Our Monitoring Tools are Lying to Us

I posted the article below to LinkedIn a few weeks back. Since it was relatively popular and relevant to the Circonus community we decided to repost to our Blog. You can find the original here.

I came across this vendor blog post today extolling the virtues of monitoring application performance using Percentiles versus Averages. Hard to believe in late 2012 there was still convincing to be done on this concept.

But in the age of Agile Computing and DevOps at scale, fixed percentiles over arbitrary, pre-determined time windows no longer cut the mustard for measuring application performance. Did they ever? Probably not, but they’re easy to calculate and cheap to store using 20th century “Small Data” technologies.

What if the proper threshold for supporting your service SLA for one KPI is measured at the 85th percentile over 1 min and for another KPI is measured at the 95th percentile over an hour? What if those thresholds change as your business changes and your business is changing rapidly? Are your tools as agile as your business?

What if consistently delighting your customers requires you to monitor a percentile of a particular metric at 1 min, 5 min, 1 hour , and 1 day intervals? Even if your tools imply they can do that, they probably can’t in reality. They weren’t designed to do that.

Lets say you are monitoring “response time” and that over the course of 1 min you typically have thousands of response time measurements. Existing tools will calculate the chosen percentile of those thousands of measurements and store the result in a database every 5 min. After 60 min they have 12 values, one for each 5 min window. Want to “calculate” the 95th percentile over an hour? More than likely what your tools will actually calculate is the average of those 12 values. But in reality there were 12 x thousands of response times measured over that hour, not 12. What’s the actual 95th percentile? Your tools probably can’t answer that question because they don’t have the data.

If you are like almost all of your IT peers, your monitoring tools begin to summarize performance data before it becomes even an hour old. Automatically summarizing performance data is one of the most “valuable” features of RRD Tools which I would bet is single the most common repository for IT performance data today. The perfect Small Data solution.

The point is, as soon as our tools begin summarizing performance data, we lose the ability to accurately analyze that data. Our tools begin to lie to us.

One-Minute-Resolution Data Storage is Here!

Have you noticed your graphs looking better? Zoom in. Notice more detail? That’s what our switch from 5 to 1-minute-resolution data storage has done for our SaaS customers. Couple that with our 7 years of data-retention and you’ve got an unrivaled monitoring and analytics tool. Dear customers, your computations are becoming more accurate as we speak!

“Keeping 5 times the data, combined with visualizing data in 1 second resolution in real-time, gives us unprecedented ability to forecast future trends and ask questions like ‘what happened last year during that big event’,” adds Circonus CEO, Theo Schlossnagle.

The switch from 5 to 1-minute-resolution data storage means more data points.

One Minute Data Storage Now and Then


Companies around the world depend on Circonus to provide unparalleled insight into all aspects of their infrastructure. Their websites typically have large swings in traffic due to a variety of factors, such as product launches, news events, or holiday shopping and other annual events. Our switch to 1-minute-resolution data storage is great news for our SaaS customers.

“Our web infrastructure is essential to our business,” says Kevin Way, Director of Engineering Operations at Monetate. “Circonus gives us the insight we need to provide our customers with the reliability they deserve. The ability to use detailed data from previous events to predict a future event is incredibly valuable! This is a major differentiater for Circonus.”

“At Wanelo we invested heavily into metrics and visibility into our infrastructure, and Circonus is a huge part of our strategy. Being able to visualize our data in 1-minute resolution gives us unprecedented ability to diagnose and remediate issues across all aspects of our infrastructure,” says Konstantin Gredeskoul, CTO at Wanelo.com.

Along with the adoption of Devops, companies are increasingly dependent on highly dynamic cloud environments, provisioning and deprovisioning infrastructure as needed. This affects decisions made at every level of an organization, from the CEO, to the product team, to IT Operations. A detailed history of usage and traffic patterns – such as from AWS, Azure, Google Cloud, Heroku, Rackspace, or one’s own private cloud infrastructure – gives an organization immeasurable insight into network performance monitoring, as well as the costs associated with providing customers with a world-class experience.

Just a little reminder, Circonus has your back!

Video: Math in Big Systems

Every year the esteemed Usenix organization holds their LISA conference. LISA has transformed slowly over the years as systems, architectures, and the nature of large-scale deployments have changed, but this year represented the largest change to date.

“The format of the conference was substantially different and I believe it (changed) for the best. The topics, content, and speakers were both relevant and fantastic while keeping just enough of the UNIX neckbeard vibe to make it familiar.” – Theo Schlossnagle

his year at LISA, our CEO presented what we’ve been doing in the realm of automatic anomaly detection on high-frequency time-series data; an otherwise dry subject was cordially delivered by Theo and very well received.

lisa14videotitle-banner-600

Watch LISA14: Math in Big Systems by Theo Schlossnagle

Alerting on disk space the right way.

Most people that alert on disk space use an arbitrary threshold, such as “notify me when my disk is 85% full.” Most people then get alerted, spend an hour trying to delete things, and update their rule to “notify me when my disk is 86% full.” Sounds dumb, right? I’ve done it and pretty much everyone I know in operations has done it. The good news is that we didn’t do this because we are all stupid people, we did it because the tools we were using didn’t allow us to ask the questions we really want to answer. Let’s work backwards to a better disk space check.

There are occasionally reasons to set static thresholds, but most of the time we care about disk space it’s because we need to buy more. The question then becomes, “how much advance notice do I need?” Let’s assume, for the sake of argument, that I need 4 weeks to execute on increasing storage capacity (planning for and scheduling possible system downtime, resizing a LUN, etc.). If you’re a cloudy sort of architecture, maybe you’re looking at a single day so that this sort of change happens during a maintenance window where all necessary parties are available. After all, why would you want to act on this in an emergency?

Really, the question we’re aiming at is “when will I run out of disk space in 4 weeks time?” It turns out that this is a very simple statistical question and with a few hints, you can get an answer in short order. First we need a model of the data growth and this is where we need a bit more information. Specifically, how much history should drive the model? This depends heavily on the usage of the system, but most systems have a fairly steady growth pattern and you’d like to include some multiple of the period of that pattern.

Graph Adding an Exponential Regression
Adding an Exponential Regression

To be a little more example oriented, let’s say we have a system that is growing over time and also generates logs that get deleted daily. We expect a general trend upward with daily periodic oscillation as we accumulate log files and then wipe them out. As rule of thumb, I would say that one week of data should be sufficient in most of the systems, so we should build our model off 7 days worth of history.

Graph looking 1 week back and 28 days forward.
Looking 1 week back and 28 days forward.

Quite simply, we should take our data over the last 7 days and generate a regression model. Then, we time shift the regression model backwards by 4 weeks (the amount of notice we’d like) and “current value” would be the model-predicted value four weeks from today. If that value is more than 100%, we need to tell someone. Easy.

Suffice it to say some tools require extracting the data into Excel or pulling data out with R or Python to accomplish this. While those tools work well, they fail to fit the bill with respect to monitoring because this model and projected value must be constantly recalculated as new data arrives so that we can reduce the MTTD to something expected.

While Circonus has had this feature squirreled away for many months, I’m pleased to say that the alerting UI has been refactored and it is now accessible to mere mortals (at least those mortals that use Circonus).

A New Day for Navigation and Search

Today we finally rolled out the new navigation menu and search bar that we’ve been working on for a while. We had been getting feedback on the poor usability of the old horizontal menu system and knew that a “pain point” had been reached—it was time to revisit how we treated navigation and search in Circonus.

We heard numerous times from users that our old navigation menus were difficult to use, and a recent survey we performed simply underscored that feedback. The horizontal nature of the menus made them tricky to navigate, especially when combined with the fact that they were not very tall. Also, we had outgrown them; after the recent addition of Metric Clusters and Blueprints, we were feeling cramped and were running out of room in the menu system. The last problem (which we started hearing recently from users) is that the location of the search field made it seem like a global search despite the placeholder hint text it contained. Some users who were new to Circonus hadn’t even noticed the search field; it just blended into the interface too well.

In this redesign we’ve shifted paradigms dramatically to alleviate these three problems. We’ve done away with the notion of showing all the menu all the time, and have implemented a large “sitemap” style menu. When the menu is collapsed, you see the current section name and page title beside a hamburger menu icon. This offers a large trigger area and easy-to-use menu with very few “moving parts.” The menu appears when hovering anywhere over the trigger area, making clicking unnecessary (clicking does work, however, for tablet and other touch-based users). This offers plenty of room both horizontally and vertically for future expansion, and it frees up room to the right for more page-related buttons.

our newly redesigned navigation menu

On pages which are searchable, the search bar now sits immediately beside the menu trigger area (containing the page section and title). This makes it easier for users to recognize the contextual nature of the search, and also increases the visibility of search in general. This new search bar provides a dedicated space to show any current search string in operation on the page, and also offers a “minus” button to clear it with a single click. To enter a search string or edit an existing search string, you can click the magnifying glass or click the existing search string, if present. To commit your search string after typing, simply hit enter on your keyboard.

You’ll also notice that we’ve slightly reorganized the menu structure. The main goal of this was to make things more logical; to provide a better model upon which users can base their own mental models, making it easier to navigate Circonus. As such, the sections have been renamed with verbs pertaining to the general tasks related to each section. First is “Collect,” where you’ll find pages related to collecting and organizing data with checks, metrics, metric clusters, templates, and beacons. Next is “Monitor,” where you’ll go to see your hosts’ statuses, set rules, follow up on alerts, and work with contact groups and maintenance windows. Last is “Visualize.” This is where you work with graphs, worksheets, events, and dashboards. Hopefully this will make it easier for new users to get acquainted with the Circonus workflow of collecting data, setting rules to monitor that data, and working with visualizations.

One last benefit of this new menu design is that we now have the opportunity to highlight some secondary links at the bottom of the menu (documentation links, mobile site and changelog links, as well as keyboard shortcuts help). These have been present in the site footer, but many users are unaware of their existence. We wanted to pull some of these links up into a more prominent position since they’re helpful for users.

Thank you to all of our users whose feedback helps us shape Circonus into a better and more useful tool. We couldn’t do this without you!

Blueprints – Graphing made easy

Introducing Blueprints

Today I’d like to introduce a new Circonus feature we’re calling Blueprints. Blueprints is a way to effortlessly create reusable graphs that can be used to visualize any host where the data you’re collecting is similar.

In the modern age of Internet infrastructure our customers are often faced not with managing just one or two machines, but whole clusters of near identical hosts. Deployed with automation tools like Chef, Puppet or cloud virtual machine imaging systems such as Amazon’s AMI, these all need monitoring and visualizing in a way a powerful tool like Circonus can provide.

Circonus has long supported features such as check templates and a comprehensive API that allows easy configuration for gathering similar data across multiple similar machines. When we came up with the concept of Blueprints, we wanted to bring the same power to visualization of the data we’re storing on these multiple instances, and do it in a way that was simple and intuitive to use. Now that concept is a reality as a powerful new tool for you to use.

Within Circonus any graph can now be quickly turned into a Blueprint with just one click and by entering a catchy name:

All the configuration for the graph is gathered up into the Blueprint. From visual components (for example: colors, line style, and axis assignment) through to the more technical details (the metrics that are being rendered with any formulas, derivatives and mathematical functions that are apply to them) are saved in a blueprint so that they applied to any future graphs you might create.

Creating a new graph from a blueprint is a breeze. One click pops open a dialog that allows you to map the original hosts that were in the original graph to any replacement hosts you’re already collecting similar data:

The selector intelligently offers you only the hosts that make sense for each check. Click, click and you’re done. A new graph for the new host is created in seconds, ready to further customize or share.

Having made the creation of new graphs easy, we wondered if we could do away with it entirely. And we can…with ephemeral visualizations offered on each check:

Clicking the visualize link next to each check now allows you to pick from the blueprints that you can use with this check, and instantly get a popup containing a rendering of the resulting graph. You now can have instant access to the right graph for any check you’re monitoring.

In our own internal use we were taken by surprise at just how powerful creating dynamic visualizations for our hosts are. Blueprints can not only provide us with the most up to date graph for each check on our system, but in times of stress they can be used to create ad-hoc graphs that we can then quickly apply to any of the hosts in our system to see which is misbehaving.

We are constantly working to add powerful new features and functionality to Circonus, like Blueprint, that expand its capabilities and make your job easier.

AWS Cloudwatch Support

This month we pushed native Cloudwatch support – any metric that you have in Cloudwatch can now be added to graphs and dashboards, and alerts can be created for them.

  • Auto Scaling (AWS/AutoScaling)
  • AWS Billing (AWS/Billing)
  • Amazon DynamoDB (AWS/DynamoDB)
  • Amazon ElastiCache (AWS/ElastiCache)
  • Amazon Elastic Block Store (AWS/EBS)
  • Amazon Elastic Compute Cloud (AWS/EC2)
  • Elastic Load Balancing (AWS/ELB)
  • Amazon Elastic MapReduce (AWS/ElasticMapReduce)
  • AWS OpsWorks (AWS/OpsWorks)
  • Amazon Redshift (AWS/Redshift)
  • Amazon Relational Database Service (AWS/RDS)
  • Amazon Route 53 (AWS/Route53)
  • Amazon Simple Notification Service (AWS/SNS)
  • Amazon Simple Queue Service (AWS/SQS)
  • AWS Storage Gateway (AWS/StorageGateway)

Overview

This check monitors your Amazon Web Services (AWS) resources and the applications you run on AWS in real-time. You can use the CloudWatch Check to collect and track metrics, which are the variables you want to measure for your resources and applications.

From the CloudWatch Check, you can set alerts within Circonus to send notifications, allowing you to make changes to the resources within AWS.  For example, you can monitor the CPU usage and disk reads and writes of your Amazon Elastic Compute Cloud (Amazon EC2) instances and then use this data to determine whether you should launch additional instances to handle increased load. You can also use this data to stop under-utilized instances to save money. With the CloudWatch Check, you gain system-wide visibility into resource utilization, application performance, and operational health.

Circonus takes the AWS Region, API Key, and API Secret, then polls the endpoint (AWS) for a list of all available Namespaces, Metrics, and Dimensions that are specific to the user (AWS Region, API Key, and API Secret combination). Only those returned are displayed in the fields. The names that are displayed under each Dimension type (for example: Volume for EBS) are all instances running this Dimension type and have detailed monitoring enabled.

For information on the master list of Namespace, Metric, and Dimension names available and additional information on Cloudwatch in general, see AWS’s Cloudwatch documentation.

JSON Over HTTP – Data Collection Made Simple

At Circonus, one of our goals is to try to make it as easy as possible to monitor your data. One of the ways we do this is to allow data formatted in JSON to be pushed or pulled over HTTP into Circonus. Since HTTP is spoken everywhere, and JSON is understood everywhere, this allows for easy metric submission so you can collect, store, graph, and analyze everything that you care about.

The HTTPTrap check type accepts JSON payloads via HTTP PUT requests. This allows you to push data from your devices or applications directly into Circonus. This is useful for data that happens sporadically, instead of at a regular or constant interval. HTTPTraps also let you send histogram data into Circonus, so you can see the whole picture instead of one aspect of your data.

The JSON check type gets data from an HTTP endpoint at the interval you select. This allows you to make applications that expose metrics in a JSON format that can be polled regularly from Circonus. These checks allow you to specify a username/password, port, and any additional headers, which gives you security and flexibility in what you allow to connect to your hosts.

One of the major shortcomings with JSON in most languages is the ability to deal with large numbers. Our parser works around that by allowing you to send the number as a string. This means there is no data that you’re interested in that we can’t collect or accept.

The ability to use JSON as a format for data also allows you to write your own data collector. For instance, Gollector was written by the folks at Triggit who wanted to have an agent that relied on the proc filesystem and C POSIX calls. Additionally, both Panoptimon (written in Ruby) and our very own nad agent (written in Node.js) utilize JSON to send system information. Customized agents like these allow you to adapt Circonus to your infrastructure and monitoring needs.

To show just how easy it is to format data so Circonus can read it, this is an example Python script that runs once per minute to generate some randomized data. Once you create an HTTPTrap check in Circonus, you can look at the check to get the URL that should be used in the PUT call. The example includes submitting strings, small numbers, large numbers, and a set of numbers that can be used for histogram data. Similar setups can be used in other languages and in your own custom applications.

import json
import urllib2
import time
import random

# Use the URL provided in the UI from the Circonus HTTPTrap check
httptrapurl = "https://trap.noit.circonus.net/module/httptrap/01234567-89ab-cdef-0123-456789abcdef/mys3cr3t"

while(1):
    # Make up the data
    data = {
            "number": random.uniform(1.0, 2.0),
            "test": "a text string",
            "bignum_as_string": "281474976710656", 
            "container": { "key1": random.randint(1200, 1300) },
            "array": [
                random.randint(1200, 1300),
                "string",
                { "crazy": "like a fox" }
            ],
            "testingtypedict": { "_type": "L", "_value": "12398234" },
            # Set the type to "n" for histogram-enabled data
            "histogramdata": { "_type": "n", "_value": [int(1000*random.betavariate(1,3)) for i in xrange(10000)] }
    }
    jsondata = json.dumps(data)

    # Form the PUT request
    requestHeaders = {"Accept": "application/json"}
    req = urllib2.Request(httptrapurl, jsondata, headers = requestHeaders)
    req.get_method = lambda: 'PUT'
    opener = urllib2.urlopen(req)
    putresponse = json.loads(opener.read())

    # Print the data we get back to the screen so we can make sure it's working
    print putresponse
    print jsondata
    print

    # Wait a minute
    time.sleep(60)

This will show up in Circonus as:

You can refer to the Circonus User Manual for more details about the HTTPtrap check. Also, please refer to the information there to import our certificate if you see the following error while following these instructions:
urllib2.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)>

Exploring Keynote XML Data Pulse

I’ll be the first to admit that the Circonus service can be somewhat intimidating. Sometimes it is hard to puzzle out what we do and what we don’t do. Case in point: perspective-based transactional web monitoring.

Many people have asked us, given our global infrastructure, why don’t we support complex web flows from all of our global nodes and report back granular telemetry regarding the web pages, assets and interactivity. The short and simple answer is: someone else is better at it. It turns out they are a lot better at it.

Keynote has been providing synthetic web transaction monitoring via their global network of nodes for many years and have an evolved and widely adopted product offering. So, why all this talk about Keynote?

But why?

You might ask why it is important to get deep, global data about web page performance into Circonus. It’s already in Keynote, right? Their tools even support exploring that data and arguably better than within Circonus?

The reason is simple… you’re other critical performance data is in Circonus too. Real-time correlations, visualization and trending can’t happy easily unless the data is available in the same toolset. Web performance is delivered by web infrastructure. Web performance powers business. Once all your performance data is in Circonus, you can can tie these three macro-systems together in a cohesive view and produce actionable information quickly.

The story of how we made this possible is, as most good stories are, rife with failures.

Phase Failure: the Keynote API

For over a year, we’ve had support for extracting telemetry data from Keynote via their traditional API. For over a year, most of our customers had no idea… because it was in hidden beta. It was hidden because we struggled to make it work. Honestly, the integration was painful due to the API allowing us to pull only a single telemetry point at a time. It was so painful that we struggled to add any real value on top of the data they stored. The API is so bad (for our needs) it almost looks like Amazon Cloudwatch (a pit of hell deserving of a separate blog post).

If you look at a standard deployment of Keynote, you might find yourself pulling data 200-300 measurements from 15 different locations every minute. For Circonus to pull that feed, we’d have to do 4500 API calls/minute to Keynote for each customer! That’s not good for anyone involved.

Phase Success: the Keynote XML Data Pulse

Recently, our friends over at Keynote let us in on their new XML Data Pulse service which looks at their data more “east and west” as opposed to “north and south.” This newer interface with Keynote’s global infrastructure allows us to pull wide swaths of telemetry data into our systems in near real-time… just like Circonus wants it.

If you’re a Keynote customer and are interesting in leveraging our new Data Pulse integration, please reach out to your Keynote contact and get setup with a Data Pulse agreement.

Monitoring Elasticsearch

With the much anticipated announcement of the Elasticsearch 1.0.0 release, we thought we’d mention that several of the features that you use within Circonus are powered by Elasticsearch behind the scenes.

We could never, in good conscience, run a product or service that we couldn’t extensively monitor. So, when it comes to monitoring things we say once again, “Yeah, we do that too.”

Adding elastic search telemetry collection in Circonus is as easy as selecting the Elasticsearch check type and entering the node name. What comes back is a plethora of statistics from the cluster node.

{
  "cluster_name": "elasticsearch",
  "nodes": {
    "zB3lYhArQJCJgJ5szVr4uA": {
      "timestamp": 1392415145096,
      "name": "Hawkeye II",
      "transport_address": "inet[/10.8.3.13:9300]",
      "host": "client-10-8-3-13.dev.circonus.net",
      "indices": {
        "docs": {
          "count": 0,
          "deleted": 0
        },
        "store": {
          "size_in_bytes": 0,
          "throttle_time_in_millis": 0
        },
        "indexing": {
          "index_total": 0,
          "index_time_in_millis": 0,
          "index_current": 0,
          "delete_total": 0,
          "delete_time_in_millis": 0,
...

On an instance here, 382 gratuitous lines of JSON ensue all of which we turn into metrics for trending and alerting.

We use this to track the inserts and deletes and the searches performed on each each node:

We’d also like to give a shout out to the Elasticsearch crew for their successful release. As “metrics people” I’m pleased to see that the old “*_time” metrics that were not easily machine readable have gone the way of the Dodo and “*_time_in_millis_” style metrics have prevailed. You all made the most of the breaking 1.0.0 opportunity to break things is a good way!