Understanding service latencies via TCP analysis

Latency is the root of all that is evil on the Internet… or so the saying goes. CPUs get faster, storage gets cheaper, IOPS are more plentiful, yet we feeble engineers have done close to nothing on improving the speed of light. While there is certainly a lower-bound to latencies rooted in the physical world, many of the latencies of today’s services (delivered both globally and within the datacenter) have latencies that are affected by things that are under our control. From a low level to a high level latencies suffer from service capacity issues, routing issues, systems and network tuning issues, bad code, bad deployments, human errors and even usage errors. You can’t improve what you can’t measure, so let’s get to measuring some latencies.

Most Internet services are pretty simple; a client calls you up and asks a question and you provide an answer. Due to this incredibly common pattern and the nuances of HTTP, it turns out to be trivial to measure service response times with nothing more than a packet sniffer (e.g. tcpdump).

“What is this nuance of HTTP?” you might ask. It is that HTTP requires the client to speak first. While all of the rest of the article can be used for more complicated services where that is not true, it oft times requires a deeper understanding of the protocol at hand and more adroit tcpdump manipulation.

“Client speaks first” makes it very easy to blindly assess the service latency on the initial HTTP request on a TCP connection. Returning to the phone call metaphor, I can simply watch the phones and measure the time between the first ring and the first word spoken by the receiver; I don’t care much about what happens in between. Granted, there is a wealth of information in the packets themselves: client (IP), agent (User-Agent), endpoint (requested URI), etc. Yet, to measure overall service latency, I simply need the first TCP SYN packet in and the first non-zero payload packet out on a given TCP/IP connection.

What are we looking at?

At this point we should stop and understand a bit better what we are actually measuring. Wouldn’t it be easier to just ask the service we’re running to measure its service times and report them? The answer to this varies from architecture to architecture, but if the answer is “no,” you should have birds and bees conversation with the software engineers or operations folk responsible for running the service. It should be absolutely trivial to track service times within the architecture.

Here, we’re asking something a tad more complex. Our measurements on the network side include an element of latency between the client and the server. What we are trying estimate here is the elapsed time between the client’s attempt to retrieve content the the first byte of that content arriving.

If we measure from SYN receipt to data write, we miss two vital times from the user’s perspective. We didn’t get the SYN until sometime after they sent it, and they won’t see our data until sometime after we’ve sent it. Because each of these is a single packet, we just need to know the time it takes to roundtrip a packet. Where can we get that information?

; ping 2.15.226.34
PING 2.15.226.34 (2.15.226.34): 56 data bytes
64 bytes from 2.15.226.34: icmp_seq=0 ttl=249 time=39.022 ms
64 bytes from 2.15.226.34: icmp_seq=1 ttl=249 time=40.013 ms
64 bytes from 2.15.226.34: icmp_seq=2 ttl=249 time=34.300 ms
64 bytes from 2.15.226.34: icmp_seq=3 ttl=249 time=40.373 ms
64 bytes from 2.15.226.34: icmp_seq=4 ttl=249 time=33.129 ms
^C

It isn’t like we can ping the remote machine. Instead we need to calculate this passively. TCP has this really inconvenient 3-way handshake that starts up a session that goes something like:

  1. Client: “I’d like to ask you to tell me a joke.”
  2. Server: “Okay, ask away”
  3. Client: “Okay, I’d like you to tell me a joke about TCP.”
  4. Server: “These two packets walked into a bar…”

From the TCP nitty-gritty, if we measure the time from the first SYN to the subsequent ACK package (before any data has transited), we have a rough estimation of the roundtrip time between the client and server. If we measure between observing (1) and sending (4), the part that we’re missing is the time between (1) being sent by the client and arriving at the server and (4) being sent and arriving at the client. That is approximately one roundtrip.

So, long story summarized. If we take the time between observing 1 and 4 and add the time between observing 2 and 3, we should have a rough approximation of the total time the client witnessed between making the request and receiving the first byte of the response. We’ll take a further short cut and note that the latency between is 1 and 2 is nominal on (at least our) systems, so we only track the timings of (1), (3) and (4) and calculate (4) – (1) + (3) – (1).

Actually doing this.

We’ll use tcpdump on our server to try to find what we’re looking for: the SYN from the client that will start the session, the ACK from the client that completes the three-way TCP handshake (in response to the server’s SYN-ACK) and the first non-zero-length data payload from the server to the client.

For our example, let’s look at web traffic on port 80. First we need to know how much data is in the packet. The IP header has the total packet length (in a 2-byte short starting a the second byte, which in BPF syntax is ‘ip[2:2]’. However, we’re looking for the payload length, so we need to remove from that the length of the IP header: ‘(ip[0]&0xf)*4)’ and the length of the TCP header: ‘((tcp[12]&0xf0)/4). We expect the one of the SYN or ACK bits set in the TCP flag and (because these are from the client) a destination port of 80:

'((((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) == 0) && ((tcp[tcpflags] & (tcp-syn|tcp-ack)) != 0) and dst port 80)'

The data payload from the server to the clients simply non-zero payload length with a source port of 80:

'(((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0 and src port 80)'

To test this, I’ll further restrict this to the host 10.10.10.10 (my laptop) and run tcpdump. I don’t want to resolve IP addresses to names and I would like to keep the output terse with high-precision timestamps: (-n -ttq). Additionally, I need at least 54 bytes of the payload to get the TCP headers (this is left as an exercise for the reader). Putting this all together in a tcpdump command (and limiting the scope to my client at 10.0.10.10), we can see something like this:

/opt/omni/sbin/tcpdump -n -ttq ' ((((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) == 0) && ((tcp[tcpflags] & (tcp-syn|tcp-ack)) != 0) and dst port 80) or (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0 and src port 80) and host 10.10.10.10'
listening on us340, link-type EN10MB (Ethernet), capture size 384 bytes
1370816555.510014 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0
1370816555.542407 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0
1370816555.585086 IP 2.15.226.34.80 > 10.10.10.10.57300: tcp 458
1370816555.617389 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0
1370816555.617398 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0
1370816555.617622 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0

The first three lines are exactly what we’re looking for. The subsequent lines would be (in this case) the shutting down of this TCP connection. These packets are harder to avoid using BPF syntax and we’ll figure out a way to ignore them as needed.

Doing something useful.

So, it is clear that on any high-volume service, these packets are going to be coming in quickly and we’ll need to process these latency calculations programmatically.

This is a bit more complicated, so I whipped together a small node.js program that runs tcpdump, consumes its output, calculated the estimated time-to-first-byte latencies and securely pushes them up to a specific Circonus httptrap check.

The only two arguments you need to supply are the UUID and secret from your httptrap in Circonus. Once running, the script will submit all latencies to one metric named “aggregate`latency” and another named “<hostname>`latency” (where <hostname> is the hostname of the machine on which it is running). This allows you to point a handful of machines at the same httptrap check and collect both their individual latencies and a recalculated aggregate set of latencies.

Now for the magic:


In the above graph, we can see three distinct nodes and the estimated latency (in seconds) on delivering the first byte of data to the client. Every single client TCP connection is accounted for. There is no sampling or missing data, we see a comprehensive and complete picture of our service latencies. Talk about an eye-opener.

This requires only 4 metrics in Circonus, which costs less than $2/month. Unparalleled value.

Manage Metrics with circonus-cmi

As an environment grows, so does the headache of managing checks and metrics. Every new machine and cluster node requires a new set checks and metrics. There are the Circonus API and circonusvi tool, but those still require manual intervention. Wouldn’t it be nice if the same tools used for provisioning could handle Circonus setup as well? With circonus-cmi, that gap has now been bridged. Circonus-cmi is a lightweight command line application written in Node.js, which reads configuration information from provisioners and updates Circonus accordingly.

Circonus-cmi currently supports Amazon AWS and Chef Server. AWS integration is used to tag existing checks based on target with ec2 region, zone, id, and instance type information. This becomes very useful when one wants to compare performance between regions, for instance. The Chef Server integration, on the other hand, is able to do the heavy lifting of check management. By specifying checks, tags, and metrics within the Chef attribute heirarchy, Circonus metrics can be both added and updated. It is by default additive, merging the check specification in chef with an existing check configuration in Circonus, unless explicitly told to do otherwise. As a result, there’s no worry that circonus-cmi will wipe out changes made manually.

The goal of circonus-cmi is to make check management as easy and flexible as possible. If there are any features that you’d like to see added, please let us know.

On the Move: Circonus Mobile

It’s been a long time coming, but it’s finally here: a Circonus that’s optimized for low-resolution devices like your smartphone or tablet! For the past few weeks, eagle-eyed Circonus customers may have noticed links to the mobile site creeping into various places, such as the login page (login.circonus.com) and the application footer. Although we officially “flipped the switch” a few weeks ago to enable the new Circonus Mobile experience, it is definitely a work in progress. To visit Circonus Mobile, visit m.circonus.com or use the “Switch to Mobile Site” link in the main site footer.

The most common need when using Circonus on a smartphone is being able to easily and quickly respond to alerts, so we made alerting functionality a priority when developing this new interface. The Current Alerts and Alert History pages are complete, already having feature parity with the main site. Also fully implemented are the Account Profile and Account Brokers pages, along with the User Profile and User Preferences pages, but most other pages are being implemented in a phased manner, beginning with read-only versions. TIP: when viewing a paginated list of items in Circonus Mobile, you may swipe left and right to page through the list (if you’re using a browser which supports touch events).

Our long-term goal for Circonus Mobile is not only achieving feature parity with the main site, but going further with push notifications and other native app functionality. But to give you a peek into our short-term goals, we’re currently working on making the Maintenance page fully functional, and after that we will move on to read-only versions of the Graphs and Worksheets pages.

If you use Circonus Mobile and don’t see a feature from the main site that you want, please let us know! We’re continually implementing features on the mobile site, but we’re more than willing to rearrange our implementation roadmap based on our customers’ needs.

What’s New Q1 2013 Edition

Navigation and URL Changes

I’ll start this update talking about the most obvious changes to the UI, which is our new completely horizontal navigation and new URL structure.

We had received a lot of feedback about the mix of horizontal and vertical navigation. The tabs we were told were hard to read, especially with the text being rotated 90 degrees, users weren’t sure that the tabs were part of the navigation once they got to the page, and the ever confusing + button had to be explained to almost everyone. It was our take on what I would call a classic site layout, but in the end didn’t really resonate with our customers.

The new layout is much easier to understand and use. Main nav at the top, sub nav below, search box in the upper right where one would expect it and the + button has been replaced with text like “+ New Check” or “+ New Graph”. An added benefit of this change is we were able to reclaim some vertical whitespace, making the page a bit tighter. Not to be completely aesthetic, we recently read an article by Ben Kamen’s about Breaking down Amazon’s Mega Dropdown so we incorporated some of those ideas to look at mouse acceleration and direction so we don’t flip to another sub nav on hover as you try to navigate to a new page.

As you probably also noticed, our URL structure changed to .circonus.com. This should have been as transparent as possible, old URLs will continue to work for the foreseeable future, in app links will however take you to the new format.

Simplified Maintenance

In the middle of February we launched the much awaited maintenance window rework. Previous to this, the controls were fine grained requiring users to mark individual metrics as in maintenance. While this provided a ton of flexibility, it made it more difficult to work at a higher level as machines were taken offline for work, etc.

The new maintenance windows solve this by providing multiple layers of control. Starting at the metric level, you can now also set maintenance on a check, a check bundle, a host or your entire account. Along with this change, when items are in maintenance they do not trigger new alerts to show up in the UI, previously you would see each alert, however notifications would be silenced, this reduction in noise should help you find problems faster.

MySQL From C to Java

Our last item involves a broker update that is coming down the pike later this week, and one we wanted to make sure was announced a little ahead of time.

Currently the way that we communicate with MySQL is through the C library that we build as a module to our broker. This library at times can cause stability issues when certain scenarios occur, typically we see this when a server stops responding correctly. Additionally, MySQL does some tricky things internally with the various “show X” commands that make it hard to maintain the various nuances in C. Because of this we are switching out the C lib for a JDBC implementation, this JDBC option is already how we talk to Oracle and MS SQLServer.

We been running this new version of the check internally and with some of our customers for awhile, taking special care to make sure metric names don’t change, we are pretty confident that this will be a transparent switch. Additionally, we have been able to add much improved support for commands like “show status” and “show master/slave status”, among others. Should you notice any issues after an upgrade, please contact support.

Update (3/21): Due to some unrelated changes, the release of the new broker code has been rescheduled to next week.

2013 has already proven busy for us and doesn’t look to be slowing down. As always, give us your feedback, user input helps drive our development.

Web Application Stat Collection with Node.js and Circonus

Collecting and carefully monitoring statistics is an essential practice for catching and debugging issues early, as well as minimizing degraded performance and downtime. When running a web service, two of the most fundamental and useful stats are request/response rates (throughput) and response times (latency). This includes both interactions between a user and the service as well as between the backend components. Fortunately, they?re also two of the most straightforward metrics to aggregate and track.

The rates are perhaps the easiest to track, as they are not much more than incrementing counters. In essence, one increments a counter each time a request is made and each time a response is sent. Individual requests and responses are unrelated as far as counts are concerned. In a multi-process or clustered environment, a tool such as Redis, with its atomic increments, can be used for global aggregation. Once aggregated, it can be checked and monitored by Circonus, with rates being derived from count changes over time.

Response times are a bit trickier. Time needs to be tracked between a specific request being made and the response being sent for that request. Each time needs to be tracked for use beyond beyond simple averages, as histograms and heatmaps often tell a very different story. Sending these values to Circonus via an HTTP trap will provide that level of detail. Local collection of times before communication to Circonus should be used to prevent an unnecessarily large number of calls being made to store the data.

The above image is a line graph of the approximate average latency in ms over time for a particular URI. According to that data, average latency is more or less consistent, just below 1k. That conclusion, however, is not entirely accurate to describe the system. With a heatmap overlay, this becomes clear.

The picture painted here is quite different than with the simple line graph. The first key element revealed is that the latency is multimodal, with high frequency around both 1k and 0.5k. The second is that there is a comparatively small, though non-negligible, frequency well in excess of 1k. Perhaps these two features of the data are expected, but may indicate an issue with the system. This kind of insight is often essential to the identification and remedy of problems.

The process to collecting all of this information will become clearer with an example. The following implementation is based on a Node.js web API service using the Express.js framework, but the principles are applicable for any circumstance apply equally as well to any internal queries or flows. A module has been created to provide this sample functionality and can be installed with:

npm install stat-collector

A simple app utilizing the module to collect both stat types:

    var express = require('express');
    var StatCollector = require('StatCollector');

    var statCollector = new StatCollector();
    // Global debug logging (default false)
    statCollector.setDebug(true);

    var queryCounterConfig = {
        // Configuration options for node-redis
        redisConfig: {
            host: 'localhost',
            port: 1234,
            options: {}
        },
        // Prefix for stat hashes in Redis (e.g. 'foo_' for foo_requestTotal)
        statPrefix: 'foo_',
        // Interval (ms) in which to update Redis (default 30000)
        pushInterval: 30000,
        // Debug logging (default false)
        debug: false
    };

    var responseTimerConfig = {
        // Circonus HTTP Trap check config
        circonusConfig: {
            host: 'foo',
            port: 1234,
            path: '/'
        },
        // Interval (ms) in which to push to Circonus (default 500)
        pushInterval: 500,
        // Debug loging for this module (default false)
        debug: false
    };

    statCollector.configureQueryCounter(queryCounterConfig);
    statCollector.configureResponseTimer(responseTimerConfig);

    var app = express();

    app.configure( function(){
        app.use(statCollector.middleware());
        app.use(app.router);
    });

    app.get('/foo',function(req,res){
        res.jsend('success',{'foo':'bar'});
    });

    app.listen(9000);

Only the components of StatCollector that are configured will be enabled. The QueryCounter stats contain counter sets for request and response, as well as for each response status, with a counter counters for each URIs. It establishes a connection to Redis and a method for pushing data to Redis is set set on an interval. The ResponseTimer stats is contains objects of the form, one for each URI:

{ _type: "n", _value: [] }

A method to push this data to Circonus is also set on an interval.

The middleware takes care of three things: enhance the response object, register the URI with enabled components, and tell the enabled components to track the request/response. In the basic preparation, res is given a method jsend to be used instead of res.send, standardizing the json output structure and saving the response status (“success”, “fail”, “error”) to res.jsendStatusSent. The res.end method, which is called automatically after the response is handled, is wrapped so that it will emit an end event. The URI registration adds URI-specific counters and objects to QueryCounter and ResponseTimer stats, respectively, if they don’t already exist. The request and response are then fed through a tracking method that will record statistics both at the time of request and after a response, on the end event. This is just a detailed way of saying: ?when a request completes, note how long it took and what kind of response it was.

The general flow for request and response count statistics is Node.js processes to Redis to Circonus via Redis-type checks. On a request, the total request and URI-specific request counters are incremented. When res.end fires, the set URI-specific counters are incremented in QueryCounter, with res.jsendStatusResult determining the appropriate response type counter. Periodically, these stats are used to increment a set Redis hashes and then reset. In Redis, the request, response, and each response type are separate hashes, with ‘total’ and specific URIs as fields, and the corresponding counters as values. A set of Redis-type checks are set up in Circonus to run an hgetall query on each of those hashes. Since these are monotonically increasing values, when graphing when graphing rates, the ?counter? display type should be used to show the first order derivative (rate of change). When graphing total values, gauge (raw value) is appropriate.

For response times, data goes from Node.js processes to Circonus via an HTTP Trap. On the res.end event, the overall response time is calculated as the difference between the time saved when the req came in and the current time. The difference is pushed into the _value array of the URI-specific object. Periodically, the entire set of stat objects is converted to JSON and pushed to a Circonus HTTP trap, being subsequently reset. For optimal histogram live view, this push is done at sub-second intervals. In Circonus, the metrics must be enabled twice for histograms; the first time will enable collection, the second time will enable histogram support via the icon that resembles three boxes. Both are collected, distinguished by the icons. Graphing the standard metric will yield the normal line/area graph while graphing the histogram metric will yield the histogram/heatmap.

Request & response rates and response times are so valuable and straightforward to track, there’s absolutely no reason not to.

Brian Bickerton is a Web Engineer at OmniTI, working on applications where scalability and uptime are critical, using a variety of tools and technologies.

Interacting with Circonus through your text editor – circonusvi

I’m a big fan of command line tools for many tasks. There are just some tasks that can be done quicker and easier with a little typing rather than pointing and clicking. A little while ago, I discovered a gem of of a command line tool called ldapvi. Without going into too much detail, this tool lets you run a query against an LDAP server where the results show up in your favorite text editor. The magic part is that when you edit that file and save it, whatever changes you made are then pushed back to the server. This method of working can be extremely flexible, as you have all of the power of your text editor at your fingertips to make both small and sweeping changes quickly and easily.

When the new Circonus API was released, I realized that it was now possible to do the same thing for Circonus with relative ease. It would have been impossible to do with the old API, but the new one was designed with consistency in mind, and a simple (dare I say the word REST) interface. The result is circonusvi. Circonusvi essentially allows you to interact directly with the circonus API via your text editor, and it works as follows:

  • You decide what you want to work with: check bundles, rule sets, worksheets and so on, and pick that as your endpoint.
  • Circonusvi queries the API and lists all resources, which are returned as JSON.
  • Optionally, circonusvi filters this output to only show the resources you want.
  • This JSON (slightly modified) is shown in your text editor, where you can make changes (edit checks, remove checks, add new checks) as needed.
  • Circonusvi identifies what changes were made, and pushes them back to the server.

Here’s a simple example, taken from an actual situation we came across: needing to switch contact groups for a series of rules. We had some rules which were a severity 2 (our convention is that sev2 alerts are mail only), and we realized that we really should have them be sev1 (in other words, wake up the on call SA). To make this change in the web interface, you need to go into each rule individually, change the criteria from sev2 to sev1, and change the contact group associated with sev1 alerts. This is time consuming and error prone, but at the same time, it’s a one-off task that you probably don’t want to write a script for.

First, we identify the contact groups in circonusvi:

./circonusvi.py -a example_com -e contact_group

{
    # Mail only
    "/contact_group/123": {
        "aggregation_window": 300,
        "contacts": {
            "external": [
                {
                    "contact_info": "sa@example.com",
                    "method": "email"
                }
            ],
            "users": []
        },
        "name": "Mail only",
        "reminders": [
            0,
            0,
            0,
            0,
            0
        ]
    },
    # On call
    "/contact_group/124": {
        "aggregation_window": 300,
        "contacts": {
            "external": [
                {
                    "contact_info": "oncall@example.com",
                    "method": "email"
                }
            ],
            "users": []
        },
        "name": "On call",
        "reminders": [
            0,
            0,
            0,
            0,
            0
        ]
    }
}

This shows two contact groups, with the ID of the mail only contact group being /contact_group/123, and the on call contact group being /contact_group/124.

Next, we need to do the actual editing:

./circonusvi.py -a example_com -e rule_set 'metric_name=^foo$'

Here we specified a filter on the rules to match only those rules that apply to a metric named ‘foo’. The pert before the equals sign is which key you filter on. You can specify any key in the results for the filter, as long as it has a simple value (i.e. string or number). The right hand side of the filter is a regular expression. Any entry that matches the regular expression will be displayed. A small section of the results are shown below:

{
    "/rule_set/1234_foo": {
        "check": "/check/1234",
        "contact_groups": {
            "1": [],
            "2": [
                # Mail only
                "/contact_group/123"
            ],
            "3": [],
            "4": [],
            "5": []
        },
        "derive": "counter",
        "link": null,
        "metric_name": "foo",
        "metric_type": "numeric",
        "notes": null,
        "parent": null,
        "rules": [
            {
                "criteria": "on absence",
                "severity": "2",
                "value": "t",
                "wait": 10
            }
        ]
    },
    ...

There are two things we want to change here. First the severity when the rule fires. This is a simple search/replace in your text editor. I’m a vim fan, so:

:%s/"severity": "2"/"severity": "1"/

Next, we need to make sure that severity 1 alerts go to the on call group. They don’t currently:

:%s/"1": [],/"1": [ "/contact_group/124" ],/

If you want, you can remove the sev2 contact group by deleting that section:

:g/"/contact_group/123"/d

The ‘comment’ line above can be left alone, or deleted too with a similar command.

Once you are finished, your file will look something like this:

{
    "/rule_set/1234_foo": {
        "check": "/check/1234",
        "contact_groups": {
            "1": [ "/contact_group/124" ],
            "2": [
                # Mail only
            ],
            "3": [],
            "4": [],
            "5": []
        },
        "derive": "counter",
        "link": null,
        "metric_name": "foo",
        "metric_type": "numeric",
        "notes": null,
        "parent": null,
        "rules": [
            {
                "criteria": "on absence",
                "severity": "1",
                "value": "t",
                "wait": 10
            }
        ]
    },
    ...

Save and quit, and you’ll be presented with a prompt asking you to confirm your changes, and give you a chance to preview the changes before submitting them:

0 additions, 0 deletions, 85 edits
Do you want to proceed? (YyNnEeSs?)

I would recommend previewing exactly what has changed before submitting. Pressing S will show a unified diff of what changed. Once you proceed, circonusvi will apply the changes via the API, and you’re all done!

This method of interacting with the API is most useful when making edits or deletions in bulk (if you delete an entry in the text editor, circonusvi will delete from the API). Additions are possible also, but are more cumbersome as you have to create the JSON precisely from scratch. For bulk additions, there are other scripts available such as the
circus scripts. However, using your editor to make bulk changes is a very powerful way of interacting with an application. Give it a try and see how well it works for you.

Mark Harrison leads the Site Reliability Engineering team at OmniTI where he manages people and networks, systems and complex architectures for some of today’s largest Internet properties.

PagerDuty Integration Improvements and Alert Formats

Recently we got burned by ignoring a page because the actual message we received lacked detail, it looked like an alert that was known to clear itself. At 3am it is hard to bring yourself to get out of bed when you have seen this alert page and clear time and time again, so it was forgotten. Four hours later the alert was spotted by another admin and resolved, and an analysis was done to determine how this happened.

The root cause we determined to be the aforementioned lack of detail. When Circonus would send an alert to PagerDuty, we would do so in our “long format” which is the alert format you get when you receive email notifications, more on this and the “short format” later. PagerDuty then truncates this message to fit a standard 160 character SMS, this truncation of detail lead to a lot of alerts looking like each other, some which were more critical were assumed to be of lesser importance and ignored.

Improvements to PagerDuty Messages

To solve this, we just pushed out a change to include both the short format and long format in a PagerDuty message. The short format is what we use for SMS alerts, and is now the first line of the message. When the truncation happens on their end, you should receive as much detail as possible about the alert. This does lead to redundant information in the message body in their UI and email alerts, but we feel it is for the better.

Secondly, we are providing details about the alerts to PagerDuty’s API. These details currently are:

  • account
  • host
  • check_name
  • metric_name
  • severity
  • alert_value
  • client_url

These details are useful if you are pulling alerts from the PagerDuty API, instead of parsing the message you should receive a JSON object with these keys and their associated values.

How Circonus Alerts are Formatted

As mentioned before, Circonus has two alert formats. A long format which is used for email, XMPP, AIM and PagerDuty alerts, and a short format which is used for SMS, Twitter and now PagerDuty.

The short format is intended to compress as much detail about the alert as possible while remaining readable and useful. An example of this type of alert:

[Circonus Testing] A2:1 development.internal "Test Check" cpu_used (89.65)

I’ll break this alert up into its various sections to describe it

  • [Circonus Testing] is the name of the account
  • A = Alert, 2 = Severity 2, 1 = The number of sev 2 alerts. The “A” here could also be R for Recovery
  • development.internal is the hostname or IP this alert was triggered on
  • “Test Check” is the name of the check bundle in the Circonus UI
  • cpu_used is our metric name and (89.65) is the value that triggered the alert

The long format is more self explanatory since we have many more characters to work with.

Account: Circonus Testing

ALERT Severity 2
Check: Test Check
Host: development.internal
Metric: cpu_used (89.65)
Agent: Ashburn, VA, US
Occurred: Tue, 8 Jan 2013 2:25:53

This is the same alert as above, so breaking it apart we have:

  • Account name
  • Alert of Severity 2, this could also be RECOVERY. The alert count is missing because in the long format we will list out each alert separately.
  • The check name
  • The host / IP
  • The metric name and alert value
  • The broker / agent that the alert was triggered from
  • The time that the alert was triggered, if this is a recovery you will also have a cleared time.
  • The Circonus URL to view the alert in the UI

In the future we intend to allow the alert formats to be customized for each contact group, or use these current formats as the default.

Thanks to redundancy built into Circonus, our users were never impacted by the outage that precipitated this change, but if it can happen to us it will happen to others, so we hope these minor changes bring improvements to your response times.

Sometimes you just need a different hammer

Circonus has a lot of powerful tools inside, but as anyone who has worked with real data knows: if you can’t get your data out into the tool you need, you’re going to suffer. We do all sorts of advanced analysis on telemetry data that is sent our way, but the systems we use to do that are somewhat isolated from the end user. While simple composites and math including population statistics are available in “point-and-click” fashion, more complicated, ad-hoc analysis on the “back-end” is only possible by Circonus engineers.

The “back-end”/”front-end” delineation is quite important here. In an API driven service like Circonus, the front-end is you or, more specifically, any thing you want to write against our APIs. Often times, when using services there are two types of API consumers: third-party tools and internal tools. Internal tools side are the Python or Ruby or Java programs you write to turn knobs, register checks, configure alerts and otherwise poke and prod at your Circonus account. The third-party tools are simply the tools that someone else authored and released for your use.

On to data languages.

So, where is this going? The traditional scripting languages aren’t really the limit of API consumers when it comes to big data services like Circonus. If you think about the fact that you’ve been dumping millions or billions of telemetry points at Circonus, some of our backend systems look a lot more like a data storage service than a telemetry collection orchestration service. More languages enter the fray when the APIs return data… such as R.

Working with Circonus data from within R is easy. After installing the circonus R package, you simply create a token to use the API with your Circonus account and then use that UUID to establish a connection within R:

api <- circonus.default('06178a18-0dff-0b0e-0794-0cc3fd350f04')

It's worth noting that the first time you run this command after creating your token, you'll get an error message saying that you need to authorize it for use with this application. Just go back to the API tokens page and you'll see a button to allow access to R. Once authorized, it will continue to work going forward.

Getting at data.

In the following Circonus graph, I'm looking at inbound and outbound network traffic from a node in Germany over a two month period starting on 2012/09/01.

If I click on the the "view check" link in the legend, it takes me to check 111. Using that little bit of information, I can pull some of that information right into R. Assuming I want to pull the outbound network traffic, Circonus tells me the metric name is "outoctets."

> checkid <- 111
> metric <- "outoctets"
> period <- 300
> start_time <- '2012-09-01 00:00:00'
> end_time <- '2012-11-01 00:00:00'
> data <- circonus.fetch_numeric(api,checkid,metric,start_time,end_time,period)

The above R session is fairly self-explanatory. I'm pulling the "outoctets" metric form check 111 over the two month time period starting on September 1st, 2012 at a granularity of 300 seconds (5 minutes).

Working with data.

This will give me several columns of numerical aggregates that I can explore. Specifically, I get a column that describes the time intervals, and columns corresponding to the number of sample, the average value, the standard deviation of those samples, as well that like information related to the first order derivatives over that time series data. All of this should look familiar if you have ever created a graph in Circonus as the same information is available for use there.

> names(data)
[1] "whence"            "count"             "value"            
[4] "derivative"        "counter"           "stddev"           
[7] "derivative_stddev" "counter_stddev"

September and October combined have 61 days. 61 days of 5 minute intervals should result in 17568 data points in each column.

> length(data$value)
[1] 17568

Our values of outoctets are (you guessed it) in octets and I'd like those in bits, so I need to multiply all the values by 8 (in the networking world, octets and bytes are different names for the same thing and there are 8 bits in a byte). That $value column are byte counts and we want the first order derivative to see bandwidth which is the $counter. Let's now ask a question that would be somewhat tricky via the point-and-click interface of Circonus: "Over the 5 minute samples in question, what was the minimum bandwidth, maximum bandwidth and, while we're at it, the 50th (median), 95th, 99th and 99.9th percentiles?"

> quantile(data$counter * 8, probs=c(0,0.5,0.95,0.99,0.999,1))
        0%        50%        95%        99%      99.9%       100% 
  221682.7  4069960.6  9071063.1 10563452.4 14485084.9 17172582.0 

For those that don't use R, everything statistical is just about this easy... After all, it is a language designed for statisticians crunching data. We can also quickly visualize the same graph in R.

We can see it has the same contour and values as our Circonus graph albeit far less beautiful!

I often pull data into R to perform discrete Fourier transforms (fft) to extract the frequency domain. It can help me programmatically determine if the graph has hourly/daily/weekly trends. That, however, it a bit too complicated to dive into here.

As a Circonus user, plug in some R! Please drop us a note to let us know what awesome things you are doing with your data and we will work to find a way to cross pollenate in our Circonus ecosystem.

Fault Detection: New Features and Fixes

One of the trickier problems when detecting faults is detecting the absence of data. Did the check run and not produce data? Did we lose connection and miss the data? The latter problems are where we lost a bit of insight, which we sought to correct.

The system is down

A loss of connection to the broker happens for one of two reasons. First, the broker itself might be down, the software restarted, machine crashed, etc. Second, there was a loss of connectivity in the network between the broker and the Circonus NOC. Note that for our purposes, a failure in our NOC would look identical to the broker running but having network problems.

Lets start with a broker being down. Since we aren’t receiving any data, it looks to the system like all of the metrics just went absent. In the event that a broker goes down, the customer owning that broker be inundated with absence alerts.

Back in July, we solved this by adding the ability to set a contact group on a broker. If the broker disconnects, you will get a single alert notifying you that the broker is down. While disconnected, the system automatically puts all metrics on the broker into an internal maintenance mode, when it reconnects we flip them out of maintenance and then ask for a current state of the world, so anything that is bad will alert. Note that if you do not set a contact group, we have no way to tell you the broker is disconnected so we will fall back to not putting metrics in maintenance and you will get paged about each one as they go absent. Even though this feature isn’t brand new, it is worth pointing out.

Can you hear me now?

It is important to know a little about how the brokers work… When they restart, all the checks configured on it are scheduled to run within the first minute, then after that they follow the normal frequency settings. To this end, when we reestablish connectivity with a broker, we look at the internal uptime monitor, if it is >= 60 seconds we know all the checks have run and we can again use the data for alerting purposes.

This presented a problem when an outage was caused by a network interruption or a problem in our NOC. Such a network problem happened late one night and connections to a handful of brokers were lost temporarily. When they came back online, because they had never restarted we saw the uptime was good and immediately started using the data. This poses a problem if we reconnected at the very end of an absence window. A given check might not run again for 1 – 5 minutes, so we would potentially trigger absences, and then recover them when the check ran.

We made two changes to fix this. First, we now have two criteria for a stable / connected broker:

  • Uptime >= 60 seconds
  • Connected to the NOC for >= 60 seconds

Since the majority of the checks run every minute, this meant that we would see the data again before declaring the data absent. This, however, doesn’t account for any checks with a larger period. To that end, we changed the absence alerting to first check to see how long the broker has been connected. If it has been connected for less than the absence window length, we push out the absence check to another window in order to first ensure the check would have run. A small change but one that took a lot of testing and should drastically cut down on false absence alerts due to network problems.

Updates From The Tech Team

Now that it is fall and the conference season is just about over, I thought it would be a good time to give you an update on some items that didn’t make our change log (and some that did), what is coming shortly down the road and just generally what we have been up to.

CEP woes and engineering salvation.

The summer started out with some interesting challenges involving our streaming event processor. When we first started working on Circonus, we decided to go with Esper as a complex event processor to drive fault detection. Esper offers some great benefits and a low barrier of entry to stream processing by placing your events into windows that are analogous to database tables, and then gives you the ability to query them with a language akin to SQL. Our initial setup worked well, and was designed to scale horizontally (federated by account) if needed. Due to demand, we started to act on this horizontal build out in mid-March. However, as more and more events were fed in, we quickly realized that even when giving an entire server to one account, the volume of data could still overload the engine. We worked on our queries, tweaking them to get more performance, but every gain was wiped away with a slight bump in data volume. This came to a head near the end of May when the engine started generating late alerts and alerts with incorrect data. At this point, too much work was put into making Esper work for not enough gain, so we started on a complete overhaul.

The new system was still in Java, but this time we wrote all the processing code ourselves. The improvement was incredible, events that once took 60ms to process now took on the order of 10µs. To validate the system we split the incoming data stream onto the old and new systems and compared the data coming out. The new system, as expected, found alerts faster, and when we saw a discrepancy, the new system was found to be correct. We launched this behind the scenes for the majority of the users on May 31st, and completed the rollout on June 7th. Unless you were one of the two customers affected by the delinquency of the old system, this mammoth amount of work got rolled out right under your nose and you never even noticed; just the way we like it. In the end we collapsed our CEP system from 3 (rather saturated) nodes back to 1 (almost idle) node and have a lot more faith in the new code. Here is some eye candy that shows the CEP processing time in microseconds over the last year. The green, purple and blue lines are the old CEP being split out, and the single remaining green line is the current system.

We tend to look at this data internally on a logarithmic scale to better see the minor changes in usage. Here is the same graph but with a log base 10 y-axis.

Distributed database updates.

Next up were upgrades to our metric storage system. To briefly describe the setup, it is based on Amazon’s Dynamo, we have a ring of nodes, and as data is fed in, we hash the ids and names to find which node it goes on, insert the data, and use a rather clever means to deterministically find subsequent nodes to meet our redundancy requirements. All data is stored at least twice and never on the same node. Theo gave a talk at last year’s Surge conference that is worth checking out for more details. The numeric data is stored in a proprietary format, highly compact, while text data was placed into a Berkeley DB whenever it changed.

The Berkeley DB decision was haunting us. We started to notice potential issues with locking as the data size grew and the performance and disk usage wasn’t quite where we wanted it to be. To solve this we wanted to move to leveldb. The code changes went smoothly, but the problem arose: how do we get the data from one on-disk format to another.

The storage system was designed from the beginning to allow one node to be destroyed and rebuilt from the others. Of course a lot of systems are like this but who ever actually wants to try it with production data? We do. With the safeguards of ZFS snapshotting, over the course of the summer we would destroy a node, bring it up to date with the most recent code, and then have the other nodes rebuild it. Each destroy, rebuild, bring online cycle took the better part of a work day, and got faster and more reliable after each exercise as we cleaned up some problem areas. During the process user requests were simply served from the active nodes in the cluster, and outside of a few minor delays in data ingestion, no users we impacted. Doing these “game day” rebuilds has given us a huge confidence boost that should a server go belly up, we can quickly be back to full capacity.

More powerful visualizations.

Histograms were another big addition to our product. I won’t speak much about them here, instead you should head to Theo’s post on them here. We’ve been showing these off at various conferences, and have given attendees at this year’s Velocity and Surge insight into the wireless networks with real time dashboards showing client signal strengths, download and uploads and total clients.

API version 2.

Lastly, we’ve received a lot of feedback on our API, some good, some indifferent but a lot of requests to make it better, so we did. This rewrite was mostly from the ground up, but we did try to keep a lot of code the same underneath since we knew it worked (some is shared by the web UI and the current API). It more tightly conforms to what one comes to expect from a RESTful API, and for our automation enabled users, we have added in some idempotence so your consecutive Chef or Puppet runs on the same server won’t create duplicate checks, rules, etc. We are excited about getting this out, stay tuned.

It was a busy summer and we are looking forward to an equally busy fall and winter. We will be giving you more updates, hopefully once a month or so, with more behind the scenes information. Until then keep an eye on the change log.