Web Application Stat Collection with Node.js and Circonus

Collecting and carefully monitoring statistics is an essential practice for catching and debugging issues early, as well as minimizing degraded performance and downtime. When running a web service, two of the most fundamental and useful stats are request/response rates (throughput) and response times (latency). This includes both interactions between a user and the service as well as between the backend components. Fortunately, they?re also two of the most straightforward metrics to aggregate and track.

The rates are perhaps the easiest to track, as they are not much more than incrementing counters. In essence, one increments a counter each time a request is made and each time a response is sent. Individual requests and responses are unrelated as far as counts are concerned. In a multi-process or clustered environment, a tool such as Redis, with its atomic increments, can be used for global aggregation. Once aggregated, it can be checked and monitored by Circonus, with rates being derived from count changes over time.

Response times are a bit trickier. Time needs to be tracked between a specific request being made and the response being sent for that request. Each time needs to be tracked for use beyond beyond simple averages, as histograms and heatmaps often tell a very different story. Sending these values to Circonus via an HTTP trap will provide that level of detail. Local collection of times before communication to Circonus should be used to prevent an unnecessarily large number of calls being made to store the data.

Circonus line plot of latency
Circonus line plot of latency

The above image is a line graph of the approximate average latency in ms over time for a particular URI. According to that data, average latency is more or less consistent, just below 1k. That conclusion, however, is not entirely accurate to describe the system. With a heatmap overlay, this becomes clear.

Circonus histogram plot of latency with line overlay
Circonus histogram plot of latency with line overlay

The picture painted here is quite different than with the simple line graph. The first key element revealed is that the latency is multimodal, with high frequency around both 1k and 0.5k. The second is that there is a comparatively small, though non-negligible, frequency well in excess of 1k. Perhaps these two features of the data are expected, but may indicate an issue with the system. This kind of insight is often essential to the identification and remedy of problems.

The process to collecting all of this information will become clearer with an example. The following implementation is based on a Node.js web API service using the Express.js framework, but the principles are applicable for any circumstance apply equally as well to any internal queries or flows. A module has been created to provide this sample functionality and can be installed with:

npm install stat-collector

A simple app utilizing the module to collect both stat types:

    var express = require('express');
    var StatCollector = require('StatCollector');

    var statCollector = new StatCollector();
    // Global debug logging (default false)
    statCollector.setDebug(true);

    var queryCounterConfig = {
        // Configuration options for node-redis
        redisConfig: {
            host: 'localhost',
            port: 1234,
            options: {}
        },
        // Prefix for stat hashes in Redis (e.g. 'foo_' for foo_requestTotal)
        statPrefix: 'foo_',
        // Interval (ms) in which to update Redis (default 30000)
        pushInterval: 30000,
        // Debug logging (default false)
        debug: false
    };

    var responseTimerConfig = {
        // Circonus HTTP Trap check config
        circonusConfig: {
            host: 'foo',
            port: 1234,
            path: '/'
        },
        // Interval (ms) in which to push to Circonus (default 500)
        pushInterval: 500,
        // Debug loging for this module (default false)
        debug: false
    };

    statCollector.configureQueryCounter(queryCounterConfig);
    statCollector.configureResponseTimer(responseTimerConfig);

    var app = express();

    app.configure( function(){
        app.use(statCollector.middleware());
        app.use(app.router);
    });

    app.get('/foo',function(req,res){
        res.jsend('success',{'foo':'bar'});
    });

    app.listen(9000);

Only the components of StatCollector that are configured will be enabled. The QueryCounter stats contain counter sets for request and response, as well as for each response status, with a counter counters for each URIs. It establishes a connection to Redis and a method for pushing data to Redis is set set on an interval. The ResponseTimer stats is contains objects of the form, one for each URI:

{ _type: "n", _value: [] }

A method to push this data to Circonus is also set on an interval.

The middleware takes care of three things: enhance the response object, register the URI with enabled components, and tell the enabled components to track the request/response. In the basic preparation, res is given a method jsend to be used instead of res.send, standardizing the json output structure and saving the response status (“success”, “fail”, “error”) to res.jsendStatusSent. The res.end method, which is called automatically after the response is handled, is wrapped so that it will emit an end event. The URI registration adds URI-specific counters and objects to QueryCounter and ResponseTimer stats, respectively, if they don’t already exist. The request and response are then fed through a tracking method that will record statistics both at the time of request and after a response, on the end event. This is just a detailed way of saying: ?when a request completes, note how long it took and what kind of response it was.

Circonus plot of some data
Circonus plot of some data

The general flow for request and response count statistics is Node.js processes to Redis to Circonus via Redis-type checks. On a request, the total request and URI-specific request counters are incremented. When res.end fires, the set URI-specific counters are incremented in QueryCounter, with res.jsendStatusResult determining the appropriate response type counter. Periodically, these stats are used to increment a set Redis hashes and then reset. In Redis, the request, response, and each response type are separate hashes, with ‘total’ and specific URIs as fields, and the corresponding counters as values. A set of Redis-type checks are set up in Circonus to run an hgetall query on each of those hashes. Since these are monotonically increasing values, when graphing when graphing rates, the ?counter? display type should be used to show the first order derivative (rate of change). When graphing total values, gauge (raw value) is appropriate.

For response times, data goes from Node.js processes to Circonus via an HTTP Trap. On the res.end event, the overall response time is calculated as the difference between the time saved when the req came in and the current time. The difference is pushed into the _value array of the URI-specific object. Periodically, the entire set of stat objects is converted to JSON and pushed to a Circonus HTTP trap, being subsequently reset. For optimal histogram live view, this push is done at sub-second intervals. In Circonus, the metrics must be enabled twice for histograms; the first time will enable collection, the second time will enable histogram support via the icon that resembles three boxes. Both are collected, distinguished by the icons. Graphing the standard metric will yield the normal line/area graph while graphing the histogram metric will yield the histogram/heatmap.

Request & response rates and response times are so valuable and straightforward to track, there’s absolutely no reason not to.

Brian Bickerton is a Web Engineer at OmniTI, working on applications where scalability and uptime are critical, using a variety of tools and technologies.

Interacting with Circonus through your text editor – circonusvi

I’m a big fan of command line tools for many tasks. There are just some tasks that can be done quicker and easier with a little typing rather than pointing and clicking. A little while ago, I discovered a gem of of a command line tool called ldapvi. Without going into too much detail, this tool lets you run a query against an LDAP server where the results show up in your favorite text editor. The magic part is that when you edit that file and save it, whatever changes you made are then pushed back to the server. This method of working can be extremely flexible, as you have all of the power of your text editor at your fingertips to make both small and sweeping changes quickly and easily.

When the new Circonus API was released, I realized that it was now possible to do the same thing for Circonus with relative ease. It would have been impossible to do with the old API, but the new one was designed with consistency in mind, and a simple (dare I say the word REST) interface. The result is circonusvi. Circonusvi essentially allows you to interact directly with the circonus API via your text editor, and it works as follows:

  • You decide what you want to work with: check bundles, rule sets, worksheets and so on, and pick that as your endpoint.
  • Circonusvi queries the API and lists all resources, which are returned as JSON.
  • Optionally, circonusvi filters this output to only show the resources you want.
  • This JSON (slightly modified) is shown in your text editor, where you can make changes (edit checks, remove checks, add new checks) as needed.
  • Circonusvi identifies what changes were made, and pushes them back to the server.

Here’s a simple example, taken from an actual situation we came across: needing to switch contact groups for a series of rules. We had some rules which were a severity 2 (our convention is that sev2 alerts are mail only), and we realized that we really should have them be sev1 (in other words, wake up the on call SA). To make this change in the web interface, you need to go into each rule individually, change the criteria from sev2 to sev1, and change the contact group associated with sev1 alerts. This is time consuming and error prone, but at the same time, it’s a one-off task that you probably don’t want to write a script for.

First, we identify the contact groups in circonusvi:

./circonusvi.py -a example_com -e contact_group

{
    # Mail only
    "/contact_group/123": {
        "aggregation_window": 300,
        "contacts": {
            "external": [
                {
                    "contact_info": "sa@example.com",
                    "method": "email"
                }
            ],
            "users": []
        },
        "name": "Mail only",
        "reminders": [
            0,
            0,
            0,
            0,
            0
        ]
    },
    # On call
    "/contact_group/124": {
        "aggregation_window": 300,
        "contacts": {
            "external": [
                {
                    "contact_info": "oncall@example.com",
                    "method": "email"
                }
            ],
            "users": []
        },
        "name": "On call",
        "reminders": [
            0,
            0,
            0,
            0,
            0
        ]
    }
}

This shows two contact groups, with the ID of the mail only contact group being /contact_group/123, and the on call contact group being /contact_group/124.

Next, we need to do the actual editing:

./circonusvi.py -a example_com -e rule_set 'metric_name=^foo$'

Here we specified a filter on the rules to match only those rules that apply to a metric named ‘foo’. The pert before the equals sign is which key you filter on. You can specify any key in the results for the filter, as long as it has a simple value (i.e. string or number). The right hand side of the filter is a regular expression. Any entry that matches the regular expression will be displayed. A small section of the results are shown below:

{
    "/rule_set/1234_foo": {
        "check": "/check/1234",
        "contact_groups": {
            "1": [],
            "2": [
                # Mail only
                "/contact_group/123"
            ],
            "3": [],
            "4": [],
            "5": []
        },
        "derive": "counter",
        "link": null,
        "metric_name": "foo",
        "metric_type": "numeric",
        "notes": null,
        "parent": null,
        "rules": [
            {
                "criteria": "on absence",
                "severity": "2",
                "value": "t",
                "wait": 10
            }
        ]
    },
    ...

There are two things we want to change here. First the severity when the rule fires. This is a simple search/replace in your text editor. I’m a vim fan, so:

:%s/"severity": "2"/"severity": "1"/

Next, we need to make sure that severity 1 alerts go to the on call group. They don’t currently:

:%s/"1": [],/"1": [ "/contact_group/124" ],/

If you want, you can remove the sev2 contact group by deleting that section:

:g/"/contact_group/123"/d

The ‘comment’ line above can be left alone, or deleted too with a similar command.

Once you are finished, your file will look something like this:

{
    "/rule_set/1234_foo": {
        "check": "/check/1234",
        "contact_groups": {
            "1": [ "/contact_group/124" ],
            "2": [
                # Mail only
            ],
            "3": [],
            "4": [],
            "5": []
        },
        "derive": "counter",
        "link": null,
        "metric_name": "foo",
        "metric_type": "numeric",
        "notes": null,
        "parent": null,
        "rules": [
            {
                "criteria": "on absence",
                "severity": "1",
                "value": "t",
                "wait": 10
            }
        ]
    },
    ...

Save and quit, and you’ll be presented with a prompt asking you to confirm your changes, and give you a chance to preview the changes before submitting them:

0 additions, 0 deletions, 85 edits
Do you want to proceed? (YyNnEeSs?)

I would recommend previewing exactly what has changed before submitting. Pressing S will show a unified diff of what changed. Once you proceed, circonusvi will apply the changes via the API, and you’re all done!

This method of interacting with the API is most useful when making edits or deletions in bulk (if you delete an entry in the text editor, circonusvi will delete from the API). Additions are possible also, but are more cumbersome as you have to create the JSON precisely from scratch. For bulk additions, there are other scripts available such as the
circus scripts. However, using your editor to make bulk changes is a very powerful way of interacting with an application. Give it a try and see how well it works for you.

Mark Harrison leads the Site Reliability Engineering team at OmniTI where he manages people and networks, systems and complex architectures for some of today’s largest Internet properties.

PagerDuty Integration Improvements and Alert Formats

Recently we got burned by ignoring a page because the actual message we received lacked detail, it looked like an alert that was known to clear itself. At 3am it is hard to bring yourself to get out of bed when you have seen this alert page and clear time and time again, so it was forgotten. Four hours later the alert was spotted by another admin and resolved, and an analysis was done to determine how this happened.

The root cause we determined to be the aforementioned lack of detail. When Circonus would send an alert to PagerDuty, we would do so in our “long format” which is the alert format you get when you receive email notifications, more on this and the “short format” later. PagerDuty then truncates this message to fit a standard 160 character SMS, this truncation of detail lead to a lot of alerts looking like each other, some which were more critical were assumed to be of lesser importance and ignored.

Improvements to PagerDuty Messages

To solve this, we just pushed out a change to include both the short format and long format in a PagerDuty message. The short format is what we use for SMS alerts, and is now the first line of the message. When the truncation happens on their end, you should receive as much detail as possible about the alert. This does lead to redundant information in the message body in their UI and email alerts, but we feel it is for the better.

Secondly, we are providing details about the alerts to PagerDuty’s API. These details currently are:

  • account
  • host
  • check_name
  • metric_name
  • severity
  • alert_value
  • client_url

These details are useful if you are pulling alerts from the PagerDuty API, instead of parsing the message you should receive a JSON object with these keys and their associated values.

How Circonus Alerts are Formatted

As mentioned before, Circonus has two alert formats. A long format which is used for email, XMPP, AIM and PagerDuty alerts, and a short format which is used for SMS, Twitter and now PagerDuty.

The short format is intended to compress as much detail about the alert as possible while remaining readable and useful. An example of this type of alert:

[Circonus Testing] A2:1 development.internal "Test Check" cpu_used (89.65)

I’ll break this alert up into its various sections to describe it

  • [Circonus Testing] is the name of the account
  • A = Alert, 2 = Severity 2, 1 = The number of sev 2 alerts. The “A” here could also be R for Recovery
  • development.internal is the hostname or IP this alert was triggered on
  • “Test Check” is the name of the check bundle in the Circonus UI
  • cpu_used is our metric name and (89.65) is the value that triggered the alert

The long format is more self explanatory since we have many more characters to work with.

Account: Circonus Testing

ALERT Severity 2
Check: Test Check
Host: development.internal
Metric: cpu_used (89.65)
Agent: Ashburn, VA, US
Occurred: Tue, 8 Jan 2013 2:25:53

This is the same alert as above, so breaking it apart we have:

  • Account name
  • Alert of Severity 2, this could also be RECOVERY. The alert count is missing because in the long format we will list out each alert separately.
  • The check name
  • The host / IP
  • The metric name and alert value
  • The broker / agent that the alert was triggered from
  • The time that the alert was triggered, if this is a recovery you will also have a cleared time.
  • The Circonus URL to view the alert in the UI

In the future we intend to allow the alert formats to be customized for each contact group, or use these current formats as the default.

Thanks to redundancy built into Circonus, our users were never impacted by the outage that precipitated this change, but if it can happen to us it will happen to others, so we hope these minor changes bring improvements to your response times.