Web Application Stat Collection with Node.js and Circonus

Collecting and carefully monitoring statistics is an essential practice for catching and debugging issues early, as well as minimizing degraded performance and downtime. When running a web service, two of the most fundamental and useful stats are request/response rates (throughput) and response times (latency). This includes both interactions between a user and the service as well as between the backend components. Fortunately, they?re also two of the most straightforward metrics to aggregate and track.

The rates are perhaps the easiest to track, as they are not much more than incrementing counters. In essence, one increments a counter each time a request is made and each time a response is sent. Individual requests and responses are unrelated as far as counts are concerned. In a multi-process or clustered environment, a tool such as Redis, with its atomic increments, can be used for global aggregation. Once aggregated, it can be checked and monitored by Circonus, with rates being derived from count changes over time.

Response times are a bit trickier. Time needs to be tracked between a specific request being made and the response being sent for that request. Each time needs to be tracked for use beyond beyond simple averages, as histograms and heatmaps often tell a very different story. Sending these values to Circonus via an HTTP trap will provide that level of detail. Local collection of times before communication to Circonus should be used to prevent an unnecessarily large number of calls being made to store the data.

The above image is a line graph of the approximate average latency in ms over time for a particular URI. According to that data, average latency is more or less consistent, just below 1k. That conclusion, however, is not entirely accurate to describe the system. With a heatmap overlay, this becomes clear.

Circonus histogram plot of latency with line overlay

The picture painted here is quite different than with the simple line graph. The first key element revealed is that the latency is multimodal, with high frequency around both 1k and 0.5k. The second is that there is a comparatively small, though non-negligible, frequency well in excess of 1k. Perhaps these two features of the data are expected, but may indicate an issue with the system. This kind of insight is often essential to the identification and remedy of problems.

The process to collecting all of this information will become clearer with an example. The following implementation is based on a Node.js web API service using the Express.js framework, but the principles are applicable for any circumstance apply equally as well to any internal queries or flows. A module has been created to provide this sample functionality and can be installed with:

npm install stat-collector

A simple app utilizing the module to collect both stat types:

    var express = require('express');
    var StatCollector = require('StatCollector');

    var statCollector = new StatCollector();
    // Global debug logging (default false)
    statCollector.setDebug(true);

    var queryCounterConfig = {
        // Configuration options for node-redis
        redisConfig: {
            host: 'localhost',
            port: 1234,
            options: {}
        },
        // Prefix for stat hashes in Redis (e.g. 'foo_' for foo_requestTotal)
        statPrefix: 'foo_',
        // Interval (ms) in which to update Redis (default 30000)
        pushInterval: 30000,
        // Debug logging (default false)
        debug: false
    };

    var responseTimerConfig = {
        // Circonus HTTP Trap check config
        circonusConfig: {
            host: 'foo',
            port: 1234,
            path: '/'
        },
        // Interval (ms) in which to push to Circonus (default 500)
        pushInterval: 500,
        // Debug loging for this module (default false)
        debug: false
    };

    statCollector.configureQueryCounter(queryCounterConfig);
    statCollector.configureResponseTimer(responseTimerConfig);

    var app = express();

    app.configure( function(){
        app.use(statCollector.middleware());
        app.use(app.router);
    });

    app.get('/foo',function(req,res){
        res.jsend('success',{'foo':'bar'});
    });

    app.listen(9000);

Only the components of StatCollector that are configured will be enabled. The QueryCounter stats contain counter sets for request and response, as well as for each response status, with a counter counters for each URIs. It establishes a connection to Redis and a method for pushing data to Redis is set set on an interval. The ResponseTimer stats is contains objects of the form, one for each URI:

{ _type: "n", _value: [] }

A method to push this data to Circonus is also set on an interval.

The middleware takes care of three things: enhance the response object, register the URI with enabled components, and tell the enabled components to track the request/response. In the basic preparation, res is given a method jsend to be used instead of res.send, standardizing the json output structure and saving the response status (“success”, “fail”, “error”) to res.jsendStatusSent. The res.end method, which is called automatically after the response is handled, is wrapped so that it will emit an end event. The URI registration adds URI-specific counters and objects to QueryCounter and ResponseTimer stats, respectively, if they don’t already exist. The request and response are then fed through a tracking method that will record statistics both at the time of request and after a response, on the end event. This is just a detailed way of saying: ?when a request completes, note how long it took and what kind of response it was.

The general flow for request and response count statistics is Node.js processes to Redis to Circonus via Redis-type checks. On a request, the total request and URI-specific request counters are incremented. When res.end fires, the set URI-specific counters are incremented in QueryCounter, with res.jsendStatusResult determining the appropriate response type counter. Periodically, these stats are used to increment a set Redis hashes and then reset. In Redis, the request, response, and each response type are separate hashes, with ‘total’ and specific URIs as fields, and the corresponding counters as values. A set of Redis-type checks are set up in Circonus to run an hgetall query on each of those hashes. Since these are monotonically increasing values, when graphing when graphing rates, the ?counter? display type should be used to show the first order derivative (rate of change). When graphing total values, gauge (raw value) is appropriate.

For response times, data goes from Node.js processes to Circonus via an HTTP Trap. On the res.end event, the overall response time is calculated as the difference between the time saved when the req came in and the current time. The difference is pushed into the _value array of the URI-specific object. Periodically, the entire set of stat objects is converted to JSON and pushed to a Circonus HTTP trap, being subsequently reset. For optimal histogram live view, this push is done at sub-second intervals. In Circonus, the metrics must be enabled twice for histograms; the first time will enable collection, the second time will enable histogram support via the icon that resembles three boxes. Both are collected, distinguished by the icons. Graphing the standard metric will yield the normal line/area graph while graphing the histogram metric will yield the histogram/heatmap.

Request & response rates and response times are so valuable and straightforward to track, there’s absolutely no reason not to.

Brian Bickerton is a Web Engineer at OmniTI, working on applications where scalability and uptime are critical, using a variety of tools and technologies.