Circonus Instrumentation Packs

In our Circonus Labs public github repo, we have started a project called Circonus Instrumentation Packs, or CIP. This is a series of libraries to make it even easier to submit telemetry data from your application.

Currently there are CIP directories for gojava,  and node.js. Each separate language directory has useful resources to help instrument applications written in that language.

Some languages have a strong leaning toward frameworks, while others are about patterns, and still others are about tooling. These packs are intended to “meld in” with the common way of doing things in each language, so that developer comfort is high and integration time and effort are minimal.

Each of these examples utilize the HTTP Trap check, which you can create within Circonus. Simply create a new JSON push (HTTPTrap) check in Circonus using the HTTPTRAP broker, and then the CheckID, UUID and secret will be available on the check details page.

HTTPTrap uuid-secret
CHECKID / UUID / Secret Example

This can be done via the user interface or via the API. The “target” for the check does not need to be an actual hostname or IP address; the name of your service might be a good substitute.

We suggest that you use a different trap for different node.js apps, as well as for production, staging, and testing.

Below is a bit more detail on each of the currently available CIPs:

Java

Java has a very popular instrumentation library called “metrics,” originally written by Coda Hale and later adopted by Dropwizard. Metrics has some great ideas that we support whole-heartedly; in particular, the use of histograms for more insightful reporting. Unfortunately, the way these measurements are captured and reported makes calculating service level agreements and other such analytics impossible. Furthermore, the implementations of the underlying histograms (Reservoirs in metrics-terminology) are opaque to the reporting tools. The Circonus metrics support in this CIP is designed to layer (non-disruptively) on top of the Dropwizard metrics packages.

Go

This library supports named counters, gauges, and histograms. It also provides convenience wrappers for registering latency instrumented functions with Go’s built-in http server.

Initializing only requires you set the AuthToken (which you generate in your API Tokens page) and CheckId, and then “Start” the metrics reporter.

You’ll need two github repos:

Here is the sample code (also found in the circonus-gometrics readme):

[slide]package main
import (
 "fmt"
 "net/http"
 metrics "github.com/circonus-gometrics"
)
func main() {
// Get your Auth token at https://login.circonus.com/user/tokens
 metrics.WithAuthToken("cee5d8ec-aac7-cf9d-bfc4-990e7ceeb774")
// Get your Checkid on the check details page
 metrics.WithCheckId(163063)
 metrics.Start()
http.HandleFunc("/", metrics.TrackHTTPLatency("/", func(w http.ResponseWriter, r *http.Request) {
 fmt.Fprintf(w, "Hello, %s!", r.URL.Path[1:])
 }))
 http.ListenAndServe(":8080", http.DefaultServeMux)
}

After you start the app (go run the_file_name.go), load http://localhost:8080 in your broswer, or curl http://localhost:8080. You’ll need to approve access to the API Token (if it is the first time you have used it), and then you can create a graph (make sure you are collecting histogram data) and you’ll see something like this:

go-httptrap-histogram-example

Node.js

This instrumentation pack is designed to allow node.js applications to easily report telemetry data to Circonus using the UUID and Secret (instead of an API Token and CheckID). It has special support for providing sample-free (100% sampling) collection of service latencies for submission, visualization, and alerting to Circonus.

Here is a basic example to measure latency:

First, some setup – making the app;

% mkdir restify-circonus-example
% cd restify-circonus-example
% npm init .

(This defaults to npm init . works fine.) Then:

% npm install --save restify
% npm install --save probability-distributions
% npm install --save circonus-cip

Next, edit index.js and include:

var restify = require('restify'),
 PD = require("probability-distributions"),
 circonus_cip = require('circonus-cip')
var circonus_uuid = '33e894e6-5b94-4569-b91b-14bda9c650b1'
var circonus_secret = 'ssssssssh_its_oh_so_quiet'
var server = restify.createServer()
server.on('after', circonus_cip.restify(circonus_uuid, circonus_secret))
server.get('/', function (req, res, next) {
 setTimeout(function() {
 res.writeHead(200, { 'Content-Type': 'text/plain' });
 //res.write("Hello to a new world of understanding.\n");
 res.end("Hello to a new world of understanding.\n");
 next();
 }, PD.rgamma(1, 3, 2) * 200);
})

server.listen(8888)

Now just start up the app:

node index.js

Then go to your browser and load localhost:8888, or at the prompt curl http:localhost:8888.

You’ll then go and create the graph in your account. Make sure to enable collection of the metric – “… httptrap: restify `GET `/ `latency…” as a histogram, and you’ll end up with a graph like this:

The Restify Histogram graph
The Restify Histogram graph

Show Me the Data

Avoid spike erosion with Percentile – and Histogram – Aggregation

It has become common wisdom that the lossy process of averaging measurements leads to all kinds of problems when measuring performance of services (see Schlossnagle2015,  Ugurlu2013,  Schwarz2015,  Gregg2014). Yet, most people are not aware that averages are even more abundant than just in old-fashioned formulation of SLAs and storage backends for monitoring data. In fact, it is likely that most graphs that you are viewing involve some averaging behind the scenes, which introduces severe side effects. In this post, we will describe a phenomenon called spike erosion, and highlight some alternative views that allow you to get a more accurate picture of your data.

Meet Spike Erosion

Spike Erosion of Request Rates

Take a look at Figure 1. It shows a graph of request rates over the last month. The spike near December 23, marks the apparent maximum at around 7 requests per second (rps).

Figure 1: Web request rate in requests per second over one month time window
Figure 1: Web request rate in requests per second over one month time window

What if I told you, the actual maximal request rate was almost double that value at 13.67rps (marked with the horizontal guide)? And moreover, it was not reached at December 23, but December 15 at 16:44, near the left boundary of the graph?

Looks way off, right?

But it’s actually true! Figure 2 shows the same graph zoomed in at said time window.

Figure 2: Web request rates (in rps) over a 4h period
Figure 2: Web request rates (in rps) over a 4h period

We call this phenomenon spike erosion; the farther you zoom out, the lower the spikes, and it’s actually very common in all kinds of graphs across all monitoring products.

Let’s see another example.

Spike Erosion of Ping Latencies

Take a look at Figure 3. It shows a graph of ping latencies (from twitter.com) over the course of 4 weeks. Again, it looks like the latency is rather stable around 0.015ms with occasional spikes above 0.02ms and a clear maximum around December 23, with a value of ca 0.03ms.

Figure 3: Ping latency of twitter.com in ms over the last month
Figure 3: Ping latency of twitter.com in ms over the last month

Again, we have marked the actual maximum with a horizontal guide line. It is more than double the apparent maximum and is assumed at any of the visible spikes. That’s right. All spikes do in fact have the same maximal height. Figure 4 shows a closeup of the one on December 30, in the center.

Figure 4: Ping latency of twitter.com in ms on December 30
Figure 4: Ping latency of twitter.com in ms on December 30

What’s going on?

The mathematical explanation of spike erosion is actually pretty simple. It is an artifact of an averaging process that happens behind the scenes, in order to produce sensible plots with high performance.

Note that within a 4 month period we have a total of 40,320 samples collected that we need to represent in a plot over that time window. Figure 5 shows how a plot of all those samples looks in GnuPlot. There are quite a few issues with this raw presentation.

Figure 5: Plot of the raw data of request rates over a month
Figure 5: Plot of the raw data of request rates over a month

First, there is a ton of visual noise in that image. In fact, you cannot even see the individual 40,000 samples for the simple reason that the image is only 1240 pixels wide.

Also, rendering such an image within a browser puts a lot of load on the CPU. The biggest issue with producing such an image is the latency involved with retrieving 40K float values from the db and transmitting them as JSON over the internet.

In order to address the above issues, all mainstream graphing tools pre-aggregate the data before sending it to the browser. The size of the graph determines the number of values that should be displayed e.g. 500. The raw data is then distributed across 500 bins, and for each bin the average is taken, and displayed in the plot.

This process leads to plots like Figure 1, which (a) can be produced much faster, since less data has to be transferred and rendered (in fact, you can cache the pre-aggregated values to speed up retrieval from the db), and (b) are less visually cluttered. However, it also leads to (c) spike erosion!

When looking at a four week time window, the raw number of 40.320 samples is reduced to a mere 448 plotted values, where each plotted value corresponds to an average over a 90 minute period. If there is a single spike in one of the bins, it gets averaged with 90 other samples of lower value, which leads to the erosion of the spike height.

What to do about it?

There are (at least) two ways to allow you to avoid spike erosion and get more insight into your data. Both change the way the data is aggregated.

Min-Max Aggregation

The first way is to show the minimum and the maximum values of each bin along with the mean value. By doing so, you get a sense of the full range of the data, including the highest spikes. Figures 6 and 7 show how Min-Max Aggregation looks in Circonus for the request rate and latency examples.

Figure 6: Request rate graph with Min-Max Aggregation Overlay
Figure 6: Request rate graph with Min-Max Aggregation Overlay
Figure 7: Latencies with Min/Max-Aggregation Overlay
Figure 7: Latencies with Min/Max-Aggregation Overlay

In both cases, the points where the maximum values are assumed are clearly visible in the graph. When zooming into the spikes, the Max aggregation values stay aligned with the global maximum.

Keeping in mind that minimum and maximum are special cases of percentiles (namely the 0%-percentile and 100%-percentile), it appears natural to extend the aggregation methods to allow general quantiles as well. This is what we implemented in Circonus with the Percentile Aggregation overlay.

Histogram Aggregation

There is another, structurally different approach to mitigate spike erosion. It begins with the observation that histograms have a natural aggregation logic: Just add the bucket counts. More concretely, a histogram metric that stores data for each minute can be aggregated to larger time windows (e.g. 90 minutes) without applying any summary statistic, like a mean value, simply by adding the counts for each histogram bin across the aggregation time window.

If we combine this observation with the simple fact that time-series metrics can be considered histogram with a single value in it, we arrive at the powerful Histogram Aggregation that rolls-up time series into histogram metrics with lower time resolution. Figures 8 and 9 show Histogram Aggregation Overlays for the Request Rate and Latency examples discussed above.

Figure 8: Request Rates with Histogram Aggregation Overlay
Figure 8: Request Rates with Histogram Aggregation Overlay
Figure 9: Latencies with Histogram Aggregation Overlay
Figure 9: Latencies with Histogram Aggregation Overlay

In addition to showing the value range (which in the above figure is amplified by using the Min-Max Aggregation Overlay) we also gain a sense of the distribution of values across the bin. In the request rate example the data varies widely across a corridor of width 2.5-10rps. In the latency example, the distribution is concentrated near the mean global median of 0.015ms, with single value outliers.

Going Further

We have seen that displaying data as histograms gives a more concise picture of what is going on. Circonus allows you to go one step further and collect your data as histograms in the first place. This allows you to capture the latencies of all requests made to your API, instead of only probing your API once per minute. See [G.Schlossnagle2015] for an in-depth discussion of the pros and cons of this “passive monitoring” approach. Note that you can still compute averages and percentiles for viewing and alerting.

Figure 10: API Latency Histogram Metric with Average Overlay
Figure 10: API Latency Histogram Metric with Average Overlay

Figure 10 shows a histogram metric of API latencies, together with the mean value computed as an overlay. While this figure looks quite similar to Figures 8 and 9, the logical dependency is reversed. The mean values are computed from the histogram, not the other way around. Also, note that the time window of this figure only spans a few hours, instead of four weeks. This shows how much richer the captured histogram data is.

Resources

  1. Theo Schlossnagle – Problem with Math
  2. Dogan Ugurlu (Optimizely) – The Most Misleading Measure of Response Time: Average
  3. Baron Schwarz – Why percentiles don’t work the way you think
  4. Brendan Gregg – Frequency Tails: What the mean really means
  5. George Schlossnagle – API Performance Monitoring

Web Application Stat Collection with Node.js and Circonus

Collecting and carefully monitoring statistics is an essential practice for catching and debugging issues early, as well as minimizing degraded performance and downtime. When running a web service, two of the most fundamental and useful stats are request/response rates (throughput) and response times (latency). This includes both interactions between a user and the service as well as between the backend components. Fortunately, they?re also two of the most straightforward metrics to aggregate and track.

The rates are perhaps the easiest to track, as they are not much more than incrementing counters. In essence, one increments a counter each time a request is made and each time a response is sent. Individual requests and responses are unrelated as far as counts are concerned. In a multi-process or clustered environment, a tool such as Redis, with its atomic increments, can be used for global aggregation. Once aggregated, it can be checked and monitored by Circonus, with rates being derived from count changes over time.

Response times are a bit trickier. Time needs to be tracked between a specific request being made and the response being sent for that request. Each time needs to be tracked for use beyond beyond simple averages, as histograms and heatmaps often tell a very different story. Sending these values to Circonus via an HTTP trap will provide that level of detail. Local collection of times before communication to Circonus should be used to prevent an unnecessarily large number of calls being made to store the data.

Circonus line plot of latency
Circonus line plot of latency

The above image is a line graph of the approximate average latency in ms over time for a particular URI. According to that data, average latency is more or less consistent, just below 1k. That conclusion, however, is not entirely accurate to describe the system. With a heatmap overlay, this becomes clear.

Circonus histogram plot of latency with line overlay
Circonus histogram plot of latency with line overlay

The picture painted here is quite different than with the simple line graph. The first key element revealed is that the latency is multimodal, with high frequency around both 1k and 0.5k. The second is that there is a comparatively small, though non-negligible, frequency well in excess of 1k. Perhaps these two features of the data are expected, but may indicate an issue with the system. This kind of insight is often essential to the identification and remedy of problems.

The process to collecting all of this information will become clearer with an example. The following implementation is based on a Node.js web API service using the Express.js framework, but the principles are applicable for any circumstance apply equally as well to any internal queries or flows. A module has been created to provide this sample functionality and can be installed with:

npm install stat-collector

A simple app utilizing the module to collect both stat types:

    var express = require('express');
    var StatCollector = require('StatCollector');

    var statCollector = new StatCollector();
    // Global debug logging (default false)
    statCollector.setDebug(true);

    var queryCounterConfig = {
        // Configuration options for node-redis
        redisConfig: {
            host: 'localhost',
            port: 1234,
            options: {}
        },
        // Prefix for stat hashes in Redis (e.g. 'foo_' for foo_requestTotal)
        statPrefix: 'foo_',
        // Interval (ms) in which to update Redis (default 30000)
        pushInterval: 30000,
        // Debug logging (default false)
        debug: false
    };

    var responseTimerConfig = {
        // Circonus HTTP Trap check config
        circonusConfig: {
            host: 'foo',
            port: 1234,
            path: '/'
        },
        // Interval (ms) in which to push to Circonus (default 500)
        pushInterval: 500,
        // Debug loging for this module (default false)
        debug: false
    };

    statCollector.configureQueryCounter(queryCounterConfig);
    statCollector.configureResponseTimer(responseTimerConfig);

    var app = express();

    app.configure( function(){
        app.use(statCollector.middleware());
        app.use(app.router);
    });

    app.get('/foo',function(req,res){
        res.jsend('success',{'foo':'bar'});
    });

    app.listen(9000);

Only the components of StatCollector that are configured will be enabled. The QueryCounter stats contain counter sets for request and response, as well as for each response status, with a counter counters for each URIs. It establishes a connection to Redis and a method for pushing data to Redis is set set on an interval. The ResponseTimer stats is contains objects of the form, one for each URI:

{ _type: "n", _value: [] }

A method to push this data to Circonus is also set on an interval.

The middleware takes care of three things: enhance the response object, register the URI with enabled components, and tell the enabled components to track the request/response. In the basic preparation, res is given a method jsend to be used instead of res.send, standardizing the json output structure and saving the response status (“success”, “fail”, “error”) to res.jsendStatusSent. The res.end method, which is called automatically after the response is handled, is wrapped so that it will emit an end event. The URI registration adds URI-specific counters and objects to QueryCounter and ResponseTimer stats, respectively, if they don’t already exist. The request and response are then fed through a tracking method that will record statistics both at the time of request and after a response, on the end event. This is just a detailed way of saying: ?when a request completes, note how long it took and what kind of response it was.

Circonus plot of some data
Circonus plot of some data

The general flow for request and response count statistics is Node.js processes to Redis to Circonus via Redis-type checks. On a request, the total request and URI-specific request counters are incremented. When res.end fires, the set URI-specific counters are incremented in QueryCounter, with res.jsendStatusResult determining the appropriate response type counter. Periodically, these stats are used to increment a set Redis hashes and then reset. In Redis, the request, response, and each response type are separate hashes, with ‘total’ and specific URIs as fields, and the corresponding counters as values. A set of Redis-type checks are set up in Circonus to run an hgetall query on each of those hashes. Since these are monotonically increasing values, when graphing when graphing rates, the ?counter? display type should be used to show the first order derivative (rate of change). When graphing total values, gauge (raw value) is appropriate.

For response times, data goes from Node.js processes to Circonus via an HTTP Trap. On the res.end event, the overall response time is calculated as the difference between the time saved when the req came in and the current time. The difference is pushed into the _value array of the URI-specific object. Periodically, the entire set of stat objects is converted to JSON and pushed to a Circonus HTTP trap, being subsequently reset. For optimal histogram live view, this push is done at sub-second intervals. In Circonus, the metrics must be enabled twice for histograms; the first time will enable collection, the second time will enable histogram support via the icon that resembles three boxes. Both are collected, distinguished by the icons. Graphing the standard metric will yield the normal line/area graph while graphing the histogram metric will yield the histogram/heatmap.

Request & response rates and response times are so valuable and straightforward to track, there’s absolutely no reason not to.

Brian Bickerton is a Web Engineer at OmniTI, working on applications where scalability and uptime are critical, using a variety of tools and technologies.

Understanding Data with Histograms

For the last several years, I’ve been speaking about the lies that graphs tell us. We all spend time looking at data, commonly through line graphs, that actually show us averages. A great example of this is showing average response times for API requests.


The above graph shows the average response time for calls made to a HTTP REST endpoint. Each pixel in this line graph is the average of thousands of samples. Each of these samples represents a real user of the API. Thousand of users distilled down to a single value sounds ideal until you realize that you have no idea what the distribution of the samples looks like. Basically, this graph only serves to mislead you. Having been misled for years by the graphs with little recourse, we decided to do something about it and give Circonus users more insight into their data.

Each of these pixels is the average of many samples. If we were to take those samples and put them in a histogram, it would provide dramatically improved insight into the underlying data. But a histogram is a visually bulky representation of data, and we have a lot of data to show (over time, no less). When I say visually bulky what do I mean? A histogram takes up space on the screen and since we have a histogram of data for each period of time and hundreds of periods of time in the time series we’d like to visualize… well, I can’t very well show you hundreds of histograms at once and expect you to be able to make any sense of them; or can I?

Enter heat maps. Heat maps are a way of displaying histograms using color saturation instead of bar heights. So heat maps remove the “bulkiness” and provide sufficient visual density of information, but the rub is that people have trouble grasping them at first sight. Once you look at them for a while, they start to make sense. The question we faced is: how do we tie it all together and make it more accessible? The journey started for us about six months ago, and we’ve arrived at a place that I find truly enlightening.

Instead of a tutorial on histograms, I think throwing you into the interface is far more constructive.


The above graph provides a very deep, rich understanding the same data that powered the first line graph. This graph shows all of the API response times for the exact same service over the same time period.

In my first (#1) point of interest, I am hovering the pointer over a specific bit of data. This happens to be August 31st at 8pm. I’ll note that not only does our horizontal position matter (affecting time), but my vertical position indicates the actual service times. I’m hovering between 23 and 24 on the y-axis (23-24 milliseconds). The legend shows me that there were 1383 API calls made at that time and 96 of them took between 23 and 24 milliseconds. Highlighted at #3, I also have some invaluable information about where these samples sit in our overall distribution: these 96 samples constitute only 7% of our dataset, 61% of the samples are less than 23ms and the remaining 32% are greater than or equal to 24ms. If I move the pointer up and down, I can see this all dynamically change on-screen. Wow.

As if that wasn’t enough, a pop-up histogram of the data from the time interval over which I’m hovering is available (#2) that shows me the precise distribution of samples. This histogram changes as I move my pointer horizontally to investigate different points in time.

Now that I’ve better prepared you for the onslaught of data, poke around a live interactive visualization of a histogram with similar data.

With these visualizations at my disposal, I am now able to ask more intelligent questions about how our systems behave and how our business reacts to that. All of these tools are available to Circonus users and you should be throwing every piece data you have at Circonus… just be prepared to have your eyes opened.