Percentile Aggregation with Histograms and CAQL

Percentiles are commonly used for measuring statistics, particularly when analyzing things like latency. Unfortunately, people frequently get tripped up when they want to take multiple percentiles and aggregate them.

For example, let’s say we are monitoring a set of ten web servers and we want to collect latency statistics across all of them. A common way to do that is to calculate percentiles for each of the nodes with some sort of monitoring agent or instrumentation library, and then store those calculated percentiles. If we want to see global latency percentiles, we have to somehow aggregate the percentile metrics from all ten servers. Unfortunately, calculating accurate global percentiles from pre-calculated percentile values is famously impossible. Once we have converted our raw data to percentiles, there is no meaningful way to aggregate percentiles any further.

At Circonus, we have long advocated histogram metrics as a solution to this problem. The basic idea is instead of pre-calculating percentiles and storing those calculations, the agent publishes histogram summaries of the raw data. Those histograms then can be freely aggregated, and they contain enough information to calculate accurate percentiles at the time of display instead of pre-calculating them (this method has a typical error of < 0.1%, & a max error < 5%).

In this article, we will assume that latency data is collected as histograms in Circonus.

Step 1: Select Metrics for Aggregation

The best way to select the metrics we want to aggregate is to use Circonus’ Metrics Explorer. It will allow us to fine-tune our search query and double-check that it’s returning the metrics which we want to aggregate. In our example, we end up using the following search query to select our latency metrics from a REST endpoint called ntt_get on an IRONdb cluster identified by an IP range:

mtev`*`rest_nnt_get_asynch`latency and(__check_target:10.128.0.*)

Notice that this search query returns a total of ten histogram metrics.

Step 2: Create a CAQL Statement

Next we will use the power of CAQL to aggregate the histograms and calculate percentiles from them. To start, we need to create a new graph and add a CAQL Datapoint as explained in the Getting Started guide. Then we convert the search query into a CAQL find() statement like so:

find("mtev`*`rest_nnt_get_asynch`latency", "and(__check_target:10.128.0.*)")

Note that we split the metric search query into two parts:

  1. mtev`*`rest_nnt_get_asynch`latency – name search pattern
  2. and(__check_target:10.128.0.*) – tag query

In the Metrics Explorer those parts are simply space-separated, but the CAQL find() function expects those parts to be passed as separate string arguments. Unfortunately, when you enter this query into the graph datapoint, you will find that it doesn’t work as expected; it returns a blank graph. This is because the find() function is for pulling numeric data, not histogram data. To pull histogram data we need to instead use the find:histogram() function, like so:

find:histogram("mtev`*`rest_nnt_get_asynch`latency", "and(__check_target:10.128.0.*)")

This CAQL query is closer…at least we’re seeing histograms in the graph. But it does not look quite like we expected; the histogram heatmap looks “smeared” across the horizontal time axis.

The reason for this is that the data was collected as a time-cumulative histogram. We need to account for this by changing our data function again, to use find:histogram_cum(). The resulting graph looks like this (all of the selected histogram metrics are indicated by different colors):

Hint: Use your mouse wheel or touchpad scrolling to set the y-axis scale; it allows you to easily “zoom in” or “zoom out” when viewing histograms.

Step 3: Aggregate Histogram Data

Now that we have our histograms being found and rendered correctly, we need to aggregate them. For this we will use the histogram:merge() function to merge the ten histograms into a single histogram.

find:histogram_cum("mtev`*`rest_nnt_get_asynch`latency", "and(__check_target:10.128.0.*)")
| histogram:merge()

Step 4: Calculate Percentiles

Our histograms have been found and aggregated, so now we can calculate our percentiles based off of that aggregated data. For this we use the histogram:percentile() function. We are interested in the commonly-used percentiles p50, p90, p99 and p99.9, so we simply specify them as parameters in the histogram:percentile() function like so:

find:histogram_cum("mtev`*`rest_nnt_get_asynch`latency", "and(__check_target:10.128.0.*)")
| histogram:merge()
| histogram:percentile(50,90,99,99.9)

We can now see that instead of a heatmap, we have four numeric line series being rendered. If we flip the graph into view mode, we then have a legend where we can read the percentile values for individual one-minute time windows:

Step 5: Tune Aggregation Periods

In many cases we are not interested in percentiles calculated over one-minute time windows, but those calculated over longer time periods like days or weeks. Let’s flip the graph back to edit mode and update the CAQL statement, changing the aggregation window with the window:merge() function.

To calculate percentiles over one-hour windows instead of one-minute windows, we would insert the window:merge() function near the end of the CAQL query, like so:

find:histogram_cum("mtev`*`rest_nnt_get_asynch`latency", "and(__check_target:10.128.0.*)")
| histogram:merge()
| window:merge(1h)
| histogram:percentile(50,90,99,99.9)

Now if we again flip to view mode, we can see in the legend that we have accurate percentile values for one-hour intervals:

Hint: Add a Legend Formula when editing the CAQL datapoint to specify the precision of the values displayed in the legend. Here we used a legend formula of =format("%.6f",VAL).

Hint: The window:merge() function also supports a skip parameter. For example, window:merge(1d, skip=1h) would compute one-day aggregations that are advanced every one hour.

Hint: The window:merge() function also supports an offset parameter. For example, window:merge(1d, offset="US/Eastern") will compute aggregations across days in the US/Eastern time zone.

So in that last example graph, we can see our final calculations. Here’s a simplified version of the legend shown; we’re getting the following percentile values for latency data aggregated over one hour windows:

Global Percentile / 1h Value (seconds)
p50 0.000141
p90 0.003898
p99 0.00615
p99.9 1.25

The Power of CAQL

The preceding example illustrates the power of histogram data storage combined with CAQL for retrieving and displaying that data. There’s no need to predetermine which percentiles we calculate; no need to lose the ability to accurately aggregate our data by pre-calculating the percentiles. CAQL lets us decide how to calculate our percentiles later, when we want to actually view the data.