Monitoring Latency SLOs with Histograms and CAQL

Latency SLOs help us quantify the performance of an API endpoint over a period of time. A typical latency SLO reads as follows:

The proportion of valid* requests served over the last 4 weeks that were slower than 100ms is less than 1%.

*In this context, “valid” means that the request responded with a status code in the 200s.

There are a number of challenges involved with monitoring SLOs such as this one:

  1. The threshold value (100ms) is arbitrary and might be changed in the future.
  2. Latency percentiles cannot be aggregated if they are pre-computed and stored.
  3. Reporting periods of multiple months require long data retention.

Let’s take a look at how Circonus histograms and Circonus Analytics Query Language (CAQL) let us overcome these challenges and effectively monitor SLOs.

Step 1: Aggregate Latency Data As A Histogram

The first step is to aggregate all of our latency data together in a single histogram heatmap. We will assume that the service in question is instrumented to emit histogram metrics. In our example we have ten database nodes emitting histogram metrics named “latency” and tagged with their service name (service:www) and HTTP status code (status:200, status:500, etc.).

Since we’re only concerned with valid requests, we’ll restrict our search to metrics with tag status:2* (200, 204, etc). Then we use the CAQL histogram:merge() method to aggregate histogram metrics captured from all nodes, like so:

SLO histogram

find:histogram("latency", "and(service:www,status:2*)") | histogram:merge()

For a more in-depth look at aggregating latency data, see this blog post.

Step 2: Separate Good and Bad Requests

Now that we have a single aggregated histogram of our latency, we can separate the data into good and bad requests. Good requests are those served faster than our latency threshold of 100ms. Bad requests are those served slower than 100ms. For this, we will use the CAQL functions histogram:count_below(), histogram:count_above() and histogram:count(). These allow us to count good, bad, and total requests made, like so:

SLO request rates

find:histogram("latency", "and(service:www,status:2*)") | histogram:merge()
| histogram:count_below(0.100)
| label("Good Requests")
find:histogram("latency", "and(service:www,status:2*)") | histogram:merge()
| histogram:count_above(0.100) | op:neg()
| label("Bad Requests")

Notice that for visualization purposes, we added op:neg() to the “bad request” statement to render the “bad” line below the x axis.

Step 3: Count Requests Over Four Weeks of Time

If we look back at our SLO, we notice that it measures the requests “served over the last 4 weeks,” not over the past minute or hour. So for our graph to actually be useful, it needs to reflect that time aggregation. Thankfully, CAQL makes this easy using the rolling:merge() function. It merges the past four weeks of counts into each time value, in a rolling manner (so each value in the line series reflects the four weeks prior to that moment in time), which is exactly what our SLO needs to examine. We add the rolling:merge() immediately after the histogram:merge(), like so:

SLO request counts for last four weeks

find:histogram("latency", "and(service:www,status:2*)") | histogram:merge()
| rolling:merge(4w, skip=1d)
| histogram:count_below(0.100)
| label("Good Requests / 4 weeks")
find:histogram("latency", "and(service:www,status:2*)") | histogram:merge()
| rolling:merge(4w, skip=1d)
| histogram:count_above(0.100)
| op:neg()
| label("Bad Requests / 4 weeks")

The parameter skip=1d is added for performance reasons. Otherwise, when zooming into the graph, we could run into situations where four weeks of data are requested at one-minute resolution. That high resolution could result in time-outs or quota limits due to the large volume of high-granularity data.

Step 4: Calculating the Proportion of Bad Requests

We’ve aggregated and separated the data. We’re now ready for the final step. Instead of simply counting the requests, we want to show the proportion of bad requests to total requests. We do this by using the CAQL histogram:ratio_above() function to replace the histogram:count_above() function, like so:

SLO request proportion over last four weeks

find:histogram("latency", "and(service:www,status:2*)") | histogram:merge()
| rolling:merge(4w, skip=1d)
| histogram:ratio_above(0.100)
| label("Proportion of Bad Requests / 4 weeks")

Note that we have removed the “good requests” line and made our primary line series green. The red line series is the original count, just included for reference. We also added a horizontal guide to the graph at 1%, since that is our SLO threshold.

Now this graph can be used effectively for monitoring our latency SLO:

  • If the green line is below the black guide, the SLO is met.
  • If the green line is above the black guide, we are in violation of the SLO.

Alternative: Running Monthly Counts

If your SLO involves monitoring the proportion of bad requests on a monthly basis, the above calculation can be modified to output running counts based on calendar months. We need a few ingredients for this:

  • time:tz("UTC", "month") will output the current month number in UTC time.
  • integrate:while() will integrate the subsequent input slots while the first slot is constant.

We add those functions to our CAQL statement to count bad requests made in the current calendar month like so:

SLO running request count

integrate:while(prefill=4w){ time:tz("UTC", "month"),
   find:histogram("latency", "and(service:www,status:2*)") | histogram:merge()
   | histogram:count_above(0.100)
}
| label("Bad Requests in calendar month")

The prefill parameter to integrate:while() will start the integration a given time before the selected view range. Since our SLO examines the requests served over four weeks, we can normalize the values by running the request count over the previous 4 weeks. This gives an estimate of our error budget like so:

SLO running proportion

integrate:while(prefill=4w){ time:tz("UTC", "month"),
   find:histogram("latency", "and(service:www,status:2*)") | histogram:merge()
   | histogram:count_above(0.100)
}
/
(
   find:histogram("latency", "and(service:www,status:2*)") | histogram:merge()
   | rolling:merge(4w, skip=1d)
   | histogram:count()
   | op:prod(60) | op:prod(VIEW_PERIOD)
)
| label("Latency Budget spent this month")

The transformation | op:prod(60) | op:prod(VIEW_PERIOD) is needed to convert between different aggregation modes used by integrate, rolling:merge(), and histogram:count(). The need for this might go away in the future.

For the final version, we add a guide at the 1% SLO target, and the current count of bad requests for reference.

SLO running request with 1% guide and bad request count

The purple area indicates how much of the latency budget is spent within this month already. It must stay under the black line in order for the Latency SLO to be met in the current calendar month. The red area indicates how many slow requests are produced at that moment.

Summary

Combining Circonus histogram data collection with CAQL data analysis makes it easy to perform complex tasks like monitoring your SLOs. No need to pre-calculate anything or worry about erroneous data aggregations: collect the data as a Circonus histogram and you can choose how to analyze your data later on.

For more in-depth treatments of latency SLOs, check out this blog post and this talk.