Latency SLOs Done Right

In their excellent SLO-workshop at SRECon2018 (programLiz Fong-Jones, Kristina Bennett and Stephen Thorne (Google) presented some best practice examples for Latency SLI/SLOs. At Circonus we care deeply about measuring latency and SRE techniques such as SLI/SLOs. For example, we recently produced A Guide to Service Level Objectives. As we will explain here, Latency SLOs are particularly delicate to implement and benefit from having Histogram-data available to understand distributions and adjust SLO targets.

Latency SLOs

Here the key definitions regarding Latency SLOs from the Workshop Handout (pdf).

Latency SLI

The suggested specification for a request/response Latency SLI is:

The proportion of valid requests served faster than a threshold.

Turning this specification into an implementation requires making two choices: which of the requests this system serves are valid for the SLI, and that threshold marks the difference between requests that are fast enough and those that are not?

Latency SLO

99% of home page requests in the past 28 days served in < 100ms.

Latency Metrics

Latency is typically measured with percentile metrics like these, which were presented for a similar use case:

Given this data, what can we say about the SLO?

What is the p90 computed over the full 28 days?

It’s very tempting to take the average of the p90 metric which is displayed in the graph, which would be just below the 500ms mark.
It’s important to note, and it was correctly pointed out during the session, that this is not generally true. There is no mathematical way determine the 28 day-percentile from the series of 1h(?)-percentiles that are shown in the above graphs (reddit, blog, math). You need to look at different metrics if you want to implement a latency SLO. In this post we will discuss three different methods how to do this correctly.

Latency metrics in the wild

In the example above the error of averaging percentiles might not actually be that dramatic. The system seems to be very well-behaved with a high constant load. In this situation the average p90/1h is typically close to the total p90/28 days.
Let’s take a look at another API, from a less loaded system. This API handles very few requests between 2:00am and 4:00am:

What’s the true p90 over the 6h drawn on the graph? Is it above or below 30ms?
The average p90/1M (36.28ms) looks far less appealing than before.

Computing Latency SLOs

So how to better compute latency? There are three ways to go about this:

  1. Compute the SLO from stored raw data (logs)
  2. Count the number of bad requests in a separate metric
  3. Use histograms to store latency distribution.

Method 1: Using Raw/Log data

Storing access logs with latency data gives you accurate results. The drawback with this approach is that you must keep your logs over long periods of time (28 hrefdays) which can be costly.

Method 2: Counting bad requests

For the first case, instrument the application to count the number of requests that violated the threshold. The resulting metrics will look like this:

Percent good = 100 - (2262/60124)*100 = 96.238%
Percent good = 100 – (2262/60124)*100 = 96.238%

Using this metrics we see that 96% of our requests over the past 6h were faster than 30ms. Our SLO stated, that 90% of the requests should be good, so that objective was met.

The drawback of this approach is that you have to choose the latency threshold upfront. There is no way to calculate the percentage of requests that were faster than, say, 200ms from the recorded data.

If your SLO changes, you will need to change the executable or the service configuration to count requests above a different threshold.

Method 3: Using Histograms

The third practical option you have for computing accurate SLOs is storing your request latency data as histograms. The advantages of storing latency data as histograms are:

  1. Histograms can be freely aggregated across time.
  2. Histograms can be used to derive approximations of arbitrary percentiles.

For (1) to be true it’s critical that your histograms have common bin choices. It’s usually a good idea to mandate the bin boundaries for your whole organization, otherwise you will not be able to aggregate histograms from different services.
For (2), it’s critical that you have enough bins in the latency range that are relevant for your percentiles. Sparsely encoded log linear histograms allow you to cover a large floating point range (E.g. 10^-127 .. 10^128) with a fixed relative precision (E.g. 5%). In this way you can guarantee 5% accuracy on all percentiles, no matter how the data is distributed.

Two popular implementations of log linear histograms are:

Circonus comes with native support for Circllhist and is used for this example.
Histogram metrics store latency information per minute, and are commonly visualized as heatmaps:

Merging those 360x1M-histograms shown above into a single 6h-Histogram, we arrive at the following graph:

This is the true latency distribution over the full SLO reporting period of 6h, in this example.
At the time of this writing, there is no nice UI option to overlay percentiles in the above histogram graph. As we will see, you can perform the SLO calculation with CAQL or Python.

SLO Reporting via CAQL

We can use the CAQL functions histogram:rolling(6h) and histogram:percentile() to aggregate histograms over the last 6h and compute percentiles over the aggregated histograms. The SLO value we are looking for will be the very last value displayed on the graph.

SLO Reporting using Python

Using the Python API the calculation could look as follows:

# 1. Fetch Histogram Data
t = 1528171020 # exact start time of the graph 
N = 364        # exact number of minutes on the above graph
circ = circonusdata.CirconusData(config["demo"])
data = circ.caql('search:metric:histogram("api`GET`/getState")', t, 60, N)

# 2. Merge Histograms
H=Circllhist()
for h in data['output[0]']: H.merge(h)
# Let's check the fetched data is consistent with Histogram in the UI
circllhist_plot(H)
# 3. Calculate Aggregated Percentiles:
P = [50, 90, 95, 99, 99.9] # arbitrary percentiles
for p in P: print("{:>8}-latency percentile over 6h: {:>8.3f}ms".format(p, H.quantile(p/100)))
      50-latency percentile over 6h:   13.507ms
      90-latency percentile over 6h:   21.065ms
      95-latency percentile over 6h:   27.796ms
      99-latency percentile over 6h:   56.058ms
    99.9-latency percentile over 6h:  918.760ms

In particular we see that the true p90 is around 21ms, which is far away from the average p90 of 36.28ms we computed earlier.

# 4. Calculate Aggregated Counts:
Y = [10, 30, 50, 100, 200] # arbitrary thresholds
for y in Y: print("{:>10.3f} percent faster than {}ms".format(H.count_below(y)/H.count()*100,y))
    18.465 percent faster than 10ms
    96.238 percent faster than 30ms
    98.859 percent faster than 50ms
    99.484 percent faster than 100ms
    99.649 percent faster than 200ms

In particular we replicate the “96.238% below 30ms” result, that we calculated using the counter metrics before.

Conclusion

It’s important to understand that percentile metrics do not allow you to implement accurate Service Level Objectives that are formulated against hours or weeks. Aggregating 1M-percentiles seems tempting, but can produce materially wrong results — especially if your load is highly volatile.
The most practical way to calculate correct SLO percentiles are counters and histograms. Histograms give you additional flexibility to choose the latency threshold after the fact. This comes in particularly handy when you are still evaluating your service, and are not ready to commit yourself to a latency threshold just yet.

Effective Management of High Volume Numeric Data with Histograms

How do you capture and organize billions of measurements per second such that you can answer a rich set of queries effectively (percentiles, counts below X, aggregations across streams), and you don’t blow through your AWS budget in minutes?

To effectively manage billions of data points, your system has to be both performant and scalable. How do you accomplish that? Not only do your algorithms have to be on point, but your implementation of them has to be efficient. You want to avoid allocating memory where possible, avoid copying data (pass pointers around instead), avoid locks, and avoid waits. Lots of little optimizations that add up to being able to run your code as close to the metal as possible.

You also need to be able to scale your data structures. They need to be as size efficient as possible, which means using strongly typed languages with the optimum choice of data types. We’ve found that histograms are the most efficient data structure for storing the data types we care about at scale.

What is a histogram?

A histogram is a representation of the distribution of a continuous variable, in which the entire range of values is divided into a series of intervals (or “bins”) and the representation displays how many values fall into each bin.

This histogram diagram shows a skewed histogram where the mode is near the minimum value q(0). The Y axis is the number of samples (or sample density), and the X axis shows the sample value. On this histogram we can see that the median is slightly left of the midpoint between the highest value and the lowest value. The mode is at a low sample value, so the median is below the mean, or average value. The 90th percentile is also called q(0.9), and is where 90 percent of the sample values are below it.

This might look like the 2nd generation display in KITT from Knight Rider, but this is a heatmap. A heatmap is essentially a series of histograms over time. This heatmap represents web service request latency. Imagine each column in the heatmap is a bar graph (like the previous histogram) viewed “from above,” the parts that are red are where the sample density is the highest. So we can see that most of the latencies tend to concentrate around 500 nanoseconds. We can overlay quantiles onto this visualization, we’ll cover that in a bit.

Types of Histograms

There are five types of histograms:

  • Fixed Bucket – require the user to specify the bucket or bin boundaries.
  • Approximate – use approximations of values.
  • Linear – have one bin at even intervals, such as one bin per integer.
  • Log Linear – have bins at logarithmically increasing intervals.
  • Cumulative – each successive bin contains the sum of the counts of previous bins.

Fixed Bucket Histograms

Fixed bucket histograms require the user to specify the bin boundaries.

Traits of fixed bucket histograms:

  • Histogram bin sizes can be fine tuned for known data sets to achieve increased precision.
  • Cannot be merged with other types of histograms because the bin boundaries are likely uncommon.
  • Less experienced users will likely pick suboptimal bin sizes.
  • If you change your bin sizes, you can’t do calculations across older configurations.

Approximate Histograms

Approximate histograms such as the t-digest histogram (created by Ted Dunning) use approximations of values, such as this example above which displays centroids. The number of samples on each side of the centroid is the same.

Traits of approximate histograms:

  • Space efficient.
  • High accuracy at extreme percentiles (95%, 99%+).
  • Worst case errors ~10% at the median with small sample sizes.
  • Can be merged with other t-digest histograms.

Linear Histograms

Linear histograms have one bin at even intervals, such as one bin per integer. Because the bins are all evenly sized, this type of histogram uses a large number of bins.

Traits of linear histograms:

  • Accuracy dependent on data distribution and bin size.
  • Low accuracy at fractional sample values (though this indicates improperly sized bins).
  • Inefficient bin footprint at higher sample values.

Log Linear Histograms

Log Linear histograms have bins at logarithmically increasing intervals.

Traits of log linear histograms:

  • High accuracy at all sample values.
  • Fits all ranges of data well.
  • Worst case bin error ~5%, but only with absurdly low sample density.
  • Bins are often subdivided into even slices for increased accuracy.
  • HDR histograms (high dynamic range) are a type of log linear histograms.

Cumulative Histograms

Cumulative histograms are different from other types of histograms in that each successive bin contains the sum of the counts of previous bins.

Traits of cumulative histograms:

  • Final bin contains the total sample count, and as such is q(1).
  • Used at Google for their Monarch monitoring system.
  • Easy to calculate bin quantile – just divide the count by the maximum.

Open Source Log Linear Histograms

So what does a programmatic implementation of a log linear histogram looks like? Circonus has released this implementation of log linear histograms as open source, in both C and Golang.

We need to ask a few things:

  • Why does it scale?
  • Why is it fast?
  • How does it calculate quantiles?

First let’s examine what the visual representation of this histogram is, to get an idea of the structure:

At first glance this looks like a linear histogram, but take a look at the 1 million point on the X axis. You’ll notice a change in bin size by a factor of 10. Where there was a bin from 990k to 1M, the next bin spans 1M to 1.1M. Each power of 10 contains 90 bins evenly spaced. Why not 100 bins you might think? Because the lower bound isn’t zero. In the case of a lower bound of 1 and an upper bound of 10, 10-1 = 9, and spacing those in 0.1 increments yields 90 bins.

Here’s an alternate look at the bin size transition:

You can see the transition to larger bins at the 1,000 sample value boundary.

Bin Data Structure

The C implementation of each bin is a struct containing a value and exponent struct paired with the count of samples in that bin. The diagram above shows the overall memory footprint of each bin. The value is one byte representing the value of the data sample times 10. The Exponent is the power of 10 of the bin, and ranges from -128 to +127. The sample count is an unsigned 64 bit integer. This field is variable bit encoded, and as such occupies a maximum of 8 bytes.

Many of the existing time series data stores out there store a single data point as an average using the value as a uint64. That’s one value for every eight bytes – this structure can store virtually an infinite count of samples in this bin range with the same data storage requirements.

Let’s take a look at what the storage footprint is in practice for this type of bin data structure. In practice, we have not seen more than a 300 bin span for operational sample sets. Bins are not pre-allocated linearly, they are allocated as samples are recorded for that span, so the histogram data structure can have gaps between bins containing data.

To calculate storage efficiency for 1 month, that’s 30 days of one minute histograms:

30 days * 24 hours/day * 60 bins/hour * 300 bin span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB = 123.6 MB

These calculations show that we can store 30 days of one minute distributions in a maximum space of 123.6 megabytes. Less than a second of disk read operations, if that.

Now, 30 days of one minute averages only takes about a third of a megabyte – but that data is essentially useless for any sort of analysis.

Let’s examine what a year’s worth of data looks like in five minute windows. That’s 365 days of five minute histograms.

365 days * 24 hours/day * 12 bins/hour * 300 bin span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB = 300.9 MB

The same calculation with different values yield a maximum of about 300 megabytes to represent a year’s worth of data in five minute windows.

Note that this is invariant to the total number of samples; the biggest factors in the actual size of the data are the span of bins covered, and the compression factor per bin.

Quantile Calculations

Let’s talk about performing quantile calculations.

  1. Given a quantile q(X) where 0 < X < 1.
  2. Sum up the counts of all the bins, C.
  3. Multiply X * C to get count Q.
  4. Walk bins, sum bin boundary counts until > Q.
  5. Interpolate quantile value q(X) from bin.

The quantile notation q of X is just a fancy way of specifying a percentile. The 90th percentile would be q of 0.9, the 95th would be q of 0.95, and the maximum would be q of 1.

So say we wanted to calculate the median, q of 0.5. First we iterate over the histogram and sum up the counts in all of the bins. Remember that cumulative histogram we talked about earlier? Guess what – the far right bin already contains the total count, so if you are using a cumulative histogram variation, that part is already done for you and can be a small optimization for quantile calculation since you have an O(1) operation instead of O(n)

Now we multiple that count by the quantile value. If we say our count is 1,000 and our median is 0.5, we get a Q of 500.

Next iterate over the bins, summing the left and right bin boundary counts, until Q is between those counts. If the count Q matches the left bin boundary count, our quantile value is that bin boundary value. If Q falls in between the left and right boundary counts, we use linear interpolation to calculate the sample value.

Let’s go over the linear interpolation part of quantile calculation. Once we have walked the bins to where Q is located in a certain bin, we use the formula shown here.

Little q of X is the left bin value plus big Q minus the left bin boundary, divided by the count differences of the left and ride sides of the bin, times the bin width.

Using this approach we can determine the quantile for a log linear histogram to a high degree of accuracy.

In terms of what error levels we can experience in terms of quantile calculation, with one sample in a bin it is possible to see a worst case 5% error if the value is 109.9 in a bin bounding 100-110; the reported value would be 105. The best case error for one sample is 0.5% for a value of 955 in a bin spanning 950-960.

However, our use of histograms is geared towards very large sets of data. With bins that contain dozens, hundreds, or more samples, accuracy should be expected to 3 or 4 nines.

Inverse Quantiles

We can also use histograms to calculate inverse quantiles. Humans can reason about thresholds more naturally than they can about quantiles.

If my web service gets a surge in requests, and my 99th percentile response time doesn’t change, that not only means that I just got a bunch of angry users whose requests took too long, but even worse I don’t know by how much those requests exceeded that percentile. I don’t know how bad the bleeding is.

Inverse quantiles allow me to set a threshold sample value, then calculate what percentage of values exceeded it.

To calculate the inverse quantile, we start with the target sample and work backwards towards the target count Q.

  1. Given a sample value X, locate its bin.
  2. Using the previous linear interpolation equation, solve for Q given X.

Given the previous equation we had, we can use some middle school level algebra (well, it was when I was in school) and solve for Q.

X = left_value+(Q-left_count) / (right_count-left_count)*bin_width

X-left_value = (Q-left_count) / (right_count-left_count)*bin_width

(X-left_value)/bin_width = (Q-left_count)/(right_count-left_count)

(X-left_value)/bin_width*(right_count-left_count) = Q-left_count

Q = (X-left_value)/bin_width*(right_count-left_count)+left_count

Solving for Q, we get 700, which is expected for our value of 1.05.

Now that we know Q, we can add up the counts to the left of it, subtract that from the total and then divide by the total to get the percentage of sample values which exceeded our sample value X.

  1. Sum the bin counts up to Q as Qleft.
  2. Inverse quantile qinv(X) = (Qtotal-Qleft)/Qtotal.
  3. For Qleft=700, Qtotal = 1,000, qinv(X) = 0.3.
  4. 30% of sample values exceeded X.

So if we are running a website, and we know from industry research that we’ll lose users if our requests take longer than three seconds, we can set X at 3 seconds, calculate the inverse quantile for our request times, and figure out what percentages of our users are getting mad at us. Let’s take a look at how we have been doing that in real time with Circonus.

Examples

This is a heatmap of one year’s worth of latency data for a web service. It contains about 300 million samples. Each histogram window in the heatmap is one day’s worth of data, five minute histograms are merged together to create that window. The overlay window shows the distribution for the day where the mouse is hovering over.

Here we added a 99th percentile overlay which you can see implemented as the green lines. It’s pretty easy to spot the monthly recurring rises in the percentile to around 10 seconds. Looks like a network timeout issue, those usually default to around 10 seconds. For most of the time the 99th percentile is relatively low, a few hundred millisecond.

Here we can see the inverse quantile shown for 500 milliseconds request times. As you can see, for most of the graph, at least 90% of the requests are finishing within the 500 millisecond service level objective. We can still see the monthly increases which we believe are related to network client timeouts, but when they spike, they only affect about 25% of requests – not great, but at least we know the extent of that our SLO is exceeded.

We can take that percentage of requests that violated our SLO of 500 milliseconds, and multiply them by the total number of requests to get the number of requests which exceeded 500 milliseconds. This has direct bearing on your business if each of these failed requests is costing you money.

Note that we’ve dropped the range here to a month to get a closer look at the data presented by these calculations.

What is we sum up over time the number of requests that exceeded 500 millseconds? Here we integrate the number of requests that exceeded the SLO over time, and plot that as the increasing line. You can clearly see now where things get bad with this service by the increase in slope of the blue line. What if we had a way to automatically detect when that is happening and then page whoever is on call?

Here we can see the red line on the graph is the output of a constant anomaly detection algorithm. When the number of SLO violations increases, the code identifies it as an anomaly and rates it on a 0-100 scale. There are several anomaly detection algorithms available out there, and most don’t use a lot of complicated machine learning, just statistics.

Video Presentation

This video of our presentation at DataEngConf 2018 covers the material explained in this post.

You can also view the slides here.

Conclusion

We looked at a few different implementations of histograms, an overview of Circonus’ open source implementation of a log linear histogram, what data structures it uses to codify bin data, and the algorithm used to calculate quantiles. Reading the code (in C or in Golang) will demonstrate how avoiding locks, memory allocations, waits, and copies are essential to making these calculations highly optimized. Good algorithms are still limited by how they are implemented on the metal itself.

One problem to solve might be to collect all of the syscall latency data on your systems via eBPF, and compare the 99th percentile syscall latencies once you’ve applied the Spectre patch which disabled CPU branch prediction. You could find that your 99th percentile syscall overhead has gone from 10 nanoseconds to 30!

While percentiles are a good first step for analyzing your Service Level Objectives, what if we could look at the change in the number of requests that exceeded that objective? Say if 10% of your syscall requests exceeded 10 nanoseconds before the spectre patch, but after patching 50% of those syscalls exceeded 10 nanoseconds?

Soon, in a later post, we’ll talk more about Service Level Objectives and SLO analysis.