Latency SLOs Done Right

In their excellent SLO-workshop at SRECon2018 (programLiz Fong-Jones, Kristina Bennett and Stephen Thorne (Google) presented some best practice examples for Latency SLI/SLOs. At Circonus we care deeply about measuring latency and SRE techniques such as SLI/SLOs. For example, we recently produced A Guide to Service Level Objectives. As we will explain here, Latency SLOs are particularly delicate to implement and benefit from having Histogram-data available to understand distributions and adjust SLO targets.

Latency SLOs

Here the key definitions regarding Latency SLOs from the Workshop Handout (pdf).

Latency SLI

The suggested specification for a request/response Latency SLI is:

The proportion of valid requests served faster than a threshold.

Turning this specification into an implementation requires making two choices: which of the requests this system serves are valid for the SLI, and that threshold marks the difference between requests that are fast enough and those that are not?

Latency SLO

99% of home page requests in the past 28 days served in < 100ms.

Latency Metrics

Latency is typically measured with percentile metrics like these, which were presented for a similar use case:

Given this data, what can we say about the SLO?

What is the p90 computed over the full 28 days?

It’s very tempting to take the average of the p90 metric which is displayed in the graph, which would be just below the 500ms mark.
It’s important to note, and it was correctly pointed out during the session, that this is not generally true. There is no mathematical way determine the 28 day-percentile from the series of 1h(?)-percentiles that are shown in the above graphs (reddit, blog, math). You need to look at different metrics if you want to implement a latency SLO. In this post we will discuss three different methods how to do this correctly.

Latency metrics in the wild

In the example above the error of averaging percentiles might not actually be that dramatic. The system seems to be very well-behaved with a high constant load. In this situation the average p90/1h is typically close to the total p90/28 days.
Let’s take a look at another API, from a less loaded system. This API handles very few requests between 2:00am and 4:00am:

What’s the true p90 over the 6h drawn on the graph? Is it above or below 30ms?
The average p90/1M (36.28ms) looks far less appealing than before.

Computing Latency SLOs

So how to better compute latency? There are three ways to go about this:

  1. Compute the SLO from stored raw data (logs)
  2. Count the number of bad requests in a separate metric
  3. Use histograms to store latency distribution.

Method 1: Using Raw/Log data

Storing access logs with latency data gives you accurate results. The drawback with this approach is that you must keep your logs over long periods of time (28 hrefdays) which can be costly.

Method 2: Counting bad requests

For the first case, instrument the application to count the number of requests that violated the threshold. The resulting metrics will look like this:

Percent good = 100 - (2262/60124)*100 = 96.238%
Percent good = 100 – (2262/60124)*100 = 96.238%

Using this metrics we see that 96% of our requests over the past 6h were faster than 30ms. Our SLO stated, that 90% of the requests should be good, so that objective was met.

The drawback of this approach is that you have to choose the latency threshold upfront. There is no way to calculate the percentage of requests that were faster than, say, 200ms from the recorded data.

If your SLO changes, you will need to change the executable or the service configuration to count requests above a different threshold.

Method 3: Using Histograms

The third practical option you have for computing accurate SLOs is storing your request latency data as histograms. The advantages of storing latency data as histograms are:

  1. Histograms can be freely aggregated across time.
  2. Histograms can be used to derive approximations of arbitrary percentiles.

For (1) to be true it’s critical that your histograms have common bin choices. It’s usually a good idea to mandate the bin boundaries for your whole organization, otherwise you will not be able to aggregate histograms from different services.
For (2), it’s critical that you have enough bins in the latency range that are relevant for your percentiles. Sparsely encoded log linear histograms allow you to cover a large floating point range (E.g. 10^-127 .. 10^128) with a fixed relative precision (E.g. 5%). In this way you can guarantee 5% accuracy on all percentiles, no matter how the data is distributed.

Two popular implementations of log linear histograms are:

Circonus comes with native support for Circllhist and is used for this example.
Histogram metrics store latency information per minute, and are commonly visualized as heatmaps:

Merging those 360x1M-histograms shown above into a single 6h-Histogram, we arrive at the following graph:

This is the true latency distribution over the full SLO reporting period of 6h, in this example.
At the time of this writing, there is no nice UI option to overlay percentiles in the above histogram graph. As we will see, you can perform the SLO calculation with CAQL or Python.

SLO Reporting via CAQL

We can use the CAQL functions histogram:rolling(6h) and histogram:percentile() to aggregate histograms over the last 6h and compute percentiles over the aggregated histograms. The SLO value we are looking for will be the very last value displayed on the graph.

SLO Reporting using Python

Using the Python API the calculation could look as follows:

# 1. Fetch Histogram Data
t = 1528171020 # exact start time of the graph 
N = 364        # exact number of minutes on the above graph
circ = circonusdata.CirconusData(config["demo"])
data = circ.caql('search:metric:histogram("api`GET`/getState")', t, 60, N)

# 2. Merge Histograms
H=Circllhist()
for h in data['output[0]']: H.merge(h)
# Let's check the fetched data is consistent with Histogram in the UI
circllhist_plot(H)
# 3. Calculate Aggregated Percentiles:
P = [50, 90, 95, 99, 99.9] # arbitrary percentiles
for p in P: print("{:>8}-latency percentile over 6h: {:>8.3f}ms".format(p, H.quantile(p/100)))
      50-latency percentile over 6h:   13.507ms
      90-latency percentile over 6h:   21.065ms
      95-latency percentile over 6h:   27.796ms
      99-latency percentile over 6h:   56.058ms
    99.9-latency percentile over 6h:  918.760ms

In particular we see that the true p90 is around 21ms, which is far away from the average p90 of 36.28ms we computed earlier.

# 4. Calculate Aggregated Counts:
Y = [10, 30, 50, 100, 200] # arbitrary thresholds
for y in Y: print("{:>10.3f} percent faster than {}ms".format(H.count_below(y)/H.count()*100,y))
    18.465 percent faster than 10ms
    96.238 percent faster than 30ms
    98.859 percent faster than 50ms
    99.484 percent faster than 100ms
    99.649 percent faster than 200ms

In particular we replicate the “96.238% below 30ms” result, that we calculated using the counter metrics before.

Conclusion

It’s important to understand that percentile metrics do not allow you to implement accurate Service Level Objectives that are formulated against hours or weeks. Aggregating 1M-percentiles seems tempting, but can produce materially wrong results — especially if your load is highly volatile.
The most practical way to calculate correct SLO percentiles are counters and histograms. Histograms give you additional flexibility to choose the latency threshold after the fact. This comes in particularly handy when you are still evaluating your service, and are not ready to commit yourself to a latency threshold just yet.

SREcon 2018 Americas

Getting paged at 11pm on New Year’s Eve because the application code used sprintf %d on a 32 bit system and your ids just passed 4.295 billion, sending the ids negative and crashing your object service. A wakeup call at 2 am (or is it 3 am?) on the ‘spring forward’ Daylight Savings transition because your timezone libraries didn’t incorporate one of the several dozen new politically mandated timezone changes. Sweating a four hour downtime two days in a row due to primary/replicant database failover because your kernel raid driver threw the same unhandled exception twice in a row; your backup primary database server naturally uses the same hardware as the active one, of course.

Circonus was created by its founders because they experienced the pain of reliability engineering on large scale systems first hand. They needed tools to efficiently diagnose and resolve problems in distributed systems. And they needed to do it at scale. The existing tools (Nagios, Ganglia, etc) at the time couldn’t cope the volume of telemetry nor provide the insight into systems behaviors that was needed. So they set out to develop tools and methods that would fill the void.

The first of these was using histograms to visualize timing data. Existing solutions would give you the average latency, the 95th percentile, the 99th percentile, and maybe a couple others. This information was useful for one host, but mathematically useless for aggregate systems metrics. Capturing latency metrics and storing it as a log linear histogram allowed users to see the distribution of values over a time window. Moreover, this data could be aggregated for multiple hosts to give a holistic view of a the performance of a distributed system or service.

However, systems are dynamic and constantly changing. Systems that behave well one second and poorly the next are the norm, not the exception in today’s ephemeral infrastructures. So we added heatmaps, which are histogram representations over discrete windows of time. So now users could get an overview of the actual performance of their system. if this diagram below was a traditional line graph showing the average latency value, it would be a mostly straight line, hiding the parts where long tail latencies became unbearable for certain customers. It gives SREs the power to separate the results of ‘works fine’ when testing and ‘this is really slow’ for those outlier large customers (who are generally the ones paying the big bucks).

These tools became formative components of standards that had been developing in the SRE community. A few years ago, Brendan Gregg introduced the USE method (Utilization, Saturation, Errors) a couple years ago. USE is a set of metrics which are key indicators for host level health. Following on the tails of USE, Tom Wilkie introduced the RED method (Rate, Errors, Duration). RED is a set of metrics which are indicators for service level health. Combining the two gives SREs a powerful set of standard frameworks for quickly identifying bad behavior for both hosts and systems.

These types of visualizations display a wealth of information, and as a result can put demands on the underlying metrics storage layer. A year ago we released the time series database that we have developed in C and Lua as IRONdb. This standalone TSDB can now power Grafana based visualizations, which have become part of the toolset for many SREs. As the complexity of today’s microservice based architectures grows, and the lifetime of individual components falls, the need for high volume time series telemetry continues to increase. Here at Circonus we are dedicated to bringing you solutions that solve the parts of reliability engineering which have caused us pain in the past which affect all SREs. So that you can focus your efforts on the parts of your business which you know better than anyone else.