The Problem with Percentiles – Aggregation brings Aggravation

Percentiles have become one of the primary service level indicators to represent real systems monitoring performance. When used correctly, they provide a robust metric that can be used for base-of-mission critical service level objectives. However, there’s a reason for the “when used correctly” above.

For all their potential, percentiles do have subtle limitations that are very often overlooked by the people using them in analyses.

There’s no shortage of previous writings on this topic, most notably Baron Schwartz’s “Why Percentiles Don’t Work the Way You Think.” Here, I’ll focus on data, and why you should pay close attention to how percentiles are applied.

Right off the bat, the most misused technique is aggregation of percentiles. You should almost never average percentiles, because even very fundamental aggregation tasks cannot be accommodated by percentile metrics. It is often thought that since percentiles are cheap to obtain by most telemetry systems, and good enough to use with little effort, that they are appropriate for aggregation and system wide performance analysis most of the time. While this is true most of the time and for most systems, you lose the ability to determine when your data is lying to you — for example, when you have high (+/- 5% and greater) error rates that are hidden from you.

Those times when your systems are misbehaving the most?
That’s exactly when you don’t have the data to tell you where things are going wrong.

Check the Math

Let’s look at an example* of request latencies of two webservers (W1 blue, W2 red). P95 of the blue server is 220ms, p95 of the red one is 650ms:

What’s the total p95 across both nodes (W1, W2)? (plot generated with matplotlib)
What’s the total p95 across both nodes (W1, W2)? (plot generated with matplotlib)

By aggregating the latency distributions of each web server, we find that the total p95 is 230ms. W2 barely served any requests, so adding requests from there did not change p95 of W1 by much. Now, naive averaging of the percentiles would have given you: (220+650) / 2 = 87/2 = 435ms, which is ~200% away from the true total percentile (230ms).

So, if you have a Service Level Indicator in this scenario of “95th percentile latency of requests over past 5 minutes < 300ms,” and you averaged P95s instead of calculating from a distribution, you would be led to believe that you have exceeded your SLI by ~30%. Folks would be getting paged even though they didn’t need to be, and maybe conclude that additional servers were needed (when in fact this scenario represents overprovisioning).

Incorrect math can result in tens of thousands of dollars of unneeded capacity,
not to mention the cost of the time of the humans in the loop.

*If you want to play with the numbers yourself to get a feel for how these scenarios can develop, there is a sample calculation with link to an online percentile calculator in the appendix [1]

“Almost Never” Actually Means “Never Ever”

“But above, you said ‘almost never’ when saying that we shouldn’t ever average percentiles?”

That’s 100% correct. (No pun intended.)

You see, there are circumstances where you can average percentiles and get a result that has low errors. Namely, when the distribution of your data sources are identical. The most obvious case is when the latencies are from two web servers that are (a) healthy and (b) serve very similar load.

Be aware that this supposition breaks down as soon as either of those conditions is violated! Those are the cases where you are most interested in your monitoring data, when one of your servers starts misbehaving or you got a load balancing problem.

“But my servers usually have an even distribution of request latencies which are nearly identical, so that doesn’t affect me, right?”

Well, sure, if your web servers have nearly identical latency distributions, go ahead and calculate your total 95th percentile for the system by averaging the percentiles from each server. But when you have one server that decides to run into swap and slow down, you likely won’t notice a problem, since the data indicating it is effectively hidden.

So still, you should never average percentiles; you won’t be able to know when the approach you are taking is hurting you at the worst time.

Averaging percentiles masks problems with nodes that would otherwise be apparent
Averaging percentiles masks problems with nodes that would otherwise be apparent

Percentiles are aggregates, but they should not be aggregated. They should be calculated, not stored. There are a considerable number of operational time series data monitoring systems, both open source and commercial, which will happily store percentiles at 5 minute (or similar) intervals. If you want to look at a year’s worth of data, you will encounter spike erosion. The percentiles are averaged to fit the number of pixels in the time windows on the graph. And that averaged data is mathematically wrong.

Example of Spike Erosion: 2 week view on the left shows a max of ~22ms. 24 hour view on the right shows a max of ~70ms.
Example of Spike Erosion: 2 week view on the left shows a max of ~22ms. 24 hour view on the right shows a max of ~70ms.

Clever Hacks

“So, the solution is to store every single sample like you did in the example, right?”

Well, yes and no.

You can store each sample, and generate correct percentiles from them, but at any more than a modest scale, this becomes prohibitively expensive. Some open-source time series databases and monitoring systems do this, but you give up either scalability in data ingest, or length of data retention. One million 64-bit integer samples per second for a year occupies 229 TB of space. One week of data of this data is 4 TB; doable with off-the-shelf hardware, but economically impractical for analysis, as well as wasteful.

“Ah, but I’ve thought up a better solution. I can just store the number of requests that are under my desired objective, say 500 milliseconds, and the number of requests that are above, and I can divide by the two to calculate a correct percentile!”

This is a valid approach, one that I have even implemented with a monitoring system that was not able to store full distributions. However, the limitation is subtle; if after some time I decide that my objective of 500ms was too aggressive and move it to 600ms, all of the historical data that I’ve collected is useless. I have to reset my counters and begin anew.

Store Distributions, Not Percentiles

A better approach than storing percentiles is to store the source sample data in a manner that is more efficient than storing single samples, but still able to produce statistically significant aggregates. The histogram, or distribution, is one such approach.

There are many types of histograms, but here at Circonus we use the log-linear histogram. It provides a mix of storage efficiency and statistical accuracy. Worst-case errors at single digit sample sizes are 5%, quite a bit better than the 200% that we demonstrated above by averaging percentiles.

Log Linear histogram view of load balancer request latency. Note the increase in bin size by a factor of 10 at 1.0M (1 million)
Log Linear histogram view of load balancer request latency. Note the increase in bin size by a factor of 10 at 1.0M (1 million)

Storage efficiency is significantly better than storing individual samples; a year’s worth of 5 minute log linear histogram windows (10 bytes per bin, 300 bins/window) can be stored in ~300MB (sans compression). Reading this amount of data from disk quickly is tractable with most physical (and virtualized) systems in under a second. The mergeability properties of histograms allows precomputed cumulative histograms to be stored for analytically useful windows such as 1 minute and 3 hours. This allows the composition of large time spans of time series telemetry to be rapidly assembled from sets that are visually relevant to the end user (think one year of data with histogram windows of six hours each).

Using histograms for operational time series data may seem like a challenge at first, but there are a lot of resources out there to help you out. We have published open source libraries of our log linear histograms in C, Golang, and even JavaScript. The Envoy proxy is one project that has implemented the log linear histogram C implementation for operational statistics. The Istio service mesh uses the Golang version of the log linear histogram library via our open source gometrics package to record latency metrics as distributions.

libcircllhist
https://github.com/circonus-labs/libcircllhist
circonusllhist
https://github.com/circonus-labs/circonusllhist
circllhist
https://github.com/circonus-labs/circllhist.js

In Conclusion

Percentiles are a staple tool of real systems monitoring, but their limitations should be understood. Because percentiles are provided by nearly every monitoring and observability toolset without limitations on their usage, they can be applied to analyses easily without the operator needing to understand the consequences of *how* they are applied. Understanding the common scenarios where percentiles give the wrong answers is just as important as having an understanding of how they are generated from operational time series data.

If you use percentiles now, you are already capturing data as distributions to some degree through your toolset. Knowing how that toolset generates percentiles from that source telemetry will ensure that you can evaluate if your use of percentiles to answer business questions is mathematically correct.


Appendix

Suppose I have two servers behind a load balancer, answering web requests. I use some form of telemetry gathering software to ingest the time it takes for each web server to complete serving its respective requests in milliseconds. I get results like this:

Web server 1: [22,90,73,296,55]
Web server 2: [935,72,18,553,267,351,56,28,45,873]

I’ll make the calculations here so easy that you can try them out yourself at https://goodcalculators.com/percentile-calculator/

What is the 90th percentile of requests from web server 1?

Solution:

  • Step 1. Arrange the data in ascending order: 22, 55, 73, 90, 296
  • Step 2. Compute the position of the pth percentile (index i):
    i = (p / 100) * n), where p = 90 and n = 5
    i = (90 / 100) * 5 = 4.5
  • Step 3. The index i is not an integer, round up.
    (i = 5) ⇒

    the 90th percentile is the value in 5th position, or 296

  • Answer: the 90th percentile is 296

What is the 90th percentile of requests from web server 2?

Solution:

    • Step 1. Arrange the data in ascending order: 18, 28, 45, 56, 72, 267, 351, 553, 873, 935
    • Step 2. Compute the position of the pth percentile (index i):
i = (p / 100) * n), where p = 90 and n = 10
i = (90 / 100) * 10 = 9
  • Step 3. The index i is an integer ⇒ the 90th percentile is the average of the values in the 8th and 9th positions (873 and 935 respectively)
  • Answer: the 90th percentile is
    (873 + 935) / 2 = 904

So web server 1 q(0.9) is 296, web server 2 q(0.9) is 904. If our objective is to keep our 90th percentile of requests overall under 500ms, did we meet that objective?

Let’s try averaging these percentiles, which you should not ever do in reality.

q(0.9) as average of (q(0.9) web server 1, q(0.9) web server2) = (904+296)/2 = 600ms. 

We were under our objective by about 20%, which is not bad. So we think everything is ok, right? Let’s calculate the p90 the right way and see what we get.

First we merge all values together, then use the percentile calculator on the merged set.

935,72,18,553,267,351,56,28,45,873,22,90,73,296,55

Solution:

  • Step 1. Arrange the data in ascending order: 18, 22, 28, 45, 55, 56, 72, 73, 90, 267, 296, 351, 553, 873, 935
  • Step 2. Compute the position of the pth percentile (index i):
    i = (p / 100) * n), where p = 90 and n = 15
    i = (90 / 100) * 15 = 13.5
  • Step 3. The index i is not an integer, round up.
    (i = 14) ⇒

    the 90th percentile is the value in 14th position, or 873

  • Answer: the 90th percentile is 873

Now our correct 90th percentile is 873 instead of 600. An increase of 273, or 45.5%. Quite a difference, no? If you had used averages, you might be thinking that your website is delivering responses at a latency that you believed was acceptable for most users. In reality, 90% of requests are under a threshold that is almost 50% larger than you thought.