The Problem with Math: Why Your Monitoring Solution is Wrong

Math is perfect – the most perfect thing in the world. The problem with math is that it is perpetrated by us imperfect humans. Circonus has long aimed to bring deep numerical analysis to business telemetry data. That may sound like a lot of mumbo-jumbo, but it really means we want better answers to better questions about all that data your business creates. Like it or not, to do that, we need to do maths and to do them right.

That’s not how it works

I was watching a line of e-surance commercials that have recently aired wherein people execute Internet-sounding tasks in nonsensical ways, like posting photographs to your [Facebook] wall by pinning things to your living room wall while your friends sits on the couch to observe, and I was reminded of something I see too often: bad math. Math to which I’m inclined to rebut: “That’s not how it works; that’s not how any of this works!”

A great example of such bad math is the inaccurate calculation of quantiles. This may sound super “mathy,” but quantiles have wide applications and there is a strong chance that you need them for things such as enforcing service level agreements and calculating billing and payments.

What is a quantile? First we’ll use the syntax q(N, v) to represent a quantile where N is a set of samples and v is some number between 0 and 1, inclusively. Herein, we’ll assume some set N and just write a quantile as q(v). Remember that 0% is 0 and 100% is actually 1.0 in probability, so the quantile is simply asking what sample in my set represents a number such that v of the samples are less than it and (1-v) samples are greater than it. This may sound complicated. Most descriptions of math concepts are a bit opaque at first, but a few examples can be quite illustrative.

What is q(0)? What sample is such that 0 (or none) in the set are smaller and the rest are larger? Well, that’s easy: the smallest number in the set. q(0) is another way of writing the minimum.
What is q(1)? What sample is such that 1 (100% or all) in the set are smaller and none are larger? Also quite simple: the largest number in the set. q(1) is another way of writing the maximum.
What is q(0.5)? What sample is such that 50% (or half) in the set are smaller and the other half are larger? This is the middle or in statistics, the median. q(0.5) is another way of writing the median.

SLA calculations – the good, the bad and the ugly

Quite often when articulating Internet bandwidth billing scenarios, one will measure the traffic over each 5 minute period throughout an entire month and calculate q(0.95) over those samples. This is called 95th percentile billing. In service level agreements, one can stipulate that the latency for a particular service must be faster than a specific amount (some percentage of the time), or that some percentage of all interactions with said service must be at a specific latency or faster. Why are those methods different? In the first, your set of samples are calculated over discrete buckets of time, whereas in the second your samples are simply the latencies of each service request. As a note to those writing SLAs, the first is dreadfully difficult to articulate and thus nigh-impossible to calculate consistently or meaningfully, the second just makes sense. While discrete time buckets might make sense for availability-based SLAs, it makes little-to-no sense for latency-based SLAs. As “slow is the new down” is adopted across modern businesses, simple availability-based SLAs are rapidly becoming irrelevant.

So, let’s get more concrete: I have an API service with a simply stated SLA: 99.9% of my accesses should be services in 100 or fewer milliseconds. It might be a simple statement, but it is missing a critical piece of information: over what time period is this enforced? Not specifying the time unit of SLA enforcement is most often the first mistake people make. A simple example will best serve to illustrate why.

Assume I have an average of 5 requests per second to this API service. “Things go wrong” and I have twenty requests that are served at 200ms (significantly above our 100ms requirement) during the day: ten slow requests at 9:02am and the other ten offenders at 12:18pm. Some back of the napkin math says that as long as less that 1 request out of 1000 are slow, I’m fine (99.9% are still fast enough). As I far exceed 20,000 total requests during the day, I’ve not violated my SLA… However, if I enforce my SLA on five minute intervals, I have 1500 requests occurring between 9:00 and 9:05 and 10 slow responses. 10 out of 1500 is… well… there goes my SLA. Same blatant failure from 12:15 to 12:20. So, based on a five-minute-enforcement method I have 10 minutes of SLA violation during the day vs no SLA violation whatsoever using a one-day-enforcement method. But wait.. it gets worse. Why? Bad math.

Many systems out there calculate quantiles over short periods of time (like 1 minute or 5 minutes). Instead of storing the latency for every request, the system retains 5 minutes of these measurements, calculates q(0.999), and then discards the samples and stores the resulting quantile for later use. At the end of the day, you have q(0.999) for each 5 minute period throughout the day (288 of them). So given 288 quantiles throughout the day, how do you calculate the quantile for the whole day? Despite the fictitious answer some tools provide, the answer is you don’t. Math… it doesn’t work that way.

There are only two magical quantiles that will allow this type of reduction: q(0) and q(1). The minimum of a set of minimums is indeed the global minimum; the same is true for maximums. Do you know what the q(0.999) of the a set of q(0.999)s is? Or what the average of a set of q(0.999)s is? Hint: it’s not the answer you’re looking for. Basically, if you have a set of 288 quantiles representing each of the day’s five minute intervals and you want the quantile for the whole day, you are simply out of luck. Because math.

In fact, the situation is quite dire. If I calculated the quantile of the aforementioned 9:00 to 9:05 time interval where ten samples of 1500 are 200ms, the q(0.999) is 200ms despite the other 1490 samples being faster than 100ms. Devastatingly, if the other 1490 samples were 150ms (such that every single request was over the prescribed 100ms limit), the q(0.999) would still be 200ms. Because I’ve tossed the original samples, I have no idea if all of my samples violated the SLA or just 0.1% of them. In the worst case scenario, all of them were “too slow” and now I have 3000 request that were too slow. While 20 requests aren’t enough to break the day, 3000 most certainly are and my whole day is actually in violation. Because the system I’m using for collection quantile information is doing math wrong, the only reasonable answer to “did we violate the SLA today?” is “I don’t know, maybe.” It doesn’t need to be like this – Circonus calculates these quantiles correctly.

An aside on SLAs: While some of this might be a bit laborious to follow, the takeaway is to be very careful how you articulate your SLAs or you will often have no idea if you are meeting them. I recommend calculating quantiles on a day-to-day basis (and that all days are measured by UTC so the abomination that is daylight savings time never foils you). So to restate the example SLA above: 99.9% or more of the requests occurring on a 24-hour calendar day (UTC) shall be serviced in 100ms or less time. If you prefer to keep your increments of failure lower, you can opt for an hour-by-hour SLA instead of a day-by-day one. I do not recommend stating an SLA in anything less that one hour spans.

Bad math in monitoring – don’t let it happen to you

Quantiles are an example where the current methods that most tools use are simply wrong and there is no trick or method that can help. However, something that I’ve learned working on Circonus is how often other tools screw up even the most basic math. I’ll list a few examples in the form of tips, without the indignity of attributing them to specific products. (If you run any of these products, you might recognize the example… or at some point in the future you will either wake up in a cold sweat or simply let loose a stream of astounding explicatives.)

The average of a set of minimums is not the global minimum. It is, instead, nothing useful.
The simple average of a set of averages of varying sample sizes isn’t the average of all the samples combined. It is, instead, nothing useful.
The average of the standard deviations of separate sets of samples is not the standard deviation of the combined set of samples. It is, instead, nothing useful.
The q(v) of a set of q(v) calculated over sample sets is not the q(v) of the combined sample set. While creative, it is nothing useful.
The average of a set of q(v) calculated over sample sets is, you guessed it, nothing useful.
The average of a bunch of rates (which are nothing more than a change in value divided by a change in time: dv/dt) with varying dts is not the damn average rate (and a particularly horrible travesty).

At Circonus we learned early on that math is both critically important to our mission and quite challenging for many of our customers to understand. This makes it imperative that we not screw it up. Many new adopters contact support asking for an explanation as to why the numbers they see in our tool don’t match their existing tools. We have to explain that we thought they deserved the right answer. As for their existing tools: that’s not how it works; that’s not how any of this works.