This is a story about an engineering group celebrating success when it shouldn’t be… and their organization buying into it. This is not the fault of the engineering groups, or the operations team, or any one person. This is the fault of yesterday’s tools not providing the right data. The right insights. The ability to dig into the data to get meaningful information to push your business forward.
Herein, we’ll dive into a day in the life of an online service where a team wakes up and triages an outage after ignorantly celebrating a larger outage as a success just twelve hours before. All names have been removed to protect the exceptionally well-intending and competent parties. You see, the problem is that the industry has been misleading us with misapplied math and bad statistics for years.
I’ll set the stage with a simple fact of this business… when it takes longer than one and half seconds to use their service, users leave. Armed with this fact, let’s begin our journey. Despite this data living in Circonus, it isn’t measuring Circonus; alas, as stories are best told in the first person with friends along for the ride, I shall drop into the first-person plural for the rest of the ride: let’s go.
We track the user’s experience logging into the application. We do this not by synthetically logging in and measuring (we do this too, but only for functional testing), but by measuring each user’s experience and recording it. When drawn as a heatmap, the data looks like the graph below. The red line indicates a number that, through research, we’ve found to be a line of despair and loss. Delivering an experience of 1.5 seconds or slower causes our users to leave.
Heatmaps can be a bit confusing to reason about, so this is the last we’ll see of it here. The important part to remember is that we are storing a complete model of the distribution of user experiences over time and we’ll get to why that is important in just a bit. From this data, we can calculate and visualize all the things we’re used to.
The above is a graph for that same lonely day in June, and it shows milliseconds of latency; specifically, the line represents the average user experience. If I ask you to spot the problem on the graph, you can do so just as easily as a four year old; it’s glaring. However, you’ll note that our graph indicates we’re well under our 1.5s line of despair and loss. We’re all okay right? Wrong.
A long time ago, the industry realized that averages (and standard deviations) are very poor representations of sample sets because our populations are not normally distributed. Instead of using an average (specifically an arithmetic mean), we all decided that measuring on some large quantile would be better. We were right. So, an organization would pick a percentage: 99.9% or 99% and articulate, “I have to be at least ‘this good’ for at least ‘this percentage’ of my users.” If this percentage seems arbitrary, it is… but, like the 1.5 second line of despair and loss, it can be derived from lots of business data and user behavior studies.
This, ladies and gentlemen, is why we don’t use averages. Saying that averages are misleading is a bit less accurate than admitting that many people are misled by averages. They simply don’t represent the things that are important to us here: how are we treating our users? This question is critical because it is our users who fund us and our real question is, “How many users are having a dissatisfying experience?”
The above graph is radically different than the first; it might surprise you to know that it is showing the same underlying dataset. Instead of the average experience, it shows the 99th percentile experience over time. It is much clearer that we had something catastrophically bad happen at 5am. It also shows that aside from two small infractions (7:52pm and 11:00pm), the rest of the day delivered the objective of a “less than 1.5s 99th percentile experience.” Okay, let’s stop. That’s such a disgustingly opaque and clinical way to talk about what we’re representing. These are actual people attempting to use this service.
What we’re saying here is that for each of the points on the purple line in this graph, during the time window that it represents (at this zoom level, each point represents 4 minutes), that 99% of visitors had an experience better than the value, and 1% had an experience worse than the value. Here we should see our first problem: percentages aren’t people.
Reflecting on the day as a whole, we see a catastrophic problem at 5am, to which our mighty engineering organization responded and remediated diligently over the course of approximately fifty minutes. Go Team! The rest of the day was pretty good, and we have those two little blips to diagnose and fix going forward.
I’m glad we’re not using averages for monitoring! We’d most likely not have been alerted to that problem at 5am! Here is where most monitoring stories end because a few quantiles is all that is stored and the raw data behind everything isn’t available for further analysis. Let’s return to our earlier question, “How many users are having a dissatisfying experience?” Luckily for us, we know how many users were on the site, so we can actually just multiply 1% by the number of current visitors to understand “how many” of the users are having an experience worse than the graph… But that isn’t the question is it? The question is how many users are having a worse experience than 1.5s, not worse than the 99th percentile.
This graph adds a black line that shows the number of current users each minute on the site (numbered on the right axis). To illustrate how we’re really missing the point, let’s just take a random point from our 99th percentile graph (again each point represents 4 minutes at this zoom level). We randomly pick 9:32pm. The graph tells us that the 99th percentile experience at that point is at 1.266s. This is better than our goal of 1.5s. Well, looking at the black line we see that we have about 86 users each minute on the site at that point, or 344 users over the four minute period. 1% of that is between 3 and 4 users. Okay, we’re getting somewhere! So we know that between 3 and 4 users had an experience over 1.266s. Wait, that wasn’t our question. Who cares about 1.266s, when we want to know about 1.5s? We’re not getting anywhere at all.
Our objective is 1.5 seconds. We’re looking at this all upside down and backwards. We should not be asking how bad the experience is for the worst 1%, instead we should be asking what percentage has a bad experience (any experience worse than our objective of 1.5 seconds). We shouldn’t be asking about quantiles; we should be asking about inverse quantiles. Since we’re storing the whole distribution of experiences in Circonus, we can simply ask, “What percentage of the population is faster than 1.5s?” If we take one minus this inverse quantile at 1.5 seconds, we get exactly the answer to our question: What percentage of users had a “bad experience?”
Now we’re getting somewhere. It is clear that we had a bad time at 5am and we did pretty well with just some line noise during our successful prior evening, right? Let’s return to our first problem: percentages aren’t people.
Luckily, just as we did before, we can simply look at how many people are visiting the site (the green line above) and multiply that by the percentage of people having a bad time and we get the number of actual people. Now we’re talking about something everyone understands. How many people had a bad experience? Let’s multiply!
In this image, we have simply multiplied two data streams from before, and we see the human casualties of our system. This is the number of users per minute that we screwed out of a good experience. These are users that, in all likelihood, are taking their business elsewhere. As anyone that thinks about it for more than a few seconds realizes, a small percentage of a large number can easily be bigger than a large percentage of a small number. Managing to inverse quantile numbers (let alone abstractly reasoning about quantiles), without knowing the size of the population, is misleading (to put it mildly).
Another way to look at this graph is to integrate; that is, to calculate the area under the curve. Integrating a graph representing users over time results in a graph of users. In other words, the number of cumulative users that have had a bad experience.
This should be flat-out eye opening. The eight hours from 2am to 10am (including the event of despair and loss) affected 121 people. The eight hours preceding it affected almost as many: 113.
It can be pretty depressing to think you’ve celebrated a successful day of delivery only to learn that it really wasn’t that successful at all. But, this isn’t so much about celebrating successes that were actually failures; it’s about understanding what, when, and where you can improve. Every user matters; and if you treat them that way, you stand to get a lot more of them.
When you look back at your own graphs, just remember that the most casualties of our day happened in these two bands. You should be using inverse quantiles for SLA reporting; if you don’t have those, use quantiles… if you only have averages, you’re blind as a bat.