Our Monitoring Tools are Lying to Us

I posted the article below to LinkedIn a few weeks back. Since it was relatively popular and relevant to the Circonus community we decided to repost to our Blog. You can find the original here.

I came across this vendor blog post today extolling the virtues of monitoring application performance using Percentiles versus Averages. Hard to believe in late 2012 there was still convincing to be done on this concept.

But in the age of Agile Computing and DevOps at scale, fixed percentiles over arbitrary, pre-determined time windows no longer cut the mustard for measuring application performance. Did they ever? Probably not, but they’re easy to calculate and cheap to store using 20th century “Small Data” technologies.

What if the proper threshold for supporting your service SLA for one KPI is measured at the 85th percentile over 1 min and for another KPI is measured at the 95th percentile over an hour? What if those thresholds change as your business changes and your business is changing rapidly? Are your tools as agile as your business?

What if consistently delighting your customers requires you to monitor a percentile of a particular metric at 1 min, 5 min, 1 hour , and 1 day intervals? Even if your tools imply they can do that, they probably can’t in reality. They weren’t designed to do that.

Lets say you are monitoring “response time” and that over the course of 1 min you typically have thousands of response time measurements. Existing tools will calculate the chosen percentile of those thousands of measurements and store the result in a database every 5 min. After 60 min they have 12 values, one for each 5 min window. Want to “calculate” the 95th percentile over an hour? More than likely what your tools will actually calculate is the average of those 12 values. But in reality there were 12 x thousands of response times measured over that hour, not 12. What’s the actual 95th percentile? Your tools probably can’t answer that question because they don’t have the data.

If you are like almost all of your IT peers, your monitoring tools begin to summarize performance data before it becomes even an hour old. Automatically summarizing performance data is one of the most “valuable” features of RRD Tools which I would bet is single the most common repository for IT performance data today. The perfect Small Data solution.

The point is, as soon as our tools begin summarizing performance data, we lose the ability to accurately analyze that data. Our tools begin to lie to us.