The Uphill Battle for Visibility

In 2011, Circonus released a feature that revolutionized the way people could observe systems. We throw around the word “histogram” a lot and the industry has listened. And while we’ve made great progress, it seems that somewhere along the way the industry got lost.

I’ve said for some time that without a histogram, you just don’t understand the behavior of your system. Many listened and the “histogram metric type” was born, and now is used in so many of today’s server apps to provide “better understanding” of systems behavior. The deep problem is that those histograms might be used “inside” the product, but from the operator’s perspective you only see things like min, max, median, and some other arbitrary percentiles (like 75 and 99). Quick question: do you know what you get when you store a 99th percentile over time? Answer: not a histogram. The “problem” with averages that I’ve railed against for years is not that averages are bad or useless, but instead that an average is a single statistical assessment of a large distribution of data… it doesn’t provide enough insight into the data set. Adding real histograms solves this issue, but analyzing a handful of quantiles that people are calling histograms puts us right back where we started. Perhaps we need a T-shirt: “I was sold a histogram, but I all got were a few lousy quantiles. :(”

I understand the problem: it’s hard. Storing histograms as an actual datatype is actually innovative to the point that people don’t understand it. Let’s say, for example, you’re trying to measure the performance of your Cassandra cluster… as you know, every transaction counts. In the old days, we all just tracked things like averages. If you served approximately 100,000 requests in a second, you would measure the service times of each and track an EWMA (exponential weighted moving average) and store this single output over time as an assessment of ongoing performance.

Stored Values
Time EWMA (seconds)
T1 0.0020
T2 0.0023
T3 0.0019
T4 0.0042

Now, most monitoring systems didn’t actually store second-by-second data, so instead of having an approximation of an average of 100,000 or so measurements every second, you would have an approximation of an average of around 6,000,000 measurements every minute. When I first started talking about histograms and their value, it was easy for an audience to say, “Wow, if I see one datapoint representing six million, it stands to reason I don’t know all that much about the nature of the six million.”

Enter the histogram hack:

By tracking the service latencies in a histogram (and there a lot of methods, from replacement sampling, to exponentially decaying, to high-definition-resolution collection), one now had a much richer model to reason about for understanding behavior. This is not the hack part… that followed. You see, people didn’t know what to do with these histograms because their monitoring/storage systems didn’t understand them, so they degraded all that hard work back into something like this:

Stored Values
Time mean min max 50th 75th 95th 99th
T1 0.0020 0.00081 0.120 0.0031 0.0033 0.0038 0.091
T2 0.0023 0.00073 0.140 0.0027 0.0031 0.0033 0.092
T3 0.0019 0.00062 0.093 0.0024 0.0027 0.0029 0.051
T4 0.0042 0.00092 0.100 0.0043 0.0050 0.0057 0.082

This may seem like a lot of information, but even at second-by-second sampling, we’re taking 100,000 data points and reducing them to a mere seven. Is this is most-certainly better than a single statistical output (EWMA), but claiming even marginal success is folly. You see, the histogram that was used to calculate those 7 outputs could calculate myriad other statistical insights, but all of that insight is lost if those histograms aren’t stored. Additionally, reasoning about the extracted statistics over time is extremely difficult.

At Circonus, from the beginning, we realized the value of storing a histogram as a first-class data type in our time series database.

Stored Values Derived Values
Time Histogram mean min max 50th nth
T1 H1(100427) mean(H1) q(H1,0) q(H1,100) q(H1,50) q(H1,n)
T2 H2(108455) mean(H2) q(H2,0) q(H2,100) q(H2,50) q(H2,n)
T3 H3(94302) mean(H3) q(H3,0) q(H3,100) q(H3,50) q(H3,n)
T4 H4(98223) mean(H4) q(H4,0) q(H4,100) q(H4,50) q(H4,n)

Obviously, this is a little abstract, because how do you store a histogram? Well, we take the abstract to something quite concrete in Go, Java, Javascript, and C. This also means that we can view the actual shape of the distribution of the data and extract other assessments, like estimated modality (how many modes there are).

All this sounds awesome, right? It is! However, there is an uphill battle. When the industry started to get excited about using histograms, they adopted the hacky model where histograms are used internally to provide poor statistics upstream, we’re left at Circonus trying to get water from a stone. It seems each and every product out there requires code modifications to be able to expose this sort of rich insight. Don’t fear, we’re climbing that hill and more and more products are falling into place.

Today’s infrastructure and infrastructure applications (like Cassandra and Kafka), as well as today’s user and machine facing APIs, are incredibly sensitive to performance regressions. Without good visibilities, the performance of your platform and the performance of your efforts atop it will always be guesswork.

In order to make things easier, we’re introducing Circonus’s Wirelatency Protocol Observer. Run it on your API server or on your Cassandra node and you’ll immediately get deep insight to the performance of your system via real histograms in all their glory.