Effective Management of High Volume Numeric Data with Histograms
How do you capture and organize billions of measurements per second such that you can answer a rich set of queries effectively (percentiles, counts below X, aggregations across streams), and you don’t blow through your AWS budget in minutes?
To effectively manage billions of data points, your system has to be both performant and scalable. How do you accomplish that? Not only do your algorithms have to be on point, but your implementation of them has to be efficient. You want to avoid allocating memory where possible, avoid copying data (pass pointers around instead), avoid locks, and avoid waits. Lots of little optimizations that add up to being able to run your code as close to the metal as possible.
You also need to be able to scale your data structures. They need to be as size efficient as possible, which means using strongly typed languages with the optimum choice of data types. We’ve found that histograms are the most efficient data structure for storing the data types we care about at scale.
What is a histogram?
A histogram is a representation of the distribution of a continuous variable, in which the entire range of values is divided into a series of intervals (or “bins”) and the representation displays how many values fall into each bin.
This histogram diagram shows a skewed histogram where the mode is near the minimum value q(0). The Y axis is the number of samples (or sample density), and the X axis shows the sample value. On this histogram we can see that the median is slightly left of the midpoint between the highest value and the lowest value. The mode is at a low sample value, so the median is below the mean, or average value. The 90th percentile is also called q(0.9), and is where 90 percent of the sample values are below it.
This might look like the 2nd generation display in KITT from Knight Rider, but this is a heatmap. A heatmap is essentially a series of histograms over time. This heatmap represents web service request latency. Imagine each column in the heatmap is a bar graph (like the previous histogram) viewed “from above,” the parts that are red are where the sample density is the highest. So we can see that most of the latencies tend to concentrate around 500 nanoseconds. We can overlay quantiles onto this visualization, we’ll cover that in a bit.
Types of Histograms
There are five types of histograms:
- Fixed Bucket – require the user to specify the bucket or bin boundaries.
- Approximate – use approximations of values.
- Linear – have one bin at even intervals, such as one bin per integer.
- Log Linear – have bins at logarithmically increasing intervals.
- Cumulative – each successive bin contains the sum of the counts of previous bins.
Fixed Bucket Histograms
Fixed bucket histograms require the user to specify the bin boundaries.
Traits of fixed bucket histograms:
- Histogram bin sizes can be fine tuned for known data sets to achieve increased precision.
- Cannot be merged with other types of histograms because the bin boundaries are likely uncommon.
- Less experienced users will likely pick suboptimal bin sizes.
- If you change your bin sizes, you can’t do calculations across older configurations.
Approximate histograms such as the t-digest histogram (created by Ted Dunning) use approximations of values, such as this example above which displays centroids. The number of samples on each side of the centroid is the same.
Traits of approximate histograms:
- Space efficient.
- High accuracy at extreme percentiles (95%, 99%+).
- Worst case errors ~10% at the median with small sample sizes.
- Can be merged with other t-digest histograms.
Linear histograms have one bin at even intervals, such as one bin per integer. Because the bins are all evenly sized, this type of histogram uses a large number of bins.
Traits of linear histograms:
- Accuracy dependent on data distribution and bin size.
- Low accuracy at fractional sample values (though this indicates improperly sized bins).
- Inefficient bin footprint at higher sample values.
Log Linear Histograms
Log Linear histograms have bins at logarithmically increasing intervals.
Traits of log linear histograms:
- High accuracy at all sample values.
- Fits all ranges of data well.
- Worst case bin error ~5%, but only with absurdly low sample density.
- Bins are often subdivided into even slices for increased accuracy.
- HDR histograms (high dynamic range) are a type of log linear histograms.
Cumulative histograms are different from other types of histograms in that each successive bin contains the sum of the counts of previous bins.
Traits of cumulative histograms:
- Final bin contains the total sample count, and as such is q(1).
- Used at Google for their Monarch monitoring system.
- Easy to calculate bin quantile – just divide the count by the maximum.
Open Source Log Linear Histograms
We need to ask a few things:
- Why does it scale?
- Why is it fast?
- How does it calculate quantiles?
First let’s examine what the visual representation of this histogram is, to get an idea of the structure:
At first glance this looks like a linear histogram, but take a look at the 1 million point on the X axis. You’ll notice a change in bin size by a factor of 10. Where there was a bin from 990k to 1M, the next bin spans 1M to 1.1M. Each power of 10 contains 90 bins evenly spaced. Why not 100 bins you might think? Because the lower bound isn’t zero. In the case of a lower bound of 1 and an upper bound of 10, 10-1 = 9, and spacing those in 0.1 increments yields 90 bins.
Here’s an alternate look at the bin size transition:
You can see the transition to larger bins at the 1,000 sample value boundary.
Bin Data Structure
The C implementation of each bin is a struct containing a value and exponent struct paired with the count of samples in that bin. The diagram above shows the overall memory footprint of each bin. The value is one byte representing the value of the data sample times 10. The Exponent is the power of 10 of the bin, and ranges from -128 to +127. The sample count is an unsigned 64 bit integer. This field is variable bit encoded, and as such occupies a maximum of 8 bytes.
Many of the existing time series data stores out there store a single data point as an average using the value as a uint64. That’s one value for every eight bytes – this structure can store virtually an infinite count of samples in this bin range with the same data storage requirements.
Let’s take a look at what the storage footprint is in practice for this type of bin data structure. In practice, we have not seen more than a 300 bin span for operational sample sets. Bins are not pre-allocated linearly, they are allocated as samples are recorded for that span, so the histogram data structure can have gaps between bins containing data.
To calculate storage efficiency for 1 month, that’s 30 days of one minute histograms:
30 days * 24 hours/day * 60 bins/hour * 300 bin span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB = 123.6 MB
These calculations show that we can store 30 days of one minute distributions in a maximum space of 123.6 megabytes. Less than a second of disk read operations, if that.
Now, 30 days of one minute averages only takes about a third of a megabyte – but that data is essentially useless for any sort of analysis.
Let’s examine what a year’s worth of data looks like in five minute windows. That’s 365 days of five minute histograms.
365 days * 24 hours/day * 12 bins/hour * 300 bin span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB = 300.9 MB
The same calculation with different values yield a maximum of about 300 megabytes to represent a year’s worth of data in five minute windows.
Note that this is invariant to the total number of samples; the biggest factors in the actual size of the data are the span of bins covered, and the compression factor per bin.
Let’s talk about performing quantile calculations.
- Given a quantile q(X) where 0 < X < 1.
- Sum up the counts of all the bins, C.
- Multiply X * C to get count Q.
- Walk bins, sum bin boundary counts until > Q.
- Interpolate quantile value q(X) from bin.
The quantile notation q of X is just a fancy way of specifying a percentile. The 90th percentile would be q of 0.9, the 95th would be q of 0.95, and the maximum would be q of 1.
So say we wanted to calculate the median, q of 0.5. First we iterate over the histogram and sum up the counts in all of the bins. Remember that cumulative histogram we talked about earlier? Guess what – the far right bin already contains the total count, so if you are using a cumulative histogram variation, that part is already done for you and can be a small optimization for quantile calculation since you have an O(1) operation instead of O(n)
Now we multiple that count by the quantile value. If we say our count is 1,000 and our median is 0.5, we get a Q of 500.
Next iterate over the bins, summing the left and right bin boundary counts, until Q is between those counts. If the count Q matches the left bin boundary count, our quantile value is that bin boundary value. If Q falls in between the left and right boundary counts, we use linear interpolation to calculate the sample value.
Let’s go over the linear interpolation part of quantile calculation. Once we have walked the bins to where Q is located in a certain bin, we use the formula shown here.
Little q of X is the left bin value plus big Q minus the left bin boundary, divided by the count differences of the left and ride sides of the bin, times the bin width.
Using this approach we can determine the quantile for a log linear histogram to a high degree of accuracy.
In terms of what error levels we can experience in terms of quantile calculation, with one sample in a bin it is possible to see a worst case 5% error if the value is 109.9 in a bin bounding 100-110; the reported value would be 105. The best case error for one sample is 0.5% for a value of 955 in a bin spanning 950-960.
However, our use of histograms is geared towards very large sets of data. With bins that contain dozens, hundreds, or more samples, accuracy should be expected to 3 or 4 nines.
We can also use histograms to calculate inverse quantiles. Humans can reason about thresholds more naturally than they can about quantiles.
If my web service gets a surge in requests, and my 99th percentile response time doesn’t change, that not only means that I just got a bunch of angry users whose requests took too long, but even worse I don’t know by how much those requests exceeded that percentile. I don’t know how bad the bleeding is.
Inverse quantiles allow me to set a threshold sample value, then calculate what percentage of values exceeded it.
To calculate the inverse quantile, we start with the target sample and work backwards towards the target count Q.
- Given a sample value X, locate its bin.
- Using the previous linear interpolation equation, solve for Q given X.
Given the previous equation we had, we can use some middle school level algebra (well, it was when I was in school) and solve for Q.
X = left_value+(Q-left_count) / (right_count-left_count)*bin_width X-left_value = (Q-left_count) / (right_count-left_count)*bin_width (X-left_value)/bin_width = (Q-left_count)/(right_count-left_count) (X-left_value)/bin_width*(right_count-left_count) = Q-left_count Q = (X-left_value)/bin_width*(right_count-left_count)+left_count
Solving for Q, we get 700, which is expected for our value of 1.05.
Now that we know Q, we can add up the counts to the left of it, subtract that from the total and then divide by the total to get the percentage of sample values which exceeded our sample value X.
- Sum the bin counts up to Q as Qleft.
- Inverse quantile qinv(X) = (Qtotal-Qleft)/Qtotal.
- For Qleft=700, Qtotal = 1,000, qinv(X) = 0.3.
- 30% of sample values exceeded X.
So if we are running a website, and we know from industry research that we’ll lose users if our requests take longer than three seconds, we can set X at 3 seconds, calculate the inverse quantile for our request times, and figure out what percentages of our users are getting mad at us. Let’s take a look at how we have been doing that in real time with Circonus.
This is a heatmap of one year’s worth of latency data for a web service. It contains about 300 million samples. Each histogram window in the heatmap is one day’s worth of data, five minute histograms are merged together to create that window. The overlay window shows the distribution for the day where the mouse is hovering over.
Here we added a 99th percentile overlay which you can see implemented as the green lines. It’s pretty easy to spot the monthly recurring rises in the percentile to around 10 seconds. Looks like a network timeout issue, those usually default to around 10 seconds. For most of the time the 99th percentile is relatively low, a few hundred millisecond.
Here we can see the inverse quantile shown for 500 milliseconds request times. As you can see, for most of the graph, at least 90% of the requests are finishing within the 500 millisecond service level objective. We can still see the monthly increases which we believe are related to network client timeouts, but when they spike, they only affect about 25% of requests – not great, but at least we know the extent of that our SLO is exceeded.
We can take that percentage of requests that violated our SLO of 500 milliseconds, and multiply them by the total number of requests to get the number of requests which exceeded 500 milliseconds. This has direct bearing on your business if each of these failed requests is costing you money.
Note that we’ve dropped the range here to a month to get a closer look at the data presented by these calculations.
What is we sum up over time the number of requests that exceeded 500 millseconds? Here we integrate the number of requests that exceeded the SLO over time, and plot that as the increasing line. You can clearly see now where things get bad with this service by the increase in slope of the blue line. What if we had a way to automatically detect when that is happening and then page whoever is on call?
Here we can see the red line on the graph is the output of a constant anomaly detection algorithm. When the number of SLO violations increases, the code identifies it as an anomaly and rates it on a 0-100 scale. There are several anomaly detection algorithms available out there, and most don’t use a lot of complicated machine learning, just statistics.
This video of our presentation at DataEngConf 2018 covers the material explained in this post.
You can also view the slides here.
We looked at a few different implementations of histograms, an overview of Circonus’ open source implementation of a log linear histogram, what data structures it uses to codify bin data, and the algorithm used to calculate quantiles. Reading the code (in C or in Golang) will demonstrate how avoiding locks, memory allocations, waits, and copies are essential to making these calculations highly optimized. Good algorithms are still limited by how they are implemented on the metal itself.
One problem to solve might be to collect all of the syscall latency data on your systems via eBPF, and compare the 99th percentile syscall latencies once you’ve applied the Spectre patch which disabled CPU branch prediction. You could find that your 99th percentile syscall overhead has gone from 10 nanoseconds to 30!
While percentiles are a good first step for analyzing your Service Level Objectives, what if we could look at the change in the number of requests that exceeded that objective? Say if 10% of your syscall requests exceeded 10 nanoseconds before the spectre patch, but after patching 50% of those syscalls exceeded 10 nanoseconds?
Soon, in a later post, we’ll talk more about Service Level Objectives and SLO analysis.