For the last several years, I've been speaking about the lies that graphs tell us. We all spend time looking at data, commonly through line graphs, that actually show us averages. A great example of this is showing average response times for API requests.
The above graph shows the average response time for calls made to a HTTP REST endpoint. Each pixel in this line graph is the average of thousands of samples. Each of these samples represents a real user of the API. Thousand of users distilled down to a single value sounds ideal until you realize that you have no idea what the distribution of the samples looks like. Basically, this graph only serves to mislead you. Having been misled for years by the graphs with little recourse, we decided to do something about it and give Circonus users more insight into their data.
Each of these pixels is the average of many samples. If we were to take those samples and put them in a histogram, it would provide dramatically improved insight into the underlying data. But a histogram is a visually bulky representation of data, and we have a lot of data to show (over time, no less). When I say visually bulky what do I mean? A histogram takes up space on the screen and since we have a histogram of data for each period of time and hundreds of periods of time in the time series we'd like to visualize... well, I can't very well show you hundreds of histograms at once and expect you to be able to make any sense of them; or can I?
Enter heat maps. Heat maps are a way of displaying histograms using color saturation instead of bar heights. So heat maps remove the "bulkiness" and provide sufficient visual density of information, but the rub is that people have trouble grasping them at first sight. Once you look at them for a while, they start to make sense. The question we faced is: how do we tie it all together and make it more accessible? The journey started for us about six months ago, and we've arrived at a place that I find truly enlightening.
Instead of a tutorial on histograms, I think throwing you into the interface is far more constructive.
The above graph provides a very deep, rich understanding the same data that powered the first line graph. This graph shows all of the API response times for the exact same service over the same time period.
In my first (#1) point of interest, I am hovering the pointer over a specific bit of data. This happens to be August 31st at 8pm. I'll note that not only does our horizontal position matter (affecting time), but my vertical position indicates the actual service times. I'm hovering between 23 and 24 on the y-axis (23-24 milliseconds). The legend shows me that there were 1383 API calls made at that time and 96 of them took between 23 and 24 milliseconds. Highlighted at #3, I also have some invaluable information about where these samples sit in our overall distribution: these 96 samples constitute only 7% of our dataset, 61% of the samples are less than 23ms and the remaining 32% are greater than or equal to 24ms. If I move the pointer up and down, I can see this all dynamically change on-screen. Wow.
As if that wasn't enough, a pop-up histogram of the data from the time interval over which I'm hovering is available (#2) that shows me the precise distribution of samples. This histogram changes as I move my pointer horizontally to investigate different points in time.
Now that I've better prepared you for the onslaught of data, poke around a live interactive visualization of a histogram with similar data.
With these visualizations at my disposal, I am now able to ask more intelligent questions about how our systems behave and how our business reacts to that. All of these tools are available to Circonus users and you should be throwing every piece data you have at Circonus... just be prepared to have your eyes opened.