Numbers, numbers, numbers; we’re all about numbers here at Circonus. We have trillions of data points which we feed into a slew of algorithms and processes to help our users identify problems with their data. But what are these numbers? It turns out that isn’t an easy question to answer.
Like most monitoring systems, Circonus performs an action from which it extracts one or more “metrics.” A common example is running a database query and measuring both the correctness of the result (as a boolean: good vs. bad) and the latency with which the answer was delivered. Similarly, it could load a web page, ensure that some specified content is successfully returned and measure the time it took. More concretely, when performing an HTTP transaction, it could obtain the following useful metrics: time to establish the TCP connection, time until the first byte of data is received, and time until the last byte of data is received. These measurements can reveal a variety of problems both on the surface of your architecture as well as provide indications of issues deep within.
While most monitoring systems (and parts of Circonus) work this way, the nature of these metrics is most interesting in what it is missing. In other words, it is vital to understand what they do not tell you. You are not observing real information; instead you are producing a single synthetic event and measuring it. The data are not real (and worse, may be far from representative.) Before I dive in and talk about why these data aren’t “good,” I’ll talk a bit about why they are “good enough” for many things.
Synthetic measurements work very well for components that can be measured in terms of quantities or rates. How many of something do you have? How quickly is it increasing or decreasing? Simple things like this are: disk space, I/O operations per second, the number of HTTP requests serviced, CPU usage, memory usage, etc. The most important factor is that these things are one-dimensional.
Data like these are both easy to visualize and critically important for things like anomaly detection and capacity planning. Being of a single dimension, understanding patterns in the data is easier for both humans and computers. However, as we start combining these data points, the world goes quickly out of focus.
For the moment, let’s assume we measure total money spent on an e-commerce site (you’d be crazy to not measure this.) In addition to that, we measure total transactions performed (number of sales.) With these metrics, we have some clear data: total dollars and dollars/hour (by deriving the samples) and total sales and sales/hour (again by deriving.) These numbers are pretty clear and we can make some good judgments about what to expect from day to day. However, you might ask, “How much is the average transaction size?” The answer to this question is simple: total money spent divided by total sales. Unfortunately, the average is not a useful number; just ask any statistician.
When you start looking at averages, you start losing information. We use averages to zoom out on graphs; you might notice that when you have a sudden spike (let’s say in traffic) you will see a much higher spike when zoomed in than when zoomed out. Why? If you were serving between 2900 and 3300 requests per second between 7pm and 8pm except for a sudden spike of 5400 requests per second between 7:40 and 7:45, you would see that on a graph showing 5 minute averages. However, on a graph zoomed out far enough to show only 20 minute averages, you’d see a deceptively small spike of about 3400 rps at that time period. As long as you can zoom in on the time series, it can be an acceptable compromise to reduce the data volume down to something consumable by a mere human being. Then the obvious question is: when does this go horribly wrong?
Let’s look at something like web page load times. If you run a synthetic transaction, always from the same location, you can track measurements in that single dimension. Things should be somewhat consistent and these numbers are useful. However, they do not tell you how fast your site is. Only your users know that. Interestingly, since your users access your web site, you can actually have them report that information back to you. In fact, this is how most web analytics systems work. The interesting part here is that you have a wide variety of data coming in representing a distribution of perceived load times. Some people load your pages quickly and others load them slowly. That’s the nature of the Internet: inconsistency. The key is that they don’t “trend” as a single datapoint that is the average of all.
The inconsistency in these data is interesting: it can be leveraged for improvements and advantage. Understanding (and eventually changing) the distribution of these data can radically change your business. There have been many articles written about web page load times, so in order to keep this fresh, I’ll discuss database transactions. The reason I’m jumping around here is because data are just data — this applies to every metric you can observe.
Understanding that your average database query takes 1.92ms to complete is, I’m sorry to say, useless. The problem is that you are likely running thousands or tens of thousands of queries per second and none of them are average. To illustrate this, here are three (contrived) database query latency histograms each of 39 samples.
The interesting (and perhaps deceptive) part is that all three have an average latency across all queries of 1.92ms. Quite clearly, all depict radically different situations. The truth is, when you have a lot of data (thousands to hundreds of thousands of data points), the histogram reveals the information you seek and the average hides it.
Why is this so interesting? In computing, there are a lot of things we can witness by actively measuring them; this is what the Circonus you know and love has done. We figured it was time to change the game a bit and help you visualize, in real-time, the things that happen in your business: enter BizEKG.
BizEKG allows you to analyze events (like webpage loads, database queries, customer service telephone calls, etc.). Not just some, not just a sample, but all the events. From there, you can break them apart, run statistical analysis (including histograms, of course) and understand your data. There are a handful of real-time web analytics companies out there, but answering these questions in “Circonus style” changes the game entirely. What’s Circonus style?
We at Circonus believe that all data are important, not just web data. We believe that if you can’t see what’s happening right now, you are as good as blind. So take this real-time, multi-dimensional statistical analysis engine, feed it any data you want, and see it all in real-time.
With our snazzy new BizEKG service you can actually do what some might consider a sufficient level of black magic. You can decompose these events in realtime and visualize these histograms in realtime. Not only is this pretty cool… it’s pretty damn enlightening. BizEKG is a new service we’ve launched and deserves its own announcement, we’ll get to that soon.
The above histogram show the last 60 seconds of page load times of a subsection of a current Alexa top 1000 site in milliseconds. Yes, 10,000ms is 10 seconds of page load time. Even on today's Internet, loading a complex site over wireless from another country is... slow.