Monitoring Elasticsearch

With the much anticipated announcement of the Elasticsearch 1.0.0 release, we thought we’d mention that several of the features that you use within Circonus are powered by Elasticsearch behind the scenes.

We could never, in good conscience, run a product or service that we couldn’t extensively monitor. So, when it comes to monitoring things we say once again, “Yeah, we do that too.”

Adding elastic search telemetry collection in Circonus is as easy as selecting the Elasticsearch check type and entering the node name. What comes back is a plethora of statistics from the cluster node.

{
  "cluster_name": "elasticsearch",
  "nodes": {
    "zB3lYhArQJCJgJ5szVr4uA": {
      "timestamp": 1392415145096,
      "name": "Hawkeye II",
      "transport_address": "inet[/10.8.3.13:9300]",
      "host": "client-10-8-3-13.dev.circonus.net",
      "indices": {
        "docs": {
          "count": 0,
          "deleted": 0
        },
        "store": {
          "size_in_bytes": 0,
          "throttle_time_in_millis": 0
        },
        "indexing": {
          "index_total": 0,
          "index_time_in_millis": 0,
          "index_current": 0,
          "delete_total": 0,
          "delete_time_in_millis": 0,
...

On an instance here, 382 gratuitous lines of JSON ensue all of which we turn into metrics for trending and alerting.

We use this to track the inserts and deletes and the searches performed on each each node:

We’d also like to give a shout out to the Elasticsearch crew for their successful release. As “metrics people” I’m pleased to see that the old “*_time” metrics that were not easily machine readable have gone the way of the Dodo and “*_time_in_millis_” style metrics have prevailed. You all made the most of the breaking 1.0.0 opportunity to break things is a good way!

Understanding service latencies via TCP analysis

Latency is the root of all that is evil on the Internet… or so the saying goes. CPUs get faster, storage gets cheaper, IOPS are more plentiful, yet we feeble engineers have done close to nothing on improving the speed of light. While there is certainly a lower-bound to latencies rooted in the physical world, many of the latencies of today’s services (delivered both globally and within the datacenter) have latencies that are affected by things that are under our control. From a low level to a high level latencies suffer from service capacity issues, routing issues, systems and network tuning issues, bad code, bad deployments, human errors and even usage errors. You can’t improve what you can’t measure, so let’s get to measuring some latencies.

Most Internet services are pretty simple; a client calls you up and asks a question and you provide an answer. Due to this incredibly common pattern and the nuances of HTTP, it turns out to be trivial to measure service response times with nothing more than a packet sniffer (e.g. tcpdump).

“What is this nuance of HTTP?” you might ask. It is that HTTP requires the client to speak first. While all of the rest of the article can be used for more complicated services where that is not true, it oft times requires a deeper understanding of the protocol at hand and more adroit tcpdump manipulation.

“Client speaks first” makes it very easy to blindly assess the service latency on the initial HTTP request on a TCP connection. Returning to the phone call metaphor, I can simply watch the phones and measure the time between the first ring and the first word spoken by the receiver; I don’t care much about what happens in between. Granted, there is a wealth of information in the packets themselves: client (IP), agent (User-Agent), endpoint (requested URI), etc. Yet, to measure overall service latency, I simply need the first TCP SYN packet in and the first non-zero payload packet out on a given TCP/IP connection.

What are we looking at?

At this point we should stop and understand a bit better what we are actually measuring. Wouldn’t it be easier to just ask the service we’re running to measure its service times and report them? The answer to this varies from architecture to architecture, but if the answer is “no,” you should have birds and bees conversation with the software engineers or operations folk responsible for running the service. It should be absolutely trivial to track service times within the architecture.

Here, we’re asking something a tad more complex. Our measurements on the network side include an element of latency between the client and the server. What we are trying estimate here is the elapsed time between the client’s attempt to retrieve content the the first byte of that content arriving.

If we measure from SYN receipt to data write, we miss two vital times from the user’s perspective. We didn’t get the SYN until sometime after they sent it, and they won’t see our data until sometime after we’ve sent it. Because each of these is a single packet, we just need to know the time it takes to roundtrip a packet. Where can we get that information?

; ping 2.15.226.34
PING 2.15.226.34 (2.15.226.34): 56 data bytes
64 bytes from 2.15.226.34: icmp_seq=0 ttl=249 time=39.022 ms
64 bytes from 2.15.226.34: icmp_seq=1 ttl=249 time=40.013 ms
64 bytes from 2.15.226.34: icmp_seq=2 ttl=249 time=34.300 ms
64 bytes from 2.15.226.34: icmp_seq=3 ttl=249 time=40.373 ms
64 bytes from 2.15.226.34: icmp_seq=4 ttl=249 time=33.129 ms
^C

It isn’t like we can ping the remote machine. Instead we need to calculate this passively. TCP has this really inconvenient 3-way handshake that starts up a session that goes something like:

  1. Client: “I’d like to ask you to tell me a joke.”
  2. Server: “Okay, ask away”
  3. Client: “Okay, I’d like you to tell me a joke about TCP.”
  4. Server: “These two packets walked into a bar…”

From the TCP nitty-gritty, if we measure the time from the first SYN to the subsequent ACK package (before any data has transited), we have a rough estimation of the roundtrip time between the client and server. If we measure between observing (1) and sending (4), the part that we’re missing is the time between (1) being sent by the client and arriving at the server and (4) being sent and arriving at the client. That is approximately one roundtrip.

So, long story summarized. If we take the time between observing 1 and 4 and add the time between observing 2 and 3, we should have a rough approximation of the total time the client witnessed between making the request and receiving the first byte of the response. We’ll take a further short cut and note that the latency between is 1 and 2 is nominal on (at least our) systems, so we only track the timings of (1), (3) and (4) and calculate (4) – (1) + (3) – (1).

Actually doing this.

We’ll use tcpdump on our server to try to find what we’re looking for: the SYN from the client that will start the session, the ACK from the client that completes the three-way TCP handshake (in response to the server’s SYN-ACK) and the first non-zero-length data payload from the server to the client.

For our example, let’s look at web traffic on port 80. First we need to know how much data is in the packet. The IP header has the total packet length (in a 2-byte short starting a the second byte, which in BPF syntax is ‘ip[2:2]’. However, we’re looking for the payload length, so we need to remove from that the length of the IP header: ‘(ip[0]&0xf)*4)’ and the length of the TCP header: ‘((tcp[12]&0xf0)/4). We expect the one of the SYN or ACK bits set in the TCP flag and (because these are from the client) a destination port of 80:

'((((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) == 0) && ((tcp[tcpflags] & (tcp-syn|tcp-ack)) != 0) and dst port 80)'

The data payload from the server to the clients simply non-zero payload length with a source port of 80:

'(((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0 and src port 80)'

To test this, I’ll further restrict this to the host 10.10.10.10 (my laptop) and run tcpdump. I don’t want to resolve IP addresses to names and I would like to keep the output terse with high-precision timestamps: (-n -ttq). Additionally, I need at least 54 bytes of the payload to get the TCP headers (this is left as an exercise for the reader). Putting this all together in a tcpdump command (and limiting the scope to my client at 10.0.10.10), we can see something like this:

/opt/omni/sbin/tcpdump -n -ttq ' ((((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) == 0) && ((tcp[tcpflags] & (tcp-syn|tcp-ack)) != 0) and dst port 80) or (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0 and src port 80) and host 10.10.10.10'
listening on us340, link-type EN10MB (Ethernet), capture size 384 bytes
1370816555.510014 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0
1370816555.542407 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0
1370816555.585086 IP 2.15.226.34.80 > 10.10.10.10.57300: tcp 458
1370816555.617389 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0
1370816555.617398 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0
1370816555.617622 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0

The first three lines are exactly what we’re looking for. The subsequent lines would be (in this case) the shutting down of this TCP connection. These packets are harder to avoid using BPF syntax and we’ll figure out a way to ignore them as needed.

Doing something useful.

So, it is clear that on any high-volume service, these packets are going to be coming in quickly and we’ll need to process these latency calculations programmatically.

This is a bit more complicated, so I whipped together a small node.js program that runs tcpdump, consumes its output, calculated the estimated time-to-first-byte latencies and securely pushes them up to a specific Circonus httptrap check.

The only two arguments you need to supply are the UUID and secret from your httptrap in Circonus. Once running, the script will submit all latencies to one metric named “aggregate`latency” and another named “<hostname>`latency” (where <hostname> is the hostname of the machine on which it is running). This allows you to point a handful of machines at the same httptrap check and collect both their individual latencies and a recalculated aggregate set of latencies.

Now for the magic:


In the above graph, we can see three distinct nodes and the estimated latency (in seconds) on delivering the first byte of data to the client. Every single client TCP connection is accounted for. There is no sampling or missing data, we see a comprehensive and complete picture of our service latencies. Talk about an eye-opener.

This requires only 4 metrics in Circonus, which costs less than $2/month. Unparalleled value.

Sometimes you just need a different hammer

Circonus has a lot of powerful tools inside, but as anyone who has worked with real data knows: if you can’t get your data out into the tool you need, you’re going to suffer. We do all sorts of advanced analysis on telemetry data that is sent our way, but the systems we use to do that are somewhat isolated from the end user. While simple composites and math including population statistics are available in “point-and-click” fashion, more complicated, ad-hoc analysis on the “back-end” is only possible by Circonus engineers.

The “back-end”/”front-end” delineation is quite important here. In an API driven service like Circonus, the front-end is you or, more specifically, any thing you want to write against our APIs. Often times, when using services there are two types of API consumers: third-party tools and internal tools. Internal tools side are the Python or Ruby or Java programs you write to turn knobs, register checks, configure alerts and otherwise poke and prod at your Circonus account. The third-party tools are simply the tools that someone else authored and released for your use.

On to data languages.

So, where is this going? The traditional scripting languages aren’t really the limit of API consumers when it comes to big data services like Circonus. If you think about the fact that you’ve been dumping millions or billions of telemetry points at Circonus, some of our backend systems look a lot more like a data storage service than a telemetry collection orchestration service. More languages enter the fray when the APIs return data… such as R.

Working with Circonus data from within R is easy. After installing the circonus R package, you simply create a token to use the API with your Circonus account and then use that UUID to establish a connection within R:

api <- circonus.default('06178a18-0dff-0b0e-0794-0cc3fd350f04')

It's worth noting that the first time you run this command after creating your token, you'll get an error message saying that you need to authorize it for use with this application. Just go back to the API tokens page and you'll see a button to allow access to R. Once authorized, it will continue to work going forward.

Getting at data.

In the following Circonus graph, I'm looking at inbound and outbound network traffic from a node in Germany over a two month period starting on 2012/09/01.

If I click on the the "view check" link in the legend, it takes me to check 111. Using that little bit of information, I can pull some of that information right into R. Assuming I want to pull the outbound network traffic, Circonus tells me the metric name is "outoctets."

> checkid <- 111
> metric <- "outoctets"
> period <- 300
> start_time <- '2012-09-01 00:00:00'
> end_time <- '2012-11-01 00:00:00'
> data <- circonus.fetch_numeric(api,checkid,metric,start_time,end_time,period)

The above R session is fairly self-explanatory. I'm pulling the "outoctets" metric form check 111 over the two month time period starting on September 1st, 2012 at a granularity of 300 seconds (5 minutes).

Working with data.

This will give me several columns of numerical aggregates that I can explore. Specifically, I get a column that describes the time intervals, and columns corresponding to the number of sample, the average value, the standard deviation of those samples, as well that like information related to the first order derivatives over that time series data. All of this should look familiar if you have ever created a graph in Circonus as the same information is available for use there.

> names(data)
[1] "whence"            "count"             "value"            
[4] "derivative"        "counter"           "stddev"           
[7] "derivative_stddev" "counter_stddev"

September and October combined have 61 days. 61 days of 5 minute intervals should result in 17568 data points in each column.

> length(data$value)
[1] 17568

Our values of outoctets are (you guessed it) in octets and I'd like those in bits, so I need to multiply all the values by 8 (in the networking world, octets and bytes are different names for the same thing and there are 8 bits in a byte). That $value column are byte counts and we want the first order derivative to see bandwidth which is the $counter. Let's now ask a question that would be somewhat tricky via the point-and-click interface of Circonus: "Over the 5 minute samples in question, what was the minimum bandwidth, maximum bandwidth and, while we're at it, the 50th (median), 95th, 99th and 99.9th percentiles?"

> quantile(data$counter * 8, probs=c(0,0.5,0.95,0.99,0.999,1))
        0%        50%        95%        99%      99.9%       100% 
  221682.7  4069960.6  9071063.1 10563452.4 14485084.9 17172582.0 

For those that don't use R, everything statistical is just about this easy... After all, it is a language designed for statisticians crunching data. We can also quickly visualize the same graph in R.

We can see it has the same contour and values as our Circonus graph albeit far less beautiful!

I often pull data into R to perform discrete Fourier transforms (fft) to extract the frequency domain. It can help me programmatically determine if the graph has hourly/daily/weekly trends. That, however, it a bit too complicated to dive into here.

As a Circonus user, plug in some R! Please drop us a note to let us know what awesome things you are doing with your data and we will work to find a way to cross pollenate in our Circonus ecosystem.

Understanding Data with Histograms

For the last several years, I’ve been speaking about the lies that graphs tell us. We all spend time looking at data, commonly through line graphs, that actually show us averages. A great example of this is showing average response times for API requests.

The above graph shows the average response time for calls made to a HTTP REST endpoint. Each pixel in this line graph is the average of thousands of samples. Each of these samples represents a real user of the API. Thousand of users distilled down to a single value sounds ideal until you realize that you have no idea what the distribution of the samples looks like. Basically, this graph only serves to mislead you. Having been misled for years by the graphs with little recourse, we decided to do something about it and give Circonus users more insight into their data.

Each of these pixels is the average of many samples. If we were to take those samples and put them in a histogram, it would provide dramatically improved insight into the underlying data. But a histogram is a visually bulky representation of data, and we have a lot of data to show (over time, no less). When I say visually bulky what do I mean? A histogram takes up space on the screen and since we have a histogram of data for each period of time and hundreds of periods of time in the time series we’d like to visualize… well, I can’t very well show you hundreds of histograms at once and expect you to be able to make any sense of them; or can I?

Enter heat maps. Heat maps are a way of displaying histograms using color saturation instead of bar heights. So heat maps remove the “bulkiness” and provide sufficient visual density of information, but the rub is that people have trouble grasping them at first sight. Once you look at them for a while, they start to make sense. The question we faced is: how do we tie it all together and make it more accessible? The journey started for us about six months ago, and we’ve arrived at a place that I find truly enlightening.

Instead of a tutorial on histograms, I think throwing you into the interface is far more constructive.

The above graph provides a very deep, rich understanding the same data that powered the first line graph. This graph shows all of the API response times for the exact same service over the same time period.

In my first (#1) point of interest, I am hovering the pointer over a specific bit of data. This happens to be August 31st at 8pm. I’ll note that not only does our horizontal position matter (affecting time), but my vertical position indicates the actual service times. I’m hovering between 23 and 24 on the y-axis (23-24 milliseconds). The legend shows me that there were 1383 API calls made at that time and 96 of them took between 23 and 24 milliseconds. Highlighted at #3, I also have some invaluable information about where these samples sit in our overall distribution: these 96 samples constitute only 7% of our dataset, 61% of the samples are less than 23ms and the remaining 32% are greater than or equal to 24ms. If I move the pointer up and down, I can see this all dynamically change on-screen. Wow.

As if that wasn’t enough, a pop-up histogram of the data from the time interval over which I’m hovering is available (#2) that shows me the precise distribution of samples. This histogram changes as I move my pointer horizontally to investigate different points in time.

Now that I’ve better prepared you for the onslaught of data, poke around a live interactive visualization of a histogram with similar data.

With these visualizations at my disposal, I am now able to ask more intelligent questions about how our systems behave and how our business reacts to that. All of these tools are available to Circonus users and you should be throwing every piece data you have at Circonus… just be prepared to have your eyes opened.

What’s in a number?

Numbers, numbers, numbers; we’re all about numbers here at Circonus. We have trillions of data points which we feed into a slew of algorithms and processes to help our users identify problems with their data. But what are these numbers? It turns out that isn’t an easy question to answer.

Like most monitoring systems, Circonus performs an action from which it extracts one or more “metrics.” A common example is running a database query and measuring both the correctness of the result (as a boolean: good vs. bad) and the latency with which the answer was delivered. Similarly, it could load a web page, ensure that some specified content is successfully returned and measure the time it took. More concretely, when performing an HTTP transaction, it could obtain the following useful metrics: time to establish the TCP connection, time until the first byte of data is received, and time until the last byte of data is received. These measurements can reveal a variety of problems both on the surface of your architecture as well as provide indications of issues deep within.

While most monitoring systems (and parts of Circonus) work this way, the nature of these metrics is most interesting in what it is missing. In other words, it is vital to understand what they do not tell you. You are not observing real information; instead you are producing a single synthetic event and measuring it. The data are not real (and worse, may be far from representative.) Before I dive in and talk about why these data aren’t “good,” I’ll talk a bit about why they are “good enough” for many things.

Synthetic measurements work very well for components that can be measured in terms of quantities or rates. How many of something do you have? How quickly is it increasing or decreasing? Simple things like this are: disk space, I/O operations per second, the number of HTTP requests serviced, CPU usage, memory usage, etc. The most important factor is that these things are one-dimensional.

Data like these are both easy to visualize and critically important for things like anomaly detection and capacity planning. Being of a single dimension, understanding patterns in the data is easier for both humans and computers. However, as we start combining these data points, the world goes quickly out of focus.

For the moment, let’s assume we measure total money spent on an e-commerce site (you’d be crazy to not measure this.) In addition to that, we measure total transactions performed (number of sales.) With these metrics, we have some clear data: total dollars and dollars/hour (by deriving the samples) and total sales and sales/hour (again by deriving.) These numbers are pretty clear and we can make some good judgments about what to expect from day to day. However, you might ask, “How much is the average transaction size?” The answer to this question is simple: total money spent divided by total sales. Unfortunately, the average is not a useful number; just ask any statistician.

When you start looking at averages, you start losing information. We use averages to zoom out on graphs; you might notice that when you have a sudden spike (let’s say in traffic) you will see a much higher spike when zoomed in than when zoomed out. Why? If you were serving between 2900 and 3300 requests per second between 7pm and 8pm except for a sudden spike of 5400 requests per second between 7:40 and 7:45, you would see that on a graph showing 5 minute averages. However, on a graph zoomed out far enough to show only 20 minute averages, you’d see a deceptively small spike of about 3400 rps at that time period. As long as you can zoom in on the time series, it can be an acceptable compromise to reduce the data volume down to something consumable by a mere human being. Then the obvious question is: when does this go horribly wrong?

Let’s look at something like web page load times. If you run a synthetic transaction, always from the same location, you can track measurements in that single dimension. Things should be somewhat consistent and these numbers are useful. However, they do not tell you how fast your site is. Only your users know that. Interestingly, since your users access your web site, you can actually have them report that information back to you. In fact, this is how most web analytics systems work. The interesting part here is that you have a wide variety of data coming in representing a distribution of perceived load times. Some people load your pages quickly and others load them slowly. That’s the nature of the Internet: inconsistency. The key is that they don’t “trend” as a single datapoint that is the average of all.

The inconsistency in these data is interesting: it can be leveraged for improvements and advantage. Understanding (and eventually changing) the distribution of these data can radically change your business. There have been many articles written about web page load times, so in order to keep this fresh, I’ll discuss database transactions. The reason I’m jumping around here is because data are just data — this applies to every metric you can observe.

Understanding that your average database query takes 1.92ms to complete is, I’m sorry to say, useless. The problem is that you are likely running thousands or tens of thousands of queries per second and none of them are average. To illustrate this, here are three (contrived) database query latency histograms each of 39 samples.

The interesting (and perhaps deceptive) part is that all three have an average latency across all queries of 1.92ms. Quite clearly, all depict radically different situations. The truth is, when you have a lot of data (thousands to hundreds of thousands of data points), the histogram reveals the information you seek and the average hides it.

Why is this so interesting? In computing, there are a lot of things we can witness by actively measuring them; this is what the Circonus you know and love has done. We figured it was time to change the game a bit and help you visualize, in real-time, the things that happen in your business: enter BizEKG.

BizEKG allows you to analyze events (like webpage loads, database queries, customer service telephone calls, etc.). Not just some, not just a sample, but all the events. From there, you can break them apart, run statistical analysis (including histograms, of course) and understand your data. There are a handful of real-time web analytics companies out there, but answering these questions in “Circonus style” changes the game entirely. What’s Circonus style?

We at Circonus believe that all data are important, not just web data. We believe that if you can’t see what’s happening right now, you are as good as blind. So take this real-time, multi-dimensional statistical analysis engine, feed it any data you want, and see it all in real-time.

With our snazzy new BizEKG service you can actually do what some might consider a sufficient level of black magic. You can decompose these events in realtime and visualize these histograms in realtime. Not only is this pretty cool… it’s pretty damn enlightening. BizEKG is a new service we’ve launched and deserves its own announcement, we’ll get to that soon.

The above histogram show the last 60 seconds of page load times of a subsection of a current Alexa top 1000 site in milliseconds. Yes, 10,000ms is 10 seconds of page load time. Even on today's Internet, loading a complex site over wireless from another country is... slow.

Lost In Translation

For more than ten years, OmniTI has been making large-scale critical Internet infrastructure work. It is, obviously, not black magic or voodoo. Perhaps not so obviously, it is not technical competence that leads to success here. I like to think our team has technical competence in spades as we have an impeccable track record, authored books and a laundry list of speaking engagements to justify it. However, technical competence alone would fall short of the mark— far short.

Without exception, it is expected that proper monitoring and trending are as much a part of the process as setting up networking, backups, and more recently, change management. And yet, when you ask someone to explain why monitoring and trending were vital, you’d be lucky to get a response other than “to be sure things are working”. Something here is lost in translation.

Disconnected Viewpoints

Every business owner knows that watching the books is part of the job. You need to know P&L, you need to understand the outputs and costs of your various business units and you track efficiencies everywhere. All of these metrics play a part in both strategic and tactical decisions made every day. Each business unit reports these things and while in good organizations each manager knows what is important to each other manager, something is still lost in translation. Far too often, managers don’t understand that what they produce, what they consume and how they work changes the game for other business units. While the word is overused and abused, every business is an ecosystem. It is obvious that a new marketing campaign will increase resource utilization on the sales teams. It should be obvious that a new marketing campaign will increase resource utilization on IT infrastructure as well.

Every systems administrator knows (or should know) that monitoring your architecture is fundamental. On the other hand, very few can explain in any detail why this is so important. “Because you lose money when systems are offline”, they’ll quote disparagingly. Ask how much and you might catch them at a loss. From my own experience in operations, as well as countless conversations with customers and vendors, very few individuals recognize the relationship between IT and Business. Systems people know that they have to keep systems and services running to support their business, but rarely do they understand that relationship completely.

Owners that foster a transparent and cohesive organization around key performance indicators in every business unit (even those that are cost centers) will change their organizations in two critically useful ways:

  • Efficiencies between business units. With increased transparency, staff in all positions will see the effects of their actions across the business as a whole. This produces an atmosphere of self-reinforcing efficiency.
  • Accountability to the overall business. The hokey old question: “Is what you’re doing good for the company?” changes form. With increased cohesiveness, the answer to that question is a more obvious outcome to every action and no one can call it hokey, because it is always answered without being asked.

A Call To Arms

Technology is no longer underneath the products you sell and the process in which you deliver them. It is, for at least the immediate future, intertwined. Creativity on the technology side doesn’t only deliver cost savings, it creates new audiences and increases interaction with your customers. You have to do more than embrace technology, you need to leverage it and let new opportunities catapult your business forward.

As intertwined as technology is, we can no longer afford to have its operational details hidden away in the bowels of the “tech ops” or “web ops” group. We need visibility and we need cohesion. Infrastructure/application engineering and other business units are now, more than ever before, on the same team marching towards success. Communication and accountability are critical to success.

Here is where I leave you and hope that you will think about the metrics you monitor in a different light. They represent something more. They are there to make the business run, increase shareholder value, make your customers happier and more prosperous.

Past Performance: does this look right to you?

If you are like me, you look at a lot of data. I look at data in spreadsheets, I look at data on P&L statements, I look at term sheets, I look at systems data — a lot of systems data. I find the best way to look at data is to visualize it because it is the fastest way to get data into the amazing pattern matcher that is the human brain.

The human brain is quite good at saying “this is abnormal” and can usually even articulate why. This curve has a periodicity, that one a monotonic behavior, another is simply always flat… then they “change.” When we say “this visualization looks wrong,” we are almost always onto something real in the numbers. I’ll give you a simple visual example:

While there is obviously something starting at 8pm, we are only left with another question: “is it out of the ordinary?” It doesn’t look like that today, and it doesn’t appear to resemble the day before. What about last week? Let’s start the graph one week earlier:

This tells us a lot. It looks like we have a very similar event last week at this time. With most analysis tools, you stop here (or you hover with you mouse and try to correlate start/end times and magnitude to better understand how these two events resemble each other).

With Circonus, we don’t leave it here. Instead, we provide tools to help compare time separated events using our data overlay feature. We can take our original two-day view and overlay the data from last week right on top of (or in this case underneath).

Just two clicks and we’ve got a one-week offset data overlay and the visualization lends a little insight into what is going on. We can see the start times are identical, but the event from this week ends about 30 minutes before the one from last week — largely the same though.

Again, we find that visuals help. Understanding how these graph differ even when they are right on top of each other can be a bit challenging. Never fear! We’ve added help in the legend.

The legend takes on some new features when data overlays are in use. You now get a very clear, side-by-side read-out of the data in the graph including percentage differences. Additionally, the arrows that say “you’re higher than you were last week” become more saturated (redder) as the different in the data increases and fade to light grey if the two values are more similar. This makes it simple to quickly understand how current performance really compares to past performance. So, the interesting part of this graph is actually the subsequent spike of inbound traffic this is up 95% over last week. That’s something to look into.

Capacity Planning Made Easy

Okay, so capacity planning will never be fool proof. You simply cannot predict the future. However, some of the time you have a darn good idea of what the future will hold. Since someone knows what is likely to happen, why is it so hard to plan marketing initiatives, funnels and IT provisioning?

The reason is that things aren’t always linearly correlated. What’s that mean? Linear correlation goes something like this: if A depends upon B and I want twice as much A, I’ll need twice as much B. While correlating non-linear systems is what I like to call BFM a lot can be done with linear regressions. The problem with any regression is that you need to put real numbers in, get real numbers out and understand how good they are.

When we look at how something grows, one of the most common tools in the statistics arsenal is a least-squared linear regression. That is: given a set of datapoints, what line best fits them? So, let’s say we have a lot of datapoints (boy do we have a lot of datapoints!). Now what does a linear regression tell us?

Let’s assume we’re looking at some traffic data over the month of December.

In this graph, it can be very hard to answer questions about the nature of the data. Two common questions are:

  1. are we growing or shrinking and by how much?
  2. if we stay on the current growth path, where will we be some point in the future?

Enter the linear regression:

Answering the first question is pretty simple now. We can look at the value on the left side of the graph, and the right side of the graph and do the math. You can’t see it in the screenshot, but the left the values are 5.49M and 5.88M which is roughly a 6.6% growth over 4 weeks. Now, any statistician will scream bloody murder about confidences in the data and model and any engineer will simply ask: “does that make sense?” Maybe we’ll look over 8 weeks and twelve weeks also to make sure that we build our confidence (this can be easier, though far less scientific, than understanding R2 values – which are, of course, available as well). Honestly, I personally find that reconciling this with my expectations is one of the better methods of trusting the model.

Let’s assume that we we expected some increase in resource usage during this time frame and that 6% is reasonable. Now onto the next question: where will we be in the future. In Circonus, we just jump up and extend our view window out one year and we can see what our model looks like in the future:

Next December we’ll be using 10.91M (this just happens to be MBits/s of network bandwidth to serve origin dynamic content on one of the sites managed over at OmniTI). We’ll revisit this month by month to ensure that we are indeed heading where we expected. It allows engineers and marketers and executives alike put real numbers into (what we call) napkin math which adds peace, clarity and allows most people to do easier what-if pontification. I can tell you one thing… we sleep better at night knowing specific numbers about a probable future.

Enterprise Agents

If you’re like me, your first response to SaaS monitoring was: “You can’t see my machines/services/metrics from your cloud. That won’t be too useful.” With a little bit of thought, it’s pretty easy to arrive at the conclusion that you must run something on your infrastructure to bridge the divide. It was a fun and exciting project here to build that magic something called the Circonus Enterprise Agent.

circ-ea-login

The Circonus Enterprise Agent (we’ll call it our “EA” from here on) is all of our magic monitoring software bundled into a maintainable VMWare virtual appliance than can be run on your internal networks to track stuff that the public shouldn’t be seeing. We had some interesting choices to make during development and I thought I’d share what they were and why we made them.

Choosing a platform

Most of our internal infrastructure runs on some variant of OpenSolaris technology. We chose this for a variety of reasons. Most importantly, storing your precious data on ZFS seemed like the right thing to do. After that, the fault management architecture (FMA) available in OpenSolaris allows us to keep our machines and services running more reliably. Reliability and data permanence are the two most important factors in technology selection here at Circonus (a fact our customers respect).

So, with all this talk about OpenSolaris and its advantages you’d imagine we built our EA on the same technology, right? Not so simple. For a virtual appliance image that is easy to administer and easy to upgrade in the field you need a good package management system. OpenSolaris simply falls on its face there. Oracle’s promises of IPS (the new and coming package management system for Solaris 11) are quite compelling, but that is just a promise today. Instead, we turned to the tried and true CentOS Linux-based platform for our EA.

CentOS provides all the features we need to run our agent software, manage package upgrades and distribution seamlessly and simply, and the core operating system is both stable and secure. In an interesting later development, we provide Joyent customers the ability to run an EA on one of their Joyent SmartMachines. Joyent’s operating architecture is actually derived from OpenSolaris — so we ended up porting our EA back to our core platform as well.

Today, the EA is available in two forms: a CentOS 5 VMWare-based appliance and a Joyent SmartMachine.

From where do you manage the appliance

circ-ea-manage-thumb

While most appliances have a web console that allows a variety of management tasks, we made a simple choice to have the appliance administrable via the main circonus.com web application. This is where Circonus users interface with all their data and set up their monitors, sp it only made sense to also administer their EA from the same place.

After using the system for a while now, I can say that I’m really pleased with this decision. The cohesiveness of scheduling checks on either your private EA and/or the world-wide Circonus agents through the same check creation interface is a simple pleasure. One single world-wide view of all the agents on which you can schedule checks makes it simple to understand how the monitoring system works.

What to automate

Generally speaking, when you think appliance, you think self-maintaining. That’s not an unreasonable expectation. However, this directly conflicts with our experience in operations. In operations, automatic upgrades of software are strictly taboo. Typically, the operations crew wants to schedule precisely when an upgrade will occur, be present and have a bulletproof evaluation and rollback plan. When you start talking about critical infrastructure like monitoring, “typically” becomes “always.”

With this in mind, we made the upgrade process on the EA completely automated, but not automatic. One click and the appliance will self-upgrade. Currently, this is the only ongoing task that is done from the appliance itself (rather than the circonus.com portal), but we’re looking to make some nice enhancements there as well. Soon, you’ll be able to trigger remote EA upgrades directly from the web application.

What you get

With an EA you get to leverage the power of Circonus against all of your private data. Networks, systems, applications and business systems that are only accessible via internal infrastructure can be monitored via an Enterprise Agent. The data is fed back to the Circonus cloud in real-time. All of that data can be alerted on, and is available for correlation, trending and planning purposes through the excellent Circonus tools you already know and love.

Finding Needles in a Worksheet

Traditional graphing tools can help you plan for growth or even narrow down root causes after a failure. But they’ have a reputation for being difficult to setup, navigate or customize. It’s nice to be able to just point Cacti at some switches or routers and have it gracefully poll each device for SNMP data. Yet when you need a custom perspective of the data (or collections of data), it can be an arduous experience setting up templates and graphs.

When we started to engineer Reconnoiter into a SaaS offering, one of the major driving forces was a desire to not suck like the others. Like you, we don’t understand why it has to be so damn hard (or require a dedicated IT staff) to take a handful of data points and correlate them into graphs that make sense of the noise. I like to think we’ve been successful. Customers have been overwhelmingly positive about our efforts, calling it “a graph nerd’s paradise”. Even still, we eat our own dog food and are constantly revisiting the service to look for better ways to get our work done. This is why we’re working hard on upcoming features like Graph Overlays and Timeline Annotations. And it’s also why we made recent changes to the workflow for graphs and worksheets.

If you’re a Circonus user, you already know how easy it is to create and view graphs. Adding them to worksheets gives you a page full of data to compare and relate. Choose a zoom preset (2 days, 2 weeks, etc) or select a date range, and all of the thumbnails are instantly redrawn in unison. It might sound basic, but it can be very useful if you’re not sure what you’re looking for. Unexpected patterns jump out at you pretty quickly.

20101105_screen1

However, most of the time you want to work with a single graph. Clicking on a thumbnail previously loaded a graph in “lightbox” view, hiding all other graphs from sight and letting you focus on the work at hand. This worked well most of the time, but had one big drawback… you couldn’t (easily) bookmark it. So we’ve moved the default view into its own page, sans lightbox, that can be bookmarked and shared with others. Miss the lightbox view? No worries, we’ve kept that as the new preview mode. Try it out in a worksheet for “flickr-style” navigation.

Here’s a short video I threw together to demonstrate some of these changes. There was some audio lag introduced by the YouTube processing, but it should be easy enough to follow along. If you’d like to see more examples like this one, shoot us an email and we’ll try to keep them coming.