Updates From The Tech Team

Now that it is fall and the conference season is just about over, I thought it would be a good time to give you an update on some items that didn’t make our change log (and some that did), what is coming shortly down the road and just generally what we have been up to.

CEP woes and engineering salvation.

The summer started out with some interesting challenges involving our streaming event processor. When we first started working on Circonus, we decided to go with Esper as a complex event processor to drive fault detection. Esper offers some great benefits and a low barrier of entry to stream processing by placing your events into windows that are analogous to database tables, and then gives you the ability to query them with a language akin to SQL. Our initial setup worked well, and was designed to scale horizontally (federated by account) if needed. Due to demand, we started to act on this horizontal build out in mid-March. However, as more and more events were fed in, we quickly realized that even when giving an entire server to one account, the volume of data could still overload the engine. We worked on our queries, tweaking them to get more performance, but every gain was wiped away with a slight bump in data volume. This came to a head near the end of May when the engine started generating late alerts and alerts with incorrect data. At this point, too much work was put into making Esper work for not enough gain, so we started on a complete overhaul.

The new system was still in Java, but this time we wrote all the processing code ourselves. The improvement was incredible, events that once took 60ms to process now took on the order of 10µs. To validate the system we split the incoming data stream onto the old and new systems and compared the data coming out. The new system, as expected, found alerts faster, and when we saw a discrepancy, the new system was found to be correct. We launched this behind the scenes for the majority of the users on May 31st, and completed the rollout on June 7th. Unless you were one of the two customers affected by the delinquency of the old system, this mammoth amount of work got rolled out right under your nose and you never even noticed; just the way we like it. In the end we collapsed our CEP system from 3 (rather saturated) nodes back to 1 (almost idle) node and have a lot more faith in the new code. Here is some eye candy that shows the CEP processing time in microseconds over the last year. The green, purple and blue lines are the old CEP being split out, and the single remaining green line is the current system.

We tend to look at this data internally on a logarithmic scale to better see the minor changes in usage. Here is the same graph but with a log base 10 y-axis.

Distributed database updates.

Next up were upgrades to our metric storage system. To briefly describe the setup, it is based on Amazon’s Dynamo, we have a ring of nodes, and as data is fed in, we hash the ids and names to find which node it goes on, insert the data, and use a rather clever means to deterministically find subsequent nodes to meet our redundancy requirements. All data is stored at least twice and never on the same node. Theo gave a talk at last year’s Surge conference that is worth checking out for more details. The numeric data is stored in a proprietary format, highly compact, while text data was placed into a Berkeley DB whenever it changed.

The Berkeley DB decision was haunting us. We started to notice potential issues with locking as the data size grew and the performance and disk usage wasn’t quite where we wanted it to be. To solve this we wanted to move to leveldb. The code changes went smoothly, but the problem arose: how do we get the data from one on-disk format to another.

The storage system was designed from the beginning to allow one node to be destroyed and rebuilt from the others. Of course a lot of systems are like this but who ever actually wants to try it with production data? We do. With the safeguards of ZFS snapshotting, over the course of the summer we would destroy a node, bring it up to date with the most recent code, and then have the other nodes rebuild it. Each destroy, rebuild, bring online cycle took the better part of a work day, and got faster and more reliable after each exercise as we cleaned up some problem areas. During the process user requests were simply served from the active nodes in the cluster, and outside of a few minor delays in data ingestion, no users we impacted. Doing these “game day” rebuilds has given us a huge confidence boost that should a server go belly up, we can quickly be back to full capacity.

More powerful visualizations.

Histograms were another big addition to our product. I won’t speak much about them here, instead you should head to Theo’s post on them here. We’ve been showing these off at various conferences, and have given attendees at this year’s Velocity and Surge insight into the wireless networks with real time dashboards showing client signal strengths, download and uploads and total clients.

API version 2.

Lastly, we’ve received a lot of feedback on our API, some good, some indifferent but a lot of requests to make it better, so we did. This rewrite was mostly from the ground up, but we did try to keep a lot of code the same underneath since we knew it worked (some is shared by the web UI and the current API). It more tightly conforms to what one comes to expect from a RESTful API, and for our automation enabled users, we have added in some idempotence so your consecutive Chef or Puppet runs on the same server won’t create duplicate checks, rules, etc. We are excited about getting this out, stay tuned.

It was a busy summer and we are looking forward to an equally busy fall and winter. We will be giving you more updates, hopefully once a month or so, with more behind the scenes information. Until then keep an eye on the change log.