Fault Detection: New Features and Fixes

One of the trickier problems when detecting faults is detecting the absence of data. Did the check run and not produce data? Did we lose connection and miss the data? The latter problems are where we lost a bit of insight, which we sought to correct.

The system is down

A loss of connection to the broker happens for one of two reasons. First, the broker itself might be down, the software restarted, machine crashed, etc. Second, there was a loss of connectivity in the network between the broker and the Circonus NOC. Note that for our purposes, a failure in our NOC would look identical to the broker running but having network problems.

Lets start with a broker being down. Since we aren’t receiving any data, it looks to the system like all of the metrics just went absent. In the event that a broker goes down, the customer owning that broker be inundated with absence alerts.

Back in July, we solved this by adding the ability to set a contact group on a broker. If the broker disconnects, you will get a single alert notifying you that the broker is down. While disconnected, the system automatically puts all metrics on the broker into an internal maintenance mode, when it reconnects we flip them out of maintenance and then ask for a current state of the world, so anything that is bad will alert. Note that if you do not set a contact group, we have no way to tell you the broker is disconnected so we will fall back to not putting metrics in maintenance and you will get paged about each one as they go absent. Even though this feature isn’t brand new, it is worth pointing out.

Can you hear me now?

It is important to know a little about how the brokers work… When they restart, all the checks configured on it are scheduled to run within the first minute, then after that they follow the normal frequency settings. To this end, when we reestablish connectivity with a broker, we look at the internal uptime monitor, if it is >= 60 seconds we know all the checks have run and we can again use the data for alerting purposes.

This presented a problem when an outage was caused by a network interruption or a problem in our NOC. Such a network problem happened late one night and connections to a handful of brokers were lost temporarily. When they came back online, because they had never restarted we saw the uptime was good and immediately started using the data. This poses a problem if we reconnected at the very end of an absence window. A given check might not run again for 1 – 5 minutes, so we would potentially trigger absences, and then recover them when the check ran.

We made two changes to fix this. First, we now have two criteria for a stable / connected broker:

  • Uptime >= 60 seconds
  • Connected to the NOC for >= 60 seconds

Since the majority of the checks run every minute, this meant that we would see the data again before declaring the data absent. This, however, doesn’t account for any checks with a larger period. To that end, we changed the absence alerting to first check to see how long the broker has been connected. If it has been connected for less than the absence window length, we push out the absence check to another window in order to first ensure the check would have run. A small change but one that took a lot of testing and should drastically cut down on false absence alerts due to network problems.

Updates From The Tech Team

Now that it is fall and the conference season is just about over, I thought it would be a good time to give you an update on some items that didn’t make our change log (and some that did), what is coming shortly down the road and just generally what we have been up to.

CEP woes and engineering salvation.

The summer started out with some interesting challenges involving our streaming event processor. When we first started working on Circonus, we decided to go with Esper as a complex event processor to drive fault detection. Esper offers some great benefits and a low barrier of entry to stream processing by placing your events into windows that are analogous to database tables, and then gives you the ability to query them with a language akin to SQL. Our initial setup worked well, and was designed to scale horizontally (federated by account) if needed. Due to demand, we started to act on this horizontal build out in mid-March. However, as more and more events were fed in, we quickly realized that even when giving an entire server to one account, the volume of data could still overload the engine. We worked on our queries, tweaking them to get more performance, but every gain was wiped away with a slight bump in data volume. This came to a head near the end of May when the engine started generating late alerts and alerts with incorrect data. At this point, too much work was put into making Esper work for not enough gain, so we started on a complete overhaul.

The new system was still in Java, but this time we wrote all the processing code ourselves. The improvement was incredible, events that once took 60ms to process now took on the order of 10µs. To validate the system we split the incoming data stream onto the old and new systems and compared the data coming out. The new system, as expected, found alerts faster, and when we saw a discrepancy, the new system was found to be correct. We launched this behind the scenes for the majority of the users on May 31st, and completed the rollout on June 7th. Unless you were one of the two customers affected by the delinquency of the old system, this mammoth amount of work got rolled out right under your nose and you never even noticed; just the way we like it. In the end we collapsed our CEP system from 3 (rather saturated) nodes back to 1 (almost idle) node and have a lot more faith in the new code. Here is some eye candy that shows the CEP processing time in microseconds over the last year. The green, purple and blue lines are the old CEP being split out, and the single remaining green line is the current system.

We tend to look at this data internally on a logarithmic scale to better see the minor changes in usage. Here is the same graph but with a log base 10 y-axis.

Distributed database updates.

Next up were upgrades to our metric storage system. To briefly describe the setup, it is based on Amazon’s Dynamo, we have a ring of nodes, and as data is fed in, we hash the ids and names to find which node it goes on, insert the data, and use a rather clever means to deterministically find subsequent nodes to meet our redundancy requirements. All data is stored at least twice and never on the same node. Theo gave a talk at last year’s Surge conference that is worth checking out for more details. The numeric data is stored in a proprietary format, highly compact, while text data was placed into a Berkeley DB whenever it changed.

The Berkeley DB decision was haunting us. We started to notice potential issues with locking as the data size grew and the performance and disk usage wasn’t quite where we wanted it to be. To solve this we wanted to move to leveldb. The code changes went smoothly, but the problem arose: how do we get the data from one on-disk format to another.

The storage system was designed from the beginning to allow one node to be destroyed and rebuilt from the others. Of course a lot of systems are like this but who ever actually wants to try it with production data? We do. With the safeguards of ZFS snapshotting, over the course of the summer we would destroy a node, bring it up to date with the most recent code, and then have the other nodes rebuild it. Each destroy, rebuild, bring online cycle took the better part of a work day, and got faster and more reliable after each exercise as we cleaned up some problem areas. During the process user requests were simply served from the active nodes in the cluster, and outside of a few minor delays in data ingestion, no users we impacted. Doing these “game day” rebuilds has given us a huge confidence boost that should a server go belly up, we can quickly be back to full capacity.

More powerful visualizations.

Histograms were another big addition to our product. I won’t speak much about them here, instead you should head to Theo’s post on them here. We’ve been showing these off at various conferences, and have given attendees at this year’s Velocity and Surge insight into the wireless networks with real time dashboards showing client signal strengths, download and uploads and total clients.

API version 2.

Lastly, we’ve received a lot of feedback on our API, some good, some indifferent but a lot of requests to make it better, so we did. This rewrite was mostly from the ground up, but we did try to keep a lot of code the same underneath since we knew it worked (some is shared by the web UI and the current API). It more tightly conforms to what one comes to expect from a RESTful API, and for our automation enabled users, we have added in some idempotence so your consecutive Chef or Puppet runs on the same server won’t create duplicate checks, rules, etc. We are excited about getting this out, stay tuned.

It was a busy summer and we are looking forward to an equally busy fall and winter. We will be giving you more updates, hopefully once a month or so, with more behind the scenes information. Until then keep an eye on the change log.

Understanding Data with Histograms

For the last several years, I’ve been speaking about the lies that graphs tell us. We all spend time looking at data, commonly through line graphs, that actually show us averages. A great example of this is showing average response times for API requests.

The above graph shows the average response time for calls made to a HTTP REST endpoint. Each pixel in this line graph is the average of thousands of samples. Each of these samples represents a real user of the API. Thousand of users distilled down to a single value sounds ideal until you realize that you have no idea what the distribution of the samples looks like. Basically, this graph only serves to mislead you. Having been misled for years by the graphs with little recourse, we decided to do something about it and give Circonus users more insight into their data.

Each of these pixels is the average of many samples. If we were to take those samples and put them in a histogram, it would provide dramatically improved insight into the underlying data. But a histogram is a visually bulky representation of data, and we have a lot of data to show (over time, no less). When I say visually bulky what do I mean? A histogram takes up space on the screen and since we have a histogram of data for each period of time and hundreds of periods of time in the time series we’d like to visualize… well, I can’t very well show you hundreds of histograms at once and expect you to be able to make any sense of them; or can I?

Enter heat maps. Heat maps are a way of displaying histograms using color saturation instead of bar heights. So heat maps remove the “bulkiness” and provide sufficient visual density of information, but the rub is that people have trouble grasping them at first sight. Once you look at them for a while, they start to make sense. The question we faced is: how do we tie it all together and make it more accessible? The journey started for us about six months ago, and we’ve arrived at a place that I find truly enlightening.

Instead of a tutorial on histograms, I think throwing you into the interface is far more constructive.

The above graph provides a very deep, rich understanding the same data that powered the first line graph. This graph shows all of the API response times for the exact same service over the same time period.

In my first (#1) point of interest, I am hovering the pointer over a specific bit of data. This happens to be August 31st at 8pm. I’ll note that not only does our horizontal position matter (affecting time), but my vertical position indicates the actual service times. I’m hovering between 23 and 24 on the y-axis (23-24 milliseconds). The legend shows me that there were 1383 API calls made at that time and 96 of them took between 23 and 24 milliseconds. Highlighted at #3, I also have some invaluable information about where these samples sit in our overall distribution: these 96 samples constitute only 7% of our dataset, 61% of the samples are less than 23ms and the remaining 32% are greater than or equal to 24ms. If I move the pointer up and down, I can see this all dynamically change on-screen. Wow.

As if that wasn’t enough, a pop-up histogram of the data from the time interval over which I’m hovering is available (#2) that shows me the precise distribution of samples. This histogram changes as I move my pointer horizontally to investigate different points in time.

Now that I’ve better prepared you for the onslaught of data, poke around a live interactive visualization of a histogram with similar data.

With these visualizations at my disposal, I am now able to ask more intelligent questions about how our systems behave and how our business reacts to that. All of these tools are available to Circonus users and you should be throwing every piece data you have at Circonus… just be prepared to have your eyes opened.

Web Portal Outage

Last night circonus.com became unavailable for 34 minutes, this was due to the primary database server becoming unavailable. Here is a breakdown of events, times are US/Eastern.

  • 8:23 pm kernel panic on primary DB machine, system rebooted but did not start up properly
  • 8:25 -> 8:27 first set of pages went out about DB being down and other dependent systems not operating
  • 8:30 work began on migrating to the backup DB
  • 8:57 migration complete and systems were back online

In addition to the web portal being down during this time, alerts were delayed. The fault detection system continued to operate, however we have discovered some edge cases in the case management portion that will be addressed soon.

Because of the highly decoupled nature of Circonus, metric collection, ingestion and long term storage was not impacted by this event. Other services like search, streaming, and even fault detection (except as outlined above) receive their updates over a message queue and continued to operate as normal.

After the outage we discussed why recovery took so long and boiled it down to inadequate documentation on the failover process. Not all the players on call that night knew all they needed about the system. This is something that is being addressed so recovery in an event like this in the future can be handled much faster.

Dashboards: Redux (or What to Look for in a Performance Monitoring Dashboard)

Last autumn we launched our customizable dashboards for Circonus, and we happen to think they’re pretty sweet. In this post, I’m not going to get into specifics about our dashboards (for more on that, you can check out my previous post, “One Dashboard to Rule Them All”), but instead I’ll talk more generally about what you should look for in your performance monitoring dashboard of choice.

Your dashboard shouldn’t limit its focus to technical data; it should help you track what really matters: business success.

A lot of data analysis done today is technical analysis for technical benefit. But the real value comes when we are able to take this expertise and start empowering better business decisions. As such, a performance monitoring dashboard which is solely focused on networks, systems, and applications is limiting because it doesn’t address what really matters: business.

While your purpose for monitoring may be to make your company’s web business operate smoothly, you can influence your overall business through what you operate and control, including releases, performance, stability, computing resources, networking, and availability. Thus, your dashboard should be designed to enable this kind of cross-pollination. By understanding which of your business metrics are critical to your success, you will be able to effectively use a dashboard to monitor those elements that are vital to your business.

Your dashboard should be able to handle multiple data sources.

There are many technologies in use across the web today. Chances are good that you have many different data sources across your business, so you need a dashboard that can handle them. It?s no good for a dashboard to only be able to gather part of your business data, because you’ll be viewing an incomplete picture. You need a dashboard that can handle all of your data sources, preferably on a system that’s under active development—continuing to integrate the best new technologies coming down the pike.

Your dashboard should provide access to real-time data.

The value of real-time data should not be underestimated; having real-time data capabilities on your dashboard is critical. Rather than requiring you to hit the refresh button, it should use real-time data to show you what is going on right now. Having this up-to-date picture makes your analysis of the data more valuable because it’s based on what’s happening in the moment. Some sectors already embracing this type of real-time analysis include finance, stock trading, and high-frequency trading.

Your dashboard should provide visualizations to match different types of data.

Your dashboard should provide different visualizations, because the visualization method you choose should fit the data you’re displaying. It’s easy to gravitate towards the slickest, shiniest visualizations, but they don’t always make the most sense for your data.

One popular visualization design is the rotary needle (dial) gauge. Gauges look cool, but they can be misleading if you don’t know their limits. Also, because of their opaque nature, the picture they give you of the current state is without context. Gauges can be great for monitoring certain data like percentages, temperature, power per rack, or bandwidth per uplink, but visualizations like graphs are generally better because they can give you context and history in a compact space. Graphs not only show you what’s going on now but also what happened before, and they allow you to project historic data (e.g. last day/week) alongside current data or place multiple datasets side-by-side so you can compare trends.

It’s also easy to forget that sometimes you may not need a visualization at all. Some data is most easily understood in text form (perhaps formatted in a table). Your dashboard should provide different ways of viewing data so you can choose the best method for your particular data sets.

Your dashboard’s interface shouldn’t be over-designed.

Designers tend to show off their design chops by creating slick, shiny user interfaces for their dashboards, but these are frequently just eye-candy and can get in the way of “scannability.” You need to be able to understand your dashboard at a glance, so design should stay away from being too graphics-heavy and should not have too much information crammed into tiny spaces. These lead to visual clutter and make you have to “work too hard” whenever you try to read your dashboard. The design should help you focus on your data, not the interface.

Everybody’s idea of a “perfect dashboard” will vary somewhat, but by following these guidelines you will be well on your way to selecting a dashboard that lets you monitor your data however you want. Remember, the goal is informed, data-driven decision-making, and it’s not unreachable.

Failing Forward While Stumbling, Eventually You Regain Your Balance

First I want to start by saying I sincerely apologize for anyone adversely affected by yesterday’s false alerts. That is something that we are very conscious of when rolling out new changes and clearly something I hope never to repeat.

How did it happen? First, a quick run down of the systems involved. As data is streamed into the system from the brokers, it is sent over RabbitMQ to a group of Complex Event Processors (CEP) running Esper and additionally the last collected value for each unique metric is stored in Redis for quick lookups. The CEPs are responsible for identifying when a value has triggered an alert, and then tell the notification system about it.

Yesterday we were working on a bug in the CEP system where under certain conditions, if a value went from bad to good, and we were restarting the service, it was possible we would never trigger an “all clear” event and as such your alert would never clear. After vigorously testing in our development environment, we thought we had it fixed and all our (known) corner cases tested.

So the change was deployed to one of the CEP systems to verify it in production. For the first few minutes all was well, stale alerts were clearing, I was a happy camper. Then roughly 5 minutes after the restart, all hell broke loose, every “on absence” alert fired, and then cleared within 1 minute, pagers went off around the office, happiness aborted.

Digging into the code we thought we spotted the problem, when we loaded the last value into the CEP from Redis, we need to do so in a particular order. Because we used multiple threads to load the data and let it do so asynchronously, some was being loaded in the proper order, but the vast majority was being loaded too late. Strike one for our dev environment. It doesn’t have near the volume of data, so everything was loaded in order by chance. We fixed the concurrency issue, tested, redeployed, BOOM same behavior as before.

The next failure was a result of the grouping that we do in the Esper queries, we were grouping by the check id, the name of the metric and the target host being observed. The preload data was missing the target field. What this caused was the initial preload event to be inserted ok, then as we got new data in it would also be inserted just fine, but was being grouped differently. Our absence windows currently have a 5 minute timeout, so 5 minutes after boot, all the preload data would exit the window, which would now be empty and we triggered an alert. Then, as the newly collected data filled its window, we would send an all clear for that metric and at this point we would be running normally, albeit with a lot of false alerts getting cleaned up.

Unfortunately at this point, the redis servers didn’t have the target information in their store, so a quick change was made to push that data into them. That rollout was a success, a little happiness was restored since something went right. After they had enough time to populate all the check data, changes were again rolled out to the CEP to add the target to the preload, hopes were high. We still at this point had only rolled the changes to the first CEP machine, so that was updated again, rebooted, and after 5 minutes things still looked solid, then the other systems were updated. BOOM.

The timing of this failure didn’t make sense. CEP one had been running for 15 minutes now, and there are no timers in the system what would explain this behavior. Code was reviewed and looked correct. Upon review of the log files, we saw failures and recoveries on each CEP system, however they were being generated by different machines.

The reason for this was due to a recent scale out of the CEP infrastructure. Each CEP is connected to RabbitMQ to receive events, to split the processing amongst them each binds a set of routing keys for events it cares about. This splitting of events wasn’t mimicked in the preload code, each CEP would be preloaded with all events. Since each system only cared about its share, the events it wasn’t receiving would trigger an absence alert as it would see them in the preload and then never again. Since the CEP systems are decoupled, an event A on CEP one wouldn’t be relayed to any other system, so they would not know that they needed to send a clear event since as far as they were concerned, everything was ok. Strike two for dev, we don’t use that distributed setup there.

Once again the CEP was patched, this time the preloader was given the intelligence to construct the routing keys for each metric. At boot it would pull the list of keys its cared about from its config, and then as it pulled the data from Redis, it would compare what that metrics key would be to its list, if it had it, preload the data. One last time, roll changes, restart, wait, wait, longest 5 minutes in recent memory, wait some more… no boom!!!

At this point though, one of the initial problems I set out to solve was still an issue. Because data streaming in looked good, the CEP won’t emit an all clear for no reason, it has to be bad first, so we had a lot of false alerts hanging out and people being reminded about them. To rectify this, I went into the primary DB, cleared all the alerts with a few updates, and rebooted the notification system so it would no longer see them as an issue. This stopped the reminders and brought us back to a state of peace. And this is where we sit now.

What are the lessons learned and how do we hope to prevent this in the future? Step 1 is, of course, always making sure dev matches production; not just in code, but in data volume and topology. Outside of the CEP setup it does, so we need a few new zones brought into the mix today and that will resolve that. Next, better staging and rollout procedure for this system. We can bring up a new CEP in production, give it a workload but have its events not generate real errors, going forward we will be verifying production traffic like this before a roll out.

Once again, sorry for the false positives. Disaster porn is a wonderful learning experience, and if any of the problems mentioned in this post hit home, I hope it gets you thinking about what changes you might need to be making. For updates on outages or general system information, remember to follow circonusops on Twitter.

Graph Annotations and Events

This feature has been a long time in coming: the ability to annotate your graphs! With the new annotations timeline sitting over the graph, not only can you create custom events to mark points in time, but you can also view alerts and see how they fit (or don’t fit) your metric data.

Annotations Timeline

First, let’s go to a graph and take a look at the annotations timeline to see how it works. When you choose a graph and view it, you will immediately see the new Annotation controls to the left side of the date tools, and the timeline itself will render in between the date tools and the graph itself. The timeline defaults to collapsed mode and by default will only show alerts from metrics on the current graph, so you may have an empty timeline at first. If you take a look at the controls, however, you will see three items: the Annotation menu, the show/hide toggle button, and the expand/collapse toggle button. The show/hide button does just what it says: it shows or hides the timeline. The expand/collapse button toggles between the space-saving collapsed timeline view and the more informative expanded timeline view.

If you open the Annotation menu, you will see a list of all the items you can possibly show in your timeline (or hide from it). Any selections you make here (as well as your show/hide and expand/collapse state changes) will be saved as site-wide user preferences in your current browser. All the items are separated into three groups:

Event Categories

This is a list of all the Event categories under the current account (these are seen and managed in the Events section of the site?we’ll get to that new section in a minute). If you have uncategorized events (due to deleting a category that was still in use), they will appear grouped under the “–” pseudo-category label.


By default, the only alerts that will be shown will be alerts of all severity (sev) levels triggered by metrics on the current graph. If you wish, you may also show all alerts, and both categories of alerts may be filtered by sev levels. To do so, click one of the alert labels to expand a sev filter row with more checkboxes.

Text Metrics

This third group is not shown by default, but is represented by the checkbox at the bottom labeled “Include text metrics.” If you check this box, the page will refresh, and any text metrics on the current graph will then be rendered as a part of the timeline (and will be excluded from the graph plot and legend).

Once you have some annotations rendering on the timeline, take a look at the timeline itself. Hovering over a point will show a detail tooltip with the annotation title, date, and description, and hovering over either a point or a line segment will highlight the corresponding date range on the graph itself.

Now for the question on everyone’s minds: “Can I create events here, or do I have to go to the Events section to do that?” The answer is, yes, you can create events straight from the view graph page! To do so, simply use your right mouse button to drag-select a time range on the graph itself. A dialog will then popup for you to input your info and create the event.

Events Section

Now let’s head over to the Events section where you can manage your events and event categories. Simply click on the new Events tab (below the Graphs tab) and you’re there! To create an event, click the standard “+” tab at the upper left of the page. This will give you the New Event dialog. Most of the dialog inputs are pretty straightforward, with the exception of the category dropdown. This is a new hybrid “editable” dropdown input.

You may select any of its options if you’d like, or you can add new ones. To add a new option, simply select the last option (it’s labeled “+ ADD Category”). Your cursor will immediately be placed in a standard text input where you can enter your new category. When you’re finished, hit enter to create the new option and have it selected as your category of choice.

After you have created your event, you may need to edit it later. To edit any of its details, simply click on the pertinent detail of the event (when changing the event category, you will see it also has the new hybrid “editable” dropdown input which works exactly like the one in the New Event dialog).

In addition to start and end points (which may be the same date if you don’t want more than a single point), you may also add midpoints to your event. Click the Show details button for an event (the arrow button at the right end of an event row), and you will see the Midpoints list taking up the right half of the event details panel. Simply click the Add Midpoint button to get the New Midpoint dialog where you enter a title, description and choose a date for your point.

The one last element of the Events section that’s good to know about is the Categories menu at the upper right of the page. This allows you to delete categories as well as filter the Events list to only show a single category of events at a time. To do this, just click the name of a category in the Categories menu.

Insights from a Data Center Conference

At the beginning of this month, I’d attended the Gartner Data Center Conference in Las Vegas, and wanted to share with you some of my gained impressions and insights from the event.

First, I have to say that I have seldom seen a group of more conscientious conference attendees (aside from Surge, of course, and a physics conference I once attended). Networking breakfasts were busy, sessions were well attended, and both lunch and topic-specific networking gatherings had lively discussions. Each of the Solution Center hours, going well into the evening, were full of people not only partaking of the food or giveaways but were primarily and voraciously soaking up information from the various exhibitors. Even in hallways during the day, while people were sitting or standing, there was a steady exchange of opinions and information. This is what I saw throughout the conference; attendees there were very serious about learning…from the speakers, vendors, and from their peers. Relatedly, it’s interesting that many organizations bar outright their employees from attending any events in Vegas? While boondoggle may be an appropriate term for some shows in that or any other location, it certainly wasn’t the case with this conference.

Now let’s get to what frequently was foremost on the mind of attendees. I was somewhat surprised to find that this was not something that is usually on the top-ten lists of CIO/IT initiates. Rather, what repeatedly came out first in terms of attendees’ pressing interest were the interrelated topics of avoiding IT outages and increasing speed of service recovery, along with monitoring to help with both of these.

Granted, this was a datacenter-specific conference so it’s natural that avoidance of and recovery from operational failures is of paramount importance. But note that there are lots of other overarching datacenter initiatives we all hear much more about, such as virtualization, cloud migration, datacenter consolidation. Many of these headline-grabbing topics are certainly both important, and getting done. However, what affects datacenter operations leaders’ daily lives and careers, and so is of primary importance, has not received much if any notice or press.

Why is that? It’s pretty simple. Some of these other initiatives are new. Monitoring has been around seemingly forever, plus (to an extent) outages are taken as being somewhat unavoidable. Yet, while zero failures is indeed not possible, markedly increased reliability is certainly attainable. Look at the historical telecom service provider side, where five-9’s reliability is the expected level of service. When expectations are high, and commensurate investment is made, higher levels are not at all out of reach.

As for monitoring solutions themselves, nowadays you don’t have to be limited to old-school systems. There are young companies, like Circonus, who have a fresh approach that breaks down the silos of stand-alone toolsets of the past.

Let’s take a step back now and visualize what outages look like from a datacenter ops teams perspective, i.e. what happens when things ‘blow up’ in a datacenter. It’s not external constituents such as clients that directly impact the datacenter for the most part. External clients touch the business units and it’s then the business units which put the heat on the datacenter leaders.

And what about SLA’s for keeping business units apprised of the benefit IT delivers to them? As I heard loud and clear in the conference, internal SLA’s are for the most part useless. Why? Because they don’t mean much to the business units—they’re only interested in “When are you going to get my service back up?!” In other words, this is a variation on, “What have you done for me lately?”

So let’s look at an option for resolution. If the problem occurs on a virtual machine, you just spin up a new instance, right? Wrong, but that’s what usually happens. When a hammer dangling off a shelf hits you on the head, do you replace it with another dangling hammer and think you’ve solved the problem? Obviously, the thing to do in a datacenter is do the work to avoid repetition of the issue—we’re talking root-cause-analysis—otherwise you’re putting out fires repeatedly…the same fires.

Now a good monitoring system is going to help and in several ways. First, as just mentioned, it’s going to assist in identifying the underlying issue, including its location—is it in the app, the database, the server, etc. You don’t want to do that blindly testing—you’ll want the capability to create graphs on the fly and you similarly want to able to very easily and quickly do correlations of your metrics.

Okay, so that’s good for remediating a problem along with reducing the chance of it recurring, but you’ll also want to do anticipatory actions like capacity planning to forestall avoidable bottlenecks. For this you also want an easy-to-use tool so that you don’t have to muck around with spreadsheets. And you’ll want to be able to have a ‘play’ function so that when you do things such as code-pushes, you’ll be able to see in real-time the effect of these changes. This way, if the effect of the code-push is negative, you can quickly reverse the action without impacting your internal or external clients.

The good news is that new solutions with all these functionalities are out there in the marketplace. Of course, before you buy one be sure to insist on testing the solution in a trial to see how it performs, in your current and anticipated (read: hybrid physical and virtual/Cloud) environments. This includes seeing how the solution handles your scale, both backend and from a UI perspective. Such an evaluation will require an investment in your time, but the result will be well worth it, in the increased avoidance of outages and speeding up of recovery from them.

Monitoring your Vitals During the Critical Holiday Retail Season

As with Brick & Mortar stores, the Holiday season is a critical time for many E-Commerce sites. Like their off-line brethren, these sites also see large increases in both traffic and revenue, sometimes substantially so. Of course these changes in user behavior don’t just affect E-Commerce sites; consider a social-networking site like Foursquare, where a person might normally check into 3 or 4 places a week, during the Holiday season that might double as they visit more stores and end up eating out more often while rushing between those stores. On an individual basis it doesn’t sound that significant, but if a large percentage of your user base doubles their traffic, you better hope you have planned accordingly.

On the technical side, many sites will actually change their regular development process in order to handle these changes in user behavior.Starting early in November, many sites will stop rolling out new features and halt large projects that might be disruptive to the site or the underlying infrastructure. As focus shifts away from features,most often it turns back towards infrastructure and optimization. Adding new monitoring, from improved logging to new metrics and graphs, becomes critical as you seek to have a comprehensive view of your sites operations so that you can better understand the changes in traffic that are happening, and hopefully be proactive about solving problems before they turn into outages.

Profiling and optimization work also receives more attention during this time; studies continue to show correlations between page load speeds and website responsiveness to increased revenue, and being able to improve these areas is something that can typically be done without having to change the behavior of how things work. Bugfixes are also a popular target during these times as those corner cases are more likely to show up as traffic increases, especially if you tend to see new users as well as an increase in use by existing users.

This brings us to a good question; just what are you monitoring? For most shops there tend to be standard graphs that get generated for this like disk space or memory usage. These things are good to have, but they only scratch the surface. Your operations staff probably knows all kind of metrics about the system the need to monitor, but how about your application developers? They should know the code that runs your site inside and out, so challenge them to find key metrics in your application stack that are important for their work. Maybe that’s messages delivered to a queuing system, or the time it takes to process the shipping costs module, or measuring the responsiveness of a 3rd party API like Facebook or Twitter. But don’t stop there;everyone in your company should be asking themselves “what analytics could I use to make better informed decisions”? For example, do you know if your increased traffic is due to new users or existing users? If you are monitoring new user sign ups, this will start to give you some insight. If you are doing E-Commerce, you should also be tracking revenue related numbers. Those types of monitors are more business focused but they are critical to everyone at your company. So much so that at Etsy, a top 100 website commonly known as “the worlds handmade marketplace”, they project these types of metrics right out in public.

Ideally once you have this type of information being logged, you can collect the information for analytically reports and historical trending via graphs. You want to be able to take the data you are collecting and correlate between metrics. Given a 10% increase in new users in the past week, we’ve seen a 15% spike in web server traffic.If we project those numbers out, can we make it through Black Friday? Cyber Tuesday? Will we make all the way to New Years, or do we need to start provisioning new machines *NOW*? Or what happens if our business model changes, and we are required to live through a “Black Friday” event every day? That’s the kind of challenges that social shopping site Gilt faces, with it’s daily turnover of inventory. It’s worth saying that you won’t need all of this information real time, but ideally you’ll be able to get a mix of real time, near-time (5 minutes aggregated data is common), as well as
daily analytical reports. Additionally you should talk with your operations staff about which of these metrics are mission critical enough that we should be alerting on them, to make sure we have the operational and organizational focus that is appropriate.

While nothing beats preparation, even the best laid plans need good feedback loops to be successful. Measuring, collecting, analyzing, and acting upon data as it comes into your organization is critical in today’s online environments. You may not be able to predict the future, but having solid monitoring systems in place will help you to recognize problems before they become critical, and help give you a “snowballs chance” during the holiday season.

Template Web UI

Back in October we released the first version of our new Templating API, allowing you to easily replicate sets of bundles across multiple hosts. Now we bring you the time-saving sweetness of Templates in the web interface as well; if you have multiple servers that you want to monitor in exactly the same way, Templates are your friend. The idea behind them is pretty simple: you choose your master host, and select one or more of its check bundles to be used as master bundles. Then when you select your target hosts, the master bundles are copied and applied to the target hosts.

Creating A Template

So let’s look at how the Templating process works. Before you create a template, you first need to ensure that you have your master check bundles set up and active on your master host. Once that’s the case, start by going to the new “data” section of Circonus and visiting the “Templates” tab at the left. Create a new template via the “+” tab (or the “Create A Template” button in the middle of the page if you have no templates yet). In the resulting dialog, type a name for your template and choose a master host, and when you click “OK” you will see the templates table appear with a row for your new template (as usual, click the summary row to view the expanded details of the template).

When you first create a template, it’s in “draft” mode. This means that it’s only saved in your browser’s memory until you apply it. Nothing has been saved to the system yet, and the master bundles haven’t been replicated. This allows you to lay out templates and modify them or discard them before making any changes to the system. If you wish to save changes to a draft you may do so via the “Save” button; the draft is not applied as a regular Template until you click the “Apply” button. To aid in visually scanning the list of Templates for drafts, drafts will always appear at the top of your list, and will always be green. (If at any point you wish to change your template name or master host, you may click them in the summary row to edit them in-place. Please note: when changing your master host, you may only choose among the target hosts currently saved in the Template.)

Choosing Bundles

Once you’ve created your draft, you need to choose your master check bundles. Under the “Check Bundles” section at the left, click “Add Bundle” to bring up the new bundle dialog. All the bundles available for your master host will be shown here in a scrollable list. This is a selectable list, so when you select a bundle, it’s shown as selected in the list until you remove it from the Template. If you have a long list of bundles and are having a difficult time finding the ones you want, you may use the field above the list to filter the shown bundles by a filter string or regular expression (if you’re using a regular expression, don’t include the leading and trailing slashes, just use the desired RegEx syntax). After you have chosen a bundle, you may change its name by clicking on it in the list of chosen bundles. (Please Note: the reserved string “{target}” will be replaced by the current hostname/IP as the bundle is replicated across the target hosts.)

Choosing Hosts

Choosing your target hosts works mostly the same way as choosing the master check bundles. The “Add Host” button brings up a dialog with a scrollable, selectable, filterable list of available hosts on your account, and you may choose one or more of those hosts. There is an additional feature, however, which is the “Enter a new host” field below the list. This allows you to enter new hosts (either IP addresses or domain names are acceptable) that aren’t currently used on your account. When you enter a new host and hit return/enter, the new host will be subject to a DNS check to ensure that it really exists; if it passes the DNS check, it will then be added to your list of target hosts.

Once you’re satisfied with your bundle and host choices, clicking the “Apply” button will replicate your master bundles across each target host and will save the template in the database.

Modifying A Template

the action dropdown select

Once your Template is saved you will see several things change in the details panel. Each bundle and host will get checkboxes, and two “Action” dropdown selects will appear, one above the check bundles list and one above the target hosts list. Now that the bundles and hosts are a part of the template, if you wish to modify or remove them, you will need to check their checkboxes and choose an action from the appropriate dropdown before saving. There are four actions available:

When used on a bundle, it will delete the target bundles and remove them from the Template. When used on a host, it will delete the host’s bundles and show the host as inactive in the host list.
When used on a bundle, it will leave the target bundles in place but will break their synchronization with the template and show them as inactive in the bundle list. When used on a host, it will leave the host’s bundles in place but will break their synchronization with the template and show the host as inactive in the host list.
When used on a bundle, it will deactivate the target bundles and show them as inactive in the bundle list. When used on a host, it will deactivate the host’s bundles and show the host as inactive in the host list.
When used on a bundle, it will reactivate, rebind, or recreate target bundles as necessary, to restore them to active status and synchronization with the template. When used on a host, it will reactivate, rebind, or recreate the host’s bundles as necessary, to restore them to active status and synchronization with the template.

Staying In-Sync

re-sync button

After creating and applying a Template, you are still allowed to edit the master check bundles. If you do so, any Templates using those check bundles as master bundles will be out-of-sync. When you go to your Templates page, the out-of-sync Templates will have their sync buttons activated and the buttons will say “Re-Sync.” Simply click the “Re-Sync” button to replicate the bundle changes across all the target bundles, and the Template will be in-sync again.

(Please Note: if at any point you wish to delete the template, any active bundles that are still a part of the template will be deleted from the target hosts. If you wish to keep the bundles on the target hosts but just delete the template, you will need to unbind all the bundles you wish to keep on the target hosts and then delete the template.)