Finding Needles in a Worksheet

Traditional graphing tools can help you plan for growth or even narrow down root causes after a failure. But they’ have a reputation for being difficult to setup, navigate or customize. It’s nice to be able to just point Cacti at some switches or routers and have it gracefully poll each device for SNMP data. Yet when you need a custom perspective of the data (or collections of data), it can be an arduous experience setting up templates and graphs.

When we started to engineer Reconnoiter into a SaaS offering, one of the major driving forces was a desire to not suck like the others. Like you, we don’t understand why it has to be so damn hard (or require a dedicated IT staff) to take a handful of data points and correlate them into graphs that make sense of the noise. I like to think we’ve been successful. Customers have been overwhelmingly positive about our efforts, calling it “a graph nerd’s paradise”. Even still, we eat our own dog food and are constantly revisiting the service to look for better ways to get our work done. This is why we’re working hard on upcoming features like Graph Overlays and Timeline Annotations. And it’s also why we made recent changes to the workflow for graphs and worksheets.

If you’re a Circonus user, you already know how easy it is to create and view graphs. Adding them to worksheets gives you a page full of data to compare and relate. Choose a zoom preset (2 days, 2 weeks, etc) or select a date range, and all of the thumbnails are instantly redrawn in unison. It might sound basic, but it can be very useful if you’re not sure what you’re looking for. Unexpected patterns jump out at you pretty quickly.

However, most of the time you want to work with a single graph. Clicking on a thumbnail previously loaded a graph in “lightbox” view, hiding all other graphs from sight and letting you focus on the work at hand. This worked well most of the time, but had one big drawback… you couldn’t (easily) bookmark it. So we’ve moved the default view into its own page, sans lightbox, that can be bookmarked and shared with others. Miss the lightbox view? No worries, we’ve kept that as the new preview mode. Try it out in a worksheet for “flickr-style” navigation.

Here’s a short video I threw together to demonstrate some of these changes. There was some audio lag introduced by the YouTube processing, but it should be easy enough to follow along. If you’d like to see more examples like this one, shoot us an email and we’ll try to keep them coming.

Access Tokens with the Circonus API

When we rolled out our initial API months ago, we took a first stab at getting the most useful features exposed to help customers get up to speed with the service. A handful of our users expressed displeasure with having to use their login credentials for basic access to the management API. Starting today, we’re pleased to announce support for access tokens within the Circonus API.

Tokens offer fine-grained access for each user to a specific service account, at your permission role or lower. For example, if Bob is a normal user on the Acme Inc. account, he can create tokens allowing normal or read-only access. Multiple applications can use the same token, but each application has to be approved by Bob in the token management page, diabolically named My Tokens. To get started, browse over to this page inside your user profile, select your account from the drop-down and click the “plus tab” to create your first token.

The first time you try to connect with a new application using your token, the API service will hand back a HTTP/1.1 401 Authorization Required. When you visit the My Tokens page again you’ll see a button to approve the new application-token request. Once this has been approved you’ll be able to connect to the API with your new application-token.

Using the token is even easier. Just pass the token as X-Circonus-Auth-Token and your application name as X-Circonus-App-Name in your request headers. Here’s a basic example using curl from the command-line:

$ curl -H "X-Circonus-Auth-Token: ec45e8a2-d6d9-624c-c21c-a83f573731c1" 
       -H "X-Circonus-App-Name: testapp" 

   "account_description":"Monitoring for The Social Network.",
   "account_name":"Social Networks"

One of the more convenient features with our tokens is how well they integrate with user roles. A token will never have higher access permissions than its owner. In fact, if you lower a user’s role on your account, their tokens automatically reflect this as well. Changing a “normal” user to “read-only” will render their tokens the same access level. But if you restore their original role, the token will also have its original privileges restored. Secure and convenient.

If you have any questions about our new API tokens or would like to see more examples with the Circonus API, drop us a line at

Annotating Alerts and Recoveries

In the last couple of posts, Brian introduced our new WebHook notifications feature and I demonstrated how Circonus can graph text metrics for Visualizing Regressions. Both of these features are interesting enough on their own, but let’s not stop there. Today I have an easy demonstration showing how you can re-import your alert information to your trends. The end goal is an annotation on our graph that can be used to help identify, at a glance, which alert(s) correspond with anomalies on your graphs.

First, let’s set a WebHook Notification in our Circonus account profile. Choose the contact group that it should belong to, or create a new contact group specifically for this exercise. Type the URL where you want to POST your alert details in the custom contact field and hit enter to save the new contact.

Now we need something for our webhook to act as a recipient. For this example I have a simple Perl CGI script that listens for the POST notification, parses the contents, and writes out Circonus-compatible XML. It doesn’t matter which language you use, as long as you can extract the necessary information and write it back out in the correct XML format (Resmon DTD).

# alert.cgi

use strict;

my $cgi = CGI->new;
my $template = HTML::Template->new(
  filename => 'resmon.tmpl',
  die_on_bad_params => 0

# check for existence of alerts from webhook POST
if ($cgi->param('alert_id')) {

  # open XML output for writing
  open (OUT, ">/path/to/alert.xml") || 
    die "unable to write to file: $!";

  # loop through alerts
  for my $alert_id ($cgi->param('alert_id')) {

    # check for valid alert id format
    if ($alert_id =~ /^d+$/) {

      # craft our XML content
        last_update => time,
         alert_id => $alert_id,
         account_name => $cgi->param('account_name'),
         check_name => $cgi->param("check_name_${alert_id}"),
         metric_name => $cgi->param("metric_name_${alert_id}"),
         agent => $cgi->param("agent_${alert_id}"),
         severity => $cgi->param("severity_${alert_id}"),
         alert_url => $cgi->param("alert_url_${alert_id}"),

      # only print RECOVERY if available
      if ($cgi->param("clear_time_${alert_id}")) {
          clear_time => $cgi->param("clear_time_${alert_id}"),
          clear_value => $cgi->param("clear_value_${alert_id}"),

      # otherwise print ALERT details
      } else {
          alert_time => $cgi->param("alert_time_${alert_id}"),
          alert_value => $cgi->param("alert_value_${alert_id}"),

  print OUT $template->output;

close (OUT);

Here is the template file used for the XML output.

<!-- resmon.tmpl -->
  <ResmonResult module="ALERT" service="aarp_web">
    <last_update><TMPL_VAR name="last_update"></last_update>
    <metric name="account_name" type="s">
      <TMPL_VAR name="account_name">
    <metric name="alert_id" type="s">
      <TMPL_VAR name="alert_id">
  <TMPL_IF name="alert_value">
    <metric name="message" type="s">
      <TMPL_VAR name="check_name">`<TMPL_VAR name="metric_name"> 
      alerted <TMPL_VAR name="alert_value"> from <TMPL_VAR name="agent">
      at <TMPL_VAR name="alert_time"> (sev <TMPL_VAR name="severity">)
  <TMPL_IF name="clear_value">
    <metric name="message" type="s">
      <TMPL_VAR name="check_name">`<TMPL_VAR name="metric_name"> 
      cleared <TMPL_VAR name="clear_value"> from <TMPL_VAR name="agent"> 
      at <TMPL_VAR name="clear_time"> (sev <TMPL_VAR name="severity">)
    <metric name="alert_url" type="s">
      <TMPL_VAR name="alert_url">

When everything is running live, the alert.cgi script will accept webhook POST notifications from Circonus and write the alert details out to /path/to/alert.xml. This file should be available over HTTP so that we can import it back into Circonus using the Resmon check. Once you’ve begun capturing this data you can add it to any graph, just like any other metric.

This might take you 30 minutes to setup the first time. But once you have it, this data can be really useful for troubleshooting or Root Cause Analysis. We plan to add native support for alert annotations within Circonus over the next few months, but this is a handy workaround to have until then.

Visualizing Regressions

We’ve heard a lot of talk about Continuous Deployment strategies over the last 12-18 months. Timothy Fitz was one of the earliest proponents, publishing stories of their success over at IMVU last year. One of the greatest benefits to continually pushing your changes to production is that it takes less time and effort to find bugs when something goes wrong, since you have fewer commits in-between to navigate. But even with this style of release management, it helps to know which versions of code are running live on your components at any point. What happens when your newest code is enough to alter the normal behavior of the system, but not so drastic as to trigger an alert?

One of the nicer trending features in Circonus (or its open-source relative, Reconnoiter) is the ability to correlate unrelated datasets. I can take any collection of metrics on my account and group them together on a single graph. But what if you could view isolated events on the same graph, as an orthogonal data point? Check out these two graphs displaying some recent activity on one of our fault detection systems. The vertical lines represent the point at which a text metric’s value changed. Circonus renders them this way so you can easily recognize that specific moment in time.


the first graph I’m hovering over a dip in performance caused by the most recent release to that comment (svn r6230). In the second graph we’re running a fix (svn r6232) for the regression introduced in the previous commit. Could I have done the same level of correlation manually? Of course, but it’s nice to be able to zoom out and study the long-term affects of our release strategy on our overall stability. This is an enormously helpful tool for investigating Root Cause Analysis on our live systems, especially if you perform releases many times in a week (like we do). If you’re one of many using automation and Configuration Management suites like Puppet, Chef and the Marionette Collective, no doubt you’ll find it even more useful.

If you’d like to start trending your own text metrics, check out the Resmon DTD. Circonus can pull in your custom metrics in this format. Although the version numbers I mentioned earlier look like integers (well, they are integers), I can explicitly cast them as a string metric using the Resmon DTD. Here is what that might look like:

  <ResmonResult module="Site::CircProd" service="vers"> 
    <metric name="ernie" type="s">6297</metric> 

As you might imagine, you can get pretty creative with the sort of data you can pull into Circonus. In our next post I plan to look at how you can combine WebHook Notifications (that Brian announced last week) with these text metrics to start trending your alert history. Stay tuned!

WebHook Notifications

This week we added support for webhook notifications in Circonus. For those that are unsure what a webhook is, its simply an HTTP POST with all the information about an alert you would normally get via email, XMPP or AIM.

Webhooks can be added to any contact group. Unlike other methods, you can’t add one to an individual user, and then add that user to a group, however this might be supported in the future based on feedback. Simply go to your account profile, click on the field “Type to Add New Contact” on the group you would like to add the hook to, and enter the URL you would like us to contact. The contact type will then display as your URL with the method of HTTP (for brevity).

Now that your hook is setup, what will it look like when the data is posted to you? Here is a perl Data::Dumper example, grouped by alert for readability, of the parameters posted for 2 alerts:

%post = (
   'alert_id' => [
   'account_name' => 'My Account',
   'severity_21190' => '1',
   'metric_name_21190' => 'A',
   'check_name_21190' => 'My Check',
   'agent_21190' => 'Ashburn, VA, US',
   'alert_value_21190' => '91.0',
   'clear_value_21190' => '0.0',
   'alert_time_21190' => 'Thu, 21 Oct 2010 16:35:49',
   'clear_time_21190' => 'Thu, 21 Oct 2010 16:36:49',
   'alert_url_21190' =>
   'severity_21191' => '1',
   'metric_name_21191' => 'B',
   'check_name_21191' => 'My Other Check',
   'agent_21191' => 'Ashburn, VA, US',
   'alert_value_21191' => '91.0',
   'alert_time_21191' => 'Thu, 21 Oct 2010 16:36:21',
   'alert_url_21191' =>

So lets look at what we have here. First thing to notice is that we pass multiple alert_id parameters, giving you the ID of each alert in the payload. From there, every other parameter is suffixed with _<alert_id> so you know which alert that parameter is associated with. In this example 21190 is a recovery, and 21191 is an alert, recoveries get the additional parameters of clear_value and clear_time.

Webhooks open up all sorts of possibilities both inside and outside of Circonus. Maybe you have a crazy complicated paging schedule, or prefer a contact method that we don’t natively support yet, fair enough, let us post the data to you and you can integrate it however you like. Want to graph your alerts? We are in the process of working on a way to overlay alerts on any graphs, but in the meantime, setup your webhook and feed the data back to Circonus via Resmon XML, now you have data for your graphs.

If you are curious about other features and would like to see an in depth post on them, please contact us at

Good Times in Charm City

It’s been a while since I had time to enjoy the technical conference scene. Thanks to my involvement with Circonus, I have plenty of action scheduled between RailsConf, Velocity and the Surge Scalability Conference. We attended RailsConf in Baltimore a couple weeks ago and had a great time. Circonus had an exhibition booth and we gave out tons of demonstrations, free swag and t-shirts. But the best part of any con is catching up with old friends and making new ones.

I finally met Mark Imbriaco of 37signals in person. Mark has been a valued user for us, giving plenty of awesome feedback during the beta and after our production launch. If you haven’t seen it already, check out Mark’s interview on He offers a lot of insight into 37signals’ operations and architecture. Good stuff.

Last but not least, a nice relationship blossomed out of our participation at RailsConf. I’ve been aware of the RPM service over at NewRelic for a while now. Although they sometimes market it as monitoring software for Rails, a more apt description would be to call it a kickass profiling tool for Ruby and Java applications. It’s very useful for tracking down performance issues within your application code. But what happens when the problem isn’t in your source code… or maybe you’re just not sure? Fortunately for NewRelic RPM users, the solution just became very clear.

We recently rolled out support for importing your NewRelic RPM metrics directly into Circonus! All of the application statistics available over NewRelic’s data API are now easily accessible inside your Circonus account. Correlate your application CPU and Response Time with HTTP first-byte and total duration. See the impact of optimizations on your end user experience! All you need to create your first RPM check in Circonus is the account id and license key (available in your application’s newrelic.yml).

Ironically, the one question we kept hearing over and over again at RailsConf was:

How is Circonus different from NewRelic?

The answer is simple; we’re perfect complements. Circonus offers a holistic view of your networks, architectures, systems and services. NewRelic RPM provides a detailed view of your application internals. Both support real-time analysis of their individual focus areas. It’s really the perfect monitoring combination for any serious Rails shop.

If you’d like more information on Circonus or how it can support your architecture, shoot us a line or stop by Booth 103 at Velocity this week. We still have plenty of black t-shirts to give away. 🙂

Circonus at Velocity 2010

Hot on the heels of our RailsConf ticket giveaway, we have another contest for a free pass to Velocity 2010! I’m really excited to attend this year’s Velocity. It’s the Web Performance event to attend, and a great place to see the sharpest whips in the industry.

Like before, the rules of this giveaway are simple. Just tweet a message about Circonus being at Velocity and ask your friends to retweet it. The original "twitterer" with the most retweets by Friday, June 14 at noon (12pm EDT) wins. Here’s an example:

The @Circonus stuff is hot and it looks like they’ll be at #velocityconf this year:

That’s an easy way to earn a free 2010 Velocity sessions pass ($1295 value). Free free to get creative with your tweet message. Our only requirements are that it’s a positive message that mentions @Circonus and #velocityconf, and that it includes the link.

Yay, free stuff!

Circonus at RailsConf 2010

We’re anxious to meet and greet everyone at RailsConf next month in Baltimore. This will be our first conference appearance since the production launch. Some of our customers, including 37signals, will be visiting Charm City for this big event. I’m excited to see so many talented Web developers and operations folk in one conference. Having it in our hometown is icing on the cake.

As if that wasn’t enough, we have a couple of fun things to announce. First, Circonus will be giving away a free RailsConf sessions pass! All you have to do is tweet a message about Circonus at RailsConf to your friends and ask them to retweet it for you. The individual with the most retweets by noon (12pm EDT) on Monday, May 31, 2010 wins. Here’s an example tweet:

The @Circonus stuff is hot and it looks like they’ll be at #railsconf this year:

If you’re keeping score at home, that’s a free 2010 RailsConf sessions pass ($795 value) for the price of a few clicks. Feel free to get creative with your tweet message. Our only requirements are that it’s a positive message that mentions @Circonus and #railsconf, and that includes the link.

Why are you still reading this? Go off and start tweeting for your free RailsConf pass (Conference Sessions Only).

See you in Baltimore!

Your Visitors Don’t Matter

Consider me old-fashioned, but I remember a time when an alert notification meant something. Drives failed, servers ran short on memory, or a cage monkey pulled the wrong cable at 3 A.M. Regardless of the circumstance, it demanded attention. Those were the days.

Today, operations is all about doing more with less. No more dedicated hardware or late-night maintenance windows. Everything is virtual, cloud-based, or filling up squares in the grid. Automation reigns supreme, limitless scalability at our disposal. Abstraction at its finest.

But woe unto you, the flapping anomaly.

That visitor who tried to load your website was turned away, timed out and left to wither. Poor Jane wanted to view your site. She needed to view your site. She’d already submitted her order, only to be ignored. Forgotten. Disconnected with nary a trace to route nor a cookie to favor.

Jane was a victim of a numbers game. Someone, somewhere, decided that some problems don’t matter. Which ones? Who cares? They don’t matter. And because she happened to visit when this problem reared its head, you ignored her request. Who would ever make such a silly presumption that one failure is less important than another? What criteria is used to determine the worthiness of this alert or that one? Pure random circumstance, it would appear.

Many “uptime” services and monitoring suites promote the concept of selective or flapping failures. Vendors sell these features as a convenience, ostensibly as a sleep aide. The administrator’s snooze-bar. I can’t think of any other reason that ignoring a faulty condition would be considered a good thing. Perhaps they reason that only the check is affected. If it responds after the third attempt, it was probably ok for visitors all along. Right?

It’s disappointing how many vendors embrace this broken methodology. It probably seemed innocent at a glance. But the damage has been done; recklessness has taken root. We’ve been conditioned to accept these transient malfunctions as mere operational speed bumps. Rather than address the problem, we nudge the threshold a tad higher. Throw additional nodes into the cluster. Increase capacity, while decreasing exposure.

But there is a more responsible alternative. What ever happened to purposeful, iterative corrections and Root Cause Analysis? Notifications may be annoying at times, but they serve a crucial function in a healthy production architecture. Ignored alerts lead to stagnant bugs, lost traffic and missed opportunities. Stop treating your visitors like they don’t matter. There’s no such thing as a flapping customer.