Annotating Alerts and Recoveries

In the last couple of posts, Brian introduced our new WebHook notifications feature and I demonstrated how Circonus can graph text metrics for Visualizing Regressions. Both of these features are interesting enough on their own, but let’s not stop there. Today I have an easy demonstration showing how you can re-import your alert information to your trends. The end goal is an annotation on our graph that can be used to help identify, at a glance, which alert(s) correspond with anomalies on your graphs.

First, let’s set a WebHook Notification in our Circonus account profile. Choose the contact group that it should belong to, or create a new contact group specifically for this exercise. Type the URL where you want to POST your alert details in the custom contact field and hit enter to save the new contact.

20101102_screen1

Now we need something for our webhook to act as a recipient. For this example I have a simple Perl CGI script that listens for the POST notification, parses the contents, and writes out Circonus-compatible XML. It doesn’t matter which language you use, as long as you can extract the necessary information and write it back out in the correct XML format (Resmon DTD).

#!/usr/bin/perl
#
# alert.cgi

use strict;

my $cgi = CGI->new;
my $template = HTML::Template->new(
  filename => 'resmon.tmpl',
  die_on_bad_params => 0
);

# check for existence of alerts from webhook POST
if ($cgi->param('alert_id')) {

  # open XML output for writing
  open (OUT, ">/path/to/alert.xml") || 
    die "unable to write to file: $!";

  # loop through alerts
  for my $alert_id ($cgi->param('alert_id')) {

    # check for valid alert id format
    if ($alert_id =~ /^d+$/) {

      # craft our XML content
      $template->param(
        last_update => time,
         alert_id => $alert_id,
         account_name => $cgi->param('account_name'),
         check_name => $cgi->param("check_name_${alert_id}"),
         metric_name => $cgi->param("metric_name_${alert_id}"),
         agent => $cgi->param("agent_${alert_id}"),
         severity => $cgi->param("severity_${alert_id}"),
         alert_url => $cgi->param("alert_url_${alert_id}"),
      );

      # only print RECOVERY if available
      if ($cgi->param("clear_time_${alert_id}")) {
        $template->param(
          clear_time => $cgi->param("clear_time_${alert_id}"),
          clear_value => $cgi->param("clear_value_${alert_id}"),
        );

      # otherwise print ALERT details
      } else {
        $template->param(
          alert_time => $cgi->param("alert_time_${alert_id}"),
          alert_value => $cgi->param("alert_value_${alert_id}"),
        );
      }
    }
  }

  print OUT $template->output;
}

close (OUT);

Here is the template file used for the XML output.

<!-- resmon.tmpl -->
<ResmonResults>
  <ResmonResult module="ALERT" service="aarp_web">
    <last_runtime_seconds>0.000238</last_runtime_seconds>
    <last_update><TMPL_VAR name="last_update"></last_update>
    <metric name="account_name" type="s">
      <TMPL_VAR name="account_name">
    </metric>
    <metric name="alert_id" type="s">
      <TMPL_VAR name="alert_id">
    </metric>
  <TMPL_IF name="alert_value">
    <metric name="message" type="s">
      <TMPL_VAR name="check_name">`<TMPL_VAR name="metric_name"> 
      alerted <TMPL_VAR name="alert_value"> from <TMPL_VAR name="agent">
      at <TMPL_VAR name="alert_time"> (sev <TMPL_VAR name="severity">)
    </metric>
  </TMPL_IF>
  <TMPL_IF name="clear_value">
    <metric name="message" type="s">
      <TMPL_VAR name="check_name">`<TMPL_VAR name="metric_name"> 
      cleared <TMPL_VAR name="clear_value"> from <TMPL_VAR name="agent"> 
      at <TMPL_VAR name="clear_time"> (sev <TMPL_VAR name="severity">)
    </metric>
  </TMPL_IF>
    <metric name="alert_url" type="s">
      <TMPL_VAR name="alert_url">
    </metric>
  </ResmonResult>
</ResmonResults>

When everything is running live, the alert.cgi script will accept webhook POST notifications from Circonus and write the alert details out to /path/to/alert.xml. This file should be available over HTTP so that we can import it back into Circonus using the Resmon check. Once you’ve begun capturing this data you can add it to any graph, just like any other metric.

This might take you 30 minutes to setup the first time. But once you have it, this data can be really useful for troubleshooting or Root Cause Analysis. We plan to add native support for alert annotations within Circonus over the next few months, but this is a handy workaround to have until then.