Site Maintenance Oct 23rd, 2017 10:00 EDT

We will be performing site wide maintenance on Monday, October 23rd, 2017 at 10:00am EDT (14:00 UTC). This maintenance is expected to last 15 minutes. During this window, the UI and API will be unavailable as we fail over to a new primary datacenter. This maintenance also includes the promotion of a new primary DB and movement of alerting services.

Over the past few weeks, we have spun up data collection to this new DC, and have been serving graph, worksheet, and dashboard data from it. During the maintenance window, data collection will continue uninterrupted. There will be an alerting outage as we switch services to their new home. Alerts that would have fired in the window will fire when we come out of maintenance; alerts that would have cleared will clear when we come out of maintenance.

Please double check that our IPs listed in out.circonus.net are permitted through any firewalls, especially if you have rules to permit webhook notifications.

We expect no major issues with this move. If you have any questions, please contact our support at support@circonus.com for further clarification.

Customizable Alerts and Ruleset Groups

Today we are releasing two new features to make your on-call life easier.

Customizable Alerts

Our default format was created and modified over the years based on user feedback, but of course it was never going to make everyone happy. How do we solve this? By letting you create your own alerts! If you head to your contact groups page, you will notice a new checkbox to “enable custom alert formats”. Check that and some new fields will appear to let you modify both long and short format bodies, add a summary for each type and even the subject line of your emails.

Alerts are customized by the use of macros and conditionals. Macros take the form {alert_id} and are replaced with the appropriate values. Conditionals look like %(cleared != null) Cleared: {cleared}% and in this example, if the alert is clear you will get a line like, Cleared: Mon, 15 July 2013 11:49:58 on your alert, if it is not clear this line will not be added.

As a further example, the current alerts in the new customizable format would like this:

Subject:
[{account}] Severity {severity} {status} {check_name}

Body:
Account: {account}

{status} Severity {severity} %(new_severity != null) [new Severity {new_severity}]%
Check: {check_name}
Host: {host}
Metric: {metric_name} ({value})
Agent: {broker_name}
Occurred: {occurred}
%(cleared != null) Cleared: {cleared}%
{link}
{metric_link}
{metric_notes}

The ? help bubble beside the Alert Formats section header has a full list of all the macros available for use. Our default alert format will be changing slightly as well, we are going to be putting your account name in the subject line instead of Circonus and we will be adding the metric notes to the body.

Ruleset Groups

Another feature we get asked for is some way of only alerting when 3 out of 5 servers are down, or only when CPU spikes in conjunction with a rise in Memory. To make this a reality, we’ve added the concept of “rule groups”, located under Alerts -> Groups. These groups take rulesets you’ve already created, like cpu > 80 or http 500 response code, and let you combine as many as you would like to form a group. You then define a formula on which the group alerts, these can be a threshold, when X of Y rules are in alert, trigger a sev 1, or you can create an expression, when (A and B) or C alerts, trigger a sev 2.

Lets look at a complete example. I’ve created 3 checks, one for each webserver in my infrastructure, each check collects the http response code. On that response code metric I’ve added 2 rules, if the value is not 200, or it goes absent, send me an email.


Since I have redundancy in my servers, I choose to only get an email when one goes down. This way I don’t get woken up, and I can just take care of the problem the next day.


However, I know I always want a least 2 servers up and running. So now I will go to the groups page, and create a webserver group. I first add my rulesets via the “add rulesets+” button, selecting all 3 webserver code rules. Then I add a formula and decide that if 2 (or more) out of the 3 servers go bad, trigger a sev 1 alert. Then I add my page group to get these sev 1s. Now I’ll still get emails if the servers go down, but I’ll get woken up with a page when I hit my group threshold.

What’s New Q1 2013 Edition

Navigation and URL Changes

I’ll start this update talking about the most obvious changes to the UI, which is our new completely horizontal navigation and new URL structure.

We had received a lot of feedback about the mix of horizontal and vertical navigation. The tabs we were told were hard to read, especially with the text being rotated 90 degrees, users weren’t sure that the tabs were part of the navigation once they got to the page, and the ever confusing + button had to be explained to almost everyone. It was our take on what I would call a classic site layout, but in the end didn’t really resonate with our customers.

The new layout is much easier to understand and use. Main nav at the top, sub nav below, search box in the upper right where one would expect it and the + button has been replaced with text like “+ New Check” or “+ New Graph”. An added benefit of this change is we were able to reclaim some vertical whitespace, making the page a bit tighter. Not to be completely aesthetic, we recently read an article by Ben Kamen’s about Breaking down Amazon’s Mega Dropdown so we incorporated some of those ideas to look at mouse acceleration and direction so we don’t flip to another sub nav on hover as you try to navigate to a new page.

As you probably also noticed, our URL structure changed to account.circonus.com. This should have been as transparent as possible, old URLs will continue to work for the foreseeable future, in app links will however take you to the new format.

Simplified Maintenance

In the middle of February we launched the much awaited maintenance window rework. Previous to this, the controls were fine grained requiring users to mark individual metrics as in maintenance. While this provided a ton of flexibility, it made it more difficult to work at a higher level as machines were taken offline for work, etc.

The new maintenance windows solve this by providing multiple layers of control. Starting at the metric level, you can now also set maintenance on a check, a check bundle, a host or your entire account. Along with this change, when items are in maintenance they do not trigger new alerts to show up in the UI, previously you would see each alert, however notifications would be silenced, this reduction in noise should help you find problems faster.

MySQL From C to Java

Our last item involves a broker update that is coming down the pike later this week, and one we wanted to make sure was announced a little ahead of time.

Currently the way that we communicate with MySQL is through the C library that we build as a module to our broker. This library at times can cause stability issues when certain scenarios occur, typically we see this when a server stops responding correctly. Additionally, MySQL does some tricky things internally with the various “show X” commands that make it hard to maintain the various nuances in C. Because of this we are switching out the C lib for a JDBC implementation, this JDBC option is already how we talk to Oracle and MS SQLServer.

We been running this new version of the check internally and with some of our customers for awhile, taking special care to make sure metric names don’t change, we are pretty confident that this will be a transparent switch. Additionally, we have been able to add much improved support for commands like “show status” and “show master/slave status”, among others. Should you notice any issues after an upgrade, please contact support.

Update (3/21): Due to some unrelated changes, the release of the new broker code has been rescheduled to next week.

2013 has already proven busy for us and doesn’t look to be slowing down. As always, give us your feedback, user input helps drive our development.

PagerDuty Integration Improvements and Alert Formats

Recently we got burned by ignoring a page because the actual message we received lacked detail, it looked like an alert that was known to clear itself. At 3am it is hard to bring yourself to get out of bed when you have seen this alert page and clear time and time again, so it was forgotten. Four hours later the alert was spotted by another admin and resolved, and an analysis was done to determine how this happened.

The root cause we determined to be the aforementioned lack of detail. When Circonus would send an alert to PagerDuty, we would do so in our “long format” which is the alert format you get when you receive email notifications, more on this and the “short format” later. PagerDuty then truncates this message to fit a standard 160 character SMS, this truncation of detail lead to a lot of alerts looking like each other, some which were more critical were assumed to be of lesser importance and ignored.

Improvements to PagerDuty Messages

To solve this, we just pushed out a change to include both the short format and long format in a PagerDuty message. The short format is what we use for SMS alerts, and is now the first line of the message. When the truncation happens on their end, you should receive as much detail as possible about the alert. This does lead to redundant information in the message body in their UI and email alerts, but we feel it is for the better.

Secondly, we are providing details about the alerts to PagerDuty’s API. These details currently are:

  • account
  • host
  • check_name
  • metric_name
  • severity
  • alert_value
  • client_url

These details are useful if you are pulling alerts from the PagerDuty API, instead of parsing the message you should receive a JSON object with these keys and their associated values.

How Circonus Alerts are Formatted

As mentioned before, Circonus has two alert formats. A long format which is used for email, XMPP, AIM and PagerDuty alerts, and a short format which is used for SMS, Twitter and now PagerDuty.

The short format is intended to compress as much detail about the alert as possible while remaining readable and useful. An example of this type of alert:

[Circonus Testing] A2:1 development.internal "Test Check" cpu_used (89.65)

I’ll break this alert up into its various sections to describe it

  • [Circonus Testing] is the name of the account
  • A = Alert, 2 = Severity 2, 1 = The number of sev 2 alerts. The “A” here could also be R for Recovery
  • development.internal is the hostname or IP this alert was triggered on
  • “Test Check” is the name of the check bundle in the Circonus UI
  • cpu_used is our metric name and (89.65) is the value that triggered the alert

The long format is more self explanatory since we have many more characters to work with.

Account: Circonus Testing

ALERT Severity 2
Check: Test Check
Host: development.internal
Metric: cpu_used (89.65)
Agent: Ashburn, VA, US
Occurred: Tue, 8 Jan 2013 2:25:53

This is the same alert as above, so breaking it apart we have:

  • Account name
  • Alert of Severity 2, this could also be RECOVERY. The alert count is missing because in the long format we will list out each alert separately.
  • The check name
  • The host / IP
  • The metric name and alert value
  • The broker / agent that the alert was triggered from
  • The time that the alert was triggered, if this is a recovery you will also have a cleared time.
  • The Circonus URL to view the alert in the UI

In the future we intend to allow the alert formats to be customized for each contact group, or use these current formats as the default.

Thanks to redundancy built into Circonus, our users were never impacted by the outage that precipitated this change, but if it can happen to us it will happen to others, so we hope these minor changes bring improvements to your response times.

Fault Detection: New Features and Fixes

One of the trickier problems when detecting faults is detecting the absence of data. Did the check run and not produce data? Did we lose connection and miss the data? The latter problems are where we lost a bit of insight, which we sought to correct.

The system is down

A loss of connection to the broker happens for one of two reasons. First, the broker itself might be down, the software restarted, machine crashed, etc. Second, there was a loss of connectivity in the network between the broker and the Circonus NOC. Note that for our purposes, a failure in our NOC would look identical to the broker running but having network problems.

Lets start with a broker being down. Since we aren’t receiving any data, it looks to the system like all of the metrics just went absent. In the event that a broker goes down, the customer owning that broker be inundated with absence alerts.

Back in July, we solved this by adding the ability to set a contact group on a broker. If the broker disconnects, you will get a single alert notifying you that the broker is down. While disconnected, the system automatically puts all metrics on the broker into an internal maintenance mode, when it reconnects we flip them out of maintenance and then ask for a current state of the world, so anything that is bad will alert. Note that if you do not set a contact group, we have no way to tell you the broker is disconnected so we will fall back to not putting metrics in maintenance and you will get paged about each one as they go absent. Even though this feature isn’t brand new, it is worth pointing out.

Can you hear me now?

It is important to know a little about how the brokers work… When they restart, all the checks configured on it are scheduled to run within the first minute, then after that they follow the normal frequency settings. To this end, when we reestablish connectivity with a broker, we look at the internal uptime monitor, if it is >= 60 seconds we know all the checks have run and we can again use the data for alerting purposes.

This presented a problem when an outage was caused by a network interruption or a problem in our NOC. Such a network problem happened late one night and connections to a handful of brokers were lost temporarily. When they came back online, because they had never restarted we saw the uptime was good and immediately started using the data. This poses a problem if we reconnected at the very end of an absence window. A given check might not run again for 1 – 5 minutes, so we would potentially trigger absences, and then recover them when the check ran.

We made two changes to fix this. First, we now have two criteria for a stable / connected broker:

  • Uptime >= 60 seconds
  • Connected to the NOC for >= 60 seconds

Since the majority of the checks run every minute, this meant that we would see the data again before declaring the data absent. This, however, doesn’t account for any checks with a larger period. To that end, we changed the absence alerting to first check to see how long the broker has been connected. If it has been connected for less than the absence window length, we push out the absence check to another window in order to first ensure the check would have run. A small change but one that took a lot of testing and should drastically cut down on false absence alerts due to network problems.

Updates From The Tech Team

Now that it is fall and the conference season is just about over, I thought it would be a good time to give you an update on some items that didn’t make our change log (and some that did), what is coming shortly down the road and just generally what we have been up to.

CEP woes and engineering salvation.

The summer started out with some interesting challenges involving our streaming event processor. When we first started working on Circonus, we decided to go with Esper as a complex event processor to drive fault detection. Esper offers some great benefits and a low barrier of entry to stream processing by placing your events into windows that are analogous to database tables, and then gives you the ability to query them with a language akin to SQL. Our initial setup worked well, and was designed to scale horizontally (federated by account) if needed. Due to demand, we started to act on this horizontal build out in mid-March. However, as more and more events were fed in, we quickly realized that even when giving an entire server to one account, the volume of data could still overload the engine. We worked on our queries, tweaking them to get more performance, but every gain was wiped away with a slight bump in data volume. This came to a head near the end of May when the engine started generating late alerts and alerts with incorrect data. At this point, too much work was put into making Esper work for not enough gain, so we started on a complete overhaul.

The new system was still in Java, but this time we wrote all the processing code ourselves. The improvement was incredible, events that once took 60ms to process now took on the order of 10µs. To validate the system we split the incoming data stream onto the old and new systems and compared the data coming out. The new system, as expected, found alerts faster, and when we saw a discrepancy, the new system was found to be correct. We launched this behind the scenes for the majority of the users on May 31st, and completed the rollout on June 7th. Unless you were one of the two customers affected by the delinquency of the old system, this mammoth amount of work got rolled out right under your nose and you never even noticed; just the way we like it. In the end we collapsed our CEP system from 3 (rather saturated) nodes back to 1 (almost idle) node and have a lot more faith in the new code. Here is some eye candy that shows the CEP processing time in microseconds over the last year. The green, purple and blue lines are the old CEP being split out, and the single remaining green line is the current system.


We tend to look at this data internally on a logarithmic scale to better see the minor changes in usage. Here is the same graph but with a log base 10 y-axis.


Distributed database updates.

Next up were upgrades to our metric storage system. To briefly describe the setup, it is based on Amazon’s Dynamo, we have a ring of nodes, and as data is fed in, we hash the ids and names to find which node it goes on, insert the data, and use a rather clever means to deterministically find subsequent nodes to meet our redundancy requirements. All data is stored at least twice and never on the same node. Theo gave a talk at last year’s Surge conference that is worth checking out for more details. The numeric data is stored in a proprietary format, highly compact, while text data was placed into a Berkeley DB whenever it changed.

The Berkeley DB decision was haunting us. We started to notice potential issues with locking as the data size grew and the performance and disk usage wasn’t quite where we wanted it to be. To solve this we wanted to move to leveldb. The code changes went smoothly, but the problem arose: how do we get the data from one on-disk format to another.

The storage system was designed from the beginning to allow one node to be destroyed and rebuilt from the others. Of course a lot of systems are like this but who ever actually wants to try it with production data? We do. With the safeguards of ZFS snapshotting, over the course of the summer we would destroy a node, bring it up to date with the most recent code, and then have the other nodes rebuild it. Each destroy, rebuild, bring online cycle took the better part of a work day, and got faster and more reliable after each exercise as we cleaned up some problem areas. During the process user requests were simply served from the active nodes in the cluster, and outside of a few minor delays in data ingestion, no users we impacted. Doing these “game day” rebuilds has given us a huge confidence boost that should a server go belly up, we can quickly be back to full capacity.

More powerful visualizations.

Histograms were another big addition to our product. I won’t speak much about them here, instead you should head to Theo’s post on them here. We’ve been showing these off at various conferences, and have given attendees at this year’s Velocity and Surge insight into the wireless networks with real time dashboards showing client signal strengths, download and uploads and total clients.

API version 2.

Lastly, we’ve received a lot of feedback on our API, some good, some indifferent but a lot of requests to make it better, so we did. This rewrite was mostly from the ground up, but we did try to keep a lot of code the same underneath since we knew it worked (some is shared by the web UI and the current API). It more tightly conforms to what one comes to expect from a RESTful API, and for our automation enabled users, we have added in some idempotence so your consecutive Chef or Puppet runs on the same server won’t create duplicate checks, rules, etc. We are excited about getting this out, stay tuned.

It was a busy summer and we are looking forward to an equally busy fall and winter. We will be giving you more updates, hopefully once a month or so, with more behind the scenes information. Until then keep an eye on the change log.

Web Portal Outage

Last night circonus.com became unavailable for 34 minutes, this was due to the primary database server becoming unavailable. Here is a breakdown of events, times are US/Eastern.

  • 8:23 pm kernel panic on primary DB machine, system rebooted but did not start up properly
  • 8:25 -> 8:27 first set of pages went out about DB being down and other dependent systems not operating
  • 8:30 work began on migrating to the backup DB
  • 8:57 migration complete and systems were back online

In addition to the web portal being down during this time, alerts were delayed. The fault detection system continued to operate, however we have discovered some edge cases in the case management portion that will be addressed soon.

Because of the highly decoupled nature of Circonus, metric collection, ingestion and long term storage was not impacted by this event. Other services like search, streaming, and even fault detection (except as outlined above) receive their updates over a message queue and continued to operate as normal.

After the outage we discussed why recovery took so long and boiled it down to inadequate documentation on the failover process. Not all the players on call that night knew all they needed about the system. This is something that is being addressed so recovery in an event like this in the future can be handled much faster.

Failing Forward While Stumbling, Eventually You Regain Your Balance

First I want to start by saying I sincerely apologize for anyone adversely affected by yesterday’s false alerts. That is something that we are very conscious of when rolling out new changes and clearly something I hope never to repeat.

How did it happen? First, a quick run down of the systems involved. As data is streamed into the system from the brokers, it is sent over RabbitMQ to a group of Complex Event Processors (CEP) running Esper and additionally the last collected value for each unique metric is stored in Redis for quick lookups. The CEPs are responsible for identifying when a value has triggered an alert, and then tell the notification system about it.

Yesterday we were working on a bug in the CEP system where under certain conditions, if a value went from bad to good, and we were restarting the service, it was possible we would never trigger an “all clear” event and as such your alert would never clear. After vigorously testing in our development environment, we thought we had it fixed and all our (known) corner cases tested.

So the change was deployed to one of the CEP systems to verify it in production. For the first few minutes all was well, stale alerts were clearing, I was a happy camper. Then roughly 5 minutes after the restart, all hell broke loose, every “on absence” alert fired, and then cleared within 1 minute, pagers went off around the office, happiness aborted.

Digging into the code we thought we spotted the problem, when we loaded the last value into the CEP from Redis, we need to do so in a particular order. Because we used multiple threads to load the data and let it do so asynchronously, some was being loaded in the proper order, but the vast majority was being loaded too late. Strike one for our dev environment. It doesn’t have near the volume of data, so everything was loaded in order by chance. We fixed the concurrency issue, tested, redeployed, BOOM same behavior as before.

The next failure was a result of the grouping that we do in the Esper queries, we were grouping by the check id, the name of the metric and the target host being observed. The preload data was missing the target field. What this caused was the initial preload event to be inserted ok, then as we got new data in it would also be inserted just fine, but was being grouped differently. Our absence windows currently have a 5 minute timeout, so 5 minutes after boot, all the preload data would exit the window, which would now be empty and we triggered an alert. Then, as the newly collected data filled its window, we would send an all clear for that metric and at this point we would be running normally, albeit with a lot of false alerts getting cleaned up.

Unfortunately at this point, the redis servers didn’t have the target information in their store, so a quick change was made to push that data into them. That rollout was a success, a little happiness was restored since something went right. After they had enough time to populate all the check data, changes were again rolled out to the CEP to add the target to the preload, hopes were high. We still at this point had only rolled the changes to the first CEP machine, so that was updated again, rebooted, and after 5 minutes things still looked solid, then the other systems were updated. BOOM.

The timing of this failure didn’t make sense. CEP one had been running for 15 minutes now, and there are no timers in the system what would explain this behavior. Code was reviewed and looked correct. Upon review of the log files, we saw failures and recoveries on each CEP system, however they were being generated by different machines.

The reason for this was due to a recent scale out of the CEP infrastructure. Each CEP is connected to RabbitMQ to receive events, to split the processing amongst them each binds a set of routing keys for events it cares about. This splitting of events wasn’t mimicked in the preload code, each CEP would be preloaded with all events. Since each system only cared about its share, the events it wasn’t receiving would trigger an absence alert as it would see them in the preload and then never again. Since the CEP systems are decoupled, an event A on CEP one wouldn’t be relayed to any other system, so they would not know that they needed to send a clear event since as far as they were concerned, everything was ok. Strike two for dev, we don’t use that distributed setup there.

Once again the CEP was patched, this time the preloader was given the intelligence to construct the routing keys for each metric. At boot it would pull the list of keys its cared about from its config, and then as it pulled the data from Redis, it would compare what that metrics key would be to its list, if it had it, preload the data. One last time, roll changes, restart, wait, wait, longest 5 minutes in recent memory, wait some more… no boom!!!

At this point though, one of the initial problems I set out to solve was still an issue. Because data streaming in looked good, the CEP won’t emit an all clear for no reason, it has to be bad first, so we had a lot of false alerts hanging out and people being reminded about them. To rectify this, I went into the primary DB, cleared all the alerts with a few updates, and rebooted the notification system so it would no longer see them as an issue. This stopped the reminders and brought us back to a state of peace. And this is where we sit now.

What are the lessons learned and how do we hope to prevent this in the future? Step 1 is, of course, always making sure dev matches production; not just in code, but in data volume and topology. Outside of the CEP setup it does, so we need a few new zones brought into the mix today and that will resolve that. Next, better staging and rollout procedure for this system. We can bring up a new CEP in production, give it a workload but have its events not generate real errors, going forward we will be verifying production traffic like this before a roll out.

Once again, sorry for the false positives. Disaster porn is a wonderful learning experience, and if any of the problems mentioned in this post hit home, I hope it gets you thinking about what changes you might need to be making. For updates on outages or general system information, remember to follow circonusops on Twitter.

Template API

Setting up a monitoring system can be a lot of work, especially if you are a large corporation with hundreds or thousands of hosts. Regardless of the size of your business, it still takes time to figure out what you want to monitor, how you are going to get at the data, and then to start collecting, but in the end it is very rewarding to know you have insight.

When we launched Circonus, we had an API to do nearly everything that could be done via the web UI (within reason) and expected it to make it easy for people to program against and get their monitoring off the ground quickly. Quite a few customers did just that, but still wanted an easier way to get started.

Today we are releasing the first version of our templating API to help you get going (templating will also be available via the web UI in the near future). With this new API you can create a service template by choosing a host and a group of check bundles as “masters.” Then you simply attach new hosts to the template, and the checks are created for you and deployed on the agents. Check out the documentation for full details.

Once a check is associated with a template, it cannot be changed on its own?you must alter the master check first and then re-sync the template. To re-sync, you just need to GET the current template definition and then POST it back; the system will take care of it from there.

To remove bundles or hosts, just remove them from the JSON payload before POSTing, and choose a removal method. Likewise, to add a host or bundle back to a template, just add it into the payload and then POST. We offer a few different removal and reactivation methods to make it easy to keep or remove your data and to start collecting it again. These methods are documented in the notes section of the documentation.

Future plans for templates include syncing rules across checks and adding templated graphs so that adding a new host will automatically add the appropriate metrics to a graph. Keep an eye on our change log for enhancements.

WebHook Notifications

This week we added support for webhook notifications in Circonus. For those that are unsure what a webhook is, its simply an HTTP POST with all the information about an alert you would normally get via email, XMPP or AIM.

Webhooks can be added to any contact group. Unlike other methods, you can’t add one to an individual user, and then add that user to a group, however this might be supported in the future based on feedback. Simply go to your account profile, click on the field “Type to Add New Contact” on the group you would like to add the hook to, and enter the URL you would like us to contact. The contact type will then display as your URL with the method of HTTP (for brevity).

Now that your hook is setup, what will it look like when the data is posted to you? Here is a perl Data::Dumper example, grouped by alert for readability, of the parameters posted for 2 alerts:

%post = (
   'alert_id' => [
   '21190',
   '21191'
   ],
   'account_name' => 'My Account',
   'severity_21190' => '1',
   'metric_name_21190' => 'A',
   'check_name_21190' => 'My Check',
   'agent_21190' => 'Ashburn, VA, US',
   'alert_value_21190' => '91.0',
   'clear_value_21190' => '0.0',
   'alert_time_21190' => 'Thu, 21 Oct 2010 16:35:49',
   'clear_time_21190' => 'Thu, 21 Oct 2010 16:36:49',
   'alert_url_21190' =>
   'https://circonus.com/account/my_account/fault-detection?alert_id=21190',
   'severity_21191' => '1',
   'metric_name_21191' => 'B',
   'check_name_21191' => 'My Other Check',
   'agent_21191' => 'Ashburn, VA, US',
   'alert_value_21191' => '91.0',
   'alert_time_21191' => 'Thu, 21 Oct 2010 16:36:21',
   'alert_url_21191' =>
   'https://circonus.com/account/my_account/fault-detection?alert_id=21191',
);

So lets look at what we have here. First thing to notice is that we pass multiple alert_id parameters, giving you the ID of each alert in the payload. From there, every other parameter is suffixed with _<alert_id> so you know which alert that parameter is associated with. In this example 21190 is a recovery, and 21191 is an alert, recoveries get the additional parameters of clear_value and clear_time.

Webhooks open up all sorts of possibilities both inside and outside of Circonus. Maybe you have a crazy complicated paging schedule, or prefer a contact method that we don’t natively support yet, fair enough, let us post the data to you and you can integrate it however you like. Want to graph your alerts? We are in the process of working on a way to overlay alerts on any graphs, but in the meantime, setup your webhook and feed the data back to Circonus via Resmon XML, now you have data for your graphs.

If you are curious about other features and would like to see an in depth post on them, please contact us at hello@circonus.com.