Recently we got burned by ignoring a page because the actual message we received lacked detail, it looked like an alert that was known to clear itself. At 3am it is hard to bring yourself to get out of bed when you have seen this alert page and clear time and time again, so it was forgotten. Four hours later the alert was spotted by another admin and resolved, and an analysis was done to determine how this happened.
The root cause we determined to be the aforementioned lack of detail. When Circonus would send an alert to PagerDuty, we would do so in our “long format” which is the alert format you get when you receive email notifications, more on this and the “short format” later. PagerDuty then truncates this message to fit a standard 160 character SMS, this truncation of detail lead to a lot of alerts looking like each other, some which were more critical were assumed to be of lesser importance and ignored.
Improvements to PagerDuty Messages
To solve this, we just pushed out a change to include both the short format and long format in a PagerDuty message. The short format is what we use for SMS alerts, and is now the first line of the message. When the truncation happens on their end, you should receive as much detail as possible about the alert. This does lead to redundant information in the message body in their UI and email alerts, but we feel it is for the better.
Secondly, we are providing details about the alerts to PagerDuty’s API. These details currently are:
These details are useful if you are pulling alerts from the PagerDuty API, instead of parsing the message you should receive a JSON object with these keys and their associated values.
How Circonus Alerts are Formatted
As mentioned before, Circonus has two alert formats. A long format which is used for email, XMPP, AIM and PagerDuty alerts, and a short format which is used for SMS, Twitter and now PagerDuty.
The short format is intended to compress as much detail about the alert as possible while remaining readable and useful. An example of this type of alert:
[Circonus Testing] A2:1 development.internal "Test Check" cpu_used (89.65)
I’ll break this alert up into its various sections to describe it
- [Circonus Testing] is the name of the account
- A = Alert, 2 = Severity 2, 1 = The number of sev 2 alerts. The “A” here could also be R for Recovery
- development.internal is the hostname or IP this alert was triggered on
- “Test Check” is the name of the check bundle in the Circonus UI
- cpu_used is our metric name and (89.65) is the value that triggered the alert
The long format is more self explanatory since we have many more characters to work with.
Account: Circonus Testing ALERT Severity 2 Check: Test Check Host: development.internal Metric: cpu_used (89.65) Agent: Ashburn, VA, US Occurred: Tue, 8 Jan 2013 2:25:53
This is the same alert as above, so breaking it apart we have:
- Account name
- Alert of Severity 2, this could also be RECOVERY. The alert count is missing because in the long format we will list out each alert separately.
- The check name
- The host / IP
- The metric name and alert value
- The broker / agent that the alert was triggered from
- The time that the alert was triggered, if this is a recovery you will also have a cleared time.
- The Circonus URL to view the alert in the UI
In the future we intend to allow the alert formats to be customized for each contact group, or use these current formats as the default.
Thanks to redundancy built into Circonus, our users were never impacted by the outage that precipitated this change, but if it can happen to us it will happen to others, so we hope these minor changes bring improvements to your response times.