Ways to Collect Systems Data in Circonus

When you decide to monitor your systems with Circonus, there’s quite a few options on how to collect your metrics. We believe Circonus should be a tool that does what you need, when you need it. Circonus does not force you into a specific approach or method. Since there are so many different ways to gather telemetry via Circonus, we thought we would take a moment to outline some of the different approaches.

In addition to application-specific checks, you might like to get baseline information about things like memory, CPU, file systems, and interfaces of your servers and network equipment. We’ve listed out the main options that can be used for system performance metrics below, along with a brief description and our recommendations for each. We’ve roughly ordered these based on our best practices, but the tool that should be used depends on many variables that you’ll need to take into account.

For instance, some users may prefer to use a single agent on all of their devices, which may mean that some options won’t be available. Available plugins and ability to expand should also be considered. Some agents allow Circonus to reach out to the endpoint and gather metrics, while others require the data to be pushed (these agents mention push requirements in the description below). In some cases, the language that the agent was written in can have an effect on your decision.

Standard protocols

SNMP – SNMP is a standard that has been around for years, and allows monitoring of many types of network equipment, servers, and appliances. There is a good chance you already have SNMP configured on most of your hosts, which would significantly lower the up-front setup time. You’ll need to know the OIDs you want to monitor, but check bundle templates can make this process a little easier for you.

HTTPTrap – Circonus can accept JSON payloads via an HTTP PUT or POST request. This data is not polled regularly from the Circonus Broker, but is pushed to the Broker from the monitored target. This is the easiest way to get arbitrary data into Circonus, but you’ll have to figure out where to get the data.

Third-Party agents

collectd – Collectd is a lightweight C-based tool that has a variety of plugins available for data collection. There are 2 main ways to use collectd with Circonus, either to push the information from your device over UDP (similar to statsd and HTTP Traps) or via the write_http plugin.

Gollector – Gollector is a new monitoring agent that relies on the proc filesystem and C POSIX calls such as sysconf to determine your machine’s profile. This alleviates any performance penalty from shelling out the collection work that some other agents can have.

NRPE – Circonus can utilize existing NRPE checks from your Nagios or Icinga installation. NRPE allows you to remotely call Nagios scripts to collect information. If you want to monitor a non-standard metric, there’s probably a Nagios script for it.

statsd – Similar to an HTTP Trap, statsd allows your hosts to send information to Circonus Enterprise Brokers, rather than the Broker reaching out to the host to poll it. One downside is that this information cannot be played in real-time, but it can be useful for metrics that may not have regular intervals of available information or are particularly high volume.

Internally-developed agents

nad – Nad is a lightweight, simply managed host agent written in Node.js. Nad is the first choice of Circonus due to its easy extensibility and its ability to work on almost any platform, including Windows, RHEL, Ubuntu, and illumos derivatives. Nad comes with enough plugins to let you monitor any of the basics, while allowing you to add your own checks to fit your environment.

Resmon – Resmon is a Perl-based agent created by OmniTI. New modules can be created quickly and easily, but must be written in Perl; that’s a make it or break it factor for many.

Windows Agent – If you’d rather not use nad on your Windows servers, there is a Windows agent that can be used to collect performance metrics from Windows servers.

Which is right for me?

The choice of agent to use depends on many factors. Current operating system, existing monitoring setup, and network layout can all have an effect on which agent you choose. You may also need to incorporate several choices in order to best monitor your environment.

That covers the main ways to get system information into Circonus. There’s plenty of other methods of getting data, such as Google Analytics, a variety of database connections, Memcached, Varnish, NewRelic, and more. A combination of these collection types can enable you to have data on every piece of your infrastructure, so you can always find the information you need.

Tags: A Long Time Coming

Ok, we know a lot of you have been asking for tags in Circonus for a long time. Well, they’re finally here! The tags feature is currently in beta, and will be released to all customers very soon. (Tags have actually been available to API users for a while, just without UI support in the application.) Let’s jump right in and I’ll give you a quick overview of how tags will work in Circonus.

First Things First: What’s Taggable?

For this initial implementation of tags, you can tag Graphs, Worksheets, Check Bundles, Templates, and Maintenance Windows. You will also see tags on some other pages, such as Alerts, Rulesets, Hosts, and Metrics, but these items aren’t taggable. The tags you see on those pages are inherited from the associated Check Bundles.

In the near future, we’ll be adding to these lists. We are planning on making Annotations and Dashboards taggable, and have some other unique ways we’re planning on using tags to help categorize and group items in the UI.

So, How Does This Work?

First, you’ll need to add tags to some items. All tags are categorized in the system, but if you don’t want to categorize your tags, that’s ok. Simply use the “uncategorized” category for all your tags and the tags will be presented solo (without category) throughout the UI. We have a couple of categories which are already created in the system, but you can create as many custom categories as you wish.

Let’s go to one of the list pages containing taggable items (e.g. the Checks list page) and look for the tags toolbar under an item (it will have an outlined tag with a plus icon). Click the “+” tag to open the “Add Tag” dialog. First choose a category or use the “+ ADD Category” option to enter a new one, then the tags dropdown will be populated with the tags under that category. Choose a tag or enter a new one by choosing the “+ ADD Tag” option, then use the “Add Tag +” button to add the tag to the item.


When the tag is added to the UI, you’ll notice right away that each tag category has its own color. There is a limited set of pre-selected colors which will be assigned to categories as they are created. These particular colors have been chosen to maximize your eye’s ability to distinguish the categories at a glance, and also because they work well under both light and dark icons. So you’ll also notice that the tag you added has its own icon. There’s a set of twelve icons which will be assigned to the first twelve tags in each category. Once a category has twelve tags, any further tags added to that category will receive blank icons. This system of colors and tags will create fairly unique combinations that should help you recognize tags at a glance without needing to read the tag every time. Note: taggable items can have unlimited tags.

After you add a tag to an item, you’ll also notice that a small set of summary tags is added (usually off to the right of the item). This shows the first few tags on the item, providing a way for you to quickly scan down the page and get a glimpse of the tags that are assigned to each item on the page.


One more note about tags and categories. Although you select them separately in the UI, when using the API the categories and tags are joined with a colon (“:”) as the separator. So a tag “windows” in the category of “os” would be represented as “os:windows” in the API.

Tag Filtering

The power of tags is apparent once you start using tag filters. Look in the upper right corner of the page and you’ll see an outlined tag with a funnel icon and beside it a similar menu button. These are for setting tag filters and saving tag filter sets for easy application later. Click the funnel tag to open the “Tag Filters” dialog, and click the “Add +” button to add a filter to the dialog. You may add as many filters as you wish, and in each one all you have to do is choose a category and tag from the choices (you may not enter new tags or categories here; these are simply the tags you’ve already added to the system). Use the “x” buttons at the right to remove filters, or use the “Clear” button to remove all filters and start with a clean slate. Note: none of your changes in this dialog are applied until you click the “Apply” button. After clicking “Apply,” the page will refresh and you’ll see your newly applied filters at the top of the page. You can then use the “Tag Filters” dialog to change these filters, or you can use the menu button on the right to open the “Tag Filter Sets” dialog, where you may save & apply sets of tag filters for easy switching.


One important feature to note is the “sticky” checkbox that appears when you have one or more tag filters applied. By default (with “sticky” turned off), the tag filters you apply are only visible in the current tab. If you close the tab or open a new one, it will not retain the current tag filters. The benefit of this is that we’ve developed a system to allow you to have multiple concurrent tag filter views open side-by-side. So with the “sticky” setting off, you can open several tabs, use different tag filters in all of them, and each tab will retain its own tag filters as you navigate Circonus in that tab. If at any point you turn the “sticky” setting on, the tag filters from that tab will be applied universally and will override all the other tabs. And not only are “sticky” tag filters applied across all tabs, they’re remembered across all of your user sessions, so they will remain applied until you choose to change or remove them.


Host Grouping

One unique feature we’ve already completed is Host Grouping. Head on over to the Hosts page and open the “Layout Options” by clicking on the grid icon at the right side of the page. You’ll see a new option labeled “Group By Tag Category.” If you choose a tag category there, the page will reorganize itself. You’ll now see a subtitle for each tag in the selected category, and under each subtitle you’ll see the Hosts which have Check Bundles with that tag. Because each Host can have many tags, including more than one tag in the same category, you may see a Host appear in more than one group. At the bottom of the page you’ll also see a grouping subtitled “Not in Category.” Under this group you’ll see all the Hosts which don’t have any Check Bundles with tags in the chosen category.


Customizable Alerts and Ruleset Groups

Today we are releasing two new features to make your on-call life easier.

Customizable Alerts

Our default format was created and modified over the years based on user feedback, but of course it was never going to make everyone happy. How do we solve this? By letting you create your own alerts! If you head to your contact groups page, you will notice a new checkbox to “enable custom alert formats”. Check that and some new fields will appear to let you modify both long and short format bodies, add a summary for each type and even the subject line of your emails.

Alerts are customized by the use of macros and conditionals. Macros take the form {alert_id} and are replaced with the appropriate values. Conditionals look like %(cleared != null) Cleared: {cleared}% and in this example, if the alert is clear you will get a line like, Cleared: Mon, 15 July 2013 11:49:58 on your alert, if it is not clear this line will not be added.

As a further example, the current alerts in the new customizable format would like this:

Subject:
[{account}] Severity {severity} {status} {check_name}

Body:
Account: {account}

{status} Severity {severity} %(new_severity != null) [new Severity {new_severity}]%
Check: {check_name}
Host: {host}
Metric: {metric_name} ({value})
Agent: {broker_name}
Occurred: {occurred}
%(cleared != null) Cleared: {cleared}%
{link}
{metric_link}
{metric_notes}

The ? help bubble beside the Alert Formats section header has a full list of all the macros available for use. Our default alert format will be changing slightly as well, we are going to be putting your account name in the subject line instead of Circonus and we will be adding the metric notes to the body.

Ruleset Groups

Another feature we get asked for is some way of only alerting when 3 out of 5 servers are down, or only when CPU spikes in conjunction with a rise in Memory. To make this a reality, we’ve added the concept of “rule groups”, located under Alerts -> Groups. These groups take rulesets you’ve already created, like cpu > 80 or http 500 response code, and let you combine as many as you would like to form a group. You then define a formula on which the group alerts, these can be a threshold, when X of Y rules are in alert, trigger a sev 1, or you can create an expression, when (A and B) or C alerts, trigger a sev 2.

Lets look at a complete example. I’ve created 3 checks, one for each webserver in my infrastructure, each check collects the http response code. On that response code metric I’ve added 2 rules, if the value is not 200, or it goes absent, send me an email.


Since I have redundancy in my servers, I choose to only get an email when one goes down. This way I don’t get woken up, and I can just take care of the problem the next day.


However, I know I always want a least 2 servers up and running. So now I will go to the groups page, and create a webserver group. I first add my rulesets via the “add rulesets+” button, selecting all 3 webserver code rules. Then I add a formula and decide that if 2 (or more) out of the 3 servers go bad, trigger a sev 1 alert. Then I add my page group to get these sev 1s. Now I’ll still get emails if the servers go down, but I’ll get woken up with a page when I hit my group threshold.

Understanding service latencies via TCP analysis

Latency is the root of all that is evil on the Internet… or so the saying goes. CPUs get faster, storage gets cheaper, IOPS are more plentiful, yet we feeble engineers have done close to nothing on improving the speed of light. While there is certainly a lower-bound to latencies rooted in the physical world, many of the latencies of today’s services (delivered both globally and within the datacenter) have latencies that are affected by things that are under our control. From a low level to a high level latencies suffer from service capacity issues, routing issues, systems and network tuning issues, bad code, bad deployments, human errors and even usage errors. You can’t improve what you can’t measure, so let’s get to measuring some latencies.

Most Internet services are pretty simple; a client calls you up and asks a question and you provide an answer. Due to this incredibly common pattern and the nuances of HTTP, it turns out to be trivial to measure service response times with nothing more than a packet sniffer (e.g. tcpdump).

“What is this nuance of HTTP?” you might ask. It is that HTTP requires the client to speak first. While all of the rest of the article can be used for more complicated services where that is not true, it oft times requires a deeper understanding of the protocol at hand and more adroit tcpdump manipulation.

“Client speaks first” makes it very easy to blindly assess the service latency on the initial HTTP request on a TCP connection. Returning to the phone call metaphor, I can simply watch the phones and measure the time between the first ring and the first word spoken by the receiver; I don’t care much about what happens in between. Granted, there is a wealth of information in the packets themselves: client (IP), agent (User-Agent), endpoint (requested URI), etc. Yet, to measure overall service latency, I simply need the first TCP SYN packet in and the first non-zero payload packet out on a given TCP/IP connection.

What are we looking at?

At this point we should stop and understand a bit better what we are actually measuring. Wouldn’t it be easier to just ask the service we’re running to measure its service times and report them? The answer to this varies from architecture to architecture, but if the answer is “no,” you should have birds and bees conversation with the software engineers or operations folk responsible for running the service. It should be absolutely trivial to track service times within the architecture.

Here, we’re asking something a tad more complex. Our measurements on the network side include an element of latency between the client and the server. What we are trying estimate here is the elapsed time between the client’s attempt to retrieve content the the first byte of that content arriving.

If we measure from SYN receipt to data write, we miss two vital times from the user’s perspective. We didn’t get the SYN until sometime after they sent it, and they won’t see our data until sometime after we’ve sent it. Because each of these is a single packet, we just need to know the time it takes to roundtrip a packet. Where can we get that information?

; ping 2.15.226.34
PING 2.15.226.34 (2.15.226.34): 56 data bytes
64 bytes from 2.15.226.34: icmp_seq=0 ttl=249 time=39.022 ms
64 bytes from 2.15.226.34: icmp_seq=1 ttl=249 time=40.013 ms
64 bytes from 2.15.226.34: icmp_seq=2 ttl=249 time=34.300 ms
64 bytes from 2.15.226.34: icmp_seq=3 ttl=249 time=40.373 ms
64 bytes from 2.15.226.34: icmp_seq=4 ttl=249 time=33.129 ms
^C

It isn’t like we can ping the remote machine. Instead we need to calculate this passively. TCP has this really inconvenient 3-way handshake that starts up a session that goes something like:

  1. Client: “I’d like to ask you to tell me a joke.”
  2. Server: “Okay, ask away”
  3. Client: “Okay, I’d like you to tell me a joke about TCP.”
  4. Server: “These two packets walked into a bar…”

From the TCP nitty-gritty, if we measure the time from the first SYN to the subsequent ACK package (before any data has transited), we have a rough estimation of the roundtrip time between the client and server. If we measure between observing (1) and sending (4), the part that we’re missing is the time between (1) being sent by the client and arriving at the server and (4) being sent and arriving at the client. That is approximately one roundtrip.

So, long story summarized. If we take the time between observing 1 and 4 and add the time between observing 2 and 3, we should have a rough approximation of the total time the client witnessed between making the request and receiving the first byte of the response. We’ll take a further short cut and note that the latency between is 1 and 2 is nominal on (at least our) systems, so we only track the timings of (1), (3) and (4) and calculate (4) – (1) + (3) – (1).

Actually doing this.

We’ll use tcpdump on our server to try to find what we’re looking for: the SYN from the client that will start the session, the ACK from the client that completes the three-way TCP handshake (in response to the server’s SYN-ACK) and the first non-zero-length data payload from the server to the client.

For our example, let’s look at web traffic on port 80. First we need to know how much data is in the packet. The IP header has the total packet length (in a 2-byte short starting a the second byte, which in BPF syntax is ‘ip[2:2]’. However, we’re looking for the payload length, so we need to remove from that the length of the IP header: ‘(ip[0]&0xf)*4)’ and the length of the TCP header: ‘((tcp[12]&0xf0)/4). We expect the one of the SYN or ACK bits set in the TCP flag and (because these are from the client) a destination port of 80:

'((((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) == 0) && ((tcp[tcpflags] & (tcp-syn|tcp-ack)) != 0) and dst port 80)'

The data payload from the server to the clients simply non-zero payload length with a source port of 80:

'(((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0 and src port 80)'

To test this, I’ll further restrict this to the host 10.10.10.10 (my laptop) and run tcpdump. I don’t want to resolve IP addresses to names and I would like to keep the output terse with high-precision timestamps: (-n -ttq). Additionally, I need at least 54 bytes of the payload to get the TCP headers (this is left as an exercise for the reader). Putting this all together in a tcpdump command (and limiting the scope to my client at 10.0.10.10), we can see something like this:

/opt/omni/sbin/tcpdump -n -ttq ' ((((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) == 0) && ((tcp[tcpflags] & (tcp-syn|tcp-ack)) != 0) and dst port 80) or (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0 and src port 80) and host 10.10.10.10'
listening on us340, link-type EN10MB (Ethernet), capture size 384 bytes
1370816555.510014 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0
1370816555.542407 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0
1370816555.585086 IP 2.15.226.34.80 > 10.10.10.10.57300: tcp 458
1370816555.617389 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0
1370816555.617398 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0
1370816555.617622 IP 10.10.10.10.57300 > 2.15.226.34.80: tcp 0

The first three lines are exactly what we’re looking for. The subsequent lines would be (in this case) the shutting down of this TCP connection. These packets are harder to avoid using BPF syntax and we’ll figure out a way to ignore them as needed.

Doing something useful.

So, it is clear that on any high-volume service, these packets are going to be coming in quickly and we’ll need to process these latency calculations programmatically.

This is a bit more complicated, so I whipped together a small node.js program that runs tcpdump, consumes its output, calculated the estimated time-to-first-byte latencies and securely pushes them up to a specific Circonus httptrap check.

The only two arguments you need to supply are the UUID and secret from your httptrap in Circonus. Once running, the script will submit all latencies to one metric named “aggregate`latency” and another named “<hostname>`latency” (where <hostname> is the hostname of the machine on which it is running). This allows you to point a handful of machines at the same httptrap check and collect both their individual latencies and a recalculated aggregate set of latencies.

The Result:


In the above graph, we can see three distinct nodes and the estimated latency (in seconds) on delivering the first byte of data to the client. Every single client TCP connection is accounted for. There is no sampling or missing data, we see a comprehensive and complete picture of our service latencies.

This requires only 4 metrics in Circonus.

Manage Metrics with circonus-cmi

As an environment grows, so does the headache of managing checks and metrics. Every new machine and cluster node requires a new set checks and metrics. There are the Circonus API and circonusvi tool, but those still require manual intervention. Wouldn’t it be nice if the same tools used for provisioning could handle Circonus setup as well? With circonus-cmi, that gap has now been bridged. Circonus-cmi is a lightweight command line application written in Node.js, which reads configuration information from provisioners and updates Circonus accordingly.

Circonus-cmi currently supports Amazon AWS and Chef Server. AWS integration is used to tag existing checks based on target with ec2 region, zone, id, and instance type information. This becomes very useful when one wants to compare performance between regions, for instance. The Chef Server integration, on the other hand, is able to do the heavy lifting of check management. By specifying checks, tags, and metrics within the Chef attribute heirarchy, Circonus metrics can be both added and updated. It is by default additive, merging the check specification in chef with an existing check configuration in Circonus, unless explicitly told to do otherwise. As a result, there’s no worry that circonus-cmi will wipe out changes made manually.

The goal of circonus-cmi is to make check management as easy and flexible as possible. If there are any features that you’d like to see added, please let us know.

On the Move: Circonus Mobile

It’s been a long time coming, but it’s finally here: a Circonus that’s optimized for low-resolution devices like your smartphone or tablet! For the past few weeks, eagle-eyed Circonus customers may have noticed links to the mobile site creeping into various places, such as the login page (login.circonus.com) and the application footer. Although we officially “flipped the switch” a few weeks ago to enable the new Circonus Mobile experience, it is definitely a work in progress. To visit Circonus Mobile, visit m.circonus.com or use the “Switch to Mobile Site” link in the main site footer.

The most common need when using Circonus on a smartphone is being able to easily and quickly respond to alerts, so we made alerting functionality a priority when developing this new interface. The Current Alerts and Alert History pages are complete, already having feature parity with the main site. Also fully implemented are the Account Profile and Account Brokers pages, along with the User Profile and User Preferences pages, but most other pages are being implemented in a phased manner, beginning with read-only versions. TIP: when viewing a paginated list of items in Circonus Mobile, you may swipe left and right to page through the list (if you’re using a browser which supports touch events).

Our long-term goal for Circonus Mobile is not only achieving feature parity with the main site, but going further with push notifications and other native app functionality. But to give you a peek into our short-term goals, we’re currently working on making the Maintenance page fully functional, and after that we will move on to read-only versions of the Graphs and Worksheets pages.

If you use Circonus Mobile and don’t see a feature from the main site that you want, please let us know! We’re continually implementing features on the mobile site, but we’re more than willing to rearrange our implementation roadmap based on our customers’ needs.

What’s New Q1 2013 Edition

Navigation and URL Changes

I’ll start this update talking about the most obvious changes to the UI, which is our new completely horizontal navigation and new URL structure.

We had received a lot of feedback about the mix of horizontal and vertical navigation. The tabs we were told were hard to read, especially with the text being rotated 90 degrees, users weren’t sure that the tabs were part of the navigation once they got to the page, and the ever confusing + button had to be explained to almost everyone. It was our take on what I would call a classic site layout, but in the end didn’t really resonate with our customers.

The new layout is much easier to understand and use. Main nav at the top, sub nav below, search box in the upper right where one would expect it and the + button has been replaced with text like “+ New Check” or “+ New Graph”. An added benefit of this change is we were able to reclaim some vertical whitespace, making the page a bit tighter. Not to be completely aesthetic, we recently read an article by Ben Kamen’s about Breaking down Amazon’s Mega Dropdown so we incorporated some of those ideas to look at mouse acceleration and direction so we don’t flip to another sub nav on hover as you try to navigate to a new page.

As you probably also noticed, our URL structure changed to account.circonus.com. This should have been as transparent as possible, old URLs will continue to work for the foreseeable future, in app links will however take you to the new format.

Simplified Maintenance

In the middle of February we launched the much awaited maintenance window rework. Previous to this, the controls were fine grained requiring users to mark individual metrics as in maintenance. While this provided a ton of flexibility, it made it more difficult to work at a higher level as machines were taken offline for work, etc.

The new maintenance windows solve this by providing multiple layers of control. Starting at the metric level, you can now also set maintenance on a check, a check bundle, a host or your entire account. Along with this change, when items are in maintenance they do not trigger new alerts to show up in the UI, previously you would see each alert, however notifications would be silenced, this reduction in noise should help you find problems faster.

MySQL From C to Java

Our last item involves a broker update that is coming down the pike later this week, and one we wanted to make sure was announced a little ahead of time.

Currently the way that we communicate with MySQL is through the C library that we build as a module to our broker. This library at times can cause stability issues when certain scenarios occur, typically we see this when a server stops responding correctly. Additionally, MySQL does some tricky things internally with the various “show X” commands that make it hard to maintain the various nuances in C. Because of this we are switching out the C lib for a JDBC implementation, this JDBC option is already how we talk to Oracle and MS SQLServer.

We been running this new version of the check internally and with some of our customers for awhile, taking special care to make sure metric names don’t change, we are pretty confident that this will be a transparent switch. Additionally, we have been able to add much improved support for commands like “show status” and “show master/slave status”, among others. Should you notice any issues after an upgrade, please contact support.

Update (3/21): Due to some unrelated changes, the release of the new broker code has been rescheduled to next week.

2013 has already proven busy for us and doesn’t look to be slowing down. As always, give us your feedback, user input helps drive our development.

Web Application Stat Collection with Node.js and Circonus

Collecting and carefully monitoring statistics is an essential practice for catching and debugging issues early, as well as minimizing degraded performance and downtime. When running a web service, two of the most fundamental and useful stats are request/response rates (throughput) and response times (latency). This includes both interactions between a user and the service as well as between the backend components. Fortunately, they?re also two of the most straightforward metrics to aggregate and track.

The rates are perhaps the easiest to track, as they are not much more than incrementing counters. In essence, one increments a counter each time a request is made and each time a response is sent. Individual requests and responses are unrelated as far as counts are concerned. In a multi-process or clustered environment, a tool such as Redis, with its atomic increments, can be used for global aggregation. Once aggregated, it can be checked and monitored by Circonus, with rates being derived from count changes over time.

Response times are a bit trickier. Time needs to be tracked between a specific request being made and the response being sent for that request. Each time needs to be tracked for use beyond beyond simple averages, as histograms and heatmaps often tell a very different story. Sending these values to Circonus via an HTTP trap will provide that level of detail. Local collection of times before communication to Circonus should be used to prevent an unnecessarily large number of calls being made to store the data.

Circonus line plot of latency
Circonus line plot of latency

The above image is a line graph of the approximate average latency in ms over time for a particular URI. According to that data, average latency is more or less consistent, just below 1k. That conclusion, however, is not entirely accurate to describe the system. With a heatmap overlay, this becomes clear.

Circonus histogram plot of latency with line overlay
Circonus histogram plot of latency with line overlay

The picture painted here is quite different than with the simple line graph. The first key element revealed is that the latency is multimodal, with high frequency around both 1k and 0.5k. The second is that there is a comparatively small, though non-negligible, frequency well in excess of 1k. Perhaps these two features of the data are expected, but may indicate an issue with the system. This kind of insight is often essential to the identification and remedy of problems.

The process to collecting all of this information will become clearer with an example. The following implementation is based on a Node.js web API service using the Express.js framework, but the principles are applicable for any circumstance apply equally as well to any internal queries or flows. A module has been created to provide this sample functionality and can be installed with:

npm install stat-collector

A simple app utilizing the module to collect both stat types:

    var express = require('express');
    var StatCollector = require('StatCollector');

    var statCollector = new StatCollector();
    // Global debug logging (default false)
    statCollector.setDebug(true);

    var queryCounterConfig = {
        // Configuration options for node-redis
        redisConfig: {
            host: 'localhost',
            port: 1234,
            options: {}
        },
        // Prefix for stat hashes in Redis (e.g. 'foo_' for foo_requestTotal)
        statPrefix: 'foo_',
        // Interval (ms) in which to update Redis (default 30000)
        pushInterval: 30000,
        // Debug logging (default false)
        debug: false
    };

    var responseTimerConfig = {
        // Circonus HTTP Trap check config
        circonusConfig: {
            host: 'foo',
            port: 1234,
            path: '/'
        },
        // Interval (ms) in which to push to Circonus (default 500)
        pushInterval: 500,
        // Debug loging for this module (default false)
        debug: false
    };

    statCollector.configureQueryCounter(queryCounterConfig);
    statCollector.configureResponseTimer(responseTimerConfig);

    var app = express();

    app.configure( function(){
        app.use(statCollector.middleware());
        app.use(app.router);
    });

    app.get('/foo',function(req,res){
        res.jsend('success',{'foo':'bar'});
    });

    app.listen(9000);

Only the components of StatCollector that are configured will be enabled. The QueryCounter stats contain counter sets for request and response, as well as for each response status, with a counter counters for each URIs. It establishes a connection to Redis and a method for pushing data to Redis is set set on an interval. The ResponseTimer stats is contains objects of the form, one for each URI:

{ _type: "n", _value: [] }

A method to push this data to Circonus is also set on an interval.

The middleware takes care of three things: enhance the response object, register the URI with enabled components, and tell the enabled components to track the request/response. In the basic preparation, res is given a method jsend to be used instead of res.send, standardizing the json output structure and saving the response status (“success”, “fail”, “error”) to res.jsendStatusSent. The res.end method, which is called automatically after the response is handled, is wrapped so that it will emit an end event. The URI registration adds URI-specific counters and objects to QueryCounter and ResponseTimer stats, respectively, if they don’t already exist. The request and response are then fed through a tracking method that will record statistics both at the time of request and after a response, on the end event. This is just a detailed way of saying: ?when a request completes, note how long it took and what kind of response it was.

Circonus plot of some data
Circonus plot of some data

The general flow for request and response count statistics is Node.js processes to Redis to Circonus via Redis-type checks. On a request, the total request and URI-specific request counters are incremented. When res.end fires, the set URI-specific counters are incremented in QueryCounter, with res.jsendStatusResult determining the appropriate response type counter. Periodically, these stats are used to increment a set Redis hashes and then reset. In Redis, the request, response, and each response type are separate hashes, with ‘total’ and specific URIs as fields, and the corresponding counters as values. A set of Redis-type checks are set up in Circonus to run an hgetall query on each of those hashes. Since these are monotonically increasing values, when graphing when graphing rates, the ?counter? display type should be used to show the first order derivative (rate of change). When graphing total values, gauge (raw value) is appropriate.

For response times, data goes from Node.js processes to Circonus via an HTTP Trap. On the res.end event, the overall response time is calculated as the difference between the time saved when the req came in and the current time. The difference is pushed into the _value array of the URI-specific object. Periodically, the entire set of stat objects is converted to JSON and pushed to a Circonus HTTP trap, being subsequently reset. For optimal histogram live view, this push is done at sub-second intervals. In Circonus, the metrics must be enabled twice for histograms; the first time will enable collection, the second time will enable histogram support via the icon that resembles three boxes. Both are collected, distinguished by the icons. Graphing the standard metric will yield the normal line/area graph while graphing the histogram metric will yield the histogram/heatmap.

Request & response rates and response times are so valuable and straightforward to track, there’s absolutely no reason not to.

Brian Bickerton is a Web Engineer at OmniTI, working on applications where scalability and uptime are critical, using a variety of tools and technologies.

Interacting with Circonus through your text editor – circonusvi

I’m a big fan of command line tools for many tasks. There are just some tasks that can be done quicker and easier with a little typing rather than pointing and clicking. A little while ago, I discovered a gem of of a command line tool called ldapvi. Without going into too much detail, this tool lets you run a query against an LDAP server where the results show up in your favorite text editor. The magic part is that when you edit that file and save it, whatever changes you made are then pushed back to the server. This method of working can be extremely flexible, as you have all of the power of your text editor at your fingertips to make both small and sweeping changes quickly and easily.

When the new Circonus API was released, I realized that it was now possible to do the same thing for Circonus with relative ease. It would have been impossible to do with the old API, but the new one was designed with consistency in mind, and a simple (dare I say the word REST) interface. The result is circonusvi. Circonusvi essentially allows you to interact directly with the circonus API via your text editor, and it works as follows:

  • You decide what you want to work with: check bundles, rule sets, worksheets and so on, and pick that as your endpoint.
  • Circonusvi queries the API and lists all resources, which are returned as JSON.
  • Optionally, circonusvi filters this output to only show the resources you want.
  • This JSON (slightly modified) is shown in your text editor, where you can make changes (edit checks, remove checks, add new checks) as needed.
  • Circonusvi identifies what changes were made, and pushes them back to the server.

Here’s a simple example, taken from an actual situation we came across: needing to switch contact groups for a series of rules. We had some rules which were a severity 2 (our convention is that sev2 alerts are mail only), and we realized that we really should have them be sev1 (in other words, wake up the on call SA). To make this change in the web interface, you need to go into each rule individually, change the criteria from sev2 to sev1, and change the contact group associated with sev1 alerts. This is time consuming and error prone, but at the same time, it’s a one-off task that you probably don’t want to write a script for.

First, we identify the contact groups in circonusvi:

./circonusvi.py -a example_com -e contact_group

{
    # Mail only
    "/contact_group/123": {
        "aggregation_window": 300,
        "contacts": {
            "external": [
                {
                    "contact_info": "sa@example.com",
                    "method": "email"
                }
            ],
            "users": []
        },
        "name": "Mail only",
        "reminders": [
            0,
            0,
            0,
            0,
            0
        ]
    },
    # On call
    "/contact_group/124": {
        "aggregation_window": 300,
        "contacts": {
            "external": [
                {
                    "contact_info": "oncall@example.com",
                    "method": "email"
                }
            ],
            "users": []
        },
        "name": "On call",
        "reminders": [
            0,
            0,
            0,
            0,
            0
        ]
    }
}

This shows two contact groups, with the ID of the mail only contact group being /contact_group/123, and the on call contact group being /contact_group/124.

Next, we need to do the actual editing:

./circonusvi.py -a example_com -e rule_set 'metric_name=^foo$'

Here we specified a filter on the rules to match only those rules that apply to a metric named ‘foo’. The pert before the equals sign is which key you filter on. You can specify any key in the results for the filter, as long as it has a simple value (i.e. string or number). The right hand side of the filter is a regular expression. Any entry that matches the regular expression will be displayed. A small section of the results are shown below:

{
    "/rule_set/1234_foo": {
        "check": "/check/1234",
        "contact_groups": {
            "1": [],
            "2": [
                # Mail only
                "/contact_group/123"
            ],
            "3": [],
            "4": [],
            "5": []
        },
        "derive": "counter",
        "link": null,
        "metric_name": "foo",
        "metric_type": "numeric",
        "notes": null,
        "parent": null,
        "rules": [
            {
                "criteria": "on absence",
                "severity": "2",
                "value": "t",
                "wait": 10
            }
        ]
    },
    ...

There are two things we want to change here. First the severity when the rule fires. This is a simple search/replace in your text editor. I’m a vim fan, so:

:%s/"severity": "2"/"severity": "1"/

Next, we need to make sure that severity 1 alerts go to the on call group. They don’t currently:

:%s/"1": [],/"1": [ "/contact_group/124" ],/

If you want, you can remove the sev2 contact group by deleting that section:

:g/"/contact_group/123"/d

The ‘comment’ line above can be left alone, or deleted too with a similar command.

Once you are finished, your file will look something like this:

{
    "/rule_set/1234_foo": {
        "check": "/check/1234",
        "contact_groups": {
            "1": [ "/contact_group/124" ],
            "2": [
                # Mail only
            ],
            "3": [],
            "4": [],
            "5": []
        },
        "derive": "counter",
        "link": null,
        "metric_name": "foo",
        "metric_type": "numeric",
        "notes": null,
        "parent": null,
        "rules": [
            {
                "criteria": "on absence",
                "severity": "1",
                "value": "t",
                "wait": 10
            }
        ]
    },
    ...

Save and quit, and you’ll be presented with a prompt asking you to confirm your changes, and give you a chance to preview the changes before submitting them:

0 additions, 0 deletions, 85 edits
Do you want to proceed? (YyNnEeSs?)

I would recommend previewing exactly what has changed before submitting. Pressing S will show a unified diff of what changed. Once you proceed, circonusvi will apply the changes via the API, and you’re all done!

This method of interacting with the API is most useful when making edits or deletions in bulk (if you delete an entry in the text editor, circonusvi will delete from the API). Additions are possible also, but are more cumbersome as you have to create the JSON precisely from scratch. For bulk additions, there are other scripts available such as the
circus scripts. However, using your editor to make bulk changes is a very powerful way of interacting with an application. Give it a try and see how well it works for you.

Mark Harrison leads the Site Reliability Engineering team at OmniTI where he manages people and networks, systems and complex architectures for some of today’s largest Internet properties.

PagerDuty Integration Improvements and Alert Formats

Recently we got burned by ignoring a page because the actual message we received lacked detail, it looked like an alert that was known to clear itself. At 3am it is hard to bring yourself to get out of bed when you have seen this alert page and clear time and time again, so it was forgotten. Four hours later the alert was spotted by another admin and resolved, and an analysis was done to determine how this happened.

The root cause we determined to be the aforementioned lack of detail. When Circonus would send an alert to PagerDuty, we would do so in our “long format” which is the alert format you get when you receive email notifications, more on this and the “short format” later. PagerDuty then truncates this message to fit a standard 160 character SMS, this truncation of detail lead to a lot of alerts looking like each other, some which were more critical were assumed to be of lesser importance and ignored.

Improvements to PagerDuty Messages

To solve this, we just pushed out a change to include both the short format and long format in a PagerDuty message. The short format is what we use for SMS alerts, and is now the first line of the message. When the truncation happens on their end, you should receive as much detail as possible about the alert. This does lead to redundant information in the message body in their UI and email alerts, but we feel it is for the better.

Secondly, we are providing details about the alerts to PagerDuty’s API. These details currently are:

  • account
  • host
  • check_name
  • metric_name
  • severity
  • alert_value
  • client_url

These details are useful if you are pulling alerts from the PagerDuty API, instead of parsing the message you should receive a JSON object with these keys and their associated values.

How Circonus Alerts are Formatted

As mentioned before, Circonus has two alert formats. A long format which is used for email, XMPP, AIM and PagerDuty alerts, and a short format which is used for SMS, Twitter and now PagerDuty.

The short format is intended to compress as much detail about the alert as possible while remaining readable and useful. An example of this type of alert:

[Circonus Testing] A2:1 development.internal "Test Check" cpu_used (89.65)

I’ll break this alert up into its various sections to describe it

  • [Circonus Testing] is the name of the account
  • A = Alert, 2 = Severity 2, 1 = The number of sev 2 alerts. The “A” here could also be R for Recovery
  • development.internal is the hostname or IP this alert was triggered on
  • “Test Check” is the name of the check bundle in the Circonus UI
  • cpu_used is our metric name and (89.65) is the value that triggered the alert

The long format is more self explanatory since we have many more characters to work with.

Account: Circonus Testing

ALERT Severity 2
Check: Test Check
Host: development.internal
Metric: cpu_used (89.65)
Agent: Ashburn, VA, US
Occurred: Tue, 8 Jan 2013 2:25:53

This is the same alert as above, so breaking it apart we have:

  • Account name
  • Alert of Severity 2, this could also be RECOVERY. The alert count is missing because in the long format we will list out each alert separately.
  • The check name
  • The host / IP
  • The metric name and alert value
  • The broker / agent that the alert was triggered from
  • The time that the alert was triggered, if this is a recovery you will also have a cleared time.
  • The Circonus URL to view the alert in the UI

In the future we intend to allow the alert formats to be customized for each contact group, or use these current formats as the default.

Thanks to redundancy built into Circonus, our users were never impacted by the outage that precipitated this change, but if it can happen to us it will happen to others, so we hope these minor changes bring improvements to your response times.