Circonus Instrumentation Packs

In our Circonus Labs public github repo, we have started a project called Circonus Instrumentation Packs, or CIP. This is a series of libraries to make it even easier to submit telemetry data from your application.

Currently there are CIP directories for gojava,  and node.js. Each separate language directory has useful resources to help instrument applications written in that language.

Some languages have a strong leaning toward frameworks, while others are about patterns, and still others are about tooling. These packs are intended to “meld in” with the common way of doing things in each language, so that developer comfort is high and integration time and effort are minimal.

Each of these examples utilize the HTTP Trap check, which you can create within Circonus. Simply create a new JSON push (HTTPTrap) check in Circonus using the HTTPTRAP broker, and then the CheckID, UUID and secret will be available on the check details page.

HTTPTrap uuid-secret
CHECKID / UUID / Secret Example

This can be done via the user interface or via the API. The “target” for the check does not need to be an actual hostname or IP address; the name of your service might be a good substitute.

We suggest that you use a different trap for different node.js apps, as well as for production, staging, and testing.

Below is a bit more detail on each of the currently available CIPs:

Java

Java has a very popular instrumentation library called “metrics,” originally written by Coda Hale and later adopted by Dropwizard. Metrics has some great ideas that we support whole-heartedly; in particular, the use of histograms for more insightful reporting. Unfortunately, the way these measurements are captured and reported makes calculating service level agreements and other such analytics impossible. Furthermore, the implementations of the underlying histograms (Reservoirs in metrics-terminology) are opaque to the reporting tools. The Circonus metrics support in this CIP is designed to layer (non-disruptively) on top of the Dropwizard metrics packages.

Go

This library supports named counters, gauges, and histograms. It also provides convenience wrappers for registering latency instrumented functions with Go’s built-in http server.

Initializing only requires you set the AuthToken (which you generate in your API Tokens page) and CheckId, and then “Start” the metrics reporter.

You’ll need two github repos:

Here is the sample code (also found in the circonus-gometrics readme):

[slide]package main
import (
 "fmt"
 "net/http"
 metrics "github.com/circonus-gometrics"
)
func main() {
// Get your Auth token at https://login.circonus.com/user/tokens
 metrics.WithAuthToken("cee5d8ec-aac7-cf9d-bfc4-990e7ceeb774")
// Get your Checkid on the check details page
 metrics.WithCheckId(163063)
 metrics.Start()
http.HandleFunc("/", metrics.TrackHTTPLatency("/", func(w http.ResponseWriter, r *http.Request) {
 fmt.Fprintf(w, "Hello, %s!", r.URL.Path[1:])
 }))
 http.ListenAndServe(":8080", http.DefaultServeMux)
}

After you start the app (go run the_file_name.go), load http://localhost:8080 in your broswer, or curl http://localhost:8080. You’ll need to approve access to the API Token (if it is the first time you have used it), and then you can create a graph (make sure you are collecting histogram data) and you’ll see something like this:

go-httptrap-histogram-example

Node.js

This instrumentation pack is designed to allow node.js applications to easily report telemetry data to Circonus using the UUID and Secret (instead of an API Token and CheckID). It has special support for providing sample-free (100% sampling) collection of service latencies for submission, visualization, and alerting to Circonus.

Here is a basic example to measure latency:

First, some setup – making the app;

% mkdir restify-circonus-example
% cd restify-circonus-example
% npm init .

(This defaults to npm init . works fine.) Then:

% npm install --save restify
% npm install --save probability-distributions
% npm install --save circonus-cip

Next, edit index.js and include:

var restify = require('restify'),
 PD = require("probability-distributions"),
 circonus_cip = require('circonus-cip')
var circonus_uuid = '33e894e6-5b94-4569-b91b-14bda9c650b1'
var circonus_secret = 'ssssssssh_its_oh_so_quiet'
var server = restify.createServer()
server.on('after', circonus_cip.restify(circonus_uuid, circonus_secret))
server.get('/', function (req, res, next) {
 setTimeout(function() {
 res.writeHead(200, { 'Content-Type': 'text/plain' });
 //res.write("Hello to a new world of understanding.\n");
 res.end("Hello to a new world of understanding.\n");
 next();
 }, PD.rgamma(1, 3, 2) * 200);
})

server.listen(8888)

Now just start up the app:

node index.js

Then go to your browser and load localhost:8888, or at the prompt curl http:localhost:8888.

You’ll then go and create the graph in your account. Make sure to enable collection of the metric – “… httptrap: restify `GET `/ `latency…” as a histogram, and you’ll end up with a graph like this:

The Restify Histogram graph

Let us know what you think, and more examples and languages will follow.
Community participation is encouraged, and feedback of any kind is more than welcome:
If you want a demo, or have a specific question, we’re happy to work with you.

Contact Me

To learn more about Circonus, click below:

Interacting with Circonus through your text editor – circonusvi

I’m a big fan of command line tools for many tasks. There are just some tasks that can be done quicker and easier with a little typing rather than pointing and clicking. A little while ago, I discovered a gem of of a command line tool called ldapvi. Without going into too much detail, this tool lets you run a query against an LDAP server where the results show up in your favorite text editor. The magic part is that when you edit that file and save it, whatever changes you made are then pushed back to the server. This method of working can be extremely flexible, as you have all of the power of your text editor at your fingertips to make both small and sweeping changes quickly and easily.

When the new Circonus API was released, I realized that it was now possible to do the same thing for Circonus with relative ease. It would have been impossible to do with the old API, but the new one was designed with consistency in mind, and a simple (dare I say the word REST) interface. The result is circonusvi. Circonusvi essentially allows you to interact directly with the circonus API via your text editor, and it works as follows:

  • You decide what you want to work with: check bundles, rule sets, worksheets and so on, and pick that as your endpoint.
  • Circonusvi queries the API and lists all resources, which are returned as JSON.
  • Optionally, circonusvi filters this output to only show the resources you want.
  • This JSON (slightly modified) is shown in your text editor, where you can make changes (edit checks, remove checks, add new checks) as needed.
  • Circonusvi identifies what changes were made, and pushes them back to the server.

Here’s a simple example, taken from an actual situation we came across: needing to switch contact groups for a series of rules. We had some rules which were a severity 2 (our convention is that sev2 alerts are mail only), and we realized that we really should have them be sev1 (in other words, wake up the on call SA). To make this change in the web interface, you need to go into each rule individually, change the criteria from sev2 to sev1, and change the contact group associated with sev1 alerts. This is time consuming and error prone, but at the same time, it’s a one-off task that you probably don’t want to write a script for.

First, we identify the contact groups in circonusvi:

./circonusvi.py -a example_com -e contact_group

{
    # Mail only
    "/contact_group/123": {
        "aggregation_window": 300,
        "contacts": {
            "external": [
                {
                    "contact_info": "sa@example.com",
                    "method": "email"
                }
            ],
            "users": []
        },
        "name": "Mail only",
        "reminders": [
            0,
            0,
            0,
            0,
            0
        ]
    },
    # On call
    "/contact_group/124": {
        "aggregation_window": 300,
        "contacts": {
            "external": [
                {
                    "contact_info": "oncall@example.com",
                    "method": "email"
                }
            ],
            "users": []
        },
        "name": "On call",
        "reminders": [
            0,
            0,
            0,
            0,
            0
        ]
    }
}

This shows two contact groups, with the ID of the mail only contact group being /contact_group/123, and the on call contact group being /contact_group/124.

Next, we need to do the actual editing:

./circonusvi.py -a example_com -e rule_set 'metric_name=^foo$'

Here we specified a filter on the rules to match only those rules that apply to a metric named ‘foo’. The pert before the equals sign is which key you filter on. You can specify any key in the results for the filter, as long as it has a simple value (i.e. string or number). The right hand side of the filter is a regular expression. Any entry that matches the regular expression will be displayed. A small section of the results are shown below:

{
    "/rule_set/1234_foo": {
        "check": "/check/1234",
        "contact_groups": {
            "1": [],
            "2": [
                # Mail only
                "/contact_group/123"
            ],
            "3": [],
            "4": [],
            "5": []
        },
        "derive": "counter",
        "link": null,
        "metric_name": "foo",
        "metric_type": "numeric",
        "notes": null,
        "parent": null,
        "rules": [
            {
                "criteria": "on absence",
                "severity": "2",
                "value": "t",
                "wait": 10
            }
        ]
    },
    ...

There are two things we want to change here. First the severity when the rule fires. This is a simple search/replace in your text editor. I’m a vim fan, so:

:%s/"severity": "2"/"severity": "1"/

Next, we need to make sure that severity 1 alerts go to the on call group. They don’t currently:

:%s/"1": [],/"1": [ "/contact_group/124" ],/

If you want, you can remove the sev2 contact group by deleting that section:

:g/"/contact_group/123"/d

The ‘comment’ line above can be left alone, or deleted too with a similar command.

Once you are finished, your file will look something like this:

{
    "/rule_set/1234_foo": {
        "check": "/check/1234",
        "contact_groups": {
            "1": [ "/contact_group/124" ],
            "2": [
                # Mail only
            ],
            "3": [],
            "4": [],
            "5": []
        },
        "derive": "counter",
        "link": null,
        "metric_name": "foo",
        "metric_type": "numeric",
        "notes": null,
        "parent": null,
        "rules": [
            {
                "criteria": "on absence",
                "severity": "1",
                "value": "t",
                "wait": 10
            }
        ]
    },
    ...

Save and quit, and you’ll be presented with a prompt asking you to confirm your changes, and give you a chance to preview the changes before submitting them:

0 additions, 0 deletions, 85 edits
Do you want to proceed? (YyNnEeSs?)

I would recommend previewing exactly what has changed before submitting. Pressing S will show a unified diff of what changed. Once you proceed, circonusvi will apply the changes via the API, and you’re all done!

This method of interacting with the API is most useful when making edits or deletions in bulk (if you delete an entry in the text editor, circonusvi will delete from the API). Additions are possible also, but are more cumbersome as you have to create the JSON precisely from scratch. For bulk additions, there are other scripts available such as the
circus scripts. However, using your editor to make bulk changes is a very powerful way of interacting with an application. Give it a try and see how well it works for you.

Mark Harrison leads the Site Reliability Engineering team at OmniTI where he manages people and networks, systems and complex architectures for some of today’s largest Internet properties.

Updates From The Tech Team

Now that it is fall and the conference season is just about over, I thought it would be a good time to give you an update on some items that didn’t make our change log (and some that did), what is coming shortly down the road and just generally what we have been up to.

CEP woes and engineering salvation.

The summer started out with some interesting challenges involving our streaming event processor. When we first started working on Circonus, we decided to go with Esper as a complex event processor to drive fault detection. Esper offers some great benefits and a low barrier of entry to stream processing by placing your events into windows that are analogous to database tables, and then gives you the ability to query them with a language akin to SQL. Our initial setup worked well, and was designed to scale horizontally (federated by account) if needed. Due to demand, we started to act on this horizontal build out in mid-March. However, as more and more events were fed in, we quickly realized that even when giving an entire server to one account, the volume of data could still overload the engine. We worked on our queries, tweaking them to get more performance, but every gain was wiped away with a slight bump in data volume. This came to a head near the end of May when the engine started generating late alerts and alerts with incorrect data. At this point, too much work was put into making Esper work for not enough gain, so we started on a complete overhaul.

The new system was still in Java, but this time we wrote all the processing code ourselves. The improvement was incredible, events that once took 60ms to process now took on the order of 10µs. To validate the system we split the incoming data stream onto the old and new systems and compared the data coming out. The new system, as expected, found alerts faster, and when we saw a discrepancy, the new system was found to be correct. We launched this behind the scenes for the majority of the users on May 31st, and completed the rollout on June 7th. Unless you were one of the two customers affected by the delinquency of the old system, this mammoth amount of work got rolled out right under your nose and you never even noticed; just the way we like it. In the end we collapsed our CEP system from 3 (rather saturated) nodes back to 1 (almost idle) node and have a lot more faith in the new code. Here is some eye candy that shows the CEP processing time in microseconds over the last year. The green, purple and blue lines are the old CEP being split out, and the single remaining green line is the current system.

We tend to look at this data internally on a logarithmic scale to better see the minor changes in usage. Here is the same graph but with a log base 10 y-axis.

Distributed database updates.

Next up were upgrades to our metric storage system. To briefly describe the setup, it is based on Amazon’s Dynamo, we have a ring of nodes, and as data is fed in, we hash the ids and names to find which node it goes on, insert the data, and use a rather clever means to deterministically find subsequent nodes to meet our redundancy requirements. All data is stored at least twice and never on the same node. Theo gave a talk at last year’s Surge conference that is worth checking out for more details. The numeric data is stored in a proprietary format, highly compact, while text data was placed into a Berkeley DB whenever it changed.

The Berkeley DB decision was haunting us. We started to notice potential issues with locking as the data size grew and the performance and disk usage wasn’t quite where we wanted it to be. To solve this we wanted to move to leveldb. The code changes went smoothly, but the problem arose: how do we get the data from one on-disk format to another.

The storage system was designed from the beginning to allow one node to be destroyed and rebuilt from the others. Of course a lot of systems are like this but who ever actually wants to try it with production data? We do. With the safeguards of ZFS snapshotting, over the course of the summer we would destroy a node, bring it up to date with the most recent code, and then have the other nodes rebuild it. Each destroy, rebuild, bring online cycle took the better part of a work day, and got faster and more reliable after each exercise as we cleaned up some problem areas. During the process user requests were simply served from the active nodes in the cluster, and outside of a few minor delays in data ingestion, no users we impacted. Doing these “game day” rebuilds has given us a huge confidence boost that should a server go belly up, we can quickly be back to full capacity.

More powerful visualizations.

Histograms were another big addition to our product. I won’t speak much about them here, instead you should head to Theo’s post on them here. We’ve been showing these off at various conferences, and have given attendees at this year’s Velocity and Surge insight into the wireless networks with real time dashboards showing client signal strengths, download and uploads and total clients.

API version 2.

Lastly, we’ve received a lot of feedback on our API, some good, some indifferent but a lot of requests to make it better, so we did. This rewrite was mostly from the ground up, but we did try to keep a lot of code the same underneath since we knew it worked (some is shared by the web UI and the current API). It more tightly conforms to what one comes to expect from a RESTful API, and for our automation enabled users, we have added in some idempotence so your consecutive Chef or Puppet runs on the same server won’t create duplicate checks, rules, etc. We are excited about getting this out, stay tuned.

It was a busy summer and we are looking forward to an equally busy fall and winter. We will be giving you more updates, hopefully once a month or so, with more behind the scenes information. Until then keep an eye on the change log.

Template API

Setting up a monitoring system can be a lot of work, especially if you are a large corporation with hundreds or thousands of hosts. Regardless of the size of your business, it still takes time to figure out what you want to monitor, how you are going to get at the data, and then to start collecting, but in the end it is very rewarding to know you have insight.

When we launched Circonus, we had an API to do nearly everything that could be done via the web UI (within reason) and expected it to make it easy for people to program against and get their monitoring off the ground quickly. Quite a few customers did just that, but still wanted an easier way to get started.

Today we are releasing the first version of our templating API to help you get going (templating will also be available via the web UI in the near future). With this new API you can create a service template by choosing a host and a group of check bundles as “masters.” Then you simply attach new hosts to the template, and the checks are created for you and deployed on the agents. Check out the documentation for full details.

Once a check is associated with a template, it cannot be changed on its own?you must alter the master check first and then re-sync the template. To re-sync, you just need to GET the current template definition and then POST it back; the system will take care of it from there.

To remove bundles or hosts, just remove them from the JSON payload before POSTing, and choose a removal method. Likewise, to add a host or bundle back to a template, just add it into the payload and then POST. We offer a few different removal and reactivation methods to make it easy to keep or remove your data and to start collecting it again. These methods are documented in the notes section of the documentation.

Future plans for templates include syncing rules across checks and adding templated graphs so that adding a new host will automatically add the appropriate metrics to a graph. Keep an eye on our change log for enhancements.

Access Tokens with the Circonus API

When we rolled out our initial API months ago, we took a first stab at getting the most useful features exposed to help customers get up to speed with the service. A handful of our users expressed displeasure with having to use their login credentials for basic access to the management API. Starting today, we’re pleased to announce support for access tokens within the Circonus API.

Tokens offer fine-grained access for each user to a specific service account, at your permission role or lower. For example, if Bob is a normal user on the Acme Inc. account, he can create tokens allowing normal or read-only access. Multiple applications can use the same token, but each application has to be approved by Bob in the token management page, diabolically named My Tokens. To get started, browse over to this page inside your user profile, select your account from the drop-down and click the “plus tab” to create your first token.

20101103_screen1

The first time you try to connect with a new application using your token, the API service will hand back a HTTP/1.1 401 Authorization Required. When you visit the My Tokens page again you’ll see a button to approve the new application-token request. Once this has been approved you’ll be able to connect to the API with your new application-token.

20101103_screen2

Using the token is even easier. Just pass the token as X-Circonus-Auth-Token and your application name as X-Circonus-App-Name in your request headers. Here’s a basic example using curl from the command-line:

$ curl -H "X-Circonus-Auth-Token: ec45e8a2-d6d9-624c-c21c-a83f573731c1" 
       -H "X-Circonus-App-Name: testapp" 
           https://api.circonus.com/api/json/list_accounts

[{
   "account":"social_networks",
   "account_description":"Monitoring for The Social Network.",
   "account_name":"Social Networks"
   "circonus_metric_limit":500,
   "circonus_metrics_used":124,
}]

One of the more convenient features with our tokens is how well they integrate with user roles. A token will never have higher access permissions than its owner. In fact, if you lower a user’s role on your account, their tokens automatically reflect this as well. Changing a “normal” user to “read-only” will render their tokens the same access level. But if you restore their original role, the token will also have its original privileges restored. Secure and convenient.

If you have any questions about our new API tokens or would like to see more examples with the Circonus API, drop us a line at hello@circonus.com.