One Dashboard to Rule Them All

<

four icons representing a dashboard

Ever dream of having a systems monitoring dashboard that was actually useful? One where you could move things around, resize them, and even choose what information you wanted to display? Large enterprise software packages may have decent dashboards, but what if you’re not a large enterprise or you don’t want to pay an arm and a leg for bloatware? Perhaps you have a good dashboard that came with a specific server or piece of hardware, but it’s narrowly-focused and inflexible. You’ve probably thought about (or even tried) creating your own dashboard, but it’s a significant undertaking that’s not for the faint-of-heart. What’s the solution? Should we just learn to live with sub-optimal monitoring tools?

Here at Circonus, we decided that this was one problem we could eliminate. Since we’ve built a SaaS offering that’s flexible enough to handle multiple different data sources, why shouldn’t we build a dashboard that’s flexible enough to display them? So we created a configurable dashboard that lets you monitor your data however you want. Do you want to show graphs side-by-side but at different sizes? Done. Want an up-to-date list of alerts beside those graphs? Easy. How about some real-time metric charts that automatically refresh? No problem. Our new configurable dashboards allow you to add all these items and more. Let’s dig in and see how these new dashboards work.

Dashboard Basics

Start by going to the standard ‘Dashboard’ and clicking the new ‘My Dashboards’ tab. These dashboards are truly yours; any dashboards you create are only visible to you (by default) and are segregated by account. If you want to share a custom dashboard with everyone else on an account, check that dashboard’s ‘share’ checkbox in your list of custom dashboards.

After you have created a custom dashboard, you may set it to be your default dashboard by using the radio buttons down the left side of your custom dashboards list. If you do this, you will be greeted with your selected dashboard when you login to Circonus. By selecting the ‘Standard Circonus Dashboard’ as your default dashboard, you will revert to being greeted with the old dashboard you’re already used to seeing.

part of the interface for creating a new dashboard layout

To create a new custom dashboard, click the ‘+’ tab and choose a layout. At first you will see only a couple predefined layouts available, but after you create a dashboard, its layout will then be available to choose when creating other new dashboards.

Now a note about working with these dashboards: every action auto-saves so you never have to worry about losing changes you’ve made. However, if you haven’t given your dashboard a title, the dashboard isn’t permanently saved yet. If you forget to title your dashboard and go off to do other things, don’t worry, the dashboard you created is saved in your browser’s memory. All you have to do is visit the ‘My Dashboards’ page and your dashboard will be listed there. With two clicks you can give your dashboard a title and save it permanently. (Please note our minimum browser requirements ‘Firefox 4+ or Chrome’ which are especially applicable for these new custom dashboards, since we’re using some features which are not available in older browsers.)

So let’s create a dashboard. Choose a layout, click ‘Create Dashboard,’ and you will be taken to the new dashboard with the ‘Add A Widget’ panel extended. To begin, let’s check out the title area. Notice that when you hover over the title, a dropdown menu appears. This lists your other dashboards on the current account (as well as dashboards shared by other account members) and is useful for quickly switching between dashboards.

the dashboard interface showing the dashboard controls icons

To the right of the title are some icons. The first icon opens the grid options dialog, which lets you change the dimensions of the dashboard grid, hide the grid (it’s still active and usable, though), enable or disable text scaling, and choose whether or not to auto-hide the title bar in fullscreen mode. The second icon toggles fullscreen mode on and off. Once you enter fullscreen mode a third icon will appear, and this icon toggles the ‘Black Dash’ theme (this theme is only available in fullscreen mode). The current states of both fullscreen mode and the ‘Black Dash’ theme are saved with your dashboard.

One other note about the dashboard interface: if you leave a dashboard sitting for more than ten or fifteen seconds and notice that parts of the interface disappear (along with the mouse cursor), don’t worry, it’s just gone to sleep! A move of the mouse will make everything visible again. (If there are any widget settings panels open, though, the sleep timer will not activate.)

Widgets

Now for the meat of it all: widgets. We currently have ten widgets which can be added to the dashboard grid to show various types of data, and we’ll be adding more widget types and contents in the future. Following is a quick rundown of the currently available widgets:

Graph

Graph widgets let you add existing graphs to your dashboard. You may choose any graph from the “My Graphs” section under your current account. Graph widgets are refreshed every few minutes to ensure they’re always up-to-date.

Beacon Map

Map widgets let you add existing Beacon maps to your dashboard. You may choose any map query from the “Beacons” page (under the “Checks” section of your current account). Map widgets are updated in real-time.

Beacon Table

Table widgets let you add existing Beacon tables to your dashboard. You may choose any table query from the “Beacons” page (under the “Checks” section of your current account). Table widgets are updated in real-time.

Chart

Chart widgets let you select multiple metrics to monitor and compare in a bar or pie chart. Chart widgets are updated in real-time.

Gauge

Gauge widgets let you monitor the current state of a single numeric metric in a graphical manner, displaying the most recent value on a bar gauge (dial gauges are coming soon). Gauge widgets are updated in real-time.

Status

Status widgets let you monitor the current state of one or more metrics, displaying the most recent value with custom formatting. This is most useful for text metrics, but it may be used for numeric metrics as well. Status widgets are updated in real-time.

HTML

HTML widgets let you embed arbitrary HTML content on your dashboard. It can be used for just about anything, from displaying a logo or graphic to using an iframe to embed more in-depth content. Everything is permissible except Javascript. HTML widgets are refreshed every few minutes to ensure they’re always up-to-date.

List

List widgets let you add lists of graphs and worksheets to your dashboard, ordered by their last modified date. You may specify how many items to list and (optionally) a search string to limit the list. List widgets are refreshed every few minutes to ensure they’re always up-to-date.

Alerts

Alerts widgets let you monitor your checks by showing the most recent alerts on your current account. You may filter the alerts by their age (how long ago they occurred), by particular search terms, by severity levels, or other status criteria. Alerts widgets are refreshed every few minutes to ensure they’re always up-to-date.

Admin

Admin widgets let you monitor selected administrative information, including the status of all Circonus agents on your current account. Admin widgets are refreshed every few minutes to ensure they’re always up-to-date.

icons representing some of the current widget types

To add widgets to the dashboard grid, there are two methods: you may use the ‘drag-and-drop’ method (dragging from the “Add a Widget” panel), or you may first click the target grid cell and then select the widget you want to place there. (Note: in fullscreen mode only the latter method is available.) After a widget has been added, some types of widgets will automatically activate with default settings, but most will be inactive. If the widget is inactive, click it to open the settings panel and get started. Once the widget is activated, the settings panel is available by clicking the settings icon in the upper right corner of the widget. In the lower right corner of the widget is the resize handle, so you can resize the widget as frequently as you want. And let’s not forget being able to rearrange the widgets, every widget has a transparent ‘title bar’ at its top which you can use to drag it around. I won’t get into the details of settings for every type of widget, because they should be self-explanatory (and that would make this one super-long blog post). But suffice it to say, there are plenty of options for everyone.

We’ve been working hard to create a configurable dashboard that will be as flexible as Circonus itself is, and we believe we’ve hit pretty close to the mark. Here’s a sample dashboard showing the power of these new dashboards:


What’s in a number?

Numbers, numbers, numbers; we’re all about numbers here at Circonus. We have trillions of data points which we feed into a slew of algorithms and processes to help our users identify problems with their data. But what are these numbers? It turns out that isn’t an easy question to answer.

Like most monitoring systems, Circonus performs an action from which it extracts one or more “metrics.” A common example is running a database query and measuring both the correctness of the result (as a boolean: good vs. bad) and the latency with which the answer was delivered. Similarly, it could load a web page, ensure that some specified content is successfully returned and measure the time it took. More concretely, when performing an HTTP transaction, it could obtain the following useful metrics: time to establish the TCP connection, time until the first byte of data is received, and time until the last byte of data is received. These measurements can reveal a variety of problems both on the surface of your architecture as well as provide indications of issues deep within.

While most monitoring systems (and parts of Circonus) work this way, the nature of these metrics is most interesting in what it is missing. In other words, it is vital to understand what they do not tell you. You are not observing real information; instead you are producing a single synthetic event and measuring it. The data are not real (and worse, may be far from representative.) Before I dive in and talk about why these data aren’t “good,” I’ll talk a bit about why they are “good enough” for many things.

Synthetic measurements work very well for components that can be measured in terms of quantities or rates. How many of something do you have? How quickly is it increasing or decreasing? Simple things like this are: disk space, I/O operations per second, the number of HTTP requests serviced, CPU usage, memory usage, etc. The most important factor is that these things are one-dimensional.

Data like these are both easy to visualize and critically important for things like anomaly detection and capacity planning. Being of a single dimension, understanding patterns in the data is easier for both humans and computers. However, as we start combining these data points, the world goes quickly out of focus.

For the moment, let’s assume we measure total money spent on an e-commerce site (you’d be crazy to not measure this.) In addition to that, we measure total transactions performed (number of sales.) With these metrics, we have some clear data: total dollars and dollars/hour (by deriving the samples) and total sales and sales/hour (again by deriving.) These numbers are pretty clear and we can make some good judgments about what to expect from day to day. However, you might ask, “How much is the average transaction size?” The answer to this question is simple: total money spent divided by total sales. Unfortunately, the average is not a useful number; just ask any statistician.

When you start looking at averages, you start losing information. We use averages to zoom out on graphs; you might notice that when you have a sudden spike (let’s say in traffic) you will see a much higher spike when zoomed in than when zoomed out. Why? If you were serving between 2900 and 3300 requests per second between 7pm and 8pm except for a sudden spike of 5400 requests per second between 7:40 and 7:45, you would see that on a graph showing 5 minute averages. However, on a graph zoomed out far enough to show only 20 minute averages, you’d see a deceptively small spike of about 3400 rps at that time period. As long as you can zoom in on the time series, it can be an acceptable compromise to reduce the data volume down to something consumable by a mere human being. Then the obvious question is: when does this go horribly wrong?


Let’s look at something like web page load times. If you run a synthetic transaction, always from the same location, you can track measurements in that single dimension. Things should be somewhat consistent and these numbers are useful. However, they do not tell you how fast your site is. Only your users know that. Interestingly, since your users access your web site, you can actually have them report that information back to you. In fact, this is how most web analytics systems work. The interesting part here is that you have a wide variety of data coming in representing a distribution of perceived load times. Some people load your pages quickly and others load them slowly. That’s the nature of the Internet: inconsistency. The key is that they don’t “trend” as a single datapoint that is the average of all.

The inconsistency in these data is interesting: it can be leveraged for improvements and advantage. Understanding (and eventually changing) the distribution of these data can radically change your business. There have been many articles written about web page load times, so in order to keep this fresh, I’ll discuss database transactions. The reason I’m jumping around here is because data are just data — this applies to every metric you can observe.

Understanding that your average database query takes 1.92ms to complete is, I’m sorry to say, useless. The problem is that you are likely running thousands or tens of thousands of queries per second and none of them are average. To illustrate this, here are three (contrived) database query latency histograms each of 39 samples.


The interesting (and perhaps deceptive) part is that all three have an average latency across all queries of 1.92ms. Quite clearly, all depict radically different situations. The truth is, when you have a lot of data (thousands to hundreds of thousands of data points), the histogram reveals the information you seek and the average hides it.

Why is this so interesting? In computing, there are a lot of things we can witness by actively measuring them; this is what the Circonus you know and love has done. We figured it was time to change the game a bit and help you visualize, in real-time, the things that happen in your business: enter BizEKG.

BizEKG allows you to analyze events (like webpage loads, database queries, customer service telephone calls, etc.). Not just some, not just a sample, but all the events. From there, you can break them apart, run statistical analysis (including histograms, of course) and understand your data. There are a handful of real-time web analytics companies out there, but answering these questions in “Circonus style” changes the game entirely. What’s Circonus style?

We at Circonus believe that all data are important, not just web data. We believe that if you can’t see what’s happening right now, you are as good as blind. So take this real-time, multi-dimensional statistical analysis engine, feed it any data you want, and see it all in real-time.

With our snazzy new BizEKG service you can actually do what some might consider a sufficient level of black magic. You can decompose these events in realtime and visualize these histograms in realtime. Not only is this pretty cool… it’s pretty damn enlightening. BizEKG is a new service we’ve launched and deserves its own announcement, we’ll get to that soon.

The above histogram show the last 60 seconds of page load times of a subsection of a current Alexa top 1000 site in milliseconds. Yes, 10,000ms is 10 seconds of page load time. Even on today’s Internet, loading a complex site over wireless from another country is… slow.

A Lotta Love for Keyboard Users

All web users who bemoan the general lack of support for keyboard accessibility in web apps, take heart! Circonus has some great features for keyboard lovers. We know there are many web users out there for whom keyboard shortcuts are a quicker and easier way to use applications, particularly web apps. This is especially true if you use a specific app heavily, or are a full-time computer user in general.

Anywhere in Circonus, you can always see the keyboard help screen by typing “?” so you’ll have an ever-present “cheat sheet” as you learn the shortcuts. As soon as new keyboard functionality is added, the keyboard help screen will be updated immediately to reflect the new shortcuts, thanks to the magic of continuous deployment.

Jump Navigation

To jump to a particular section in Circonus, all you have to do is type the proper keyword and you will jump there immediately. For example, type the keyword “dash” (d-a-s-h) and you will jump to the current account’s dashboard. It’s that easy! Here’s a list of the current jump keywords:

  • “dash” (jump to the dashboard)
  • “alerts” (jump to the fault dashboard)
  • “rules” (jump to rules)
  • “checks” (jump to checks)
  • “metrics” (jump to metrics)
  • “trends” (jump to the trending dashboard)
  • “graphs” (jump to graphs)
  • “worksheets” (jump to worksheets)

The shortcut for opening the feedback dialog also works the same way: simply type “feedback” and the feedback dialog will open for you. Another quick shortcut is the forward-slash (/), which focuses on any search field that may be on the page.

Graph & Worksheet Shortcuts

Here’s where we get to the good stuff. We’ve added some great shortcuts to work with graphs and an enhanced zoom tool which is only available via keyboard shortcuts.

To start off, you can now see the legend on any thumbnail graph view (on “My Graphs,” “Trending Dashboard,” and all worksheets) not only on the large graph views as before. To do so, simply hold down the shift key, and the legend will appear for whichever graph you’re hovering over. On a worksheet, the shift key also inverts the legend hover option. So if you have enabled the new worksheet option to show legends upon graph hovering, holding down the shift key will disable the hovering legends.


Back in January we launched an enhanced graph zoom toolbar that relies on keyboard shortcuts. Normally the zoom toolbar is labeled “Past” because its buttons will set the graph zoom level to view data from the past one week, two weeks, etc. However, if you hold down either comma or period, the zoom tool will be enhanced and the label will change to “shift.” You will also see an orange bar at the end of the graph(s) which indicates the end that will be shifted (and if you hold both keys, you will get two orange bars, indicating that you can pan the entire graph date range into the past or the future). While holding one or both keys, click one of the new arrow icons that appear inside the “shift” buttons—the graph date(s) will be shifted by the specified amount in the specified direction. Not only does this work when viewing or editing graphs, it works almost everywhere there is one or more graphs, whether large or thumbnail sized.

One last set of useful shortcuts applies when viewing a worksheet. Among the newly added worksheet options is the ability to resize worksheet graphs to one of three sizes. In addition to being able to do this by clicking the buttons in the worksheet options dialog, you can instantly change the size of your worksheet graphs by pressing alt+1, alt+2, or alt+3.

Being avid keyboard users ourselves, we are excited to build keyboard support into more areas of Circonus as we are able to do so. Keep watching for more keyboard info and if you have ideas for some useful shortcuts, please let us know!

Lost In Translation

For more than ten years, OmniTI has been making large-scale critical Internet infrastructure work. It is, obviously, not black magic or voodoo. Perhaps not so obviously, it is not technical competence that leads to success here. I like to think our team has technical competence in spades as we have an impeccable track record, authored books and a laundry list of speaking engagements to justify it. However, technical competence alone would fall short of the mark— far short.

Without exception, it is expected that proper monitoring and trending are as much a part of the process as setting up networking, backups, and more recently, change management. And yet, when you ask someone to explain why monitoring and trending were vital, you’d be lucky to get a response other than “to be sure things are working”. Something here is lost in translation.

Disconnected Viewpoints

Every business owner knows that watching the books is part of the job. You need to know P&L, you need to understand the outputs and costs of your various business units and you track efficiencies everywhere. All of these metrics play a part in both strategic and tactical decisions made every day. Each business unit reports these things and while in good organizations each manager knows what is important to each other manager, something is still lost in translation. Far too often, managers don’t understand that what they produce, what they consume and how they work changes the game for other business units. While the word is overused and abused, every business is an ecosystem. It is obvious that a new marketing campaign will increase resource utilization on the sales teams. It should be obvious that a new marketing campaign will increase resource utilization on IT infrastructure as well.

Every systems administrator knows (or should know) that monitoring your architecture is fundamental. On the other hand, very few can explain in any detail why this is so important. “Because you lose money when systems are offline”, they’ll quote disparagingly. Ask how much and you might catch them at a loss. From my own experience in operations, as well as countless conversations with customers and vendors, very few individuals recognize the relationship between IT and Business. Systems people know that they have to keep systems and services running to support their business, but rarely do they understand that relationship completely.

Owners that foster a transparent and cohesive organization around key performance indicators in every business unit (even those that are cost centers) will change their organizations in two critically useful ways:

  • Efficiencies between business units. With increased transparency, staff in all positions will see the effects of their actions across the business as a whole. This produces an atmosphere of self-reinforcing efficiency.
  • Accountability to the overall business. The hokey old question: “Is what you’re doing good for the company?” changes form. With increased cohesiveness, the answer to that question is a more obvious outcome to every action and no one can call it hokey, because it is always answered without being asked.

A Call To Arms

Technology is no longer underneath the products you sell and the process in which you deliver them. It is, for at least the immediate future, intertwined. Creativity on the technology side doesn’t only deliver cost savings, it creates new audiences and increases interaction with your customers. You have to do more than embrace technology, you need to leverage it and let new opportunities catapult your business forward.

As intertwined as technology is, we can no longer afford to have its operational details hidden away in the bowels of the “tech ops” or “web ops” group. We need visibility and we need cohesion. Infrastructure/application engineering and other business units are now, more than ever before, on the same team marching towards success. Communication and accountability are critical to success.

Here is where I leave you and hope that you will think about the metrics you monitor in a different light. They represent something more. They are there to make the business run, increase shareholder value, make your customers happier and more prosperous.

Past Performance: does this look right to you?

If you are like me, you look at a lot of data. I look at data in spreadsheets, I look at data on P&L statements, I look at term sheets, I look at systems data — a lot of systems data. I find the best way to look at data is to visualize it because it is the fastest way to get data into the amazing pattern matcher that is the human brain.

The human brain is quite good at saying “this is abnormal” and can usually even articulate why. This curve has a periodicity, that one a monotonic behavior, another is simply always flat… then they “change.” When we say “this visualization looks wrong,” we are almost always onto something real in the numbers. I’ll give you a simple visual example:


While there is obviously something starting at 8pm, we are only left with another question: “is it out of the ordinary?” It doesn’t look like that today, and it doesn’t appear to resemble the day before. What about last week? Let’s start the graph one week earlier:


This tells us a lot. It looks like we have a very similar event last week at this time. With most analysis tools, you stop here (or you hover with you mouse and try to correlate start/end times and magnitude to better understand how these two events resemble each other).

With Circonus, we don’t leave it here. Instead, we provide tools to help compare time separated events using our data overlay feature. We can take our original two-day view and overlay the data from last week right on top of (or in this case underneath).


Just two clicks and we’ve got a one-week offset data overlay and the visualization lends a little insight into what is going on. We can see the start times are identical, but the event from this week ends about 30 minutes before the one from last week — largely the same though.


Again, we find that visuals help. Understanding how these graph differ even when they are right on top of each other can be a bit challenging. Never fear! We’ve added help in the legend.


The legend takes on some new features when data overlays are in use. You now get a very clear, side-by-side read-out of the data in the graph including percentage differences. Additionally, the arrows that say “you’re higher than you were last week” become more saturated (redder) as the different in the data increases and fade to light grey if the two values are more similar. This makes it simple to quickly understand how current performance really compares to past performance. So, the interesting part of this graph is actually the subsequent spike of inbound traffic this is up 95% over last week. That’s something to look into.

Capacity Planning Made Easy

Okay, so capacity planning will never be fool proof. You simply cannot predict the future. However, some of the time you have a darn good idea of what the future will hold. Since someone knows what is likely to happen, why is it so hard to plan marketing initiatives, funnels and IT provisioning?

The reason is that things aren’t always linearly correlated. What’s that mean? Linear correlation goes something like this: if A depends upon B and I want twice as much A, I’ll need twice as much B. While correlating non-linear systems can be tricky, a lot can be done with linear regressions. The problem with any regression is that you need to put real numbers in, get real numbers out and understand how good they are.

When we look at how something grows, one of the most common tools in the statistics arsenal is a least-squared linear regression. That is: given a set of datapoints, what line best fits them? So, let’s say we have a lot of datapoints (boy do we have a lot of datapoints!). Now what does a linear regression tell us?

Let’s assume we’re looking at some traffic data over the month of December.


In this graph, it can be very hard to answer questions about the nature of the data. Two common questions are:

  1. are we growing or shrinking and by how much?
  2. if we stay on the current growth path, where will we be some point in the future?

Enter the linear regression:



Answering the first question is pretty simple now. We can look at the value on the left side of the graph, and the right side of the graph and do the math. You can’t see it in the screenshot, but the left the values are 5.49M and 5.88M which is roughly a 6.6% growth over 4 weeks. Now, any statistician will scream bloody murder about confidences in the data and model and any engineer will simply ask: “does that make sense?” Maybe we’ll look over 8 weeks and twelve weeks also to make sure that we build our confidence (this can be easier, though far less scientific, than understanding R2 values – which are, of course, available as well). Honestly, I personally find that reconciling this with my expectations is one of the better methods of trusting the model.

Let’s assume that we we expected some increase in resource usage during this time frame and that 6% is reasonable. Now onto the next question: where will we be in the future. In Circonus, we just jump up and extend our view window out one year and we can see what our model looks like in the future:


Next December we’ll be using 10.91M (this just happens to be MBits/s of network bandwidth to serve origin dynamic content on one of the sites managed over at OmniTI). We’ll revisit this month by month to ensure that we are indeed heading where we expected. It allows engineers and marketers and executives alike put real numbers into (what we call) napkin math which adds peace, clarity and allows most people to do easier what-if pontification. I can tell you one thing… we sleep better at night knowing specific numbers about a probable future.

Enterprise Agents

If you’re like me, your first response to SaaS monitoring was: “You can’t see my machines/services/metrics from your cloud. That won’t be too useful.” With a little bit of thought, it’s pretty easy to arrive at the conclusion that you must run something on your infrastructure to bridge the divide. It was a fun and exciting project here to build that magic something called the Circonus Enterprise Agent.

circ-ea-login

The Circonus Enterprise Agent (we’ll call it our “EA” from here on) is all of our magic monitoring software bundled into a maintainable VMWare virtual appliance than can be run on your internal networks to track stuff that the public shouldn’t be seeing. We had some interesting choices to make during development and I thought I’d share what they were and why we made them.

Choosing a platform

Most of our internal infrastructure runs on some variant of OpenSolaris technology. We chose this for a variety of reasons. Most importantly, storing your precious data on ZFS seemed like the right thing to do. After that, the fault management architecture (FMA) available in OpenSolaris allows us to keep our machines and services running more reliably. Reliability and data permanence are the two most important factors in technology selection here at Circonus (a fact our customers respect).

So, with all this talk about OpenSolaris and its advantages you’d imagine we built our EA on the same technology, right? Not so simple. For a virtual appliance image that is easy to administer and easy to upgrade in the field you need a good package management system. OpenSolaris simply falls on its face there. Oracle’s promises of IPS (the new and coming package management system for Solaris 11) are quite compelling, but that is just a promise today. Instead, we turned to the tried and true CentOS Linux-based platform for our EA.

CentOS provides all the features we need to run our agent software, manage package upgrades and distribution seamlessly and simply, and the core operating system is both stable and secure. In an interesting later development, we provide Joyent customers the ability to run an EA on one of their Joyent SmartMachines. Joyent’s operating architecture is actually derived from OpenSolaris — so we ended up porting our EA back to our core platform as well.

Today, the EA is available in two forms: a CentOS 5 VMWare-based appliance and a Joyent SmartMachine.

From where do you manage the appliance

circ-ea-manage-thumb

While most appliances have a web console that allows a variety of management tasks, we made a simple choice to have the appliance administrable via the main circonus.com web application. This is where Circonus users interface with all their data and set up their monitors, sp it only made sense to also administer their EA from the same place.

After using the system for a while now, I can say that I’m really pleased with this decision. The cohesiveness of scheduling checks on either your private EA and/or the world-wide Circonus agents through the same check creation interface is a simple pleasure. One single world-wide view of all the agents on which you can schedule checks makes it simple to understand how the monitoring system works.

What to automate

Generally speaking, when you think appliance, you think self-maintaining. That’s not an unreasonable expectation. However, this directly conflicts with our experience in operations. In operations, automatic upgrades of software are strictly taboo. Typically, the operations crew wants to schedule precisely when an upgrade will occur, be present and have a bulletproof evaluation and rollback plan. When you start talking about critical infrastructure like monitoring, “typically” becomes “always.”

With this in mind, we made the upgrade process on the EA completely automated, but not automatic. One click and the appliance will self-upgrade. Currently, this is the only ongoing task that is done from the appliance itself (rather than the circonus.com portal), but we’re looking to make some nice enhancements there as well. Soon, you’ll be able to trigger remote EA upgrades directly from the web application.

What you get

With an EA you get to leverage the power of Circonus against all of your private data. Networks, systems, applications and business systems that are only accessible via internal infrastructure can be monitored via an Enterprise Agent. The data is fed back to the Circonus cloud in real-time. All of that data can be alerted on, and is available for correlation, trending and planning purposes through the excellent Circonus tools you already know and love.

Finding Needles in a Worksheet

Traditional graphing tools can help you plan for growth or even narrow down root causes after a failure. But they’ have a reputation for being difficult to setup, navigate or customize. It’s nice to be able to just point Cacti at some switches or routers and have it gracefully poll each device for SNMP data. Yet when you need a custom perspective of the data (or collections of data), it can be an arduous experience setting up templates and graphs.

When we started to engineer Reconnoiter into a SaaS offering, one of the major driving forces was a desire to not suck like the others. Like you, we don’t understand why it has to be so damn hard (or require a dedicated IT staff) to take a handful of data points and correlate them into graphs that make sense of the noise. I like to think we’ve been successful. Customers have been overwhelmingly positive about our efforts, calling it “a graph nerd’s paradise”. Even still, we eat our own dog food and are constantly revisiting the service to look for better ways to get our work done. This is why we’re working hard on upcoming features like Graph Overlays and Timeline Annotations. And it’s also why we made recent changes to the workflow for graphs and worksheets.

If you’re a Circonus user, you already know how easy it is to create and view graphs. Adding them to worksheets gives you a page full of data to compare and relate. Choose a zoom preset (2 days, 2 weeks, etc) or select a date range, and all of the thumbnails are instantly redrawn in unison. It might sound basic, but it can be very useful if you’re not sure what you’re looking for. Unexpected patterns jump out at you pretty quickly.


However, most of the time you want to work with a single graph. Clicking on a thumbnail previously loaded a graph in “lightbox” view, hiding all other graphs from sight and letting you focus on the work at hand. This worked well most of the time, but had one big drawback… you couldn’t (easily) bookmark it. So we’ve moved the default view into its own page, sans lightbox, that can be bookmarked and shared with others. Miss the lightbox view? No worries, we’ve kept that as the new preview mode. Try it out in a worksheet for “flickr-style” navigation.

Here’s a short video I threw together to demonstrate some of these changes. There was some audio lag introduced by the YouTube processing, but it should be easy enough to follow along. If you’d like to see more examples like this one, shoot us an email and we’ll try to keep them coming.

Access Tokens with the Circonus API

When we rolled out our initial API months ago, we took a first stab at getting the most useful features exposed to help customers get up to speed with the service. A handful of our users expressed displeasure with having to use their login credentials for basic access to the management API. Starting today, we’re pleased to announce support for access tokens within the Circonus API.

Tokens offer fine-grained access for each user to a specific service account, at your permission role or lower. For example, if Bob is a normal user on the Acme Inc. account, he can create tokens allowing normal or read-only access. Multiple applications can use the same token, but each application has to be approved by Bob in the token management page, diabolically named My Tokens. To get started, browse over to this page inside your user profile, select your account from the drop-down and click the “plus tab” to create your first token.


The first time you try to connect with a new application using your token, the API service will hand back a HTTP/1.1 401 Authorization Required. When you visit the My Tokens page again you’ll see a button to approve the new application-token request. Once this has been approved you’ll be able to connect to the API with your new application-token.


Using the token is even easier. Just pass the token as X-Circonus-Auth-Token and your application name as X-Circonus-App-Name in your request headers. Here’s a basic example using curl from the command-line:

$ curl -H "X-Circonus-Auth-Token: ec45e8a2-d6d9-624c-c21c-a83f573731c1" 
       -H "X-Circonus-App-Name: testapp" 
           https://api.circonus.com/api/json/list_accounts

[{
   "account":"social_networks",
   "account_description":"Monitoring for The Social Network.",
   "account_name":"Social Networks"
   "circonus_metric_limit":500,
   "circonus_metrics_used":124,
}]

One of the more convenient features with our tokens is how well they integrate with user roles. A token will never have higher access permissions than its owner. In fact, if you lower a user’s role on your account, their tokens automatically reflect this as well. Changing a “normal” user to “read-only” will render their tokens the same access level. But if you restore their original role, the token will also have its original privileges restored. Secure and convenient.

If you have any questions about our new API tokens or would like to see more examples with the Circonus API, drop us a line at hello@circonus.com.

Annotating Alerts and Recoveries

In the last couple of posts, Brian introduced our new WebHook notifications feature and I demonstrated how Circonus can graph text metrics for Visualizing Regressions. Both of these features are interesting enough on their own, but let’s not stop there. Today I have an easy demonstration showing how you can re-import your alert information to your trends. The end goal is an annotation on our graph that can be used to help identify, at a glance, which alert(s) correspond with anomalies on your graphs.

First, let’s set a WebHook Notification in our Circonus account profile. Choose the contact group that it should belong to, or create a new contact group specifically for this exercise. Type the URL where you want to POST your alert details in the custom contact field and hit enter to save the new contact.


Now we need something for our webhook to act as a recipient. For this example I have a simple Perl CGI script that listens for the POST notification, parses the contents, and writes out Circonus-compatible XML. It doesn’t matter which language you use, as long as you can extract the necessary information and write it back out in the correct XML format (Resmon DTD).

#!/usr/bin/perl
#
# alert.cgi

use strict;

my $cgi = CGI->new;
my $template = HTML::Template->new(
  filename => 'resmon.tmpl',
  die_on_bad_params => 0
);

# check for existence of alerts from webhook POST
if ($cgi->param('alert_id')) {

  # open XML output for writing
  open (OUT, ">/path/to/alert.xml") || 
    die "unable to write to file: $!";

  # loop through alerts
  for my $alert_id ($cgi->param('alert_id')) {

    # check for valid alert id format
    if ($alert_id =~ /^d+$/) {

      # craft our XML content
      $template->param(
        last_update => time,
         alert_id => $alert_id,
         account_name => $cgi->param('account_name'),
         check_name => $cgi->param("check_name_${alert_id}"),
         metric_name => $cgi->param("metric_name_${alert_id}"),
         agent => $cgi->param("agent_${alert_id}"),
         severity => $cgi->param("severity_${alert_id}"),
         alert_url => $cgi->param("alert_url_${alert_id}"),
      );

      # only print RECOVERY if available
      if ($cgi->param("clear_time_${alert_id}")) {
        $template->param(
          clear_time => $cgi->param("clear_time_${alert_id}"),
          clear_value => $cgi->param("clear_value_${alert_id}"),
        );

      # otherwise print ALERT details
      } else {
        $template->param(
          alert_time => $cgi->param("alert_time_${alert_id}"),
          alert_value => $cgi->param("alert_value_${alert_id}"),
        );
      }
    }
  }

  print OUT $template->output;
}

close (OUT);

Here is the template file used for the XML output.

<!-- resmon.tmpl -->
<ResmonResults>
  <ResmonResult module="ALERT" service="aarp_web">
    <last_runtime_seconds>0.000238</last_runtime_seconds>
    <last_update><TMPL_VAR name="last_update"></last_update>
    <metric name="account_name" type="s">
      <TMPL_VAR name="account_name">
    </metric>
    <metric name="alert_id" type="s">
      <TMPL_VAR name="alert_id">
    </metric>
  <TMPL_IF name="alert_value">
    <metric name="message" type="s">
      <TMPL_VAR name="check_name">`<TMPL_VAR name="metric_name"> 
      alerted <TMPL_VAR name="alert_value"> from <TMPL_VAR name="agent">
      at <TMPL_VAR name="alert_time"> (sev <TMPL_VAR name="severity">)
    </metric>
  </TMPL_IF>
  <TMPL_IF name="clear_value">
    <metric name="message" type="s">
      <TMPL_VAR name="check_name">`<TMPL_VAR name="metric_name"> 
      cleared <TMPL_VAR name="clear_value"> from <TMPL_VAR name="agent"> 
      at <TMPL_VAR name="clear_time"> (sev <TMPL_VAR name="severity">)
    </metric>
  </TMPL_IF>
    <metric name="alert_url" type="s">
      <TMPL_VAR name="alert_url">
    </metric>
  </ResmonResult>
</ResmonResults>

When everything is running live, the alert.cgi script will accept webhook POST notifications from Circonus and write the alert details out to /path/to/alert.xml. This file should be available over HTTP so that we can import it back into Circonus using the Resmon check. Once you’ve begun capturing this data you can add it to any graph, just like any other metric.

This might take you 30 minutes to setup the first time. But once you have it, this data can be really useful for troubleshooting or Root Cause Analysis. We plan to add native support for alert annotations within Circonus over the next few months, but this is a handy workaround to have until then.