Monitoring your Vitals During the Critical Holiday Retail Season

As with Brick & Mortar stores, the Holiday season is a critical time for many E-Commerce sites. Like their off-line brethren, these sites also see large increases in both traffic and revenue, sometimes substantially so. Of course these changes in user behavior don’t just affect E-Commerce sites; consider a social-networking site like Foursquare, where a person might normally check into 3 or 4 places a week, during the Holiday season that might double as they visit more stores and end up eating out more often while rushing between those stores. On an individual basis it doesn’t sound that significant, but if a large percentage of your user base doubles their traffic, you better hope you have planned accordingly.

On the technical side, many sites will actually change their regular development process in order to handle these changes in user behavior.Starting early in November, many sites will stop rolling out new features and halt large projects that might be disruptive to the site or the underlying infrastructure. As focus shifts away from features,most often it turns back towards infrastructure and optimization. Adding new monitoring, from improved logging to new metrics and graphs, becomes critical as you seek to have a comprehensive view of your sites operations so that you can better understand the changes in traffic that are happening, and hopefully be proactive about solving problems before they turn into outages.

Profiling and optimization work also receives more attention during this time; studies continue to show correlations between page load speeds and website responsiveness to increased revenue, and being able to improve these areas is something that can typically be done without having to change the behavior of how things work. Bugfixes are also a popular target during these times as those corner cases are more likely to show up as traffic increases, especially if you tend to see new users as well as an increase in use by existing users.

This brings us to a good question; just what are you monitoring? For most shops there tend to be standard graphs that get generated for this like disk space or memory usage. These things are good to have, but they only scratch the surface. Your operations staff probably knows all kind of metrics about the system the need to monitor, but how about your application developers? They should know the code that runs your site inside and out, so challenge them to find key metrics in your application stack that are important for their work. Maybe that’s messages delivered to a queuing system, or the time it takes to process the shipping costs module, or measuring the responsiveness of a 3rd party API like Facebook or Twitter. But don’t stop there;everyone in your company should be asking themselves “what analytics could I use to make better informed decisions”? For example, do you know if your increased traffic is due to new users or existing users? If you are monitoring new user sign ups, this will start to give you some insight. If you are doing E-Commerce, you should also be tracking revenue related numbers. Those types of monitors are more business focused but they are critical to everyone at your company. So much so that at Etsy, a top 100 website commonly known as “the worlds handmade marketplace”, they project these types of metrics right out in public.

Ideally once you have this type of information being logged, you can collect the information for analytically reports and historical trending via graphs. You want to be able to take the data you are collecting and correlate between metrics. Given a 10% increase in new users in the past week, we’ve seen a 15% spike in web server traffic.If we project those numbers out, can we make it through Black Friday? Cyber Tuesday? Will we make all the way to New Years, or do we need to start provisioning new machines *NOW*? Or what happens if our business model changes, and we are required to live through a “Black Friday” event every day? That’s the kind of challenges that social shopping site Gilt faces, with it’s daily turnover of inventory. It’s worth saying that you won’t need all of this information real time, but ideally you’ll be able to get a mix of real time, near-time (5 minutes aggregated data is common), as well as
daily analytical reports. Additionally you should talk with your operations staff about which of these metrics are mission critical enough that we should be alerting on them, to make sure we have the operational and organizational focus that is appropriate.

While nothing beats preparation, even the best laid plans need good feedback loops to be successful. Measuring, collecting, analyzing, and acting upon data as it comes into your organization is critical in today’s online environments. You may not be able to predict the future, but having solid monitoring systems in place will help you to recognize problems before they become critical, and help give you a “snowballs chance” during the holiday season.

Template Web UI

Back in October we released the first version of our new Templating API, allowing you to easily replicate sets of bundles across multiple hosts. Now we bring you the time-saving sweetness of Templates in the web interface as well; if you have multiple servers that you want to monitor in exactly the same way, Templates are your friend. The idea behind them is pretty simple: you choose your master host, and select one or more of its check bundles to be used as master bundles. Then when you select your target hosts, the master bundles are copied and applied to the target hosts.

Creating A Template

So let’s look at how the Templating process works. Before you create a template, you first need to ensure that you have your master check bundles set up and active on your master host. Once that’s the case, start by going to the new “data” section of Circonus and visiting the “Templates” tab at the left. Create a new template via the “+” tab (or the “Create A Template” button in the middle of the page if you have no templates yet). In the resulting dialog, type a name for your template and choose a master host, and when you click “OK” you will see the templates table appear with a row for your new template (as usual, click the summary row to view the expanded details of the template).

When you first create a template, it’s in “draft” mode. This means that it’s only saved in your browser’s memory until you apply it. Nothing has been saved to the system yet, and the master bundles haven’t been replicated. This allows you to lay out templates and modify them or discard them before making any changes to the system. If you wish to save changes to a draft you may do so via the “Save” button; the draft is not applied as a regular Template until you click the “Apply” button. To aid in visually scanning the list of Templates for drafts, drafts will always appear at the top of your list, and will always be green. (If at any point you wish to change your template name or master host, you may click them in the summary row to edit them in-place. Please note: when changing your master host, you may only choose among the target hosts currently saved in the Template.)

Choosing Bundles

Once you’ve created your draft, you need to choose your master check bundles. Under the “Check Bundles” section at the left, click “Add Bundle” to bring up the new bundle dialog. All the bundles available for your master host will be shown here in a scrollable list. This is a selectable list, so when you select a bundle, it’s shown as selected in the list until you remove it from the Template. If you have a long list of bundles and are having a difficult time finding the ones you want, you may use the field above the list to filter the shown bundles by a filter string or regular expression (if you’re using a regular expression, don’t include the leading and trailing slashes, just use the desired RegEx syntax). After you have chosen a bundle, you may change its name by clicking on it in the list of chosen bundles. (Please Note: the reserved string “{target}” will be replaced by the current hostname/IP as the bundle is replicated across the target hosts.)

Choosing Hosts

Choosing your target hosts works mostly the same way as choosing the master check bundles. The “Add Host” button brings up a dialog with a scrollable, selectable, filterable list of available hosts on your account, and you may choose one or more of those hosts. There is an additional feature, however, which is the “Enter a new host” field below the list. This allows you to enter new hosts (either IP addresses or domain names are acceptable) that aren’t currently used on your account. When you enter a new host and hit return/enter, the new host will be subject to a DNS check to ensure that it really exists; if it passes the DNS check, it will then be added to your list of target hosts.

Once you’re satisfied with your bundle and host choices, clicking the “Apply” button will replicate your master bundles across each target host and will save the template in the database.

Modifying A Template

the action dropdown select

Once your Template is saved you will see several things change in the details panel. Each bundle and host will get checkboxes, and two “Action” dropdown selects will appear, one above the check bundles list and one above the target hosts list. Now that the bundles and hosts are a part of the template, if you wish to modify or remove them, you will need to check their checkboxes and choose an action from the appropriate dropdown before saving. There are four actions available:

When used on a bundle, it will delete the target bundles and remove them from the Template. When used on a host, it will delete the host’s bundles and show the host as inactive in the host list.
When used on a bundle, it will leave the target bundles in place but will break their synchronization with the template and show them as inactive in the bundle list. When used on a host, it will leave the host’s bundles in place but will break their synchronization with the template and show the host as inactive in the host list.
When used on a bundle, it will deactivate the target bundles and show them as inactive in the bundle list. When used on a host, it will deactivate the host’s bundles and show the host as inactive in the host list.
When used on a bundle, it will reactivate, rebind, or recreate target bundles as necessary, to restore them to active status and synchronization with the template. When used on a host, it will reactivate, rebind, or recreate the host’s bundles as necessary, to restore them to active status and synchronization with the template.

Staying In-Sync

re-sync button

After creating and applying a Template, you are still allowed to edit the master check bundles. If you do so, any Templates using those check bundles as master bundles will be out-of-sync. When you go to your Templates page, the out-of-sync Templates will have their sync buttons activated and the buttons will say “Re-Sync.” Simply click the “Re-Sync” button to replicate the bundle changes across all the target bundles, and the Template will be in-sync again.

(Please Note: if at any point you wish to delete the template, any active bundles that are still a part of the template will be deleted from the target hosts. If you wish to keep the bundles on the target hosts but just delete the template, you will need to unbind all the bundles you wish to keep on the target hosts and then delete the template.)

Template API

Setting up a monitoring system can be a lot of work, especially if you are a large corporation with hundreds or thousands of hosts. Regardless of the size of your business, it still takes time to figure out what you want to monitor, how you are going to get at the data, and then to start collecting, but in the end it is very rewarding to know you have insight.

When we launched Circonus, we had an API to do nearly everything that could be done via the web UI (within reason) and expected it to make it easy for people to program against and get their monitoring off the ground quickly. Quite a few customers did just that, but still wanted an easier way to get started.

Today we are releasing the first version of our templating API to help you get going (templating will also be available via the web UI in the near future). With this new API you can create a service template by choosing a host and a group of check bundles as “masters.” Then you simply attach new hosts to the template, and the checks are created for you and deployed on the agents. Check out the documentation for full details.

Once a check is associated with a template, it cannot be changed on its own?you must alter the master check first and then re-sync the template. To re-sync, you just need to GET the current template definition and then POST it back; the system will take care of it from there.

To remove bundles or hosts, just remove them from the JSON payload before POSTing, and choose a removal method. Likewise, to add a host or bundle back to a template, just add it into the payload and then POST. We offer a few different removal and reactivation methods to make it easy to keep or remove your data and to start collecting it again. These methods are documented in the notes section of the documentation.

Future plans for templates include syncing rules across checks and adding templated graphs so that adding a new host will automatically add the appropriate metrics to a graph. Keep an eye on our change log for enhancements.

One Dashboard to Rule Them All


four icons representing a dashboard

Ever dream of having a systems monitoring dashboard that was actually useful? One where you could move things around, resize them, and even choose what information you wanted to display? Large enterprise software packages may have decent dashboards, but what if you’re not a large enterprise or you don’t want to pay an arm and a leg for bloatware? Perhaps you have a good dashboard that came with a specific server or piece of hardware, but it’s narrowly-focused and inflexible. You’ve probably thought about (or even tried) creating your own dashboard, but it’s a significant undertaking that’s not for the faint-of-heart. What’s the solution? Should we just learn to live with sub-optimal monitoring tools?

Here at Circonus, we decided that this was one problem we could eliminate. Since we’ve built a SaaS offering that’s flexible enough to handle multiple different data sources, why shouldn’t we build a dashboard that’s flexible enough to display them? So we created a configurable dashboard that lets you monitor your data however you want. Do you want to show graphs side-by-side but at different sizes? Done. Want an up-to-date list of alerts beside those graphs? Easy. How about some real-time metric charts that automatically refresh? No problem. Our new configurable dashboards allow you to add all these items and more. Let’s dig in and see how these new dashboards work.

Dashboard Basics

Start by going to the standard ‘Dashboard’ and clicking the new ‘My Dashboards’ tab. These dashboards are truly yours; any dashboards you create are only visible to you (by default) and are segregated by account. If you want to share a custom dashboard with everyone else on an account, check that dashboard’s ‘share’ checkbox in your list of custom dashboards.

After you have created a custom dashboard, you may set it to be your default dashboard by using the radio buttons down the left side of your custom dashboards list. If you do this, you will be greeted with your selected dashboard when you login to Circonus. By selecting the ‘Standard Circonus Dashboard’ as your default dashboard, you will revert to being greeted with the old dashboard you’re already used to seeing.

part of the interface for creating a new dashboard layout

To create a new custom dashboard, click the ‘+’ tab and choose a layout. At first you will see only a couple predefined layouts available, but after you create a dashboard, its layout will then be available to choose when creating other new dashboards.

Now a note about working with these dashboards: every action auto-saves so you never have to worry about losing changes you’ve made. However, if you haven’t given your dashboard a title, the dashboard isn’t permanently saved yet. If you forget to title your dashboard and go off to do other things, don’t worry, the dashboard you created is saved in your browser’s memory. All you have to do is visit the ‘My Dashboards’ page and your dashboard will be listed there. With two clicks you can give your dashboard a title and save it permanently. (Please note our minimum browser requirements ‘Firefox 4+ or Chrome’ which are especially applicable for these new custom dashboards, since we’re using some features which are not available in older browsers.)

So let’s create a dashboard. Choose a layout, click ‘Create Dashboard,’ and you will be taken to the new dashboard with the ‘Add A Widget’ panel extended. To begin, let’s check out the title area. Notice that when you hover over the title, a dropdown menu appears. This lists your other dashboards on the current account (as well as dashboards shared by other account members) and is useful for quickly switching between dashboards.

the dashboard interface showing the dashboard controls icons

To the right of the title are some icons. The first icon opens the grid options dialog, which lets you change the dimensions of the dashboard grid, hide the grid (it’s still active and usable, though), enable or disable text scaling, and choose whether or not to auto-hide the title bar in fullscreen mode. The second icon toggles fullscreen mode on and off. Once you enter fullscreen mode a third icon will appear, and this icon toggles the ‘Black Dash’ theme (this theme is only available in fullscreen mode). The current states of both fullscreen mode and the ‘Black Dash’ theme are saved with your dashboard.

One other note about the dashboard interface: if you leave a dashboard sitting for more than ten or fifteen seconds and notice that parts of the interface disappear (along with the mouse cursor), don’t worry, it’s just gone to sleep! A move of the mouse will make everything visible again. (If there are any widget settings panels open, though, the sleep timer will not activate.)


Now for the meat of it all: widgets. We currently have ten widgets which can be added to the dashboard grid to show various types of data, and we’ll be adding more widget types and contents in the future. Following is a quick rundown of the currently available widgets:


Graph widgets let you add existing graphs to your dashboard. You may choose any graph from the “My Graphs” section under your current account. Graph widgets are refreshed every few minutes to ensure they’re always up-to-date.

Beacon Map

Map widgets let you add existing Beacon maps to your dashboard. You may choose any map query from the “Beacons” page (under the “Checks” section of your current account). Map widgets are updated in real-time.

Beacon Table

Table widgets let you add existing Beacon tables to your dashboard. You may choose any table query from the “Beacons” page (under the “Checks” section of your current account). Table widgets are updated in real-time.


Chart widgets let you select multiple metrics to monitor and compare in a bar or pie chart. Chart widgets are updated in real-time.


Gauge widgets let you monitor the current state of a single numeric metric in a graphical manner, displaying the most recent value on a bar gauge (dial gauges are coming soon). Gauge widgets are updated in real-time.


Status widgets let you monitor the current state of one or more metrics, displaying the most recent value with custom formatting. This is most useful for text metrics, but it may be used for numeric metrics as well. Status widgets are updated in real-time.


HTML widgets let you embed arbitrary HTML content on your dashboard. It can be used for just about anything, from displaying a logo or graphic to using an iframe to embed more in-depth content. Everything is permissible except Javascript. HTML widgets are refreshed every few minutes to ensure they’re always up-to-date.


List widgets let you add lists of graphs and worksheets to your dashboard, ordered by their last modified date. You may specify how many items to list and (optionally) a search string to limit the list. List widgets are refreshed every few minutes to ensure they’re always up-to-date.


Alerts widgets let you monitor your checks by showing the most recent alerts on your current account. You may filter the alerts by their age (how long ago they occurred), by particular search terms, by severity levels, or other status criteria. Alerts widgets are refreshed every few minutes to ensure they’re always up-to-date.


Admin widgets let you monitor selected administrative information, including the status of all Circonus agents on your current account. Admin widgets are refreshed every few minutes to ensure they’re always up-to-date.

icons representing some of the current widget types

To add widgets to the dashboard grid, there are two methods: you may use the ‘drag-and-drop’ method (dragging from the “Add a Widget” panel), or you may first click the target grid cell and then select the widget you want to place there. (Note: in fullscreen mode only the latter method is available.) After a widget has been added, some types of widgets will automatically activate with default settings, but most will be inactive. If the widget is inactive, click it to open the settings panel and get started. Once the widget is activated, the settings panel is available by clicking the settings icon in the upper right corner of the widget. In the lower right corner of the widget is the resize handle, so you can resize the widget as frequently as you want. And let’s not forget being able to rearrange the widgets, every widget has a transparent ‘title bar’ at its top which you can use to drag it around. I won’t get into the details of settings for every type of widget, because they should be self-explanatory (and that would make this one super-long blog post). But suffice it to say, there are plenty of options for everyone.

We’ve been working hard to create a configurable dashboard that will be as flexible as Circonus itself is, and we believe we’ve hit pretty close to the mark. Here’s a sample dashboard showing the power of these new dashboards:

What’s in a number?

Numbers, numbers, numbers; we’re all about numbers here at Circonus. We have trillions of data points which we feed into a slew of algorithms and processes to help our users identify problems with their data. But what are these numbers? It turns out that isn’t an easy question to answer.

Like most monitoring systems, Circonus performs an action from which it extracts one or more “metrics.” A common example is running a database query and measuring both the correctness of the result (as a boolean: good vs. bad) and the latency with which the answer was delivered. Similarly, it could load a web page, ensure that some specified content is successfully returned and measure the time it took. More concretely, when performing an HTTP transaction, it could obtain the following useful metrics: time to establish the TCP connection, time until the first byte of data is received, and time until the last byte of data is received. These measurements can reveal a variety of problems both on the surface of your architecture as well as provide indications of issues deep within.

While most monitoring systems (and parts of Circonus) work this way, the nature of these metrics is most interesting in what it is missing. In other words, it is vital to understand what they do not tell you. You are not observing real information; instead you are producing a single synthetic event and measuring it. The data are not real (and worse, may be far from representative.) Before I dive in and talk about why these data aren’t “good,” I’ll talk a bit about why they are “good enough” for many things.

Synthetic measurements work very well for components that can be measured in terms of quantities or rates. How many of something do you have? How quickly is it increasing or decreasing? Simple things like this are: disk space, I/O operations per second, the number of HTTP requests serviced, CPU usage, memory usage, etc. The most important factor is that these things are one-dimensional.

Data like these are both easy to visualize and critically important for things like anomaly detection and capacity planning. Being of a single dimension, understanding patterns in the data is easier for both humans and computers. However, as we start combining these data points, the world goes quickly out of focus.

For the moment, let’s assume we measure total money spent on an e-commerce site (you’d be crazy to not measure this.) In addition to that, we measure total transactions performed (number of sales.) With these metrics, we have some clear data: total dollars and dollars/hour (by deriving the samples) and total sales and sales/hour (again by deriving.) These numbers are pretty clear and we can make some good judgments about what to expect from day to day. However, you might ask, “How much is the average transaction size?” The answer to this question is simple: total money spent divided by total sales. Unfortunately, the average is not a useful number; just ask any statistician.

When you start looking at averages, you start losing information. We use averages to zoom out on graphs; you might notice that when you have a sudden spike (let’s say in traffic) you will see a much higher spike when zoomed in than when zoomed out. Why? If you were serving between 2900 and 3300 requests per second between 7pm and 8pm except for a sudden spike of 5400 requests per second between 7:40 and 7:45, you would see that on a graph showing 5 minute averages. However, on a graph zoomed out far enough to show only 20 minute averages, you’d see a deceptively small spike of about 3400 rps at that time period. As long as you can zoom in on the time series, it can be an acceptable compromise to reduce the data volume down to something consumable by a mere human being. Then the obvious question is: when does this go horribly wrong?

Let’s look at something like web page load times. If you run a synthetic transaction, always from the same location, you can track measurements in that single dimension. Things should be somewhat consistent and these numbers are useful. However, they do not tell you how fast your site is. Only your users know that. Interestingly, since your users access your web site, you can actually have them report that information back to you. In fact, this is how most web analytics systems work. The interesting part here is that you have a wide variety of data coming in representing a distribution of perceived load times. Some people load your pages quickly and others load them slowly. That’s the nature of the Internet: inconsistency. The key is that they don’t “trend” as a single datapoint that is the average of all.

The inconsistency in these data is interesting: it can be leveraged for improvements and advantage. Understanding (and eventually changing) the distribution of these data can radically change your business. There have been many articles written about web page load times, so in order to keep this fresh, I’ll discuss database transactions. The reason I’m jumping around here is because data are just data — this applies to every metric you can observe.

Understanding that your average database query takes 1.92ms to complete is, I’m sorry to say, useless. The problem is that you are likely running thousands or tens of thousands of queries per second and none of them are average. To illustrate this, here are three (contrived) database query latency histograms each of 39 samples.

The interesting (and perhaps deceptive) part is that all three have an average latency across all queries of 1.92ms. Quite clearly, all depict radically different situations. The truth is, when you have a lot of data (thousands to hundreds of thousands of data points), the histogram reveals the information you seek and the average hides it.

Why is this so interesting? In computing, there are a lot of things we can witness by actively measuring them; this is what the Circonus you know and love has done. We figured it was time to change the game a bit and help you visualize, in real-time, the things that happen in your business: enter BizEKG.

BizEKG allows you to analyze events (like webpage loads, database queries, customer service telephone calls, etc.). Not just some, not just a sample, but all the events. From there, you can break them apart, run statistical analysis (including histograms, of course) and understand your data. There are a handful of real-time web analytics companies out there, but answering these questions in “Circonus style” changes the game entirely. What’s Circonus style?

We at Circonus believe that all data are important, not just web data. We believe that if you can’t see what’s happening right now, you are as good as blind. So take this real-time, multi-dimensional statistical analysis engine, feed it any data you want, and see it all in real-time.

With our snazzy new BizEKG service you can actually do what some might consider a sufficient level of black magic. You can decompose these events in realtime and visualize these histograms in realtime. Not only is this pretty cool… it’s pretty damn enlightening. BizEKG is a new service we’ve launched and deserves its own announcement, we’ll get to that soon.

The above histogram show the last 60 seconds of page load times of a subsection of a current Alexa top 1000 site in milliseconds. Yes, 10,000ms is 10 seconds of page load time. Even on today’s Internet, loading a complex site over wireless from another country is… slow.

A Lotta Love for Keyboard Users

All web users who bemoan the general lack of support for keyboard accessibility in web apps, take heart! Circonus has some great features for keyboard lovers. We know there are many web users out there for whom keyboard shortcuts are a quicker and easier way to use applications, particularly web apps. This is especially true if you use a specific app heavily, or are a full-time computer user in general.

Anywhere in Circonus, you can always see the keyboard help screen by typing “?” so you’ll have an ever-present “cheat sheet” as you learn the shortcuts. As soon as new keyboard functionality is added, the keyboard help screen will be updated immediately to reflect the new shortcuts, thanks to the magic of continuous deployment.

Jump Navigation

To jump to a particular section in Circonus, all you have to do is type the proper keyword and you will jump there immediately. For example, type the keyword “dash” (d-a-s-h) and you will jump to the current account’s dashboard. It’s that easy! Here’s a list of the current jump keywords:

  • “dash” (jump to the dashboard)
  • “alerts” (jump to the fault dashboard)
  • “rules” (jump to rules)
  • “checks” (jump to checks)
  • “metrics” (jump to metrics)
  • “trends” (jump to the trending dashboard)
  • “graphs” (jump to graphs)
  • “worksheets” (jump to worksheets)

The shortcut for opening the feedback dialog also works the same way: simply type “feedback” and the feedback dialog will open for you. Another quick shortcut is the forward-slash (/), which focuses on any search field that may be on the page.

Graph & Worksheet Shortcuts

Here’s where we get to the good stuff. We’ve added some great shortcuts to work with graphs and an enhanced zoom tool which is only available via keyboard shortcuts.

To start off, you can now see the legend on any thumbnail graph view (on “My Graphs,” “Trending Dashboard,” and all worksheets) not only on the large graph views as before. To do so, simply hold down the shift key, and the legend will appear for whichever graph you’re hovering over. On a worksheet, the shift key also inverts the legend hover option. So if you have enabled the new worksheet option to show legends upon graph hovering, holding down the shift key will disable the hovering legends.

Back in January we launched an enhanced graph zoom toolbar that relies on keyboard shortcuts. Normally the zoom toolbar is labeled “Past” because its buttons will set the graph zoom level to view data from the past one week, two weeks, etc. However, if you hold down either comma or period, the zoom tool will be enhanced and the label will change to “shift.” You will also see an orange bar at the end of the graph(s) which indicates the end that will be shifted (and if you hold both keys, you will get two orange bars, indicating that you can pan the entire graph date range into the past or the future). While holding one or both keys, click one of the new arrow icons that appear inside the “shift” buttons—the graph date(s) will be shifted by the specified amount in the specified direction. Not only does this work when viewing or editing graphs, it works almost everywhere there is one or more graphs, whether large or thumbnail sized.

One last set of useful shortcuts applies when viewing a worksheet. Among the newly added worksheet options is the ability to resize worksheet graphs to one of three sizes. In addition to being able to do this by clicking the buttons in the worksheet options dialog, you can instantly change the size of your worksheet graphs by pressing alt+1, alt+2, or alt+3.

Being avid keyboard users ourselves, we are excited to build keyboard support into more areas of Circonus as we are able to do so. Keep watching for more keyboard info and if you have ideas for some useful shortcuts, please let us know!

Lost In Translation

For more than ten years, OmniTI has been making large-scale critical Internet infrastructure work. It is, obviously, not black magic or voodoo. Perhaps not so obviously, it is not technical competence that leads to success here. I like to think our team has technical competence in spades as we have an impeccable track record, authored books and a laundry list of speaking engagements to justify it. However, technical competence alone would fall short of the mark— far short.

Without exception, it is expected that proper monitoring and trending are as much a part of the process as setting up networking, backups, and more recently, change management. And yet, when you ask someone to explain why monitoring and trending were vital, you’d be lucky to get a response other than “to be sure things are working”. Something here is lost in translation.

Disconnected Viewpoints

Every business owner knows that watching the books is part of the job. You need to know P&L, you need to understand the outputs and costs of your various business units and you track efficiencies everywhere. All of these metrics play a part in both strategic and tactical decisions made every day. Each business unit reports these things and while in good organizations each manager knows what is important to each other manager, something is still lost in translation. Far too often, managers don’t understand that what they produce, what they consume and how they work changes the game for other business units. While the word is overused and abused, every business is an ecosystem. It is obvious that a new marketing campaign will increase resource utilization on the sales teams. It should be obvious that a new marketing campaign will increase resource utilization on IT infrastructure as well.

Every systems administrator knows (or should know) that monitoring your architecture is fundamental. On the other hand, very few can explain in any detail why this is so important. “Because you lose money when systems are offline”, they’ll quote disparagingly. Ask how much and you might catch them at a loss. From my own experience in operations, as well as countless conversations with customers and vendors, very few individuals recognize the relationship between IT and Business. Systems people know that they have to keep systems and services running to support their business, but rarely do they understand that relationship completely.

Owners that foster a transparent and cohesive organization around key performance indicators in every business unit (even those that are cost centers) will change their organizations in two critically useful ways:

  • Efficiencies between business units. With increased transparency, staff in all positions will see the effects of their actions across the business as a whole. This produces an atmosphere of self-reinforcing efficiency.
  • Accountability to the overall business. The hokey old question: “Is what you’re doing good for the company?” changes form. With increased cohesiveness, the answer to that question is a more obvious outcome to every action and no one can call it hokey, because it is always answered without being asked.

A Call To Arms

Technology is no longer underneath the products you sell and the process in which you deliver them. It is, for at least the immediate future, intertwined. Creativity on the technology side doesn’t only deliver cost savings, it creates new audiences and increases interaction with your customers. You have to do more than embrace technology, you need to leverage it and let new opportunities catapult your business forward.

As intertwined as technology is, we can no longer afford to have its operational details hidden away in the bowels of the “tech ops” or “web ops” group. We need visibility and we need cohesion. Infrastructure/application engineering and other business units are now, more than ever before, on the same team marching towards success. Communication and accountability are critical to success.

Here is where I leave you and hope that you will think about the metrics you monitor in a different light. They represent something more. They are there to make the business run, increase shareholder value, make your customers happier and more prosperous.

Past Performance: does this look right to you?

If you are like me, you look at a lot of data. I look at data in spreadsheets, I look at data on P&L statements, I look at term sheets, I look at systems data — a lot of systems data. I find the best way to look at data is to visualize it because it is the fastest way to get data into the amazing pattern matcher that is the human brain.

The human brain is quite good at saying “this is abnormal” and can usually even articulate why. This curve has a periodicity, that one a monotonic behavior, another is simply always flat… then they “change.” When we say “this visualization looks wrong,” we are almost always onto something real in the numbers. I’ll give you a simple visual example:

While there is obviously something starting at 8pm, we are only left with another question: “is it out of the ordinary?” It doesn’t look like that today, and it doesn’t appear to resemble the day before. What about last week? Let’s start the graph one week earlier:

This tells us a lot. It looks like we have a very similar event last week at this time. With most analysis tools, you stop here (or you hover with you mouse and try to correlate start/end times and magnitude to better understand how these two events resemble each other).

With Circonus, we don’t leave it here. Instead, we provide tools to help compare time separated events using our data overlay feature. We can take our original two-day view and overlay the data from last week right on top of (or in this case underneath).

Just two clicks and we’ve got a one-week offset data overlay and the visualization lends a little insight into what is going on. We can see the start times are identical, but the event from this week ends about 30 minutes before the one from last week — largely the same though.

Again, we find that visuals help. Understanding how these graph differ even when they are right on top of each other can be a bit challenging. Never fear! We’ve added help in the legend.

The legend takes on some new features when data overlays are in use. You now get a very clear, side-by-side read-out of the data in the graph including percentage differences. Additionally, the arrows that say “you’re higher than you were last week” become more saturated (redder) as the different in the data increases and fade to light grey if the two values are more similar. This makes it simple to quickly understand how current performance really compares to past performance. So, the interesting part of this graph is actually the subsequent spike of inbound traffic this is up 95% over last week. That’s something to look into.

Capacity Planning Made Easy

Okay, so capacity planning will never be fool proof. You simply cannot predict the future. However, some of the time you have a darn good idea of what the future will hold. Since someone knows what is likely to happen, why is it so hard to plan marketing initiatives, funnels and IT provisioning?

The reason is that things aren’t always linearly correlated. What’s that mean? Linear correlation goes something like this: if A depends upon B and I want twice as much A, I’ll need twice as much B. While correlating non-linear systems can be tricky, a lot can be done with linear regressions. The problem with any regression is that you need to put real numbers in, get real numbers out and understand how good they are.

When we look at how something grows, one of the most common tools in the statistics arsenal is a least-squared linear regression. That is: given a set of datapoints, what line best fits them? So, let’s say we have a lot of datapoints (boy do we have a lot of datapoints!). Now what does a linear regression tell us?

Let’s assume we’re looking at some traffic data over the month of December.

In this graph, it can be very hard to answer questions about the nature of the data. Two common questions are:

  1. are we growing or shrinking and by how much?
  2. if we stay on the current growth path, where will we be some point in the future?

Enter the linear regression:

Answering the first question is pretty simple now. We can look at the value on the left side of the graph, and the right side of the graph and do the math. You can’t see it in the screenshot, but the left the values are 5.49M and 5.88M which is roughly a 6.6% growth over 4 weeks. Now, any statistician will scream bloody murder about confidences in the data and model and any engineer will simply ask: “does that make sense?” Maybe we’ll look over 8 weeks and twelve weeks also to make sure that we build our confidence (this can be easier, though far less scientific, than understanding R2 values – which are, of course, available as well). Honestly, I personally find that reconciling this with my expectations is one of the better methods of trusting the model.

Let’s assume that we we expected some increase in resource usage during this time frame and that 6% is reasonable. Now onto the next question: where will we be in the future. In Circonus, we just jump up and extend our view window out one year and we can see what our model looks like in the future:

Next December we’ll be using 10.91M (this just happens to be MBits/s of network bandwidth to serve origin dynamic content on one of the sites managed over at OmniTI). We’ll revisit this month by month to ensure that we are indeed heading where we expected. It allows engineers and marketers and executives alike put real numbers into (what we call) napkin math which adds peace, clarity and allows most people to do easier what-if pontification. I can tell you one thing… we sleep better at night knowing specific numbers about a probable future.

Enterprise Agents

If you’re like me, your first response to SaaS monitoring was: “You can’t see my machines/services/metrics from your cloud. That won’t be too useful.” With a little bit of thought, it’s pretty easy to arrive at the conclusion that you must run something on your infrastructure to bridge the divide. It was a fun and exciting project here to build that magic something called the Circonus Enterprise Agent.


The Circonus Enterprise Agent (we’ll call it our “EA” from here on) is all of our magic monitoring software bundled into a maintainable VMWare virtual appliance than can be run on your internal networks to track stuff that the public shouldn’t be seeing. We had some interesting choices to make during development and I thought I’d share what they were and why we made them.

Choosing a platform

Most of our internal infrastructure runs on some variant of OpenSolaris technology. We chose this for a variety of reasons. Most importantly, storing your precious data on ZFS seemed like the right thing to do. After that, the fault management architecture (FMA) available in OpenSolaris allows us to keep our machines and services running more reliably. Reliability and data permanence are the two most important factors in technology selection here at Circonus (a fact our customers respect).

So, with all this talk about OpenSolaris and its advantages you’d imagine we built our EA on the same technology, right? Not so simple. For a virtual appliance image that is easy to administer and easy to upgrade in the field you need a good package management system. OpenSolaris simply falls on its face there. Oracle’s promises of IPS (the new and coming package management system for Solaris 11) are quite compelling, but that is just a promise today. Instead, we turned to the tried and true CentOS Linux-based platform for our EA.

CentOS provides all the features we need to run our agent software, manage package upgrades and distribution seamlessly and simply, and the core operating system is both stable and secure. In an interesting later development, we provide Joyent customers the ability to run an EA on one of their Joyent SmartMachines. Joyent’s operating architecture is actually derived from OpenSolaris — so we ended up porting our EA back to our core platform as well.

Today, the EA is available in two forms: a CentOS 5 VMWare-based appliance and a Joyent SmartMachine.

From where do you manage the appliance


While most appliances have a web console that allows a variety of management tasks, we made a simple choice to have the appliance administrable via the main web application. This is where Circonus users interface with all their data and set up their monitors, sp it only made sense to also administer their EA from the same place.

After using the system for a while now, I can say that I’m really pleased with this decision. The cohesiveness of scheduling checks on either your private EA and/or the world-wide Circonus agents through the same check creation interface is a simple pleasure. One single world-wide view of all the agents on which you can schedule checks makes it simple to understand how the monitoring system works.

What to automate

Generally speaking, when you think appliance, you think self-maintaining. That’s not an unreasonable expectation. However, this directly conflicts with our experience in operations. In operations, automatic upgrades of software are strictly taboo. Typically, the operations crew wants to schedule precisely when an upgrade will occur, be present and have a bulletproof evaluation and rollback plan. When you start talking about critical infrastructure like monitoring, “typically” becomes “always.”

With this in mind, we made the upgrade process on the EA completely automated, but not automatic. One click and the appliance will self-upgrade. Currently, this is the only ongoing task that is done from the appliance itself (rather than the portal), but we’re looking to make some nice enhancements there as well. Soon, you’ll be able to trigger remote EA upgrades directly from the web application.

What you get

With an EA you get to leverage the power of Circonus against all of your private data. Networks, systems, applications and business systems that are only accessible via internal infrastructure can be monitored via an Enterprise Agent. The data is fed back to the Circonus cloud in real-time. All of that data can be alerted on, and is available for correlation, trending and planning purposes through the excellent Circonus tools you already know and love.