Understanding Data with Histograms

For the last several years, I’ve been speaking about the lies that graphs tell us. We all spend time looking at data, commonly through line graphs, that actually show us averages. A great example of this is showing average response times for API requests.

The above graph shows the average response time for calls made to a HTTP REST endpoint. Each pixel in this line graph is the average of thousands of samples. Each of these samples represents a real user of the API. Thousand of users distilled down to a single value sounds ideal until you realize that you have no idea what the distribution of the samples looks like. Basically, this graph only serves to mislead you. Having been misled for years by the graphs with little recourse, we decided to do something about it and give Circonus users more insight into their data.

Each of these pixels is the average of many samples. If we were to take those samples and put them in a histogram, it would provide dramatically improved insight into the underlying data. But a histogram is a visually bulky representation of data, and we have a lot of data to show (over time, no less). When I say visually bulky what do I mean? A histogram takes up space on the screen and since we have a histogram of data for each period of time and hundreds of periods of time in the time series we’d like to visualize… well, I can’t very well show you hundreds of histograms at once and expect you to be able to make any sense of them; or can I?

Enter heat maps. Heat maps are a way of displaying histograms using color saturation instead of bar heights. So heat maps remove the “bulkiness” and provide sufficient visual density of information, but the rub is that people have trouble grasping them at first sight. Once you look at them for a while, they start to make sense. The question we faced is: how do we tie it all together and make it more accessible? The journey started for us about six months ago, and we’ve arrived at a place that I find truly enlightening.

Instead of a tutorial on histograms, I think throwing you into the interface is far more constructive.

The above graph provides a very deep, rich understanding the same data that powered the first line graph. This graph shows all of the API response times for the exact same service over the same time period.

In my first (#1) point of interest, I am hovering the pointer over a specific bit of data. This happens to be August 31st at 8pm. I’ll note that not only does our horizontal position matter (affecting time), but my vertical position indicates the actual service times. I’m hovering between 23 and 24 on the y-axis (23-24 milliseconds). The legend shows me that there were 1383 API calls made at that time and 96 of them took between 23 and 24 milliseconds. Highlighted at #3, I also have some invaluable information about where these samples sit in our overall distribution: these 96 samples constitute only 7% of our dataset, 61% of the samples are less than 23ms and the remaining 32% are greater than or equal to 24ms. If I move the pointer up and down, I can see this all dynamically change on-screen. Wow.

As if that wasn’t enough, a pop-up histogram of the data from the time interval over which I’m hovering is available (#2) that shows me the precise distribution of samples. This histogram changes as I move my pointer horizontally to investigate different points in time.

Now that I’ve better prepared you for the onslaught of data, poke around a live interactive visualization of a histogram with similar data.

With these visualizations at my disposal, I am now able to ask more intelligent questions about how our systems behave and how our business reacts to that. All of these tools are available to Circonus users and you should be throwing every piece data you have at Circonus… just be prepared to have your eyes opened.

Web Portal Outage

Last night circonus.com became unavailable for 34 minutes, this was due to the primary database server becoming unavailable. Here is a breakdown of events, times are US/Eastern.

  • 8:23 pm kernel panic on primary DB machine, system rebooted but did not start up properly
  • 8:25 -> 8:27 first set of pages went out about DB being down and other dependent systems not operating
  • 8:30 work began on migrating to the backup DB
  • 8:57 migration complete and systems were back online

In addition to the web portal being down during this time, alerts were delayed. The fault detection system continued to operate, however we have discovered some edge cases in the case management portion that will be addressed soon.

Because of the highly decoupled nature of Circonus, metric collection, ingestion and long term storage was not impacted by this event. Other services like search, streaming, and even fault detection (except as outlined above) receive their updates over a message queue and continued to operate as normal.

After the outage we discussed why recovery took so long and boiled it down to inadequate documentation on the failover process. Not all the players on call that night knew all they needed about the system. This is something that is being addressed so recovery in an event like this in the future can be handled much faster.

Dashboards: Redux (or What to Look for in a Performance Monitoring Dashboard)

Last autumn we launched our customizable dashboards for Circonus, and we happen to think they’re pretty sweet. In this post, I’m not going to get into specifics about our dashboards (for more on that, you can check out my previous post, “One Dashboard to Rule Them All”), but instead I’ll talk more generally about what you should look for in your performance monitoring dashboard of choice.

Your dashboard shouldn’t limit its focus to technical data; it should help you track what really matters: business success.

A lot of data analysis done today is technical analysis for technical benefit. But the real value comes when we are able to take this expertise and start empowering better business decisions. As such, a performance monitoring dashboard which is solely focused on networks, systems, and applications is limiting because it doesn’t address what really matters: business.

While your purpose for monitoring may be to make your company’s web business operate smoothly, you can influence your overall business through what you operate and control, including releases, performance, stability, computing resources, networking, and availability. Thus, your dashboard should be designed to enable this kind of cross-pollination. By understanding which of your business metrics are critical to your success, you will be able to effectively use a dashboard to monitor those elements that are vital to your business.

Your dashboard should be able to handle multiple data sources.

There are many technologies in use across the web today. Chances are good that you have many different data sources across your business, so you need a dashboard that can handle them. It?s no good for a dashboard to only be able to gather part of your business data, because you’ll be viewing an incomplete picture. You need a dashboard that can handle all of your data sources, preferably on a system that’s under active development—continuing to integrate the best new technologies coming down the pike.

Your dashboard should provide access to real-time data.

The value of real-time data should not be underestimated; having real-time data capabilities on your dashboard is critical. Rather than requiring you to hit the refresh button, it should use real-time data to show you what is going on right now. Having this up-to-date picture makes your analysis of the data more valuable because it’s based on what’s happening in the moment. Some sectors already embracing this type of real-time analysis include finance, stock trading, and high-frequency trading.

Your dashboard should provide visualizations to match different types of data.

Your dashboard should provide different visualizations, because the visualization method you choose should fit the data you’re displaying. It’s easy to gravitate towards the slickest, shiniest visualizations, but they don’t always make the most sense for your data.

One popular visualization design is the rotary needle (dial) gauge. Gauges look cool, but they can be misleading if you don’t know their limits. Also, because of their opaque nature, the picture they give you of the current state is without context. Gauges can be great for monitoring certain data like percentages, temperature, power per rack, or bandwidth per uplink, but visualizations like graphs are generally better because they can give you context and history in a compact space. Graphs not only show you what’s going on now but also what happened before, and they allow you to project historic data (e.g. last day/week) alongside current data or place multiple datasets side-by-side so you can compare trends.

It’s also easy to forget that sometimes you may not need a visualization at all. Some data is most easily understood in text form (perhaps formatted in a table). Your dashboard should provide different ways of viewing data so you can choose the best method for your particular data sets.

Your dashboard’s interface shouldn’t be over-designed.

Designers tend to show off their design chops by creating slick, shiny user interfaces for their dashboards, but these are frequently just eye-candy and can get in the way of “scannability.” You need to be able to understand your dashboard at a glance, so design should stay away from being too graphics-heavy and should not have too much information crammed into tiny spaces. These lead to visual clutter and make you have to “work too hard” whenever you try to read your dashboard. The design should help you focus on your data, not the interface.

Everybody’s idea of a “perfect dashboard” will vary somewhat, but by following these guidelines you will be well on your way to selecting a dashboard that lets you monitor your data however you want. Remember, the goal is informed, data-driven decision-making, and it’s not unreachable. If you haven’t yet tested the capabilities of Circonus (including our customizable dashboards), why not give it a try with a free one month trial period?

Failing Forward While Stumbling, Eventually You Regain Your Balance

First I want to start by saying I sincerely apologize for anyone adversely affected by yesterday’s false alerts. That is something that we are very conscious of when rolling out new changes and clearly something I hope never to repeat.

How did it happen? First, a quick run down of the systems involved. As data is streamed into the system from the brokers, it is sent over RabbitMQ to a group of Complex Event Processors (CEP) running Esper and additionally the last collected value for each unique metric is stored in Redis for quick lookups. The CEPs are responsible for identifying when a value has triggered an alert, and then tell the notification system about it.

Yesterday we were working on a bug in the CEP system where under certain conditions, if a value went from bad to good, and we were restarting the service, it was possible we would never trigger an “all clear” event and as such your alert would never clear. After vigorously testing in our development environment, we thought we had it fixed and all our (known) corner cases tested.

So the change was deployed to one of the CEP systems to verify it in production. For the first few minutes all was well, stale alerts were clearing, I was a happy camper. Then roughly 5 minutes after the restart, all hell broke loose, every “on absence” alert fired, and then cleared within 1 minute, pagers went off around the office, happiness aborted.

Digging into the code we thought we spotted the problem, when we loaded the last value into the CEP from Redis, we need to do so in a particular order. Because we used multiple threads to load the data and let it do so asynchronously, some was being loaded in the proper order, but the vast majority was being loaded too late. Strike one for our dev environment. It doesn’t have near the volume of data, so everything was loaded in order by chance. We fixed the concurrency issue, tested, redeployed, BOOM same behavior as before.

The next failure was a result of the grouping that we do in the Esper queries, we were grouping by the check id, the name of the metric and the target host being observed. The preload data was missing the target field. What this caused was the initial preload event to be inserted ok, then as we got new data in it would also be inserted just fine, but was being grouped differently. Our absence windows currently have a 5 minute timeout, so 5 minutes after boot, all the preload data would exit the window, which would now be empty and we triggered an alert. Then, as the newly collected data filled its window, we would send an all clear for that metric and at this point we would be running normally, albeit with a lot of false alerts getting cleaned up.

Unfortunately at this point, the redis servers didn’t have the target information in their store, so a quick change was made to push that data into them. That rollout was a success, a little happiness was restored since something went right. After they had enough time to populate all the check data, changes were again rolled out to the CEP to add the target to the preload, hopes were high. We still at this point had only rolled the changes to the first CEP machine, so that was updated again, rebooted, and after 5 minutes things still looked solid, then the other systems were updated. BOOM.

The timing of this failure didn’t make sense. CEP one had been running for 15 minutes now, and there are no timers in the system what would explain this behavior. Code was reviewed and looked correct. Upon review of the log files, we saw failures and recoveries on each CEP system, however they were being generated by different machines.

The reason for this was due to a recent scale out of the CEP infrastructure. Each CEP is connected to RabbitMQ to receive events, to split the processing amongst them each binds a set of routing keys for events it cares about. This splitting of events wasn’t mimicked in the preload code, each CEP would be preloaded with all events. Since each system only cared about its share, the events it wasn’t receiving would trigger an absence alert as it would see them in the preload and then never again. Since the CEP systems are decoupled, an event A on CEP one wouldn’t be relayed to any other system, so they would not know that they needed to send a clear event since as far as they were concerned, everything was ok. Strike two for dev, we don’t use that distributed setup there.

Once again the CEP was patched, this time the preloader was given the intelligence to construct the routing keys for each metric. At boot it would pull the list of keys its cared about from its config, and then as it pulled the data from Redis, it would compare what that metrics key would be to its list, if it had it, preload the data. One last time, roll changes, restart, wait, wait, longest 5 minutes in recent memory, wait some more… no boom!!!

At this point though, one of the initial problems I set out to solve was still an issue. Because data streaming in looked good, the CEP won’t emit an all clear for no reason, it has to be bad first, so we had a lot of false alerts hanging out and people being reminded about them. To rectify this, I went into the primary DB, cleared all the alerts with a few updates, and rebooted the notification system so it would no longer see them as an issue. This stopped the reminders and brought us back to a state of peace. And this is where we sit now.

What are the lessons learned and how do we hope to prevent this in the future? Step 1 is, of course, always making sure dev matches production; not just in code, but in data volume and topology. Outside of the CEP setup it does, so we need a few new zones brought into the mix today and that will resolve that. Next, better staging and rollout procedure for this system. We can bring up a new CEP in production, give it a workload but have its events not generate real errors, going forward we will be verifying production traffic like this before a roll out.

Once again, sorry for the false positives. Disaster porn is a wonderful learning experience, and if any of the problems mentioned in this post hit home, I hope it gets you thinking about what changes you might need to be making. For updates on outages or general system information, remember to follow circonusops on Twitter.

Graph Annotations and Events

This feature has been a long time in coming: the ability to annotate your graphs! With the new annotations timeline sitting over the graph, not only can you create custom events to mark points in time, but you can also view alerts and see how they fit (or don’t fit) your metric data.

Annotations Timeline

part of a screenshot of the new annotations interface

First, let’s go to a graph and take a look at the annotations timeline to see how it works. When you choose a graph and view it, you will immediately see the new Annotation controls to the left side of the date tools, and the timeline itself will render in between the date tools and the graph itself. The timeline defaults to collapsed mode and by default will only show alerts from metrics on the current graph, so you may have an empty timeline at first. If you take a look at the controls, however, you will see three items: the Annotation menu, the show/hide toggle button, and the expand/collapse toggle button. The show/hide button does just what it says: it shows or hides the timeline. The expand/collapse button toggles between the space-saving collapsed timeline view and the more informative expanded timeline view.

If you open the Annotation menu, you will see a list of all the items you can possibly show in your timeline (or hide from it). Any selections you make here (as well as your show/hide and expand/collapse state changes) will be saved as site-wide user preferences in your current browser. All the items are separated into three groups:

Event Categories

This is a list of all the Event categories under the current account (these are seen and managed in the Events section of the site?we’ll get to that new section in a minute). If you have uncategorized events (due to deleting a category that was still in use), they will appear grouped under the “–” pseudo-category label.

Alerts

By default, the only alerts that will be shown will be alerts of all severity (sev) levels triggered by metrics on the current graph. If you wish, you may also show all alerts, and both categories of alerts may be filtered by sev levels. To do so, click one of the alert labels to expand a sev filter row with more checkboxes.

Text Metrics

This third group is not shown by default, but is represented by the checkbox at the bottom labeled “Include text metrics.” If you check this box, the page will refresh, and any text metrics on the current graph will then be rendered as a part of the timeline (and will be excluded from the graph plot and legend).

Once you have some annotations rendering on the timeline, take a look at the timeline itself. Hovering over a point will show a detail tooltip with the annotation title, date, and description, and hovering over either a point or a line segment will highlight the corresponding date range on the graph itself.

Now for the question on everyone’s minds: “Can I create events here, or do I have to go to the Events section to do that?” The answer is, yes, you can create events straight from the view graph page! To do so, simply use your right mouse button to drag-select a time range on the graph itself. A dialog will then popup for you to input your info and create the event.

Events Section

Now let’s head over to the Events section where you can manage your events and event categories. Simply click on the new Events tab (below the Graphs tab) and you’re there! To create an event, click the standard “+” tab at the upper left of the page. This will give you the New Event dialog. Most of the dialog inputs are pretty straightforward, with the exception of the category dropdown. This is a new hybrid “editable” dropdown input.the category select dropdown input in the new event dialog You may select any of its options if you’d like, or you can add new ones. To add a new option, simply select the last option (it’s labeled “+ ADD Category”). Your cursor will immediately be placed in a standard text input where you can enter your new category. When you’re finished, hit enter to create the new option and have it selected as your category of choice.

After you have created your event, you may need to edit it later. To edit any of its details, simply click on the pertinent detail of the event (when changing the event category, you will see it also has the new hybrid “editable” dropdown input which works exactly like the one in the New Event dialog).

In addition to start and end points (which may be the same date if you don’t want more than a single point), you may also add midpoints to your event. Click the Show details button for an event (the arrow button at the right end of an event row), and you will see the Midpoints list taking up the right half of the event details panel. Simply click the Add Midpoint button to get the New Midpoint dialog where you enter a title, description and choose a date for your point.

The one last element of the Events section that’s good to know about is the Categories menu at the upper right of the page. This allows you to delete categories as well as filter the Events list to only show a single category of events at a time. To do this, just click the name of a category in the Categories menu.

Insights from a Data Center Conference

At the beginning of this month, I’d attended the Gartner Data Center Conference in Las Vegas, and wanted to share with you some of my gained impressions and insights from the event.

First, I have to say that I have seldom seen a group of more conscientious conference attendees (aside from Surge, of course, and a physics conference I once attended). Networking breakfasts were busy, sessions were well attended, and both lunch and topic-specific networking gatherings had lively discussions. Each of the Solution Center hours, going well into the evening, were full of people not only partaking of the food or giveaways but were primarily and voraciously soaking up information from the various exhibitors. Even in hallways during the day, while people were sitting or standing, there was a steady exchange of opinions and information. This is what I saw throughout the conference; attendees there were very serious about learning…from the speakers, vendors, and from their peers. Relatedly, it’s interesting that many organizations bar outright their employees from attending any events in Vegas? While boondoggle may be an appropriate term for some shows in that or any other location, it certainly wasn’t the case with this conference.

Now let’s get to what frequently was foremost on the mind of attendees. I was somewhat surprised to find that this was not something that is usually on the top-ten lists of CIO/IT initiates. Rather, what repeatedly came out first in terms of attendees’ pressing interest were the interrelated topics of avoiding IT outages and increasing speed of service recovery, along with monitoring to help with both of these.

Granted, this was a datacenter-specific conference so it’s natural that avoidance of and recovery from operational failures is of paramount importance. But note that there are lots of other overarching datacenter initiatives we all hear much more about, such as virtualization, cloud migration, datacenter consolidation. Many of these headline-grabbing topics are certainly both important, and getting done. However, what affects datacenter operations leaders’ daily lives and careers, and so is of primary importance, has not received much if any notice or press.

Why is that? It’s pretty simple. Some of these other initiatives are new. Monitoring has been around seemingly forever, plus (to an extent) outages are taken as being somewhat unavoidable. Yet, while zero failures is indeed not possible, markedly increased reliability is certainly attainable. Look at the historical telecom service provider side, where five-9’s reliability is the expected level of service. When expectations are high, and commensurate investment is made, higher levels are not at all out of reach.

As for monitoring solutions themselves, nowadays you don’t have to be limited to old-school systems. There are young companies, like Circonus, who have a fresh approach that breaks down the silos of stand-alone toolsets of the past.

Let’s take a step back now and visualize what outages look like from a datacenter ops teams perspective, i.e. what happens when things ‘blow up’ in a datacenter. It’s not external constituents such as clients that directly impact the datacenter for the most part. External clients touch the business units and it’s then the business units which put the heat on the datacenter leaders.

And what about SLA’s for keeping business units apprised of the benefit IT delivers to them? As I heard loud and clear in the conference, internal SLA’s are for the most part useless. Why? Because they don’t mean much to the business units—they’re only interested in “When are you going to get my service back up?!” In other words, this is a variation on, “What have you done for me lately?”

So let’s look at an option for resolution. If the problem occurs on a virtual machine, you just spin up a new instance, right? Wrong, but that’s what usually happens. When a hammer dangling off a shelf hits you on the head, do you replace it with another dangling hammer and think you’ve solved the problem? Obviously, the thing to do in a datacenter is do the work to avoid repetition of the issue—we’re talking root-cause-analysis—otherwise you’re putting out fires repeatedly…the same fires.

Now a good monitoring system is going to help and in several ways. First, as just mentioned, it’s going to assist in identifying the underlying issue, including its location—is it in the app, the database, the server, etc. You don’t want to do that blindly testing—you’ll want the capability to create graphs on the fly and you similarly want to able to very easily and quickly do correlations of your metrics.

Okay, so that’s good for remediating a problem along with reducing the chance of it recurring, but you’ll also want to do anticipatory actions like capacity planning to forestall avoidable bottlenecks. For this you also want an easy-to-use tool so that you don’t have to muck around with spreadsheets. And you’ll want to be able to have a ‘play’ function so that when you do things such as code-pushes, you’ll be able to see in real-time the effect of these changes. This way, if the effect of the code-push is negative, you can quickly reverse the action without impacting your internal or external clients.

The good news is that new solutions with all these functionalities are out there in the marketplace. Of course, before you buy one be sure to insist on testing the solution in a trial to see how it performs, in your current and anticipated (read: hybrid physical and virtual/Cloud) environments. This includes seeing how the solution handles your scale, both backend and from a UI perspective. Such an evaluation will require an investment in your time, but the result will be well worth it, in the increased avoidance of outages and speeding up of recovery from them.

Monitoring your Vitals During the Critical Holiday Retail Season

As with Brick & Mortar stores, the Holiday season is a critical time for many E-Commerce sites. Like their off-line brethren, these sites also see large increases in both traffic and revenue, sometimes substantially so. Of course these changes in user behavior don’t just affect E-Commerce sites; consider a social-networking site like Foursquare, where a person might normally check into 3 or 4 places a week, during the Holiday season that might double as they visit more stores and end up eating out more often while rushing between those stores. On an individual basis it doesn’t sound that significant, but if a large percentage of your user base doubles their traffic, you better hope you have planned accordingly.

On the technical side, many sites will actually change their regular development process in order to handle these changes in user behavior.Starting early in November, many sites will stop rolling out new features and halt large projects that might be disruptive to the site or the underlying infrastructure. As focus shifts away from features,most often it turns back towards infrastructure and optimization. Adding new monitoring, from improved logging to new metrics and graphs, becomes critical as you seek to have a comprehensive view of your sites operations so that you can better understand the changes in traffic that are happening, and hopefully be proactive about solving problems before they turn into outages.

Profiling and optimization work also receives more attention during this time; studies continue to show correlations between page load speeds and website responsiveness to increased revenue, and being able to improve these areas is something that can typically be done without having to change the behavior of how things work. Bugfixes are also a popular target during these times as those corner cases are more likely to show up as traffic increases, especially if you tend to see new users as well as an increase in use by existing users.

This brings us to a good question; just what are you monitoring? For most shops there tend to be standard graphs that get generated for this like disk space or memory usage. These things are good to have, but they only scratch the surface. Your operations staff probably knows all kind of metrics about the system the need to monitor, but how about your application developers? They should know the code that runs your site inside and out, so challenge them to find key metrics in your application stack that are important for their work. Maybe that’s messages delivered to a queuing system, or the time it takes to process the shipping costs module, or measuring the responsiveness of a 3rd party API like Facebook or Twitter. But don’t stop there;everyone in your company should be asking themselves “what analytics could I use to make better informed decisions”? For example, do you know if your increased traffic is due to new users or existing users? If you are monitoring new user sign ups, this will start to give you some insight. If you are doing E-Commerce, you should also be tracking revenue related numbers. Those types of monitors are more business focused but they are critical to everyone at your company. So much so that at Etsy, a top 100 website commonly known as “the worlds handmade marketplace”, they project these types of metrics right out in public.

Ideally once you have this type of information being logged, you can collect the information for analytically reports and historical trending via graphs. You want to be able to take the data you are collecting and correlate between metrics. Given a 10% increase in new users in the past week, we’ve seen a 15% spike in web server traffic.If we project those numbers out, can we make it through Black Friday? Cyber Tuesday? Will we make all the way to New Years, or do we need to start provisioning new machines *NOW*? Or what happens if our business model changes, and we are required to live through a “Black Friday” event every day? That’s the kind of challenges that social shopping site Gilt faces, with it’s daily turnover of inventory. It’s worth saying that you won’t need all of this information real time, but ideally you’ll be able to get a mix of real time, near-time (5 minutes aggregated data is common), as well as
daily analytical reports. Additionally you should talk with your operations staff about which of these metrics are mission critical enough that we should be alerting on them, to make sure we have the operational and organizational focus that is appropriate.

While nothing beats preparation, even the best laid plans need good feedback loops to be successful. Measuring, collecting, analyzing, and acting upon data as it comes into your organization is critical in today’s online environments. You may not be able to predict the future, but having solid monitoring systems in place will help you to recognize problems before they become critical, and help give you a “snowballs chance” during the holiday season.

Template Web UI

Back in October we released the first version of our new Templating API, allowing you to easily replicate sets of bundles across multiple hosts. Now we bring you the time-saving sweetness of Templates in the web interface as well; if you have multiple servers that you want to monitor in exactly the same way, Templates are your friend. The idea behind them is pretty simple: you choose your master host, and select one or more of its check bundles to be used as master bundles. Then when you select your target hosts, the master bundles are copied and applied to the target hosts.

Creating A Template

three check icons enclosed in a box, representing a templateSo let’s look at how the Templating process works. Before you create a template, you first need to ensure that you have your master check bundles set up and active on your master host. Once that’s the case, start by going to the new “data” section of Circonus and visiting the “Templates” tab at the left. Create a new template via the “+” tab (or the “Create A Template” button in the middle of the page if you have no templates yet). In the resulting dialog, type a name for your template and choose a master host, and when you click “OK” you will see the templates table appear with a row for your new template (as usual, click the summary row to view the expanded details of the template).

When you first create a template, it’s in “draft” mode. This means that it’s only saved in your browser’s memory until you apply it. Nothing has been saved to the system yet, and the master bundles haven’t been replicated. This allows you to lay out templates and modify them or discard them before making any changes to the system. If you wish to save changes to a draft you may do so via the “Save” button; the draft is not applied as a regular Template until you click the “Apply” button. To aid in visually scanning the list of Templates for drafts, drafts will always appear at the top of your list, and will always be green. (If at any point you wish to change your template name or master host, you may click them in the summary row to edit them in-place. Please note: when changing your master host, you may only choose among the target hosts currently saved in the Template.)

Choosing Bundles

Once you’ve created your draft, you need to choose your master check bundles. Under the “Check Bundles” section at the left, click “Add Bundle” to bring up the new bundle dialog. All the bundles available for your master host will be shown here in a scrollable list. This is a selectable list, so when you select a bundle, it’s shown as selected in the list until you remove it from the Template. If you have a long list of bundles and are having a difficult time finding the ones you want, you may use the field above the list to filter the shown bundles by a filter string or regular expression (if you’re using a regular expression, don’t include the leading and trailing slashes, just use the desired RegEx syntax). After you have chosen a bundle, you may change its name by clicking on it in the list of chosen bundles. (Please Note: the reserved string “{target}” will be replaced by the current hostname/IP as the bundle is replicated across the target hosts.)

Choosing Hosts

Choosing your target hosts works mostly the same way as choosing the master check bundles. The “Add Host” button brings up a dialog with a scrollable, selectable, filterable list of available hosts on your account, and you may choose one or more of those hosts. There is an additional feature, however, which is the “Enter a new host” field below the list. This allows you to enter new hosts (either IP addresses or domain names are acceptable) that aren’t currently used on your account. When you enter a new host and hit return/enter, the new host will be subject to a DNS check to ensure that it really exists; if it passes the DNS check, it will then be added to your list of target hosts.

Once you’re satisfied with your bundle and host choices, clicking the “Apply” button will replicate your master bundles across each target host and will save the template in the database.

Modifying A Template

the action dropdown selectOnce your Template is saved you will see several things change in the details panel. Each bundle and host will get checkboxes, and two “Action” dropdown selects will appear, one above the check bundles list and one above the target hosts list. Now that the bundles and hosts are a part of the template, if you wish to modify or remove them, you will need to check their checkboxes and choose an action from the appropriate dropdown before saving. There are four actions available:

Remove
When used on a bundle, it will delete the target bundles and remove them from the Template. When used on a host, it will delete the host’s bundles and show the host as inactive in the host list.
Unbind
When used on a bundle, it will leave the target bundles in place but will break their synchronization with the template and show them as inactive in the bundle list. When used on a host, it will leave the host’s bundles in place but will break their synchronization with the template and show the host as inactive in the host list.
Deactivate
When used on a bundle, it will deactivate the target bundles and show them as inactive in the bundle list. When used on a host, it will deactivate the host’s bundles and show the host as inactive in the host list.
Restore
When used on a bundle, it will reactivate, rebind, or recreate target bundles as necessary, to restore them to active status and synchronization with the template. When used on a host, it will reactivate, rebind, or recreate the host’s bundles as necessary, to restore them to active status and synchronization with the template.

Staying In-Sync

re-sync buttonAfter creating and applying a Template, you are still allowed to edit the master check bundles. If you do so, any Templates using those check bundles as master bundles will be out-of-sync. When you go to your Templates page, the out-of-sync Templates will have their sync buttons activated and the buttons will say “Re-Sync.” Simply click the “Re-Sync” button to replicate the bundle changes across all the target bundles, and the Template will be in-sync again.

(Please Note: if at any point you wish to delete the template, any active bundles that are still a part of the template will be deleted from the target hosts. If you wish to keep the bundles on the target hosts but just delete the template, you will need to unbind all the bundles you wish to keep on the target hosts and then delete the template.)

Template API

Setting up a monitoring system can be a lot of work, especially if you are a large corporation with hundreds or thousands of hosts. Regardless of the size of your business, it still takes time to figure out what you want to monitor, how you are going to get at the data, and then to start collecting, but in the end it is very rewarding to know you have insight.

When we launched Circonus, we had an API to do nearly everything that could be done via the web UI (within reason) and expected it to make it easy for people to program against and get their monitoring off the ground quickly. Quite a few customers did just that, but still wanted an easier way to get started.

Today we are releasing the first version of our templating API to help you get going (templating will also be available via the web UI in the near future). With this new API you can create a service template by choosing a host and a group of check bundles as “masters.” Then you simply attach new hosts to the template, and the checks are created for you and deployed on the agents. Check out the documentation for full details.

Once a check is associated with a template, it cannot be changed on its own?you must alter the master check first and then re-sync the template. To re-sync, you just need to GET the current template definition and then POST it back; the system will take care of it from there.

To remove bundles or hosts, just remove them from the JSON payload before POSTing, and choose a removal method. Likewise, to add a host or bundle back to a template, just add it into the payload and then POST. We offer a few different removal and reactivation methods to make it easy to keep or remove your data and to start collecting it again. These methods are documented in the notes section of the documentation.

Future plans for templates include syncing rules across checks and adding templated graphs so that adding a new host will automatically add the appropriate metrics to a graph. Keep an eye on our change log for enhancements.

One Dashboard to Rule Them All

four icons representing a dashboardEver dream of having a systems monitoring dashboard that was actually useful? One where you could move things around, resize them, and even choose what information you wanted to display? Large enterprise software packages may have decent dashboards, but what if you’re not a large enterprise or you don’t want to pay an arm and a leg for bloatware? Perhaps you have a good dashboard that came with a specific server or piece of hardware, but it’s narrowly-focused and inflexible. You’ve probably thought about (or even tried) creating your own dashboard, but it’s a significant undertaking that’s not for the faint-of-heart. What’s the solution? Should we just learn to live with sub-optimal monitoring tools?

Here at Circonus, we decided that this was one problem we could eliminate. Since we’ve built a SaaS offering that’s flexible enough to handle multiple different data sources, why shouldn’t we build a dashboard that’s flexible enough to display them? So we created a configurable dashboard that lets you monitor your data however you want. Do you want to show graphs side-by-side but at different sizes? Done. Want an up-to-date list of alerts beside those graphs? Easy. How about some real-time metric charts that automatically refresh? No problem. Our new configurable dashboards allow you to add all these items and more. Let’s dig in and see how these new dashboards work.

Dashboard Basics

Start by going to the standard ‘Dashboard’ and clicking the new ‘My Dashboards’ tab. These dashboards are truly yours; any dashboards you create are only visible to you (by default) and are segregated by account. If you want to share a custom dashboard with everyone else on an account, check that dashboard’s ‘share’ checkbox in your list of custom dashboards.

After you have created a custom dashboard, you may set it to be your default dashboard by using the radio buttons down the left side of your custom dashboards list. If you do this, you will be greeted with your selected dashboard when you login to Circonus. By selecting the ‘Standard Circonus Dashboard’ as your default dashboard, you will revert to being greeted with the old dashboard you’re already used to seeing.

part of the interface for creating a new dashboard layout

To create a new custom dashboard, click the ‘+’ tab and choose a layout. At first you will see only a couple predefined layouts available, but after you create a dashboard, its layout will then be available to choose when creating other new dashboards.

Now a note about working with these dashboards: every action auto-saves so you never have to worry about losing changes you’ve made. However, if you haven’t given your dashboard a title, the dashboard isn’t permanently saved yet. If you forget to title your dashboard and go off to do other things, don’t worry, the dashboard you created is saved in your browser’s memory. All you have to do is visit the ‘My Dashboards’ page and your dashboard will be listed there. With two clicks you can give your dashboard a title and save it permanently. (Please note our minimum browser requirements ‘Firefox 4+ or Chrome’ which are especially applicable for these new custom dashboards, since we’re using some features which are not available in older browsers.)

So let’s create a dashboard. Choose a layout, click ‘Create Dashboard,’ and you will be taken to the new dashboard with the ‘Add A Widget’ panel extended. To begin, let’s check out the title area. Notice that when you hover over the title, a dropdown menu appears. This lists your other dashboards on the current account (as well as dashboards shared by other account members) and is useful for quickly switching between dashboards.

the dashboard interface showing the dashboard controls icons

To the right of the title are some icons. The first icon opens the grid options dialog, which lets you change the dimensions of the dashboard grid, hide the grid (it’s still active and usable, though), enable or disable text scaling, and choose whether or not to auto-hide the title bar in fullscreen mode. The second icon toggles fullscreen mode on and off. Once you enter fullscreen mode a third icon will appear, and this icon toggles the ‘Black Dash’ theme (this theme is only available in fullscreen mode). The current states of both fullscreen mode and the ‘Black Dash’ theme are saved with your dashboard.

One other note about the dashboard interface: if you leave a dashboard sitting for more than ten or fifteen seconds and notice that parts of the interface disappear (along with the mouse cursor), don’t worry, it’s just gone to sleep! A move of the mouse will make everything visible again. (If there are any widget settings panels open, though, the sleep timer will not activate.)

Widgets

Now for the meat of it all: widgets. We currently have ten widgets which can be added to the dashboard grid to show various types of data, and we’ll be adding more widget types and contents in the future. Following is a quick rundown of the currently available widgets:

Graph

Graph widgets let you add existing graphs to your dashboard. You may choose any graph from the “My Graphs” section under your current account. Graph widgets are refreshed every few minutes to ensure they’re always up-to-date.

Beacon Map

Map widgets let you add existing Beacon maps to your dashboard. You may choose any map query from the “Beacons” page (under the “Checks” section of your current account). Map widgets are updated in real-time.

Beacon Table

Table widgets let you add existing Beacon tables to your dashboard. You may choose any table query from the “Beacons” page (under the “Checks” section of your current account). Table widgets are updated in real-time.

Chart

Chart widgets let you select multiple metrics to monitor and compare in a bar or pie chart. Chart widgets are updated in real-time.

Gauge

Gauge widgets let you monitor the current state of a single numeric metric in a graphical manner, displaying the most recent value on a bar gauge (dial gauges are coming soon). Gauge widgets are updated in real-time.

Status

Status widgets let you monitor the current state of one or more metrics, displaying the most recent value with custom formatting. This is most useful for text metrics, but it may be used for numeric metrics as well. Status widgets are updated in real-time.

HTML

HTML widgets let you embed arbitrary HTML content on your dashboard. It can be used for just about anything, from displaying a logo or graphic to using an iframe to embed more in-depth content. Everything is permissible except Javascript. HTML widgets are refreshed every few minutes to ensure they’re always up-to-date.

List

List widgets let you add lists of graphs and worksheets to your dashboard, ordered by their last modified date. You may specify how many items to list and (optionally) a search string to limit the list. List widgets are refreshed every few minutes to ensure they’re always up-to-date.

Alerts

Alerts widgets let you monitor your checks by showing the most recent alerts on your current account. You may filter the alerts by their age (how long ago they occurred), by particular search terms, by severity levels, or other status criteria. Alerts widgets are refreshed every few minutes to ensure they’re always up-to-date.

Admin

Admin widgets let you monitor selected administrative information, including the status of all Circonus agents on your current account. Admin widgets are refreshed every few minutes to ensure they’re always up-to-date.

icons representing some of the current widget types

To add widgets to the dashboard grid, there are two methods: you may use the ‘drag-and-drop’ method (dragging from the “Add a Widget” panel), or you may first click the target grid cell and then select the widget you want to place there. (Note: in fullscreen mode only the latter method is available.) After a widget has been added, some types of widgets will automatically activate with default settings, but most will be inactive. If the widget is inactive, click it to open the settings panel and get started. Once the widget is activated, the settings panel is available by clicking the settings icon in the upper right corner of the widget. In the lower right corner of the widget is the resize handle, so you can resize the widget as frequently as you want. And let’s not forget being able to rearrange the widgets, every widget has a transparent ‘title bar’ at its top which you can use to drag it around. I won’t get into the details of settings for every type of widget, because they should be self-explanatory (and that would make this one super-long blog post). But suffice it to say, there are plenty of options for everyone.

We’ve been working hard to create a configurable dashboard that will be as flexible as Circonus itself is, and we believe we’ve hit pretty close to the mark. Here’s a sample dashboard showing the power of these new dashboards:

dashboard grid with several rectangular graph, chart, alerts and status widgets arranged in a grid