Ways to Collect Systems Data in Circonus

When you decide to monitor your systems with Circonus, there’s quite a few options on how to collect your metrics. We believe Circonus should be a tool that does what you need, when you need it. Circonus does not force you into a specific approach or method. Since there are so many different ways to gather telemetry via Circonus, we thought we would take a moment to outline some of the different approaches.

In addition to application-specific checks, you might like to get baseline information about things like memory, CPU, file systems, and interfaces of your servers and network equipment. We’ve listed out the main options that can be used for system performance metrics below, along with a brief description and our recommendations for each. We’ve roughly ordered these based on our best practices, but the tool that should be used depends on many variables that you’ll need to take into account.

For instance, some users may prefer to use a single agent on all of their devices, which may mean that some options won’t be available. Available plugins and ability to expand should also be considered. Some agents allow Circonus to reach out to the endpoint and gather metrics, while others require the data to be pushed (these agents mention push requirements in the description below). In some cases, the language that the agent was written in can have an effect on your decision.

Standard protocols

SNMP – SNMP is a standard that has been around for years, and allows monitoring of many types of network equipment, servers, and appliances. There is a good chance you already have SNMP configured on most of your hosts, which would significantly lower the up-front setup time. You’ll need to know the OIDs you want to monitor, but check bundle templates can make this process a little easier for you.

HTTPTrap – Circonus can accept JSON payloads via an HTTP PUT or POST request. This data is not polled regularly from the Circonus Broker, but is pushed to the Broker from the monitored target. This is the easiest way to get arbitrary data into Circonus, but you’ll have to figure out where to get the data.

Third-Party agents

collectd – Collectd is a lightweight C-based tool that has a variety of plugins available for data collection. There are 2 main ways to use collectd with Circonus, either to push the information from your device over UDP (similar to statsd and HTTP Traps) or via the write_http plugin.

Gollector – Gollector is a new monitoring agent that relies on the proc filesystem and C POSIX calls such as sysconf to determine your machine’s profile. This alleviates any performance penalty from shelling out the collection work that some other agents can have.

NRPE – Circonus can utilize existing NRPE checks from your Nagios or Icinga installation. NRPE allows you to remotely call Nagios scripts to collect information. If you want to monitor a non-standard metric, there’s probably a Nagios script for it.

statsd – Similar to an HTTP Trap, statsd allows your hosts to send information to Circonus Enterprise Brokers, rather than the Broker reaching out to the host to poll it. One downside is that this information cannot be played in real-time, but it can be useful for metrics that may not have regular intervals of available information or are particularly high volume.

Internally-developed agents

nad – Nad is a lightweight, simply managed host agent written in Node.js. Nad is the first choice of Circonus due to its easy extensibility and its ability to work on almost any platform, including Windows, RHEL, Ubuntu, and illumos derivatives. Nad comes with enough plugins to let you monitor any of the basics, while allowing you to add your own checks to fit your environment.

Resmon – Resmon is a Perl-based agent created by OmniTI. New modules can be created quickly and easily, but must be written in Perl; that’s a make it or break it factor for many.

Windows Agent – If you’d rather not use nad on your Windows servers, there is a Windows agent that can be used to collect performance metrics from Windows servers.

Which is right for me?

The choice of agent to use depends on many factors. Current operating system, existing monitoring setup, and network layout can all have an effect on which agent you choose. You may also need to incorporate several choices in order to best monitor your environment.

That covers the main ways to get system information into Circonus. There’s plenty of other methods of getting data, such as Google Analytics, a variety of database connections, Memcached, Varnish, NewRelic, and more. A combination of these collection types can enable you to have data on every piece of your infrastructure, so you can always find the information you need.

Tags: A Long Time Coming

Ok, we know a lot of you have been asking for tags in Circonus for a long time. Well, they’re finally here! The tags feature is currently in beta, and will be released to all customers very soon. (Tags have actually been available to API users for a while, just without UI support in the application.) Let’s jump right in and I’ll give you a quick overview of how tags will work in Circonus.

First Things First: What’s Taggable?

For this initial implementation of tags, you can tag Graphs, Worksheets, Check Bundles, Templates, and Maintenance Windows. You will also see tags on some other pages, such as Alerts, Rulesets, Hosts, and Metrics, but these items aren’t taggable. The tags you see on those pages are inherited from the associated Check Bundles.

In the near future, we’ll be adding to these lists. We are planning on making Annotations and Dashboards taggable, and have some other unique ways we’re planning on using tags to help categorize and group items in the UI.

So, How Does This Work?

First, you’ll need to add tags to some items. All tags are categorized in the system, but if you don’t want to categorize your tags, that’s ok. Simply use the “uncategorized” category for all your tags and the tags will be presented solo (without category) throughout the UI. We have a couple of categories which are already created in the system, but you can create as many custom categories as you wish.

Let’s go to one of the list pages containing taggable items (e.g. the Checks list page) and look for the tags toolbar under an item (it will have an outlined tag with a plus icon). Click the “+” tag to open the “Add Tag” dialog. First choose a category or use the “+ ADD Category” option to enter a new one, then the tags dropdown will be populated with the tags under that category. Choose a tag or enter a new one by choosing the “+ ADD Tag” option, then use the “Add Tag +” button to add the tag to the item.


When the tag is added to the UI, you’ll notice right away that each tag category has its own color. There is a limited set of pre-selected colors which will be assigned to categories as they are created. These particular colors have been chosen to maximize your eye’s ability to distinguish the categories at a glance, and also because they work well under both light and dark icons. So you’ll also notice that the tag you added has its own icon. There’s a set of twelve icons which will be assigned to the first twelve tags in each category. Once a category has twelve tags, any further tags added to that category will receive blank icons. This system of colors and tags will create fairly unique combinations that should help you recognize tags at a glance without needing to read the tag every time. Note: taggable items can have unlimited tags.

After you add a tag to an item, you’ll also notice that a small set of summary tags is added (usually off to the right of the item). This shows the first few tags on the item, providing a way for you to quickly scan down the page and get a glimpse of the tags that are assigned to each item on the page.


One more note about tags and categories. Although you select them separately in the UI, when using the API the categories and tags are joined with a colon (“:”) as the separator. So a tag “windows” in the category of “os” would be represented as “os:windows” in the API.

Tag Filtering

The power of tags is apparent once you start using tag filters. Look in the upper right corner of the page and you’ll see an outlined tag with a funnel icon and beside it a similar menu button. These are for setting tag filters and saving tag filter sets for easy application later. Click the funnel tag to open the “Tag Filters” dialog, and click the “Add +” button to add a filter to the dialog. You may add as many filters as you wish, and in each one all you have to do is choose a category and tag from the choices (you may not enter new tags or categories here; these are simply the tags you’ve already added to the system). Use the “x” buttons at the right to remove filters, or use the “Clear” button to remove all filters and start with a clean slate. Note: none of your changes in this dialog are applied until you click the “Apply” button. After clicking “Apply,” the page will refresh and you’ll see your newly applied filters at the top of the page. You can then use the “Tag Filters” dialog to change these filters, or you can use the menu button on the right to open the “Tag Filter Sets” dialog, where you may save & apply sets of tag filters for easy switching.


One important feature to note is the “sticky” checkbox that appears when you have one or more tag filters applied. By default (with “sticky” turned off), the tag filters you apply are only visible in the current tab. If you close the tab or open a new one, it will not retain the current tag filters. The benefit of this is that we’ve developed a system to allow you to have multiple concurrent tag filter views open side-by-side. So with the “sticky” setting off, you can open several tabs, use different tag filters in all of them, and each tab will retain its own tag filters as you navigate Circonus in that tab. If at any point you turn the “sticky” setting on, the tag filters from that tab will be applied universally and will override all the other tabs. And not only are “sticky” tag filters applied across all tabs, they’re remembered across all of your user sessions, so they will remain applied until you choose to change or remove them.


Host Grouping

One unique feature we’ve already completed is Host Grouping. Head on over to the Hosts page and open the “Layout Options” by clicking on the grid icon at the right side of the page. You’ll see a new option labeled “Group By Tag Category.” If you choose a tag category there, the page will reorganize itself. You’ll now see a subtitle for each tag in the selected category, and under each subtitle you’ll see the Hosts which have Check Bundles with that tag. Because each Host can have many tags, including more than one tag in the same category, you may see a Host appear in more than one group. At the bottom of the page you’ll also see a grouping subtitled “Not in Category.” Under this group you’ll see all the Hosts which don’t have any Check Bundles with tags in the chosen category.


Fault Detection: New Features and Fixes

One of the trickier problems when detecting faults is detecting the absence of data. Did the check run and not produce data? Did we lose connection and miss the data? The latter problems are where we lost a bit of insight, which we sought to correct.

The system is down

A loss of connection to the broker happens for one of two reasons. First, the broker itself might be down, the software restarted, machine crashed, etc. Second, there was a loss of connectivity in the network between the broker and the Circonus NOC. Note that for our purposes, a failure in our NOC would look identical to the broker running but having network problems.

Lets start with a broker being down. Since we aren’t receiving any data, it looks to the system like all of the metrics just went absent. In the event that a broker goes down, the customer owning that broker be inundated with absence alerts.

Back in July, we solved this by adding the ability to set a contact group on a broker. If the broker disconnects, you will get a single alert notifying you that the broker is down. While disconnected, the system automatically puts all metrics on the broker into an internal maintenance mode, when it reconnects we flip them out of maintenance and then ask for a current state of the world, so anything that is bad will alert. Note that if you do not set a contact group, we have no way to tell you the broker is disconnected so we will fall back to not putting metrics in maintenance and you will get paged about each one as they go absent. Even though this feature isn’t brand new, it is worth pointing out.

Can you hear me now?

It is important to know a little about how the brokers work… When they restart, all the checks configured on it are scheduled to run within the first minute, then after that they follow the normal frequency settings. To this end, when we reestablish connectivity with a broker, we look at the internal uptime monitor, if it is >= 60 seconds we know all the checks have run and we can again use the data for alerting purposes.

This presented a problem when an outage was caused by a network interruption or a problem in our NOC. Such a network problem happened late one night and connections to a handful of brokers were lost temporarily. When they came back online, because they had never restarted we saw the uptime was good and immediately started using the data. This poses a problem if we reconnected at the very end of an absence window. A given check might not run again for 1 – 5 minutes, so we would potentially trigger absences, and then recover them when the check ran.

We made two changes to fix this. First, we now have two criteria for a stable / connected broker:

  • Uptime >= 60 seconds
  • Connected to the NOC for >= 60 seconds

Since the majority of the checks run every minute, this meant that we would see the data again before declaring the data absent. This, however, doesn’t account for any checks with a larger period. To that end, we changed the absence alerting to first check to see how long the broker has been connected. If it has been connected for less than the absence window length, we push out the absence check to another window in order to first ensure the check would have run. A small change but one that took a lot of testing and should drastically cut down on false absence alerts due to network problems.

Dashboards: Redux (or What to Look for in a Performance Monitoring Dashboard)

Last autumn we launched our customizable dashboards for Circonus, and we happen to think they’re pretty sweet. In this post, I’m not going to get into specifics about our dashboards (for more on that, you can check out my previous post, “One Dashboard to Rule Them All”), but instead I’ll talk more generally about what you should look for in your performance monitoring dashboard of choice.

Your dashboard shouldn’t limit its focus to technical data; it should help you track what really matters: business success.

A lot of data analysis done today is technical analysis for technical benefit. But the real value comes when we are able to take this expertise and start empowering better business decisions. As such, a performance monitoring dashboard which is solely focused on networks, systems, and applications is limiting because it doesn’t address what really matters: business.

While your purpose for monitoring may be to make your company’s web business operate smoothly, you can influence your overall business through what you operate and control, including releases, performance, stability, computing resources, networking, and availability. Thus, your dashboard should be designed to enable this kind of cross-pollination. By understanding which of your business metrics are critical to your success, you will be able to effectively use a dashboard to monitor those elements that are vital to your business.

Your dashboard should be able to handle multiple data sources.

There are many technologies in use across the web today. Chances are good that you have many different data sources across your business, so you need a dashboard that can handle them. It?s no good for a dashboard to only be able to gather part of your business data, because you’ll be viewing an incomplete picture. You need a dashboard that can handle all of your data sources, preferably on a system that’s under active development—continuing to integrate the best new technologies coming down the pike.

Your dashboard should provide access to real-time data.

The value of real-time data should not be underestimated; having real-time data capabilities on your dashboard is critical. Rather than requiring you to hit the refresh button, it should use real-time data to show you what is going on right now. Having this up-to-date picture makes your analysis of the data more valuable because it’s based on what’s happening in the moment. Some sectors already embracing this type of real-time analysis include finance, stock trading, and high-frequency trading.

Your dashboard should provide visualizations to match different types of data.

Your dashboard should provide different visualizations, because the visualization method you choose should fit the data you’re displaying. It’s easy to gravitate towards the slickest, shiniest visualizations, but they don’t always make the most sense for your data.

One popular visualization design is the rotary needle (dial) gauge. Gauges look cool, but they can be misleading if you don’t know their limits. Also, because of their opaque nature, the picture they give you of the current state is without context. Gauges can be great for monitoring certain data like percentages, temperature, power per rack, or bandwidth per uplink, but visualizations like graphs are generally better because they can give you context and history in a compact space. Graphs not only show you what’s going on now but also what happened before, and they allow you to project historic data (e.g. last day/week) alongside current data or place multiple datasets side-by-side so you can compare trends.

It’s also easy to forget that sometimes you may not need a visualization at all. Some data is most easily understood in text form (perhaps formatted in a table). Your dashboard should provide different ways of viewing data so you can choose the best method for your particular data sets.

Your dashboard’s interface shouldn’t be over-designed.

Designers tend to show off their design chops by creating slick, shiny user interfaces for their dashboards, but these are frequently just eye-candy and can get in the way of “scannability.” You need to be able to understand your dashboard at a glance, so design should stay away from being too graphics-heavy and should not have too much information crammed into tiny spaces. These lead to visual clutter and make you have to “work too hard” whenever you try to read your dashboard. The design should help you focus on your data, not the interface.

Everybody’s idea of a “perfect dashboard” will vary somewhat, but by following these guidelines you will be well on your way to selecting a dashboard that lets you monitor your data however you want. Remember, the goal is informed, data-driven decision-making, and it’s not unreachable.

Monitoring your Vitals During the Critical Holiday Retail Season

As with Brick & Mortar stores, the Holiday season is a critical time for many E-Commerce sites. Like their off-line brethren, these sites also see large increases in both traffic and revenue, sometimes substantially so. Of course these changes in user behavior don’t just affect E-Commerce sites; consider a social-networking site like Foursquare, where a person might normally check into 3 or 4 places a week, during the Holiday season that might double as they visit more stores and end up eating out more often while rushing between those stores. On an individual basis it doesn’t sound that significant, but if a large percentage of your user base doubles their traffic, you better hope you have planned accordingly.

On the technical side, many sites will actually change their regular development process in order to handle these changes in user behavior.Starting early in November, many sites will stop rolling out new features and halt large projects that might be disruptive to the site or the underlying infrastructure. As focus shifts away from features,most often it turns back towards infrastructure and optimization. Adding new monitoring, from improved logging to new metrics and graphs, becomes critical as you seek to have a comprehensive view of your sites operations so that you can better understand the changes in traffic that are happening, and hopefully be proactive about solving problems before they turn into outages.

Profiling and optimization work also receives more attention during this time; studies continue to show correlations between page load speeds and website responsiveness to increased revenue, and being able to improve these areas is something that can typically be done without having to change the behavior of how things work. Bugfixes are also a popular target during these times as those corner cases are more likely to show up as traffic increases, especially if you tend to see new users as well as an increase in use by existing users.

This brings us to a good question; just what are you monitoring? For most shops there tend to be standard graphs that get generated for this like disk space or memory usage. These things are good to have, but they only scratch the surface. Your operations staff probably knows all kind of metrics about the system the need to monitor, but how about your application developers? They should know the code that runs your site inside and out, so challenge them to find key metrics in your application stack that are important for their work. Maybe that’s messages delivered to a queuing system, or the time it takes to process the shipping costs module, or measuring the responsiveness of a 3rd party API like Facebook or Twitter. But don’t stop there;everyone in your company should be asking themselves “what analytics could I use to make better informed decisions”? For example, do you know if your increased traffic is due to new users or existing users? If you are monitoring new user sign ups, this will start to give you some insight. If you are doing E-Commerce, you should also be tracking revenue related numbers. Those types of monitors are more business focused but they are critical to everyone at your company. So much so that at Etsy, a top 100 website commonly known as “the worlds handmade marketplace”, they project these types of metrics right out in public.

Ideally once you have this type of information being logged, you can collect the information for analytically reports and historical trending via graphs. You want to be able to take the data you are collecting and correlate between metrics. Given a 10% increase in new users in the past week, we’ve seen a 15% spike in web server traffic.If we project those numbers out, can we make it through Black Friday? Cyber Tuesday? Will we make all the way to New Years, or do we need to start provisioning new machines *NOW*? Or what happens if our business model changes, and we are required to live through a “Black Friday” event every day? That’s the kind of challenges that social shopping site Gilt faces, with it’s daily turnover of inventory. It’s worth saying that you won’t need all of this information real time, but ideally you’ll be able to get a mix of real time, near-time (5 minutes aggregated data is common), as well as
daily analytical reports. Additionally you should talk with your operations staff about which of these metrics are mission critical enough that we should be alerting on them, to make sure we have the operational and organizational focus that is appropriate.

While nothing beats preparation, even the best laid plans need good feedback loops to be successful. Measuring, collecting, analyzing, and acting upon data as it comes into your organization is critical in today’s online environments. You may not be able to predict the future, but having solid monitoring systems in place will help you to recognize problems before they become critical, and help give you a “snowballs chance” during the holiday season.

One Dashboard to Rule Them All

<

four icons representing a dashboard

Ever dream of having a systems monitoring dashboard that was actually useful? One where you could move things around, resize them, and even choose what information you wanted to display? Large enterprise software packages may have decent dashboards, but what if you’re not a large enterprise or you don’t want to pay an arm and a leg for bloatware? Perhaps you have a good dashboard that came with a specific server or piece of hardware, but it’s narrowly-focused and inflexible. You’ve probably thought about (or even tried) creating your own dashboard, but it’s a significant undertaking that’s not for the faint-of-heart. What’s the solution? Should we just learn to live with sub-optimal monitoring tools?

Here at Circonus, we decided that this was one problem we could eliminate. Since we’ve built a SaaS offering that’s flexible enough to handle multiple different data sources, why shouldn’t we build a dashboard that’s flexible enough to display them? So we created a configurable dashboard that lets you monitor your data however you want. Do you want to show graphs side-by-side but at different sizes? Done. Want an up-to-date list of alerts beside those graphs? Easy. How about some real-time metric charts that automatically refresh? No problem. Our new configurable dashboards allow you to add all these items and more. Let’s dig in and see how these new dashboards work.

Dashboard Basics

Start by going to the standard ‘Dashboard’ and clicking the new ‘My Dashboards’ tab. These dashboards are truly yours; any dashboards you create are only visible to you (by default) and are segregated by account. If you want to share a custom dashboard with everyone else on an account, check that dashboard’s ‘share’ checkbox in your list of custom dashboards.

After you have created a custom dashboard, you may set it to be your default dashboard by using the radio buttons down the left side of your custom dashboards list. If you do this, you will be greeted with your selected dashboard when you login to Circonus. By selecting the ‘Standard Circonus Dashboard’ as your default dashboard, you will revert to being greeted with the old dashboard you’re already used to seeing.

part of the interface for creating a new dashboard layout

To create a new custom dashboard, click the ‘+’ tab and choose a layout. At first you will see only a couple predefined layouts available, but after you create a dashboard, its layout will then be available to choose when creating other new dashboards.

Now a note about working with these dashboards: every action auto-saves so you never have to worry about losing changes you’ve made. However, if you haven’t given your dashboard a title, the dashboard isn’t permanently saved yet. If you forget to title your dashboard and go off to do other things, don’t worry, the dashboard you created is saved in your browser’s memory. All you have to do is visit the ‘My Dashboards’ page and your dashboard will be listed there. With two clicks you can give your dashboard a title and save it permanently. (Please note our minimum browser requirements ‘Firefox 4+ or Chrome’ which are especially applicable for these new custom dashboards, since we’re using some features which are not available in older browsers.)

So let’s create a dashboard. Choose a layout, click ‘Create Dashboard,’ and you will be taken to the new dashboard with the ‘Add A Widget’ panel extended. To begin, let’s check out the title area. Notice that when you hover over the title, a dropdown menu appears. This lists your other dashboards on the current account (as well as dashboards shared by other account members) and is useful for quickly switching between dashboards.

the dashboard interface showing the dashboard controls icons

To the right of the title are some icons. The first icon opens the grid options dialog, which lets you change the dimensions of the dashboard grid, hide the grid (it’s still active and usable, though), enable or disable text scaling, and choose whether or not to auto-hide the title bar in fullscreen mode. The second icon toggles fullscreen mode on and off. Once you enter fullscreen mode a third icon will appear, and this icon toggles the ‘Black Dash’ theme (this theme is only available in fullscreen mode). The current states of both fullscreen mode and the ‘Black Dash’ theme are saved with your dashboard.

One other note about the dashboard interface: if you leave a dashboard sitting for more than ten or fifteen seconds and notice that parts of the interface disappear (along with the mouse cursor), don’t worry, it’s just gone to sleep! A move of the mouse will make everything visible again. (If there are any widget settings panels open, though, the sleep timer will not activate.)

Widgets

Now for the meat of it all: widgets. We currently have ten widgets which can be added to the dashboard grid to show various types of data, and we’ll be adding more widget types and contents in the future. Following is a quick rundown of the currently available widgets:

Graph

Graph widgets let you add existing graphs to your dashboard. You may choose any graph from the “My Graphs” section under your current account. Graph widgets are refreshed every few minutes to ensure they’re always up-to-date.

Beacon Map

Map widgets let you add existing Beacon maps to your dashboard. You may choose any map query from the “Beacons” page (under the “Checks” section of your current account). Map widgets are updated in real-time.

Beacon Table

Table widgets let you add existing Beacon tables to your dashboard. You may choose any table query from the “Beacons” page (under the “Checks” section of your current account). Table widgets are updated in real-time.

Chart

Chart widgets let you select multiple metrics to monitor and compare in a bar or pie chart. Chart widgets are updated in real-time.

Gauge

Gauge widgets let you monitor the current state of a single numeric metric in a graphical manner, displaying the most recent value on a bar gauge (dial gauges are coming soon). Gauge widgets are updated in real-time.

Status

Status widgets let you monitor the current state of one or more metrics, displaying the most recent value with custom formatting. This is most useful for text metrics, but it may be used for numeric metrics as well. Status widgets are updated in real-time.

HTML

HTML widgets let you embed arbitrary HTML content on your dashboard. It can be used for just about anything, from displaying a logo or graphic to using an iframe to embed more in-depth content. Everything is permissible except Javascript. HTML widgets are refreshed every few minutes to ensure they’re always up-to-date.

List

List widgets let you add lists of graphs and worksheets to your dashboard, ordered by their last modified date. You may specify how many items to list and (optionally) a search string to limit the list. List widgets are refreshed every few minutes to ensure they’re always up-to-date.

Alerts

Alerts widgets let you monitor your checks by showing the most recent alerts on your current account. You may filter the alerts by their age (how long ago they occurred), by particular search terms, by severity levels, or other status criteria. Alerts widgets are refreshed every few minutes to ensure they’re always up-to-date.

Admin

Admin widgets let you monitor selected administrative information, including the status of all Circonus agents on your current account. Admin widgets are refreshed every few minutes to ensure they’re always up-to-date.

icons representing some of the current widget types

To add widgets to the dashboard grid, there are two methods: you may use the ‘drag-and-drop’ method (dragging from the “Add a Widget” panel), or you may first click the target grid cell and then select the widget you want to place there. (Note: in fullscreen mode only the latter method is available.) After a widget has been added, some types of widgets will automatically activate with default settings, but most will be inactive. If the widget is inactive, click it to open the settings panel and get started. Once the widget is activated, the settings panel is available by clicking the settings icon in the upper right corner of the widget. In the lower right corner of the widget is the resize handle, so you can resize the widget as frequently as you want. And let’s not forget being able to rearrange the widgets, every widget has a transparent ‘title bar’ at its top which you can use to drag it around. I won’t get into the details of settings for every type of widget, because they should be self-explanatory (and that would make this one super-long blog post). But suffice it to say, there are plenty of options for everyone.

We’ve been working hard to create a configurable dashboard that will be as flexible as Circonus itself is, and we believe we’ve hit pretty close to the mark. Here’s a sample dashboard showing the power of these new dashboards:


Lost In Translation

For more than ten years, OmniTI has been making large-scale critical Internet infrastructure work. It is, obviously, not black magic or voodoo. Perhaps not so obviously, it is not technical competence that leads to success here. I like to think our team has technical competence in spades as we have an impeccable track record, authored books and a laundry list of speaking engagements to justify it. However, technical competence alone would fall short of the mark— far short.

Without exception, it is expected that proper monitoring and trending are as much a part of the process as setting up networking, backups, and more recently, change management. And yet, when you ask someone to explain why monitoring and trending were vital, you’d be lucky to get a response other than “to be sure things are working”. Something here is lost in translation.

Disconnected Viewpoints

Every business owner knows that watching the books is part of the job. You need to know P&L, you need to understand the outputs and costs of your various business units and you track efficiencies everywhere. All of these metrics play a part in both strategic and tactical decisions made every day. Each business unit reports these things and while in good organizations each manager knows what is important to each other manager, something is still lost in translation. Far too often, managers don’t understand that what they produce, what they consume and how they work changes the game for other business units. While the word is overused and abused, every business is an ecosystem. It is obvious that a new marketing campaign will increase resource utilization on the sales teams. It should be obvious that a new marketing campaign will increase resource utilization on IT infrastructure as well.

Every systems administrator knows (or should know) that monitoring your architecture is fundamental. On the other hand, very few can explain in any detail why this is so important. “Because you lose money when systems are offline”, they’ll quote disparagingly. Ask how much and you might catch them at a loss. From my own experience in operations, as well as countless conversations with customers and vendors, very few individuals recognize the relationship between IT and Business. Systems people know that they have to keep systems and services running to support their business, but rarely do they understand that relationship completely.

Owners that foster a transparent and cohesive organization around key performance indicators in every business unit (even those that are cost centers) will change their organizations in two critically useful ways:

  • Efficiencies between business units. With increased transparency, staff in all positions will see the effects of their actions across the business as a whole. This produces an atmosphere of self-reinforcing efficiency.
  • Accountability to the overall business. The hokey old question: “Is what you’re doing good for the company?” changes form. With increased cohesiveness, the answer to that question is a more obvious outcome to every action and no one can call it hokey, because it is always answered without being asked.

A Call To Arms

Technology is no longer underneath the products you sell and the process in which you deliver them. It is, for at least the immediate future, intertwined. Creativity on the technology side doesn’t only deliver cost savings, it creates new audiences and increases interaction with your customers. You have to do more than embrace technology, you need to leverage it and let new opportunities catapult your business forward.

As intertwined as technology is, we can no longer afford to have its operational details hidden away in the bowels of the “tech ops” or “web ops” group. We need visibility and we need cohesion. Infrastructure/application engineering and other business units are now, more than ever before, on the same team marching towards success. Communication and accountability are critical to success.

Here is where I leave you and hope that you will think about the metrics you monitor in a different light. They represent something more. They are there to make the business run, increase shareholder value, make your customers happier and more prosperous.

Your Visitors Don’t Matter

Consider me old-fashioned, but I remember a time when an alert notification meant something. Drives failed, servers ran short on memory, or a cage monkey pulled the wrong cable at 3 A.M. Regardless of the circumstance, it demanded attention. Those were the days.

Today, operations is all about doing more with less. No more dedicated hardware or late-night maintenance windows. Everything is virtual, cloud-based, or filling up squares in the grid. Automation reigns supreme, limitless scalability at our disposal. Abstraction at its finest.

But woe unto you, the flapping anomaly.

That visitor who tried to load your website was turned away, timed out and left to wither. Poor Jane wanted to view your site. She needed to view your site. She’d already submitted her order, only to be ignored. Forgotten. Disconnected with nary a trace to route nor a cookie to favor.

Jane was a victim of a numbers game. Someone, somewhere, decided that some problems don’t matter. Which ones? Who cares? They don’t matter. And because she happened to visit when this problem reared its head, you ignored her request. Who would ever make such a silly presumption that one failure is less important than another? What criteria is used to determine the worthiness of this alert or that one? Pure random circumstance, it would appear.

Many “uptime” services and monitoring suites promote the concept of selective or flapping failures. Vendors sell these features as a convenience, ostensibly as a sleep aide. The administrator’s snooze-bar. I can’t think of any other reason that ignoring a faulty condition would be considered a good thing. Perhaps they reason that only the check is affected. If it responds after the third attempt, it was probably ok for visitors all along. Right?

It’s disappointing how many vendors embrace this broken methodology. It probably seemed innocent at a glance. But the damage has been done; recklessness has taken root. We’ve been conditioned to accept these transient malfunctions as mere operational speed bumps. Rather than address the problem, we nudge the threshold a tad higher. Throw additional nodes into the cluster. Increase capacity, while decreasing exposure.

But there is a more responsible alternative. What ever happened to purposeful, iterative corrections and Root Cause Analysis? Notifications may be annoying at times, but they serve a crucial function in a healthy production architecture. Ignored alerts lead to stagnant bugs, lost traffic and missed opportunities. Stop treating your visitors like they don’t matter. There’s no such thing as a flapping customer.