At the beginning of this month, I’d attended the Gartner Data Center Conference in Las Vegas, and wanted to share with you some of my gained impressions and insights from the event.
First, I have to say that I have seldom seen a group of more conscientious conference attendees (aside from Surge, of course, and a physics conference I once attended). Networking breakfasts were busy, sessions were well attended, and both lunch and topic-specific networking gatherings had lively discussions. Each of the Solution Center hours, going well into the evening, were full of people not only partaking of the food or giveaways but were primarily and voraciously soaking up information from the various exhibitors. Even in hallways during the day, while people were sitting or standing, there was a steady exchange of opinions and information. This is what I saw throughout the conference; attendees there were very serious about learning…from the speakers, vendors, and from their peers. Relatedly, it’s interesting that many organizations bar outright their employees from attending any events in Vegas? While boondoggle may be an appropriate term for some shows in that or any other location, it certainly wasn’t the case with this conference.
Now let’s get to what frequently was foremost on the mind of attendees. I was somewhat surprised to find that this was not something that is usually on the top-ten lists of CIO/IT initiates. Rather, what repeatedly came out first in terms of attendees’ pressing interest were the interrelated topics of avoiding IT outages and increasing speed of service recovery, along with monitoring to help with both of these.
Granted, this was a datacenter-specific conference so it’s natural that avoidance of and recovery from operational failures is of paramount importance. But note that there are lots of other overarching datacenter initiatives we all hear much more about, such as virtualization, cloud migration, datacenter consolidation. Many of these headline-grabbing topics are certainly both important, and getting done. However, what affects datacenter operations leaders’ daily lives and careers, and so is of primary importance, has not received much if any notice or press.
Why is that? It’s pretty simple. Some of these other initiatives are new. Monitoring has been around seemingly forever, plus (to an extent) outages are taken as being somewhat unavoidable. Yet, while zero failures is indeed not possible, markedly increased reliability is certainly attainable. Look at the historical telecom service provider side, where five-9’s reliability is the expected level of service. When expectations are high, and commensurate investment is made, higher levels are not at all out of reach.
As for monitoring solutions themselves, nowadays you don’t have to be limited to old-school systems. There are young companies, like Circonus, who have a fresh approach that breaks down the silos of stand-alone toolsets of the past.
Let’s take a step back now and visualize what outages look like from a datacenter ops teams perspective, i.e. what happens when things ‘blow up’ in a datacenter. It’s not external constituents such as clients that directly impact the datacenter for the most part. External clients touch the business units and it’s then the business units which put the heat on the datacenter leaders.
And what about SLA’s for keeping business units apprised of the benefit IT delivers to them? As I heard loud and clear in the conference, internal SLA’s are for the most part useless. Why? Because they don’t mean much to the business units—they’re only interested in “When are you going to get my service back up?!” In other words, this is a variation on, “What have you done for me lately?”
So let’s look at an option for resolution. If the problem occurs on a virtual machine, you just spin up a new instance, right? Wrong, but that’s what usually happens. When a hammer dangling off a shelf hits you on the head, do you replace it with another dangling hammer and think you’ve solved the problem? Obviously, the thing to do in a datacenter is do the work to avoid repetition of the issue—we’re talking root-cause-analysis—otherwise you’re putting out fires repeatedly…the same fires.
Now a good monitoring system is going to help and in several ways. First, as just mentioned, it’s going to assist in identifying the underlying issue, including its location—is it in the app, the database, the server, etc. You don’t want to do that blindly testing—you’ll want the capability to create graphs on the fly and you similarly want to able to very easily and quickly do correlations of your metrics.
Okay, so that’s good for remediating a problem along with reducing the chance of it recurring, but you’ll also want to do anticipatory actions like capacity planning to forestall avoidable bottlenecks. For this you also want an easy-to-use tool so that you don’t have to muck around with spreadsheets. And you’ll want to be able to have a ‘play’ function so that when you do things such as code-pushes, you’ll be able to see in real-time the effect of these changes. This way, if the effect of the code-push is negative, you can quickly reverse the action without impacting your internal or external clients.
The good news is that new solutions with all these functionalities are out there in the marketplace. Of course, before you buy one be sure to insist on testing the solution in a trial to see how it performs, in your current and anticipated (read: hybrid physical and virtual/Cloud) environments. This includes seeing how the solution handles your scale, both backend and from a UI perspective. Such an evaluation will require an investment in your time, but the result will be well worth it, in the increased avoidance of outages and speeding up of recovery from them.