Five Signs Your Monitoring Solution is Failing You

In a recent post I talked about the strain being placed on IT Infrastructure with the current surge in demand for online services being driven by the COVID-19 pandemic. I talked about how this sudden migration to online has exposed weaknesses in, and in some cases a total lack of, adequate monitoring practices. Unfortunately, many online sites have experienced degradation of service, poor customer experiences, and even complete outages. Operations teams are scrambling to keep up with demand while flying nearly blind from a lack of key metric data.

Infrastructure monitoring has been around for decades — so why the chaos? I think it depends a lot on your monitoring philosophy and I think too many view monitoring like an insurance policy – something to purchase and check the box so we can say we have it in place. It’s easy to get mesmerized by charts and graphs, but are we really using monitoring to run the business or just giving ourselves a false sense of security? If things go sideways (and they always do), it’s going to be little comfort to say, “well we had monitoring in place.”

So here are five sure signs that your current monitoring solution is letting you down.

Being Blindsided. It’s never fun when the first indication you have of problems in production is complaints from customers. This is a classic scenario we’ve all probably had at one point in time and likely the most frustrating, as nothing could be more embarrassing. It could be that your monitoring solution just needs to be better configured, but it could also mean it has limitations on the amount of data it can collect, there’s been data loss, or delays in the ingestion of data for analytics. Also, out-of-the-box graphs have the allure of getting you up and running quickly but could miss what’s most meaningful to you. Be sure you have the ability to configure accordingly and continually update your monitoring to address the evolving needs of the business.
Preventable Outages and False Positives. In this scenario, you are experiencing too many preventable outages on the one hand or too many false positive alerts on the other. This could either be a case of not monitoring what you actually care about or you know exactly what you care about, but your monitoring solution can’t express what you want to monitor. For example “tell me when a disc is 90% full” is not nearly as useful as “tell me when a disc is within 6 hours of running out of space.” To do the latter, however, requires more advanced functionality like forecasting.
Bad Data. You spend hours studying a graph to isolate a problem only to find out the data you’ve been analyzing is either wrong or outdated. This can happen when you’ve had to compromise and store summary data to save space and/or cost, and you lose the drill-down granularity you need for root cause analysis and other analytics. It can also happen if there is a delay between ingestion and availability of data.
Missing Data. You have urgent operational and business impact questions but turns out you haven’t been collecting the data you need to answer them. This typically happens when your monitoring system is limited in its ability to collect and retain the massive amount of telemetry data generated by today’s modern infrastructure. It’s a basic tenet of DevOps to measure everything, but most likely you had to make trade-offs because of limitations in your monitoring solution’s capacity and capabilities.
Monitoring Crashes. Your monitoring solution is actually less dependable than the systems it monitors, and you lose data when it crashes. Be careful of solutions with single points of failure and/or that need to be deployed within the actual “blast zone” (of a potential outage) in order to collect data. Collection technology should have the ability to “store and forward” to eliminate data loss from data center outages.

If any of these symptoms sound like you, there’s a good chance that your monitoring solution needs a tune-up. The good news is that these issues can indeed be fixed, and the return on investment is well worth the time and effort.

The systems that run your business need to be the best they can be, but the system you use to monitor them needs to be even better. You need complete visibility and command of all the data from all your infrastructure to make the best possible decisions and provide the best possible service.

If you’ve been putting off making the changes to your operational and monitoring practices that you’ve known you need to make, now is the time. One could be forgiven for perhaps not predicting a pandemic, but there is no excuse to not learn and improve from it.