How to Elevate From Basic to Advanced Infrastructure Monitoring

Times are changing fast and technology continues to advance at an unrelenting pace. An explosion of systems and devices, complex architectures, pressures to deploy faster, and demand for optimal performance have placed greater and greater strain on monitoring teams. For many, their current monitoring strategy and tools are just not enough. (See my recent post “Five Signs Your Monitoring System is Failing You”)

Developing ever more sophisticated monitoring practices and capabilities is a journey. At Circonus, we’ve developed a capability maturity model to help companies make that journey by mapping out levels of capability and steps to move from one level to the next beginning with “basic monitoring” and progressing through full machine data intelligence. If you’re ready to get more value from your monitoring efforts and regain control of your operations, the first step is to move from basic to advanced monitoring.

Basic Monitoring: Challenges and Limitations

One of the hallmark characteristics of basic monitoring is that it’s very tactical and reactive. There’s a line about Winnie-the-Pooh as he’s bouncing down a staircase on his head that says in essence, “I know there must be a better way, if I could just stop and think of it.” Sound familiar?

At the basic monitoring level, the organization has multiple teams all using disparate monitoring tools for their specific purpose and creating silos of metric data. It’s a patchwork environment where there is a lack of standards and consistent processes and as a result, there’s no ability to share information in a clear and cohesive way among different teams within the organization.

Systems and applications are configured differently, monitored differently, and measured differently. Analyses like comparing the KPIs of one Kubernetes cluster to another are impossible because there’s no way to do a true “apples to apples” comparison. There’s also no ability to search and correlate all the metrics that the various monitoring tools are collecting because metrics are formatted differently and/or lack sufficient context. Critically for executive management, there is no way to get a comprehensive and consolidated view of the health and performance of the systems that underpin the business.

And operational inefficiencies abound. Services are provisioned ad-hoc by various teams and without centralized tracking they fall off the radar screen. Over time, services get shut down but the associated infrastructure is forgotten. Costs and metrics continue to pile up unabated.

At this stage, all your charts and graphs may look great, but you’re only one misstep away from a potential catastrophe. And when that catastrophe happens, only then do companies realize that an event threatening their organization’s brand or bottom line could have easily been surfaced with unified, comprehensive monitoring. It’s usually a painful outage like this that leads companies to re-evaluate their monitoring practices and solutions. It may even be what led you to read this post.

Moving From Basic to Advanced Monitoring

In the advanced monitoring stage, operations are far more strategic and proactive. Organizations move from fire-fighting to driving measurable business performance and results. In an advanced monitoring environment, an organization has established organizational-wide monitoring by consolidating and rationalizing its monitoring and data collection capabilities across the enterprise. It has built a solid foundation on which to begin deriving additional value from monitoring data for use cases such as streaming analytics, fault/anomaly detection, root cause analysis, SLOs/SLAs, and error-budgeting.

There are 5 foundational components that are absolutely critical to achieve this level of operations.

1. Organizational Buy-In

The importance of this aspect cannot be overstated. You will be limited in your ability to generate results to the degree that everyone in the organization buys into the vision and goals of a comprehensive, consistent, and unified approach to monitoring. It is imperative that leadership establishes a data-driven culture that embraces and values the benefits of unified monitoring. Leaders up to and including the CEO need to make it a clear mandate and priority but they also need to take the time to impress upon all team members the strategic importance to the business, to explain the “why.” Why monitoring is so critical to business success and why decisions are being (or will be) made to change the way monitoring has been done up until now.

Team members have likely become vested in the current approach. Ensure each team has leaders who have the technical expertise behind monitoring and are truly believers in the value of monitoring. They need to be fluent in the technology and processes so they can train others and help drive consistency and enforce procedures throughout the organization. It’s a hearts and minds campaign for sure and nothing speaks louder than the actions of leadership. If leaders do not fully embrace the value of monitoring in their daily actions, you can be sure that attitude will cascade throughout the organization.

2. A Comprehensive Inventory of Services and Infrastructure

You can’t change what you don’t monitor and you can’t monitor what you don’t know exists. You may be surprised to learn how many services are running in your organization that may have been forgotten. Services get deployed over time but at some point they get shut down. Unfortunately no one tells the operations team and the underlying infrastructure continues to run and generate costs.

At the heart of a robust monitoring program is an always up-to-date inventory. Start by identifying all the services you are running and the resources they depend upon. Document what services are running, where they are running, how they’re running, why they’re running, what they do, what they connect with, etc. Develop a plan and procedures through which when new services and infrastructure get provisioned, they automatically move into the inventory and get monitored by default. (Over time, your monitoring platform should in essence become the system of record for all services and related infrastructure.)

It may take some time to complete the inventory, but it will be well worth the effort. It is incredibly valuable just knowing what you have. It’s not essential to have an exhaustive inventory of all your infrastructure and services to move into advanced monitoring. It’s more important to get started and instill the practice of keeping track of inventory and then monitor that inventory consistently.

3. A Monitoring Plan Linked to Business Success

What should we be monitoring? Believe it or not, that simple question can nearly bring about an existential crisis in the business causing companies to really question what they do and why. It’s a great mental exercise to ask (not that you would, but), “If we could only monitor one KPI, one metric, one telemetry point, what would it be?” Do this exercise with a cross-functional team across the business and you’ll get a range of answers. You’re on your way to really understanding what’s important to the business and therefore what you should be monitoring to ensure you meet those goals. (By the way, if no one is up for doing this exercise, you probably need to revisit step one.)

For example, if you’re an online retailer, your goal is most likely to drive the sale of products to customers. It’s useful to know “can they purchase?” i.e. is your ecommerce platform up and available, but probably even more useful to know “are they purchasing?” Or perhaps you’ll want to know the amount of sales being generated per second. Which is most critical to business success?

Once again, it’s more important to get started than to be perfect out of the gate. You’ll want to iterate your plan with business leaders over time to refine what metrics are most important to focus on. Best practice these days is to “monitor everything.” You’ll want full observability of all your infrastructure and all your metrics. Not only will this help accelerate problem resolution, but once you have these metrics collected, it’s possible for your teams to surface additional business value within this sea of data.

4. A Unified Monitoring Platform that is Metrics 2.0 Compliant

All of the above is critically dependent on implementing a centralized monitoring platform that has the capacity to consistently collect, correlate, share, and present all your metric data from all of your infrastructure in use by all of your teams in real-time. A centralized platform consolidates the monitoring efforts across all teams within the organization and enables the business to extract the maximum value from its monitoring efforts.

This resolves one issue with basic monitoring around the use of disparate tools and the inability to correlate and understand metrics. Another issue however is that many monitoring tools are not Metrics 2.0 compliant. Metrics 2.0 is set of “conventions, standards and concepts around time series metrics metadata” with the goal of generating metrics in a format that is self-describing and standardized.

The fundamental premise of Metrics 2.0 is that metrics without context do not have a lot of value. Metrics 2.0 requires metrics be tagged with associated “metadata” or context about the metric that is being collected. For example, collecting CPU utilization from a hundred servers without any context is not particularly useful. But with Metrics 2.0 tags, you will know that this particular CPU metric is from this particular server within this particular rack at this specific data center doing this particular type of work. Much more useful.

When all metrics are tagged in this manner, queries and analytics become quite powerful. You can search based on these tags and you are able to slice and dice the data in many ways to glean insights and intelligence about your operations and performance.

5. A Commitment to Learn and Iterate

Finally it’s also critical that the business adopt a philosophy of continuous improvement. As noted several times above, it’s more important to get started and iterate over time than to shoot for immediate perfection. Start with an initial inventory and then ensure new services and infrastructure are automatically added to your inventory and monitored by default. Build your initial monitoring plan in collaboration with business leaders, set what you believe to be acceptable performance levels, and measure results. Then meet regularly with business leaders to share data and results and further refine your monitoring plan. It’s a journey and an iterative process – one in which there is always room for improvement.

The Rewards of Advanced Monitoring

So, you might rightly ask, what do I get if I go through the trouble of making all these changes? What’s the upside? Well first and foremost, you will have moved from being a reactive service provider, no doubt viewed as only a cost center in the business, to a strategic business partner able to help drive tangible business results. But you’ll get a host of other benefits as well:

  • Avoid being blindsided by preventable outages
  • Full visibility and command of all your infrastructure and metrics
  • Faster problem identification and resolution time
  • The ability to answer any question at any time
  • More confidence and speed in your decision making
  • Better sleep at night

For some organizations, basic monitoring processes and tools may be enough. But for others, it’s just not sufficient. The power and value of monitoring grows exponentially the more you can harness all your metric data to confidently make the best decisions for your organization. Any monitoring tool will work until something goes wrong. But if you’re tired of bouncing down the proverbial staircase on your head, it’s time to elevate your game to advanced monitoring.