Our previous post, “Monitoring for Success: What All SREs Need to Know,” discusses how today’s complex IT environments — virtualization, cloud computing, continuous delivery and integration — coupled with pressures to deploy faster while meeting demands for “always on” customer expectations – have placed greater strains on monitoring teams. Unfortunately, the reality is that many organizations still have legacy monitoring tools and processes in place that are no longer effective in today’s world. At Circonus, we speak with many companies who are looking to “modernize” their monitoring. They want to embrace and implement SRE principles, and fully harness all the powerful data they are generating so they can gain insights and make decisions that have a big impact on the company.
If you’re looking to advance your monitoring and elevate its impact and significance to your organization, then you need to achieve the following 4 essential characteristics of modern monitoring – none of which can be successfully done without the right monitoring platform.
1. Democratization of Data Through Centralized Monitoring
One of the hallmarks of conventional monitoring is having disparate monitoring tools that each have a specific purpose and create silos of metric data. It’s a patchwork environment where there is a lack of consistent standards and processes and as a result, there’s no ability to share information in a clear and cohesive way among different teams within the organization.
Having disparate tools often requires more costs and resources, and knowledge of how to use them can reside in just a few individuals. This not only creates the potential for serious disruptions if people leave the organization, but it also prevents teams within the IT organization from being able to find answers on their own. For example, an engineer responsible for application performance monitoring cannot get information they require on network health without relying from someone on that team to get it for them – resulting in increased time for essential tasks like troubleshooting. At the strategic level, there is no way to get a comprehensive and consolidated view of the health and performance of the systems that underpin the business.
By centralizing all of your metrics into one monitoring and observability platform, your organization gains a consistent metrics framework across teams and services. You democratize your data so that anybody can immediately access that data any time and use it in a way that is correlated to the other parts of your business – eliminating the time-consuming barriers associated with legacy monitoring tools. A centralized platform that consistently presents and correlates all data in real-time consolidates monitoring efforts across all teams within the organization and enables the business to extract the maximum value from its monitoring efforts.
For example, Major League Baseball (MLB) is using Circonus as the centralized monitoring and analytics platform that underpins applications, systems infrastructure, cloud infrastructure, and network infrastructure.
In a recent interview with Network World, Jeremy Schulman, Principal Network Automation Software Engineer at MLB, stated, “All this very rich information is being put into a common observability platform, and that democratizes the data in a very important way at MLB. Enabling other IT disciplines to access network data will potentially speed troubleshooting and improve performance.”
He continued, “It’s amazing to have a seat at that table,” Schulman says. “We don’t have to make isolated tool decisions. We get to work with a group of very sophisticated engineers across all these other domains in cloud infrastructure, systems infrastructure, and we get to use their tools, along with their technology.”
2. Compliance with Metrics 2.0
Metrics 2.0 is set of “conventions, standards and concepts around time series metrics metadata” with the goal of generating metrics in a format that is self-describing and standardized.
The fundamental premise of Metrics 2.0 is that metrics without context do not have a lot of value. Metrics 2.0 requires metrics be tagged with associated “metadata” or context about the metric that is being collected. For example, collecting CPU utilization from a hundred servers without any context is not particularly useful. But with Metrics 2.0 tags, you will know that this particular CPU metric is from this particular server within this particular rack at this specific data center doing this particular type of work. Much more useful.
When all metrics are tagged in this manner, queries and analytics become quite powerful. You can search based on these tags and you are able to slice and dice the data in many ways to glean insights and intelligence about your operations and performance.
Many monitoring tools, however, are not Metrics 2.0 compliant. Today’s SREs are swimming in data, and without metrics that have sufficient context, identifying the source of a performance issue can take hours, and executing core SRE functions like dynamically creating SLOs is difficult if not impossible.
3. SLOs and Error-Budgets
As more companies transform into service-centric, “always on” environments, they are implementing SRE functions that are responsible for defining ways to measure availability and uptime, accelerate releases, and reduce the costs of failures. Enter Service Level Objectives (SLOs). SLOs are an agreement on an acceptable level of availability and performance and are essential to modern monitoring because they help SREs determine how to properly balance risk and innovation.
As stated above, creating your SLOs is dependent on highly precise, granular data. Having the right monitoring and analytics platform in place – one that will provide the correct math, historical metrics, and the ability to correlate metrics – is critical to calculating your SLOs correctly and avoiding those costly mistakes.
Once you have SLOs in place, you can define your “error budget.” An error budget is based on your SLOs and is essentially the difference between the level at which your systems are capable of performing and the level that still provides an acceptable experience to your customers. For example, if you have an SLO of 99.5% uptime and actually reach 99.99% on a typical month, consider the delta to be an error budget—time that your team can use to take risks.
Having an error budget will force you to have metrics in place to know how well you’re meeting goals and if there’s room in the budget for additional risk. If you’re consistently not meeting or getting close to not meeting your SLOs, then it’s time to dial back. Conversely, if you’re exceeding goals, then dial up innovation and deploy more features. Like SLOs, the error budget ensures teams are aligned on when to slow down or speed up.
4. Ability to “Monitor Everything”
Best practice these days is to “monitor everything,” not just samples. You need full observability of all your infrastructure and all your metrics. Not only will this help accelerate problem resolution, but once you have these metrics collected, it’s possible for your teams to surface additional business value within this sea of data.
But accurately monitoring everything requires the built-in ability to continuously aggregate ALL metrics from ALL infrastructure on-demand at extremely high granularity – which can amount to millions of measurements per second. This requires a platform that can run at scale, meaning there should be no compromise on performance, regardless of the infrastructure environment size or the amount of data collected to run analytics in real-time.
Traditional monitoring was not built to handle this type of scale, so it handles data in a way that leads to inaccurate analysis. Take, for example, latency SLOs – just highlighted as a critical characteristic of modern monitoring. Traditional tools will reduce all latency measurements to a single number – the average latency over an arbitrarily determined time window, typically a minute. This can result in wildly inaccurate latency SLOs, which could end up costing organizations significant money and resources. At Circonus, we collect all source data and store latency measurements as OpenHistograms. This enables customers to aggregate millions of latency measurements a second, so that they can accurately calculate SLOs.
Take Your Monitoring to the Next Level
Whether you’ve already started implementing more modern monitoring or are just starting the journey, it’s important to remember that it’s a gradual process. But as you begin to achieve even one of these modern monitoring characteristics, you’ll immediately realize how all of the powerful data you’re harnessing elevates the relevance of monitoring to your business’ success. And you’ll gain lots of other benefits as well, like faster problem identification and resolution; full visibility into all your metrics; better performance; reduced costs; and importantly, more confidence in the accuracy of your decisions.