In the dynamic world of IT, the way we monitor systems has seen a remarkable evolution. Gone are the days when monitoring was limited to basic server checks or infrastructure health. With the rise of cloud-native applications, serverless architectures, and container orchestration platforms like Kubernetes, the digital landscape has become a multi-dimensional maze.
Just as technology has advanced, so too have the tools and methodologies we use to keep a vigilant eye on our systems, ensuring they are both effective and efficient. Today, we stand at the precipice of the fourth generation of monitoring, a sophisticated era where complexities have multiplied, but so too have our capabilities.
This article provides business leaders with an understanding of the four generations of monitoring and observability platforms, along with a framework that businesses of any size can use to plan a monitoring strategy focused on efficient provisioning of resources and directly addressing user pain points.
The Four Generations of Monitoring
As infrastructures, applications, and networks have evolved, so too have our methods of monitoring them. To truly understand where we stand today, it’s essential to look back at the journey of monitoring across what we consider to be its four distinct generations.
First Generation: The Basics of Health Checks (~2005)
In the early days, monitoring was quite rudimentary. Tools like Pingdom offered basic health checks that posed a simple question: Is your website up or down? This binary perspective lacked nuance. There was no consideration for issues like degradation or partial outages; it was purely about site accessibility.
Second Generation: The Dawn of Infrastructure Monitoring (~2010)
By the late 2000s, the concept of monitoring had evolved. With the advent of tools like DataDog, the focus shifted from mere site accessibility to infrastructure health. Engineers could now identify things like high CPU usage on databases or low disk space on servers. While this provided a more in-depth look into system health, it was still limited. High CPU usage, for instance, could be a sign of efficient infrastructure usage—not necessarily a problem, and certainly not something an on-call engineer wants to be woken up about in the middle of the night if it isn’t impacting the customer experience. So, while this generation added depth, it didn’t always correlate directly to user experience or platform usability. Often, this resulted in false alarms, commonly referred to as “false positives.”
Third Generation: Application Metrics Take Center Stage (~2013-2018)
The third generation, ushered in by tools like New Relic, marked a significant shift towards application-centric monitoring. No longer was monitoring just about infrastructure; it was about understanding how an application performed (i.e. your infrastructure may be fine but your application is struggling). With New Relic’s introspective Ruby monitoring, for instance, businesses could identify slow functions, database request hang-ups, and other application-specific issues. This level of granularity was a game-changer, allowing for more sophisticated and actionable insights.
Fourth Generation: Embracing Complexity in the Era of Kubernetes and Containers
Today, we find ourselves in the fourth generation of monitoring, a complex era dominated by Kubernetes, containers, pods, and more.
What does “infrastructure” even mean in this new environment? Is it the VM, the container on the VM, or the pod that the containers are part of? With numerous levels of abstraction, the challenge now is not just monitoring but understanding the myriad components that, despite surface-level figures (such as that high CPU percentage example) could be healthy, auto-healing, or transient. Fourth generation platforms are characterized by their ephemeral nature, with containers scaling up or down in mere minutes, making monitoring not just about capturing data, but interpreting its transient nature.
Today’s enterprises need a single dashboard to analyze metrics from all of the aforementioned levels—encompassing infrastructure, infrastructure abstractions, applications, and even customer success metrics such as SLOs. They also need metrics and logs from all of those. And because of the complexity of today’s applications, traces are becoming even more important—not only traces through the different microservices for an application, but traces through various levels of infrastructure as well.
While the tools and techniques have evolved, the end goal remains the same—ensuring optimal user experience and system performance. It just becomes a lot harder to do that as the complexity of your system increases. As we venture further into this fourth generation of monitoring and observability, the importance of following the right strategy, informed by history and geared for the future, cannot be overstated.
Simply Put: Understanding Traces and Logs for Application Monitoring
Think of logs as a “vertical” look at everything a particular application has done across all requests. Traces, on the other hand, are “horizontal”; they track a single user request across all the applications it has touched.
In simpler applications, tracking requests is straightforward as there may be only one server, one database, and one application function. You don’t need traces since there’s only one path for a request. However, as systems grow in complexity, with multiple microservices, databases, and load balancers, tracking becomes complex. In such cases, when a user interacts with your website, it’s essential to know which services are touched. Traces help answer this.
With increasing complexity, both logs (vertical) and traces (horizontal) are needed to get a full picture of system health.
Building Your Monitoring Strategy
Keep it Simple.
When first starting out, it’s common to either have no monitoring at all or to overdo it. Many organizations fall into the common pitfall of generating too many metrics, which results in having so much data they cannot find the signal—it’s like trying to find a needle in a haystack—and they pay a lot of money for the challenge. The key is to find a balance.
Focus on Customer Experience.
At the end of the day, the type of monitoring that’s most effective is that which highly correlates with customer pain and the overall customer experience. This has always been the case, going all the way back to first generation monitoring platforms—and this is where business leaders should focus when developing a monitoring strategy.
Start at a system-wide level.
This is where the “Four Golden Signals” come in, which were introduced by Google SREs:
- Latency: How long it takes to process a request.
- Traffic: The number of requests made across the network.
- Errors: The count of failed requests.
- Saturation: The load on the network and servers.
Your primary system-wide concerns should revolve around a couple of key issues. Firstly, consider latency at the system-wide level, evaluating both synchronous and asynchronous latency.
Synchronous latency measures the speed of the network/application, while asynchronous latency assesses how swiftly data input results in notifications being sent out, as an example. The longer this takes, the worse the user experience, leading to what we term as “degradation.” While not a complete outage, a degraded system still impairs user satisfaction.
Errors are our second system-wide concern. Synchronous errors, such as issues with the user interface, directly hinder usage and can be equated to outages. On the other hand, asynchronous errors are concerning because they often indicate data loss. In such cases, the data input encounters an error and fails to produce the desired output, which could have serious consequences—but because it is an asynchronous process, you may not find out about it immediately. For example, an HR system that automatically sends out payments to users every two weeks could fail in the middle of the night and, if your team is not alerted, your employees won’t get paid.
Using a fourth generation platform to base notifications on actual customer pain ensures fewer false positives (e.g. waking engineers up in the middle of the night for no reason, eventually leading to a “boy who cried wolf” scenario) and also false negatives (e.g. not receiving an alert for an actual issue). Going back to the high CPU example in this scenario, an alert may fire, but the on-call engineer would not be woken up in the middle of the night because the end user experience is not affected. This is an ideal scenario, made possible by fourth generation monitoring and observability platforms.
By focusing on system-wide latency and error issues that actually impact customer experience, your team can accurately identify when customers are facing challenges, and alert the proper engineer or technical team accordingly. It’s then the responsibility of the technical team to determine the root causes, but these metrics will ensure the proper people are aware of actual issues—and that they won’t get bogged down by false negatives and positives, which impact their ability to solve root-cause problems (and get a good night’s sleep).
Note: These are alerting metrics. While they won’t aid in debugging the problem, they signal its existence. Such metrics are what should trigger alerts, prompting immediate investigation. After all, CPU on a database may be correlated with an issue, but it’s not necessarily causing it. You can then bring in debug-ability, providing your engineers with better tools to debug faster when they are justifiably woken up in the middle of the night due to a credible issue.
All combined, you can now more accurately identify when there are real customer issues, and achieve faster mean time to resolution (MTTR).
Continue to Subsystems and Services.
Here’s the beauty of this framework: Now that we have addressed alerting and debugging at the system level, we can simply replicate this approach for each subsystem—be it the alerting, ingestion, UI, or API systems. It’s important to note that these subsystems may be composed of various microservices, each managed by different teams.
For each subsystem:
- Determine how customer pain manifests for that specific subsystem.
- Establish alerts based on these pain points.
- Develop both application and infrastructure debugging metrics to accompany these alerts.
And, once again, you can repeat this very same process at the service level.
During disruptions, this nested alert framework enables rapid issue assessment and swift identification of the core problem. You will know with a high level of accuracy if and when the entire system exhibits issues. From there, your on-call person can pinpoint the specific alerting subsystem and promptly notify the responsible team. Once alerted, this team can discern the problematic service and examine the application, if needed. Leveraging debugging metrics, they can triage, consult the appropriate expert, and reduce average MTTR.
If your system is not complex, you may only need to focus on the system as a whole. However, you should still employ a fourth generation monitoring and observability platform that can cost-effectively scale its capabilities and pricing as your company grows.
Monitoring and observability tools have transitioned from basic health checks to sophisticated, multi-dimensional platforms capable of interpreting the complexities of today’s infrastructures. However, they cannot be correctly or efficiently leveraged without a savvy approach.
For businesses of any size, and every level of monitoring complexity, the key to success lies in understanding the shifts across the four “generations” of monitoring, focusing on actual customer pain points, and employing a layered, systemic approach. By doing so, businesses can ensure optimal user experiences, swift problem resolutions, and ultimately, a more resilient digital infrastructure for the future.