Correlation in monitoring and observability refers to the process of analyzing different types of data to identify and understand relationships between application, network, and infrastructure behavior. Correlating these data sets can help IT teams identify all technology components contributing to or impacted by a performance or reliability issue, thereby empowering them to identify root cause and troubleshoot faster. It is a way of stitching together pieces of information from different sources to create a complete picture of what is happening.
Often this information comes in the form of metrics, semi-structured logs, and traces. The ability to efficiently correlate the metrics, logs, and traces from different sources leads to shorter downtimes, decreased outage costs, and ultimately better user experiences. The converse is also true and unfortunately a reality for many organizations – the inability to quickly correlate data is a primary reason outages are lasting longer, resulting in unhappy customers, lost revenue, damage to the organization’s reputation, and lost opportunities for growth.
In this post, I’ll highlight why correlation is such a challenge today, and how we’re addressing this at Cironus.
Swivel chair monitoring
It’s harder than ever to correlate metrics, traces, and logs, particularly because many organizations are suffering from monitoring tool sprawl. Tools from different vendors are often not compatible, creating silos of data in different formats. Each tool is often owned by different teams and has different approaches to tagging and other contextual metadata, making it difficult to compare and correlate.
As a result, organizations often resort to cross-organizational war rooms, manually correlating and stitching the data and information from the different tools together. This manual approach to correlating metrics, traces, and logs across multiple screens and tools is what we call “swivel chair correlation.” It’s time-consuming, with engineers often frantically switching back and forth across screens to determine the cause of outages or significant performance degradation. The process is error-prone, makes it hard to differentiate “correlation from causation,” and makes it easy to mistake symptoms for causes – which all ultimately lead to longer outage times.
As an example, if you were to see a drop in transaction throughput and an increase in response times for the same or related transactions, you could infer that longer transaction response times would reduce the number of transactions that could be processed in a given interval. Therefore, you’d want to correlate the supporting application and infrastructure components to understand what caused the latency increase.
If the slowdown is occurring across the entire application, you would typically start by triaging the common or shared components looking for anomalous behavior or resource exhaustion across application and infrastructure components.
If it’s isolated to one or two serverless functions, containers or servers, depending on the nature of the issue, and whether it’s a performance, reliability or availability, you would triage those components and analyze service, trace and span response times, any errors reported in the logs, the code version or configuration deployed to those components, and the underlying infrastructure health such as CPU, memory, and disk performance.
By this time, with a collection of tools, you’re well into swivel chair triage and correlation.
One of the holy grails in the industry has been to build correlation engines. Correlation engines are very prone to false positives because they typically rely purely on time, have limited context, and have poor or non-existent metadata to support correlation. The tools often see the same components by different names or IP addresses, making it very difficult to establish that teams may be talking about the same component across the accumulated tools, let alone establishing clear relationships across distributed components.
Some tools use topology-based correlation , which use periodic discovery and dependency mapping to establish correlations. However, these tools create largely static views of your topology that break down in today’s ephemeral, highly dynamic Kubernetes and cloud environments where resources exist for mere seconds.
Circonus Speeds Correlation Across the Full IT Stack
Circonus is unique in that we take a multi-layer approach to correlation – ultimately helping to ensure accuracy and eliminate false positives.
Circonus provides the ability to aggregate metrics, traces, and logs from across your applications, infrastructure, and network and correlate this data using unified dashboards, time, and search.
Unified Dashboards and Time-Based Correlation
Within one dashboard, users can correlate related metrics, traces, and logs across charts in a single click and view to quickly identify root cause. All data is updated in real-time, and if you specify a time period, all of the charts on your dashboard will update instantly to that particular time period.
The following Circonus unified dashboard animation shows how you can highlight a specific time period to correlate log errors with related throughput metrics and application latency all in one view.
While providing a consolidated dashboard correlated across time is powerful, Circonus recognizes that to meet the challenges of dynamic, ephemeral environments, you need to do more. We also enable search-based correlation, which allows you to identify and search for patterns in unstructured or semi-structured data such as logs that correlate to changes in metric behavior.
Circonus is Metrics 2.0 compliant, meaning that we tag metadata to each item we monitor. By doing this, we’re able to quickly establish relationships between components and metrics, traces, and log data that used to be established through dependency management, as well as compare data in an “apples to apples” way.
It’s important to note that the Elastic Common Schema (ECS) is the metadata schema we chose to use because it’s a standard already familiar to so many engineers. In April 2023, Elastic announced it was contributing the Elastic Common Schema (ECS) to the Open Telemetry (OTEL) project, which is the second highest velocity project in Cloud Native Computing Foundation (CNCF).
Altogether, the search and tagging capabilities we have built in our backend enables efficient and accurate contextualization and correlation and lets users query all their data using the same schema.
As an engineer, this means you and your team can correlate data from across your full-stack in ways that were previously impossible. Importantly, it also eliminates the risk of false positives and correlating things that are in fact unrelated.
No more jumping from one tool to the next and back again. No more jumping from one tab to another and between various dashboards. No more manual calculations.
This streamlined process offers several advantages, including cost savings, standardized data formats from multiple sources, reduced errors, and faster mean time to resolution (MTTR) by eliminating manual effort. The result is often a 5x—10x improvement in efficiency.
Despite the industry having more monitoring tools than ever before, the reality according to the Uptime Institute’s Annual Outages Analysis for 2023 is that IT outage times are increasing, and a primary reason for this is the lack of automated and comprehensive correlation solutions. By consolidating all data into a one platform and via a unified dashboard and using multiple methods of correlation to ensure accuracy and eliminate false positives, Circonus enables a faster, more accurate, approach to full-stack data correlation.
Swivel chair not included.