The Uptime Institute recently released its Annual Outage Analysis 2023 report. Overall, the report highlights the increasing costs, frequency, and duration of outages, the prominent role of cloud and digital services in outages, the shortcomings of service providers, and the need to address human error and management failures. It also underscores the ongoing challenges of handling failures in complex distributed architectures.

In the following post, we highlight some of the headlines from the report, followed by three solutions we believe are essential to decreasing outage times (or preventing outages in the first place), yet are missing from many modern monitoring platforms today.

Report summary

Here are the headline issues from the analysis, followed by our take on solutions:

Outages are becoming more frequent:

According to EMA, 41% of organizations experience at least one significant outage per month.

Outages are becoming more expensive:

70% of outages in 2022 cost over $100,000, compared to 40% in 2019. This trend is expected to continue as reliance on digital services grows.

Outages are taking longer to resolve:

In 2022, 32% of outages lasted longer than 12 hours, while 16% exceeded 48 hours in duration. These extended outages can have severe consequences for businesses and their customers.

Legacy service providers are missing the mark:

According to the report, “The frequency/duration of outages strongly suggests that the actual performance of many service providers falls short of SLAs. Customers should not consider SLAs (or 99.9x% availability figures) as reliable predictors of future availability of service providers.”

Cloud and digital services represent the lion’s share:

Cloud, Software as a Service (SaaS), and digital services accounted for 80% of public outages in 2022, rising from 66% in 2016. These services are responsible for a growing proportion of outages, which highlights the expanding role and significance these companies play.

Software and humans share responsibility:

Software or configuration errors were responsible for 65% of outages in cloud services. Additionally, more than 87% of outages were attributed to human error and/or management failures. These findings emphasize the need for improved tools and processes to mitigate such errors.

Our take

Despite decades of industry development and increasingly “sophisticated” monitoring solutions, outage resolution times continue to rise as the industry grapples with effectively managing failures in multi-cloud distributed architectures and networks. We think this can be attributed to the following factors:

Disparate tools and manual correlation

Today’s IT teams face the challenge of integrating information from multiple vendors’ monitoring tools in order to obtain the data they need to be effective. This tool sprawl results in data silos, the lack of a single source of truth, and visibility gaps. This forces IT teams to employ “swivel chair correlation” practices that make it increasingly difficult to accurately and efficiently correlate metrics, traces, and logs, thereby slowing mean time to resolution (MTTR).

Solution:

Consolidate monitoring into one platform so that you can visualize and correlate your metrics, traces, and logs from your full environment – infrastructure, cloud, applications, network – in a single view. But be sure to understand how these unified platforms truly enable more automated correlation, which should take a multi-layered approach to ensure accuracy and eliminate false positives – time-based and search-based correlation together is optimal.

Data collection and storage issues

The amount of telemetry data is exploding. Unfortunately, most tools limit data volume by sampling data rather than collecting it all. This significantly impacts accuracy and causes insights to be missed.

Moreover, storing logs is so expensive these days that organizations delete them at regular intervals, particularly during times of performance issues when more logs are generated. So, when this data is needed most, or you want the data for post-mortems, it’s gone.

Solution:

Instead of sampling telemetry data, IT teams can use histograms to collect all of their data and store it inexpensively infinitely. This will enable more accurate anomaly detection, trend analysis, historical analysis, and SLO calculation.

Also, your monitoring solution should have the ability to convert logs to metrics. This allows you to store this data long-term cost-effectively, and you can perform trend and behavioral analysis on metrics that are not possible with logs. The insights from this analysis are incredibly valuable for learning how to prevent future issues and shift from reactive to proactive monitoring.

Alert lag

At a time when the average time to resolution of major incidents and outages is increasing across the industry, every minute counts when it comes to detecting incidents. Unfortunately, most monitoring solutions today have a delay of at least ten minutes (many times longer) before alerts are generated from incidents. This detection delay comes from periodic polling of the database where metrics are stored.

Solution:

Ensure your monitoring platform has real-time streaming alerting. It should have the ability to evaluate every metric as it hits the platform in real-time, and not have to wait for it to be processed and stored in the database. This will allow you to surface issues immediately so you can resolve them before they become outages – and before customers notice them.

Final thoughts

The ability to collect ALL data, analyze and alert on it in real-time, store it indefinitely for historical analysis, and critically, to efficiently correlate it in a way that accurately pinpoints root cause are all possible today. Together, these capabilities are critical for not only reacting faster, but  preventing outages before they happen in the first place.

Check out these capabilities for yourself by signing up for a free Circonus trial.

Get blog updates.

Keep up with the latest in telemtry data intelligence and observability.