Common Causes of Outages and Tips to Prevent Them

This past spring, Ron DeSantis used Twitter Spaces to launch his presidential campaign.

At least, he tried to.

As you may remember, the event was marred with technical difficulties, resulting in false starts, confused hosts, glitches, echoes, and the “melting” of servers. Of the more than 600,000 Twitter users who initially tuned in, less than half remained by the time they relaunched the event using a different account.

Outage Costs

Such outages result in significant real costs, including the opportunity cost of lost business. In fact, according to Netblocks, the cost of an outage in the United States alone costs Twitter $13,962,513 per hour.

There’s also internal and external reputational damage to consider. Judging by the reaction on Twitter at the time—indicated by the trending hashtag, #Desaster—what should have been a shining moment for the platform turned into a reputational disaster for Twitter, Elon Musk, and DeSantis.

Not to mention, more than 300,000 voters left in the lurch.

These types of outages can often be traced back to two causes: tech debt and visibility gaps.

Common Causes of Outages

Tech Debt

Originally coined in the 90’s by Agile Manifesto co-author, Ward Cunningham, tech debt refers to the latent expense incurred by failing to address issues that will impact a business in the future. Allowing technical problems to persist leads to their worsening over time—and the longer this debt accumulates, the more expensive it becomes to resolve.

Tech debt most commonly accrues when companies try to “move fast and break things.” Cunningham theorized that shipping first time code is like going into debt. When managed efficiently, a small debt can expedite progress in development, but it must be promptly refactored on an ongoing basis. The real risk arises when the debt remains unpaid. Each moment invested in imperfect code accrues as interest on that debt—and the burden of an unconsolidated execution can bring entire engineering organizations to a halt.

One way to mitigate the impact of tech debt as new products and features are rolled out is to supplement the system with human labor. Though a temporary crutch, implementing a robust workforce capable of handling manual tasks is often seen as a “good enough” solution in the short term.

However, if an organization fails to hire enough employees to handle the technical management of the system,tech debt can accrue very quickly.

Visibility Gaps

Another common reason such outages happen is the inability to correlate data between disparate monitoring systems.

Correlation in monitoring refers to the process of analyzing different types of data – often metrics, traces, and logs – to identify and understand relationships between application, network, and infrastructure behavior. Correlating these data sets can help IT teams identify the root cause of a performance issue, so it can be resolved quickly before turning into a major (or long-term) outage.

However, correlating metrics, traces, and logs is challenging when using multiple different monitoring tools (a reality for many organizations), as each tool is often owned by different teams and has different approaches to tagging and other contextual metadata

As a result, organizations often resort to cross-organizational war rooms, manually correlating and stitching the data and information from the different tools together. It’s time-consuming and error-prone, with engineers often frantically switching back and forth across screens to determine the cause of outages or significant performance degradation.

An Ounce of Prevention is Worth A Pound of Cure

Things break and outages happen. It’s a matter of when, not if. In fact, According to EMA, 41% of organizations experience at least one significant outage per month.

The most important factors in outage prevention and correction is to ensure there are enough skilled personnel to support the existing system (including those with key institutional knowledge) and to practice continual improvement, which includes making time for regular, blameless retros. Beyond these factors, there are a few additional things organizations can do to mitigate such situations.

Make the system more adaptable and elastic

Embrace safeguards such as capacity automation, which minimizes issues like service or disk shortages, and auto-scaling, which enables organizations to scale cloud services such as server capacities or virtual machines up or down automatically, based on defined situations.

Doing so can mitigate the risk of human error—which is responsible for 87% of outages, according to the Uptime Institute 2023 Annual Outage Analysis—and minimize vulnerabilities arising from fragile and poorly supported architecture.

Release carefully, prioritize capabilities, and handle bugs as they occur

Today’s companies increasingly release beta versions of products in an effort to move quickly. By releasing in a focused, rigorous manner and having observability on your pipelines, organizations can detect and track the origins of bugs as they occur. Reacting to a bug a day later versus months or years later lets you minimize the size of your changes and more accurately track the effects of those changes.

It’s also possible that fixing a particular bug is not the best course of action at the time. However, as an organization, it is important to intentionally dedicate some development resources to analyzing root causes, prioritizing the capabilities you most need to improve, and shoring up the most fundamental pieces.

If your customers are making you aware of problems, you might want to prioritize earlier detection. Or, you may want to prioritize better observability so that you actually have the ability to know something went wrong and can put targeted manual effort into that. So there are several ways to improve MTTR and deal with the problem until you can get around to actually fixing it.

Choose a consolidated monitoring and observability platform

Employing multiple monitoring tools is no longer realistic. Today’s IT environments are generating significantly more data than ever before. Combine this with the expectations users have for performance and the brand damage from outages, and it’s clear that more sophisticated monitoring platforms are an essential investment. Choose solutions that can ingest and analyze all of your data – metrics, traces, and logs – from across your full environment. Having this single source of truth is essential to quickly identifying root cause, and solving issues before users ever notice them.

Final Thoughts

The Ron DeSantis / Twitter Spaces mishap is just one example of the type of technical outages that occur quite commonly across organizations of all sizes.

Thankfully, engineering teams can take a common sense approach to preventing such situations that includes capabilities prioritization, technical safeguards, more intentional releases and bug tracking, and using a monitoring platform that can unify and correlate all observability data.

Given the costs—in this case, $13,962,513 per hour, damaged reputations, and 300,000+ disenchanted voters—deciding against putting such safeguards in place is most often a “pennywise and pound foolish” proposition.

Get blog updates.

Keep up with the latest in telemtry data intelligence and observability.

Subscribe Now

Common Causes of Outages and Tips to Prevent Them

Outage Costs

Common Causes of Outages

Tech Debt

Visibility Gaps

An Ounce of Prevention is Worth A Pound of Cure

Make the system more adaptable and elastic

Release carefully, prioritize capabilities, and handle bugs as they occur

Choose a consolidated monitoring and observability platform

Final Thoughts

Related Posts

4 Strategies to Reduce Observability Costs – Without Sacrificing Visibility

10 Things to Consider before Multicasting Your Observability Data

More is More – A Case for Dynamic Observability

Get blog updates.

Platform

Use Cases

Help

About

Get Blog Updates