10 Challenges to Expect When You Scale Your Monitoring

This post includes contributions from Riley Berton, Principal SRE at Major League Baseball.

You started the year with one Kubernetes cluster and now you have 100. How do you deal with that? This is a reality for many SREs, and as organizations scale their monitoring to address the growing complexity of their IT environments, SREs will inevitably encounter challenges. The key is to know what challenges to expect, so you can be prepared rather than surprised.

In this post, we share a quick checklist of the ten monitoring and observability challenges we have faced as SREs when our organizations scaled. We experienced these challenges in seriously painful ways, and unfortunately for us, we weren’t aware these would be issues until they had already gone wrong.

1. No data governance. When organizations grow, you need processes in place to understand how data should be collected, how it should be named, and how to define who has access to it. Without data governance, SREs can’t find important data in the noise of all the junk that’s been submitted, or data that should be thrown away sits around longer than it should, increasing liability both financially and legally. When you grow, you need both good governance and a good data discovery tool – which leads to the following bullet.

2. Lack of data discovery tools. As you scale and collect more data, it will become more difficult to know where to find data, or understand the context behind it – which will quickly become a significant issue when troubleshooting, setting alerts, and planning capacity. In addition to the processes, you need the tooling in place to quickly identify the location of your data and know its context. If you’re running a large scale architecture without this, you’ll waste time and even “double up” monitoring data that you didn’t even realize was already there because you couldn’t find it.

3. Onboarding new engineers. There is no organized or bite-sized way for new engineers to learn about SRE topics. As your monitoring scales and you grow your team, expect – and account – for time to learn. For example, each engineer brings a certain level of experience in particular tooling, so any efforts around unifying different tools with one platform will degrade their expertise. Any way to ease and facilitate education and knowledge-sharing is critical. Good organizational processes that allow engineers to know where educational information is and where to look for it is essential.

4. Uncertainty on how to handle incident response. Teams cannot be siloed in their approach to incident response when their organization scales. Scaling requires that organizations have a unified approach across all teams on how to respond to incidents. Having a centralized platform that uses one common language and is the single source of truth for all data makes this easier and allows different teams to understand how issues in one department may impact business services in another.

5. New business metrics. As your organization scales, SREs need to understand how what they are doing is making the business as a whole better. This means that SREs will now need to pull in non-systems or non-application data into their observability platform to do more sensible reporting and correlation, even if it’s just visual. Correlating your KPIs with the overall business becomes imperative to prioritizing what’s critical, allowing you to focus time and resources on only those projects that optimize the entire organization.

6. Acquisition of new technologies that have different tooling. Inevitably, when you scale, you will purchase new solutions. Handling this effectively requires homogenization of policy, process, and approach. This will temporarily compromise engineers’ productivity and capabilities because you’re taking away functionality that was useful for them. Be prepared for a painful (at first) compromise.

7. Incomplete planning. Capacity planning and resource projections, like projected usage and projected billing, will require more sophistication as you grow. Capacity planning is often more complicated than single regression analysis on a single trending metric. It often requires complex compositing and filtering of information prior to analysis. You need robust data science tooling to answer the “real life” questions. If you have data in a tool that doesn’t make planning/projections easy, it’s a lot of “not fun” work. But it’s more important than having the system guess at behavior. Simple capacity planning on an ongoing basis should always be done before sophisticated anomaly detection like AIops because everyone needs to be on the same page on growth.

8. Availability of your observability platform. As you scale up, you’re going to put more into your observability platform – can it handle this amount of data? This is critical to avoiding data holes, lack of visibility, and performance issues. Your observability platform absolutely must be able to run at scale – meaning there is no compromise on performance, regardless of the infrastructure environment size or the amount of data being collected.

9. Relaxing your concept of real-time. It’s impossible to expect that everything can be delivered or executed with the same measurement of real-time as you scale. SREs will need to change their policies and approaches to adapt. As you scale and combine systems (networks, databases, applications), those systems will use the term real-time, but each will mean something different. A group that monitors all their apps every 10s might be frustrated that data from their third-party advertising platform is available only on a 5 minute cadence. And the networking group is amazed anyone can get anything done at 10s because all their data is available on sub-second cadence. Each group has a very different concept of real-time and will perceive delay because of it. If you don’t recognize that, you will under-deliver or over-engineer your systems.

10. Requirements that simply violate the boundaries of your tools. SREs are going to draw abstractions in places where their vendor didn’t think that was a good idea – and they’ll have to cope with that. For example, very organizationally specific alerting requirements that don’t necessarily fit tooling for alert management. Or, defining an SLO in terms that the tool does not support, which can lead to disastrous outcomes.

Every organization will address these issues in different ways, but our quick conceptual checklist will hopefully help you plan accordingly ahead of time. It’s inevitable to still encounter challenges, often frustrating ones, but even some preparation ahead of time can save time, resources, and productivity.