How Major League Baseball Scales Kubernetes Monitoring

Millions of baseball fans tuned into the World Series last week, and we at Circonus were proud to help our customer, Major League Baseball, provide those fans with seamless viewing experiences. To celebrate our partnership, we’re rolling the replay on how MLB has leveraged Circonus to overcome Kubernetes observability challenges as the league quickly scaled its Kubernetes deployment.

The following are 6 lessons that MLB has shared about its experiences monitoring 200 Kubernetes clusters across 30 ballparks, which include insights for saving organizations time and money.

1. Centralized monitoring makes managing alerts easier.

The most important lesson MLB has learned since deploying Kubernetes is to centralize all Kubernetes monitoring metrics into one platform. At first, the organization installed Prometheus on each cluster, which created a sort of siloed monitoring tool running within each of its clusters — a tactic that may work if you have a finite number of clusters you can easily manage. But when you have a larger Kubernetes footprint, it’s challenging to filter through the sheer volume of data the clusters are generating using federated tooling. MLB sees close to a million time series metrics per minute — so much data that it can be challenging to derive value from. By consolidating its metrics into one platform, MLB has one set of alerting rules that manages the 15-20 actionable issues it actually cares about and can address, such as pod crash looping, memory pressure, PID pressure, etc.

2. The “pull model” of receiving data using Prometheus creates gaps in visibility.

The “pull model” of Prometheus doesn’t work for MLB because the league has a complex network that spans offices, ballparks, and cloud environments that inevitably have occasional connectivity issues. As a result, pulling the metrics would sometimes fail, and MLB would see huge holes in its data. The only way it could solve this issue was to figure out some kind of backfill process, which is extremely hard to manage. By moving to a push model with store and forward capabilities, MLB is able to overcome this challenge. If there ever is a connectivity issue, the data is safely buffered and sent when connectivity returns.

3. Understanding resource utilization prevents overprovisioning and saves significant costs.

Overprovisioning clusters is a common problem in Kubernetes. In fact, most organizations likely have a lot of waste running in their clusters that they’re unaware of. Understanding resource utilization is key to preventing this.

What’s challenging to understand is how much CPU and RAM your pods are requesting, versus how much are they actually using. It’s important to be able to surface this information and make decisions about spend based on what you’re actually using. Many organizations make decisions to scale up because they see that their CPU utilization is high. This is because it is fairly easy to implement a horizontal pod autoscaler using simple pod utilization metrics. However, an increase in CPU does not necessarily mean that your service is degraded — it just means you’re using the cores that you said you would use. MLB originally had auto-scaling set up based on CPU, but now it auto-scales based on the metrics they derive from actual utilization using a custom pod autoscaler. For example, MLB might analyze the requests per second and decide not to scale up, because it knows performance requirements have been met with this level of scaling at that request rate. This saves significant costs over time.

Most organizations overestimate the resources needed to keep their clusters running optimally. If you do a bit more digging into resource utilization metrics, you might find that building scaling rules under the assumption that your service is CPU or memory bound is too simplistic. You may need to reference data specific to your application’s function, such as latency or queue depth. It’s more work, but there’s a huge opportunity for cost savings.

4. Kubernetes knowledge sharing is a worthwhile investment.

Take the time to invest in transferring knowledge about how Kubernetes operates to the development teams who are actually running the clusters. Unless you have the resources to hire a very large SRE team, it’s not possible to manage all of the Kubernetes sprawl with one small team. Everyone needs to be involved and learn from common mistakes that provide insights into how Kubernetes operates.

Knowledge sharing is critical because you cannot apply traditional IT expertise to Kubernetes — it’s too inherently different. Software engineers that have not built software to run on Kubernetes will approach the technology with a set of assumptions based on past experiences. By applying this past knowledge without altering it for Kubernetes, organizations will have clusters that are less reliable and more difficult to operate. By prioritizing knowledge transfer about the nuances of how Kubernetes works and how it operates, teams will better understand how to build software that will be operable on Kubernetes. If you silo or isolate Kubernetes expertise into a single team, you’ll just end up producing software that is less reliable and more costly.

5. Democratizing Kubernetes creates efficiency.

MLB has a core team of engineers who handle the CI/CD pipeline. Rather than have this core team manage Kubernetes on its own, clusters are owned by the teams that use them. This is good for knowledge sharing, but it can be challenging because Kubernetes is complex.

For example, many of MLB’s clusters are in various states of broken continuously, so inevitably questions about these issues come back to the core team. Since there are so many clusters, there’s no way for a centralized team to manage all of these and understand the context of all the deployments. MLB therefore uses Terraform to set up rules that allow the alerts to go directly to the team that owns the cluster. If the team gets stuck, they leverage the core team for help. In addition, MLB has centralized tooling for the deployment and monitoring of Kubernetes (as stated in Lesson #1), which enables consistency among the teams.

6. Using turnkey monitoring solutions that cover 80% of use cases saves time.

MLB finds that about 80% of use cases run extremely well in Kubernetes. By applying a turnkey monitoring solution like Circonus for that 80%, the league saves significant time. MLB simply drops in the Circonus Kubernetes agent, and immediately it gets alert rules, visualizations and data insights with no extra work required. With a turnkey solution, MLB can launch a cluster and already have the 15-20 important alert rules set up. If something is broken, the team is alerted and can view dashboards to identify what the issue is and how to fix it. And if MLB wants to customize further, it can.

Whether you’ve already deployed Kubernetes or are considering it, you’re likely well aware of the benefits of Kubernetes — but also its complexity. Major League Baseball has gained a few insights along the way that can hopefully help reduce your learning curve in a way that allows you to get the maximum value out of your current or future Kubernetes deployments.

Learn why more enterprises are using Circonus for Kubernetes Monitoring