We recently spoke with Major League Baseball (MLB) executives about their Kubernetes deployment, including what challenges they’ve faced and what lessons they’ve learned — particularly as it relates to Kubernetes monitoring. Currently, MLB has about 70 Kubernetes clusters running across 30 ballparks, most of which are running in GKE but with some in EKS. The following are 6 lessons that MLB has shared about their experiences, which include insights you can likely apply to your own clusters to save your organization substantial time and money.
Lesson #1: Centralize monitoring data to ease alert configurations and better manage metrics
The most important lesson MLB has learned since deploying Kubernetes is to centralize all Kubernetes monitoring metrics into one platform. At first, the organization had created silos of monitoring tools within each of its clusters — a tactic that may work if you have a finite number of clusters you can easily manage. But when you have a larger Kubernetes footprint, it’s challenging to filter through the sheer volume of data the clusters are generating using disparate tools. MLB sees about three quarters of a million time series data coming out of each cluster every minute — too much data to manage. By consolidating its metrics into one platform, the league can filter out 90% of the data and keep about 15,000 to 30,000 time series per larger cluster. This allows them to focus on the 15 to 20 actionable issues they actually care about and can address — such as pod crash looping, memory pressure, PID pressure, etc.
Centralizing all data has also eased alert management. Previously, they had to configure alert management across all clusters — and many were in various states of non-compliance. By taking all of the data the clusters are generating and putting it into one place, they now have one set of alerting data that manages the 15-20 major issues.
Lesson #2: The “pull model” of receiving data using Prometheus leaves holes in visibility
The “pull model” of Prometheus doesn’t work for MLB, because it has a complex network that spans ballparks all over the country. Its network spans VPNs, different clouds, on-premises, etc.; and it’s therefore inevitable they have “drops.” As a result, the “pulls” would fail and they would see huge holes in their data. The only way they could solve this issue was to figure out some kind of backfill process, which is extremely hard to manage. By moving to a “push model” for receiving data and centralizing the processing of that data, they were able to overcome this challenge.
Lesson #3: Understand resource utilization to prevent overprovisioning and save significant costs
Overprovisioning clusters is a common problem in Kubernetes. In fact, most organizations likely have a lot of waste running in their clusters that they’re unaware of. Understanding resource utilization is key to preventing this.
What’s challenging to understand is how much CPU and RAM your pods are requesting, versus how much are they actually using. It’s important to be able to surface this information and make decisions about spend based on what you’re actually using. Many organizations make decisions to scale up because they see that their CPU utilization is high. However, an increase in CPU does not necessarily mean that your latency has gone up — it just means you’re using the cores that you said you would use. MLB originally had auto-scaling set up based on CPU, but now it auto-scales based on the metrics they derive from actual utilization. For example, MLB might analyze the requests per second and decide not to scale up, because they know performance requirements have been met with this level of scaling at that request rate. This will save them significant costs over time.
Most organizations overestimate the resources needed to keep their clusters running optimally. If you do a bit more digging into resource utilization metrics, you might see that you’re not using the resources you thought you were — so there’s a huge opportunity for cost savings.
Lesson #4: Invest in Kubernetes knowledge sharing
Take the time to invest in transferring knowledge about how Kubernetes operates to the development teams who are actually running the clusters. Unless you have the resources to hire a very large SRE team, it’s not possible to manage all of the Kubernetes sprawl with one small team. Everyone needs to be involved and learn from common mistakes that provide more insights into how Kubernetes operates.
Knowledge sharing is critical, because you cannot apply traditional IT expertise to Kubernetes — it’s too inherently different. Software engineers that have not built software to run on Kubernetes will approach the technology with a set of assumptions based on past experiences. By applying this past knowledge without altering it for Kubernetes, organizations will have clusters that are less reliable and more difficult to operate. By forcing knowledge transfer about the nuances of how Kubernetes works and how it operates into those teams, they will better understand how to build software that will be operable on Kubernetes. If you silo or isolate Kubernetes nuances and knowledge into a single team, you’ll just end up producing software that burns money.
Lesson 5: Democratize Kubernetes
MLB has a core central team of 8-10 engineers who handle the CI/CD pipeline. Rather than have this central team manage Kubernetes on its own, clusters are owned by the teams that use them. This is good for knowledge sharing, but it can be challenging because Kubernetes is complex.
For example, most of MLB’s 70 plus clusters are in various states of broken continuously, so inevitably questions about these issues come back to the core team. Since there are so many clusters, there’s no way for a centralized team to manage all of these and understand the context of all the deployments. MLB therefore uses Terraform to set up rules that allow the alerts to go directly to the team that owns that cluster. If the team gets stuck, they leverage the core team for help. In addition, MLB has centralized tooling for the deployment and monitoring of Kubernetes (as stated in Lesson #1), which enables consistency among the teams.
Lesson #6: Use turnkey monitoring solutions that cover 80% of use cases, to save oodles of time
MLB finds that about 80% of use cases run extremely well in Kubernetes. By applying a turnkey monitoring solution for that 80%, they save significant time. All they do is simply drop an agent and immediately they get alerts, visualizations and data insights with no extra work on their end. With a turnkey solution, MLB can launch a cluster and already have the 15-20 important alerts that matter set up. If something is broken, they’re alerted and can view dashboards to identify what the issue is and how to fix it. And if they want to customize further, they can.
Hit Your Kubernetes Deployment Out of the Park
Whether you’ve already deployed Kubernetes clusters or are considering it, you’re well aware of the benefits Kubernetes can bring — but you also know just how complex it is. Major League Baseball has gained a few insights along the way to ensure that they continue to maximize the value they get out of their Kubernetes deployments. Hopefully, some of what they have shared can help your organization do the same.