Guide to Kubernetes Monitoring: Part 1

Kubernetes is one of the hottest topics in IT right now, but what exactly is it and where did it come from? As DevOps and Infrastructure-as-Code practices arose and took hold in the IT/OPS community over the past decade, the logical continuation of those ideas was a system for automating the management of the software itself. So Google stepped in and offered its own software as a solution, releasing Kubernetes as open-source in 2014.

Kubernetes can manage most aspects of an application: service discovery and load balancing, storage orchestration, automated rollouts/rollbacks, automatic bin packing, and more. And its popularity hasn’t yet seen a peak — usage continues to climb. The number of organizations using container-based deployments in production environments has increased by over 260% in the past four years.

Despite the explosive popularity of Kubernetes, operating a Kubernetes cluster is challenging. With its impressive capabilities comes a high level of complexity that few people can truly master. This leads to one of the hardest aspects of operating Kubernetes clusters: monitoring their health and performance.

When monitoring a system as complex as Kubernetes, you need to ensure that your nodes’ resources are being allocated properly and the nodes are not experiencing memory or process ID (PID) pressure. You need to watch for delayed pod creation and monitor your jobs for failures. And if there is a problem with your statefulsets or daemonsets or a glitch with a deployment, you have to determine the cause and find a resolution.

But you cannot successfully monitor Kubernetes using conventional approaches because almost everything within Kubernetes clusters is ephemeral. Pods and containers may come or go at any time, while nodes may be added as you scale up your cluster size to meet increased demand. Only with a modern monitoring solution that addresses the unique challenges of Kubernetes can you successfully ensure the health of your clusters. Operating a complex system like Kubernetes without a modern monitoring solution is like stumbling around in the dark: you might eventually get where you want to go, but the journey will likely be a painful one.

This post will highlight why Kubernetes requires a different approach to reap the benefits of monitoring, and how to manage Kubernetes monitoring challenges.

Kubernetes Monitoring Goals

Before we dive into the challenges of Kubernetes, let’s first discuss why we need to monitor Kubernetes in the first place. I’ve broken this into three main goals:

Goal #1: Understand and improve overall cluster health

Your first goal when monitoring your Kubernetes cluster is monitoring the overall cluster health.
You could go install a simple agent on your Kubernetes nodes and get their CPU, memory, and disk usage, but then what? Those metrics alone don’t give you the “big picture.” You want to monitor the cluster itself and know if it’s healthy or not. How do you even monitor an environment like this with a conventional monitoring system? It’s difficult to pull metrics from IP addresses or hostnames when your cluster is constantly changing. These questions can only be answered with a modern monitoring solution — one that takes into account the specifics of Kubernetes.

Goal #2: Identify the root cause of issues

Once you know that your cluster is unhealthy, you need to be able to point to a cause of the unhealthy state. Is a container crashlooping? Has a node run out of memory? Maybe all network responses are slow, but is it a DNS issue? Once you can pinpoint the cause, you can do something about it.

Goal #3: Reduce MTTR

If you don’t have sufficient monitoring and something goes wrong in your cluster, it can take a while to figure out exactly what’s happening, and these types of investigations are daunting to even the most experienced operators. When something breaks, you need to be able to fix it quickly, and having proper monitoring in place significantly shortens that process.

Additional Benefits

The ability to get a good view of your cluster health has other benefits, too. Once you can actually monitor your cluster, you can dive into your Kubernetes application structure itself. Are some pod types overallocated or underperforming? Could your application be reworked to provide a better, smoother user experience? Perhaps you need to restructure part of your application to be able to support more users.

You’ll also gain insights to save on costs. With ephemeral infrastructures, it’s easy to always scale up but never take the time to scale back down when necessary, and some cloud providers don’t even offer bidirectional autoscaling. It may be that some nodes are no longer needed due to a traffic surge, or perhaps an application’s pods are over-provisioned. Being able to dive into how all your resources are allocated may enable you to find cost savings by being able to easily look at resource usage across all aspects of your cluster and compare actual usage to desired usage.

Kubernetes Monitoring Complexities

Millions of metrics, constant changes, and a lack of observability are three complexities that make Kubernetes monitoring challenging and drive the need for more tailored solutions.

Complexity #1: So many metrics

Kubernetes is a multi-layered solution. The entire deployment is called a cluster. Inside the cluster you have worker machines called nodes, and they run your containerized applications. Each node runs one or more pods, which are the main components that handle your containers, and the nodes and pods in turn are managed by the Control Plane. Inside the Control Plane are many smaller pieces such as the kube-controller, cloud-controller, kube-api-server, kube-scheduler, and etcd, and these can be distributed across multiple machines for fault-tolerance and resilience.

Above: Diagram of Kubernetes cluster with control plane View full size

These abstractions all work to help Kubernetes efficiently support your container deployments, pod scaling, scheduling, updating, service discovery, etc., and while they’re all very helpful, they’re also complex and they generate many metrics.

But which metrics should you watch? You can’t watch all of them, and you don’t need to watch all of them. Any comprehensive Kubernetes monitoring solution needs to keep tabs on the important metrics relating to the Control Plane, and it needs to constantly adapt to new versions of Kubernetes as they’re released.

In addition to Control Plane metrics are “Pod Churn” metrics. Real-world pod usage varies wildly between different organizations. Some organizations design systems where pods may last days, weeks, or even months, while other organizations consider this to be a flawed application design and have systems where pods only last for minutes or seconds. In Kubernetes, a given pod produces a collection of metrics which are unique to that particular pod. These metrics all contain tags, labels, or dimensions that contain contextual information for that particular pod: the pod name, what node it was running on, its associated namespace, component type, etc.

Above: Example pod metrics

“Pod churn” refers to the cycle through which pods and containers are created, destroyed, and later recreated, and every time a pod is created you have a new set of metrics being created for it. This results in a large volume of high-cardinality (very unique) metrics. A high level of “pod churn” can result in millions upon millions of new metrics being created every single day. This is a very significant challenge for many conventional monitoring systems, including commercial systems. They just can’t keep up with the sheer volume of unique metrics needed to monitor such a cluster. Simply put: it’s just too much data.

Above: Pod churn due to horizontal pod and node scaling

Complexity #2: Ephemerality

In addition to the system Control Plane, there are your deployment elements which constantly change. Deployments, DaemonSets, Jobs, and StatefulSets all can generate new pods to monitor, and sometimes it’s even necessary to scale down; then pods or nodes will disappear forever. The Kubernetes scheduler schedules all of these elements to ensure that resources are always available and allocated where you want them to be. As new deployments are scheduled, Kubernetes may decide that it needs to move a pod in order to free up resources on a given node. This results in pods being moved and recreated…the same pod, just with a different name and in a different place. A monitoring solution needs to be able to detect these changes automatically and continue monitoring without interruption.

Complexity #3: Lack of Observability

Organizations which adopt Kubernetes tend to also follow modern software practices, including using microservices and/or stateless application design. These ultimately lead to application architectures which are very dynamic and hinder observability.

In a microservice-based application, engineers break down the application into components representing the core functions or services of the application. These components are intended to be loosely coupled, so the services are operated independently and designed in such a way that a change to one service won’t significantly affect other services. Modern applications can be composed of dozens of microservices, and Kubernetes keeps track of the state of these various components, ensuring they are available and that there are enough of them to handle the appropriate workload. The microservices themselves are in constant communication with each other, and that communication takes place through a virtual network within the Kubernetes cluster itself.

In a stateless application, the application avoids storing any client session data on the server. Any session data storage (if it needs to occur at all) is handled on the client side. Since there is no session data stored on the server, there is also no need for any particular client connection to be favored over any other client connection. This allows the application to treat each connection as the first connection and easily balance the processing load across multiple instances. The biggest benefit of stateless application design is that it enables applications to be horizontally scaled simply by deploying instances of the application on multiple servers and then distributing all incoming client requests amongst the available servers.

Microservices are not required to be stateless (and stateless is not required to be microservices) vice-versa), but you do tend to find these two practices being leveraged together for the sake of being able to easily scale the application. This means Kubernetes becomes an ideal platform upon which to deploy this type of software. However, these types of services are (by design) expected to be ephemeral; they scale up to handle a workload and subsequently disappear when no longer needed. As a result, all operational information present within a pod disappears with it when it’s torn down. Nothing is left; it’s all gone.

Above: Microservices and statelessness lead to ephemerality

How does this affect the observability of Kubernetes, then? Since observability is the ability to infer the state of a system through knowledge of that system’s outputs, it sure seems like Kubernetes is a system with minimal observability, and yet its popularity continues to climb. This limited observability is why it’s so difficult to troubleshoot problems with Kubernetes. It’s not uncommon to hear stories of Kubernetes operators finding major software problems months or even years after having migrated to the ecosystem. Kubernetes itself does such a fantastic job of ensuring that services stay running, that given its limited outputs you can easily find yourself in just such a situation without realizing it. On the surface, this is a great success story for Kubernetes (the fact that it’s so resilient) but sooner or later those software problems need to be found, and that’s going to be a problem when the system seems to be a “black box.”

Conclusion

Rarely has a new technology seen such explosive growth as Kubernetes has over the past few years. Its capabilities are robust and beneficial for modern cloud-based applications, and yet it is different enough from traditional server architectures that it has created the need for a new approach to monitoring. To gain useful insights into cluster health and to properly monitor cluster resource allocation, conventional monitoring techniques are insufficient. And most monitoring solutions are incapable of handling the sheer volume of data that’s needed to monitor Kubernetes properly. While this is a daunting challenge, it is definitely manageable; if done correctly, you can ensure your cluster is healthy and your applications are running as expected. In my next post, I’ll dive deeper into successfully understanding the health and performance of your Kubernetes clusters.