Guide to Monitoring Kubernetes, Part 2: Which Metrics and Health Conditions You Should be Monitoring

Welcome back to our series of Kubernetes monitoring guides. In part 1 of this series, we discussed the difficulties of managing a Kubernetes cluster, the challenges of conventional monitoring approaches in ephemeral environments, and what our goals should be as we think about how to approach Kubernetes monitoring. Kubernetes can generate so many types of new metrics that one of the most challenging aspects of monitoring your cluster’s health is filtering through these metrics to decide which ones are important to collect and pay attention to.

In fact, in a recent survey that Circonus conducted of Kubernetes operators, uncertainties around which metrics to collect was one of the top challenges to monitoring that operators face. This isn’t surprising, given the millions of metrics that Kubernetes can generate on a daily basis.

Here in part 2 of the series, we’re going to share which health metrics are most critical for Kubernetes operators to collect and analyze. We’ll look at three sources of these metrics, define and name each of the metrics by source, and which health conditions they’re associated with that you should monitor and alert on. We’ll also discuss four more health conditions concerning the Control Plane that you should monitor if you’re managing your own cluster.

Sources of Health Data

There are three sources of the metrics this article will examine. The first source is resource and utilization metrics, which are provided by the kubelets themselves. The second source, which provides most of the crucial metrics, is the kube-state-metrics component (an optional installation). The third source is the Control Plane, and that is composed of multiple components that all serve their own metrics (their metrics may not be enabled by default, however, so you will need to update those components to publish their metrics if you’re managing your own Kubernetes cluster).

#1: Resource and Utilization Metrics

Resource and utilization metrics come from the built-in metrics API and are provided by the kubelets themselves. We only use the CPU Usage as a critical health condition, but monitoring memory usage and network traffic is also important.

Metric Name Description
CPU Usage usageNanoCores The number of CPU cycles used per second for a node or pod (divided into billionths of a second).
CPU Capacity capacity_cpu The number of CPU cores available on a node (not available for pods).
Memory Usage used{resource:memory,units:bytes} The amount of memory used by a node or pod, in bytes.
Memory Capacity capacity_memory{units:bytes} The total memory available for a node (not available for pods), in bytes.
Network Traffic rx{resource:network,units:bytes}
tx{resource:network,units:bytes}
The total network traffic seen for a node or pod, both received (incoming) traffic and transmitted (outgoing) traffic, in bytes.

Utilization Health Conditions

High CPU

This is the easiest to understand: you should track how many CPU cycles your nodes are using. This is important to monitor for two reasons: first, you don’t want to run out of processing resources for your application. If your application becomes CPU-bound, you need to increase your CPU allocation or add more nodes to the cluster. Second, you don’t want your CPU to sit there unused. If your CPU usage is consistently low, you may have over-allocated your resources and are potentially wasting money. You should compare utilization{resource:cpu} to a pre-decided threshold over a particular window of time (e.g. has it stayed over the threshold for over 5 minutes) to determine if your CPU usage is getting too high.

#2: State Metrics

kube-state-metrics is a component that provides data on the state of cluster objects (nodes, pods, DaemonSets, namespaces, et al). It serves its metrics through the same metrics API from which the resource and utilization metrics are served.

Metric Name Description
Node Status kube_node_status_condition{status:true,condition:OutOfDisk|MemoryPressure|PIDPressure|DiskPressure|NetworkUnavailable} A numeric boolean (0 or 1) for each node/condition combination, indicating if that node is currently experiencing that condition.
Crash Loops kube_pod_container_status_waiting_reason{reason:CrashLoopBackOff} A numeric boolean (0 or 1) for each container, indicating if it’s experiencing a crash loop.
Job Status (Failed) kube_job_status_failed A numeric boolean (0 or 1) for each job, indicating if it has failed.
Persistent Volume Status (Failed) kube_persistentvolume_status_phase{phase:Failed} A numeric boolean (0 or 1) for each persistent volume, indicating if it has failed.
Pod Status (Pending) kube_pod_status_phase{phase:Pending} A numeric boolean (0 or 1) for each pod indicating if it’s in a pending state.
Latest Deployment Generation kube_deployment_metadata_generation Sequence number representing the latest generation of a Deployment.
Observed Deployment Generation kube_deployment_status_observed_generation Sequence number representing the current generation of a Deployment as observed by the controller.
Desired DaemonSet Nodes kube_daemonset_status_desired_number_scheduled Number of nodes that should be running each pod in the DaemonSet.
Current DaemonSet Nodes kube_daemonset_status_current_number_scheduled Number of nodes that are running each pod in the DaemonSet.
Desired StatefulSet Replicas kube_statefulset_status_replicas Number of replicas desired per StatefulSet.
Ready StatefulSet Replicas kube_statefulset_status_replicas_ready Number of replicas which are ready per StatefulSet.

State Health Conditions

Crash Loops

A crash loop is when a container starts, crashes, and kube-scheduler keeps trying to restart it but can’t (so it keeps crashing and restarting in a loop). This can be caused either by an application within the container crashing, or by a misconfiguration in the deployment process, which makes debugging a crash loop rather tricky.

Crash loops are obviously bad because they may render your services unreachable. However, sometimes clusters have hidden crash loops — Kubernetes is so good at its job that crash loops may be occurring — yet your application continues to be available.

Even in a best case scenario, the crash loop is wasting CPU cycles, which may impact your cluster provisioning and waste money. When a crash loop happens, you need to know quickly so that you can figure out what’s happening and whether you need to take emergency measures to keep your application available. Summing all of your kube_pod_container_status_waiting_reason{reason:CrashLoopBackOff} metrics will give you a total of how many crash loops you currently have happening, and they’re also tagged with the names of the containers which are crashing as well.

Disk Pressure

Disk pressure is a condition indicating that a node is using too much disk space or is using disk space too fast, according to the thresholds you have set in your Kubernetes configuration (this is pertaining to the disk space on the node itself, not on other volumes like PersistentVolumes). This is important to monitor because it might mean that you need to add more disk space, if your application legitimately needs more space. Or it might mean that an application is misbehaving and filling up the disk prematurely in an unanticipated manner. Either way, it’s a condition that needs your attention.

Memory Pressure

Memory pressure is another resourcing condition indicating that your node is running out of memory. Similar to CPU resourcing, you don’t want to run out of memory — but you also don’t want to over-allocate memory resources and waste money. You especially need to watch for this condition because it could mean there’s a memory leak in one of your applications.

PID Pressure

PID pressure is a rare condition where a pod or container spawns too many processes and starves the node of available process IDs. Each node has a limited number of process IDs to distribute amongst running processes; and if it runs out of IDs, no other processes can be started. Kubernetes lets you set PID thresholds for pods to limit their ability to perform runaway process-spawning, and a PID pressure condition means that one or more pods are using up their allocated PIDs and need to be examined.

Network Unavailable

All your nodes need network connections, and this status indicates that there’s something wrong with a node’s network connection. Either it wasn’t set up properly (due to route exhaustion or a misconfiguration), or there’s a physical problem with the network connection to your hardware.

Job Failures

Jobs are designed to run pods for a limited amount of time and tear them down when they’ve completed their intended functions. If a job doesn’t complete successfully due to a node crashing or being rebooted, or due to resource exhaustion, you need to know that the job failed. That’s why you need to monitor job failures — they don’t usually mean your application is inaccessible, but if left unfixed it can lead to problems down the road. Summing all of your kube_job_status_failed metrics will give you a total of how many jobs are currently failing.

Persistent Volume Failures

Persistent Volumes are storage resources that are specified on the cluster and are available as persistent storage to any pod which requests it. During their lifecycle, they are bound to a pod and then reclaimed when no longer needed by that pod. If that reclamation fails for whatever reason, you need to know that there’s something wrong with your persistent storage. Summing all of your kube_persistentvolume_status_phase{phase:Failed} metrics will give you a total of how many persistent volumes are currently failing.

Pod Pending Delays

During a pod’s lifecycle, its state is “pending” if it’s waiting to be scheduled on a node. If it’s stuck in the “pending” state, it usually means there aren’t enough resources to get the pod scheduled and deployed. You will need to either update your CPU and memory allocations, remove pods, or add more nodes to your cluster. You should watch your kube_pod_status_phase{phase:Pending} metrics over a particular window of time (e.g. has a pod stayed pending for over 15 minutes) to determine if you’re having pod scheduling problems.

Deployment Glitches

Deployments are used to manage stateless applications — where the pods are interchangeable and don’t need to be able to reach any specific single pod, but rather just a particular type of pod. You need to keep an eye on your Deployments to make sure they finish properly. The best way is to make sure the latest Deployment Generation matches the observed Deployment Generation. If there’s a mismatch, then one or more Deployments has likely failed and not been rolled back.

DaemonSets Not Ready

DaemonSets are used to manage the services or applications that need to be run on all nodes in a cluster. If you have a log collection daemon or monitoring service that you want to run on every node, you’ll want to use a DaemonSet. Monitoring is similar to that of deployments: you need to make sure that the number of desired DaemonSet Nodes matches the number of current DaemonSet Nodes. If there’s a mismatch, then one or more DaemonSets has failed to fully deploy.

StatefulSets Not Ready

StatefulSets are used to manage stateful applications, where the pods have specific roles and need to reach other specific pods; rather than just needing a particular type of pod, as with deployments. Monitoring is similar, though — you need to make sure that the number of desired StatefulSet Replicas matches the number of ready StatefulSet Replicas. If there’s a mismatch, then one or more StatefulSets has failed to fully deploy.

#3: Control Plane Metrics

The Kubernetes Control Plane encompasses the portions of Kubernetes that are considered “system components” for helping with cluster management. In a managed environment like Google or Amazon provide, the Control Plane is managed by the cloud provider and you typically don’t have to worry about monitoring these metrics. However, if you manage your own cluster, you’ll want to know how to monitor your Control Plane. When they’re available, most of these metrics can be found via the metrics API.

Metric Name Description
etcd Leader etcd_server_has_leader A numeric boolean (0 or 1) for each etcd cluster member, indicating whether that member knows who its leader is.
etcd Leader Changes etcd_server_leader_changes_seen_total The count of the total number of leader changes which have happened in the etcd cluster.
API Latency Count apiserver_request_latencies_count The count of the total number of API requests; used to calculate average latency per request.
API Latency Sum apiserver_request_latencies_sum The total of all API request durations; used to calculate average latency per request.
Queue Waiting Time workqueue_queue_duration_seconds The total time that action items have spent waiting in each of the controller manager’s work queues.
Queue Work Time workqueue_work_duration_seconds The total time that has been taken to process action items from each of the controller manager’s work queues.
Unsuccessful Pod Scheduling Attempts scheduler_schedule_attempts_total{result:unschedulable} The total number of attempts made by the scheduler to schedule pods on nodes which ended up being unsuccessful.
Pod Scheduling Latency scheduler_e2e_scheduling_latency_microseconds (< v1.14) or scheduler_e2e_scheduling_duration_seconds The total length of time that has been taken to schedule pods onto nodes.

Control Plane Health Conditions

etcd Leaders

The etcd cluster should always have a leader (except during the process of changing leaders, which should be infrequent). You should keep an eye on all of your etcd_server_has_leader metrics because if too many cluster members don’t recognize their leader, your cluster performance will be degraded. Also, if you’re seeing a high number of leader changes reflected in etcd_server_leader_changes_seen_total, it could indicate issues with connectivity or resourcing in the etcd cluster.

API Request Latency

If you divide apiserver_request_latencies_count into apiserver_request_latencies_sum you’ll get your API server’s average latency per request. Tracking the average request latency over time can let you know when your server is getting overwhelmed.

Work Queue Latency

The work queues are action queues managed by the controller manager, and are used to handle all automated processes in the cluster. Watching for increases in either workqueue_queue_duration_seconds or workqueue_queue_duration_seconds will let you know when the queue latency is increasing. If this happens, you may want to dig into the controller manager logs to see what’s going on.

Scheduler Problems

There are two aspects of the scheduler that are worth watching. First, you should monitor scheduler_schedule_attempts_total{result:unschedulable} because an increase in unschedulable pods may mean you have a resourcing issue with your cluster. Second, you should keep an eye on the scheduler latency using one of the latency metrics indicated above (the metric name and units changed with v1.14). An increase in pod scheduling latency may cause other problems, and may also indicate resourcing issues in your cluster.

Events

In addition to collecting numeric metrics from your Kubernetes cluster, collecting and tracking events from your cluster can also be useful. Cluster events will let you monitor the pod lifecycle and watch for significant pod failures, and watching the rate of events flowing from your cluster can be an excellent early warning indicator. If the rate of events changes suddenly or significantly, it may be an indicator that something is going wrong.

Application Metrics

Unlike the rest of the metrics and events we’ve examined above, application metrics aren’t emitted from Kubernetes itself, but rather from your workloads which are run by the cluster. This telemetry can be anything that you consider important from the point of view of your application: error responses, request latency, processing time, etc.

There are two philosophies of how to collect application metrics. The first (which has been widely preferred until recently) is that metrics should be “pushed” out from the application to a collection endpoint. This means a client like StatsD has to be bundled with each application to provide a mechanism with which to push metric data out of that application. This technique requires more management overhead to ensure every application running in your cluster is instrumented properly, so it’s begun falling out of favor with cluster managers.

The second metric collection philosophy (which is becoming more widely adopted) is that metrics should be “pulled” from applications by a collection agent. This makes applications easier to write because all they have to do is publish their metrics appropriately, but the application doesn’t have to worry about how those metrics are pulled or scraped. This is how OpenMetrics works and is the way Kubernetes cluster metrics are collected. When this technique is combined with service discovery by your collection agent, it creates a powerful method for collecting any kind of metrics you need from your cluster applications.

Final Thoughts

Kubernetes can generate millions upon millions of new metrics daily. This can present two big challenges. First, many conventional monitoring systems just can’t keep up with the sheer volume of unique metrics needed to properly monitor Kubernetes clusters. Second, all this data “noise” makes it hard to keep up with and know which metrics are most important.

Your Kubernetes monitoring solution must have the ability to handle all of this data, as well as automatically analyze, graph and alert on the most critical metrics to pay attention to. This way, you know you’ve collected everything you need, filtered out the unnecessary data, and automatically narrowed in on the most relevant data. As a result, you can save substantial time and rest assured that everything is working as it should.