Comprehensive Container-Based Service Monitoring with Kubernetes and Istio
Operating containerized infrastructure brings with it a new set of challenges. You need to instrument your containers, evaluate your API endpoint performance, and identify bad actors within your infrastructure. The Istio service mesh enables instrumentation of APIs without code change and provides service latencies for free. But how do you make sense all that data? With math, that’s how.
Circonus is the first third party adapter for Istio. In a previous post, we talked about the first Istio community adapter to monitor Istio based services. This post will expand on that. We’ll explain how to get a comprehensive understanding of your Kubernetes infrastructure. We will also explain how to get an Istio service mesh implementation for your container based infrastructure.
Istio is a service mesh for Kubernetes, which means that it takes care of all of the intercommunication and facilitation between services, much like network routing software does for TCP/IP traffic. In addition to Kubernetes, Istio can also interact with Docker and Consul based services. It’s similar to LinkerD, which has been around for a while.
Istio is an open source project by developed by teams from Google, IBM, Cisco, and Lyft’s Envoy. The project recently turned one year old, and Istio has found its way into a couple of production environments at scale. At the time of this post, the current version is 0.8.
So, how does Istio fit into the Kubernetes ecosystem? Kubernetes acts as the data plane and Istio acts as the control plane. Kubernetes carries the application traffic, handling container orchestration, deployment, and scaling. Istio routes the application traffic, handling policy enforcement, traffic management and load balancing. It also handles telemetry syndication such as metrics, logs, and tracing. Istio is the crossing guard and reporting piece of the container based infrastructure.
The diagram above shows the service mesh architecture. Istio uses an envoy sidecar proxy for each service. Envoy proxies inbound requests to the Istio Mixer service via a GRPC call. Then Mixer applies rules for traffic management, and syndicates request telemetry. Mixer is the brains of Istio. Operators can write YAML files that specify how Envoy should redirect traffic. They can also specify what telemetry to push to monitoring and observability systems. Rules can be applied as needed at run time without needing to restart any Istio components.
Istio supports a number of adapters to send data to a variety of monitoring tools. That includes Prometheus, Circonus, or Statsd. You can also enable both Zipkin and Jaeger tracing. And, you can generate graphs to visualize the services involved.
Istio is easy to deploy. Way back when, around 7 to 8 months ago, you had to install Istio onto a Kubernetes cluster with a series of kubectl commands. And you still can today. But now you can just hop into Google Cloud platform, and deploy an Istio enabled Kubernetes cluster with a few clicks, including monitoring, tracing, and a sample application. You can get up and running very quickly, and then use the istioctl command to start having fun.
Another benefit is that we can gather data from services without requiring developers to instrument their services to provide that data. This has a multitude of benefits. It reduces maintenance. It removes points of failure in the code. It provides vendor agnostic interfaces, which reduces the chance of vendor lockin.
With Istio, we can deploy different versions of individual services and weight the traffic between them. Istio itself makes use of a number of different pods to operate itself, as shown here:
> kubectl get pods -n istio-system NAME READY STATUS RESTARTS AGE istio-ca-797dfb66c5 1/1 Running 0 2m istio-ingress-84f75844c4 1/1 Running 0 2m istio-egress-29a16321d3 1/1 Running 0 2m istio-mixer-9bf85fc68 3/3 Running 0 2m istio-pilot-575679c565 2/2 Running 0 2m grafana-182346ba12 2/2 Running 0 2m prometheus-837521fe34 2/2 Running 0 2m
Istio is not exactly lightweight. The power and flexibility of Istio come with the cost of some overhead for operation. However, if you have more than a few microservices in your application, your application containers will soon surpass the system provisioned containers.
Service Level Objectives
This brief overview of Service Level Objectives will set the stage for how we should measure our service health. The concept of Service Level Agreements (SLAs) has been around for at least a decade. But just recently, the amount of online content related to Service Level Objectives (SLOs) and Service Level Indicators (SLIs) has been increasing rapidly.
In addition to the well-known Google SRE book, two new books that talk about SLOs are being published soon. The Site Reliability Workbook has a dedicated chapter on SLOs, and Seeking SRE has a chapter on defining SLO goals by Circonus founder and CEO, Theo Schlossnagle. We also recommend watching the YouTube video “SLIs, SLOs, SLAs, oh my!” from Seth Vargo and Liz Fong Jones to get an in depth understanding of the difference between SLIs, SLOs, and SLAs.
To summarize: SLIs drive SLOs, which inform SLAs.
A Service Level Indicator (SLI) is a metric derived measure of health for a service. For example, I could have an SLI that says my 95th percentile latency of homepage requests over the last 5 minutes should be less than 300 milliseconds.
A Service Level Objective (SLO) is a goal or target for an SLI. We take an SLI, and extend its scope to quantify how we expect our service to perform over a strategic time interval. Using the SLI from the previous example, we could say that we want to meet the criteria set by that SLI for 99.9% of a trailing year window.
A Service Level Agreement (SLA) is an agreement between a business and a customer, defining the consequences for failing to meet an SLO. Generally, the SLOs which your SLA is based upon will be more relaxed than your internal SLOs, because we want our internal facing targets to be more strict than our external facing targets.
What combinations of SLIs are best for quantifying both host and service health? Over the past several years, there have been a number of emerging standards. The top standards are the USE method, the RED method, and the “four golden signals” discussed in the Google SRE book.
Brendan Gregg introduced the USE method, which seeks to quantify health of a system host based on utilization, saturation, and errors metrics. For something like a CPU, we can use common utilization metrics for user, system, and idle percentages. We can use load average and run queue for saturation. The UNIX perf profiler is a good tool for measuring CPU error events.
Tom Wilkie introduced the RED method a few years ago. With RED. we monitor request rate, request errors, and request duration. The Google SRE book talks about using latency, traffic, errors, and saturation metrics. These “four golden signals” are targeted at service health, and is similar to the RED method, but extends it with saturation. In practice, it can be difficult to quantify service saturation.
So, how are we monitoring the containers? Containers are short lived entities. Monitoring them directly to discern our service health presents a number of complex problems, such as the high cardinality issue. It is easier and more effective to monitor the service outputs of those containers in aggregate. We don’t care if one container is misbehaving if the service is healthy. Chances are that our orchestration framework will reap that container anyway and replace it with a new one.
Let’s consider how best to integrate SLIs from Istio as part of a RED dashboard. To compose our RED dashboard, let’s look at what telemetry is provided by Istio:
- Request Count by Response Code
- Request Duration
- Request Size
- Response Size
- Connection Received Bytes
- Connection Sent Bytes
- Connection Duration
- Template Based MetaData (Metric Tags)
Istio provides several metrics about the requests it receives, the latency to generate a response, and connection level data. Note the first two items from the list above; we’ll want to include them in our RED dashboard.
Istio also gives us the ability to add metric tags, which it calls dimensions. So we can break down the telemetry by host, cluster, etc. We can get the rate in requests per second by taking the first derivative of the request count. We can get the error rate by taking the derivative of the request count of unsuccessful requests. Istio also provides us with the request latency of each request, so we can record how long each service request took to complete.
In addition, Istio provides us with a Grafana dashboard out of the box that contains the pieces we want:
The components we want are circled in red in the screenshot above. We have the request rate in operations per second in the upper left, the number of failed requests per second in the upper right, and a graph of response time in the bottom. There are several other indicators on this graph, but let’s take a closer look at the ones we’ve circled:
The above screenshot shows the rate component of the dashboard. This is pretty straight forward. We count the number of requests which returned a 200 response code and graph the rate over time.
The Istio dashboard does something similar for responses that return a 5xx error code. In the above screenshot, you can see how it breaks down the errors by either the ingress controller, or by errors from the application product page itself.
This screenshot shows the request duration graph. This graph is the most informative about the health of our service. This data is provided by a Prometheus monitoring system, so we see request time percentiles graphed here, including the median, 90th, 95th, and 99th percentiles.
These percentiles give us some overall indication of how the service is performing. However, there are a number of deficiencies with this approach that are worth examining. During times of low activity, these percentiles can skew wildly because of limited numbers of samples. This could mislead you about the system performance in those situations. Let’s take a look at the other issues that can arise with this approach:
- The percentiles are aggregated metrics over fixed time windows.
- The percentiles cannot be re-aggregated for cluster health.
- The percentiles cannot be averaged (and this is a common mistake).
- This method stores aggregates are outputs, not inputs.
- It is difficult to measure cluster SLIs with this method.
Percentiles often provide deeper insight than averages as they express the range of values with multiple data points instead of just one. But like averages, percentiles are an aggregated metric. They are calculated over a fixed time window for a fixed data set. If we calculate a duration percentile for one cluster member, we can not merge that with another one to get an aggregate performance metric for the whole cluster.
It is a common misconception that percentiles can be averaged; they cannot, except in rare cases where the distributions that generated them are nearly identical. If you only have the percentile, and not the source data, you cannot know that might be the case. It is a chicken and egg problem.
This also means that you cannot set service level indicators for an entire service due to the lack of mergeability, if you are measuring percentile based performance only for individual cluster members.
Our ability to set meaningful SLIs (and as a result, meaningful SLOs) is limited here, due to only having four latency data points over fixed time windows. So when you are working with percentile based duration metrics, you have to ask yourself if your SLIs really good SLIs. We can do better by using math to determine the SLIs that we need to give us a comprehensive view of our service’s performance and health.
Above is a visualization of latency data for a service in microseconds using a histogram. The number of samples is on the Y-Axis, and the sample value, in this case microsecond latency, is on the X-Axis. This is the open source histogram we developed at Circonus. (See the open source in both C and Golang, or read more about histograms here.) There are a few other histogram implementations out there that are open source, such as Ted Dunning’s t-digest histogram and the HDR histogram.
The Envoy project recently adopted the C implementation of Circonus’s log linear histogram library. This allows envoy data to be collected as distributions. They found a very minor bug in implementation, which Circonus was quite happy to fix. That’s the beauty of open source, the more eyes on the code, the better it gets over time.
Histograms are mergeable. Any two or more histograms can be merged as long as the bin boundaries are the same. That means that we can take this distribution and combine it with other distributions. Mergeable metrics are great for monitoring and observability. They allow us to combine outputs from similar sources, such as service members, and get aggregate service metrics.
As indicated in the image above, this log linear histogram contains 90 bins for each power of 10. You can see 90 bins between 100,000 and 1M. At each power of 10, the bin size increases by a factor of 10. This allows us to record a wide range of values with high relative accuracy without needing to know the data distribution ahead of time. Let’s see what this looks like when we overlay some percentiles:
Now you can see where we have the average, and the 50th percentile (also known as the median), and the 90th percentile. The 90th percentile is the value at which 90% of the samples are below that value.
Consider our example SLI from earlier. With latency data displayed in this format, we can easily calculate that SLI for a service by merging histograms together to get a 5 minute view of data, and then calculating the 90th percentile value for that distribution. If it is less than 1,000 milliseconds, we met our target.
The RED dashboard duration graph from our screenshot above has four percentiles, the 50th, 90th, 95th, and 99th. We could overlay those percentiles on this distribution as well. Even without data, we can see the rough shape of what the request distribution might look like, but that would be making a lot of assumptions. To see just how misleading those assumptions based on just a few percentiles can be, let’s look at a distribution with additional modes:
This histogram shows a distribution with two distinct modes. The leftmost mode could be fast responses due to serving from a cache, and the right mode from serving from disk. Just measuring latency using four percentiles would make it nearly impossible to discern a distribution like this. This gives us a sense of the complexity that percentiles can mask. Consider a distribution with more than two modes:
This distribution has at least four visible modes. If we do the math on the full distribution, we will find 20+ modes here. How many percentiles would you need to record to approximate a latency distribution like the one above? What about a distribution like the one below?
Complex systems composed of many service will generate latency distributions that can not be accurately represented by using percentiles. You have to record the entire latency distribution to be able to fully represent it. This is one reason it is preferable to store the complete distributions of the data in histograms and calculate percentiles as needed, rather than just storing a few percentiles.
This type of histogram visualization shows a distribution over a fixed time window. We can store multiple distributions to get a sense of how it changes over time, as shown below:
This is a heatmap, which represents a set of histograms over time. Imagine each column in this heatmap has a separate bar chart viewed from above, with color being used to indicate the height of each bin. This is a grafana visualization of the response latency from a cluster of 10 load balancers. This gives us a deep insight into the system behavior of the entire cluster over a week, there’s over 1 million data samples here. The median here centers around 500 microseconds, shown in the red colored bands.
Above is another type of heatmap. Here, saturation is used to indicate the “height” of each bin (the darker tiles are more “full”). Also, this time we’ve overlayed percentile calculations over time on top of the heatmap. Percentiles are robust metrics and very useful, but not by themselves. We can see here how the 90+ percentiles increase as the latency distribution shifts upwards.
Let’s take these distribution based duration maps and see if we can generate something more informative than the sample Istio dashboard:
The above screenshot is a RED dashboard revised to show distribution based latency data. In the lower left, we show a histogram of latencies over a fixed time window. To the right of it, we use a heat map to break that distribution down into smaller time windows. With this layout of RED dashboard, we can get a complete view of how our service is behaving with only a few panels of information. This particular dashboard was implemented using Grafana served from an IRONdb time series database which stores the latency data natively as log linear histograms.
We can extend this RED dashboard a bit further and overlay our SLIs onto the graphs as well:
For the rate panel, our SLI might be to maintain a minimum level of requests per second. For the rate panel, our SLI might be to stay under a certain number of errors per second. And as we have previously examined duration SLIs, we might want our 99th percentile for our entire service which is composed of several pods, to stay under a certain latency over a fixed window. Using Istio telemetry stored as histograms enables us to set these meaningful service wide SLIs. Now we have a lot more to work with and we’re better able to interrogate our data (see below).
Asking the Right Questions
So now that we’ve put the pieces together and have seen how to use Istio to get meaningful data from our services, let’s see what kinds questions we can answer with it.
We all love being able to solve technical problems, but not everyone has that same focus. The folks on the business side want to answer questions on how the business is doing. You need to be able to answer business-centric questions. Let’s take the toolset we’ve assembled and align the capabilities with a couple of questions that the business ask its SREs:
- How many users got angry on the Tuesday slowdown after the big marketing promotion?
- Are we over-provisioned or under-provisioned on our purchasing checkout service?
Consider the first example. Everyone has been through a big slowdown. Let’s say Marketing did a big push, traffic went up, performance speed went down, and users complained that the site got slow. How can we quantify the extent of how slow it was for everyone? How many users got angry? Let’s say that Marketing wants to know this so that they can send out a 10% discount email to the users affected and also because they want to avoid a recurrence of the same problem. Let’s craft an SLI and assume that users noticed the slowdown and got angry if requests took more than 500 milliseconds. How can we calculate how many users got angry with this SLI of 500 ms?
First, we need to already be recording the request latencies as a distribution. Then we can plot them as a heatmap. We can use the distribution data to calculate the percentage of requests that exceeded our 500ms SLI by using inverse percentiles. We take that answer, multiply it by the total number of requests in that time window, and integrate over time. Then we can plot the result overlayed on the heatmap:
In this screenshot, we’ve circled the part of the heatmap where the slowdown occurred. The increased latency distribution is fairly indicative of a slowdown. The line on the graph indicates the total number of requests affected over time.
In this example, we managed to miss our SLI for 4 million requests. Whoops. What isn’t obvious are the two additional slowdowns on the right because they are smaller in magnitude. Each of those cost us an additional 2 million SLI violations. Ouch.
We can do these kinds of mathematical analyses because we are storing data as distributions, not aggregations like percentiles.
Let’s consider another common question. Is my service under provisioned, or over provisioned?
The answer is often “it depends.” Loads vary based on the time of day and the day of week, in addition to varying because of special events. That’s before we even consider how the system behaves under load. Let’s put some math to work and use latency bands to visualize how our system can perform:
The visualization above shows latency distribution broken down by latency bands over time. The bands here show the number of requests that took under 25ms, between 25 and 100 ms, 100-250ms, 250-1000, and over 1000ms. The colors are grouped by fast requests as shown in green, to slow requests shown in red.
What does this visualization tell us? It shows that requests to our service started off very quickly, then the percentage of fast requests dropped off after a few minutes, and the percentage of slow requests increased after about 10 minutes. This pattern repeated itself for two traffic sessions. What does that tell us about provisioning? It suggests that initially the service was over provisioned, but then became under provisioned over the course of 10-20 minutes. Sounds like a good candidate for auto-scaling.
We can also add this type of visualization to our RED dashboard. This type of data is excellent for business stakeholders. And it doesn’t require a lot of technical knowledge investment to understand the impact on the business.
We should monitor services, not containers. Services are long lived entities, containers are not. Your users doesn’t care how your containers are performing, they care about how your services are performing.
You should record distributions instead of aggregates. But then you should generate your aggregates from those distributions. Aggregates are very valuable sources of information. But they are unmergeable and so they are not well suited to statistical analysis.
Istio gives you a lot of stuff for free. You don’t have to instrument your code either. You don’t need to go and build a high quality application framework from scratch.
Use math to ask and answer questions about your services that are important to the business. That’s what this is all about, right? When we can make systems reliable by answering questions that the business values, we achieve the goals of the organization.