Comprehensive Container-Based Service Monitoring with Kubernetes and Istio

Operating containerized infrastructure brings with it a new set of challenges. You need to instrument your containers, evaluate your API endpoint performance, and identify bad actors within your infrastructure. The Istio service mesh enables instrumentation of APIs without code change and provides service latencies for free. But how do you make sense all that data? With math, that’s how.

Circonus is the first third party adapter for Istio. In a previous post, we talked about the first Istio community adapter to monitor Istio based services. This post will expand on that. We’ll explain how to get a comprehensive understanding of your Kubernetes infrastructure. We will also explain how to get an Istio service mesh implementation for your container based infrastructure.

Istio Overview

Istio is a service mesh for Kubernetes, which means that it takes care of all of the intercommunication and facilitation between services, much like network routing software does for TCP/IP traffic. In addition to Kubernetes, Istio can also interact with Docker and Consul based services. It’s similar to LinkerD, which has been around for a while.

Istio is an open source project by developed by teams from Google, IBM, Cisco, and Lyft’s Envoy. The project recently turned one year old, and Istio has found its way into a couple of production environments at scale. At the time of this post, the current version is 0.8.

So, how does Istio fit into the Kubernetes ecosystem? Kubernetes acts as the data plane and Istio acts as the control plane. Kubernetes carries the application traffic, handling container orchestration, deployment, and scaling. Istio routes the application traffic, handling policy enforcement, traffic management and load balancing. It also handles telemetry syndication such as metrics, logs, and tracing. Istio is the crossing guard and reporting piece of the container based infrastructure.

The diagram above shows the service mesh architecture. Istio uses an envoy sidecar proxy for each service. Envoy proxies inbound requests to the Istio Mixer service via a GRPC call. Then Mixer applies rules for traffic management, and syndicates request telemetry. Mixer is the brains of Istio. Operators can write YAML files that specify how Envoy should redirect traffic. They can also specify what telemetry to push to monitoring and observability systems. Rules can be applied as needed at run time without needing to restart any Istio components.

Istio supports a number of adapters to send data to a variety of monitoring tools. That includes Prometheus, Circonus, or Statsd. You can also enable both Zipkin and Jaeger tracing. And, you can generate graphs to visualize the services involved.

Istio is easy to deploy. Way back when, around 7 to 8 months ago, you had to install Istio onto a Kubernetes cluster with a series of kubectl commands. And you still can today. But now you can just hop into Google Cloud platform, and deploy an Istio enabled Kubernetes cluster with a few clicks, including monitoring, tracing, and a sample application. You can get up and running very quickly, and then use the istioctl command to start having fun.

Another benefit is that we can gather data from services without requiring developers to instrument their services to provide that data. This has a multitude of benefits. It reduces maintenance. It removes points of failure in the code. It provides vendor agnostic interfaces, which reduces the chance of vendor lockin.

With Istio, we can deploy different versions of individual services and weight the traffic between them. Istio itself makes use of a number of different pods to operate itself, as shown here:

> kubectl get pods -n istio-system
NAME                     READY STATUS  RESTARTS AGE
istio-ca-797dfb66c5      1/1   Running  0       2m
istio-ingress-84f75844c4 1/1   Running  0       2m
istio-egress-29a16321d3  1/1   Running  0       2m
istio-mixer-9bf85fc68    3/3   Running  0       2m
istio-pilot-575679c565   2/2   Running  0       2m
grafana-182346ba12       2/2   Running  0       2m
prometheus-837521fe34    2/2   Running  0       2m

Istio is not exactly lightweight. The power and flexibility of Istio come with the cost of some overhead for operation. However, if you have more than a few microservices in your application, your application containers will soon surpass the system provisioned containers.

Service Level Objectives

This brief overview of Service Level Objectives will set the stage for how we should measure our service health. The concept of Service Level Agreements (SLAs) has been around for at least a decade. But just recently, the amount of online content related to Service Level Objectives (SLOs) and Service Level Indicators (SLIs) has been increasing rapidly.

In addition to the well-known Google SRE book, two new books that talk about SLOs are being published soon. The Site Reliability Workbook has a dedicated chapter on SLOs, and Seeking SRE has a chapter on defining SLO goals by Circonus founder and CEO, Theo Schlossnagle. We also recommend watching the YouTube video “SLIs, SLOs, SLAs, oh my!” from Seth Vargo and Liz Fong Jones to get an in depth understanding of the difference between SLIs, SLOs, and SLAs.

To summarize: SLIs drive SLOs, which inform SLAs.

A Service Level Indicator (SLI) is a metric derived measure of health for a service. For example, I could have an SLI that says my 95th percentile latency of homepage requests over the last 5 minutes should be less than 300 milliseconds.

A Service Level Objective (SLO) is a goal or target for an SLI. We take an SLI, and extend its scope to quantify how we expect our service to perform over a strategic time interval. Using the SLI from the previous example, we could say that we want to meet the criteria set by that SLI for 99.9% of a trailing year window.

A Service Level Agreement (SLA) is an agreement between a business and a customer, defining the consequences for failing to meet an SLO. Generally, the SLOs which your SLA is based upon will be more relaxed than your internal SLOs, because we want our internal facing targets to be more strict than our external facing targets.

RED Dashboard

What combinations of SLIs are best for quantifying both host and service health? Over the past several years, there have been a number of emerging standards. The top standards are the USE method, the RED method, and the “four golden signals” discussed in the Google SRE book.

Brendan Gregg introduced the USE method, which seeks to quantify health of a system host based on utilization, saturation, and errors metrics. For something like a CPU, we can use common utilization metrics for user, system, and idle percentages. We can use load average and run queue for saturation. The UNIX perf profiler is a good tool for measuring CPU error events.

Tom Wilkie introduced the RED method a few years ago. With RED. we monitor request rate, request errors, and request duration. The Google SRE book talks about using latency, traffic, errors, and saturation metrics. These “four golden signals” are targeted at service health, and is similar to the RED method, but extends it with saturation. In practice, it can be difficult to quantify service saturation.

So, how are we monitoring the containers? Containers are short lived entities. Monitoring them directly to discern our service health presents a number of complex problems, such as the high cardinality issue. It is easier and more effective to monitor the service outputs of those containers in aggregate. We don’t care if one container is misbehaving if the service is healthy. Chances are that our orchestration framework will reap that container anyway and replace it with a new one.

Let’s consider how best to integrate SLIs from Istio as part of a RED dashboard. To compose our RED dashboard, let’s look at what telemetry is provided by Istio:

  • Request Count by Response Code
  • Request Duration
  • Request Size
  • Response Size
  • Connection Received Bytes
  • Connection Sent Bytes
  • Connection Duration
  • Template Based MetaData (Metric Tags)

Istio provides several metrics about the requests it receives, the latency to generate a response, and connection level data. Note the first two items from the list above; we’ll want to include them in our RED dashboard.

Istio also gives us the ability to add metric tags, which it calls dimensions. So we can break down the telemetry by host, cluster, etc. We can get the rate in requests per second by taking the first derivative of the request count. We can get the error rate by taking the derivative of the request count of unsuccessful requests. Istio also provides us with the request latency of each request, so we can record how long each service request took to complete.

In addition, Istio provides us with a Grafana dashboard out of the box that contains the pieces we want:

The components we want are circled in red in the screenshot above. We have the request rate in operations per second in the upper left, the number of failed requests per second in the upper right, and a graph of response time in the bottom. There are several other indicators on this graph, but let’s take a closer look at the ones we’ve circled:

The above screenshot shows the rate component of the dashboard. This is pretty straight forward. We count the number of requests which returned a 200 response code and graph the rate over time.

The Istio dashboard does something similar for responses that return a 5xx error code. In the above screenshot, you can see how it breaks down the errors by either the ingress controller, or by errors from the application product page itself.

This screenshot shows the request duration graph. This graph is the most informative about the health of our service. This data is provided by a Prometheus monitoring system, so we see request time percentiles graphed here, including the median, 90th, 95th, and 99th percentiles.

These percentiles give us some overall indication of how the service is performing. However, there are a number of deficiencies with this approach that are worth examining. During times of low activity, these percentiles can skew wildly because of limited numbers of samples. This could mislead you about the system performance in those situations. Let’s take a look at the other issues that can arise with this approach:

Duration Problems:

  • The percentiles are aggregated metrics over fixed time windows.
  • The percentiles cannot be re-aggregated for cluster health.
  • The percentiles cannot be averaged (and this is a common mistake).
  • This method stores aggregates as outputs, not inputs.
  • It is difficult to measure cluster SLIs with this method.

Percentiles often provide deeper insight than averages as they express the range of values with multiple data points instead of just one. But like averages, percentiles are an aggregated metric. They are calculated over a fixed time window for a fixed data set. If we calculate a duration percentile for one cluster member, we can not merge that with another one to get an aggregate performance metric for the whole cluster.

It is a common misconception that percentiles can be averaged; they cannot, except in rare cases where the distributions that generated them are nearly identical. If you only have the percentile, and not the source data, you cannot know that might be the case. It is a chicken and egg problem.

This also means that you cannot set service level indicators for an entire service due to the lack of mergeability, if you are measuring percentile based performance only for individual cluster members.

Our ability to set meaningful SLIs (and as a result, meaningful SLOs) is limited here, due to only having four latency data points over fixed time windows. So when you are working with percentile based duration metrics, you have to ask yourself if your SLIs are really good SLIs. We can do better by using math to determine the SLIs that we need to give us a comprehensive view of our service’s performance and health.

Histogram Telemetry

Above is a visualization of latency data for a service in microseconds using a histogram. The number of samples is on the Y-Axis, and the sample value, in this case microsecond latency, is on the X-Axis. This is the open source histogram we developed at Circonus. (See the open source in both C and Golang, or read more about histograms here.) There are a few other histogram implementations out there that are open source, such as Ted Dunning’s t-digest histogram and the HDR histogram.

The Envoy project recently adopted the C implementation of Circonus’s log linear histogram library. This allows envoy data to be collected as distributions. They found a very minor bug in implementation, which Circonus was quite happy to fix. That’s the beauty of open source, the more eyes on the code, the better it gets over time.

Histograms are mergeable. Any two or more histograms can be merged as long as the bin boundaries are the same. That means that we can take this distribution and combine it with other distributions. Mergeable metrics are great for monitoring and observability. They allow us to combine outputs from similar sources, such as service members, and get aggregate service metrics.

As indicated in the image above, this log linear histogram contains 90 bins for each power of 10. You can see 90 bins between 100,000 and 1M. At each power of 10, the bin size increases by a factor of 10. This allows us to record a wide range of values with high relative accuracy without needing to know the data distribution ahead of time. Let’s see what this looks like when we overlay some percentiles:

Now you can see where we have the average, and the 50th percentile (also known as the median), and the 90th percentile. The 90th percentile is the value at which 90% of the samples are below that value.

Consider our example SLI from earlier. With latency data displayed in this format, we can easily calculate that SLI for a service by merging histograms together to get a 5 minute view of data, and then calculating the 90th percentile value for that distribution. If it is less than 1,000 milliseconds, we met our target.

The RED dashboard duration graph from our screenshot above has four percentiles, the 50th, 90th, 95th, and 99th. We could overlay those percentiles on this distribution as well. Even without data, we can see the rough shape of what the request distribution might look like, but that would be making a lot of assumptions. To see just how misleading those assumptions based on just a few percentiles can be, let’s look at a distribution with additional modes:

This histogram shows a distribution with two distinct modes. The leftmost mode could be fast responses due to serving from a cache, and the right mode from serving from disk. Just measuring latency using four percentiles would make it nearly impossible to discern a distribution like this. This gives us a sense of the complexity that percentiles can mask. Consider a distribution with more than two modes:

This distribution has at least four visible modes. If we do the math on the full distribution, we will find 20+ modes here. How many percentiles would you need to record to approximate a latency distribution like the one above? What about a distribution like the one below?

Complex systems composed of many service will generate latency distributions that can not be accurately represented by using percentiles. You have to record the entire latency distribution to be able to fully represent it. This is one reason it is preferable to store the complete distributions of the data in histograms and calculate percentiles as needed, rather than just storing a few percentiles.

This type of histogram visualization shows a distribution over a fixed time window. We can store multiple distributions to get a sense of how it changes over time, as shown below:

This is a heatmap, which represents a set of histograms over time. Imagine each column in this heatmap has a separate bar chart viewed from above, with color being used to indicate the height of each bin. This is a grafana visualization of the response latency from a cluster of 10 load balancers. This gives us a deep insight into the system behavior of the entire cluster over a week, there’s over 1 million data samples here. The median here centers around 500 microseconds, shown in the red colored bands.

Above is another type of heatmap. Here, saturation is used to indicate the “height” of each bin (the darker tiles are more “full”). Also, this time we’ve overlayed percentile calculations over time on top of the heatmap. Percentiles are robust metrics and very useful, but not by themselves. We can see here how the 90+ percentiles increase as the latency distribution shifts upwards.

Let’s take these distribution based duration maps and see if we can generate something more informative than the sample Istio dashboard:

The above screenshot is a RED dashboard revised to show distribution based latency data. In the lower left, we show a histogram of latencies over a fixed time window. To the right of it, we use a heat map to break that distribution down into smaller time windows. With this layout of RED dashboard, we can get a complete view of how our service is behaving with only a few panels of information. This particular dashboard was implemented using Grafana served from an IRONdb time series database which stores the latency data natively as log linear histograms.

We can extend this RED dashboard a bit further and overlay our SLIs onto the graphs as well:

For the rate panel, our SLI might be to maintain a minimum level of requests per second. For the rate panel, our SLI might be to stay under a certain number of errors per second. And as we have previously examined duration SLIs, we might want our 99th percentile for our entire service which is composed of several pods, to stay under a certain latency over a fixed window. Using Istio telemetry stored as histograms enables us to set these meaningful service wide SLIs. Now we have a lot more to work with and we’re better able to interrogate our data (see below).

Asking the Right Questions

So now that we’ve put the pieces together and have seen how to use Istio to get meaningful data from our services, let’s see what kinds questions we can answer with it.

We all love being able to solve technical problems, but not everyone has that same focus. The folks on the business side want to answer questions on how the business is doing. You need to be able to answer business-centric questions. Let’s take the toolset we’ve assembled and align the capabilities with a couple of questions that the business ask its SREs:

Example Questions:

  • How many users got angry on the Tuesday slowdown after the big marketing promotion?
  • Are we over-provisioned or under-provisioned on our purchasing checkout service?

Consider the first example. Everyone has been through a big slowdown. Let’s say Marketing did a big push, traffic went up, performance speed went down, and users complained that the site got slow. How can we quantify the extent of how slow it was for everyone? How many users got angry? Let’s say that Marketing wants to know this so that they can send out a 10% discount email to the users affected and also because they want to avoid a recurrence of the same problem. Let’s craft an SLI and assume that users noticed the slowdown and got angry if requests took more than 500 milliseconds. How can we calculate how many users got angry with this SLI of 500 ms?

First, we need to already be recording the request latencies as a distribution. Then we can plot them as a heatmap. We can use the distribution data to calculate the percentage of requests that exceeded our 500ms SLI by using inverse percentiles. We take that answer, multiply it by the total number of requests in that time window, and integrate over time. Then we can plot the result overlayed on the heatmap:

In this screenshot, we’ve circled the part of the heatmap where the slowdown occurred. The increased latency distribution is fairly indicative of a slowdown. The line on the graph indicates the total number of requests affected over time.

In this example, we managed to miss our SLI for 4 million requests. Whoops. What isn’t obvious are the two additional slowdowns on the right because they are smaller in magnitude. Each of those cost us an additional 2 million SLI violations. Ouch.

We can do these kinds of mathematical analyses because we are storing data as distributions, not aggregations like percentiles.

Let’s consider another common question. Is my service under provisioned, or over provisioned?

The answer is often “it depends.” Loads vary based on the time of day and the day of week, in addition to varying because of special events. That’s before we even consider how the system behaves under load. Let’s put some math to work and use latency bands to visualize how our system can perform:

The visualization above shows latency distribution broken down by latency bands over time. The bands here show the number of requests that took under 25ms, between 25 and 100 ms, 100-250ms, 250-1000, and over 1000ms. The colors are grouped by fast requests as shown in green, to slow requests shown in red.

What does this visualization tell us? It shows that requests to our service started off very quickly, then the percentage of fast requests dropped off after a few minutes, and the percentage of slow requests increased after about 10 minutes. This pattern repeated itself for two traffic sessions. What does that tell us about provisioning? It suggests that initially the service was over provisioned, but then became under provisioned over the course of 10-20 minutes. Sounds like a good candidate for auto-scaling.

We can also add this type of visualization to our RED dashboard. This type of data is excellent for business stakeholders. And it doesn’t require a lot of technical knowledge investment to understand the impact on the business.

Conclusion

We should monitor services, not containers. Services are long lived entities, containers are not. Your users doesn’t care how your containers are performing, they care about how your services are performing.

You should record distributions instead of aggregates. But then you should generate your aggregates from those distributions. Aggregates are very valuable sources of information. But they are unmergeable and so they are not well suited to statistical analysis.

Istio gives you a lot of stuff for free. You don’t have to instrument your code either. You don’t need to go and build a high quality application framework from scratch.

Use math to ask and answer questions about your services that are important to the business. That’s what this is all about, right? When we can make systems reliable by answering questions that the business values, we achieve the goals of the organization.

The Circonus Istio Mixer Adapter

Here at Circonus, we have a long heritage of open source software involvement. So when we saw that Istio provided a well designed interface to syndicate service telemetry via adapters, we knew that a Circonus adapter would be a natural fit. Istio has been designed to provide a highly performant, highly scalable application control plane, and Circonus has been designed with performance and scalability as core principles.

Today we are happy to announce the availability of the Circonus adapter for the Istio service mesh. This blog post will go over the development of this adapter, and show you how to get up and running with it quickly. We know you’ll be excited about this, because Kubernetes and Istio give you the ability to scale to the level that Circonus was engineered to perform at, above other telemetry solutions.

If you don’t know what a service mesh is, you aren’t alone, but odds are you have been using them for years. The routing infrastructure of the Internet is a service mesh; it facilitates tcp retransmission, access control, dynamic routing, traffic shaping, etc. The monolithic applications that have dominated the web are giving way to applications composed of microservices. Istio provides control plane functionality for container based distributed applications via a sidecar proxy. It provides the service operator with a rich set of functionality to control a Kubernetes orchestrated set of services, without requiring the services themselves to implement any control plane feature sets.

Istio’s Mixer provides an adapter model which allowed us to develop an adapter by creating handlers for interfacing Mixer with external infrastructure backends. Mixer also provides a set of templates, each of which expose different sets of metadata that can be provided to the adapter. In the case of a metrics adapter such as the Circonus adapter, this metadata includes metrics like request duration, request count, request payload size, and response payload size. To activate the Circonus adapter in an Istio-enabled Kubernetes cluster, simply use the istioctl command to inject the Circonus operator configuration into the K8s cluster, and the metrics will start flowing.

Here’s an architectural overview of how Mixer interacts with these external backends:

Istio also contains metrics adapters for StatsD and Prometheus. However, a few things differentiate the Circonus adapter from those other adapters. First, the Circonus adapter allows us to collect the request durations as a histogram, instead of just recording fixed percentiles. This allows us to calculate any quantile over arbitrary time windows, and perform statistical analyses on the histogram which is collected. Second, data can be retained essentially forever. Third, the telemetry data is retained in a durable environment, outside the blast radius of any of the ephemeral assets managed by Kubernetes.

Let’s take a look at the guts of how data gets from Istio into Circonus. Istio’s adapter framework exposes a number of methods which are available to adapter developers. The HandleMetric() method is called for a set of metric instances generated from each request that Istio handles. In our operator configuration, we can specify the metrics that we want to act on, and their types:

spec:
  # HTTPTrap url, replace this with your account submission url
  submission_url: "https://trap.noit.circonus.net/module/httptrap/myuuid/mysecret"
  submission_interval: "10s"
  metrics:
  - name: requestcount.metric.istio-system
    type: COUNTER
  - name: requestduration.metric.istio-system
    type: DISTRIBUTION
  - name: requestsize.metric.istio-system
    type: GAUGE
  - name: responsesize.metric.istio-system
    type: GAUGE

Here we configure the Circonus handler with a submission URL for an HTTPTrap check, an interval to send metrics at. In this example, we specify four metrics to gather, and their types. Notice that we collect the requestduration metric as a DISTRIBUTION type, which will be processed as a histogram in Circonus. This retains fidelity over time, as opposed to averaging that metric, or calculating a percentile before recording the value (both of those techniques lose the value of the signal).

For each request, the HandleMetric() method is called on each request for the metrics we have specified. Let’s take a look at the code:

// HandleMetric submits metrics to Circonus via circonus-gometrics
func (h *handler) HandleMetric(ctx context.Context, insts []*metric.Instance) error {

    for _, inst := range insts {

        metricName := inst.Name
        metricType := h.metrics[metricName]

        switch metricType {

        case config.GAUGE:
            value, _ := inst.Value.(int64)
            h.cm.Gauge(metricName, value)

        case config.COUNTER:
            h.cm.Increment(metricName)

        case config.DISTRIBUTION:
            value, _ := inst.Value.(time.Duration)
            h.cm.Timing(metricName, float64(value))
        }

    }
    return nil
}

Here we can see that HandleMetric() is called with a Mixer context, and a set of metric instances. We iterate over each instance, determine its type, and call the appropriate circonus-gometrics method. The metric handler contains a circonus-gometrics object which makes submitting the actual metric trivial to implement in this framework. Setting up the handler is a bit more complex, but still not rocket science:

// Build constructs a circonus-gometrics instance and sets up the handler
func (b *builder) Build(ctx context.Context, env adapter.Env) (adapter.Handler, error) {

    bridge := &logToEnvLogger{env: env}

    cmc := &cgm.Config{
        CheckManager: checkmgr.Config{
            Check: checkmgr.CheckConfig{
                SubmissionURL: b.adpCfg.SubmissionUrl,
            },
        },
        Log:      log.New(bridge, "", 0),
        Debug:    true, // enable [DEBUG] level logging for env.Logger
        Interval: "0s", // flush via ScheduleDaemon based ticker
    }

    cm, err := cgm.NewCirconusMetrics(cmc)
    if err != nil {
        err = env.Logger().Errorf("Could not create NewCirconusMetrics: %v", err)
        return nil, err
    }

    // create a context with cancel based on the istio context
    adapterContext, adapterCancel := context.WithCancel(ctx)

    env.ScheduleDaemon(
        func() {

            ticker := time.NewTicker(b.adpCfg.SubmissionInterval)

            for {
                select {
                case <-ticker.C:
                  cm.Flush()
                case <-adapterContext.Done()
                  ticker.Stop()
                  cm.Flush()
                  return
                }
            }
          })
    metrics := make(map[string])config.Params_MetricInfo_Type)
    ac := b.adpCfg
    for _, adpMetric := range ac.Metrics {
        metrics[adpMetricName] = adpmetric.Type
    }
    return &handler{cm: cm, env: env, metrics: metrics, cancel: adapterCancel}, nil
}

Mixer provides a builder type which we defined the Build method on. Again, a Mixer context is passed, along with an environment object representing Mixer’s configuration. We create a new circonus-gometrics object, and deliberately disable automatic metrics flushing. We do this because Mixer requires us to wrap all goroutines in their panic handler with the env.ScheduleDaemon() method. You’ll notice that we’ve created our own adapterContext via context.WithCancel. This allows us to shut down the metrics flushing goroutine by calling h.cancel() in the Close() method handler provided by Mixer. We also want to send any log events from CGM (circonus-gometrics) to Mixer’s log. Mixer provides an env.Logger() interface which is based on glog, but CGM uses the standard Golang logger. How did we resolve this impedance mismatch? With a logger bridge, any logging statements that CGM generates are passed to Mixer.

// logToEnvLogger converts CGM log package writes to env.Logger()
func (b logToEnvLogger) Write(msg []byte) (int, error) {
    if bytes.HasPrefix(msg, []byte("[ERROR]")) {
        b.env.Logger().Errorf(string(msg))
    } else if bytes.HasPrefix(msg, []byte("[WARN]")) {
        b.env.Logger().Warningf(string(msg))
    } else if bytes.HasPrefix(msg, []byte("[DEBUG]")) {
        b.env.Logger().Infof(string(msg))
    } else {
        b.env.Logger().Infof(string(msg))
    }
    return len(msg), nil
}

For the full adapter codebase, see the Istio github repo here.

Enough with the theory though, let’s see what this thing looks like in action. I setup a Google Kubernetes Engine deployment, loaded a development version of Istio with the Circonus adapter, and deployed the sample BookInfo application that is provided with Istio. The image below is a heatmap of the distribution of request durations from requests made to the application. You’ll notice the histogram overlay for the time slice highlighted. I added an overlay showing the median, 90th, and 95th percentile response times; it’s possible to generate these at arbitrary quantiles and time windows because we store the data natively as log linear histograms. Notice that the median and 90th percentile are relatively fixed, while the 95th percentile tends to fluctuate over a range of a few hundred milliseconds. This type of deep observability can be used to quantify the performance of Istio itself over versions as it continues it’s rapid growth. Or, more likely, it will be used to identify issues within the application deployed. If your 95th percentile isn’t meeting your internal Service Level Objectives (SLO), that’s a good sign you have some work to do. After all, if 1 in 20 users is having a sub-par experience on your application, don’t you want to know about it?

That looks like fun, so let’s lay out how to get this stuff up and running. First thing we’ll need is a Kubernetes cluster. Google Kubernetes Engine provides an easy way to get a cluster up quickly.

There’s a few other ways documented in the Istio docs if you don’t want to use GKE, but these are the notes I used to get up and running. I used the gcloud command line utility as such after deploying the cluster in the web UI.

# set your zones and region
$ gcloud config set compute/zone us-west1-a
$ gcloud config set compute/region us-west1

# create the cluster
$ gcloud alpha container cluster create istio-testing --num-nodes=4

# get the credentials and put them in kubeconfig
$ gcloud container clusters get-credentials istio-testing --zone us-west1-a --project istio-circonus

# grant cluster admin permissions
$ kubectl create clusterrolebinding cluster-admin-binding --clusterrole=cluster-admin --user=$(gcloud config get-value core/account)

Poof, you have a Kubernetes cluster. Let’s install Istio – refer to the Istio docs

# grab Istio and setup your PATH
$ curl -L https://git.io/getLatestIstio | sh -
$ cd istio-0.2.12
$ export PATH=$PWD/bin:$PATH

# now install Istio
$ kubectl apply -f install/kubernetes/istio.yaml

# wait for the services to come up
$ kubectl get svc -n istio-system

Now setup the sample BookInfo application

# Assuming you are using manual sidecar injection, use `kube-inject`
$ kubectl apply -f <(istioctl kube-inject -f samples/bookinfo/kube/bookinfo.yaml)

# wait for the services to come up
$ kubectl get services

If you are on GKE, you’ll need to setup the gateway and firewall rules

# get the worker address
$ kubectl get ingress -o wide

# get the gateway url
$ export GATEWAY_URL=<workerNodeAddress>:$(kubectl get svc istio-ingress -n istio-system -o jsonpath='{.spec.ports[0].nodePort}')

# add the firewall rule
$ gcloud compute firewall-rules create allow-book --allow tcp:$(kubectl get svc istio-ingress -n istio-system -o jsonpath='{.spec.ports[0].nodePort}')

# hit the url - for some reason GATEWAY_URL is on an ephemeral port, use port 80 instead
$ wget http://<workerNodeAddress>/<productpage>

The sample application should be up and running. If you are using Istio 0.3 or less, you’ll need to install the docker image we build with the Circonus adapter embedded.

Load the Circonus resource definition (only need to do this with Istio 0.3 or less). Save this content as circonus_crd.yaml

kind: CustomResourceDefinition
apiVersion: apiextensions.k8s.io/v1beta1
metadata:
  name: circonuses.config.istio.io
  labels:
    package: circonus
    istio: mixer-adapter
spec:
  group: config.istio.io
  names:
    kind: circonus
    plural: circonuses
    singular: circonus
  scope: Namespaced
  version: v1alpha2

Now apply it:

$ kubectl apply -f circonus_crd.yaml

Edit the Istio deployment to pull in the Docker image with the Circonus adapter build (again, not needed if you’re using Istio v0.4 or greater)

$ kubectl -n istio-system edit deployment istio-mixer

Change the image for the Mixer binary to use the istio-circonus image:

image: gcr.io/istio-circonus/mixer_debug
imagePullPolicy: IfNotPresent
name: mixer

Ok, we’re almost there. Grab a copy of the operator configuration, and insert your HTTPTrap submission URL into it. You’ll need a Circonus account to get that; just signup for a free account if you don’t have one and create an HTTPTrap check.

Now apply your operator configuration:

$ istioctl create  -f circonus_config.yaml

Make a few requests to the application, and you should see the metrics flowing into your Circonus dashboard! If you run into any problems, feel free to contact us at the Circonus labs slack, or reach out to me directly on Twitter at @phredmoyer.

This was a fun integration; Istio is definitely on the leading edge of Kubernetes, but it has matured significantly over the past few months and should be considered ready to use to deploy new microservices. I’d like to extend thanks to some folks who helped out on this effort. Matt Maier is the maintainer of Circonus gometrics and was invaluable on integrating CGM within the Istio handler framework. Zack Butcher and Douglas Reid are hackers on the Istio project, and a few months ago gave an encouraging ‘send the PRs!’ nudge when I talked to them at a local meetup about Istio. Martin Taillefer gave great feedback and advice during the final stages of the Circonus adapter development. Shriram Rajagopalan gave a hand with the CircleCI testing to get things over the finish line. Finally, a huge thanks to the team at Circonus for sponsoring this work, and the Istio community for their welcoming culture that makes this type of innovation possible.

Learn More