A Guide To Service Level Objectives, Part 2: It All Adds Up

A simple primer on the complicated statistical analysis behind setting your Service Level Objectives.

This is the second in a multi-part series about Service Level Objectives. The first part can be found here .

Statistical analysis is a critical –  but often complicated – component in determining your ideal Service Level Objectives (SLOs). So, a “deep-dive” on the subject requires much more detail than can be explored in a blog post. However, we aim to provide enough information here to give you a basic understanding of the math behind a smart SLO – and why it’s so important that you get it right.

Auditable, measurable data is the cornerstone of setting and meeting your SLOs. As stated in part one, Availability and Quality of Service (QoS) are the indicators that help quantify what you’re delivering to your customers, via time quantum and/or transaction availability. The better data you have, the more accurate the analysis, and the more actionable insight you have to work with.  

So yes, it’s complicated. But understanding the importance of the math of SLOs doesn’t have to be.

Functions of SLO Analysis

SLO analysis is based on probability, the likelihood that an event will — or will not — take place. As such, it primarily uses two types of functions: Probability Density Function (PDF) and Cumulative Density Function (CDF).

Simply put, the analysis behind determining your SLO is driven by the basic concept of probability.

For example, PDF answers questions like “What is the probability that the next transaction will have a latency of X?” As the integral of the PDF, the CDF answers questions like “What’s the probability that the next transaction will have a latency less than X?” or “What’s the probability that the next transaction will have a latency greater than X?

Probability Density Function (PDF) Cumulative Density Function (CDF)
INPUTS Any measurement Any measurement
OUTPUTS The probability that a given sample of data will have the input measurement. The probability that X will take a value less than or equal to x

Percentiles and Quantiles

Before we get further into expressing these functions, let’s have a quick sidebar about percentiles vs. quantiles. Unfortunately, this is a simple concept that has gotten quite complicated.

A percentile is measured on a 0-100 scale, and expressed as a percentage. For example: the “99th percentile” means “as good or better than” 99% of the distribution.

A quantile is the same data, expressed on a 0-1 scale. So as a quantile, that “99th percentile” above would be expressed as “.99.”

That’s basically it. While scientists prefer using percentiles, the only differences from a quantile are a decimal point and a percentage symbol. However, for SLO analysis, the quantile function is important because it is mapped to the CDF we discussed earlier.

Remember, this is an overview of basic concepts to provide “top-level” understanding of the math behind a smart SLO.
For a deeper dive, check out David Edelman Blank’s book “Seeking SRE.”

The Data Volume Factor

As any analyst will tell you, the sheer volume of data (or lack thereof) can dramatically impact your results, leading to uninformed insight, inaccurate reporting, and poor decisions. So, it’s imperative that you have enough data to support your analysis. For example, low volumes in the time quantum can produce incredibly misleading results if you don’t specify your SLOs well.

So, with large amounts of data vs “not enough,” the error levels in quantile approximations tend to be lower (vs. worst possible case errors with a single sample per bin, with the sample value at the edge of the bin, those can cause 5% errors). In practice, with log linear histograms, we tend to see data sets span 300 bins, so sets that contain thousands of data points tend to provide sufficient data for accurate statistical analyses.

Inverse quantiles can also come into play. For example, defining an SLO such that our 99th percentile request latency completes within 200ms. At low sample volumes, this approach is likely to be meaningless – with only a dozen or so samples, the 99th percentile can be far out of band compared to the median. And, the percentile and time quantum approach doesn’t tell us how many samples exceeded that 200ms quantum.

We can use inverse percentiles to define an SLO that says we want 80 percent of our requests to be faster than that 200ms quantum. Or alternatively, we can set our SLO as a fixed number of requests within the time quantum; say “I want less than 100 requests to exceed my 200ms time quantum over a span of 10 minutes.”

The actual implementations can vary, so it is incumbent upon the implementer to choose one which suits their business needs appropriately.

Defining Formulas and Analysis

Based on the information you’re trying to get, and your sample set, the next step is determining the right formulas or functions for analysis. For SLO-related data, most practitioners implement open-source histogram libraries. There are many implementations out there, ranging from log-linear, to t-digest, to fixed bin. These libraries often provide functions to execute quantile calculations, inverse calculations, bin count, and other mathematical implementations needed for statistical data analysis.

Some analysts use approximate histograms, such as t-digest. However, those implementations often exhibit double digit error rates near median values. With any histogram-based implementation, there will always be some level of error, but implementations such as log linear can generally minimize that error to well under 1%, particularly with large numbers of samples.

Common Distributions in SLO Analysis

Once you’ve begun analysis, there are several different mathematical models you will use to describe the distribution of your measurement samples, or at least how you expect them to be distributed.

  • Normal distributions: The common “bell-curve” distribution often used to describe random variables whose distribution is not known.

tsdb histogram bell curve
Normal distribution histogram and curve fit

  • Gamma distributions: A two-parameter family of continuous probability distributions, important for using the PDF and CDF.

time series histogram gamma

Gamma distribution histogram and curve fit

  • Pareto distributions: Most of the samples are concentrated near one end of the distribution. Often useful for describing how system resources are utilized.

 

Pareto distribution histogram and curve fit

Pareto distribution histogram and curve fit

In real life, our networks, systems, and computers are all complex entities, and you will probably almost never see something that perfectly fits any of these distributions. You may have spent a lot of time discussing normal distributions in Statistics 101, but you will probably never come across one as an SRE.

While you may often see distributions that resemble the Gamma or Pareto model, it’s highly unusual to see a distribution that’s a perfect fit.

Instead, most of your sample distributions will be a composition of different model, which is completely normal and expected. This “single mode” latency distribution most often represents latency distributions. And, while a single latency distribution is often represented by a Gamma distribution, it is exceptionally rare that we see single latency distributions. They are actually often multiple latency distributions all “jammed together”, which results in multi-modal distributions.

That could be the result of a few different common code paths (each with a different distribution), a few different types of clients each with a different usage pattern or network connection… Or both. So most of the latency distributions we’ll see in practice are actually a handful (and sometimes a bucket full) of different gamma-like distributions stacked atop each other. The point being, don’t worry too much about any specific model – it’s the actual data that’s important.

Histograms in SLO Analysis

A histogram is a representation of the distribution of a continuous variable, in which the entire range of values is divided into a series of intervals (or “bins”) and the representation displays how many values fall into each bin.

 

time series histogram log linear

Log Linear Bimodal Histogram

If for any reason your range of values in on the low end, this is where a data volume issue (as we mentioned above) could rear its ugly head and distort your results.

However, histograms are ideal for SLO analysis, or any high-frequency, high-volume data, because they allow us to store the complete distribution of data at scale. You can describe a histogram with between 3 and 10 bytes per bin, depending on the varbit encoding of 8 of those bytes. Compression reduces that down lower. That’s an efficient approach to storing a large number of bounded sample values. So instead of storing a handful of quantiles, we can the complete distribution of data and calculate arbitrary quantiles and inverse quantiles on demand, as well as more advanced modeling techniques.

We’ll dig deeper into histograms in part 3.

Conclusions

In summary, analysis plays a critical role in setting your Service Level Objectives, because raw data is just that — raw and unrefined. To put yourself in a good position when setting SLOs, you must:

  • Know the data you’re analyzing, Choose data structures that are appropriate for your samples, ones that provide the needed precision and robustness for analysis. Be knowledgeable of the expected cardinality and expected distribution of your data set.
  • Understand how you’re analyzing the data and reporting your results. Ensure your analyses are mathematically correct. Realize if your data fits known distributions, and the implications that arise from that.
  • Set realistic expectations for results. Your outputs are only as good as the data you provide as inputs. Aggregates are excellent tools but it is important to understand their limitations.
  • And always be sure that you have enough data to support the analysis. A 99th percentile calculated with a dozen samples will likely vary significantly from one with hundreds of samples. Outliers can exert great influence over aggregates on small sets of data, but larger data sets are robust and not as susceptible.

With each of those pieces in place, you’ll gain the insight you need to make the smartest decision possible.

That concludes the basic overview of SLO analysis. As mentioned above, part 3 will focus, in more detail, on how to use histograms in SLO analysis.

 

A Guide To Service Level Objectives, Part 1: SLOs & You

Four steps to ensure that you hit your targets – and learn from your successes.

This is the first in a multi-part series about Service Level Objectives. The second part can be found here.

Whether you’re just getting started with DevOps or you’re a seasoned pro, goals are critical to your growth and success. They indicate an endpoint, describe a purpose, or more simply, define success. But how do you ensure you’re on the right track to achieve your goals?

You can’t succeed at your goals without first identifying them
– AND answering “What does success look like?”

Your goals are more than high-level mission statements or an inspiring vision for your company. They must be quantified, measured, and reconciled, so you can compare the end result with the desired result.

For example, to promote system reliability we use Service Level Indicators (SLIs), set Service Level Objectives (SLOs1), and create Service Level Agreements (SLAs) to clarify goals and ensure that we’re on the same page as our customers. Below, we’ll define each of these terms and explain their relationships with each other, to help you identify, measure, and meet your goals.

Whether you’re a Site Reliability Engineer (SRE), developer, or executive, as a service provider you have a vested interest in (or responsibility for) ensuring system reliability. However, “system reliability” in and of itself can be a vague and subjective term that depends on the specific needs of the enterprise. So, SLOs are necessary because they define your Quality of Service (QoS) and reliability goals in concrete, measurable, objective terms.

But how do you determine fair and appropriate measures of success, and define these goals? We’ll look at four steps to get you there:

  1. Identify relevant SLIs
  2. Measure success with SLOs
  3. Agree to an SLA based on your defined SLOs
  4. Use gained insights to restart the process

Before we jump into the four steps, let’s make sure we’re on the same page by defining SLIs, SLOs, and SLAs.

So, What’s the Difference?

For the purposes of our discussion, let’s quickly differentiate between an SLI, an SLO, and an SLA. For example, if your broad goal is for your system to “…run faster,” then:

  • A Service Level Indicator is what we’ve chosen to measure progress towards our goal. E.g., “Latency of a request.”
  • A Service Level Objective is the stated objective of the SLI – what we’re trying to accomplish for either ourselves or the customer. E.g., “99.5% of requests will be completed in 5ms.”
  • A Service Level Agreement, generally speaking2, is a contract explicitly stating the consequences of failing to achieve your defined SLOs. E.g., “If 99% of your system requests aren’t completed in 5ms, you get a refund.”

Although most SLOs are defined in terms of what you provide to your customer, as a service provider you should also have separate internal SLOs that are defined between components within your architecture. For example, your storage system is relied upon by other components in your architecture for availability and performance, and these dependencies are similar to the promise represented by the SLOs within your SLA. We’ll call these internal SLOs out later in the discussion.

What Are We Measuring?: SLIs

Before you can build your SLOs, you must determine what it is you’re measuring. This will not only help define your objectives, but will also help set a baseline to measure against.

In general, SLIs help quantify the service that will be delivered to the customer — what will eventually become the SLO. These terms will vary depending on the nature of the service, but they tend to be defined in terms of either Quality of Service (QoS) or in terms of Availability.

Defining Availability and QoS

  • Availability means that your service is there if the consumer wants it. Either the service is up or it is down. That’s it.
  • Quality of Service (QoS) is usually related to the performance of service delivery (measured in latencies)

Availability and QoS tend to work best together. For example, picture a restaurant that’s always open, but has horrible food and service; or one that has great food and service but is only open for one hour, once a week. Neither is optimal. If you don’t balance these carefully in your SLA, you could either expose yourself to unnecessary risk or end up making a promise to your customer that effectively means nothing. The real path to success is in setting a higher standard and meeting it. Now, we’ll get into some common availability measurement strategies.

Traditionally, availability is measured by counting failures. That means the SLI for availability is the percentage of uptime or downtime. While you can use time quantum or transactions to define your SLAs, we’ve found that a combination works best.

Time quantum availability is measured by splitting your assurance window into pieces. If we split a day into minutes (1440), each minute represents a time quantum we could use to measure failure. A time quantum is marked as bad if any failures are detected, and your availability is then measured by dividing the good time quantum by the total time quantum. Simple enough, right?

The downside of this relatively simple approach is that it doesn’t accurately measure failure unless you have an even distribution of transactions throughout the day – and most services do not. You must also ensure that your time quantum is large enough to prevent a single bad transaction from ruining your objective. For example, a 0.001% error rate threshold makes no sense applied to less than 10k requests.

Transaction availability management uses raw transactions to measure availability – calculated by dividing the count of all successful transactions by the count of all attempted transactions over the course of each window. This method:

  • Provides a much stronger guarantee for the customer than the time quantum method.
  • Helps service providers avoid being penalized for SLA violations caused by short periods of anomalous behavior that affect a tiny fraction of transactions.

However, this method only works if you can measure attempted transactions… which is actually impossible. If data doesn’t show up, how could we know if it was ever sent? We’re not offering the customer much peace of mind if the burden of proof is on them.

So, we combine these approaches by dividing the assurance window into time quantum and counting transactions within each time quantum. We then use the transaction method to define part of our SLO, but we also mark any time quantum where transactions cannot be counted as failed, and incorporate that into our SLO as well. We’re now able to compensate for the inherent weakness of each method.

For example, if we have 144 million transactions per day with a 99.9% uptime SLO, our combined method would give this service an SLO that defines 99.9% uptime something like this:

“The service will be available and process requests for at least 1439 out of 1440 minutes each day. Each minute, at least 99.9% of the attempted transactions will processed. A given minute will be considered unavailable if a system outage prevents the number of attempted transactions during that minute from being measured, unless the system outage is outside of our control.”

Using this example, we would violate this SLO if the system is down for 2 minutes (consecutive or non-consecutive) in a day, or if we fail more than 100 transactions in a minute (assuming 100,000 transactions per minute).

This way you’re covered, even if you don’t have consistent system use throughout the day, or can’t measure attempted transactions. However, your indicators often require more than just crunching numbers.

Remember, some indicators are more than calculations. We’re often too focused on performance criteria instead of user experience.

Looking back to the example from the “What’s the Difference” section, if we can guarantee latency below the liminal threshold for 99% of users, then improving that to 99.9% would obviously be better because it means fewer users are having a bad experience. That’s a better goal than just improving upon an SLI like retrieval speed. If retrieval speed is already 5 ms, would it be better if it were 20% faster? In many cases the end user may not even notice an improvement.

We could gain better insight by analyzing the inverse quantile of our retrieval speed SLI. The 99th quantile for latency just tells us how slow the experience is for the 99th percentile of users. But the inverse quantile tells us what percentage of user experiences meet or exceed our performance goal.

This example SLI graph shows the inverse quantile calculation of request latency, where our SLO specifies completion within 500 milliseconds. We’ll explore how this is derived and used in a later post.

Defining Your Goals: SLOs

Once you’ve decided on an SLI, an SLO is built around it. Generally, SLOs are used to set benchmarks for your goals. However, setting an SLO should be based on what’s cost-effective and mutually beneficial for your service and your customer. There is no universal, industry-standard set of SLOs. It’s a “case-by-case” decision based on data, what your service can provide and what your team can achieve.

That being said, how do you set your SLO? Knowing whether or not your system is up no longer cuts it. Modern customers expect fast service. High latencies will drive people away from your service almost as quickly as your service being unavailable. Therefore it’s highly probable that you won’t meet your SLO if your service isn’t fast enough.

Since “slow” is the new “down,” many speed-related
SLOs are defined using SLIs for service latency.

We track the latencies on our services to assess the success of both our external promises and our internal goals. For your success, be clear and realistic about what you’re agreeing to — and don’t lose sight of the fact that the customer is focused on “what’s in it for me.” You’re not just making promises, you’re showing commitment to your customer’s success.

For example, let’s say you’re guaranteeing that the 99th percentile of requests will be completed with latency of 200 milliseconds or less. You might then go further with your SLO and establish an additional internal goal that 80% of those requests will be completed in 5 milliseconds.

Next, you have to ask the hard question: “What’s the lowest quality and availability I can possibly provide and still provide exceptional service to users?” The spread between this service level and 100% perfect service is your budget for failure. The answer that’s right for you and your service should be based on an analysis of the underlying technical requirements and business objectives of the service.

Base your goals on data. As an industry, we too often select arbitrary SLOs.
There can be big differences between 99%, 99.9%, and 99.99%.

Setting an SLO is about setting the minimum viable service level that will still deliver acceptable quality to the consumer. It’s not necessarily the best you can do, it’s an objective of what you intend to deliver. To position yourself for success, this should always be the minimum viable objective, so that you can more easily accrue error budgets to spend on risk.

Agreeing to Success: The SLA

As you see, defining your objectives and determining the best way to measure against them requires a significant amount of effort. However, well-planned SLIs and SLOs make the SLA process smoother for you and your customer.

While commonly built on SLOs, the SLA is driven by two factors:
the promise of customer satisfaction, and the best service you can deliver.

The key to defining fair and mutually beneficial SLAs (and limiting your liability) is calculating a cost-effective balance between these two needs.

SLAs also tend to be defined by multiple, fixed time frames to balance risks. These time frames are called assurance windows. Generally, these windows will match your billing cycle, because these agreements define your refund policy.

Breaking promises can get expensive when an SLA is in place
– and that’s part of the point – if you don’t deliver, you don’t get paid.

As mentioned earlier, you should give yourself some breathing room by setting the minimum viable service level that will still deliver acceptable quality to the consumer. You’ve probably heard the advice “under-promise and over-deliver.” That’s because exceeding expectations is always better than the alternative. Using a tighter internal SLO than what you’ve committed to gives you a buffer to address issues before they become problems that are visible — and disappointing — to users. So, by “budgeting for failure” and building some margin for error into your objectives, you give yourself a safety net for when you introduce new features, load-test, or otherwise experiment to improve system performance.

Learn, Innovate, and Start Over

Your SLOs should reflect the ways you and your users expect your service to behave. Your SLIs should measure them accurately. And your SLA must make sense for you, your client, and your specific situation. Use all available data to avoid guesswork. Select goals that fit you, your team, your service, and your users. And:

  • Identify the SLIs that are relevant to your goals
  • Measure your goals precisely with SLOs
  • Agree to an SLA based on your defined SLOs
  • Use any gained insights to set new goals, improve, and innovate

Knowing how well you’re meeting your goals allows you to budget for the risks inherent to innovation. If you’re in danger of violating an SLA or falling short of your internal SLO, it’s time to take fewer risks. On the other hand, if you’re comfortably exceeding your goals, it’s time to either set more ambitious ones, or to use that extra breathing room to take more the risks. This enables you to deploy new features, innovate, and move faster!

That’s the overview. In part 2, we’ll take a closer look at the math used to set SLOs.

 

1Although SLO still seems to be the favored term at the time of this writing, the Information Technology Infrastructure Library (ITIL) v3 has deprecated “SLO” and replaced it with Service Level Target (SLT).
2There has been much debate as to whether an SLA is a collection of SLOs or simply an outward-facing SLO. Regardless, it is universally agreed that an SLA is a contract that defines the expected level of service and the consequences for not meeting it.

Air Quality Sensors and IoT Systems Monitoring

2017 was a bad year for fires in California. The Tubbs Fire in Sonoma County in October destroyed whole neighborhoods and sent toxic smoke south through most of the San Francisco Bay Area. The Air Quality Index (AQI) for parts of that area went up past the unhealthy level (101–150) to the hazardous level (301–500) at certain points during the fire. Once word got out that N99 dust masks were needed to keep the harmful particles out of the lungs, they became a common sight.

The EPA maintains the AirNow website, which displays the AQI for the entire US. The weather app on my Pixel phone conveniently displays a summary of the local air quality from the EPA’s source. This was my goto source for information about the outdoor air quality once the fires started. However, I started to notice that the observed air quality often didn’t match that of the app. I realized that the data reported on the app was often delayed by an hour or more, and the local air quality could change much more quickly than was reported by the AirNow resource.

Pixel Weather App air quality

 

I started to look into how often the data was updated, and where the sensors that collected it were located. Unfortunately, I wasn’t able to find a lot of details. However, I did come across a link to PurpleAir.com while browsing a local news site. PurpleAir reports on the same air quality metrics as the EPA source, but uses sensors that were hosted by individuals. They have a network of over a thousand sensors across the planet, and the sensor density in the SF Bay Area is quite good, as can be seen from their map. Best of all, the data was reported in real-time. They had one hour averages, last twenty four hours average, particle counts, and so on. This made it easy to check the air quality reported in real-time from a sensor close by, letting us know when it was okay to go out, and when things had gotten bad in our area.

PurpleAir.com map

 

I considered obtaining one of these sensors for myself at the time, but as the Tubbs fire faded, I stopped checking the site as often. However, shortly after the Thomas fire in December, I decided to purchase a Purple Air PA-II sensor. The sensor was easy to setup. I connected it to my WiFi network, gave it a name, location, and some other metadata. It also allowed me to give the sensor a custom url (KEY3) to PUT sensor data. KEY4 allowed me to set a custom HTTP header.

Circonus allows you to post data to it as a JSON object using an HTTPTrap endpoint. I didn’t know what format the PA-II would use to send the data, but I thought this was a good guess. So I created an HTTPTrap, grabbed the data submission URL, and put it into the sensor configuration. About thirty seconds later, metrics started flowing into Circonus. PurpleAir shared a helpful document that described each of these data points.

HTTPTrap configuration

 

I wanted to create my own dashboard, but first I needed to understand the data. It turns out that AQI is calculated from the concentration of 2.5 micron and 10 micron particles in micrograms per cubic meter. I wasn’t able to find an equation that allows AQI calculation from these concentrations; it appears that AQI is linearly interpolated between different particulate concentration ranges.

In addition to PM2.5 and PM10, the PA-II provided a number of other particle measurements from its dual laser sensors; in particular, particles per deciliter for particles ranging from 0.3 to 10 microns. It seems that the small particles under 1 micron are exhaled, whereas the 1–10 micron particles are the ones that become lodged in the lungs. The PA-II sensor also provides other metrics such as humidity, barometric pressure, dewpoint, temperature, RSSI signal strength to the WiFi access point, and free heap memory. I put together a dashboard to track these metrics.

Temperature, Humidity, Air Pressure, Dewpoint

 

PM2.5 (2.5 micron) and PM10, micrograms per cubic meter. Log 20 scale. AQI levels added as horizontal lines.

 

PM1 micrograms per cubic meter, and 0.3/0.5/1.0/2.5 micron particles per deciliter

 

Free heap memory, WiFi signal strength, and 5.0/10.0 micron particles per deciliter

 

Now that I had a dashboard up and running, I could keep a good watch on local air quality, for my neighborhood specifically, in addition to some simple weather measurements. This was quite useful, but I wanted to get a handle on when the air quality started to go bad. So I created a rule to send an alert whenever the PM 2.5 count went over 12, from Good to Moderate.

Circonus rule

 

It took a bit of digging, but I was able to find the AQI breakpoints which correlates air quality index values to PM 2.5 µg/m3. The relation between AQI and particle concentration wasn’t linear across the categories, so I couldn’t apply a formula to calculate the conversion directly. I settled on adding threshold lines to the graphs for each different AQI category. However, I was able to easily set alerts for each threshold for the particulate count boundaries.

AQI breakpoints

 

If the air quality changed, I got a text message. I created several rules with varying levels of severity, so that I could get an idea of how fast the air quality was changing.

At this point I had a pretty good setup; if the air got bad, I got a text message. The air sensor itself was pretty sensitive to local air quality fluctuations; if I fired up the meat smoker, I’d get an alert. Overall the system was fairly stable, but I did run into some issues where data wasn’t sent to the HTTPTrap at certain times. As a former WiFi firmware engineer, I decided to use tcpdump to look at the traffic directly. To do this, I had to get a host between the sensor and the internet, so I shared my iMac internet connection over WiFi and connected the air sensor to it. The air sensor has a basic web interface that you can use to specify the WiFi connection, and also get a real time readout of the laser air sensors.

Once I had the sensor bridged through my iMac I was able to take a look at the network traffic. The sensor used HTTP GET requests to update the PurpleAir map, and as I had specified in the configuration interface, PUT requests to the Circonus HTTPTrap. Oddly enough, things worked just fine when requests were being routed through the iMac. I came to the conclusion that the Airport Extreme that the sensor was normally associated with might be the source of the failed PUT requests at the TCP level somehow. This is something I need to put some more energy into at some point, but these types of network level issues can be tricky to debug.

me@myhost ~ $ sudo tcpdump -AvvvXX -i bridge100 dst host 192.168.2.2 or src host 192.168.2.2

10:58:11.781392 IP (tos 0x0, ttl 128, id 3301, offset 0, flags [none], proto TCP (6), length 1498)
192.168.2.2.csdmbase > 199.69.201.35.bc.googleusercontent.com.http: Flags [P.], cksum 0xf46e (correct), seq 1:1459, ack 1, win 5840, length 1458: HTTP, length: 1458
PUT /module/httptrap/xxxxxxxx-adf9–40ec-bcc8–yyyyy3194/xx HTTP/1.1
Host: trap.noit.circonus.net
X-PurpleAir: 1
Content-Type: application/json
User-Agent: PurpleAir/2.50i
Content-Length: 1231
Connection: close

{“SensorId”:”5c:cf:7f:4b:f8:c4",”DateTime”:”2018/03/12T17:58:10z”,”Geo”:”AirMonitor_f8c4",”Mem”:26536,”Id”:156,”Adc”:0.03,”lat”:37.745113,”lon”:-122.421211,”accuracy”:152,”elevation”:67.58,”version”:”2.50i”,”uptime”:4160,”rssi”:-77,”hardwareversion”:”2.0",”hardwarediscovered”:”2.0+BME280+PMSX003A+PMSX003B”,”current_temp_f”:87,”current_humidity”:26,”current_dewpoint_f”:47.99,”pressure”:1016.27,”pm1_0_atm_b”:4.82,”pm2_5_atm_b”:7.25,”pm10_0_atm_b”:7.59,”pm1_0_cf_1_b”:4.82,”pm2_5_cf_1_b”:7.25,”pm10_0_cf_1_b”:7.59,”p_0_3_um_b”:1116.89,”p_0_5_um_b”:313.32,”p_1_0_um_b”:45.95,”p_2_5_um_b”:1.89,”p_5_0_um_b”:0.32,”p_10_0_um_b”:0.32,”pm1_0_atm”:3.98,”pm2_5_atm”:6.02,”pm10_0_atm”:6.30,”pm1_0_cf_1":3.98,”pm2_5_cf_1":6.02,”pm10_0_cf_1":6.30,”p_0_3_um”:983.05,”p_0_5_um”:280.57,”p_1_0_um”:37.09,”p_2_5_um”:2.23,”p_5_0_um”:0.52,”p_10_0_um”:0.25,”key1_responseCode”:”200",”key1_responseCode_date”:1520877431,”key1_count”:55387,”key2_responseCode”:”200",”key2_responseCode_date”:1520877441,”key2_count”:54417,”responseCode_b”:”200",”responseCode_date_b”:1520877411,”key1_responseCode_b”:”200",”key1_responseCode_date_b”:1520877461,”key1_count_b”:55450,”key2_responseCode_b”:”200",”key2_responseCode_date_b”:1520877471,”key2_count_b”:55679}[!http]
0x0000: ca2a 14f1 e064 5ccf 7f4b f8c4 0800 4500 .*…d\..K….E.
0x0010: 05da 0ce5 0000 8006 fbfe c0a8 0202 23c9 …………..#.
0x0020: 45c7 05bb 0050 0054 d1fa 9ab3 12ce 5018 E….P.T……P.
0x0030: 16d0 f46e 0000 5055 5420 2f6d 6f64 756c …n..PUT./modul
0x0040: 652f 6874 7470 7472 6170 2f31 6430 3131 e/httptrap/1d011
0x0050: 6339 332d 6164 6639 2d34 3065 632d 6263 xxx-xxx–40ec-bc
0x0060: 6338 2d35 3866 6634 3036 3733 3139 342f c8–xxxxxxxxxxx/
0x0070: 6d79 7333 6372 3374 2048 5454 502f 312e xxxxxxxx.HTTP/1.
0x0080: 310d 0a48 6f73 743a 2074 7261 702e 6e6f 1..Host:.trap.no
0x0090: 6974 2e63 6972 636f 6e75 732e 6e65 740d it.circonus.net.
0x00a0: 0a58 2d50 7572 706c 6541 6972 3a20 310d .X-PurpleAir:.1.
0x00b0: 0a43 6f6e 7465 6e74 2d54 7970 653a 2061 .Content-Type:.a
0x00c0: 7070 6c69 6361 7469 6f6e 2f6a 736f 6e0d pplication/json.
0x00d0: 0a55 7365 722d 4167 656e 743a 2050 7572 .User-Agent:.Pur
0x00e0: 706c 6541 6972 2f32 2e35 3069 0d0a 436f pleAir/2.50i..Co
0x00f0: 6e74 656e 742d 4c65 6e67 7468 3a20 3132 ntent-Length:.12
0x0100: 3331 0d0a 436f 6e6e 6563 7469 6f6e 3a20 31..Connection:.
0x0110: 636c 6f73 650d 0a0d 0a7b 2253 656e 736f close….{“Senso
0x0120: 7249 6422 3a22 3563 3a63 663a 3766 3a34 rId”:”5c:cf:7f:4
0x0130: 623a 6638 3a63 3422 2c22 4461 7465 5469 b:f8:c4",”DateTi

Overall, I’m pleased with the result. There are a couple more things I want to try with the sensor, such as putting a second order derivative alert on the air pressure metric to tell when a low pressure region is moving in. The folks at PurpleAir.com were kind and helpful in responding to any questions I had. I’m looking forward to trying out some other sensors that I can plug into a monitoring system. Amazon has a C02 sensor, so that might be next on my list.

Posted in IoT

Comprehensive Container-Based Service Monitoring with Kubernetes and Istio

Operating containerized infrastructure brings with it a new set of challenges. You need to instrument your containers, evaluate your API endpoint performance, and identify bad actors within your infrastructure. The Istio service mesh enables instrumentation of APIs without code change and provides service latencies for free. But how do you make sense all that data? With math, that’s how.

Circonus is the first third party adapter for Istio. In a previous post, we talked about the first Istio community adapter to monitor Istio based services. This post will expand on that. We’ll explain how to get a comprehensive understanding of your Kubernetes infrastructure. We will also explain how to get an Istio service mesh implementation for your container based infrastructure.

Istio Overview

Istio is a service mesh for Kubernetes, which means that it takes care of all of the intercommunication and facilitation between services, much like network routing software does for TCP/IP traffic. In addition to Kubernetes, Istio can also interact with Docker and Consul based services. It’s similar to LinkerD, which has been around for a while.

Istio is an open source project by developed by teams from Google, IBM, Cisco, and Lyft’s Envoy. The project recently turned one year old, and Istio has found its way into a couple of production environments at scale. At the time of this post, the current version is 0.8.

So, how does Istio fit into the Kubernetes ecosystem? Kubernetes acts as the data plane and Istio acts as the control plane. Kubernetes carries the application traffic, handling container orchestration, deployment, and scaling. Istio routes the application traffic, handling policy enforcement, traffic management and load balancing. It also handles telemetry syndication such as metrics, logs, and tracing. Istio is the crossing guard and reporting piece of the container based infrastructure.

The diagram above shows the service mesh architecture. Istio uses an envoy sidecar proxy for each service. Envoy proxies inbound requests to the Istio Mixer service via a GRPC call. Then Mixer applies rules for traffic management, and syndicates request telemetry. Mixer is the brains of Istio. Operators can write YAML files that specify how Envoy should redirect traffic. They can also specify what telemetry to push to monitoring and observability systems. Rules can be applied as needed at run time without needing to restart any Istio components.

Istio supports a number of adapters to send data to a variety of monitoring tools. That includes Prometheus, Circonus, or Statsd. You can also enable both Zipkin and Jaeger tracing. And, you can generate graphs to visualize the services involved.

Istio is easy to deploy. Way back when, around 7 to 8 months ago, you had to install Istio onto a Kubernetes cluster with a series of kubectl commands. And you still can today. But now you can just hop into Google Cloud platform, and deploy an Istio enabled Kubernetes cluster with a few clicks, including monitoring, tracing, and a sample application. You can get up and running very quickly, and then use the istioctl command to start having fun.

Another benefit is that we can gather data from services without requiring developers to instrument their services to provide that data. This has a multitude of benefits. It reduces maintenance. It removes points of failure in the code. It provides vendor agnostic interfaces, which reduces the chance of vendor lockin.

With Istio, we can deploy different versions of individual services and weight the traffic between them. Istio itself makes use of a number of different pods to operate itself, as shown here:

> kubectl get pods -n istio-system
NAME                     READY STATUS  RESTARTS AGE
istio-ca-797dfb66c5      1/1   Running  0       2m
istio-ingress-84f75844c4 1/1   Running  0       2m
istio-egress-29a16321d3  1/1   Running  0       2m
istio-mixer-9bf85fc68    3/3   Running  0       2m
istio-pilot-575679c565   2/2   Running  0       2m
grafana-182346ba12       2/2   Running  0       2m
prometheus-837521fe34    2/2   Running  0       2m

Istio is not exactly lightweight. The power and flexibility of Istio come with the cost of some overhead for operation. However, if you have more than a few microservices in your application, your application containers will soon surpass the system provisioned containers.

Service Level Objectives

This brief overview of Service Level Objectives will set the stage for how we should measure our service health. The concept of Service Level Agreements (SLAs) has been around for at least a decade. But just recently, the amount of online content related to Service Level Objectives (SLOs) and Service Level Indicators (SLIs) has been increasing rapidly.

In addition to the well-known Google SRE book, two new books that talk about SLOs are being published soon. The Site Reliability Workbook has a dedicated chapter on SLOs, and Seeking SRE has a chapter on defining SLO goals by Circonus founder and CEO, Theo Schlossnagle. We also recommend watching the YouTube video “SLIs, SLOs, SLAs, oh my!” from Seth Vargo and Liz Fong Jones to get an in depth understanding of the difference between SLIs, SLOs, and SLAs.

To summarize: SLIs drive SLOs, which inform SLAs.

A Service Level Indicator (SLI) is a metric derived measure of health for a service. For example, I could have an SLI that says my 95th percentile latency of homepage requests over the last 5 minutes should be less than 300 milliseconds.

A Service Level Objective (SLO) is a goal or target for an SLI. We take an SLI, and extend its scope to quantify how we expect our service to perform over a strategic time interval. Using the SLI from the previous example, we could say that we want to meet the criteria set by that SLI for 99.9% of a trailing year window.

A Service Level Agreement (SLA) is an agreement between a business and a customer, defining the consequences for failing to meet an SLO. Generally, the SLOs which your SLA is based upon will be more relaxed than your internal SLOs, because we want our internal facing targets to be more strict than our external facing targets.

RED Dashboard

What combinations of SLIs are best for quantifying both host and service health? Over the past several years, there have been a number of emerging standards. The top standards are the USE method, the RED method, and the “four golden signals” discussed in the Google SRE book.

Brendan Gregg introduced the USE method, which seeks to quantify health of a system host based on utilization, saturation, and errors metrics. For something like a CPU, we can use common utilization metrics for user, system, and idle percentages. We can use load average and run queue for saturation. The UNIX perf profiler is a good tool for measuring CPU error events.

Tom Wilkie introduced the RED method a few years ago. With RED. we monitor request rate, request errors, and request duration. The Google SRE book talks about using latency, traffic, errors, and saturation metrics. These “four golden signals” are targeted at service health, and is similar to the RED method, but extends it with saturation. In practice, it can be difficult to quantify service saturation.

So, how are we monitoring the containers? Containers are short lived entities. Monitoring them directly to discern our service health presents a number of complex problems, such as the high cardinality issue. It is easier and more effective to monitor the service outputs of those containers in aggregate. We don’t care if one container is misbehaving if the service is healthy. Chances are that our orchestration framework will reap that container anyway and replace it with a new one.

Let’s consider how best to integrate SLIs from Istio as part of a RED dashboard. To compose our RED dashboard, let’s look at what telemetry is provided by Istio:

  • Request Count by Response Code
  • Request Duration
  • Request Size
  • Response Size
  • Connection Received Bytes
  • Connection Sent Bytes
  • Connection Duration
  • Template Based MetaData (Metric Tags)

Istio provides several metrics about the requests it receives, the latency to generate a response, and connection level data. Note the first two items from the list above; we’ll want to include them in our RED dashboard.

Istio also gives us the ability to add metric tags, which it calls dimensions. So we can break down the telemetry by host, cluster, etc. We can get the rate in requests per second by taking the first derivative of the request count. We can get the error rate by taking the derivative of the request count of unsuccessful requests. Istio also provides us with the request latency of each request, so we can record how long each service request took to complete.

In addition, Istio provides us with a Grafana dashboard out of the box that contains the pieces we want:

The components we want are circled in red in the screenshot above. We have the request rate in operations per second in the upper left, the number of failed requests per second in the upper right, and a graph of response time in the bottom. There are several other indicators on this graph, but let’s take a closer look at the ones we’ve circled:

The above screenshot shows the rate component of the dashboard. This is pretty straight forward. We count the number of requests which returned a 200 response code and graph the rate over time.

The Istio dashboard does something similar for responses that return a 5xx error code. In the above screenshot, you can see how it breaks down the errors by either the ingress controller, or by errors from the application product page itself.

This screenshot shows the request duration graph. This graph is the most informative about the health of our service. This data is provided by a Prometheus monitoring system, so we see request time percentiles graphed here, including the median, 90th, 95th, and 99th percentiles.

These percentiles give us some overall indication of how the service is performing. However, there are a number of deficiencies with this approach that are worth examining. During times of low activity, these percentiles can skew wildly because of limited numbers of samples. This could mislead you about the system performance in those situations. Let’s take a look at the other issues that can arise with this approach:

Duration Problems:

  • The percentiles are aggregated metrics over fixed time windows.
  • The percentiles cannot be re-aggregated for cluster health.
  • The percentiles cannot be averaged (and this is a common mistake).
  • This method stores aggregates as outputs, not inputs.
  • It is difficult to measure cluster SLIs with this method.

Percentiles often provide deeper insight than averages as they express the range of values with multiple data points instead of just one. But like averages, percentiles are an aggregated metric. They are calculated over a fixed time window for a fixed data set. If we calculate a duration percentile for one cluster member, we can not merge that with another one to get an aggregate performance metric for the whole cluster.

It is a common misconception that percentiles can be averaged; they cannot, except in rare cases where the distributions that generated them are nearly identical. If you only have the percentile, and not the source data, you cannot know that might be the case. It is a chicken and egg problem.

This also means that you cannot set service level indicators for an entire service due to the lack of mergeability, if you are measuring percentile based performance only for individual cluster members.

Our ability to set meaningful SLIs (and as a result, meaningful SLOs) is limited here, due to only having four latency data points over fixed time windows. So when you are working with percentile based duration metrics, you have to ask yourself if your SLIs are really good SLIs. We can do better by using math to determine the SLIs that we need to give us a comprehensive view of our service’s performance and health.

Histogram Telemetry

Above is a visualization of latency data for a service in microseconds using a histogram. The number of samples is on the Y-Axis, and the sample value, in this case microsecond latency, is on the X-Axis. This is the open source histogram we developed at Circonus. (See the open source in both C and Golang, or read more about histograms here.) There are a few other histogram implementations out there that are open source, such as Ted Dunning’s t-digest histogram and the HDR histogram.

The Envoy project recently adopted the C implementation of Circonus’s log linear histogram library. This allows envoy data to be collected as distributions. They found a very minor bug in implementation, which Circonus was quite happy to fix. That’s the beauty of open source, the more eyes on the code, the better it gets over time.

Histograms are mergeable. Any two or more histograms can be merged as long as the bin boundaries are the same. That means that we can take this distribution and combine it with other distributions. Mergeable metrics are great for monitoring and observability. They allow us to combine outputs from similar sources, such as service members, and get aggregate service metrics.

As indicated in the image above, this log linear histogram contains 90 bins for each power of 10. You can see 90 bins between 100,000 and 1M. At each power of 10, the bin size increases by a factor of 10. This allows us to record a wide range of values with high relative accuracy without needing to know the data distribution ahead of time. Let’s see what this looks like when we overlay some percentiles:

Now you can see where we have the average, and the 50th percentile (also known as the median), and the 90th percentile. The 90th percentile is the value at which 90% of the samples are below that value.

Consider our example SLI from earlier. With latency data displayed in this format, we can easily calculate that SLI for a service by merging histograms together to get a 5 minute view of data, and then calculating the 90th percentile value for that distribution. If it is less than 1,000 milliseconds, we met our target.

The RED dashboard duration graph from our screenshot above has four percentiles, the 50th, 90th, 95th, and 99th. We could overlay those percentiles on this distribution as well. Even without data, we can see the rough shape of what the request distribution might look like, but that would be making a lot of assumptions. To see just how misleading those assumptions based on just a few percentiles can be, let’s look at a distribution with additional modes:

This histogram shows a distribution with two distinct modes. The leftmost mode could be fast responses due to serving from a cache, and the right mode from serving from disk. Just measuring latency using four percentiles would make it nearly impossible to discern a distribution like this. This gives us a sense of the complexity that percentiles can mask. Consider a distribution with more than two modes:

This distribution has at least four visible modes. If we do the math on the full distribution, we will find 20+ modes here. How many percentiles would you need to record to approximate a latency distribution like the one above? What about a distribution like the one below?

Complex systems composed of many service will generate latency distributions that can not be accurately represented by using percentiles. You have to record the entire latency distribution to be able to fully represent it. This is one reason it is preferable to store the complete distributions of the data in histograms and calculate percentiles as needed, rather than just storing a few percentiles.

This type of histogram visualization shows a distribution over a fixed time window. We can store multiple distributions to get a sense of how it changes over time, as shown below:

This is a heatmap, which represents a set of histograms over time. Imagine each column in this heatmap has a separate bar chart viewed from above, with color being used to indicate the height of each bin. This is a grafana visualization of the response latency from a cluster of 10 load balancers. This gives us a deep insight into the system behavior of the entire cluster over a week, there’s over 1 million data samples here. The median here centers around 500 microseconds, shown in the red colored bands.

Above is another type of heatmap. Here, saturation is used to indicate the “height” of each bin (the darker tiles are more “full”). Also, this time we’ve overlayed percentile calculations over time on top of the heatmap. Percentiles are robust metrics and very useful, but not by themselves. We can see here how the 90+ percentiles increase as the latency distribution shifts upwards.

Let’s take these distribution based duration maps and see if we can generate something more informative than the sample Istio dashboard:

The above screenshot is a RED dashboard revised to show distribution based latency data. In the lower left, we show a histogram of latencies over a fixed time window. To the right of it, we use a heat map to break that distribution down into smaller time windows. With this layout of RED dashboard, we can get a complete view of how our service is behaving with only a few panels of information. This particular dashboard was implemented using Grafana served from an IRONdb time series database which stores the latency data natively as log linear histograms.

We can extend this RED dashboard a bit further and overlay our SLIs onto the graphs as well:

For the rate panel, our SLI might be to maintain a minimum level of requests per second. For the rate panel, our SLI might be to stay under a certain number of errors per second. And as we have previously examined duration SLIs, we might want our 99th percentile for our entire service which is composed of several pods, to stay under a certain latency over a fixed window. Using Istio telemetry stored as histograms enables us to set these meaningful service wide SLIs. Now we have a lot more to work with and we’re better able to interrogate our data (see below).

Asking the Right Questions

So now that we’ve put the pieces together and have seen how to use Istio to get meaningful data from our services, let’s see what kinds questions we can answer with it.

We all love being able to solve technical problems, but not everyone has that same focus. The folks on the business side want to answer questions on how the business is doing. You need to be able to answer business-centric questions. Let’s take the toolset we’ve assembled and align the capabilities with a couple of questions that the business ask its SREs:

Example Questions:

  • How many users got angry on the Tuesday slowdown after the big marketing promotion?
  • Are we over-provisioned or under-provisioned on our purchasing checkout service?

Consider the first example. Everyone has been through a big slowdown. Let’s say Marketing did a big push, traffic went up, performance speed went down, and users complained that the site got slow. How can we quantify the extent of how slow it was for everyone? How many users got angry? Let’s say that Marketing wants to know this so that they can send out a 10% discount email to the users affected and also because they want to avoid a recurrence of the same problem. Let’s craft an SLI and assume that users noticed the slowdown and got angry if requests took more than 500 milliseconds. How can we calculate how many users got angry with this SLI of 500 ms?

First, we need to already be recording the request latencies as a distribution. Then we can plot them as a heatmap. We can use the distribution data to calculate the percentage of requests that exceeded our 500ms SLI by using inverse percentiles. We take that answer, multiply it by the total number of requests in that time window, and integrate over time. Then we can plot the result overlayed on the heatmap:

In this screenshot, we’ve circled the part of the heatmap where the slowdown occurred. The increased latency distribution is fairly indicative of a slowdown. The line on the graph indicates the total number of requests affected over time.

In this example, we managed to miss our SLI for 4 million requests. Whoops. What isn’t obvious are the two additional slowdowns on the right because they are smaller in magnitude. Each of those cost us an additional 2 million SLI violations. Ouch.

We can do these kinds of mathematical analyses because we are storing data as distributions, not aggregations like percentiles.

Let’s consider another common question. Is my service under provisioned, or over provisioned?

The answer is often “it depends.” Loads vary based on the time of day and the day of week, in addition to varying because of special events. That’s before we even consider how the system behaves under load. Let’s put some math to work and use latency bands to visualize how our system can perform:

The visualization above shows latency distribution broken down by latency bands over time. The bands here show the number of requests that took under 25ms, between 25 and 100 ms, 100-250ms, 250-1000, and over 1000ms. The colors are grouped by fast requests as shown in green, to slow requests shown in red.

What does this visualization tell us? It shows that requests to our service started off very quickly, then the percentage of fast requests dropped off after a few minutes, and the percentage of slow requests increased after about 10 minutes. This pattern repeated itself for two traffic sessions. What does that tell us about provisioning? It suggests that initially the service was over provisioned, but then became under provisioned over the course of 10-20 minutes. Sounds like a good candidate for auto-scaling.

We can also add this type of visualization to our RED dashboard. This type of data is excellent for business stakeholders. And it doesn’t require a lot of technical knowledge investment to understand the impact on the business.

Conclusion

We should monitor services, not containers. Services are long lived entities, containers are not. Your users doesn’t care how your containers are performing, they care about how your services are performing.

You should record distributions instead of aggregates. But then you should generate your aggregates from those distributions. Aggregates are very valuable sources of information. But they are unmergeable and so they are not well suited to statistical analysis.

Istio gives you a lot of stuff for free. You don’t have to instrument your code either. You don’t need to go and build a high quality application framework from scratch.

Use math to ask and answer questions about your services that are important to the business. That’s what this is all about, right? When we can make systems reliable by answering questions that the business values, we achieve the goals of the organization.

Cassandra Query Observability with Libpcap and Protocol Observer

Opinions vary in recent online discussions regarding systems and software observability. Some state that observability is a replacement for monitoring. Others that they are parallel mechanisms, or that one is a subset of another (not to mention where tracing fits into such a hierarchy). Monitoring Weekly recently provided a helpful list of resources for an overview of this discussion, as well as some practical applications of observability. Without attempting to put forth yet another opinion, let’s take a step back and ask, what is observability?

What is Observability

Software observability wasn’t invented yesterday, and has a history going back at least 10 years, to the birth of DTrace. One can go back even further to the Apollo 11 source code to see some of the earliest implementations of software observability, and also systems monitoring from mission control. If we ask Wikipedia what Observability is, we get an answer along the lines of how well internal states of a system can be inferred from knowledge of its external outputs.

To understand this distinction between monitoring and observability, consider what a doctor does with a stethoscope. Anyone who has watched one of the many TV medical dramas, or been in a doctor’s office themselves, should be familiar with this. One uses the external outputs of the stethoscope to determine the internal state of the patient. The patient is monitored with instruments such as the stethoscope, and it is the observable traits of these instruments that allow this monitoring to occur. The doctor observes the instruments in order to monitor the patient, because this is often more effective than observing the patient directly, which might not even be possible for some of a patient’s internal states.

Observing Wirelatency and Monitoring Cassandra

So now that we have a baseline understanding of these terms, let’s dive into a practical use case that we’ve implemented here at Circonus. Our patient will be the Apache Cassandra wide column distributed data store. Our stethoscope will be the wirelatency tool, which uses the libpcap library to grab a copy of packets off of the wire before they are processed by the operating system. We are going to determine how well the internal states of Cassandra are by observing the external outputs (the query latency data).

Let’s take a quick overview of libpcap. Pcap allows one to get a copy of packets off the ethernet interface at the link layer prior to their being handled by the kernel networking code. The details of the implementation vary between operating systems, but packets essentially bypass the kernel network stack and are made available to user space. Linux uses PF_PACKET sockets to accomplish this. These are often referred to as raw sockets. BSD based systems use the Berkeley Packet Filter. For more detail, refer to the publication “Introduction to RAW-sockets” (Heuschkel, Hofmann, Hollstein, & Kuepper, 2017).

 

So now that we can grab packets off the wire in a computationally performant means, we can reassemble bidirectional TCP streams, decode application specific protocols, and collect telemetry. Enter the wirelatency utility that was developed in Go. The gopacket library handles the reassembly of bidirectional data from the packets provided by pcap. Wirelatency provides a TCPProtocolInterpreter interface, which allows us to define protocol specific functions in modules for HTTP, PostgreSQL, Cassandra CQL, and Kafka application service calls. The circonus-gometrics library allows us to send that telemetry upstream to Circonus for analysis.

It’s time to do some observability. First lets use `tcpdump` to take a look some actual Cassandra CQL network traffic. I used Google’s Compute Engine to setup a Cassandra 3.4 server and client node. Initially, I used an Ubuntu 17.04 host and installed Cassandra 3.11, but the server failed to startup and threw an exception. Fortunately, this was a documented issue. I could have downgraded JDK or recompiled Cassandra from source (and probably would have done so 10 years ago), but I decided to take the easy route and lit up two new hosts running RHEL. It was a cake walk comparatively to get Cassandra up and running, so I used the excellent DataStax documentation to get a simple schema up and insert some data. At this point, I was able to grab some network traffic.

fredmoyer@instance-1 wirelatency]$ sudo tcpdump -AvvvX dst port 9042 or src port 9042
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
06:55:32.586018 IP (tos 0x0, ttl 64, id 60628, offset 0, flags [DF], proto TCP (6), length 109)
instance-2.c.deft-reflection-188505.internal.38428 > instance-1.c.deft-reflection-188505.internal.9042: Flags [
P.], cksum 0xacad (correct), seq 2362125213:2362125270, ack 3947549198, win 325, options [nop,nop,TS val 435889631
ecr 435892523], length 57
0x0000: 4500 006d ecd4 4000 4006 3896 0a8e 0003 E..m..@.@.8.....
0x0010: 0a8e 0002 961c 2352 8ccb 2b9d eb4a d20e ......#R..+..J..
0x0020: 8018 0145 acad 0000 0101 080a 19fb 25df ...E..........%.
0x0030: 19fb 312b 0400 000b 0700 0000 3000 0000 ..1+........0...
0x0040: 1b73 656c 6563 7420 2a20 6672 6f6d 2063 .select.*.from.c
0x0050: 7963 6c69 7374 5f6e 616d 653b 0001 3400 yclist_name;..4.
0x0060: 0000 6400 0800 0565 2698 f00e 08 ..d....e&....
06:55:32.593339 IP (tos 0x0, ttl 64, id 43685, offset 0, flags [DF], proto TCP (6), length 170)
instance-1.c.deft-reflection-188505.internal.9042 > instance-2.c.deft-reflection-188505.internal.38428: Flags [
P.], cksum 0x15bd (incorrect -> 0x1e61), seq 1:119, ack 57, win 220, options [nop,nop,TS val 435912925 ecr 43588963
1], length 118
0x0000: 4500 00aa aaa5 4000 4006 7a88 0a8e 0002 E.....@.@.z.....
0x0010: 0a8e 0003 2352 961c eb4a d20e 8ccb 2bd6 ....#R...J....+.
0x0020: 8018 00dc 15bd 0000 0101 080a 19fb 80dd ................
0x0030: 19fb 25df 8400 000b 0800 0000 6d00 0000 ..%.........m...
0x0040: 0200 0000 0100 0000 0300 0763 7963 6c69 ...........cycli
0x0050: 6e67 000c 6379 636c 6973 745f 6e61 6d65 ng..cyclist_name
0x0060: 0002 6964 000c 0009 6669 7273 746e 616d ..id....firstnam
0x0070: 6500 0d00 086c 6173 746e 616d 6500 0d00 e....lastname...
0x0080: 0000 0100 0000 105b 6962 dd3f 904c 938f .......[ib.?.L..
0x0090: 61ea bfa4 a803 e200 0000 084d 6172 6961 a..........Maria
0x00a0: 6e6e 6500 0000 0356 4f53 nne....VOS
06:55:32.593862 IP (tos 0x0, ttl 64, id 60629, offset 0, flags [DF], proto TCP (6), length 52)
instance-2.c.deft-reflection-188505.internal.38428 > instance-1.c.deft-reflection-188505.internal.9042: Flags [
.], cksum 0x55bd (correct), seq 57, ack 119, win 325, options [nop,nop,TS val 435889639 ecr 435912925], length 0
0x0000: 4500 0034 ecd5 4000 4006 38ce 0a8e 0003 E..4..@.@.8.....
0x0010: 0a8e 0002 961c 2352 8ccb 2bd6 eb4a d284 ......#R..+..J..
0x0020: 8010 0145 55bd 0000 0101 080a 19fb 25e7 ...EU.........%.
0x0030: 19fb 80dd

Here we can observe the query issued and the result sent back from the server. And because the packets are timestamped, we can calculate the query latency, which was about 7.3 milliseconds (06:55:32.593339 – 06:55:32.586018 = 0.0073209).

Here’s the approach using `protocol-observer`, the wirelatency executable. The golang binary takes an API token, a wire protocol (in this case `cassandra_cql`), and a number of optional debugging arguments. We can see below how it tracks inbound and outbound TCP streams. It reassembles those streams, pulls out the Cassandra queries, and records the query latencies. At regular intervals, it executes an HTTP PUT the Circonus API endpoint to store the observed metrics. The metrics are recorded as log linear histograms, which means that we can store thousands of query latencies (or much more) in a very compact data structure without losing any accuracy from a statistical analysis standpoint

[fredmoyer@instance-1 wirelatency]$ sudo API_TOKEIN:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx -wire c
assandra_cql -debug_capture=true -debug_circonus=true -debug_capture_data=true -debug_cql=true -debug_measurements
=true

2018/02/14 07:25:07 [DEBUG] New(10.142.0.3->10.142.0.2, 38428->9042) -> true
2018/02/14 07:25:07 [DEBUG] establishing sessions for net:10.142.0.3->10.142.0.2
2018/02/14 07:25:07 [DEBUG] establishing dsessions for ports:38428->9042
2018/02/14 07:25:07 [DEBUG] new inbound TCP stream 10.142.0.3->10.142.0.2:38428->9042 started, paired: false
2018/02/14 07:25:07 [DEBUG] New(10.142.0.2->10.142.0.3, 9042->38428) -> false
2018/02/14 07:25:07 [DEBUG] new outbound TCP stream 10.142.0.2->10.142.0.3:9042->38428 started, paired: true
2018/02/14 07:25:08 [DEBUG] flushing all streams that haven't seen packets, pcap stats: &{PacketsReceived:3 Packets
Dropped:0 PacketsIfDropped:0}
2018/02/14 07:25:08 [DEBUG] Packaging metrics
2018/02/14 07:25:08 [DEBUG] PUT URL:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
2018/02/14 07:25:09 [DEBUG] 2 stats sent

Check out the Wirelatency repo or ask about it on the Circonus-labs slack to learn more about how it works.

Conclusion

We can get a look at the distribution of latencies observed by using a histogram to display that distribution.

This histogram shows that a lot of requests are clustered between 2 and 4 milliseconds. We can also see a much smaller mode between 25 and 30 milliseconds. This tells us that we likely have two different data access patterns going on for this example select query. It’s possible that the first mode indicates queries that returned data from a memory cache, and the second from disk. This is something that can’t be discerned just from this data collection, but we can go one step further and plot an overlay with block I/O activity via eBPF to see if there is a correlation between the higher latency query times and disk activity.

If we had just looked at average query latency with a more blunt instrument, we would have probably concluded that most queries were taking around 10 milliseconds. But looking at the distribution here, we can see that very few queries actually took that long to execute. Your ability to observe your system accurately depends on the correctness of the instruments you are using. Here we can see how libpcap and the histogram visualization give us a detailed view into how our Cassandra instance is really performing.

Observability into key Service Level Objectives is essential for the success of your enterprise, and observability is the first step for successful analysis. Only once a doctor collects their observations of the symptoms can they can begin to make their diagnosis. What kind of insights can we gain into the behavior of our systems (of the patient’s internal states) from these observations?

That’s just part of what we will explore next week, when we’ll talk more about Service Level Objectives and dive into SLO analysis in more detail.

Effective Management of High Volume Numeric Data with Histograms

How do you capture and organize billions of measurements per second such that you can answer a rich set of queries effectively (percentiles, counts below X, aggregations across streams), and you don’t blow through your AWS budget in minutes?

To effectively manage billions of data points, your system has to be both performant and scalable. How do you accomplish that? Not only do your algorithms have to be on point, but your implementation of them has to be efficient. You want to avoid allocating memory where possible, avoid copying data (pass pointers around instead), avoid locks, and avoid waits. Lots of little optimizations that add up to being able to run your code as close to the metal as possible.

You also need to be able to scale your data structures. They need to be as size efficient as possible, which means using strongly typed languages with the optimum choice of data types. We’ve found that histograms are the most efficient data structure for storing the data types we care about at scale.

What is a histogram?

A histogram is a representation of the distribution of a continuous variable, in which the entire range of values is divided into a series of intervals (or “bins”) and the representation displays how many values fall into each bin.

This histogram diagram shows a skewed histogram where the mode is near the minimum value q(0). The Y axis is the number of samples (or sample density), and the X axis shows the sample value. On this histogram we can see that the median is slightly left of the midpoint between the highest value and the lowest value. The mode is at a low sample value, so the median is below the mean, or average value. The 90th percentile is also called q(0.9), and is where 90 percent of the sample values are below it.

This might look like the 2nd generation display in KITT from Knight Rider, but this is a heatmap. A heatmap is essentially a series of histograms over time. This heatmap represents web service request latency. Imagine each column in the heatmap is a bar graph (like the previous histogram) viewed “from above,” the parts that are red are where the sample density is the highest. So we can see that most of the latencies tend to concentrate around 500 nanoseconds. We can overlay quantiles onto this visualization, we’ll cover that in a bit.

Types of Histograms

There are five types of histograms:

  • Fixed Bucket – require the user to specify the bucket or bin boundaries.
  • Approximate – use approximations of values.
  • Linear – have one bin at even intervals, such as one bin per integer.
  • Log Linear – have bins at logarithmically increasing intervals.
  • Cumulative – each successive bin contains the sum of the counts of previous bins.

Fixed Bucket Histograms

Fixed bucket histograms require the user to specify the bin boundaries.

Traits of fixed bucket histograms:

  • Histogram bin sizes can be fine tuned for known data sets to achieve increased precision.
  • Cannot be merged with other types of histograms because the bin boundaries are likely uncommon.
  • Less experienced users will likely pick suboptimal bin sizes.
  • If you change your bin sizes, you can’t do calculations across older configurations.

Approximate Histograms

Approximate histograms such as the t-digest histogram (created by Ted Dunning) use approximations of values, such as this example above which displays centroids. The number of samples on each side of the centroid is the same.

Traits of approximate histograms:

  • Space efficient.
  • High accuracy at extreme percentiles (95%, 99%+).
  • Worst case errors ~10% at the median with small sample sizes.
  • Can be merged with other t-digest histograms.

Linear Histograms

Linear histograms have one bin at even intervals, such as one bin per integer. Because the bins are all evenly sized, this type of histogram uses a large number of bins.

Traits of linear histograms:

  • Accuracy dependent on data distribution and bin size.
  • Low accuracy at fractional sample values (though this indicates improperly sized bins).
  • Inefficient bin footprint at higher sample values.

Log Linear Histograms

Log Linear histograms have bins at logarithmically increasing intervals.

Traits of log linear histograms:

  • High accuracy at all sample values.
  • Fits all ranges of data well.
  • Worst case bin error ~5%, but only with absurdly low sample density.
  • Bins are often subdivided into even slices for increased accuracy.
  • HDR histograms (high dynamic range) are a type of log linear histograms.

Cumulative Histograms

Cumulative histograms are different from other types of histograms in that each successive bin contains the sum of the counts of previous bins.

Traits of cumulative histograms:

  • Final bin contains the total sample count, and as such is q(1).
  • Used at Google for their Monarch monitoring system.
  • Easy to calculate bin quantile – just divide the count by the maximum.

Open Source Log Linear Histograms

So what does a programmatic implementation of a log linear histogram looks like? Circonus has released this implementation of log linear histograms as open source, in both C and Golang.

We need to ask a few things:

  • Why does it scale?
  • Why is it fast?
  • How does it calculate quantiles?

First let’s examine what the visual representation of this histogram is, to get an idea of the structure:

At first glance this looks like a linear histogram, but take a look at the 1 million point on the X axis. You’ll notice a change in bin size by a factor of 10. Where there was a bin from 990k to 1M, the next bin spans 1M to 1.1M. Each power of 10 contains 90 bins evenly spaced. Why not 100 bins you might think? Because the lower bound isn’t zero. In the case of a lower bound of 1 and an upper bound of 10, 10-1 = 9, and spacing those in 0.1 increments yields 90 bins.

Here’s an alternate look at the bin size transition:

You can see the transition to larger bins at the 1,000 sample value boundary.

Bin Data Structure

The C implementation of each bin is a struct containing a value and exponent struct paired with the count of samples in that bin. The diagram above shows the overall memory footprint of each bin. The value is one byte representing the value of the data sample times 10. The Exponent is the power of 10 of the bin, and ranges from -128 to +127. The sample count is an unsigned 64 bit integer. This field is variable bit encoded, and as such occupies a maximum of 8 bytes.

Many of the existing time series data stores out there store a single data point as an average using the value as a uint64. That’s one value for every eight bytes – this structure can store virtually an infinite count of samples in this bin range with the same data storage requirements.

Let’s take a look at what the storage footprint is in practice for this type of bin data structure. In practice, we have not seen more than a 300 bin span for operational sample sets. Bins are not pre-allocated linearly, they are allocated as samples are recorded for that span, so the histogram data structure can have gaps between bins containing data.

To calculate storage efficiency for 1 month, that’s 30 days of one minute histograms:

30 days * 24 hours/day * 60 bins/hour * 300 bin span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB = 123.6 MB

These calculations show that we can store 30 days of one minute distributions in a maximum space of 123.6 megabytes. Less than a second of disk read operations, if that.

Now, 30 days of one minute averages only takes about a third of a megabyte – but that data is essentially useless for any sort of analysis.

Let’s examine what a year’s worth of data looks like in five minute windows. That’s 365 days of five minute histograms.

365 days * 24 hours/day * 12 bins/hour * 300 bin span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB = 300.9 MB

The same calculation with different values yield a maximum of about 300 megabytes to represent a year’s worth of data in five minute windows.

Note that this is invariant to the total number of samples; the biggest factors in the actual size of the data are the span of bins covered, and the compression factor per bin.

Quantile Calculations

Let’s talk about performing quantile calculations.

  1. Given a quantile q(X) where 0 < X < 1.
  2. Sum up the counts of all the bins, C.
  3. Multiply X * C to get count Q.
  4. Walk bins, sum bin boundary counts until > Q.
  5. Interpolate quantile value q(X) from bin.

The quantile notation q of X is just a fancy way of specifying a percentile. The 90th percentile would be q of 0.9, the 95th would be q of 0.95, and the maximum would be q of 1.

So say we wanted to calculate the median, q of 0.5. First we iterate over the histogram and sum up the counts in all of the bins. Remember that cumulative histogram we talked about earlier? Guess what – the far right bin already contains the total count, so if you are using a cumulative histogram variation, that part is already done for you and can be a small optimization for quantile calculation since you have an O(1) operation instead of O(n)

Now we multiple that count by the quantile value. If we say our count is 1,000 and our median is 0.5, we get a Q of 500.

Next iterate over the bins, summing the left and right bin boundary counts, until Q is between those counts. If the count Q matches the left bin boundary count, our quantile value is that bin boundary value. If Q falls in between the left and right boundary counts, we use linear interpolation to calculate the sample value.

Let’s go over the linear interpolation part of quantile calculation. Once we have walked the bins to where Q is located in a certain bin, we use the formula shown here.

Little q of X is the left bin value plus big Q minus the left bin boundary, divided by the count differences of the left and ride sides of the bin, times the bin width.

Using this approach we can determine the quantile for a log linear histogram to a high degree of accuracy.

In terms of what error levels we can experience in terms of quantile calculation, with one sample in a bin it is possible to see a worst case 5% error if the value is 109.9 in a bin bounding 100-110; the reported value would be 105. The best case error for one sample is 0.5% for a value of 955 in a bin spanning 950-960.

However, our use of histograms is geared towards very large sets of data. With bins that contain dozens, hundreds, or more samples, accuracy should be expected to 3 or 4 nines.

Inverse Quantiles

We can also use histograms to calculate inverse quantiles. Humans can reason about thresholds more naturally than they can about quantiles.

If my web service gets a surge in requests, and my 99th percentile response time doesn’t change, that not only means that I just got a bunch of angry users whose requests took too long, but even worse I don’t know by how much those requests exceeded that percentile. I don’t know how bad the bleeding is.

Inverse quantiles allow me to set a threshold sample value, then calculate what percentage of values exceeded it.

To calculate the inverse quantile, we start with the target sample and work backwards towards the target count Q.

  1. Given a sample value X, locate its bin.
  2. Using the previous linear interpolation equation, solve for Q given X.

Given the previous equation we had, we can use some middle school level algebra (well, it was when I was in school) and solve for Q.

X = left_value+(Q-left_count) / (right_count-left_count)*bin_width

X-left_value = (Q-left_count) / (right_count-left_count)*bin_width

(X-left_value)/bin_width = (Q-left_count)/(right_count-left_count)

(X-left_value)/bin_width*(right_count-left_count) = Q-left_count

Q = (X-left_value)/bin_width*(right_count-left_count)+left_count

Solving for Q, we get 700, which is expected for our value of 1.05.

Now that we know Q, we can add up the counts to the left of it, subtract that from the total and then divide by the total to get the percentage of sample values which exceeded our sample value X.

  1. Sum the bin counts up to Q as Qleft.
  2. Inverse quantile qinv(X) = (Qtotal-Qleft)/Qtotal.
  3. For Qleft=700, Qtotal = 1,000, qinv(X) = 0.3.
  4. 30% of sample values exceeded X.

So if we are running a website, and we know from industry research that we’ll lose users if our requests take longer than three seconds, we can set X at 3 seconds, calculate the inverse quantile for our request times, and figure out what percentages of our users are getting mad at us. Let’s take a look at how we have been doing that in real time with Circonus.

Examples

This is a heatmap of one year’s worth of latency data for a web service. It contains about 300 million samples. Each histogram window in the heatmap is one day’s worth of data, five minute histograms are merged together to create that window. The overlay window shows the distribution for the day where the mouse is hovering over.

Here we added a 99th percentile overlay which you can see implemented as the green lines. It’s pretty easy to spot the monthly recurring rises in the percentile to around 10 seconds. Looks like a network timeout issue, those usually default to around 10 seconds. For most of the time the 99th percentile is relatively low, a few hundred millisecond.

Here we can see the inverse quantile shown for 500 milliseconds request times. As you can see, for most of the graph, at least 90% of the requests are finishing within the 500 millisecond service level objective. We can still see the monthly increases which we believe are related to network client timeouts, but when they spike, they only affect about 25% of requests – not great, but at least we know the extent of that our SLO is exceeded.

We can take that percentage of requests that violated our SLO of 500 milliseconds, and multiply them by the total number of requests to get the number of requests which exceeded 500 milliseconds. This has direct bearing on your business if each of these failed requests is costing you money.

Note that we’ve dropped the range here to a month to get a closer look at the data presented by these calculations.

What is we sum up over time the number of requests that exceeded 500 millseconds? Here we integrate the number of requests that exceeded the SLO over time, and plot that as the increasing line. You can clearly see now where things get bad with this service by the increase in slope of the blue line. What if we had a way to automatically detect when that is happening and then page whoever is on call?

Here we can see the red line on the graph is the output of a constant anomaly detection algorithm. When the number of SLO violations increases, the code identifies it as an anomaly and rates it on a 0-100 scale. There are several anomaly detection algorithms available out there, and most don’t use a lot of complicated machine learning, just statistics.

Video Presentation

This video of our presentation at DataEngConf 2018 covers the material explained in this post.

You can also view the slides here.

Conclusion

We looked at a few different implementations of histograms, an overview of Circonus’ open source implementation of a log linear histogram, what data structures it uses to codify bin data, and the algorithm used to calculate quantiles. Reading the code (in C or in Golang) will demonstrate how avoiding locks, memory allocations, waits, and copies are essential to making these calculations highly optimized. Good algorithms are still limited by how they are implemented on the metal itself.

One problem to solve might be to collect all of the syscall latency data on your systems via eBPF, and compare the 99th percentile syscall latencies once you’ve applied the Spectre patch which disabled CPU branch prediction. You could find that your 99th percentile syscall overhead has gone from 10 nanoseconds to 30!

While percentiles are a good first step for analyzing your Service Level Objectives, what if we could look at the change in the number of requests that exceeded that objective? Say if 10% of your syscall requests exceeded 10 nanoseconds before the spectre patch, but after patching 50% of those syscalls exceeded 10 nanoseconds?

Soon, in a later post, we’ll talk more about Service Level Objectives and SLO analysis.

SREcon 2018 Americas

Getting paged at 11pm on New Year’s Eve because the application code used sprintf %d on a 32 bit system and your ids just passed 4.295 billion, sending the ids negative and crashing your object service. A wakeup call at 2 am (or is it 3 am?) on the ‘spring forward’ Daylight Savings transition because your timezone libraries didn’t incorporate one of the several dozen new politically mandated timezone changes. Sweating a four hour downtime two days in a row due to primary/replicant database failover because your kernel raid driver threw the same unhandled exception twice in a row; your backup primary database server naturally uses the same hardware as the active one, of course.

Circonus was created by its founders because they experienced the pain of reliability engineering on large scale systems first hand. They needed tools to efficiently diagnose and resolve problems in distributed systems. And they needed to do it at scale. The existing tools (Nagios, Ganglia, etc) at the time couldn’t cope the volume of telemetry nor provide the insight into systems behaviors that was needed. So they set out to develop tools and methods that would fill the void.

The first of these was using histograms to visualize timing data. Existing solutions would give you the average latency, the 95th percentile, the 99th percentile, and maybe a couple others. This information was useful for one host, but mathematically useless for aggregate systems metrics. Capturing latency metrics and storing it as a log linear histogram allowed users to see the distribution of values over a time window. Moreover, this data could be aggregated for multiple hosts to give a holistic view of a the performance of a distributed system or service.

histogram

However, systems are dynamic and constantly changing. Systems that behave well one second and poorly the next are the norm, not the exception in today’s ephemeral infrastructures. So we added heatmaps, which are histogram representations over discrete windows of time. So now users could get an overview of the actual performance of their system. if this diagram below was a traditional line graph showing the average latency value, it would be a mostly straight line, hiding the parts where long tail latencies became unbearable for certain customers. It gives SREs the power to separate the results of ‘works fine’ when testing and ‘this is really slow’ for those outlier large customers (who are generally the ones paying the big bucks).

heatmap

These tools became formative components of standards that had been developing in the SRE community. A few years ago, Brendan Gregg introduced the USE method (Utilization, Saturation, Errors) a couple years ago. USE is a set of metrics which are key indicators for host level health. Following on the tails of USE, Tom Wilkie introduced the RED method (Rate, Errors, Duration). RED is a set of metrics which are indicators for service level health. Combining the two gives SREs a powerful set of standard frameworks for quickly identifying bad behavior for both hosts and systems.

red dashboard

use

These types of visualizations display a wealth of information, and as a result can put demands on the underlying metrics storage layer. A year ago we released the time series database that we have developed in C and Lua as IRONdb. This standalone TSDB can now power Grafana based visualizations, which have become part of the toolset for many SREs. As the complexity of today’s microservice based architectures grows, and the lifetime of individual components falls, the need for high volume time series telemetry continues to increase. Here at Circonus we are dedicated to bringing you solutions that solve the parts of reliability engineering which have caused us pain in the past which affect all SREs. So that you can focus your efforts on the parts of your business which you know better than anyone else.

Posted in SRE

Our Values

Values Create Value

In the tech industry, you read more blog posts on product features than you do on core values. At Circonus, we see them as inextricably linked – values create value – which is exactly what positions us to deliver results for our customers. Values lead you to real solutions, not just resolutions.

Resolutions and reflections are at the top of our minds in the new year. 2017 was a dynamic year in the monitoring and observability space. Some highlights:

This momentum in our space, and the pioneering role that Circonus plays, make me proud to work in technology. But I recall too many tech-related headlines in 2017 that screamed a lack of basic human or corporate values. This week, as we all vow to exercise more (my favorite) and eat less peanut butter ice cream (also my favorite), it’s worth reflecting that resolutions and corporate gamesmanship are like fad diets, but values are a way of life.

As leaders in a rapidly-evolving sector, Circonus believes our values should guide the path we forge. So we’ve decided to share our values publicly here. Leadership in technology depends on a set of principles that serve as a touchstone at times when it might be easier in the short term to take actions that sacrifice that which our customers, partners, and colleagues have come to expect from us. Without further ado, I present the values that Theo has laid out for us here at Circonus.

Respect

Be excellent to each other; everyone; equally. Never participate directly or indirectly in the violation of human rights. Consider others in your actions. Always presume competence and good intention. During disagreements fallback to shared principles and values and work toward a solution from there. Communicate honestly, clearly and consistently, and with respect.

What this means for our work: Ideas are not people. We will criticize ideas, tear them down, and seek to ensure they can withstand operating at scale. We don’t do the same to people though; to operate at the highest level, shared discourse demands respect and trust between individuals.

Trust

Trust is basic fabric of good working relationships with our colleagues, with our customers, with our industry, and with our competitors. Trust is reinforced through honesty, openness, and being transparent by default. It is okay to share tactical mistakes and our shortcomings both internally and externally. Never break the law or expose the mistakes or shortcomings of our customers.

What this means for our work: We build trust through open communication with our customers when things don’t go as expected.

Integrity

Do not break the law. Do not game the system. If we feel the system is broken, we must act to change the system. Winning isn’t winning if we’ve cheated. It is impossible to win alone.

What this means for our work: When we build on the work of others, we cite prior art and give credit where it is due.

Care

We must care as much or more about our customers than they do for themselves. Customer data: keep it secret, keep it safe, keep it intact and accurate. We also recognize that people are not machines and that human contact and personal care should not be sacrificed for the sake of efficiency. Never miss an opportunity to connect with a customer at the human level.

What this means for our work: We have built our systems to implement data safety as a first class feature.

Value

Leave a room cleaner than when you entered. Leave a customer with more value than they’ve invested in us. Leave your colleagues, your customers and competitors lives more enriched and happier after every interaction. Appreciate and acknowledge the contributions of others.

What this means for our work: We aren’t satisfied with being average. We look to implement the best in class technical solution, even when it means waiting a little bit longer for the market to realize it.

Kindness

Always treat customer organizations as the assembly of humans they are. Treat everyone with kindness; it is the one thing you can always afford to do.

What this means for our work: Kindness at a minimum means helping our customers – even more than they asked, whenever we can. Kindness costs nothing, yet returns so much.

Frugality

Avoid waste. Consider the world. Conserve and protect what we consume – from the environment to people’s time.

What this means for our work: We engineer our systems to be frugal for clock cycles and block reads, as well as network traffic.

Growth

Learn something new every day and encourage responsible risk taking. Experience results in good decision making; experience comes from making poor decisions. Support those around you to help them constructively learn from their mistakes. This is how we build a team with experience. Learn from our failures and celebrate our successes.

What this means for our work: We are always looking to the leading edge of innovation; the value of success is often higher than the cost of failure.

Excellence

Set high standards for our own excellence. Expect more from ourselves than we do from our customers, our colleagues, and others we come in contact with.

What this means for our work: We seek to push the envelope on ideas and practices. As we said earlier, being average isn’t good enough; we expect ourselves to strive for the 99th percentile.

 

We hope you will share in these values going into 2018. If these words resonated with you, and you love to build high quality systems and software, come work with us, we are hiring!

The Circonus Istio Mixer Adapter

Here at Circonus, we have a long heritage of open source software involvement. So when we saw that Istio provided a well designed interface to syndicate service telemetry via adapters, we knew that a Circonus adapter would be a natural fit. Istio has been designed to provide a highly performant, highly scalable application control plane, and Circonus has been designed with performance and scalability as core principles.

Today we are happy to announce the availability of the Circonus adapter for the Istio service mesh. This blog post will go over the development of this adapter, and show you how to get up and running with it quickly. We know you’ll be excited about this, because Kubernetes and Istio give you the ability to scale to the level that Circonus was engineered to perform at, above other telemetry solutions.

If you don’t know what a service mesh is, you aren’t alone, but odds are you have been using them for years. The routing infrastructure of the Internet is a service mesh; it facilitates tcp retransmission, access control, dynamic routing, traffic shaping, etc. The monolithic applications that have dominated the web are giving way to applications composed of microservices. Istio provides control plane functionality for container based distributed applications via a sidecar proxy. It provides the service operator with a rich set of functionality to control a Kubernetes orchestrated set of services, without requiring the services themselves to implement any control plane feature sets.

Istio’s Mixer provides an adapter model which allowed us to develop an adapter by creating handlers for interfacing Mixer with external infrastructure backends. Mixer also provides a set of templates, each of which expose different sets of metadata that can be provided to the adapter. In the case of a metrics adapter such as the Circonus adapter, this metadata includes metrics like request duration, request count, request payload size, and response payload size. To activate the Circonus adapter in an Istio-enabled Kubernetes cluster, simply use the istioctl command to inject the Circonus operator configuration into the K8s cluster, and the metrics will start flowing.

Here’s an architectural overview of how Mixer interacts with these external backends:

Istio also contains metrics adapters for StatsD and Prometheus. However, a few things differentiate the Circonus adapter from those other adapters. First, the Circonus adapter allows us to collect the request durations as a histogram, instead of just recording fixed percentiles. This allows us to calculate any quantile over arbitrary time windows, and perform statistical analyses on the histogram which is collected. Second, data can be retained essentially forever. Third, the telemetry data is retained in a durable environment, outside the blast radius of any of the ephemeral assets managed by Kubernetes.

Let’s take a look at the guts of how data gets from Istio into Circonus. Istio’s adapter framework exposes a number of methods which are available to adapter developers. The HandleMetric() method is called for a set of metric instances generated from each request that Istio handles. In our operator configuration, we can specify the metrics that we want to act on, and their types:

spec:
  # HTTPTrap url, replace this with your account submission url
  submission_url: "https://trap.noit.circonus.net/module/httptrap/myuuid/mysecret"
  submission_interval: "10s"
  metrics:
  - name: requestcount.metric.istio-system
    type: COUNTER
  - name: requestduration.metric.istio-system
    type: DISTRIBUTION
  - name: requestsize.metric.istio-system
    type: GAUGE
  - name: responsesize.metric.istio-system
    type: GAUGE

Here we configure the Circonus handler with a submission URL for an HTTPTrap check, an interval to send metrics at. In this example, we specify four metrics to gather, and their types. Notice that we collect the requestduration metric as a DISTRIBUTION type, which will be processed as a histogram in Circonus. This retains fidelity over time, as opposed to averaging that metric, or calculating a percentile before recording the value (both of those techniques lose the value of the signal).

For each request, the HandleMetric() method is called on each request for the metrics we have specified. Let’s take a look at the code:

// HandleMetric submits metrics to Circonus via circonus-gometrics
func (h *handler) HandleMetric(ctx context.Context, insts []*metric.Instance) error {

    for _, inst := range insts {

        metricName := inst.Name
        metricType := h.metrics[metricName]

        switch metricType {

        case config.GAUGE:
            value, _ := inst.Value.(int64)
            h.cm.Gauge(metricName, value)

        case config.COUNTER:
            h.cm.Increment(metricName)

        case config.DISTRIBUTION:
            value, _ := inst.Value.(time.Duration)
            h.cm.Timing(metricName, float64(value))
        }

    }
    return nil
}

Here we can see that HandleMetric() is called with a Mixer context, and a set of metric instances. We iterate over each instance, determine its type, and call the appropriate circonus-gometrics method. The metric handler contains a circonus-gometrics object which makes submitting the actual metric trivial to implement in this framework. Setting up the handler is a bit more complex, but still not rocket science:

// Build constructs a circonus-gometrics instance and sets up the handler
func (b *builder) Build(ctx context.Context, env adapter.Env) (adapter.Handler, error) {

    bridge := &logToEnvLogger{env: env}

    cmc := &cgm.Config{
        CheckManager: checkmgr.Config{
            Check: checkmgr.CheckConfig{
                SubmissionURL: b.adpCfg.SubmissionUrl,
            },
        },
        Log:      log.New(bridge, "", 0),
        Debug:    true, // enable [DEBUG] level logging for env.Logger
        Interval: "0s", // flush via ScheduleDaemon based ticker
    }

    cm, err := cgm.NewCirconusMetrics(cmc)
    if err != nil {
        err = env.Logger().Errorf("Could not create NewCirconusMetrics: %v", err)
        return nil, err
    }

    // create a context with cancel based on the istio context
    adapterContext, adapterCancel := context.WithCancel(ctx)

    env.ScheduleDaemon(
        func() {

            ticker := time.NewTicker(b.adpCfg.SubmissionInterval)

            for {
                select {
                case <-ticker.C:
                  cm.Flush()
                case <-adapterContext.Done()
                  ticker.Stop()
                  cm.Flush()
                  return
                }
            }
          })
    metrics := make(map[string])config.Params_MetricInfo_Type)
    ac := b.adpCfg
    for _, adpMetric := range ac.Metrics {
        metrics[adpMetricName] = adpmetric.Type
    }
    return &handler{cm: cm, env: env, metrics: metrics, cancel: adapterCancel}, nil
}

Mixer provides a builder type which we defined the Build method on. Again, a Mixer context is passed, along with an environment object representing Mixer’s configuration. We create a new circonus-gometrics object, and deliberately disable automatic metrics flushing. We do this because Mixer requires us to wrap all goroutines in their panic handler with the env.ScheduleDaemon() method. You’ll notice that we’ve created our own adapterContext via context.WithCancel. This allows us to shut down the metrics flushing goroutine by calling h.cancel() in the Close() method handler provided by Mixer. We also want to send any log events from CGM (circonus-gometrics) to Mixer’s log. Mixer provides an env.Logger() interface which is based on glog, but CGM uses the standard Golang logger. How did we resolve this impedance mismatch? With a logger bridge, any logging statements that CGM generates are passed to Mixer.

// logToEnvLogger converts CGM log package writes to env.Logger()
func (b logToEnvLogger) Write(msg []byte) (int, error) {
    if bytes.HasPrefix(msg, []byte("[ERROR]")) {
        b.env.Logger().Errorf(string(msg))
    } else if bytes.HasPrefix(msg, []byte("[WARN]")) {
        b.env.Logger().Warningf(string(msg))
    } else if bytes.HasPrefix(msg, []byte("[DEBUG]")) {
        b.env.Logger().Infof(string(msg))
    } else {
        b.env.Logger().Infof(string(msg))
    }
    return len(msg), nil
}

For the full adapter codebase, see the Istio github repo here.

Enough with the theory though, let’s see what this thing looks like in action. I setup a Google Kubernetes Engine deployment, loaded a development version of Istio with the Circonus adapter, and deployed the sample BookInfo application that is provided with Istio. The image below is a heatmap of the distribution of request durations from requests made to the application. You’ll notice the histogram overlay for the time slice highlighted. I added an overlay showing the median, 90th, and 95th percentile response times; it’s possible to generate these at arbitrary quantiles and time windows because we store the data natively as log linear histograms. Notice that the median and 90th percentile are relatively fixed, while the 95th percentile tends to fluctuate over a range of a few hundred milliseconds. This type of deep observability can be used to quantify the performance of Istio itself over versions as it continues it’s rapid growth. Or, more likely, it will be used to identify issues within the application deployed. If your 95th percentile isn’t meeting your internal Service Level Objectives (SLO), that’s a good sign you have some work to do. After all, if 1 in 20 users is having a sub-par experience on your application, don’t you want to know about it?

That looks like fun, so let’s lay out how to get this stuff up and running. First thing we’ll need is a Kubernetes cluster. Google Kubernetes Engine provides an easy way to get a cluster up quickly.

There’s a few other ways documented in the Istio docs if you don’t want to use GKE, but these are the notes I used to get up and running. I used the gcloud command line utility as such after deploying the cluster in the web UI.

# set your zones and region
$ gcloud config set compute/zone us-west1-a
$ gcloud config set compute/region us-west1

# create the cluster
$ gcloud alpha container cluster create istio-testing --num-nodes=4

# get the credentials and put them in kubeconfig
$ gcloud container clusters get-credentials istio-testing --zone us-west1-a --project istio-circonus

# grant cluster admin permissions
$ kubectl create clusterrolebinding cluster-admin-binding --clusterrole=cluster-admin --user=$(gcloud config get-value core/account)

Poof, you have a Kubernetes cluster. Let’s install Istio – refer to the Istio docs

# grab Istio and setup your PATH
$ curl -L https://git.io/getLatestIstio | sh -
$ cd istio-0.2.12
$ export PATH=$PWD/bin:$PATH

# now install Istio
$ kubectl apply -f install/kubernetes/istio.yaml

# wait for the services to come up
$ kubectl get svc -n istio-system

Now setup the sample BookInfo application

# Assuming you are using manual sidecar injection, use `kube-inject`
$ kubectl apply -f <(istioctl kube-inject -f samples/bookinfo/kube/bookinfo.yaml)

# wait for the services to come up
$ kubectl get services

If you are on GKE, you’ll need to setup the gateway and firewall rules

# get the worker address
$ kubectl get ingress -o wide

# get the gateway url
$ export GATEWAY_URL=<workerNodeAddress>:$(kubectl get svc istio-ingress -n istio-system -o jsonpath='{.spec.ports[0].nodePort}')

# add the firewall rule
$ gcloud compute firewall-rules create allow-book --allow tcp:$(kubectl get svc istio-ingress -n istio-system -o jsonpath='{.spec.ports[0].nodePort}')

# hit the url - for some reason GATEWAY_URL is on an ephemeral port, use port 80 instead
$ wget http://<workerNodeAddress>/<productpage>

The sample application should be up and running. If you are using Istio 0.3 or less, you’ll need to install the docker image we build with the Circonus adapter embedded.

Load the Circonus resource definition (only need to do this with Istio 0.3 or less). Save this content as circonus_crd.yaml

kind: CustomResourceDefinition
apiVersion: apiextensions.k8s.io/v1beta1
metadata:
  name: circonuses.config.istio.io
  labels:
    package: circonus
    istio: mixer-adapter
spec:
  group: config.istio.io
  names:
    kind: circonus
    plural: circonuses
    singular: circonus
  scope: Namespaced
  version: v1alpha2

Now apply it:

$ kubectl apply -f circonus_crd.yaml

Edit the Istio deployment to pull in the Docker image with the Circonus adapter build (again, not needed if you’re using Istio v0.4 or greater)

$ kubectl -n istio-system edit deployment istio-mixer

Change the image for the Mixer binary to use the istio-circonus image:

image: gcr.io/istio-circonus/mixer_debug
imagePullPolicy: IfNotPresent
name: mixer

Ok, we’re almost there. Grab a copy of the operator configuration, and insert your HTTPTrap submission URL into it. You’ll need a Circonus account to get that; just signup for a free account if you don’t have one and create an HTTPTrap check.

Now apply your operator configuration:

$ istioctl create  -f circonus_config.yaml

Make a few requests to the application, and you should see the metrics flowing into your Circonus dashboard! If you run into any problems, feel free to contact us at the Circonus labs slack, or reach out to me directly on Twitter at @phredmoyer.

This was a fun integration; Istio is definitely on the leading edge of Kubernetes, but it has matured significantly over the past few months and should be considered ready to use to deploy new microservices. I’d like to extend thanks to some folks who helped out on this effort. Matt Maier is the maintainer of Circonus gometrics and was invaluable on integrating CGM within the Istio handler framework. Zack Butcher and Douglas Reid are hackers on the Istio project, and a few months ago gave an encouraging ‘send the PRs!’ nudge when I talked to them at a local meetup about Istio. Martin Taillefer gave great feedback and advice during the final stages of the Circonus adapter development. Shriram Rajagopalan gave a hand with the CircleCI testing to get things over the finish line. Finally, a huge thanks to the team at Circonus for sponsoring this work, and the Istio community for their welcoming culture that makes this type of innovation possible.

Learn More

 

Hosts, Metrics, and Pricing, Oh My!

As the number and types of monitoring vendors have risen over the past several years, so have the pricing models. With the rise of cloud based ephemeral systems, often running alongside statically provisioned infrastructure, understanding the monitoring needs of one’s systems can be a challenging task for the most seasoned engineering and operations practitioners. In this post, we’ll take a look at a few of the current trends in the industry and why metrics-based pricing is the future of commercial monitoring purchasing.

For the first decade of the 21st century, rack mounted servers dominated the footprint of web infrastructure. Operators used number of hosts as the primary guide for infrastructure provisioning and capacity planning. Popular open source monitoring systems of the era, such as Nagios and Zabbix, reflected this paradigm; the UI was oriented around host based management. Commercial based monitoring systems of the time followed this pattern; pricing was a combination of number of hosts and resources, such as CPUs. Figuring out how much a commercial solution would cost was a relatively straightforward calculation; take the number of hosts/resources, and multiply it by the cost of each.

Post 2010, two major disruptions to this host based pricing model surfaced in the industry. The first was Amazon’s Elastic Compute Cloud (EC2), which had been growing in usage since 2006. Ephemeral-based cloud systems, such as AWS, GCP, and Azure, are now the preferred infrastructure choice. Services utilize individual hosts (or containers), so what may have been appropriate for deployment to one host 10 years ago is now a composition of dozens or hundreds. In these situations, host based pricing makes cost forecasting for infrastructure monitoring solutions much more complicated. One only need to be familiar with AWS auto-scaling or K8s cluster redeployment to shudder at the implications for host based monitoring system costs.

The second major disruption post 2010 was Etsy’s statsd, which introduced easy to implement application-based metrics, and in large part gave rise to the rapid ascent of monitoring culture over the last several years. Now one can instrument an application and collect thousands (if not millions) of telemetry sources from a distributed application and monitor its health in real time. The implication this has for statically-based infrastructure is that now a single host can source orders of magnitude more metrics than just host-based system metrics. Host-based vendors have responded to this by including only a small number of metrics per host; this represents an additional revenue opportunity for them, and an additional cost-based headache for operators.

Learn More

As a result of these disruptions, metrics-based pricing has emerged as a solution which gives systems operators, engineers, and cost- and capacity-conscious executives a way to address the current monitoring needs of their applications. The question is no longer “how many hosts do we have,” but “how many things (metrics) do we need to monitor.” As the answer to the “how many” metrics question is also ephemeral, it is important that modern monitoring solutions also answer this question in terms of how many concurrent metrics are being collected. This is an infrastructure invariant approach that scales from bare metal to containers to serverless applications.

Storage is cheap; concurrency is not. Does your monitoring solution let you analyze your legacy metrics without charging you to keep them stored, or do you pay an ever increasing cost for all of the metrics you’ve ever collected?

At Circonus, we believe that an active-only metrics-based pricing model is the fairest way to deliver value to our customers, while giving them the power to scale up or down as needed over time. You can archive metrics while still retaining the ability to perform historical analyses with them. Our pricing model gives you the flexibility to follow your application’s monitoring needs, so that you can adapt to the ever-changing trends in application infrastructure.

As the world turns, so does our understanding and expectations from systems. At the end of the day, pricing models need to make sense for buyers, and not all buyers are at the same point on the technology adoption curve. Generally, active-only metrics pricing is the humane approach, but our model is user-friendly and adaptable, Don’t be surprised if we introduce some more options because we accommodate customers that just can’t make this model fit the way they run technical infrastructure.

–Fred Moyer is a Developer Evangelist at Circonus

Twitter: @phredmoyer

Learn More