System Monitoring with the USE Dashboard

The USE method was developed by Brendan Gregg to study performance problems in a systematic way.1 It provides a simple, top-down approach to identify bottlenecks quickly and reliably. The gathering of performance metrics is usually done using a myriad of different system tools (sar, iostat, vmstat) and tracers (dtrace, perf, ebpf). Circonus has now added the most relevant USE metrics to its monitoring agent, and conveniently presents them in the form of a USE Dashboard (Figure 1, live demo), that allows a full USE analysis at a single glance.

Figure 1: The USE System Dashboard. Click to view a live demo.

Outline

  1. System Monitoring
  2. The USE Method
  3. Performance Metrics
    3.1 CPU / Scheduler Monitoring
    3.2 Memory Management
    3.3 Network Monitoring
    3.4 Disks and File Systems
  4. The USE Dashboard

1. System Monitoring

System monitoring is concerned with monitoring basic system resources: CPU, Memory, Network and Disks. These computational resources are not consumed directly by the application. Instead, the Operating System abstracts and manages these resources and provides a consistent abstracted API to the application (“process environment”, system calls, etc). Figure 2 illustrates the high level architecture. A more detailed version can be found in Gregg.2, 3

Figure 2: High-level System Overview

 

Once critical objective of system monitoring is to check that if how the available resource are utilized. Typical questions are: Is my CPU fully utilized? Is my application running out of memory? Do we have enough disk capacity left?

While a fully utilized resource is an indication of a performance bottleneck, it might not be a problem at all. A fully utilized CPU means only that we are making good use of the system. It starts causing problems only when incoming requests start queuing up or producing errors, and hence the performance of the application is impacted. But queuing does not only occur in the application layer. Modern software stacks use queuing in all system components to improve performance and distribute load. The degree to which a resource has extra work that it can not service is called saturation,3, p42 and is another important indicator for performance bottlenecks.

2. The USE Method

The USE method, by Gregg, is an excellent way to identify performance problems quickly. It uses a top down approach to summarize the system resources, which ensures that every resource is covered. Other approaches suffer from a “street light syndrome,” in that the focus lies on parts of the system where metrics are readily available. In other cases, random changes are applied in the hope that the problems go away.

The USE method can be summarized as follows:

For each resource, check utilization, saturation, and errors.

The USE analysis is started by creating an exhaustive list of that are consumed by the application. The four resource types mentioned above are the most important ones, but there are more resources, like IO Bus, Memory Bus, and Network Controllers, that should be included in a thorough analysis. For each resource, errors should be investigated first, since they impact performance and might not be noticed immediately, when the failure is recoverable. Then, utilization and saturation are checked.

For more details about the use method and its application to system performance analysis the reader is referred to the excellent book by Gregg.3

3. Performance Metrics

It’s not immediately clear how utilization, saturation, and errors can be quantified for different system resources. Fortunately, Gregg has compiled a sensible list of indicators2 that are available on Linux systems. We have taken this list as our starting point to define a set of USE metrics for monitoring systems with Circonus. In this section, we will go over each of them and explain their significance.

3.1. CPU Monitoring

Utilization Metrics:

  1. cpu`idle — time spent in the idle thread and not waiting on I/O (yellow)
  2. cpu`wait_io — time spent in the idle thread while waiting on IO (dark yellow)
  3. cpu`user — time spent executing user code (blue)
  4. cpu`system + cpu`intr — time spent executing OS-kernel code (dark blue)

Those metrics should give you a rough idea of what the CPU is doing during the last reporting period (1M). Blues represent time that the system spent doing work, yellow colors represent time where that the system spent waiting.

Like all other monitoring tools we are aware of, the values are derived from `/proc/stat`.4 However, they should be taken with the necessary precaution:5, 6 Those counters do not usually add up to 100%, which is already bad sign. Also accounting takes place in units of full CPU time slices (jiffies) at time of the clock interrupt. There are many voluntary context switches within a single time slice that are missed this way. Tickless kernels and varying clock speeds (Intel Turbo Boost/Turbo Step) further blur the meaning of these metrics. For the subtleties in accounting wait_io on multiprocessor systems see A. Veithen.11

Nevertheless these metrics are used everywhere to measure CPU utilization, and have proven to be valuable first source of information. We hope that we will be able to replace them with more precise metrics in the future.

There are some differences from Gregg:2 There, `vmstat “us” + “sy” + “st”` is the suggested utilization metric. We account steal time (“st”) as idle.

Saturation Metrics:

  1. cpu`procs_runnable (normalized, corrected) — runnable processes per CPU (purple)
  2. loadavg`1 (normalized) — one-minute load average per CPU (dark purple)
  3. Saturation threshold guide at value one. (black)

The CPU keeps runnable processes waiting on a queue. The length of this queue at any given point in time is a good metric for the CPU saturation. Unfortunately, this number is not directly exposed by the kernel. Via `/proc/stat` we get the total number of runnable processes (`procs_running`) which includes queued processes as well as currently running processes. We report this number as the `procs_runnable` metric. While being a straightforward measure, it suffers from the low sampling interval (only once a minute) and an observation bias. The Circonus agent NAD currently lists 3 running processes for an idle system. We account for these problems by (a) subtracting processes run by the monitoring agent (3) and (b) dividing by the number of CPU cores on the system. In this way, a value of one, which is marked with the guide, represents a system that has one runnable process per core.

The load average is a smoothed out version of procs`runnable metric maintained by the kernel. It’s typically sampled every 5 seconds and aggregated exponential smoothing algorithm.7 Recent kernel versions spent a lot of effort to maintain a meaningful load average metric across a system with a high number of CPUs and tickless kernels. This metric is divided by the number of CPU cores as well.

While 1min, 5min, and 15min load averages are maintained by the kernel, we only show the 1min average, since the others don’t provide any added value when plotted over time.

Both collected metrics are similar in their interpretation. If the value of either of these is larger than one (the guide) you have processes queuing for CPU time on the machine.

Differences from Gregg:2 In Gregg’s method, only procs_runnable statistic is considered. We think load averages are valuable as well.

Error Metrics:
CPU error metrics are hard to come by. If CPU performance counters are available (often not the case on virtualized hardware) perf(1) can be used to read them out. At the moment, Circonus does not provide CPU error metrics.

3.2. Memory Management

This section is concerned with the memory capacity resource. The bandwidth of the memory interconnect is another resource that can be worth analyzing, but it is much harder to get.

Utilization Metrics:

  1. vm`meminfo`Buffers — file system buffers (darker yellow)
  2. vm`meminfo`Cached — file-system caches (dark yellow)
  3. vm`meminfo`MemFree — free memory (yellow)
  4. vm`meminfo`Used (computed) — memory used by the application (blue)

The OS uses memory that is not utilized by the application for caching file system content. These memory pages can be reclaimed by the application as needed, and are usually not a problem for system performance.

Saturation Metrics:

  1. vm`info`pg_scan — the number of memory pages scanned per second (purple)
  2. vm`swap`used — usage of swap devices (purple, zero in the example)

When the free memory is close to exhausted, the system begins freeing memory from buffers and caches, or begins moving pages to a swap partition on disk (if present). The page scanner is responsible for identifying suitable memory pages to free. Hence, scanning activity is an indicator for memory saturation. A growing amount of swap space is also an indicator for saturated memory.

When the system has neither memory nor swap space left, it must free memory by force. Linux does this by killing applications that consume too much memory. When this happens, we have an OOM-(“out of memory”)-event, which is logged to dmesg.

Differences from Gregg:2 We are missing metrics for swapping (i.e. anonymous paging) and OOM events.

Errors:
Physical memory failures are logged to dmesg. Failed malloc(3) can be detected using SystemTap. We don’t have any metrics for either of them at the moment.

3.3. Network Monitoring

Utilization Metrics:

  1. if`<interface>`in_bytes — inbound network throughput in bytes/sec
  2. if`<interface>`out_bytes — outbound network throughput in bytes/sec

The network utilization can be measured as throughput divided by the bandwidth (maximal throughput) of each network interface. A full-duplex interface is fully utilized if either inbound or outbound throughput exhaust the available bandwidth. For half-duplex interfaces, the sum of inbound and outbound throughput is the relevant metric to consider.

For graphing throughput we use a logarithmic scale, so that a few kb/sec remain visibly distinct from the x-axis, and set the y-limit to the available bandwidth. The available bandwidth is often not exposed by virtual hardware; in this case, we don’t set a y-limit.

Saturation Metrics:

  1. tcp`setments_retransmitted (purple)
  2. if`<interface>`{in,out}_drop — dropped packets due to full buffers (purple, zero)
  3. if`<interface>`{in,out}_fifo_overrun — dropped packets due to full buffers (purple, zero)

Network saturation is hard to come by. Ideally, we’d like to know how many packets are queued send/receive buffers, but these statistics do not seem to be exposed via /proc. Instead, we have to settle for indirect indicators which are available, such as tcp-level retransmits, as well as drop and overrun counts.

Error Metrics:

  1. if`<interface>`{in,out}_errors — the number of errors occurred, e.g. invalid packets received

All metrics covered in this whole section should be taken with a great bit of caution. The Linux networking stack is a large, undocumented codebase with a lot of known bugs in the accounting mechanisms, see Damato.8.

3.4. Disks and File Systems

Utilization Metrics:

    diskstats`<device>`io_ms — the percentage of time the device has been utilized during the last reporting interval

The disk utilization is measured per device and not per file system. We simply record the percentage of time the device was busy during the last reporting period. This metric is read from `/proc/diskstats`.

Saturation Metrics:

  1. diskstats`<disks>`io_ms_weighted — Average number of inflight I/O operations

This metric is exposed under the name “avgqu-sz” by `sar -d` and `iostat -x` and measures the number of requests that are currently queued or processed by the disk. The metric is derived from a weighted_io_ms counter in `/proc/diskstats`, and involves a little bit of mathematics.11

Error Metrics:
According to Gregg,2 disk errors can be found in /sys/devices/<>/ioerr_cnt, smartctl, or by “tracing of the IO subsystem”. However in the (virtualized) systems we used for testing, these metrics were not exposed. Hence, at the moment, we don’t include any disk error metrics.

4. The USE Dashboards

The USE Dashboard, shown in Figure 1 above, combines all metrics discussed above in a single dashboard for each host.

Each row corresponds to a resource type, and each column contains an analysis dimension: Utilization, Saturation, and Errors. To perform a USE Performance analysis as outlined in Gregg,1, 2 you would traverse the dashboard line by line, and check:

  1. If errors are present? => Graphs in Column 3 non zero.
  2. If saturation is present? => Graphs in Column 2 show large values over extended time period
  3. Is the resource is fully utilized? => Graphs in Column 1 are maxed out.

The graphs are organized in such a way that all these checks can be done at a single glance.

Discussion

The USE Dashboard allows a rapid performance analysis of a single host. Instead of ssh-ing into a host and collecting statistics from a number of system tools (vmstat, iostat, sar), we get all relevant numbers together with their historical context in a single dashboard. We found this visualization valuable to judge the utilization of our own infrastructure, and as a first go-to source for analyzing performance problems.

However, the work on this dashboard is far from complete. First, we are lacking error metrics for CPU, Memory, and Disk resources. We have included text notes on how to get them manually in the USE dashboard to make those blind spots “known unknowns”. Another big working site is the quality of the measurements. As explained above, even basic metrics like CPU utilization have large margins of errors and conceptual weaknesses in the measurement methodology. Also, we did not cover all resources that a system provides to the application. E.g. we don’t have metrics about the File System, Network Controllers, I/O- and Memory- interconnects, and even basic metrics about the physical CPU and Memory resources are missing (Instruction per sec, Memory ECC events). Figure 3 below positions the covered resources into the high level system overview presented in the introduction.

Figure 3: Metrics displayed in the USE dashboard per resource

Try it yourself!

To try the USE Dashboard for yourself, log into your Circonus account, click on “Checks” > “+ New Host” to provision a new host with a single command. The USE Dashboard will be automatically created. A link is provided at the command line.

No Circonus account, yet? Get one for free here!

References

  1. Gregg – Thinking Methodically about Performance, CACM 2013
  2. Gregg – Use Method: Linux Performance Checklist
  3. Gregg – Systems Performance, Prentice Hall 2014
  4. https://www.kernel.org/doc/Documentation/filesystems/proc.txt
  5. https://github.com/torvalds/linux/blob/master/Documentation/cpu-load.txt#L20-L22
  6. Gregg – Broken Linux Performance Tools, Scale14x (2016)
  7. https://en.wikipedia.org/wiki/Load_(computing)
  8. Damato – All of your network monitoring is probably wrong
  9. Hartmann – Monitoring Queue Sizes
  10. https://github.com/iovisor/bcc
  11. Andreas Veithen – The precise meaning of IO wait on Linux (blog)

[latexpage]

Monitoring Queue Sizes

At Circonus we are collecting a number of metric about block devices. An interesting one is the average length of the request queue, which is exposed by iostat[1]:

aqu-sz
  The average queue length of the requests that were issued to the device.
  Note: In previous versions, this field was known as avgqu-sz.

To collect “aqu-sz” as a metric we don’t want to fire up iostat on each collection interval. So we needed to understand how this value is derived from statistics exposed by the operating system. It turns out that there is an interesting bit of mathematics involved in deriving this value. Here is how it works.

It was already noted by Baron Schwarz in 2010[2], that the calculation behind iostat are interesting. This post can be seen as a formalized elaboration.

Step 1: Checking the Implementation

The first step in understanding the iostat calculation of aqu-sz is to check the implementation (source):

/* aqu-sz */
    cprintf_f(NO_UNIT, 1, 6, 2, S_VALUE(ioj->rq_ticks, ioi->rq_ticks, itv) / 1000.0);

Here “S_VALUE” is defined (source) as:

#define S_VALUE(m,n,p)		(((double) ((n) - (m))) / (p) * HZ)

and “rq_ticks” is Field 11 from /proc/diskstats (source).

Digging around in the source code a little more, you will be able to verify for yourself that the following
formula is used to calculate aqu-sz:

$$ aqusz = \frac{F_{11}(t_1) – F_{11}(t_0)}{t_1 – t_0} $$

where $t_0, t_1$ are the measurement timestamps in ms, and [latex]F_{11}(t)[/latex] is the value of Field 11 of at time t.

So, the average queue size can be calculated as a slope or discrete derivative of a mystical field from /proc/diskstats. Interesting!

Step 2: Checking the kernel documentation

Let’s see how that works. The kernel documentation[3] says the following:

Field 11 -- weighted # of milliseconds spent doing I/Os
  This field is incremented at each I/O start, I/O completion, I/O
  merge, or read of these stats by the number of I/Os in progress
  (field 9) times the number of milliseconds spent doing I/O since the
  last update of this field.  This can provide an easy measure of both
  I/O completion time and the backlog that may be accumulating.

So:

$$
F_{11}(t_1) = F_{11}(t_0) + (\text{I/Os in progress at time $t_1$}) \times (\text{ms spent doing I/O in $t_0,t_1$})
$$

Where $t_0$ is the time of the last update, and $t_1$ is the current time.

So if we want to calculate the value of $F_{11}$ at a time $t$, we have to recursively apply this rule until the beginning of time (boot):

$$
F_{11}(t) = \sum_{t_i} (\text{I/Os in progress at time t’}) \times (\text{ms spent doing I/O in $t_{i-1},t_{i-1}$})
$$

where $t_i$ runs through the times of every I/O event in the system. For the mathematical reader, this sum has a very familiar form. It’s summing a function values times a little time interval. This is precisely how integration over time works. So we are inclined to compare it with:

$$
F_{11}(T) \overset{?}{\leftrightarrow} \int_0^T (\text{number of I/Os in progress at time t}) \text{dt}
$$

This is not be quite right, since we should only sum over “I/O” time, and it’s not clear that the sum can be replaced by an integral, but it points us at the right direction. Assuming this were the case, we would get the following expression for `avgqz`:

$$
avgqz \overset{?}{\leftrightarrow}  \frac{\int_{t_0}^{t_1} (\text{I/Os in progress at time t}) \text{dt}}{t_1 – t_0}.
$$

This is a very common way to express a mean value of a function. In this case, we average the number of I/Os in progress, and not the queue length, but that kind of imprecise wording is commonly found in man pages.

So all this would make sense, if only Field 11 accounted the total time and not the time “spent doing I/O” and we could replace the sum by an integral!

Step 3: A mathematical model

In order to make the above idea precise, let’s introduce some notation:

\begin{align*}
ios(t) &= \text{(number of I/O operations in progress at time t)} \\
busy(t) &= \text{(one if the device is processing IO at time t, zero otherwise)}
\end{align*}

Here, again $t$ denotes system uptime in ms. To warm up, let’s first look at the integral:

$$
Busy(T) = \int_0^T busy(t) \text{dt} = \int_{t : busy(t) = 1} 1 \text{dt}
$$

This integral measures the total time the system was busy processing IO since boot. We can get the utilization of the system during a time interval $t_0 < t_1$ as:

$$
\frac{Busy(t_1) – Busy(t_0)}{t_1 – t_0} = \frac{\int_{t_0}^{t_1} busy(t) \text{dt}}{t_1-t_0} = \frac{\text{time spent doing IO}}{\text{total time}} = util(t_0, t_1)
$$

This is another very important statistics that is reported by `iostat -x` under the “%util” column.

To get a mathematical description of Field 11, let’s try the following:

$$ F(T) := \int_0^T ios(t) busy(t) \text{dt} $$

In this way we have:

$$ F(t_1) – F(t_0) = \int_{t_0}^{t_1} ios(t) busy(t) \text{dt} $$

or, equivalently:

$$ F(t_1) = F(t_0) + \int_{t_0}^{t_1} ios(t) busy(t) \text{dt} $$

Now, if $ios(t)$ is constant during the time interval $t_1,t_0$, then:

$$ F(t_1) = F(t_0) + ios(t_1) \times \int_{t_0}^{t_1} busy(t) \text{dt} $$

Looking closely at this, we see that this is precisely the recursion formula for Field 11 from /proc/diskstat

$$ F(t_1) = F(t_0) + (\text{IO ops in progress at time $t_1$}) \times (\text{numer of ms spent doing I/O in $t_0,t_1$}) $$

And if $t_1,t_0$ are adjacent I/O events (start, complete, merge) the assumption that ios(t) is constant in between is justified. Hence we see, that $F_{11} = F$.

Step 4: This can’t be true!

Now that we have a firm grip of $F_{11}$, we can start examining the claimed formula for the average queue size

$$ aqusz = \frac{F_{11}(t_1) – F_{11}(t_0)}{t_1 – t_0} = \frac{\int_{t_0}^{t_1} ios(t) busy(t) \text{dt}}{t_1 – t_0} $$

Is this indeed a sensible measure for the average queue size?

It does not seem to be the case. Take for example a time interval, where there are $ios(t) = 10$, but $busy(t) = 0$, then: The average queue size should be 10, but the integral evaluates to 0. Hence the above expression is zero, which is clearly not sensible.

Step 5: But is it really?

But is this really a valid example? If the two functions busy() and ios() were really independent, then this condition could certainly occur. In the analogue case of CPU scheduling, cases like these, where there are runnable threads but the CPU is busy doing something else (interrupt handlers, hypervisor) can indeed happen.

But is this really the case for block IO scheduling? Another look at the documentation[1] reveals the following:

    Field  9 -- # of I/Os currently in progress
        The only field that should go to zero. Incremented as requests are
        given to appropriate struct request_queue and decremented as they finish.
    Field 10 -- # of milliseconds spent doing I/Os
        This field increases so long as field 9 is nonzero.

So, Field 9 is our function $ios(t)$ and Field 10 is actually our function $Busy(t)$! And they indeed have a relation:

> "[Busy(t)] increases as long as [ios(t)] is nonzero"

In other words $busy(t) = 1$ if $ios(t) > 0$ and $0$ otherwise!

Revisiting our definition of $F$, we find the following (explanations follow)

\begin{align*}
F(T) &:= \int_{t \in [0,T]} ios(t) busy(t) \text{dt} \\
&= \int_{t \in [0,T], busy(t) = 1} ios(t) busy(t) \text{dt} + \int_{t \in [0,T], busy(t) = 0} ios(t) busy(t) \text{dt} \\
&= \int_{t \in [0,T], busy(t) = 1} ios(t) \text{dt} + 0 \\
&= \int_{t \in [0,T], ios(t) > 0} ios(t) \text{dt} + \int_{t \in [0,T], ios(t) = 0} ios(t) \text{dt} \\ &= \int_{t \in [0,T]} ios(t) \text{dt}
\end{align*}

Let’s break that down. In the first step, we can divide the integral into two parts, where $busy = 0/1$, the part where $busy = 0$ evaluates to 0. In the next step we note that $busy = 0$ is equivalent to $ios = 0$, so we can replace the condition. Next we complete the integral by introducing another 0-summand: Integrating over $ios(t) = 0$ which evaluates to zero as well. Finally we put things back together.

We see that for the calculation of Field 11, it does not matter if we include $busy(t)$ in the integrand. We end up with the same number whether we sum over “the number of milliseconds *spent doing I/O* since the last update of this field” or just “the number of milliseconds since the last update of this field”:

\begin{align*}
avgqz(t_0, t_1)
&= \frac{F_{11}(t_1) – F_{11}(t_0)}{t_1 – t_0} \\
& \overset{!}{=} \frac{\int_{t_0}^{t_1} ios(t) \text{dt}}{t_1 – t_0} \\
&= \text{Integral-average over the number of IOs in progess within $t_0,t_1$.}
\end{align*}

Conclusion

We have seen that the avg-qz reported by iostat has indeed the interpretation of an integral average over the number of IOs in progress. We find it interesting to see calculus can be applied various state accounting metrics and clarify their relationships. Moreover, the ability to express integral averages as discrete derivatives of a counter ($F_{11}$) is another remarkable takeaway.

References

  1. man iostat
  2. Schwarz – How iostat computes metrics (2010)
  3. I/O statistics fields

 

Heinrich Hartmann is the Chief Data Scientist at Circonus. To learn more about data science for effective operations, sign up for in-person training with Heinrich at Velocity, Tuesday, 17 October & Wednesday, 18 October, 2017

Learn More

[latexpage]

Hosts, Metrics, and Pricing, Oh My!

As the number and types of monitoring vendors have risen over the past several years, so have the pricing models. With the rise of cloud based ephemeral systems, often running alongside statically provisioned infrastructure, understanding the monitoring needs of one’s systems can be a challenging task for the most seasoned engineering and operations practitioners. In this post, we’ll take a look at a few of the current trends in the industry and why metrics-based pricing is the future of commercial monitoring purchasing.

For the first decade of the 21st century, rack mounted servers dominated the footprint of web infrastructure. Operators used number of hosts as the primary guide for infrastructure provisioning and capacity planning. Popular open source monitoring systems of the era, such as Nagios and Zabbix, reflected this paradigm; the UI was oriented around host based management. Commercial based monitoring systems of the time followed this pattern; pricing was a combination of number of hosts and resources, such as CPUs. Figuring out how much a commercial solution would cost was a relatively straightforward calculation; take the number of hosts/resources, and multiply it by the cost of each.

Post 2010, two major disruptions to this host based pricing model surfaced in the industry. The first was Amazon’s Elastic Compute Cloud (EC2), which had been growing in usage since 2006. Ephemeral-based cloud systems, such as AWS, GCP, and Azure, are now the preferred infrastructure choice. Services utilize individual hosts (or containers), so what may have been appropriate for deployment to one host 10 years ago is now a composition of dozens or hundreds. In these situations, host based pricing makes cost forecasting for infrastructure monitoring solutions much more complicated. One only need to be familiar with AWS auto-scaling or K8s cluster redeployment to shudder at the implications for host based monitoring system costs.

The second major disruption post 2010 was Etsy’s statsd, which introduced easy to implement application-based metrics, and in large part gave rise to the rapid ascent of monitoring culture over the last several years. Now one can instrument an application and collect thousands (if not millions) of telemetry sources from a distributed application and monitor its health in real time. The implication this has for statically-based infrastructure is that now a single host can source orders of magnitude more metrics than just host-based system metrics. Host-based vendors have responded to this by including only a small number of metrics per host; this represents an additional revenue opportunity for them, and an additional cost-based headache for operators.

Learn More

As a result of these disruptions, metrics-based pricing has emerged as a solution which gives systems operators, engineers, and cost- and capacity-conscious executives a way to address the current monitoring needs of their applications. The question is no longer “how many hosts do we have,” but “how many things (metrics) do we need to monitor.” As the answer to the “how many” metrics question is also ephemeral, it is important that modern monitoring solutions also answer this question in terms of how many concurrent metrics are being collected. This is an infrastructure invariant approach that scales from bare metal to containers to serverless applications.

Storage is cheap; concurrency is not. Does your monitoring solution let you analyze your legacy metrics without charging you to keep them stored, or do you pay an ever increasing cost for all of the metrics you’ve ever collected?

At Circonus, we believe that an active-only metrics-based pricing model is the fairest way to deliver value to our customers, while giving them the power to scale up or down as needed over time. You can archive metrics while still retaining the ability to perform historical analyses with them. Our pricing model gives you the flexibility to follow your application’s monitoring needs, so that you can adapt to the ever-changing trends in application infrastructure.

As the world turns, so does our understanding and expectations from systems. At the end of the day, pricing models need to make sense for buyers, and not all buyers are at the same point on the technology adoption curve. Generally, active-only metrics pricing is the humane approach, but our model is user-friendly and adaptable, Don’t be surprised if we introduce some more options because we accommodate customers that just can’t make this model fit the way they run technical infrastructure.

–Fred Moyer is a Developer Evangelist at Circonus

Twitter: @phredmoyer

Learn More

 

Monitoring as Code

Circonus has always been API-driven, and this has always been one of our product’s core strengths. Via our API, Circonus provides the ability to create anything that you can in the UI and more. With so many of our customers moving to API-driven platforms like AWS, DigitalOcean, Google Compute Engine (GCE), Joyent, Microsoft Azure, and even private clouds, we have seen the emergence of a variety of tools (like Terraform) that allows an abstraction of these resources. Now, with Circonus built into Terraform, it is possible to declaratively codify your application’s monitoring, alerting, and escalation, as well as the resources it runs on.

Terraform is a tool from HashiCorp for building, changing, and versioning infrastructure, which can be used to manage a wide variety of popular and custom service providers. Now, Terraform 0.9 includes an integration for managing Circonus.

These are a few key features of the Circonus Provider in Terraform:

  • Monitoring as Code – Alongside Infrastructure as Code. Monitoring (i.e. what to monitor, how to visualize, and when to alert) is described using the same high-level configuration syntax used to describe infrastructure. This allows a blueprint of your datacenter, as well as your business rules, to be versioned and treated as you would any other code. Additionally, monitoring can be shared and reused.
  • Execution Plans – Terraform has a “planning” step where it generates an execution plan of what will be monitored, visualized, and alerted on.
  • Resource Graphs – Terraform builds a graph of all your resources, and now can include how these resources are monitored, and parallelizes the creation and modification of any non-dependent resources.
  • Change Automation – Complex changesets can be applied to your infrastructure and metrics, visualizations, or alerts, which can all be created, deactivated, or deleted with minimal human interaction.

This last piece, Change Automation, is one of the most powerful features of the Circonus Provider in Terraform. Allocations and services can come and go (within a few seconds or a few days), and the monitoring of each resource dynamically updates accordingly.

While our larger customers were already running in programmatically defined and CMDB-driven worlds, our cloud computing customers didn’t share our API-centric view of the world. The lifecycle management of our metrics was either implicit, creating a pile of quickly outdated metrics, or ill-specified, in that they couldn’t make sense of the voluminous data. What was missing was an easy way to implement this level of monitoring across the incredibly complex SOA our customers were using.

Now when organizations ship and deploy an application to the cloud, they can also specify a definition of health (i.e. what a healthy system looks like). This runs side-by-side with the code that specifies the infrastructure supporting the application. For instance, if your application runs in an AWS Auto-Scaling Group and consumes RDS resources, it’s possible to unify the metrics, visualization, and alerting across these different systems using Circonus. Application owners now have a unified deployment framework that can measure itself against internal or external SLAs. With the Circonus Provider in Terraform 0.9, companies running on either public clouds or in private data centers can now programmatically manage their monitoring infrastructure.

As an API-centric service provider, Circonus has always worked with configuration management software. Now, in the era of mutable infrastructure, Terraform extends this API-centric view to provide ubiquitous coverage and consistency across all of the systems that Circonus can monitor. Terraform enables application owners to create a higher-level abstraction of the application, datacenter, and associated services, and present this information back to the rest of the organization in a consistent way. With the Circonus Provider, any Terraform-provisioned resource that can be monitored can be referenced such that there are no blind spots or exceptions. As an API-driven company, we’ve unfortunately seen blind-spots develop, but with Terraform these blind spots are systematically addressed, providing a consistent look, feel, and escalation workflow for application teams and the rest of the organization.

It has been and continues to be an amazing journey to ingest the firehose of data from ephemeral infrastructure, and this is our latest step toward servicing cloud-centric workloads. As industry veterans who remember when microservice architectures were were simply called “SOA,” it is impressive to watch the rate at which new metrics are produced and the dramatic lifespan reduction for network endpoints. At Circonus, our first integration with some of the underlying technologies that enable a modern SOA came at HashiConf 2016. At that time we had nascent integrations with Consul, Nomad, and Vault, but in the intervening months we have added more and more to the product to increase the value customers can get from each of these industry accepted products:

  • Consul is the gold standard for service-discovery, and we have recently added a native Consul check-type that makes cluster management of services a snap.
  • Nomad is a performant, robust, and datacenter-aware scheduler with native Vault integration.
  • Vault can be used to secure, store, and control access to secrets in a SOA.

Each of these products utilizes our circonus-gometrics library. When enabled, Circonus-Gometrics automatically creates numerous checks and automatically enables metrics for all the available telemetry (automatically creating either histogram, text, or numeric metrics, given the telemetry stream). Users can now monitor these tools from a single instance, and have a unified lifecycle management framework for both infrastructure and application monitoring. In particular, how do you address the emergent DevOps pattern of separating the infrastructure management from the running of applications? Enter Terraform. With help from HashiCorp, we began an R&D experiment to investigate the next step and see what was the feasibility of unifying these two axes of organizational responsibility. Here are some of the things that we’ve done over the last several months:

  • Added per metric activation and (as importantly) deactivation, while still maintaining the history of the metric.
  • Simplified the ability to view 100’s of clients, or 1000’s of allocations as a whole (via histograms), or to monitor and visualize a single client, server, or allocation.
  • Automatically show outliers within a group of metrics (i.e. identify metrics which don’t look like the others).
  • Reduced the friction associated with deploying and monitoring applications in an “application owner”-centric view of the world.
  • These features and many more, the fruit of expert insights, are what we’ve built into the product, and more will be rolled out in the coming months.

    Example of a Circonus Cluster definition:

    variable "consul_tags" {
      type = "list"
      default = [ "app:consul", "source:consul" ]
    }
    
    resource "circonus_metric_cluster" "catalog-service-query-tags" {
      name        = "Aggregate Consul Catalog Queries for Service Tags"
      description = "Aggregate catalog queries for Consul service tags on all consul servers"
    
      query {
        definition = "consul`consul`catalog`service`query-tag`*"
        type       = "average"
      }
    
      tags = ["${var.consul_tags}", "subsystem:catalog"]
    }
    

    Then merge these into a histogram:

    resource "circonus_check" "query-tags" {
      name   = "Consul Catalog Query Tags (Merged Histogram)"
      period = "60s"
      collector {
        id = "/broker/1490"
      }
      caql {
        query = <<EOF
    search:metric:histogram("consul`consul`catalog`service`query-tag (active:1)") | histogram:merge()
    EOF
      }
      metric {
        name = "output[1]"
        tags = ["${var.consul_tags}", "subsystem:catalog"]
        type = "histogram"
        unit = "nanoseconds"
      }
      tags = ["${var.consul_tags}", "subsystem:catalog"]
    }
    

    Then add the 99th Percentile:

    resource "circonus_check" "query-tag-99" {
      name   = "Consul Query Tag 99th Percentile"
      period = "60s"
      collector {
        id = "/broker/1490"
      }
      caql {
        query = <<EOF
    search:metric:histogram("consul`consul`http`GET`v1`kv`_ (active:1)") | histogram:merge() | histogram:percentile(99)
    EOF
      }
    
      metric {
        name = "output[1]"
        tags = ["${var.consul_tags}", "subsystem:catalog"]
        type = "histogram"
        unit = "nanoseconds"
      }
    
      tags = ["${var.consul_tags}", "subsystem:catalog"]
    }
    

    And add a Graph:

    resource "circonus_graph" "query-tag" {
      name        = "Consul Query Tag Overview"
      description = "The per second histogram of all Consul Query tags metrics (with 99th %tile)"
      line_style  = "stepped"
    
      metric {
        check       = "${circonus_check.query-tags.check_id}"
        metric_name = "output[1]"
        metric_type = "histogram"
        axis        = "left"
        color       = "#33aa33"
        name        = "Query Latency"
      }
      metric {
        check       = "${circonus_check.query-tag-99.check_id}"
        metric_name = "output[1]"
        metric_type = "histogram"
        axis        = "left"
        color       = "#caac00"
        name        = "TP99 Query Latency"
      }
    
      tags = ["${var.consul_tags}", "owner:team1"]
    }
    

    And you get this result:

    Finally, we want to be alerted if the 99th Percentile goes above 8000ms. So, we’ll create the contact (along with SMS, we can use Slack, OpsGenie, PagerDuty, VictorOps, or email):

    resource "circonus_contact_group" "consul-owners-escalation" {
      name = "Consul Owners Notification"
      sms {
        user  = "${var.alert_sms_user_name}"
      }
      email {
        address = "consul-team@example.org"
      }
      tags = [ "${var.consul_tags}", "owner:team1" ]
    }
    

    And then define the rule:

    resource "circonus_rule_set" "99th-threshhold" {
      check       = "${circonus_check.query-tag-99.check_id}"
      metric_name = "output[1]"
      notes = <<EOF
    Query latency is high, take corrective action.
    EOF
      link = "https://www.example.com/wiki/consul-latency-playbook"
      if {
        value {
          max_value = "8000" # ms
        }
        then {
          notify = [
            "${circonus_contact_group.consul-owners-escalation.id}",
          ]
          severity = 1
        }
      }
      tags = ["${var.consul_tags}", "owner:team1"]
    }
    

    With a little copy and paste, we can do exactly the same for all the other metrics in the system.

    Note that the original metric was automatically created when consul was deployed, and you can do the same thing with any number of other numeric data points, or do the same with native histogram data (merge all the histograms into a combined histogram and apply analytics across all your consul nodes).

    We also have the beginnings of a sample set of implementations here, which builds on the sample Consul, Nomad, & Vault telemetry integration here.

     

    Learn More about Terraform

    Postmortem: 2017-04-11 Firewall Outage

    The Event

    At approximately 05:40AM GMT on 4/11/2017, we experienced a network outage in our main datacenter in Chicago, IL.
    The outage lasted until approximately 10:55AM GMT on the same day. The Circonus SaaS service, as well as any PUSH based checks that use the public trap.noit.circonus.net, were affected by this outage. Any HTTPTrap checks using the public trap.noit.circonus.net broker would have been unable to send data to Circonus during this time period. As a result, any alerts based on this PUSH data would have also not been working. Meanwhile, enterprise brokers may have experienced a delay in processing data, but no data would have been lost for users of enterprise brokers as we use a store and forward mechanism on the brokers.

    The Explanation

    We use a pair of firewall devices in an active/passive configuration with automatic failover should one of the devices become unresponsive. The firewall device in question went down, and automatic failover did not trigger for an unknown reason (we are still investigating). When we realized the problem, we killed off the bad firewall device, causing the secondary to promote itself to master and service to be restored.

    What We’re Doing About It

    Going forward, we will use more robust checking mechanisms on these firewall devices to be alerted more quickly should we encounter a similar situation. Using an enterprise broker can insulate you from network outages between your infrastructure and Circonus should any future issues arise similar to this, or should any future issues arise in the network path between your infrastructure and Circonus.

    Documenting with Types

    I’ve said this before: elegant code is pedagogical. That is, elegant code is designed to teach its readers about the concepts and relationships in the problem domain that the code addresses, with as little noise as possible. I think data types are a fundamental tool for teaching readers about the domain that code addresses, but heavy use of types tends to introduce noise into the code base. Still, types are often misunderstood, and, as such, tend to be under-utilized, or otherwise misused.

    Types are means of describing to the compiler what operations are valid against data. That is my working definition; different programming languages will have their own concept of data type and their own means of defining them. However, the type system in every language will have these two responsibilities: to allow the programmer to define what operations are valid against typed data, and to indicate failure when the source code contains a prohibited operation. (Of course, type definitions in most languages do more than this, such as also defining memory organization for typed data for example. These details are critical for translating source code into an executable form, but are not critical to the type-checker.)

    The strength of a programming language’s type system is related to how much it helps or hinders your efforts to discourage client error. It’s my general thesis that the most important kind of help is when your types correspond to distinct domain concepts, simultaneously teaching your users about these concepts and their possible interactions, and discouraging the creation of nonsensical constructs. But this doesn’t usually come for free. I won’t go further into this here, but some hindrances are:

    • Error messages that don’t help users understand their mistakes.
    • Excessively verbose type signatures.
    • Difficulty of representing correctness constraints.
    • More abstraction than your users will want to process.

    Your types should pay for themselves, by some combination of keeping these costs low, preventing catastrophic failure, or being an effective teacher of your domain concepts.[1]

    How does one document code with types? I have a few guidelines, but this blog will run long if I go into all of them here, so I’ll start with one:

    Primitive types are rarely domain types

    Consider integers. A domain concept may be stored in an integer, but this integer will usually have some unit bound to it, like “10 seconds” or “148 bytes”. It almost certainly does not make domain sense to treat a value you got back from `time(2)` like a value you got back from `ftell(3)`. So these could be different types, and the type could be used to prevent misuse. Depending on language, I might even do so, but consider the options:

    In C, you can use `typedef` to create a type alias, as POSIX does. This serves to document that different integers may be used differently, but does not actually prevent misuse:

    #include 
    typedef uint64_t my_offset_t;
    typedef int64_t my_time_t;
    
    void ex1() {
      my_offset_t a = 0;
      /* variable misuse, but not an error in C's type system. */
      my_time_t b = a;
    }
    

    You could use a `struct` to create a new type, but these are awkward to work with:

    #include 
    #include 
    typedef struct { uint64_t val; } my_offset_t;
    typedef struct { int64_t val; } my_time_t;
    
    void my_func() {
      my_offset_t offset_a = { .val=0 };
      my_offset_t offset_b = { .val=1 };
      my_time_t time_c = { .val=2 };
    
      /*
       * variable misuse is a compile-time error:
       *   time_c = offset_a;
       * does not compile.
       */
    
      /*
       * cannot directly use integer operations:
       *   if (offset_b > offset_a) { printf("offset_b > offset_a\n"); }
       * does not compile, but can use:
       */
    
      if (offset_b.val > offset_a.val) { printf("offset_b.val > offset_a.val\n"); }
    
      /*
       * but the technique of reaching directly into the structure to use
       * integer operations also allows:
       */
    
      if (time_c.val > offset_a.val) { printf("BAD\n"); }
    
      /*
       * which is a domain type error, but not a language type error
       * (though it may generate a warning for a signed / unsigned comparison).
       * One could define a suite of functions against the new types, such as:
       *   int64_t compare_offsets(restrict my_offset_t *a, restrict my_offset_t *b)
       *   {
       *     return (int64_t) a->val - (int64_t) b->val;
       *   }
       * and then one could use the more type-safe code:
       *   if (compare_offsets(&offset_a, &offset_b) > 0) {
       *     printf("GOOD\n");
       *   }
       * but, in no particular order: 
       * this isn't idiomatic, so it's more confusing to new maintainers; 
       * even once you're used to the new functions, it's not as readable as idiomatic code;
       *  depending on optimization and inlining, it's plausibly less efficient than
       * idiomatic code;
       * and it is awkward and annoying to define new functions 
       * to replace the built-in integer operations we'd like to use.
       */
    }
    

    As far as I can tell, C does not provide ergonomic options for using the type system to prevent integer type confusion. That said, the likelihood of user error in this example (misusing a time as a file size, or vice versa) is pretty low, so I would probably make the same choice that POSIX did in this circumstance, and just use type aliases to document that the types are different and give up on actually preventing misuse.

    On the other hand, we at Circonus maintain a time-series database that must deal with time at multiple resolutions represented as 64-bit integers. Aggregate data storage uses units of seconds-since-unix-epoch, while high-resolution data storage uses units of milliseconds-since-unix-epoch.In this case, the likelihood of user error working with these different views of time is very high (we have a number of places where we need to convert between these views, and have even needed to change some code from using time-in-seconds to using time-in-milliseconds). Furthermore, making mistakes would probably result in presenting the wrong data to the user (not something you want in a database), or possibly worse.

    If we were strictly using C, I would probably want to follow the approach Joel Spolsky outlined here, and use a form of Hungarian notation to represent the different views of time. As it happens, we’re using C++ in this part of the code base, so we can use the type system to enforce correctness. We have an internal proscription against using the STL (to keep our deployed code relatively traceable with, say, dtrace), so `std::chrono` is out. But we can define our own types for working with these different views of time. We start by creating our own strong_typedef facility (no, we don’t use BOOST, either):

    #define ALWAYS_INLINE __attribute__((always_inline))
    
    // bare-bones newtype facility, intended to wrap primitive types (like
    // `int` or `char *`), imposes no run-time overhead.
    template 
      class primitive_newtype
    {
    public:
      typedef oldtype oldtype_t;
      typedef primitive_newtype self_t;
    
      primitive_newtype(oldtype val) : m_val(val) {}
      ALWAYS_INLINE oldtype_t to_oldtype() { return m_val; }
    private:
      oldtype m_val;
    };
    

    With this facility, we can define domain types that are incompatible, but which share a representation and which should impose no overhead over using the primitive types:

    class _uniquify_s_t;
    typedef primitive_newtype seconds_t;
    class _uniquify_ms_t;
    typedef primitive_newtype milliseconds_t;
    

    Or even better, since time types all share similar operations, we can define the types together with their operations, and also split up the types for “time point” from “time duration”, while enforcing a constraint that you can’t add two time points together:

    template 
      class my_time_t
    {
    private:
      class _uniquify_point_t;
      class _uniquify_diff_t;
    public:
      typedef primitive_newtype point_t;
      typedef primitive_newtype diff_t;
      static ALWAYS_INLINE point_t add(point_t a, diff_t b)
      {
        return point_t(a.to_oldtype() + b.to_oldtype());
      }
      static ALWAYS_INLINE diff_t diff(point_t a, point_t b)
      {
        return diff_t(a.to_oldtype() - b.to_oldtype());
      }
      static ALWAYS_INLINE point_t diff(point_t a, diff_t b)
      {
        return point_t(a.to_oldtype() - b.to_oldtype());
      }
      static ALWAYS_INLINE diff_t diff(diff_t a, diff_t b)
      {
        return diff_t(a.to_oldtype() - b.to_oldtype());
      }
      static ALWAYS_INLINE diff_t add(diff_t a, diff_t b)
      {
        return diff_t(a.to_oldtype() + b.to_oldtype());
      }
    };
    
    class _millisecond_uniquify_t;
    typedef my_time_t<_millisecond_uniquify_t> my_millisecond_t;
    class _second_uniquify_t;
    typedef my_time_t<_second_uniquify_t> my_second_t;
    

    This is just the primitive basis of our time-management types, and is implemented a little differently than what we actually have in our code base (to help the example fit in a blog post, and because I write the blog for a different audience than the one for which I write production code).

    With these new types, we can perform basic operations with time in seconds or milliseconds units, while preventing incorrect mixing of types. For example, an attempt to take a difference between a time-point based in seconds and a time-point based in milliseconds, will result in a compilation error. Using these facilities made translating one of our HTTP endpoints from operating against seconds to operating against milliseconds into an entirely mechanical process of converting one code location to use the new types, starting a build, getting a compilation error from a seconds / milliseconds mismatch, changing that location to use the new types, and repeating. This process was much less likely to result in errors than it would have been had we been using bare `int64_t`’s everywhere, relying on a code audit to try and ensure that everything that worked with the old units was correctly updated to use the new.

    These types are more annoying to work with than bare integers, but using them helped avoid introducing a very likely and very significant system problem under maintenance, by providing the strongest possible reinforcement of the fact that we deal with time in two resolutions. In this case, the types paid for themselves.

    (Thanks to Riley Berton for reviewing this post.)

    References:

    [1]: C++ is an interesting case of weighing costs and benefits. While the benefits of using advanced C++ type facilities can be very high (bugs in C++ code can be catastrophic, and many of C++’s advanced facilities impose no runtime overhead), the maintenance costs can also be extremely high, especially when using advanced type facilities. I’ve seen thousands of characters of errors output due to a missing `const` in a type signature. This can be, *ahem*, intimidating.

     

    Learn More

    Post-Mortem 2017.1.12.1

    TL;DR: Some users received spurious false alerts for approximately 30 minutes, starting at 2017-01-12 22:10 UTC. It is our assessment that no expected alerts were missed. There was no data loss.

    Overview

    Due to a software bug in the ingestion pipeline specific to fault detection, a small subset (less than 2.5%) of checks were not analyzed by the online fault detection system for 31 minutes, starting at 2017-01-12 22:10 UTC.

    The problem was triaged. Broker provisioning and deprovisioning services were taken offline at 22:40 UTC, at which time all fault detection services returned to normal.

    Broker provisioning and deprovisioning services were brought back online at 2017-01-13 00:11 UTC. All broker provisioning and deprovisioning requests issued during that period were queued and processed successfully upon service resumption.

    Gratuitous Detail

    Within the Circonus architecture, we have an aggregation layer at the edge of our service that communicates with our store-and-forward telemetry brokers (which in-turn accept/acquire data from agents). This component is called “stratcond.” On January 5th, we launched new code that allows more flexible configuration orchestration and, despite having both unit tests and end-to-end tests, an error was introduced. Normal operations continued successfully until January 12th, when a user issued a command requiring reconfiguration of this system. That command managed to exercise the code path containing this specific error and stratcond crashed. As with all resilient systems, the stratcond was restarted immediately, and it suffered approximately 1.5 seconds of “disconnection” from downstream brokers.

    The system is designed to tolerate failures, as failures are pretty much the only guaranteed thing in distributed systems. These can happen at the most unexpected times and many of our algorithms for handling failure are designed to cope with the randomness (or perceived randomness) of distributed failure.

    The command that caused the crash was queued and reattempted precisely 60 seconds later, and again 60 seconds after that, and again after that. A recurrent and very non-random failure. There are many checks that customers have scheduled to run every 60 seconds. When a check is scheduled to run on a broker, it is scheduled to run at a random offset within the first 60 seconds of that broker’s boot time. So, of the 60-second-period checks, 2.5% of the checks would have been scheduled to run during the 1.5 second real-time-stream outage that transpired as a part of this failure. The particular issue here is that because the crash recurred almost exactly every 60 seconds, the same 1.5 seconds of each minute was vulnerable to exclusion. Therefore the same 2.5% of checks were affected each minute, making them “disappear” to the fault detection system.

    The same general pipeline that powers graphs and analysis is also used for long-term storage, but due to open-ended temporal requirements, that system was unaffected. All checks run in those “outage” windows had their measurements successfully sent upstream and stored (during the outages, since there were no outages for the storage stream).

    Operational response led to diagnosis of the cause of the crash, avoidance, and restoration of normal fault detection operation within 31 minutes. Crash analysis and all-hands engineer triage led to a bug fix, test, packaging, and deployment at 2 hours and 11 minutes.

    Actions

    There are two actions to be taken, and both will require research and implementation.

    The first is to implement better instability detection to further enhance the already sophisticated capabilities of flagging instability in the fault detection system. The short, reliable timing of the disconnections in this case did not trigger the fault detection system’s instability mode and thus it did not react as it should have.

    The second is to better exploit “at least once delivery” in the fault pipeline. In order to make sure that we get the job done that we promise to get done, we make sure our systems can process the same data more than once. Often, a metric is actually delivered to the fault detection system four times. We can further extend this “duplication tolerance” to the stratcond-broker feed and replay some window of past traffic to send upstream. In online systems, old data is worthless. In all systems, “old” is subjective. By relaxing our definition of “old” a bit more and leveraging the fact that no upstream protections will be required, we should easily be able to make this tiny section of our pipeline even more resilient to failure.

    To close, we live in the real world. Failure is the only option. We embrace the failures that we see on a daily basis and do our best to ensure that the failures we see do not impact the service we deliver to you in any way. Yesterday, we learned that we can do better. We will.

    Systems Monitoring is Ripe for a Revolution

    Before we explore systems, let’s talk users. After all, most of our businesses wouldn’t exist without lots of users; users that have decreasing brand loyalty and who value unintrusive, convenient, and quick experiences. We’ve intuited that if a user has a better experience on a competitor’s site, then they will stop being your customer and start being theirs. Significant research into exactly how much impact is had by substandard web performance started around 2010, progressed to consensus, and has turned into a tome of easily consumable knowledge. What allowed for this? RUM.

    Real User Monitoring

    The term RUM wasn’t in constant usage until just after 2010, but the concept and practice was a slow growth that transformed the previous synthetic web monitoring industry. Both Keynote and Gomez (the pall bearers of synthetic web monitoring) successfully survived that transition and became leaders in RUM as well. Of course, the industry has many more and varied competitors now.

    Synthetic monitoring is the process of performing some action and taking measurements around the aspects of that performance. A simple example would be asking, “how fast does my homepage load?” The old logic was that an automated system would perform a load of your homepage and measure how long various aspects took: initial page load, image rendering, above-the-fold completeness, etc. One problem is that real users are spread around the world, so to simulate them “better,” one would need to place these automated “agents” around the world so that a synthetic load could indeed come from Paris, or Copenhagen, or Detroit. The fundamental problem remained that the measurements being taken represented exactly zero real users of your web site… while users of your website were actively loading your home page. RUM was born when people decided to simply observe what’s actually happening. Now, synthetic monitoring isn’t completely useless, but RUM largely displaced most of its obvious value.

    What took RUM so long? The short answer was the size of the problem relative to the capabilities of technology. The idea of tracking the performance of every user action before 2000 was seen as a “Big Data Problem” before we coined the term Big Data. Once the industry better understood how to cope with data volumes like this, RUM solutions became commonplace.

    Now it seems fairly obvious to anyone that monitoring real users is fundamental in attempting to understand the behavior of their website and its users… so why not with systems?

    Systems are Stuck

    Systems, like websites, have “real users,” those users just happen to be other systems most of the time. It is common practice today to synthetically perform some operation against a system and measure facets of that performance. It is uncommon today to passively observe all operations against the system and extract the same measurements. Systems are stuck in synthetic monitoring land.

    Now, to be fair, certain technologies have been around for a while that allow the observation of inflight systems; caveat “systems” with a focus on custom applications running on systems.

    The APM industry took a thin horizontal slice of this problem, added sampling, and sold a solution (getting much market capitalization). To sum up their “solution,” you have an exceptional view into part of your system some of the time. Imagine selling that story in the web analytics industry today: “now see real users… only on your search pages, only 1% of the time.”

    Why don’t we have a magically delicious RUM store for systems? For the same reason it took so long to get RUM. The technology available today doesn’t afford us the opportunity to crunch that much data. Users work in human time (second and minutes) at human scale (tens of millions); systems work at computing time (nanoseconds) at cloud scale (tens of thousands of machines and components). It’s literally a million times harder to think about Real Systems Monitoring (RSM) than it is to think about Real User Monitoring (RUM).

    The Birth of Real Systems Monitoring

    The technology has not improved a million-fold over the last 10 years, so we can’t solve this RSM problem as comprehensively. But it has improved significantly, so we’re ready for a departure from synthetic systems monitoring into a brave new world. Circonus and many of its industry peers have been nipping at the heels of this problem and we are entering the age of tangible gains. Here’s what’s coming (and 5-10 years from now will be ubiquitous table stakes):

    • 100% sampling of microsecond or larger latencies in systems operation. (ie, You see everything)
    • Software and services will expose measurement streams from real activity.
    • Histograms as the primary data types in most measurement systems
    • Significantly more sophisticated math to cope with human reasoning of large datasets
    • Measurement collection on computer-scale (billions of measurements per second)
    • Ultimately a merge of RUM and RSM… After all, we only have systems because we have users.

    Exciting Times

    At Circonus, we’ve been building the architectures required to tackle these problems: the scale, the histograms, and the math. We see the cost efficiencies increasing, resulting in positive (and often huge) returns on investment. We see software and service providers avidly adding instrumentation that exposes real measurements to interested observers. We’re at an inflection point and the world of systems monitoring is about to take an evolutionary leap forward. These are exciting times.

     

     

    COSI:Postgres

    A few months ago we announced the availability Circonus One Step Install (COSI) to introduce a very fast way to get data collected for systems with the most obvious set of metrics enabled. This makes monitoring new systems as easy as copying and pasting a command into a shell on the machine to be monitored, or by adding that command into some automation script via Puppet, or Chef, or Ansible, or whatever you use.

    Today we are announcing the general availability of “plugins” for COSI, starting with one-step monitoring for Postgres databases.

    COSI:postgres builds on the existing COSI workflow outlined in the COSI Tutorial and demonstrated in the video below:

    After completing a basic COSI install like the one shown above, you can now run the postgres plugin to monitor the box for important postgres telemetry. Below is a video showing the process from the initial login on an EC2 instance to setting up full monitoring of the box and postgres via the COSI:postgres plugin. I have included the installation of the postgres database for completeness, but if you have an existing postgres database which you intend to monitor, you can skip to 1:07 in the video.

    Video Summary:

    1. Install Postgres. 0:00
      	$ ssh machine
      	$ sudo bash
      	# apt-get install postgresql
      
    2. Create a database. 0:20
      	# su - postgres
      	$ psql
      	postgres=# create database foo;
      	CREATE DATABASE
      	postgres# /q
      
    3. Create a table. 0:40
      	$ psql foo
      	foo=# create table bar (baz text);
      	CREATE TABLE
      
    4. Add some data. 0:54
      	foo=# insert into bar (baz) values (‘some text’);
      	INSERT 0 1
      	foo=# insert into bar (baz) values (‘some more text’);
      	INSERT 0 1
      	foo=# insert into bar (baz) values (‘even more text’);
      	INSERT 0 1
      	foo=# /q
      
    5. Monitor this machine with COSI. 1:07
      	# curl -sSL https://onestep.circonus.com/install | bash \
      	-s -- \
      	--key  \
      	--app  \
      
    6. Install protocol_observer. 1:32
      	# apt-get install golang
      	# mkdir go
      	# export GOPATH=~/go
      	# go get github.com/circonus-labs/wirelatency
      	# cd go/src/github.com/circonus-labs/wirelatency/protocol_observer
      	# go build
      
    7. Ensure that protocol_observer is in the PATH. 2:10
      	# cp protocol_observer /usr/bin
      

      NOTE: If you place protocol_observer in /opt/circonus/bin, the postgres plugin will find it automatically because that is the default search path.

    8. protocol_observer requires root privilege to execute, so give ‘nobody’ sudo. 2:24
      	# cd /etc/sudoers.d
      	# echo “nobody ALL=(ALL) NOPASSWD: /usr/bin/protocol_observer” \
      > 91-protocol_observer
      
    9. Create an account for the COSI plugin. 2:46
      	# su - postgres
      	$ psql foo
      	foo=# create user cosi with password ‘’
      	CREATE ROLE
      	foo=# grant all on database foo to cosi;
      	GRANT
      	foo=# /q
      
    10. Modify pg_hba.conf to allow local logins. 3:19
      	# nano /etc/postgresql/9.5/main/pg_hba.conf
      	…
      	# /etc/init.d/postgresql restart
      
    11. Finally, run the COSI:Postgres plugin install. 3:44
      # /opt/circonus/cosi/bin/cosi-plugin-postgres --enable \
      --database foo --user cosi --pass 	
      

    Now you are finished installing the plugin, and you are ready to enjoy your new dashboard and the new functionality if offers.

    New Functionality

    The postgres plugin for COSI comes with some advanced functionality:

    • Optional support for tracking latency of every single query that hits the database
    • Cache vs. file system interaction
    • A live view of current transactions in flight as well as a historic graph of transaction counts
    • Displays what postgres background writers are busy doing
    • Forecasts your overall database size in the future!

    Let’s break these new features down:

    Optional support for tracking latency of every single query that hits the database

    In order to support tracking of latency, COSI:postgres requires the circonus-labs/wirelatency tool installed on the box. The `protocol_observer` executable must be in the system PATH and the user that executes the node-agent *must* have sudo permission for the `protocol_observer` executable (covered at 1:32 in the video above). This is because tracking the latency of queries relies on pcapping the network traffic for the postgres port and reconstructing the postgres protocol in order to track when queries come in and when they are responded to. There are a wealth of options for the `protocol_observer` and you can read more about it on the github page.

    What you are seeing in these dashboard graphs for query latency is a heatmap containing the latency of every query that hit this postgres server, along with overlays of the overall count of queries (per second) and quartile banding of the latencies. This helps get a good overview of how much time the queries against your postgres instances are taking. If you want to get more advanced, you can apply CAQL queries to this data to extract really useful info.

    Cache vs. file system interaction

    Problems under postgres are often related to inadequate cache size or too many cache misses which have to hit the actual disk for data. Generally, we want to keep the cache hit percentage as close to 100% as possible. The dashboard widget, “cache hit percentage,” and the graph, “…cache vs. file system,” will help illuminate any cache miss issues and poor performance your database may be experiencing.

    A live view of current transactions in flight as well as a historic graph of transaction counts

    The dashboard widget, “txns,” and the graph, “… txns,” show a live view and a historic view (respectively) of transactions running against your database instance. Spikes in these indicate hotspots of activity. Here, “txns” means all database interactions (both reads and writes).

    Displays what postgres background writers are busy doing

    Postgres has several background writer processes which manage commits to disk for postgres. A lag in checkpointing can make database recoveries after crashes a much longer process. This graph will expose problems in the background writer processes. For more on checkpoint statistics, refer to this helpful blog post: “Measuring PostgreSQL Checkpoint Statistics“.

    Forecasts your overall database size in the future!

    The bottom graph on the dashboard exposes database size as reported by Postgres along with future size. This is calculated using advanced CAQL based resource forecasting.

    And more…

    In addition to the above features, this new COSI:postgres plugin exposes active and max connection count, transaction details (how many SELECT, INSERT, UPDATE, and DELETE), database scans (how many index reads, sequence scans, and tuple reads is the database doing), and database lock information.

    If you are running Postgres instances in our infrastructure and want quick and useful insights into the performance of those systems, using the new COSI:postgres plugin is an easy way to automate collection of these most useful metrics for a Postgres installation in the Circonus system.

     

     

    No, We “Fixed the Glitch”

    If you haven’t seen the movie Office Space, you should do so at your earliest convenience. As with the new TV comedy, “Silicon Valley,” Mike Judge hits far too close to home for the movie to be comfortable… its hilarity, on the other hand, is indisputable. So much of our lives are wrapped up in “making the machine work” that comedic exposures of our industries deep malfunctions are, perhaps, the only things that keep me sane.

    Not a day goes by that I don’t see some scene or line from “Office Space” percolate from either the Industry or Circonus itself. Just after 21:30 UTC on October 3rd was just another one of these events, but the situation that brought it up is interesting enough to share.

    In “Office Space,” there is an employee named Milton, whom management believes they have fired, but who has been working and getting paid for years. Classic communication breakdown. However, due to the over-the-top passive aggressive behavior in the organization, management doesn’t want a confrontation to correct the situation. Instead of informing Milton, they simply decide to stop paying him and let the situation work itself out… They “fixed the glitch.” If you do this, you’re an asshole. Spoiler alert: Milton burns the building down.

    The interesting thing about software is that it is full of bugs. So full of bugs, that we tend to fix things we didn’t even know were broken. While it’s no less frustrating to have a “glitch” fixed on you, it’s a bit more understandable when it happens unintentionally. We’re fixing glitches before they are identified as glitches. This most commonly occurs in undocumented behavior that is assumed to be stable by some consumer of a system. It happens during a new feature introduction, or some other unrelated bug fixing, or a reimplementation of the system exhibiting the undocumented behavior, and then boom… some unsuspecting consumer has their world turned upside down. I’m sure we’ve done this at Circonus.

    On October 3rd, a few customers had their Amazon Cloudwatch checks stop returning data. After much fretting and testing, we could find nothing wrong with Amazon’s API. Sure, it was a bit slow and gave stale information, but this is something we’ve accommodated from the beginning. Amazon’s Cloudwatch service is literally a metrics tire fire. But this was different… the answers just stopped happening.

    Circonus’ collection system is three-tier (unlike many of our competitors that use two-tier systems). First, there’s the thing that’s got the info: the agent. In this case, the agent is the Cloudwatch API itself. Then, there’s the thing that stores and analyzes the data: Circonus SaaS. And finally there’s this middle tier that talks to the agents, then stores and forwards the data back to Circonus SaaS. We call this the broker. Brokers are basically babelfish; they speak every protocol (e.g. they can interrogate the Cloudwatch API), and they are spread out throughout the world. By spreading them out, we can place brokers closer to the agents so that network disruptions don’t affect the collection of data, and so that we get a more resilient observation fabric. This explains why I can assert that “we didn’t change anything,” even with as many as fifty code launches per day. The particular broker in question, the one talking to the cloudwatch API, hadn’t been upgraded in weeks. Additionally, we audit changes to the the configuration of the broker, and the configurations related to Cloudwatch interrogations hasn’t been modified either.

    So, with no changes to the system or code asking Cloudwatch for data and no changes to the questions we are asking Cloudwatch, how could the answers just stop? Our first thought was that Amazon must have changed something, but that’s a pretty unsatisfying speculation without more evidence.

    The way Cloudwatch works is that you ask for a metric and then limit the ask by fixing certain dimensions on the data. For example, if I wanted to look at a specific Elastic Load Balancer (ELB) servicing one of my properties and ascertain the number of healthy hosts backing it, then I’d work backwards. First, I’d ask for the number of healthy hosts, the “HealthyHostCount”, and then I’d limit that to the namespace “AWS/ELB” and specify a set of dimensions. Some of the available dimensions are “Service”, “Namespace”, and “LoadBalancerName”. Now, our Cloudwatch integration is very flexible, and users can specify whatever dimensions they please, knowing that it is possible that they might work themselves out of an answer (by setting dimensions that are not possible).

    The particular Cloudwatch interrogation said that dimension should match the following: Service=”ELB”, Namespace=”AWS”, and LoadBalancerName=”website-prod13.” And behold: data. The broker was set to collect this data at 12:00 UTC on October 1st and to check it every minute.

    As we can see from this graph, while it worked at first, there appears to be an outage. “It just stopped working.” Or did it? Around 21:30 on October 3rd, things went off the rails.

    This graph tells a very different story than things “just stopping.” For anyone that runs very large clusters of machines where they do staged rollouts, this might look familiar. It looks a lot like a probability of 1 shifting to a probability of 0 over about two hours. Remember, there are no changes in what we are asking or how we are asking it… just different answers. In this case, the expected answer is 2, but we received no answer at all.

    The part I regret most about this story is how long it took for the problem to be completely resolved. It turns out that by removing the Service=”ELB” and Namespace=”AWS” dimensions, leaving only the LoadBalancerName=”website-prod13”, resulted in Amazon Cloudwatch correctly returning the expected answer again. The sudden recovery on October 7th wasn’t magic; the customer changed the configuration in Circonus to eliminate those two dimensions from the query.

    Our confidence is pretty high that nothing changed on our end. My confidence is also pretty high that in a code launch on October 3rd, Amazon “fixed a glitch.”