Hosts, Metrics, and Pricing, Oh My!

As the number and types of monitoring vendors have risen over the past several years, so have the pricing models. With the rise of cloud based ephemeral systems, often running alongside statically provisioned infrastructure, understanding the monitoring needs of one’s systems can be a challenging task for the most seasoned engineering and operations practitioners. In this post, we’ll take a look at a few of the current trends in the industry and why metrics-based pricing is the future of commercial monitoring purchasing.

For the first decade of the 21st century, rack mounted servers dominated the footprint of web infrastructure. Operators used number of hosts as the primary guide for infrastructure provisioning and capacity planning. Popular open source monitoring systems of the era, such as Nagios and Zabbix, reflected this paradigm; the UI was oriented around host based management. Commercial based monitoring systems of the time followed this pattern; pricing was a combination of number of hosts and resources, such as CPUs. Figuring out how much a commercial solution would cost was a relatively straightforward calculation; take the number of hosts/resources, and multiply it by the cost of each.

Post 2010, two major disruptions to this host based pricing model surfaced in the industry. The first was Amazon’s Elastic Compute Cloud (EC2), which had been growing in usage since 2006. Ephemeral-based cloud systems, such as AWS, GCP, and Azure, are now the preferred infrastructure choice. Services utilize individual hosts (or containers), so what may have been appropriate for deployment to one host 10 years ago is now a composition of dozens or hundreds. In these situations, host based pricing makes cost forecasting for infrastructure monitoring solutions much more complicated. One only need to be familiar with AWS auto-scaling or K8s cluster redeployment to shudder at the implications for host based monitoring system costs.

The second major disruption post 2010 was Etsy’s statsd, which introduced easy to implement application-based metrics, and in large part gave rise to the rapid ascent of monitoring culture over the last several years. Now one can instrument an application and collect thousands (if not millions) of telemetry sources from a distributed application and monitor its health in real time. The implication this has for statically-based infrastructure is that now a single host can source orders of magnitude more metrics than just host-based system metrics. Host-based vendors have responded to this by including only a small number of metrics per host; this represents an additional revenue opportunity for them, and an additional cost-based headache for operators.

As a result of these disruptions, metrics-based pricing has emerged as a solution which gives systems operators, engineers, and cost- and capacity-conscious executives a way to address the current monitoring needs of their applications. The question is no longer “how many hosts do we have,” but “how many things (metrics) do we need to monitor.” As the answer to the “how many” metrics question is also ephemeral, it is important that modern monitoring solutions also answer this question in terms of how many concurrent metrics are being collected. This is an infrastructure invariant approach that scales from bare metal to containers to serverless applications.

Storage is cheap; concurrency is not. Does your monitoring solution let you analyze your legacy metrics without charging you to keep them stored, or do you pay an ever increasing cost for all of the metrics you’ve ever collected?

At Circonus, we believe that an active-only metrics-based pricing model is the fairest way to deliver value to our customers, while giving them the power to scale up or down as needed over time. You can archive metrics while still retaining the ability to perform historical analyses with them. Our pricing model gives you the flexibility to follow your application’s monitoring needs, so that you can adapt to the ever-changing trends in application infrastructure.

As the world turns, so does our understanding and expectations from systems. At the end of the day, pricing models need to make sense for buyers, and not all buyers are at the same point on the technology adoption curve. Generally, active-only metrics pricing is the humane approach, but our model is user-friendly and adaptable, Don’t be surprised if we introduce some more options because we accommodate customers that just can’t make this model fit the way they run technical infrastructure.

–Fred Moyer is a Developer Evangelist at Circonus

Twitter: @phredmoyer

Monitoring as Code

Circonus has always been API-driven, and this has always been one of our product’s core strengths. Via our API, Circonus provides the ability to create anything that you can in the UI and more. With so many of our customers moving to API-driven platforms like AWS, DigitalOcean, Google Compute Engine (GCE), Joyent, Microsoft Azure, and even private clouds, we have seen the emergence of a variety of tools (like Terraform) that allows an abstraction of these resources. Now, with Circonus built into Terraform, it is possible to declaratively codify your application’s monitoring, alerting, and escalation, as well as the resources it runs on.

Terraform is a tool from HashiCorp for building, changing, and versioning infrastructure, which can be used to manage a wide variety of popular and custom service providers. Now, Terraform 0.9 includes an integration for managing Circonus.

These are a few key features of the Circonus Provider in Terraform:

  • Monitoring as Code – Alongside Infrastructure as Code. Monitoring (i.e. what to monitor, how to visualize, and when to alert) is described using the same high-level configuration syntax used to describe infrastructure. This allows a blueprint of your datacenter, as well as your business rules, to be versioned and treated as you would any other code. Additionally, monitoring can be shared and reused.
  • Execution Plans – Terraform has a “planning” step where it generates an execution plan of what will be monitored, visualized, and alerted on.
  • Resource Graphs – Terraform builds a graph of all your resources, and now can include how these resources are monitored, and parallelizes the creation and modification of any non-dependent resources.
  • Change Automation – Complex changesets can be applied to your infrastructure and metrics, visualizations, or alerts, which can all be created, deactivated, or deleted with minimal human interaction.

This last piece, Change Automation, is one of the most powerful features of the Circonus Provider in Terraform. Allocations and services can come and go (within a few seconds or a few days), and the monitoring of each resource dynamically updates accordingly.

While our larger customers were already running in programmatically defined and CMDB-driven worlds, our cloud computing customers didn’t share our API-centric view of the world. The lifecycle management of our metrics was either implicit, creating a pile of quickly outdated metrics, or ill-specified, in that they couldn’t make sense of the voluminous data. What was missing was an easy way to implement this level of monitoring across the incredibly complex SOA our customers were using.

Now when organizations ship and deploy an application to the cloud, they can also specify a definition of health (i.e. what a healthy system looks like). This runs side-by-side with the code that specifies the infrastructure supporting the application. For instance, if your application runs in an AWS Auto-Scaling Group and consumes RDS resources, it’s possible to unify the metrics, visualization, and alerting across these different systems using Circonus. Application owners now have a unified deployment framework that can measure itself against internal or external SLAs. With the Circonus Provider in Terraform 0.9, companies running on either public clouds or in private data centers can now programmatically manage their monitoring infrastructure.

As an API-centric service provider, Circonus has always worked with configuration management software. Now, in the era of mutable infrastructure, Terraform extends this API-centric view to provide ubiquitous coverage and consistency across all of the systems that Circonus can monitor. Terraform enables application owners to create a higher-level abstraction of the application, datacenter, and associated services, and present this information back to the rest of the organization in a consistent way. With the Circonus Provider, any Terraform-provisioned resource that can be monitored can be referenced such that there are no blind spots or exceptions. As an API-driven company, we’ve unfortunately seen blind-spots develop, but with Terraform these blind spots are systematically addressed, providing a consistent look, feel, and escalation workflow for application teams and the rest of the organization.

It has been and continues to be an amazing journey to ingest the firehose of data from ephemeral infrastructure, and this is our latest step toward servicing cloud-centric workloads. As industry veterans who remember when microservice architectures were were simply called “SOA,” it is impressive to watch the rate at which new metrics are produced and the dramatic lifespan reduction for network endpoints. At Circonus, our first integration with some of the underlying technologies that enable a modern SOA came at HashiConf 2016. At that time we had nascent integrations with Consul, Nomad, and Vault, but in the intervening months we have added more and more to the product to increase the value customers can get from each of these industry accepted products:

  • Consul is the gold standard for service-discovery, and we have recently added a native Consul check-type that makes cluster management of services a snap.
  • Nomad is a performant, robust, and datacenter-aware scheduler with native Vault integration.
  • Vault can be used to secure, store, and control access to secrets in a SOA.

Each of these products utilizes our circonus-gometrics library. When enabled, Circonus-Gometrics automatically creates numerous checks and automatically enables metrics for all the available telemetry (automatically creating either histogram, text, or numeric metrics, given the telemetry stream). Users can now monitor these tools from a single instance, and have a unified lifecycle management framework for both infrastructure and application monitoring. In particular, how do you address the emergent DevOps pattern of separating the infrastructure management from the running of applications? Enter Terraform. With help from HashiCorp, we began an R&D experiment to investigate the next step and see what was the feasibility of unifying these two axes of organizational responsibility. Here are some of the things that we’ve done over the last several months:

  • Added per metric activation and (as importantly) deactivation, while still maintaining the history of the metric.
  • Simplified the ability to view 100’s of clients, or 1000’s of allocations as a whole (via histograms), or to monitor and visualize a single client, server, or allocation.
  • Automatically show outliers within a group of metrics (i.e. identify metrics which don’t look like the others).
  • Reduced the friction associated with deploying and monitoring applications in an “application owner”-centric view of the world.
  • These features and many more, the fruit of expert insights, are what we’ve built into the product, and more will be rolled out in the coming months.

    Example of a Circonus Cluster definition:

    variable "consul_tags" {
      type = "list"
      default = [ "app:consul", "source:consul" ]
    }
    
    resource "circonus_metric_cluster" "catalog-service-query-tags" {
      name        = "Aggregate Consul Catalog Queries for Service Tags"
      description = "Aggregate catalog queries for Consul service tags on all consul servers"
    
      query {
        definition = "consul`consul`catalog`service`query-tag`*"
        type       = "average"
      }
    
      tags = ["${var.consul_tags}", "subsystem:catalog"]
    }
    

    Then merge these into a histogram:

    resource "circonus_check" "query-tags" {
      name   = "Consul Catalog Query Tags (Merged Histogram)"
      period = "60s"
      collector {
        id = "/broker/1490"
      }
      caql {
        query = <<EOF
    search:metric:histogram("consul`consul`catalog`service`query-tag (active:1)") | histogram:merge()
    EOF
      }
      metric {
        name = "output[1]"
        tags = ["${var.consul_tags}", "subsystem:catalog"]
        type = "histogram"
        unit = "nanoseconds"
      }
      tags = ["${var.consul_tags}", "subsystem:catalog"]
    }
    

    Then add the 99th Percentile:

    resource "circonus_check" "query-tag-99" {
      name   = "Consul Query Tag 99th Percentile"
      period = "60s"
      collector {
        id = "/broker/1490"
      }
      caql {
        query = <<EOF
    search:metric:histogram("consul`consul`http`GET`v1`kv`_ (active:1)") | histogram:merge() | histogram:percentile(99)
    EOF
      }
    
      metric {
        name = "output[1]"
        tags = ["${var.consul_tags}", "subsystem:catalog"]
        type = "histogram"
        unit = "nanoseconds"
      }
    
      tags = ["${var.consul_tags}", "subsystem:catalog"]
    }
    

    And add a Graph:

    resource "circonus_graph" "query-tag" {
      name        = "Consul Query Tag Overview"
      description = "The per second histogram of all Consul Query tags metrics (with 99th %tile)"
      line_style  = "stepped"
    
      metric {
        check       = "${circonus_check.query-tags.check_id}"
        metric_name = "output[1]"
        metric_type = "histogram"
        axis        = "left"
        color       = "#33aa33"
        name        = "Query Latency"
      }
      metric {
        check       = "${circonus_check.query-tag-99.check_id}"
        metric_name = "output[1]"
        metric_type = "histogram"
        axis        = "left"
        color       = "#caac00"
        name        = "TP99 Query Latency"
      }
    
      tags = ["${var.consul_tags}", "owner:team1"]
    }
    

    And you get this result:

    Finally, we want to be alerted if the 99th Percentile goes above 8000ms. So, we’ll create the contact (along with SMS, we can use Slack, OpsGenie, PagerDuty, VictorOps, or email):

    resource "circonus_contact_group" "consul-owners-escalation" {
      name = "Consul Owners Notification"
      sms {
        user  = "${var.alert_sms_user_name}"
      }
      email {
        address = "consul-team@example.org"
      }
      tags = [ "${var.consul_tags}", "owner:team1" ]
    }
    

    And then define the rule:

    resource "circonus_rule_set" "99th-threshhold" {
      check       = "${circonus_check.query-tag-99.check_id}"
      metric_name = "output[1]"
      notes = <<EOF
    Query latency is high, take corrective action.
    EOF
      link = "https://www.example.com/wiki/consul-latency-playbook"
      if {
        value {
          max_value = "8000" # ms
        }
        then {
          notify = [
            "${circonus_contact_group.consul-owners-escalation.id}",
          ]
          severity = 1
        }
      }
      tags = ["${var.consul_tags}", "owner:team1"]
    }
    

    With a little copy and paste, we can do exactly the same for all the other metrics in the system.

    Note that the original metric was automatically created when consul was deployed, and you can do the same thing with any number of other numeric data points, or do the same with native histogram data (merge all the histograms into a combined histogram and apply analytics across all your consul nodes).

    We also have the beginnings of a sample set of implementations here, which builds on the sample Consul, Nomad, & Vault telemetry integration here.

    Learn More about Terraform

    Postmortem: 2017-04-11 Firewall Outage

    The Event

    At approximately 05:40AM GMT on 4/11/2017, we experienced a network outage in our main datacenter in Chicago, IL.
    The outage lasted until approximately 10:55AM GMT on the same day. The Circonus SaaS service, as well as any PUSH based checks that use the public trap.noit.circonus.net, were affected by this outage. Any HTTPTrap checks using the public trap.noit.circonus.net broker would have been unable to send data to Circonus during this time period. As a result, any alerts based on this PUSH data would have also not been working. Meanwhile, enterprise brokers may have experienced a delay in processing data, but no data would have been lost for users of enterprise brokers as we use a store and forward mechanism on the brokers.

    The Explanation

    We use a pair of firewall devices in an active/passive configuration with automatic failover should one of the devices become unresponsive. The firewall device in question went down, and automatic failover did not trigger for an unknown reason (we are still investigating). When we realized the problem, we killed off the bad firewall device, causing the secondary to promote itself to master and service to be restored.

    What We’re Doing About It

    Going forward, we will use more robust checking mechanisms on these firewall devices to be alerted more quickly should we encounter a similar situation. Using an enterprise broker can insulate you from network outages between your infrastructure and Circonus should any future issues arise similar to this, or should any future issues arise in the network path between your infrastructure and Circonus.

    Documenting with Types

    I’ve said this before: elegant code is pedagogical. That is, elegant code is designed to teach its readers about the concepts and relationships in the problem domain that the code addresses, with as little noise as possible. I think data types are a fundamental tool for teaching readers about the domain that code addresses, but heavy use of types tends to introduce noise into the code base. Still, types are often misunderstood, and, as such, tend to be under-utilized, or otherwise misused.

    Types are means of describing to the compiler what operations are valid against data. That is my working definition; different programming languages will have their own concept of data type and their own means of defining them. However, the type system in every language will have these two responsibilities: to allow the programmer to define what operations are valid against typed data, and to indicate failure when the source code contains a prohibited operation. (Of course, type definitions in most languages do more than this, such as also defining memory organization for typed data for example. These details are critical for translating source code into an executable form, but are not critical to the type-checker.)

    The strength of a programming language’s type system is related to how much it helps or hinders your efforts to discourage client error. It’s my general thesis that the most important kind of help is when your types correspond to distinct domain concepts, simultaneously teaching your users about these concepts and their possible interactions, and discouraging the creation of nonsensical constructs. But this doesn’t usually come for free. I won’t go further into this here, but some hindrances are:

    • Error messages that don’t help users understand their mistakes.
    • Excessively verbose type signatures.
    • Difficulty of representing correctness constraints.
    • More abstraction than your users will want to process.

    Your types should pay for themselves, by some combination of keeping these costs low, preventing catastrophic failure, or being an effective teacher of your domain concepts.[1]

    How does one document code with types? I have a few guidelines, but this blog will run long if I go into all of them here, so I’ll start with one:

    Primitive types are rarely domain types

    Consider integers. A domain concept may be stored in an integer, but this integer will usually have some unit bound to it, like “10 seconds” or “148 bytes”. It almost certainly does not make domain sense to treat a value you got back from `time(2)` like a value you got back from `ftell(3)`. So these could be different types, and the type could be used to prevent misuse. Depending on language, I might even do so, but consider the options:

    In C, you can use `typedef` to create a type alias, as POSIX does. This serves to document that different integers may be used differently, but does not actually prevent misuse:

    #include 
    typedef uint64_t my_offset_t;
    typedef int64_t my_time_t;
    
    void ex1() {
      my_offset_t a = 0;
      /* variable misuse, but not an error in C's type system. */
      my_time_t b = a;
    }
    

    You could use a `struct` to create a new type, but these are awkward to work with:

    #include 
    #include 
    typedef struct { uint64_t val; } my_offset_t;
    typedef struct { int64_t val; } my_time_t;
    
    void my_func() {
      my_offset_t offset_a = { .val=0 };
      my_offset_t offset_b = { .val=1 };
      my_time_t time_c = { .val=2 };
    
      /*
       * variable misuse is a compile-time error:
       *   time_c = offset_a;
       * does not compile.
       */
    
      /*
       * cannot directly use integer operations:
       *   if (offset_b > offset_a) { printf("offset_b > offset_a\n"); }
       * does not compile, but can use:
       */
    
      if (offset_b.val > offset_a.val) { printf("offset_b.val > offset_a.val\n"); }
    
      /*
       * but the technique of reaching directly into the structure to use
       * integer operations also allows:
       */
    
      if (time_c.val > offset_a.val) { printf("BAD\n"); }
    
      /*
       * which is a domain type error, but not a language type error
       * (though it may generate a warning for a signed / unsigned comparison).
       * One could define a suite of functions against the new types, such as:
       *   int64_t compare_offsets(restrict my_offset_t *a, restrict my_offset_t *b)
       *   {
       *     return (int64_t) a->val - (int64_t) b->val;
       *   }
       * and then one could use the more type-safe code:
       *   if (compare_offsets(&offset_a, &offset_b) > 0) {
       *     printf("GOOD\n");
       *   }
       * but, in no particular order: 
       * this isn't idiomatic, so it's more confusing to new maintainers; 
       * even once you're used to the new functions, it's not as readable as idiomatic code;
       *  depending on optimization and inlining, it's plausibly less efficient than
       * idiomatic code;
       * and it is awkward and annoying to define new functions 
       * to replace the built-in integer operations we'd like to use.
       */
    }
    

    As far as I can tell, C does not provide ergonomic options for using the type system to prevent integer type confusion. That said, the likelihood of user error in this example (misusing a time as a file size, or vice versa) is pretty low, so I would probably make the same choice that POSIX did in this circumstance, and just use type aliases to document that the types are different and give up on actually preventing misuse.

    On the other hand, we at Circonus maintain a time-series database that must deal with time at multiple resolutions represented as 64-bit integers. Aggregate data storage uses units of seconds-since-unix-epoch, while high-resolution data storage uses units of milliseconds-since-unix-epoch.In this case, the likelihood of user error working with these different views of time is very high (we have a number of places where we need to convert between these views, and have even needed to change some code from using time-in-seconds to using time-in-milliseconds). Furthermore, making mistakes would probably result in presenting the wrong data to the user (not something you want in a database), or possibly worse.

    If we were strictly using C, I would probably want to follow the approach Joel Spolsky outlined here, and use a form of Hungarian notation to represent the different views of time. As it happens, we’re using C++ in this part of the code base, so we can use the type system to enforce correctness. We have an internal proscription against using the STL (to keep our deployed code relatively traceable with, say, dtrace), so `std::chrono` is out. But we can define our own types for working with these different views of time. We start by creating our own strong_typedef facility (no, we don’t use BOOST, either):

    #define ALWAYS_INLINE __attribute__((always_inline))
    
    // bare-bones newtype facility, intended to wrap primitive types (like
    // `int` or `char *`), imposes no run-time overhead.
    template 
      class primitive_newtype
    {
    public:
      typedef oldtype oldtype_t;
      typedef primitive_newtype self_t;
    
      primitive_newtype(oldtype val) : m_val(val) {}
      ALWAYS_INLINE oldtype_t to_oldtype() { return m_val; }
    private:
      oldtype m_val;
    };
    

    With this facility, we can define domain types that are incompatible, but which share a representation and which should impose no overhead over using the primitive types:

    class _uniquify_s_t;
    typedef primitive_newtype seconds_t;
    class _uniquify_ms_t;
    typedef primitive_newtype milliseconds_t;
    

    Or even better, since time types all share similar operations, we can define the types together with their operations, and also split up the types for “time point” from “time duration”, while enforcing a constraint that you can’t add two time points together:

    template 
      class my_time_t
    {
    private:
      class _uniquify_point_t;
      class _uniquify_diff_t;
    public:
      typedef primitive_newtype point_t;
      typedef primitive_newtype diff_t;
      static ALWAYS_INLINE point_t add(point_t a, diff_t b)
      {
        return point_t(a.to_oldtype() + b.to_oldtype());
      }
      static ALWAYS_INLINE diff_t diff(point_t a, point_t b)
      {
        return diff_t(a.to_oldtype() - b.to_oldtype());
      }
      static ALWAYS_INLINE point_t diff(point_t a, diff_t b)
      {
        return point_t(a.to_oldtype() - b.to_oldtype());
      }
      static ALWAYS_INLINE diff_t diff(diff_t a, diff_t b)
      {
        return diff_t(a.to_oldtype() - b.to_oldtype());
      }
      static ALWAYS_INLINE diff_t add(diff_t a, diff_t b)
      {
        return diff_t(a.to_oldtype() + b.to_oldtype());
      }
    };
    
    class _millisecond_uniquify_t;
    typedef my_time_t<_millisecond_uniquify_t> my_millisecond_t;
    class _second_uniquify_t;
    typedef my_time_t<_second_uniquify_t> my_second_t;
    

    This is just the primitive basis of our time-management types, and is implemented a little differently than what we actually have in our code base (to help the example fit in a blog post, and because I write the blog for a different audience than the one for which I write production code).

    With these new types, we can perform basic operations with time in seconds or milliseconds units, while preventing incorrect mixing of types. For example, an attempt to take a difference between a time-point based in seconds and a time-point based in milliseconds, will result in a compilation error. Using these facilities made translating one of our HTTP endpoints from operating against seconds to operating against milliseconds into an entirely mechanical process of converting one code location to use the new types, starting a build, getting a compilation error from a seconds / milliseconds mismatch, changing that location to use the new types, and repeating. This process was much less likely to result in errors than it would have been had we been using bare `int64_t`‘s everywhere, relying on a code audit to try and ensure that everything that worked with the old units was correctly updated to use the new.

    These types are more annoying to work with than bare integers, but using them helped avoid introducing a very likely and very significant system problem under maintenance, by providing the strongest possible reinforcement of the fact that we deal with time in two resolutions. In this case, the types paid for themselves.

    (Thanks to Riley Berton for reviewing this post.)

    References:

    [1]: C++ is an interesting case of weighing costs and benefits. While the benefits of using advanced C++ type facilities can be very high (bugs in C++ code can be catastrophic, and many of C++’s advanced facilities impose no runtime overhead), the maintenance costs can also be extremely high, especially when using advanced type facilities. I’ve seen thousands of characters of errors output due to a missing `const` in a type signature. This can be, *ahem*, intimidating.

    Post-Mortem 2017.1.12.1

    TL;DR: Some users received spurious false alerts for approximately 30 minutes, starting at 2017-01-12 22:10 UTC. It is our assessment that no expected alerts were missed. There was no data loss.

    Overview

    Due to a software bug in the ingestion pipeline specific to fault detection, a small subset (less than 2.5%) of checks were not analyzed by the online fault detection system for 31 minutes, starting at 2017-01-12 22:10 UTC.

    The problem was triaged. Broker provisioning and deprovisioning services were taken offline at 22:40 UTC, at which time all fault detection services returned to normal.

    Broker provisioning and deprovisioning services were brought back online at 2017-01-13 00:11 UTC. All broker provisioning and deprovisioning requests issued during that period were queued and processed successfully upon service resumption.

    Gratuitous Detail

    Within the Circonus architecture, we have an aggregation layer at the edge of our service that communicates with our store-and-forward telemetry brokers (which in-turn accept/acquire data from agents). This component is called “stratcond.” On January 5th, we launched new code that allows more flexible configuration orchestration and, despite having both unit tests and end-to-end tests, an error was introduced. Normal operations continued successfully until January 12th, when a user issued a command requiring reconfiguration of this system. That command managed to exercise the code path containing this specific error and stratcond crashed. As with all resilient systems, the stratcond was restarted immediately, and it suffered approximately 1.5 seconds of “disconnection” from downstream brokers.

    The system is designed to tolerate failures, as failures are pretty much the only guaranteed thing in distributed systems. These can happen at the most unexpected times and many of our algorithms for handling failure are designed to cope with the randomness (or perceived randomness) of distributed failure.

    The command that caused the crash was queued and reattempted precisely 60 seconds later, and again 60 seconds after that, and again after that. A recurrent and very non-random failure. There are many checks that customers have scheduled to run every 60 seconds. When a check is scheduled to run on a broker, it is scheduled to run at a random offset within the first 60 seconds of that broker’s boot time. So, of the 60-second-period checks, 2.5% of the checks would have been scheduled to run during the 1.5 second real-time-stream outage that transpired as a part of this failure. The particular issue here is that because the crash recurred almost exactly every 60 seconds, the same 1.5 seconds of each minute was vulnerable to exclusion. Therefore the same 2.5% of checks were affected each minute, making them “disappear” to the fault detection system.

    The same general pipeline that powers graphs and analysis is also used for long-term storage, but due to open-ended temporal requirements, that system was unaffected. All checks run in those “outage” windows had their measurements successfully sent upstream and stored (during the outages, since there were no outages for the storage stream).

    Operational response led to diagnosis of the cause of the crash, avoidance, and restoration of normal fault detection operation within 31 minutes. Crash analysis and all-hands engineer triage led to a bug fix, test, packaging, and deployment at 2 hours and 11 minutes.

    Actions

    There are two actions to be taken, and both will require research and implementation.

    The first is to implement better instability detection to further enhance the already sophisticated capabilities of flagging instability in the fault detection system. The short, reliable timing of the disconnections in this case did not trigger the fault detection system’s instability mode and thus it did not react as it should have.

    The second is to better exploit “at least once delivery” in the fault pipeline. In order to make sure that we get the job done that we promise to get done, we make sure our systems can process the same data more than once. Often, a metric is actually delivered to the fault detection system four times. We can further extend this “duplication tolerance” to the stratcond-broker feed and replay some window of past traffic to send upstream. In online systems, old data is worthless. In all systems, “old” is subjective. By relaxing our definition of “old” a bit more and leveraging the fact that no upstream protections will be required, we should easily be able to make this tiny section of our pipeline even more resilient to failure.

    To close, we live in the real world. Failure is the only option. We embrace the failures that we see on a daily basis and do our best to ensure that the failures we see do not impact the service we deliver to you in any way. Yesterday, we learned that we can do better. We will.

    Systems Monitoring is Ripe for a Revolution

    Before we explore systems, let’s talk users. After all, most of our businesses wouldn’t exist without lots of users; users that have decreasing brand loyalty and who value unintrusive, convenient, and quick experiences. We’ve intuited that if a user has a better experience on a competitor’s site, then they will stop being your customer and start being theirs. Significant research into exactly how much impact is had by substandard web performance started around 2010, progressed to consensus, and has turned into a tome of easily consumable knowledge. What allowed for this? RUM.

    Real User Monitoring

    The term RUM wasn’t in constant usage until just after 2010, but the concept and practice was a slow growth that transformed the previous synthetic web monitoring industry. Both Keynote and Gomez (the pall bearers of synthetic web monitoring) successfully survived that transition and became leaders in RUM as well. Of course, the industry has many more and varied competitors now.

    Synthetic monitoring is the process of performing some action and taking measurements around the aspects of that performance. A simple example would be asking, “how fast does my homepage load?” The old logic was that an automated system would perform a load of your homepage and measure how long various aspects took: initial page load, image rendering, above-the-fold completeness, etc. One problem is that real users are spread around the world, so to simulate them “better,” one would need to place these automated “agents” around the world so that a synthetic load could indeed come from Paris, or Copenhagen, or Detroit. The fundamental problem remained that the measurements being taken represented exactly zero real users of your web site… while users of your website were actively loading your home page. RUM was born when people decided to simply observe what’s actually happening. Now, synthetic monitoring isn’t completely useless, but RUM largely displaced most of its obvious value.

    What took RUM so long? The short answer was the size of the problem relative to the capabilities of technology. The idea of tracking the performance of every user action before 2000 was seen as a “Big Data Problem” before we coined the term Big Data. Once the industry better understood how to cope with data volumes like this, RUM solutions became commonplace.

    Now it seems fairly obvious to anyone that monitoring real users is fundamental in attempting to understand the behavior of their website and its users… so why not with systems?

    Systems are Stuck

    Systems, like websites, have “real users,” those users just happen to be other systems most of the time. It is common practice today to synthetically perform some operation against a system and measure facets of that performance. It is uncommon today to passively observe all operations against the system and extract the same measurements. Systems are stuck in synthetic monitoring land.

    Now, to be fair, certain technologies have been around for a while that allow the observation of inflight systems; caveat “systems” with a focus on custom applications running on systems.

    The APM industry took a thin horizontal slice of this problem, added sampling, and sold a solution (getting much market capitalization). To sum up their “solution,” you have an exceptional view into part of your system some of the time. Imagine selling that story in the web analytics industry today: “now see real users… only on your search pages, only 1% of the time.”

    Why don’t we have a magically delicious RUM store for systems? For the same reason it took so long to get RUM. The technology available today doesn’t afford us the opportunity to crunch that much data. Users work in human time (second and minutes) at human scale (tens of millions); systems work at computing time (nanoseconds) at cloud scale (tens of thousands of machines and components). It’s literally a million times harder to think about Real Systems Monitoring (RSM) than it is to think about Real User Monitoring (RUM).

    The Birth of Real Systems Monitoring

    The technology has not improved a million-fold over the last 10 years, so we can’t solve this RSM problem as comprehensively. But it has improved significantly, so we’re ready for a departure from synthetic systems monitoring into a brave new world. Circonus and many of its industry peers have been nipping at the heels of this problem and we are entering the age of tangible gains. Here’s what’s coming (and 5-10 years from now will be ubiquitous table stakes):

    • 100% sampling of microsecond or larger latencies in systems operation. (ie, You see everything)
    • Software and services will expose measurement streams from real activity.
    • Histograms as the primary data types in most measurement systems
    • Significantly more sophisticated math to cope with human reasoning of large datasets
    • Measurement collection on computer-scale (billions of measurements per second)
    • Ultimately a merge of RUM and RSM… After all, we only have systems because we have users.

    Exciting Times

    At Circonus, we’ve been building the architectures required to tackle these problems: the scale, the histograms, and the math. We see the cost efficiencies increasing, resulting in positive (and often huge) returns on investment. We see software and service providers avidly adding instrumentation that exposes real measurements to interested observers. We’re at an inflection point and the world of systems monitoring is about to take an evolutionary leap forward. These are exciting times.

    COSI:Postgres

    A few months ago we announced the availability Circonus One Step Install (COSI) to introduce a very fast way to get data collected for systems with the most obvious set of metrics enabled. This makes monitoring new systems as easy as copying and pasting a command into a shell on the machine to be monitored, or by adding that command into some automation script via Puppet, or Chef, or Ansible, or whatever you use.

    Today we are announcing the general availability of “plugins” for COSI, starting with one-step monitoring for Postgres databases.

    COSI:postgres builds on the existing COSI workflow outlined in the COSI Tutorial and demonstrated in the video below:

    After completing a basic COSI install like the one shown above, you can now run the postgres plugin to monitor the box for important postgres telemetry. Below is a video showing the process from the initial login on an EC2 instance to setting up full monitoring of the box and postgres via the COSI:postgres plugin. I have included the installation of the postgres database for completeness, but if you have an existing postgres database which you intend to monitor, you can skip to 1:07 in the video.

    Video Summary:

    1. Install Postgres. 0:00
      	$ ssh machine
      	$ sudo bash
      	# apt-get install postgresql
      
    2. Create a database. 0:20
      	# su - postgres
      	$ psql
      	postgres=# create database foo;
      	CREATE DATABASE
      	postgres# /q
      
    3. Create a table. 0:40
      	$ psql foo
      	foo=# create table bar (baz text);
      	CREATE TABLE
      
    4. Add some data. 0:54
      	foo=# insert into bar (baz) values (‘some text’);
      	INSERT 0 1
      	foo=# insert into bar (baz) values (‘some more text’);
      	INSERT 0 1
      	foo=# insert into bar (baz) values (‘even more text’);
      	INSERT 0 1
      	foo=# /q
      
    5. Monitor this machine with COSI. 1:07
      	# curl -sSL https://onestep.circonus.com/install | bash \
      	-s -- \
      	--key  \
      	--app  \
      
    6. Install protocol_observer. 1:32
      	# apt-get install golang
      	# mkdir go
      	# export GOPATH=~/go
      	# go get github.com/circonus-labs/wirelatency
      	# cd go/src/github.com/circonus-labs/wirelatency/protocol_observer
      	# go build
      
    7. Ensure that protocol_observer is in the PATH. 2:10
      	# cp protocol_observer /usr/bin
      

      NOTE: If you place protocol_observer in /opt/circonus/bin, the postgres plugin will find it automatically because that is the default search path.

    8. protocol_observer requires root privilege to execute, so give ‘nobody’ sudo. 2:24
      	# cd /etc/sudoers.d
      	# echo “nobody ALL=(ALL) NOPASSWD: /usr/bin/protocol_observer” \
      > 91-protocol_observer
      
    9. Create an account for the COSI plugin. 2:46
      	# su - postgres
      	$ psql foo
      	foo=# create user cosi with password ‘’
      	CREATE ROLE
      	foo=# grant all on database foo to cosi;
      	GRANT
      	foo=# /q
      
    10. Modify pg_hba.conf to allow local logins. 3:19
      	# nano /etc/postgresql/9.5/main/pg_hba.conf
      	…
      	# /etc/init.d/postgresql restart
      
    11. Finally, run the COSI:Postgres plugin install. 3:44
      # /opt/circonus/cosi/bin/cosi-plugin-postgres --enable \
      --database foo --user cosi --pass 	
      

    Now you are finished installing the plugin, and you are ready to enjoy your new dashboard and the new functionality if offers.

    New Functionality

    The postgres plugin for COSI comes with some advanced functionality:

    • Optional support for tracking latency of every single query that hits the database
    • Cache vs. file system interaction
    • A live view of current transactions in flight as well as a historic graph of transaction counts
    • Displays what postgres background writers are busy doing
    • Forecasts your overall database size in the future!

    Let’s break these new features down:

    Optional support for tracking latency of every single query that hits the database

    In order to support tracking of latency, COSI:postgres requires the circonus-labs/wirelatency tool installed on the box. The `protocol_observer` executable must be in the system PATH and the user that executes the node-agent *must* have sudo permission for the `protocol_observer` executable (covered at 1:32 in the video above). This is because tracking the latency of queries relies on pcapping the network traffic for the postgres port and reconstructing the postgres protocol in order to track when queries come in and when they are responded to. There are a wealth of options for the `protocol_observer` and you can read more about it on the github page.

    What you are seeing in these dashboard graphs for query latency is a heatmap containing the latency of every query that hit this postgres server, along with overlays of the overall count of queries (per second) and quartile banding of the latencies. This helps get a good overview of how much time the queries against your postgres instances are taking. If you want to get more advanced, you can apply CAQL queries to this data to extract really useful info.

    Cache vs. file system interaction

    Problems under postgres are often related to inadequate cache size or too many cache misses which have to hit the actual disk for data. Generally, we want to keep the cache hit percentage as close to 100% as possible. The dashboard widget, “cache hit percentage,” and the graph, “…cache vs. file system,” will help illuminate any cache miss issues and poor performance your database may be experiencing.

    A live view of current transactions in flight as well as a historic graph of transaction counts

    The dashboard widget, “txns,” and the graph, “… txns,” show a live view and a historic view (respectively) of transactions running against your database instance. Spikes in these indicate hotspots of activity. Here, “txns” means all database interactions (both reads and writes).

    Displays what postgres background writers are busy doing

    Postgres has several background writer processes which manage commits to disk for postgres. A lag in checkpointing can make database recoveries after crashes a much longer process. This graph will expose problems in the background writer processes. For more on checkpoint statistics, refer to this helpful blog post: “Measuring PostgreSQL Checkpoint Statistics“.

    Forecasts your overall database size in the future!

    The bottom graph on the dashboard exposes database size as reported by Postgres along with future size. This is calculated using advanced CAQL based resource forecasting.

    And more…

    In addition to the above features, this new COSI:postgres plugin exposes active and max connection count, transaction details (how many SELECT, INSERT, UPDATE, and DELETE), database scans (how many index reads, sequence scans, and tuple reads is the database doing), and database lock information.

    If you are running Postgres instances in our infrastructure and want quick and useful insights into the performance of those systems, using the new COSI:postgres plugin is an easy way to automate collection of these most useful metrics for a Postgres installation in the Circonus system.

    No, We “Fixed the Glitch”

    If you haven’t seen the movie Office Space, you should do so at your earliest convenience. As with the new TV comedy, “Silicon Valley,” Mike Judge hits far too close to home for the movie to be comfortable… its hilarity, on the other hand, is indisputable. So much of our lives are wrapped up in “making the machine work” that comedic exposures of our industries deep malfunctions are, perhaps, the only things that keep me sane.

    Not a day goes by that I don’t see some scene or line from “Office Space” percolate from either the Industry or Circonus itself. Just after 21:30 UTC on October 3rd was just another one of these events, but the situation that brought it up is interesting enough to share.

    In “Office Space,” there is an employee named Milton, whom management believes they have fired, but who has been working and getting paid for years. Classic communication breakdown. However, due to the over-the-top passive aggressive behavior in the organization, management doesn’t want a confrontation to correct the situation. Instead of informing Milton, they simply decide to stop paying him and let the situation work itself out… They “fixed the glitch.” If you do this, you’re an asshole. Spoiler alert: Milton burns the building down

    .

    The interesting thing about software is that it is full of bugs. So full of bugs, that we tend to fix things we didn’t even know were broken. While it’s no less frustrating to have a “glitch” fixed on you, it’s a bit more understandable when it happens unintentionally. We’re fixing glitches before they are identified as glitches. This most commonly occurs in undocumented behavior that is assumed to be stable by some consumer of a system. It happens during a new feature introduction, or some other unrelated bug fixing, or a reimplementation of the system exhibiting the undocumented behavior, and then boom… some unsuspecting consumer has their world turned upside down. I’m sure we’ve done this at Circonus.

    On October 3rd, a few customers had their Amazon Cloudwatch checks stop returning data. After much fretting and testing, we could find nothing wrong with Amazon’s API. Sure, it was a bit slow and gave stale information, but this is something we’ve accommodated from the beginning. Amazon’s Cloudwatch service is literally a metrics tire fire. But this was different… the answers just stopped happening.

    Circonus’ collection system is three-tier (unlike many of our competitors that use two-tier systems). First, there’s the thing that’s got the info: the agent. In this case, the agent is the Cloudwatch API itself. Then, there’s the thing that stores and analyzes the data: Circonus SaaS. And finally there’s this middle tier that talks to the agents, then stores and forwards the data back to Circonus SaaS. We call this the broker. Brokers are basically babelfish; they speak every protocol (e.g. they can interrogate the Cloudwatch API), and they are spread out throughout the world. By spreading them out, we can place brokers closer to the agents so that network disruptions don’t affect the collection of data, and so that we get a more resilient observation fabric. This explains why I can assert that “we didn’t change anything,” even with as many as fifty code launches per day. The particular broker in question, the one talking to the cloudwatch API, hadn’t been upgraded in weeks. Additionally, we audit changes to the the configuration of the broker, and the configurations related to Cloudwatch interrogations hasn’t been modified either.

    So, with no changes to the system or code asking Cloudwatch for data and no changes to the questions we are asking Cloudwatch, how could the answers just stop? Our first thought was that Amazon must have changed something, but that’s a pretty unsatisfying speculation without more evidence.

    The way Cloudwatch works is that you ask for a metric and then limit the ask by fixing certain dimensions on the data. For example, if I wanted to look at a specific Elastic Load Balancer (ELB) servicing one of my properties and ascertain the number of healthy hosts backing it, then I’d work backwards. First, I’d ask for the number of healthy hosts, the “HealthyHostCount”, and then I’d limit that to the namespace “AWS/ELB” and specify a set of dimensions. Some of the available dimensions are “Service”, “Namespace”, and “LoadBalancerName”. Now, our Cloudwatch integration is very flexible, and users can specify whatever dimensions they please, knowing that it is possible that they might work themselves out of an answer (by setting dimensions that are not possible).

    The particular Cloudwatch interrogation said that dimension should match the following: Service=”ELB”, Namespace=”AWS”, and LoadBalancerName=”website-prod13.” And behold: data. The broker was set to collect this data at 12:00 UTC on October 1st and to check it every minute.

    There appears to be an outage
    There appears to be an outage.

    As we can see from this graph, while it worked at first, there appears to be an outage. “It just stopped working.” Or did it? Around 21:30 on October 3rd, things went off the rails.

    Things went off the rails here.
    Things went off the rails here.

    This graph tells a very different story than things “just stopping.” For anyone that runs very large clusters of machines where they do staged rollouts, this might look familiar. It looks a lot like a probability of 1 shifting to a probability of 0 over about two hours. Remember, there are no changes in what we are asking or how we are asking it… just different answers. In this case, the expected answer is 2, but we received no answer at all.

    The part I regret most about this story is how long it took for the problem to be completely resolved. It turns out that by removing the Service=”ELB” and Namespace=”AWS” dimensions, leaving only the LoadBalancerName=”website-prod13”, resulted in Amazon Cloudwatch correctly returning the expected answer again. The sudden recovery on October 7th wasn’t magic; the customer changed the configuration in Circonus to eliminate those two dimensions from the query.

    Our confidence is pretty high that nothing changed on our end. My confidence is also pretty high that in a code launch on October 3rd, Amazon “fixed a glitch.”

    The Circonus API and Raspberry PI

    Building a Raspberry PI Circonus AlertBot

    To a Site Reliability Engineer, one of the most important things is making sure that you get alerts as soon as events happen. Circonus provides many contact options, such as SMS, email, Slack, PagerDuty, or VictorOps. For me, if I’m not actively watching for alerts,I rely on my phone to do something to get my attention. On occasion, I may bump the mute switch on my iPhone, or the battery might run out, etc, etc. It’s rare, but it happens.

    Folks that know me also know that I get a kick out of the Raspberry Pi. They run linux, they’re cheap, and they provide all those nice GPIO pins. So it was a good practical exercise to have a Raspberry Pi do some work for me. I decided to build a back up alerting tool for those times when I don’t see the alerts on my phone.

    First I attached a pair of LEDs to the GPIO (one red, one green), along with appropriate resistors, one to GPIO pin 7, and one to pin 11. Then I attached a super obnoxious buzzer with a resistor to GPIO pin 24, and a momentary switch (button) to GPIO pin 26 along with a pullup resistor.

    Then I wrote up a script in Perl that will poll the Circonus API for any sev1 alerts. The green light means no alerts, and the red light means something broke. And since I’m usually sleeping when it goes off, the buzzer accompanies the red light. This is as annoying as intended, and therefore the momentary switch will signal to the script to keep quiet.

    I have a few improvements planned for this set up. My next task is to implement a cheap OLED display to provide some information about an alert. I’ve also thought about replacing the LEDs with a relay that could switch on something more robust, like a lamp or a motor, which might be useful in an office setting.

    I’ll admit the script leaves room for a lot of improvement, and pull requests are welcome! Either way, pairing the Raspberry Pi with the Circonus API has been fun and productive. I imagine there are plenty other applications that could make use the two! Join the Circonus Labs slack community and you can share and discuss your ideas with our engineers and other Circonus users.

    Crash-Dump Necromancy

    Recently, one of our on-premises customers experienced a system crash with one of their metric-storage nodes. Circonus offers the option to operate a complete, private copy of our product on a customer’s own premises. This may be because of a compliance requirement or for reasons of scale that make the public SaaS version unsuitable. In this case, the node was temporarily incapacitated, and recovered on its own without intervention. This is the story of how we diagnosed the issue to determine why the node had this panic, which then informed our customer’s decisions about hardware replacement.

    Our metrics store runs on OmniOS, a distribution of illumos, itself a fork of the now-defunct OpenSolaris project. The Solaris pedigree is important, because it means we have a robust set of post-mortem debugging capabilities, beginning with a kernel crash dump.

    Whenever an OmniOS system encounters an unexpected condition, the kernel logs a panic message and dumps the contents of kernel memory to the local dump or swap device. On the next boot, a service called savecore notices that a dump is present and captures that data as a compressed vmdump.X file in /var/crash/<hostname>, where X is just an ordinal number, starting at 0, to distinguish multiple crash dumps. Our customer was able to send that file to us for analysis.

    Analysis of such a crash dump is performed with MDB, the Modular Debugger. MDB is something of a swiss-army knife for debugging in illumos-based systems. It can operate on various types of “targets,” including kernel crash dumps and application core files, as well as a running system. In this exercise, we’ll primarily be using MDB interactively, but it can also function quite handily in a shell pipeline, as we’ll see later.

    The customer knew from the panic message that the system had panicked due to the ZFS deadman timer:

    panic[cpu0]/thread=ffffff00f40c5c40:
    genunix: [ID 918906 kern.notice] I/O to pool 'data' appears to be hung.
    

    ZFS is an advanced storage subsystem that combines the traditional capabilities of a filesystem, volume manager, and RAID controller into a unified storage stack. It features a pooled storage model, provable data integrity with end-to-end checksumming, and dead-simple administration. It’s a cornerstone of how Circonus protects metric data – we utilize node-local data redundancy with a mirroring discipline, as well as writing the same data to multiple nodes in the metric storage cluster. The end-to-end checksumming ensures that when we read back metric data, we can be confident it’s the same data we originally wrote.

    The ZFS deadman timer does what it says on the tin – if the timer expires, the system panics and restarts. It is a last resort for when ZFS has sent one or more I/O commands to the underlying storage and has not heard back in the allotted time (this is tunable; the default is 1000 seconds, or about 16.5 minutes). Panicking the system at this point provides the administrator with a snapshot of the system state at the point when the deadman timer expired. As we will see, this is a rich source of information that helps us drill down on the culprit(s).

    In order to analyze the customer’s crash dump, we must first decompress the crash dump file produced by savecore on the customer’s system:

    # savecore -f vmdump.0 .
    

    The uncompressed dump is written to ./vmcore.0, along with a copy of the corresponding kernel binary as ./unix.0. The kernel binary provides mappings of the core image for kernel functions, global variables, etc. To launch an interactive session using this crash dump, we just invoke mdb with the name of the core file.

    # mdb vmcore.0
    Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs mr_sas sd ip hook neti sockfs arp usba lofs idm random ptm ipc nfs cpc ufs logindmux ]
    >
    

    We are now at the prompt (>) where we can run MDB dcmds (debugger commands), which are routines that access various properties of the target. If we didn’t have the hint from the kernel log about the reason for the crash, we could see what we’re dealing with by using the ::status dcmd:

    > ::status
    debugging crash dump vmcore.0 (64-bit) from [hostname redacted]
    operating system: 5.11 omnios-d0a4f2e (i86pc)
    panic message: I/O to pool 'data' appears to be hung.
    dump content: kernel pages only
    

    The deadman timer is tracked in the spa_sync function, so the first order of business is to locate the kernel thread that was in this function. To do that, we use the ::stacks dcmd, which prints kernel thread stacks. This dcmd is provided by the illumos kernel, so it is only available when debugging a kernel crash dump or interacting with a live system in kernel mode. Additionally, we can narrow down the list of stacks to our target function with the ‘-c’ option:

    > ::stacks -c spa_sync
    THREAD           STATE    SOBJ                COUNT
    ffffff00f4ce5c40 SLEEP    CV                      2
                     swtch+0x141
                     cv_wait+0x70
                     zio_wait+0x5b
                     dsl_pool_sync+0xb3
                     spa_sync+0x27e
                     txg_sync_thread+0x227
                     thread_start+8
    

    Interesting – a count of 2. ::stacks summarizes unique kernel thread stacks that share common features, such as thread state, type of synchronization object (conditional variable, mutex, etc.), and program counter. To find out more about these stacks, we can use another kernel-provided dcmd, ::findstack. This prints more information about a thread’s stack, including function arguments if the “-v” option is used. We can also take advantage of MDB’s handy pipelining capability, similar to a shell pipeline, in order to avoid a lot of tedious copy/pasting:

    > ::stacks -c spa_sync | ::findstack -v
    stack pointer for thread ffffff00f4ce5c40: ffffff00f4ce5960
    [ ffffff00f4ce5960 _resume_from_idle+0xf4() ]
      ffffff00f4ce5990 swtch+0x141()
      ffffff00f4ce59d0 cv_wait+0x70(ffffff30ef69ff40, ffffff30ef69ff38)
      ffffff00f4ce5a10 zio_wait+0x5b(ffffff30ef69fc18)
      ffffff00f4ce5aa0 dsl_pool_sync+0xb3(ffffff21f098a6c0, 168a1db)
      ffffff00f4ce5b70 spa_sync+0x27e(ffffff21f0998000, 168a1db)
      ffffff00f4ce5c20 txg_sync_thread+0x227(ffffff21f098a6c0)
      ffffff00f4ce5c30 thread_start+8()
    stack pointer for thread ffffff00f457dc40: ffffff00f457d960
    [ ffffff00f457d960 _resume_from_idle+0xf4() ]
      ffffff00f457d990 swtch+0x141()
      ffffff00f457d9d0 cv_wait+0x70(ffffff30106b6dc8, ffffff30106b6dc0)
      ffffff00f457da10 zio_wait+0x5b(ffffff30106b6aa0)
      ffffff00f457daa0 dsl_pool_sync+0xb3(ffffff21f6494c80, 168a1df)
      ffffff00f457db70 spa_sync+0x27e(ffffff21f0f93000, 168a1df)
      ffffff00f457dc20 txg_sync_thread+0x227(ffffff21f6494c80)
      ffffff00f457dc30 thread_start+8()
    

    Unlike the shell pipeline, the output that gets piped is not the literal text that we see in the terminal, but the value of an expression, which MDB calls “dot”. MDB’s syntax is based on computing the value of something, usually a memory address in the target, with dot representing the current address. Here, ::stacks is setting dot to the address of each stack in succession and transmitting those over the pipe, to be used as input by ::findstack. Then ::findstack operates on each one in turn. Without the pipeline, we would need to manually copy and paste each of the stack addresses and prefix it to the next dcmd, e.g. <addr>::findstack -v.

    With both stacks shown with function arguments, we can see that the first argument to spa_sync differs between the two threads. From the source, we know that the first argument to spa_sync is a pointer to a spa_t struct, and that has a member, spa_name. This tells us the name of the pool that we’re syncing to. We can get that value by printing at the address of the struct and providing the struct type and member name we’re interested in:

    > ffffff21f0998000::print -t spa_t spa_name
    char [256] spa_name = [ "rpool" ]
    
    > ffffff21f0f93000::print -t spa_t spa_name
    char [256] spa_name = [ "data" ]
    

    There is one for each pool in this system. We know from the panic message that it was the “data” pool that was hung, so we’ll focus on the “dc40” stack from here on:

    stack pointer for thread ffffff00f457dc40: ffffff00f457d960
    [ ffffff00f457d960 _resume_from_idle+0xf4() ]
      ffffff00f457d990 swtch+0x141()
      ffffff00f457d9d0 cv_wait+0x70(ffffff30106b6dc8, ffffff30106b6dc0)
      ffffff00f457da10 zio_wait+0x5b(ffffff30106b6aa0)
      ffffff00f457daa0 dsl_pool_sync+0xb3(ffffff21f6494c80, 168a1df)
      ffffff00f457db70 spa_sync+0x27e(ffffff21f0f93000, 168a1df)
      ffffff00f457dc20 txg_sync_thread+0x227(ffffff21f6494c80)
      ffffff00f457dc30 thread_start+8()
    

    Moving up the stack, dsl_pool_sync issued some I/O, in the form of a zio_t, to the underlying storage. This resulted in calling zio_wait for notification that the I/O operations are complete. Each ZIO can have child ZIOs for related data, and those can have their own children, etc. So rather than looking up a single struct member, we potentially could be recursing down a non-trivial list. It’s therefore a good thing that, in addition to dcmds, MDB has something called a “walker,” which knows the specifics of a particular data structure and can automatically “walk” through it for us. The ZFS kernel module helpfully provides just such a walker for ZIOs called (perhaps obviously) ::zio.

    > ::help zio
    
    NAME
      zio - zio_t summary
    
    SYNOPSIS
      addr ::zio [-cpr]
            -c display children
            -p display parents
            -r display recursively
    

    Starting from the address of the ZIO that we’re waiting on, we use both the -c option to show child ZIOs and the -r option to recurse down through any descendants. For a busy pool, this can produce a lot of output. We can use another MDB feature, calling out to the shell with “!” to send the output to a file that we can work with outside of MDB.

    > ffffff30106b6aa0::zio -c -r !cat > zio_data
    > ::quit
    
    $ wc -l zio_data 
    4798 zio_data
    

    Now we can look at this file in a text editor in order to more easily explore the output.

    I/O in ZFS flows through the ZIO pipeline, where various stages can not only perform the actual I/O, but also transform the data in different ways, such as compression or deduplication. Since we’re dealing with a hung pool, we’re interested in actual I/O. Therefore we want to look for ZIOs that are in the VDEV_IO_* stages.

    ADDRESS                 TYPE  STAGE            WAITER           TIME_ELAPSED
    ffffff30106b6aa0        NULL  CHECKSUM_VERIFY  ffffff00f457dc40 -
     ffffff30106afc58       WRITE ISSUE_ASYNC      -                -
      ffffff2e4e9248e0      WRITE ISSUE_ASYNC      -                -
       ffffff309de9ac18     WRITE ISSUE_ASYNC      -                -
        ffffff22749a8578    WRITE ISSUE_ASYNC      -                -
         ffffff318ec07750   WRITE ISSUE_ASYNC      -                -
          ffffff23025a5e58  WRITE ISSUE_ASYNC      -                -
    ...
           ffffff31ba43ee50 WRITE VDEV_IO_START    -                1007685ms
    

    Starting at the top of our output file, we see the ZIO address that we were waiting on, followed by its descendants. This pool was primarily doing writes, as expected, given that it’s running the Circonus metrics store. If we proceed down the list, we eventually come to some actual VDEV_IO operations. In ZFS, a “vdev” (virtual device) is the building block for a pool. There are many types of vdevs (some representing higher-level objects such as mirror or raidz sets), but for our purposes, we’re talking about “leaf vdevs”. These are the actual devices that store data, i.e., disks. The last column is interesting too – time elapsed. This means that this ZIO has been in the VDEV_IO_START stage for a whopping 1007 seconds. Something’s definitely wrong here, and that lines up pretty well with the ZFS deadman timer.

    We would expect that the ZIOs doing vdev I/O would have some information about which vdev they are operating on. Recalling the zio_t structure, does it mention vdevs?

    > ffffff31ba43ee50::print -t zio_t !grep vdev
        zio_done_func_t *io_done = vdev_mirror_child_done
        vdev_t *io_vd = 0xffffff21f4db0d00
    

    Yes it does, and there’s a pointer, *io_vd, to a vdev_t. If we print that with MDB, we find that it has a wealth of information about the vdev in question. In particular, there is a device path:

    > ffffff21f4db0d00::print -t vdev_t
    vdev_t {
    ...
        char *vdev_path = 0xffffff21f5b196e0 "/dev/dsk/c1t7d0s0"
    

    Now we know that our waited-upon ZIO was at least writing to /dev/dsk/c1t7d0s0 and maybe to some others as well (remember, there were nearly 4800 lines of output for all the active ZIOs for the “data” zpool).

    As an aside, we can actually collapse the above two invocations into one by using the structure pointer operator:

    > ffffff31ba43ee50::print -t zio_t io_vd->vdev_path
    char *io_vd->vdev_path = 0xffffff21f5b196e0 "/dev/dsk/c1t7d0s0"
    

    It’s time to break out our text-processing tools and crunch this data. We’d need to get vdev_t pointers out of all the ZIOs that are “interesting,” and print the vdev_path from each one of those to show what device the ZIO was issued to. I like to use awk for this:

    $ awk '$1 ~ /^ff/ && $5 != "-"' zio_data | head -5
           ffffff31ba43ee50 WRITE VDEV_IO_START    -                1007685ms
           ffffff31855c0c70 WRITE VDEV_IO_START    -                1007685ms
           ffffff2d9fe99af0 WRITE VDEV_IO_START    -                1007685ms
           ffffff22746e7060 WRITE VDEV_IO_START    -                1007686ms
           ffffff2d9a82c1b0 WRITE VDEV_IO_START    -                1007685ms
    

    “Interesting” ZIOs in this case would be ones where the elapsed time has a value, i.e., the last column is not “-”. But this isn’t helpful on its own – we need to drill down into each zio_t, find the vdev_t, and extract the vdev_path from it. To do this, we build up a list of MDB dcmds and use the shell to invoke mdb with this list.

    $ awk '$1 ~ /^ff/ && $5 != "-" \
      {
        printf("%s::print -t zio_t io_vd->vdev_path\n", $1)
      }' zio_data | \
      mdb vmcore.0 | \
      sort | uniq
    char *io_vd->vdev_path = 0xffffff21f5b196e0 "/dev/dsk/c1t7d0s0"
    char *io_vd->vdev_path = 0xffffff21f5b196f8 "/dev/dsk/c1t6d0s0"
    char *io_vd->vdev_path = 0xffffff21f5b19710 "/dev/dsk/c1t5d0s0"
    char *io_vd->vdev_path = 0xffffff21f5b19728 "/dev/dsk/c1t4d0s0"
    

    This takes the ZIO address (column 1) and prints the *io_vd->vdev_path value. Awk generates a command line for every ZIO address, which we shell-pipe into MDB. Then we sort and uniq the output to produce a concise list of the disks with outstanding I/O at the time of the panic. Also, in case you might be wondering, the vmcore.0 image is ~25G in size and this completes in under 0.5 seconds on a modest machine.

    If a single disk were implicated, we might suspect that disk of being the culprit. However, since multiple disks are shown, the culprit is more likely either the disk controller or possibly some other core system problem, such as a PCIe bus error, a bad backplane port, or a cabling issue.

    Even though this problem wasn’t caused by our software, the fact that OmniOS has these debugging facilities available enabled us to help the customer diagnose the issue with their system, leading to a replacement of the disk controller.