Monitoring as Code

Circonus has always been API-driven, and this has always been one of our product’s core strengths. Via our API, Circonus provides the ability to create anything that you can in the UI and more. With so many of our customers moving to API-driven platforms like AWS, DigitalOcean, Google Compute Engine (GCE), Joyent, Microsoft Azure, and even private clouds, we have seen the emergence of a variety of tools (like Terraform) that allows an abstraction of these resources. Now, with Circonus built into Terraform, it is possible to declaratively codify your application’s monitoring, alerting, and escalation, as well as the resources it runs on.

Terraform is a tool from HashiCorp for building, changing, and versioning infrastructure, which can be used to manage a wide variety of popular and custom service providers. Now, Terraform 0.9 includes an integration for managing Circonus.

These are a few key features of the Circonus Provider in Terraform:

  • Monitoring as Code – Alongside Infrastructure as Code. Monitoring (i.e. what to monitor, how to visualize, and when to alert) is described using the same high-level configuration syntax used to describe infrastructure. This allows a blueprint of your datacenter, as well as your business rules, to be versioned and treated as you would any other code. Additionally, monitoring can be shared and reused.
  • Execution Plans – Terraform has a “planning” step where it generates an execution plan of what will be monitored, visualized, and alerted on.
  • Resource Graphs – Terraform builds a graph of all your resources, and now can include how these resources are monitored, and parallelizes the creation and modification of any non-dependent resources.
  • Change Automation – Complex changesets can be applied to your infrastructure and metrics, visualizations, or alerts, which can all be created, deactivated, or deleted with minimal human interaction.

This last piece, Change Automation, is one of the most powerful features of the Circonus Provider in Terraform. Allocations and services can come and go (within a few seconds or a few days), and the monitoring of each resource dynamically updates accordingly.

While our larger customers were already running in programmatically defined and CMDB-driven worlds, our cloud computing customers didn’t share our API-centric view of the world. The lifecycle management of our metrics was either implicit, creating a pile of quickly outdated metrics, or ill-specified, in that they couldn’t make sense of the voluminous data. What was missing was an easy way to implement this level of monitoring across the incredibly complex SOA our customers were using.

Now when organizations ship and deploy an application to the cloud, they can also specify a definition of health (i.e. what a healthy system looks like). This runs side-by-side with the code that specifies the infrastructure supporting the application. For instance, if your application runs in an AWS Auto-Scaling Group and consumes RDS resources, it’s possible to unify the metrics, visualization, and alerting across these different systems using Circonus. Application owners now have a unified deployment framework that can measure itself against internal or external SLAs. With the Circonus Provider in Terraform 0.9, companies running on either public clouds or in private data centers can now programmatically manage their monitoring infrastructure.

As an API-centric service provider, Circonus has always worked with configuration management software. Now, in the era of mutable infrastructure, Terraform extends this API-centric view to provide ubiquitous coverage and consistency across all of the systems that Circonus can monitor. Terraform enables application owners to create a higher-level abstraction of the application, datacenter, and associated services, and present this information back to the rest of the organization in a consistent way. With the Circonus Provider, any Terraform-provisioned resource that can be monitored can be referenced such that there are no blind spots or exceptions. As an API-driven company, we’ve unfortunately seen blind-spots develop, but with Terraform these blind spots are systematically addressed, providing a consistent look, feel, and escalation workflow for application teams and the rest of the organization.

It has been and continues to be an amazing journey to ingest the firehose of data from ephemeral infrastructure, and this is our latest step toward servicing cloud-centric workloads. As industry veterans who remember when microservice architectures were were simply called “SOA,” it is impressive to watch the rate at which new metrics are produced and the dramatic lifespan reduction for network endpoints. At Circonus, our first integration with some of the underlying technologies that enable a modern SOA came at HashiConf 2016. At that time we had nascent integrations with Consul, Nomad, and Vault, but in the intervening months we have added more and more to the product to increase the value customers can get from each of these industry accepted products:

  • Consul is the gold standard for service-discovery, and we have recently added a native Consul check-type that makes cluster management of services a snap.
  • Nomad is a performant, robust, and datacenter-aware scheduler with native Vault integration.
  • Vault can be used to secure, store, and control access to secrets in a SOA.

Each of these products utilizes our circonus-gometrics library. When enabled, Circonus-Gometrics automatically creates numerous checks and automatically enables metrics for all the available telemetry (automatically creating either histogram, text, or numeric metrics, given the telemetry stream). Users can now monitor these tools from a single instance, and have a unified lifecycle management framework for both infrastructure and application monitoring. In particular, how do you address the emergent DevOps pattern of separating the infrastructure management from the running of applications? Enter Terraform. With help from HashiCorp, we began an R&D experiment to investigate the next step and see what was the feasibility of unifying these two axes of organizational responsibility. Here are some of the things that we’ve done over the last several months:

  • Added per metric activation and (as importantly) deactivation, while still maintaining the history of the metric.
  • Simplified the ability to view 100’s of clients, or 1000’s of allocations as a whole (via histograms), or to monitor and visualize a single client, server, or allocation.
  • Automatically show outliers within a group of metrics (i.e. identify metrics which don’t look like the others).
  • Reduced the friction associated with deploying and monitoring applications in an “application owner”-centric view of the world.
  • These features and many more, the fruit of expert insights, are what we’ve built into the product, and more will be rolled out in the coming months.

    Example of a Circonus Cluster definition:

    variable "consul_tags" {
      type = "list"
      default = [ "app:consul", "source:consul" ]
    }
    
    resource "circonus_metric_cluster" "catalog-service-query-tags" {
      name        = "Aggregate Consul Catalog Queries for Service Tags"
      description = "Aggregate catalog queries for Consul service tags on all consul servers"
    
      query {
        definition = "consul`consul`catalog`service`query-tag`*"
        type       = "average"
      }
    
      tags = ["${var.consul_tags}", "subsystem:catalog"]
    }
    

    Then merge these into a histogram:

    resource "circonus_check" "query-tags" {
      name   = "Consul Catalog Query Tags (Merged Histogram)"
      period = "60s"
      collector {
        id = "/broker/1490"
      }
      caql {
        query = <<EOF
    search:metric:histogram("consul`consul`catalog`service`query-tag (active:1)") | histogram:merge()
    EOF
      }
      metric {
        name = "output[1]"
        tags = ["${var.consul_tags}", "subsystem:catalog"]
        type = "histogram"
        unit = "nanoseconds"
      }
      tags = ["${var.consul_tags}", "subsystem:catalog"]
    }
    

    Then add the 99th Percentile:

    resource "circonus_check" "query-tag-99" {
      name   = "Consul Query Tag 99th Percentile"
      period = "60s"
      collector {
        id = "/broker/1490"
      }
      caql {
        query = <<EOF
    search:metric:histogram("consul`consul`http`GET`v1`kv`_ (active:1)") | histogram:merge() | histogram:percentile(99)
    EOF
      }
    
      metric {
        name = "output[1]"
        tags = ["${var.consul_tags}", "subsystem:catalog"]
        type = "histogram"
        unit = "nanoseconds"
      }
    
      tags = ["${var.consul_tags}", "subsystem:catalog"]
    }
    

    And add a Graph:

    resource "circonus_graph" "query-tag" {
      name        = "Consul Query Tag Overview"
      description = "The per second histogram of all Consul Query tags metrics (with 99th %tile)"
      line_style  = "stepped"
    
      metric {
        check       = "${circonus_check.query-tags.check_id}"
        metric_name = "output[1]"
        metric_type = "histogram"
        axis        = "left"
        color       = "#33aa33"
        name        = "Query Latency"
      }
      metric {
        check       = "${circonus_check.query-tag-99.check_id}"
        metric_name = "output[1]"
        metric_type = "histogram"
        axis        = "left"
        color       = "#caac00"
        name        = "TP99 Query Latency"
      }
    
      tags = ["${var.consul_tags}", "owner:team1"]
    }
    

    And you get this result:

    Finally, we want to be alerted if the 99th Percentile goes above 8000ms. So, we’ll create the contact (along with SMS, we can use Slack, OpsGenie, PagerDuty, VictorOps, or email):

    resource "circonus_contact_group" "consul-owners-escalation" {
      name = "Consul Owners Notification"
      sms {
        user  = "${var.alert_sms_user_name}"
      }
      email {
        address = "consul-team@example.org"
      }
      tags = [ "${var.consul_tags}", "owner:team1" ]
    }
    

    And then define the rule:

    resource "circonus_rule_set" "99th-threshhold" {
      check       = "${circonus_check.query-tag-99.check_id}"
      metric_name = "output[1]"
      notes = <<EOF
    Query latency is high, take corrective action.
    EOF
      link = "https://www.example.com/wiki/consul-latency-playbook"
      if {
        value {
          max_value = "8000" # ms
        }
        then {
          notify = [
            "${circonus_contact_group.consul-owners-escalation.id}",
          ]
          severity = 1
        }
      }
      tags = ["${var.consul_tags}", "owner:team1"]
    }
    

    With a little copy and paste, we can do exactly the same for all the other metrics in the system.

    Note that the original metric was automatically created when consul was deployed, and you can do the same thing with any number of other numeric data points, or do the same with native histogram data (merge all the histograms into a combined histogram and apply analytics across all your consul nodes).

    We also have the beginnings of a sample set of implementations here, which builds on the sample Consul, Nomad, & Vault telemetry integration here.

     

    Learn More about Terraform

    Circonus Instrumentation Packs

    In our Circonus Labs public github repo, we have started a project called Circonus Instrumentation Packs, or CIP. This is a series of libraries to make it even easier to submit telemetry data from your application.

    Currently there are CIP directories for gojava,  and node.js. Each separate language directory has useful resources to help instrument applications written in that language.

    Some languages have a strong leaning toward frameworks, while others are about patterns, and still others are about tooling. These packs are intended to “meld in” with the common way of doing things in each language, so that developer comfort is high and integration time and effort are minimal.

    Each of these examples utilize the HTTP Trap check, which you can create within Circonus. Simply create a new JSON push (HTTPTrap) check in Circonus using the HTTPTRAP broker, and then the CheckID, UUID and secret will be available on the check details page.

    HTTPTrap uuid-secret
    CHECKID / UUID / Secret Example

    This can be done via the user interface or via the API. The “target” for the check does not need to be an actual hostname or IP address; the name of your service might be a good substitute.

    We suggest that you use a different trap for different node.js apps, as well as for production, staging, and testing.

    Below is a bit more detail on each of the currently available CIPs:

    Java

    Java has a very popular instrumentation library called “metrics,” originally written by Coda Hale and later adopted by Dropwizard. Metrics has some great ideas that we support whole-heartedly; in particular, the use of histograms for more insightful reporting. Unfortunately, the way these measurements are captured and reported makes calculating service level agreements and other such analytics impossible. Furthermore, the implementations of the underlying histograms (Reservoirs in metrics-terminology) are opaque to the reporting tools. The Circonus metrics support in this CIP is designed to layer (non-disruptively) on top of the Dropwizard metrics packages.

    Go

    This library supports named counters, gauges, and histograms. It also provides convenience wrappers for registering latency instrumented functions with Go’s built-in http server.

    Initializing only requires you set the AuthToken (which you generate in your API Tokens page) and CheckId, and then “Start” the metrics reporter.

    You’ll need two github repos:

    Here is the sample code (also found in the circonus-gometrics readme):

    [slide]package main
    import (
     "fmt"
     "net/http"
     metrics "github.com/circonus-gometrics"
    )
    func main() {
    // Get your Auth token at https://login.circonus.com/user/tokens
     metrics.WithAuthToken("cee5d8ec-aac7-cf9d-bfc4-990e7ceeb774")
    // Get your Checkid on the check details page
     metrics.WithCheckId(163063)
     metrics.Start()
    http.HandleFunc("/", metrics.TrackHTTPLatency("/", func(w http.ResponseWriter, r *http.Request) {
     fmt.Fprintf(w, "Hello, %s!", r.URL.Path[1:])
     }))
     http.ListenAndServe(":8080", http.DefaultServeMux)
    }

    After you start the app (go run the_file_name.go), load http://localhost:8080 in your broswer, or curl http://localhost:8080. You’ll need to approve access to the API Token (if it is the first time you have used it), and then you can create a graph (make sure you are collecting histogram data) and you’ll see something like this:

    go-httptrap-histogram-example

    Node.js

    This instrumentation pack is designed to allow node.js applications to easily report telemetry data to Circonus using the UUID and Secret (instead of an API Token and CheckID). It has special support for providing sample-free (100% sampling) collection of service latencies for submission, visualization, and alerting to Circonus.

    Here is a basic example to measure latency:

    First, some setup – making the app;

    % mkdir restify-circonus-example
    % cd restify-circonus-example
    % npm init .

    (This defaults to npm init . works fine.) Then:

    % npm install --save restify
    % npm install --save probability-distributions
    % npm install --save circonus-cip

    Next, edit index.js and include:

    var restify = require('restify'),
     PD = require("probability-distributions"),
     circonus_cip = require('circonus-cip')
    var circonus_uuid = '33e894e6-5b94-4569-b91b-14bda9c650b1'
    var circonus_secret = 'ssssssssh_its_oh_so_quiet'
    var server = restify.createServer()
    server.on('after', circonus_cip.restify(circonus_uuid, circonus_secret))
    server.get('/', function (req, res, next) {
     setTimeout(function() {
     res.writeHead(200, { 'Content-Type': 'text/plain' });
     //res.write("Hello to a new world of understanding.\n");
     res.end("Hello to a new world of understanding.\n");
     next();
     }, PD.rgamma(1, 3, 2) * 200);
    })
    
    server.listen(8888)

    Now just start up the app:

    node index.js

    Then go to your browser and load localhost:8888, or at the prompt curl http:localhost:8888.

    You’ll then go and create the graph in your account. Make sure to enable collection of the metric – “… httptrap: restify `GET `/ `latency…” as a histogram, and you’ll end up with a graph like this:

    The Restify Histogram graph

    Let us know what you think, and more examples and languages will follow.
    Community participation is encouraged, and feedback of any kind is more than welcome:
    If you want a demo, or have a specific question, we’re happy to work with you.

    Contact Me

    To learn more about Circonus, click below:

    The New Grid View – Instant Graphs on the Metrics Page

    We just added a new feature to the UI which displays a graph for every metric in your account.

    While the previous view (now called List View) did show a graph for each metric, these graphs were hidden by default. The new Grid View now shows a full page a graphs, one for each metric. You can easily switch between Grid and List views as needed.

    These screenshots show, from left to right, the old list view, the new layout options menu, and the new grid view.

    The grid-style layout provides you with an easy way to view a graph for each metric in the list. It lets you click-to-select as many metrics as you want and easily create a graph out of them.

    You can also:

    • Choose from 3 layouts with different graph sizes.
    • Define how the titles are displayed.
    • Hover over a graph to see the metric value.
    • Play any number of graphs to get real-time data.

    We hope this feature is as useful to you as it has been to us. More information is available in our user documentation and below is a short video showing off some of these features:

     

     

    ACM – Testing a Distributed System

    I want to sing the praises of one of our lead engineers, Phil Maddox, for authoring a very interesting paper, Testing a Distributed System, which was published in Communications of the ACM, Vol. 58 No. 9.

    A brief excerpt follows:

    082015_CACMpg55_Testing-a-Distributed.large

    “Distributed systems can be especially difficult to program for a variety of reasons. They can be difficult to design, difficult to manage, and, above all, difficult to test. Testing a normal system can be trying even under the best of circumstances, and no matter how diligent the tester is, bugs can still get through. Now take all of the standard issues and multiply them by multiple processes written in multiple languages running on multiple boxes that could potentially all be on different operating systems, and there is potential for a real disaster.

    Individual component testing, usually done via automated test suites, certainly helps by verifying that each component is working correctly. Component testing, however, usually does not fully test all of the bits of a distributed system. Testers need to be able to verify that data at one end of a distributed system makes its way to all of the other parts of the system and, perhaps more importantly, is visible to the various components of the distributed system in a manner that meets the consistency requirements of the system as a whole.”

    AWS Cloudwatch Support

    This month we pushed native Cloudwatch support – any metric that you have in Cloudwatch can now be added to graphs and dashboards, and alerts can be created for them.

    • Auto Scaling (AWS/AutoScaling)
    • AWS Billing (AWS/Billing)
    • Amazon DynamoDB (AWS/DynamoDB)
    • Amazon ElastiCache (AWS/ElastiCache)
    • Amazon Elastic Block Store (AWS/EBS)
    • Amazon Elastic Compute Cloud (AWS/EC2)
    • Elastic Load Balancing (AWS/ELB)
    • Amazon Elastic MapReduce (AWS/ElasticMapReduce)
    • AWS OpsWorks (AWS/OpsWorks)
    • Amazon Redshift (AWS/Redshift)
    • Amazon Relational Database Service (AWS/RDS)
    • Amazon Route 53 (AWS/Route53)
    • Amazon Simple Notification Service (AWS/SNS)
    • Amazon Simple Queue Service (AWS/SQS)
    • AWS Storage Gateway (AWS/StorageGateway)

    Overview

    This check monitors your Amazon Web Services (AWS) resources and the applications you run on AWS in real-time. You can use the CloudWatch Check to collect and track metrics, which are the variables you want to measure for your resources and applications.

    From the CloudWatch Check, you can set alerts within Circonus to send notifications, allowing you to make changes to the resources within AWS.  For example, you can monitor the CPU usage and disk reads and writes of your Amazon Elastic Compute Cloud (Amazon EC2) instances and then use this data to determine whether you should launch additional instances to handle increased load. You can also use this data to stop under-utilized instances to save money. With the CloudWatch Check, you gain system-wide visibility into resource utilization, application performance, and operational health.

    Circonus takes the AWS Region, API Key, and API Secret, then polls the endpoint (AWS) for a list of all available Namespaces, Metrics, and Dimensions that are specific to the user (AWS Region, API Key, and API Secret combination). Only those returned are displayed in the fields. The names that are displayed under each Dimension type (for example: Volume for EBS) are all instances running this Dimension type and have detailed monitoring enabled.

    For information on the master list of Namespace, Metric, and Dimension names available and additional information on Cloudwatch in general, see AWS’s Cloudwatch documentation.