Monitoring as Code:

Terraform integration with Circonus

Terraform is a tool from HashiCorp for building, changing, and versioning infrastructure, which can be used to manage a wide variety of popular and custom service providers. This product enables application owners to create a higher-level abstraction of the application, datacenter, and associated services, and present this information back to the rest of the organization in a consistent way.
Terraform

Now,

Terraform 0.9

includes an integration for managing a Circonus account.

With the Circonus integration, any Terraform-provisioned resource that can be monitored can be referenced such that there are no blind spots or exceptions. Additionally, with Circonus it is possible to declaratively codify your application’s monitoring, alerting, and escalation, as well as the resources it runs on. This integration enables companies running on either public clouds or in private data centers to programmatically manage their monitoring infrastructure.

These are a few key features of the Circonus Provider in Terraform:

    • Monitoring as Code

      – Alongside Infrastructure as Code. Monitoring (i.e. what to monitor, how to visualize, and when to alert) is described using the same high-level configuration syntax used to describe infrastructure. This allows a blueprint of your datacenter, as well as your business rules, to be versioned and treated as you would any other code. Additionally, monitoring can be shared and reused.

    • Execution Plans

      – Terraform has a “planning” step where it generates an execution plan of what will be monitored, visualized, and alerted on.

      • Resource Graphs

        – Terraform builds a graph of all your resources, and now can include how these resources are monitored, and parallelizes the creation and modification of any non-dependent resources.

      • Change Automation

        – Complex changesets can be applied to your infrastructure and metrics, visualizations, or alerts, which can all be created, deactivated, or deleted with minimal human interaction.

Change Automation

Change Automation, is perhaps the most powerful feature of the Circonus Provider in Terraform. Allocations and services can come and go (within a few seconds or a few days), and the monitoring of each resource dynamically updates accordingly.

At Circonus, our first integration with some of the underlying technologies that enable a modern SOA came at HashiConf 2016. At that time we had nascent integrations with Consul, Nomad, and Vault, but in the intervening months we have added more and more to the product to increase the value customers can get from each of these industry accepted products:

  • Consul

    is the gold standard for service-discovery, and we have recently added a native Consul check-type that makes cluster management of services a snap.

  • Nomad

    is a performant, robust, and datacenter-aware scheduler with native Vault integration.

  • Vault can be used to secure, store, and control access to secrets in a SOA

Each of these products utilizes our circonus-gometrics library. When enabled, Circonus-Gometrics automatically creates numerous checks and automatically enables metrics for all the available telemetry (automatically creating either histogram, text, or numeric metrics, given the telemetry stream). Users can now monitor these tools from a single instance, and have a unified lifecycle management framework for both infrastructure and application monitoring.

In particular, how do you address the emergent DevOps pattern of separating the infrastructure management from the running of applications? Enter Terraform. With help from HashiCorp, we began an R&D experiment to investigate the next step and see what was the feasibility of unifying these two axises of organizational responsibility. Here are some of the things that we’ve done over the last several months:

  • Added per metric activation and (as importantly) deactivation, while still maintaining the history of the metric.
  • Simplified the ability to view 100’s of clients, or 1000’s of allocations as a whole (via histograms), or to monitor and visualize a single client, server, or allocation.
  • Automatically show outliers within a group of metrics (i.e. identify metrics which don’t look like the others).
  • Reduced the friction associated with deploying and monitoring applications in an “application owner”-centric view of the world.

These features and many more, the fruit of expert insights, are what we’ve built into the product, and more will be rolled out in the coming months.

By copying and pasting below, we can do exactly the same for all the other metrics in the system.

Note that the original metric was automatically created when consul was deployed, and you can do the same thing with any number of other numeric data points, or do the same with native histogram data (merge all the histograms into a combined histogram and apply analytics across all your consul nodes).

We also have the beginnings of a sample set of implementations here, which builds on the sample Consul, Nomad, & Vault telemetry integration here.

Example of a Circonus Cluster definition

variable "consul_tags" {
  type = "list"
  default = [ "app:consul", "source:consul" ]
}
resource "circonus_metric_cluster" "catalog-service-query-tags" {
  name        = "Aggregate Consul Catalog Queries for Service Tags"
  description = "Aggregate catalog queries for Consul service tags on all consul servers"
  query {
    definition = "consul`consul`catalog`service`query-tag`*"
    type       = "average"
  }
  tags = ["${var.consul_tags}", "subsystem:catalog"]
}

Then merge these into a histogram

resource “circonus_check” “query-tags” {
  name   = “Consul Catalog Query Tags (Merged Histogram)”
  period = “60s”
  collector {
    id = “/broker/1490”
  }
  caql {
    query = <
search:metric:histogram(“consul`consul`catalog`service`query-tag (active:1)”) | histogram:merge()
EOF
  }
  metric {
    name = “output[1]”
    tags = [“${var.consul_tags}”, “subsystem:catalog”]
    type = “histogram”
    unit = “nanoseconds”
  }
  tags = [“${var.consul_tags}”, “subsystem:catalog”]
}

Then add the 99th Percentile:

resource "circonus_check" "query-tag-99" {
  name   = "Consul Query Tag 99th Percentile"
  period = "60s"
  collector {
    id = "/broker/1490"
  }
  caql {
    query = <
search:metric:histogram("consul`consul`http`GET`v1`kv`_ (active:1)") | histogram:merge() | histogram:percentile(99)
EOF
  }
  metric {
    name = "output[1]"
    tags = ["${var.consul_tags}", "subsystem:catalog"]
    type = "histogram"
    unit = "nanoseconds"
  }
  tags = ["${var.consul_tags}", "subsystem:catalog"]
}

And add a Graph:

resource "circonus_graph" "query-tag" {
  name        = "Consul Query Tag Overview"
  description = "The per second histogram of all Consul Query tags metrics (with 99th %tile)"
  line_style  = "stepped"
  metric {
    check       = "${circonus_check.query-tags.check_id}"
    metric_name = "output[1]"
    metric_type = "histogram"
    axis        = "left"
    color       = "#33aa33"
    name        = "Query Latency"
  }
  metric {
    check       = "${circonus_check.query-tag-99.check_id}"
    metric_name = "output[1]"
    metric_type = "histogram"
    axis        = "left"
    color       = "#caac00"
    name        = "TP99 Query Latency"
  }
  tags = ["${var.consul_tags}", "owner:team1"]
}

And you get this result:

Finally, we want to be alerted if the 99th Percentile goes above 8000ms.

So, we’ll create the contact (along with SMS, we can use Slack, OpsGenie, PagerDuty, VictorOps, or email):

resource "circonus_contact_group" "consul-owners-escalation" {
  name = "Consul Owners Notification"
  sms {
    user  = "${var.alert_sms_user_name}"
  }
  email {
    address = "consul-team@example.org"
  }
  tags = [ "${var.consul_tags}", "owner:team1" ]}

And then define the rule:

resource "circonus_rule_set" "99th-threshhold" {
  check       = "${circonus_check.query-tag-99.check_id}"
  metric_name = "output[1]"
  notes = < 
 Query latency is high, take corrective action.
 EOF
  link = "https://www.example.com/wiki/consul-latency-playbook"
  if {
    value {
      max_value = "8000" # ms
    }
    then {
      notify = [
        "${circonus_contact_group.consul-owners-escalation.id}",
      ] 
      severity = 1
    }
  }
  tags = ["${var.consul_tags}", "owner:team1"]
}