Using CAQL to Identify Hosts with Top CPU Usage

A common task that users want to perform when monitoring their infrastructure is to identify their top resource consumers. Although the following techniques can be applied to numerous different resource metrics, we will specifically look at the problem of identifying which of our hosts or services are consuming the most CPU resources.

To begin, we first need to create a new graph and add a CAQL (Circonus Analytics Query Language) Datapoint as explained in the Getting Started guide. Make sure you’re in edit mode and go to the “Options” menu which is under the “tuner” icon in the upper right corner of the screen. There, change the “Graph Fill Type” to “Line”.

Method 1: Simple find() with Mouse Selection

In the simplest case, when we are only monitoring a few dozen hosts, we can accomplish our task using a simple CAQL query to find and output all “cpu`idle” metrics in our account, like so:

find:counter("cpu`idle")

We enter that statement into the CAQL datapoint and the graph now renders all of our cpu`idle metrics as line series. If we flip the graph to view mode, we can hold down the “s” key and hover over individual lines to highlight and select particular metrics to focus on. If we highlight the lines with the lowest CPU idle percentages, we’re effectively highlighting the ones with the largest CPU utilization.

Method 2: Improved Labels and Values

We can immediately see that there are multiple shortcomings with the first method:

The legend labels only include the metric name, which is not helpful.
We are interested in the CPU utilization, not the cpu`idle percentage.

Let’s flip the graph back to edit mode and update the CAQL statement to address these shortcomings. Here is our new, improved CAQL statement:

find:counter("cpu`idle")
| each:neg() | each:add("100")
| label("%tv{__check_target} cpu utilization")

The functions | each:neg() | each:add("100") apply the transformation x --> -x + 100 to each stream, which converts the idle percentages into utilization percentages by subtracting them from 100.

The function | label("%tv{__check_target} / %n") adds a label to each stream, consisting of the check target (host name), and the string “cpu utilization”.

This is much better. The stream labels are actually useful, and we’re looking at actual CPU utilization instead of the idle percentages.

Method 3: Using top() to Filter Results

The second method gives us what we want, but it becomes impractical when too many results are returned. If we’re monitoring hundreds of hosts, we cannot manually display and visually sort through all of them at once. In order to narrow down the results to the most relevant metrics we can use the top() function, as follows:

find:counter("cpu`idle")
| each:neg() | each:add("100") // convert to utilization
| top(5, method="max")         // filter top results
| label("%tv{__check_target} cpu utilization")

We added the method="max" parameter because we are interested in the highest CPU spikes. If we are instead interested in the highest average CPU utilization, we could have left this parameter out or set it to "mean".

After adding the top() function to the CAQL statement, the results look like this:

This is much better. Now we can easily pick out the top CPU consumers in our account without being distracted by all the other data.

Method 4: Group By Service Tag

In some cases, particularly when we’re monitoring hundreds of hosts, we might not be interested in individual hosts, but instead want to identify services provided by groups of hosts that have a high CPU utilization. In this case, we want to aggregate the metrics using tags to identify what service or component they belong to. We will use the group_by:* functions to aggregate the data by tag, like so:

find:counter("cpu`idle")
| group_by:mean("component")
| label("%tv{component} / %n")

This updated CAQL statement will average the cpu`idle percentage over the “component” tag. Metrics which have the same value for the “component tag” will be grouped together and averaged. If a metric does not have the “component” tag, it will be put into the category “uncategorized”.

Hint: It’s possible to group by multiple dimensions by supplying multiple parameters to the group_by function, e.g. group_by:mean("component", "datacenter")

For the label, we chose to select the value of the “component” tag by using %tv{component} (“tv” stands for “tag value”), and then append the metric name itself by using %n.

Hint: If you want to show all tags which vary across the output results, use the label pattern %t-{*}.

Once we flip the graph to view mode we will see the following output:

The Power of CAQL

As you can see, CAQL can be easily tailored and extended to fit the use-case at hand. This flexibility extends far beyond checking CPU utilization. From searching and manipulating data streams to limiting and aggregating the results, CAQL lets you efficiently harness your data, optimizing both performance and business intelligence.