Less Toil, More Coil – Telemetry Analysis with Python

This was a frequent request we were hearing from many customers:

“How can I analyze my data with Python?”

The Python Data Science toolchain (Jupyter/NumPy/pandas) offers a wide spectrum of advanced data analytics capabilities. Therefore, seamless integration with this environment is important for our customers who want to make use of those tools.

Circonus has for a long time provided Python bindings for its API. With these bindings, you can configure the account, create graphs and dashboards, etc. However, fetching data and getting it into the right format involves multiple steps and was not easy to get right.

We are now pleased to announce that this has changed. We have just added new capabilities to our Python bindings that allow you to fetch and analyze data more effectively. Here is how to use it.

Quick Tour

Connecting to the API

You need an API token to connect to the API. You can create one using the UI under Integrations > API Tokens. In the following we assume the variable api_token holds a valid API token for your account.

from circonusapi import circonusdata
circ = circonusdata.CirconusData(api_token)

Searching for Metrics

The first thing we can do is search for some metrics:

>>> M = circ.search('(metric:duration)', limit=10)

The returned object extends the list class, and can be manipulated like any list object.
We override the __str__ method, so that printing the list, gives a table representation of the fetched metrics:

>>> print(M)

check_id   type       metric_name
--------------------------------------------------
195902     numeric    duration
218003     numeric    duration
154743     numeric    duration
217833     numeric    duration
217834     numeric    duration
218002     numeric    duration
222857     numeric    duration
222854     numeric    duration
222862     numeric    duration
222860     numeric    duration

Metric lists provide a .fetch() method that can be used to fetch data. Fetches are performed serially, one metric at a time, so the retrieval can take some time. We will later see how to parallelize fetches with CAQL.

R = M.fetch(
    start=datetime(2018,1,1), # start at Midnight UTC 2018-01-01
    period=60,                # return 60 second (=1min) aggregates
    count=180,                # return 180 samples
    kind="value"              # return (mean-)value aggregate
)

The resulting object is a dict that maps metrics names to the fetched data. This is designed in such a way that it can be directly passed to a pandas DataFrame constructor.

import pandas as pd
df = pd.DataFrame(R)

# [OPTIONAL] Make the DataFrame aware of the time column
df['time']=pd.to_datetime(df['time'],unit='s')
df.set_index('time', inplace=True)

df.head()
time			154743/duration	195902/duration	217833/duration	217834/duration	218002/duration	218003/duration	222854/duration	222857/duration	222860/duration	222862/duration
2018-01-01 00:00:00	1		4		1		1		1		1		12		11		12		1
2018-01-01 00:01:00	1		2		1		1		2		1		11		12		12		1
2018-01-01 00:02:00	1		2		1		1		1		1		12		12		11		1
2018-01-01 00:03:00	1		2		1		1		1		1		12		11		12		1
2018-01-01 00:04:00	1		2		1		1		1		1		12		11		11		1

Data Analysis with pandas

Pandas makes common data analysis methods very easy to perform. We start with computing some summary statistics:

df.describe()
154743/duration	195902/duration	217833/duration	217834/duration	218002/duration	218003/duration	222854/duration	222857/duration	222860/duration	222862/duration
count	180.000000	180.000000	180.0		180.000000	180.000000	180.000000	180.000000	180.000000	180.00000	180.000000
mean	1.316667	2.150000	1.0		1.150000	1.044444	1.177778	11.677778	11.783333	11.80000	1.022222
std	1.642573	0.583526	0.0		1.130951	0.232120	0.897890	0.535401	0.799965	0.89941 	0.181722
min	1.000000	1.000000	1.0		1.000000	1.000000	1.000000	11.000000	11.000000	11.00000	1.000000
25%	1.000000	2.000000	1.0		1.000000	1.000000	1.000000	11.000000	11.000000	11.00000	1.000000
50%	1.000000	2.000000	1.0		1.000000	1.000000	1.000000	12.000000	12.000000	12.00000	1.000000
75%	1.000000	2.000000	1.0		1.000000	1.000000	1.000000	12.000000	12.000000	12.00000	1.000000
max	15.000000	4.000000	1.0		12.000000	3.000000	9.000000	13.000000	17.000000	16.00000	3.000000

Here is a plot of the dataset over time:

from matplotlib import pyplot as plt
ax = df.plot(style=".",figsize=(20,5),legend=False, ylim=(0,20), linewidth=0.2)

We can also summarize the individual distributions as box plots:

ax = df.plot(figsize=(20,5),legend=False, ylim=(0,20), kind="box")
ax.figure.autofmt_xdate(rotation=-20,ha="left")

Working with Histogram Data

Histogram data can be fetched using the kind=”histogram” parameter to fetch. Numeric metrics will be converted to histograms. Histograms are represented as libcircllhist objects, which have very efficient methods for the most common histogram operations (mean, quantiles).

MH = circ.search("api`GET`/getState", limit=1)
print(MH)
check_id   type       metric_name
--------------------------------------------------
160764     histogram  api`GET`/getState                                 

Let’s fetch the 1h latency distributions of this API for the timespan of one day:

RH = MH.fetch(datetime(2018,1,1), 60*60, 24, kind="histogram")

We can plot the resulting histograms with a little helper function:

fig = plt.figure(figsize=(20, 5))
for H in RH['160764/api`GET`/getState']:
    circllhist_plot(H, alpha=0.2)
ax = fig.get_axes()
ax[0].set_xlim(0,100)

The output:
(0, 100)

Again, we can directly import the data into a pandas data frame, and perform some calculations on the data:

dfh = pd.DataFrame(RH)

# [OPTIONAL] Make the DataFrame aware of the time column
dfh['time']=pd.to_datetime(dfh['time'],unit='s')
dfh.set_index('time', inplace=True)
dfh['p99'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.99))
dfh['p90'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.99))
dfh['p95'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.99))
dfh['p50'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.5))
dfh['mean'] = dfh.iloc[:,0].map(lambda h: h.mean())
dfh.head()
time             	160764/api`GET`/getState				p99		p90		p95		p50		mean
2018-01-01 00:00:00	{"+29e-002": 2, "+40e-002": 6, "+50e-002": 8, ...	112.835714	112.835714	112.835714	11.992790	15.387013
2018-01-01 01:00:00	{"+40e-002": 2, "+50e-002": 2, "+59e-002": 5, ...	114.961628	114.961628	114.961628	16.567822	19.542284
2018-01-01 02:00:00	{"+40e-002": 3, "+50e-002": 12, "+59e-002": 4,...	118.124324	118.124324	118.124324	20.556859	24.012226
2018-01-01 03:00:00	{"+29e-002": 1, "+40e-002": 7, "+50e-002": 21,...	427.122222	427.122222	427.122222	20.827982	37.040173
2018-01-01 04:00:00	{"+40e-002": 6, "+50e-002": 26, "+59e-002": 15...	496.077778	496.077778	496.077778	23.247373	40.965517

The CAQL API

Circonus comes with a wide range of data analysis capabilities that are integrated into the Circonus Analytics Query Language, CAQL.

CAQL provides highly efficient data fetching operations that allow you to process multiple metrics at the same time. Also by performing the computation close to the data, you can save time and bandwidth.

To get started, we search for duration metrics, like we did before, using CAQL:

A = circ.caql('search:metric("duration")', datetime(2018,1,1), 60, 5000)
dfc = pd.DataFrame(A)
dfc.head()
		output[0]	output[10]	output[11]	output[12]	output[13]	output[14]	output[15]	output[16]	output[17]	output[18]	...	output[21]	output[2]	output[3]	output[4]	output[5]	output[6]	output[7]	output[8]	output[9]	time
0		4		12		1		1		2		1		1		1		11		1		...		1		1		1		1		1		1		11		12		1	1514764800
1		2		12		1		1		1		1		1		1		11		1		...		1		1		1		1		1		2		12		11		1	1514764860
2		2		11		1		1		2		1		1		1		12		1		...		1		1		1		1		1		1		12		12		1	1514764920
3		2		12		1		1		2		1		1		1		12		1		...		1		1		1		1		1		1		11		12		1	1514764980
4		2		11		1		1		2		1		1		1		11		1		...		1		1		1		1		1		1		11		12		1	1514765040
5 rows × 23 columns

This API call fetched 1000 samples from 22 metrics, and completed in just over 1 second. The equivalent circ.search().fetch() statement would have taken around one minute to complete.

One drawback of the CAQL fetching is, that we use the metric names in the output. We are working on resolving this shortcoming.

To showcase some of the analytics features, we’ll now use CAQL to compute a rolling mean over the second largest duration metric in the above cluster, and plot the transformed data using pandas:

B = circ.caql("""

search:metric("duration") | stats:trim(1) | stats:max() | rolling:mean(10M)

""", datetime(2018,1,1), 60, 1000)
df = pd.DataFrame(B)
df['time']=pd.to_datetime(df['time'],unit='s')
df.set_index('time', inplace=True)
df.plot(figsize=(20,5), lw=.5,ylim=(0,50))

You can also fetch histogram data with circ.caql():

AH = circ.caql('search:metric:histogram("api`GET`/getState")', datetime(2018,1,1), 60*60, 24)
dfch = pd.DataFrame(AH)
dfch.head()
        output[0]						time
0	{"+29e-002": 2, "+40e-002": 6, "+50e-002": 8, ...	1514764800
1	{"+40e-002": 2, "+50e-002": 2, "+59e-002": 5, ...	1514768400
2	{"+40e-002": 3, "+50e-002": 12, "+59e-002": 4,...	1514772000
3	{"+29e-002": 1, "+40e-002": 7, "+50e-002": 21,...	1514775600
4	{"+40e-002": 6, "+50e-002": 26, "+59e-002": 15...	1514779200

We can perform a wide variety of data transformation tasks directly inside Circonus using CAQL expressions. This speeds up the computation even further. Another advantage is that we can leverage CAQL queries for live graphing and alerting in the Circonus UI.

In this example, we compute how many requests were serviced above certain latency thresholds:

B = circ.caql('''

search:metric:histogram("api`GET`/getState") | histogram:count_above(0,10,50,100,500,1000)

''', datetime(2018,1,1), 60*5, 24*20)
dfc2 = pd.DataFrame(B)
dfc2['time']=pd.to_datetime(dfc2['time'],unit='s')
dfc2.set_index('time', inplace=True)
dfc2.plot(figsize=(20,5), colormap="gist_heat",legend=False, lw=.5)

Conclusion

Getting Circonus data into Python has never been easier. We hope that this blog post allows you to get started with the new data fetching capabilities.  A Jupyter notebook version of this blog post containing the complete source code is available here. If you run into any problems or have some suggestions, feel free to open an issue on GitHub, or get in touch on our Slack channel.

Linux System Monitoring with eBPF

The Linux kernel is an abundant component of modern IT systems. It provides the critical services of hardware abstraction and time-sharing to applications. The classical metrics for monitoring Linux are among the most well known metrics in monitoring: CPU utilization, memory usage, disk utilization, and network throughput. For a while now, Circonus installations have organized, the key system metrics in the form of a USE Dashboard, as a high level overview of the system resources.

While those metrics are clearly useful and important, there are lot of things to be wished for. Even the most basic metrics like CPU utilization have some serious flaws (cpu-load.txt) that limit their significance. Also there are a lot of questions for which there are simply no metrics exposed (such as disk errors and failed mallocs).

eBPF is a game changing technology that became available in recent kernel versions (v4.1 and later). It allows subscribing to a large variety of in kernel events (Kprobes, function call tracing) and aggregating them with minimal overhead. This unlocks a wide range of meaningful precise measurements that can help narrow the observability gap. A great and ever growing collection of system tracing tools is provided by the bcc toolkit by iovisor.

The Circonus Monitoring Agent comes with a plugin that collects eBPF metrics using the bcc toolkit (see source code & instructions here). At the time of this writing, the plugin is supported on the Ubuntu 16.04 platform. In the following examples we will demonstrate how this information can be used.

Block-I/O Latencies

The block-I/O layer of the operating system is the interface the block-devices, like disk, offer to the file system. Since they are an API, it’s natural to apply the RED methodology (adapted from the SRE Book, see e.g. Tom Wilkie 2018) and monitor rate, errors, and duration. One famous example of how this information can be used is to identify environmental influences to I/O performance, as seen in Brendan Gregg – Shouting in the Datacenter (YouTube 2008). The example duration measurements can be seen in the figure below.

150M I/O events that were recorded on three disks over the period of a week.
150M I/O events that were recorded on three disks over the period of a week

This diagram shows a total of 150M I/O events that were recorded on three disks over the period of a week. The visualization as stand-alone histogram allows us to qualitatively compare the access latency profiles very easily. In this case, we see that the traffic pattern is imbalanced (one disk serving less than half of the load of the others), but the latency modes are otherwise very similar, indicating good (or at least consistent) health of the disks.

The next figure shows how the these requests are distributed over time.

Disk array visualized as a heatmap.
Disk array visualized as a heatmap.

This is the latency duration profile of the disk array visualized as a heatmap. The line graphs show the p10,p50 and p90 percentiles calculated over 20 minute spans. One can see how the workload changes over time. Most of the requests were issued between Sept 10th and Sept 11th 12:00, with a median performance of around 3ms.

File System Latency

From the application perspective, the file system latencies are much more relevant than block I/O latencies. The following graphic shows the latency of the read(2) and write(2) syscalls executed over the period of a few days.

The median latency of this dataset is around 5u-sec for read and 14u-sec for write accesses. This is an order of magnitude faster than block I/O latencies and indicates that buffering and caching of file system accesses is indeed speeding things up.

Caveat: In UNIX systems, “everything is a file.” Hence the same syscalls are used to write data to all kinds of devices (sockets, pipes) and not only disks. The above metrics do not differentiate between those devices.

CPU Scheduling Latency

Everyone knows that systems become less responsive when they are overloaded. If there are more runable processes than CPUs in the system, processes begin to queue and additional scheduling latency is introduced. The load average reported by top(1) gives you a rough idea how many processes were queued for execution over the last few minutes on average (the reality is quite a bit more subtle). If this metric is higher than the number of CPUs you will get “some” scheduling latency.

But how much scheduling latency did your application actually experience?

With eBPF, you can just measure the latency of every scheduling event. The diagram below shows the latency of 17.4B scheduling events collected over a 4 week period.

The median scheduling latency (30u-sec) was very reasonable. Clearly visible are Several modes which I suppose can be attributed to processes waiting behind none, one, or two other processes in the queue. The tail of the distribution shows the collateral damage caused by periods of extreme loads during the collection period. The longest scheduling delay was a severe hang of 45 seconds!

Next steps

If you want to try this out on your system, you can get free Circonus account is a matter of minutes. Installing the Circonus agent on an Ubuntu 16.04 machine can be done with a single command. Then enable the eBPF plugin on your host by following the instructions here.

It’s an ongoing effort to extend the capabilities of the eBPF plugin. Apart from the metrics shown above, there are also rate and duration metrics for all 392 Linux system calls that are exposed by the plugin. There are a lot more interesting tools in iovisor/bcc that wait to be ported.

Happy Monitoring!

Circonus Update: New UI Rollout

The Circonus team is excited to announce the release of our newest update, which includes sweeping changes to the Circonus Monitoring Platform UI.

This update is part of an ongoing effort to optimize the Circonus UI for performance and usability on mobile devices, such as phones and tablets, as well as on media PCs, desktops, and laptops, and almost every single change to our familiar UI directly supports that optimization. You’ll find that just about every single page is now responsive. In the future we’ll be continuing these efforts, and will be tackling the dashboards and check creation workflows next.

We’re also grouping some of our features to provide a more streamlined experience, by making improvements to how data is displayed, and changing how controls are displayed to be consistent throughout all of the different pages and views.

Look for these changes to go live this Thursday, April 26th!

What’s New?

The biggest change is that the displays are consistent for ALL of the different searchable items in Circonus. Hosts, metrics, checks, rules sets, graphs, worksheets, everything!

Every item has a List View that includes the search functionality, and a Details View with more information for each item in the list. All of these Details View pages will have the same familiar design, and the List View for each item will also have the same functionality. Users will still be able to select their preferred layout for Metrics and Hosts lists, whether they want their List View in a grid or an actual list.

Another significant change is that the List View and the Details View are separate. Circonus will no longer display a dropdown accordion with a Details View inside of the List View. Instead, you can use the List View to search for data and simply click the View button to visit a page specifically designed for viewing those details.

These views group many familiar Circonus features together in a dropdown menu that appears under the View button, as you can see in this Alert History list.

Many frequently used Circonus features are grouped under one of our two (and only two) “burger”-style menus, which are for mobile navigation and tag filters. Controls for other features have been replaced with intuitive icons in the appropriate places that will clearly indicate what users can do from different views.

Menu items are context dependent, and display options relevant to the current List View or Details View.

All of the Token API Management pages have been consolidated to a single API Token page, under the Integrations menu.

Account administration has also been consolidated and streamlined, as you can see from this view of the Team page:

FAQ

There are a lot of changes in this update, so to assist users with this transition, we’ve prepared answers for a few questions we anticipated that you might ask.

How do I view the check UUID?

The view for the API object, which includes the check UUID, is available on the checks list page by clicking the down arrow next to the View button. You can also visit the Details View page for the Check to get all pertinent Check info, including the UUID.

How do I view details for two things at once now that the List View and Details View are separate?

We recommend opening a separate Details View page for each item you want to view, by right-clicking on the View button in the List View and opening multiple new tabs.

What’s Next?

Our team is dedicated to continuously improving Circonus, and have recently prepared our roadmap for the next year, so we can confidently say there are many more exciting new features and performance enhancements on the horizon.

Our next big UI update will enhance our dashboards and the check creation workflows. These features will receive the same responsive improvements you will see in the rest of the UI, along with improving usability.

Circonus On The Raspberry Pi

There are a lot of interesting monitoring tasks, that can be facilitated with a Raspberry Pi (e.g. here, there). Circonus does not officially support “Raspbian Linux on armv6/v7” as a deployment target, but given the steady interest in this topic, we took the time to pave the way and write down some instructions in this post.

Let’s dive right in:

# Install dependencies
sudo apt-get update
sudo apt-get install git make

# Only node node v4 or node v6 are supported. Install official arm package
wget https://nodejs.org/dist/latest-v6.x/node-v6.12.3-linux-armv6l.tar.xz
sudo mkdir -p /opt/node
sudo tar -vxf node-v6.12.3-linux-armv6l.tar.xz -C /opt/node --strip-components=1
sudo ln -s /opt/node/bin/node /usr/bin/node
sudo ln -s /opt/node/bin/npm /usr/bin/npm
sudo ln -s /opt/node/bin/node /opt/circonus/bin/node

# Install nad from sources
git clone https://github.com/circonus-labs/nad.git
cd nad
git checkout v2.5.1
sudo make install install-linux-init

# Fix nad disk plugins which depend on /proc/mdstat
sudo modprobe md

This should install and enable the monitoring agent nad on the RPI. Check that it is up and running using:

# Is it up? -> You should see a staus report mentioning "active (running)"
systemctl status nad
# Does it serve metrics? -> You should see a large JSON blob, containing metrics
curl localhost:2609

Now copy paste the one-step-install command from the Integrations > Checks > [New Host] page. In my case this is:

curl -sSL https://onestep.circonus.com/install | bash \
   -s -- \
   --key xxxxxxxxx-50ce-4118-9145-xxxxxxxxxx \
   --app circonus:osi:xxxxxx...

The credentials will be different for you. The installer should find the running agent and register checks, graphs and the USE Dashboard for you. We needed to make some tweaks to the COSI installer itself to get this working as it is now. Special thanks goes to Matt Maier from Circonus to make this happen so quickly.

These instructions were tested on Raspbian stretch lite (Version: November 2017 / Release date: 2017-11-29 / Kernel version: 4.9) on a RPI2 Model B.

Happy Monitoring!

This post was also published on Heinrich Hartmann’s blog.

Our Values

Values Create Value

In the tech industry, you read more blog posts on product features than you do on core values. At Circonus, we see them as inextricably linked – values create value – which is exactly what positions us to deliver results for our customers. Values lead you to real solutions, not just resolutions.

Resolutions and reflections are at the top of our minds in the new year. 2017 was a dynamic year in the monitoring and observability space. Some highlights:

This momentum in our space, and the pioneering role that Circonus plays, make me proud to work in technology. But I recall too many tech-related headlines in 2017 that screamed a lack of basic human or corporate values. This week, as we all vow to exercise more (my favorite) and eat less peanut butter ice cream (also my favorite), it’s worth reflecting that resolutions and corporate gamesmanship are like fad diets, but values are a way of life.

As leaders in a rapidly-evolving sector, Circonus believes our values should guide the path we forge. So we’ve decided to share our values publicly here. Leadership in technology depends on a set of principles that serve as a touchstone at times when it might be easier in the short term to take actions that sacrifice that which our customers, partners, and colleagues have come to expect from us. Without further ado, I present the values that Theo has laid out for us here at Circonus.

Respect

Be excellent to each other; everyone; equally. Never participate directly or indirectly in the violation of human rights. Consider others in your actions. Always presume competence and good intention. During disagreements fallback to shared principles and values and work toward a solution from there. Communicate honestly, clearly and consistently, and with respect.

What this means for our work: Ideas are not people. We will criticize ideas, tear them down, and seek to ensure they can withstand operating at scale. We don’t do the same to people though; to operate at the highest level, shared discourse demands respect and trust between individuals.

Trust

Trust is basic fabric of good working relationships with our colleagues, with our customers, with our industry, and with our competitors. Trust is reinforced through honesty, openness, and being transparent by default. It is okay to share tactical mistakes and our shortcomings both internally and externally. Never break the law or expose the mistakes or shortcomings of our customers.

What this means for our work: We build trust through open communication with our customers when things don’t go as expected.

Integrity

Do not break the law. Do not game the system. If we feel the system is broken, we must act to change the system. Winning isn’t winning if we’ve cheated. It is impossible to win alone.

What this means for our work: When we build on the work of others, we cite prior art and give credit where it is due.

Care

We must care as much or more about our customers than they do for themselves. Customer data: keep it secret, keep it safe, keep it intact and accurate. We also recognize that people are not machines and that human contact and personal care should not be sacrificed for the sake of efficiency. Never miss an opportunity to connect with a customer at the human level.

What this means for our work: We have built our systems to implement data safety as a first class feature.

Value

Leave a room cleaner than when you entered. Leave a customer with more value than they’ve invested in us. Leave your colleagues, your customers and competitors lives more enriched and happier after every interaction. Appreciate and acknowledge the contributions of others.

What this means for our work: We aren’t satisfied with being average. We look to implement the best in class technical solution, even when it means waiting a little bit longer for the market to realize it.

Kindness

Always treat customer organizations as the assembly of humans they are. Treat everyone with kindness; it is the one thing you can always afford to do.

What this means for our work: Kindness at a minimum means helping our customers – even more than they asked, whenever we can. Kindness costs nothing, yet returns so much.

Frugality

Avoid waste. Consider the world. Conserve and protect what we consume – from the environment to people’s time.

What this means for our work: We engineer our systems to be frugal for clock cycles and block reads, as well as network traffic.

Growth

Learn something new every day and encourage responsible risk taking. Experience results in good decision making; experience comes from making poor decisions. Support those around you to help them constructively learn from their mistakes. This is how we build a team with experience. Learn from our failures and celebrate our successes.

What this means for our work: We are always looking to the leading edge of innovation; the value of success is often higher than the cost of failure.

Excellence

Set high standards for our own excellence. Expect more from ourselves than we do from our customers, our colleagues, and others we come in contact with.

What this means for our work: We seek to push the envelope on ideas and practices. As we said earlier, being average isn’t good enough; we expect ourselves to strive for the 99th percentile.

We hope you will share in these values going into 2018. If these words resonated with you, and you love to build high quality systems and software, come work with us, we are hiring!

UI Redesign FAQ

Today, the Circonus team is releasing the first round of our new User Interface changes.

Since we started developing Circonus 7 years ago, we’ve worked with many customers in different spaces with different needs. Our team talked with many engineers about how they monitor their systems, what their workflow looks like, and the configuration of their ideal platform. You’ve spoken and we listened. Now, we’ve put those years of feedback to use, so that you get the benefit of that collective knowledge, with a new, improved interface for managing your monitoring.

The interface now features responsive improvements for different screen sizes (and which provides better support for mobile devices) and a revised navigation menu. More changes will follow, as we innovate to adapt our interface to intuitively suit our users needs.

Frequently Asked Questions

Change, even change for the better, can be frustrating. This FAQ will help explain the interface changes. We’ll be adding more items to this FAQ as we get more feedback.

Where did the checks go?

The checks have been moved to under the “Integrations” menu.

Where did the graphs go?

Graphs are now under the “Analytics” menu.

Where are the rulesets?

Rulesets can be found under the “Alerts” menu.

Send us your feedback

Tell us about your own Circonus experience, and let us know what you think about our new User Interface:

The Circonus Istio Mixer Adapter

Here at Circonus, we have a long heritage of open source software involvement. So when we saw that Istio provided a well designed interface to syndicate service telemetry via adapters, we knew that a Circonus adapter would be a natural fit. Istio has been designed to provide a highly performant, highly scalable application control plane, and Circonus has been designed with performance and scalability as core principles.

Today we are happy to announce the availability of the Circonus adapter for the Istio service mesh. This blog post will go over the development of this adapter, and show you how to get up and running with it quickly. We know you’ll be excited about this, because Kubernetes and Istio give you the ability to scale to the level that Circonus was engineered to perform at, above other telemetry solutions.

If you don’t know what a service mesh is, you aren’t alone, but odds are you have been using them for years. The routing infrastructure of the Internet is a service mesh; it facilitates tcp retransmission, access control, dynamic routing, traffic shaping, etc. The monolithic applications that have dominated the web are giving way to applications composed of microservices. Istio provides control plane functionality for container based distributed applications via a sidecar proxy. It provides the service operator with a rich set of functionality to control a Kubernetes orchestrated set of services, without requiring the services themselves to implement any control plane feature sets.

Istio’s Mixer provides an adapter model which allowed us to develop an adapter by creating handlers for interfacing Mixer with external infrastructure backends. Mixer also provides a set of templates, each of which expose different sets of metadata that can be provided to the adapter. In the case of a metrics adapter such as the Circonus adapter, this metadata includes metrics like request duration, request count, request payload size, and response payload size. To activate the Circonus adapter in an Istio-enabled Kubernetes cluster, simply use the istioctl command to inject the Circonus operator configuration into the K8s cluster, and the metrics will start flowing.

Here’s an architectural overview of how Mixer interacts with these external backends:

Istio also contains metrics adapters for StatsD and Prometheus. However, a few things differentiate the Circonus adapter from those other adapters. First, the Circonus adapter allows us to collect the request durations as a histogram, instead of just recording fixed percentiles. This allows us to calculate any quantile over arbitrary time windows, and perform statistical analyses on the histogram which is collected. Second, data can be retained essentially forever. Third, the telemetry data is retained in a durable environment, outside the blast radius of any of the ephemeral assets managed by Kubernetes.

Let’s take a look at the guts of how data gets from Istio into Circonus. Istio’s adapter framework exposes a number of methods which are available to adapter developers. The HandleMetric() method is called for a set of metric instances generated from each request that Istio handles. In our operator configuration, we can specify the metrics that we want to act on, and their types:

spec:
  # HTTPTrap url, replace this with your account submission url
  submission_url: "https://trap.noit.circonus.net/module/httptrap/myuuid/mysecret"
  submission_interval: "10s"
  metrics:
  - name: requestcount.metric.istio-system
    type: COUNTER
  - name: requestduration.metric.istio-system
    type: DISTRIBUTION
  - name: requestsize.metric.istio-system
    type: GAUGE
  - name: responsesize.metric.istio-system
    type: GAUGE

Here we configure the Circonus handler with a submission URL for an HTTPTrap check, an interval to send metrics at. In this example, we specify four metrics to gather, and their types. Notice that we collect the requestduration metric as a DISTRIBUTION type, which will be processed as a histogram in Circonus. This retains fidelity over time, as opposed to averaging that metric, or calculating a percentile before recording the value (both of those techniques lose the value of the signal).

For each request, the HandleMetric() method is called on each request for the metrics we have specified. Let’s take a look at the code:

// HandleMetric submits metrics to Circonus via circonus-gometrics
func (h *handler) HandleMetric(ctx context.Context, insts []*metric.Instance) error {

    for _, inst := range insts {

        metricName := inst.Name
        metricType := h.metrics[metricName]

        switch metricType {

        case config.GAUGE:
            value, _ := inst.Value.(int64)
            h.cm.Gauge(metricName, value)

        case config.COUNTER:
            h.cm.Increment(metricName)

        case config.DISTRIBUTION:
            value, _ := inst.Value.(time.Duration)
            h.cm.Timing(metricName, float64(value))
        }

    }
    return nil
}

Here we can see that HandleMetric() is called with a Mixer context, and a set of metric instances. We iterate over each instance, determine its type, and call the appropriate circonus-gometrics method. The metric handler contains a circonus-gometrics object which makes submitting the actual metric trivial to implement in this framework. Setting up the handler is a bit more complex, but still not rocket science:

// Build constructs a circonus-gometrics instance and sets up the handler
func (b *builder) Build(ctx context.Context, env adapter.Env) (adapter.Handler, error) {

    bridge := &logToEnvLogger{env: env}

    cmc := &cgm.Config{
        CheckManager: checkmgr.Config{
            Check: checkmgr.CheckConfig{
                SubmissionURL: b.adpCfg.SubmissionUrl,
            },
        },
        Log:      log.New(bridge, "", 0),
        Debug:    true, // enable [DEBUG] level logging for env.Logger
        Interval: "0s", // flush via ScheduleDaemon based ticker
    }

    cm, err := cgm.NewCirconusMetrics(cmc)
    if err != nil {
        err = env.Logger().Errorf("Could not create NewCirconusMetrics: %v", err)
        return nil, err
    }

    // create a context with cancel based on the istio context
    adapterContext, adapterCancel := context.WithCancel(ctx)

    env.ScheduleDaemon(
        func() {

            ticker := time.NewTicker(b.adpCfg.SubmissionInterval)

            for {
                select {
                case <-ticker.C:
                  cm.Flush()
                case <-adapterContext.Done()
                  ticker.Stop()
                  cm.Flush()
                  return
                }
            }
          })
    metrics := make(map[string])config.Params_MetricInfo_Type)
    ac := b.adpCfg
    for _, adpMetric := range ac.Metrics {
        metrics[adpMetricName] = adpmetric.Type
    }
    return &handler{cm: cm, env: env, metrics: metrics, cancel: adapterCancel}, nil
}

Mixer provides a builder type which we defined the Build method on. Again, a Mixer context is passed, along with an environment object representing Mixer’s configuration. We create a new circonus-gometrics object, and deliberately disable automatic metrics flushing. We do this because Mixer requires us to wrap all goroutines in their panic handler with the env.ScheduleDaemon() method. You’ll notice that we’ve created our own adapterContext via context.WithCancel. This allows us to shut down the metrics flushing goroutine by calling h.cancel() in the Close() method handler provided by Mixer. We also want to send any log events from CGM (circonus-gometrics) to Mixer’s log. Mixer provides an env.Logger() interface which is based on glog, but CGM uses the standard Golang logger. How did we resolve this impedance mismatch? With a logger bridge, any logging statements that CGM generates are passed to Mixer.

// logToEnvLogger converts CGM log package writes to env.Logger()
func (b logToEnvLogger) Write(msg []byte) (int, error) {
    if bytes.HasPrefix(msg, []byte("[ERROR]")) {
        b.env.Logger().Errorf(string(msg))
    } else if bytes.HasPrefix(msg, []byte("[WARN]")) {
        b.env.Logger().Warningf(string(msg))
    } else if bytes.HasPrefix(msg, []byte("[DEBUG]")) {
        b.env.Logger().Infof(string(msg))
    } else {
        b.env.Logger().Infof(string(msg))
    }
    return len(msg), nil
}

For the full adapter codebase, see the Istio github repo here.

Enough with the theory though, let’s see what this thing looks like in action. I setup a Google Kubernetes Engine deployment, loaded a development version of Istio with the Circonus adapter, and deployed the sample BookInfo application that is provided with Istio. The image below is a heatmap of the distribution of request durations from requests made to the application. You’ll notice the histogram overlay for the time slice highlighted. I added an overlay showing the median, 90th, and 95th percentile response times; it’s possible to generate these at arbitrary quantiles and time windows because we store the data natively as log linear histograms. Notice that the median and 90th percentile are relatively fixed, while the 95th percentile tends to fluctuate over a range of a few hundred milliseconds. This type of deep observability can be used to quantify the performance of Istio itself over versions as it continues it’s rapid growth. Or, more likely, it will be used to identify issues within the application deployed. If your 95th percentile isn’t meeting your internal Service Level Objectives (SLO), that’s a good sign you have some work to do. After all, if 1 in 20 users is having a sub-par experience on your application, don’t you want to know about it?

That looks like fun, so let’s lay out how to get this stuff up and running. First thing we’ll need is a Kubernetes cluster. Google Kubernetes Engine provides an easy way to get a cluster up quickly.

There’s a few other ways documented in the Istio docs if you don’t want to use GKE, but these are the notes I used to get up and running. I used the gcloud command line utility as such after deploying the cluster in the web UI.

# set your zones and region
$ gcloud config set compute/zone us-west1-a
$ gcloud config set compute/region us-west1

# create the cluster
$ gcloud alpha container cluster create istio-testing --num-nodes=4

# get the credentials and put them in kubeconfig
$ gcloud container clusters get-credentials istio-testing --zone us-west1-a --project istio-circonus

# grant cluster admin permissions
$ kubectl create clusterrolebinding cluster-admin-binding --clusterrole=cluster-admin --user=$(gcloud config get-value core/account)

Poof, you have a Kubernetes cluster. Let’s install Istio – refer to the Istio docs

# grab Istio and setup your PATH
$ curl -L https://git.io/getLatestIstio | sh -
$ cd istio-0.2.12
$ export PATH=$PWD/bin:$PATH

# now install Istio
$ kubectl apply -f install/kubernetes/istio.yaml

# wait for the services to come up
$ kubectl get svc -n istio-system

Now setup the sample BookInfo application

# Assuming you are using manual sidecar injection, use `kube-inject`
$ kubectl apply -f <(istioctl kube-inject -f samples/bookinfo/kube/bookinfo.yaml)

# wait for the services to come up
$ kubectl get services

If you are on GKE, you’ll need to setup the gateway and firewall rules

# get the worker address
$ kubectl get ingress -o wide

# get the gateway url
$ export GATEWAY_URL=<workerNodeAddress>:$(kubectl get svc istio-ingress -n istio-system -o jsonpath='{.spec.ports[0].nodePort}')

# add the firewall rule
$ gcloud compute firewall-rules create allow-book --allow tcp:$(kubectl get svc istio-ingress -n istio-system -o jsonpath='{.spec.ports[0].nodePort}')

# hit the url - for some reason GATEWAY_URL is on an ephemeral port, use port 80 instead
$ wget http://<workerNodeAddress>/<productpage>

The sample application should be up and running. If you are using Istio 0.3 or less, you’ll need to install the docker image we build with the Circonus adapter embedded.

Load the Circonus resource definition (only need to do this with Istio 0.3 or less). Save this content as circonus_crd.yaml

kind: CustomResourceDefinition
apiVersion: apiextensions.k8s.io/v1beta1
metadata:
  name: circonuses.config.istio.io
  labels:
    package: circonus
    istio: mixer-adapter
spec:
  group: config.istio.io
  names:
    kind: circonus
    plural: circonuses
    singular: circonus
  scope: Namespaced
  version: v1alpha2

Now apply it:

$ kubectl apply -f circonus_crd.yaml

Edit the Istio deployment to pull in the Docker image with the Circonus adapter build (again, not needed if you’re using Istio v0.4 or greater)

$ kubectl -n istio-system edit deployment istio-mixer

Change the image for the Mixer binary to use the istio-circonus image:

image: gcr.io/istio-circonus/mixer_debug
imagePullPolicy: IfNotPresent
name: mixer

Ok, we’re almost there. Grab a copy of the operator configuration, and insert your HTTPTrap submission URL into it. You’ll need a Circonus account to get that; just signup for a free account if you don’t have one and create an HTTPTrap check.

Now apply your operator configuration:

$ istioctl create  -f circonus_config.yaml

Make a few requests to the application, and you should see the metrics flowing into your Circonus dashboard! If you run into any problems, feel free to contact us at the Circonus labs slack, or reach out to me directly on Twitter at @phredmoyer.

This was a fun integration; Istio is definitely on the leading edge of Kubernetes, but it has matured significantly over the past few months and should be considered ready to use to deploy new microservices. I’d like to extend thanks to some folks who helped out on this effort. Matt Maier is the maintainer of Circonus gometrics and was invaluable on integrating CGM within the Istio handler framework. Zack Butcher and Douglas Reid are hackers on the Istio project, and a few months ago gave an encouraging ‘send the PRs!’ nudge when I talked to them at a local meetup about Istio. Martin Taillefer gave great feedback and advice during the final stages of the Circonus adapter development. Shriram Rajagopalan gave a hand with the CircleCI testing to get things over the finish line. Finally, a huge thanks to the team at Circonus for sponsoring this work, and the Istio community for their welcoming culture that makes this type of innovation possible.

Thank you, ZFS

If you’ve had a technical conversation with anyone at Circonus, there are very likely two technologies that came up: ZFS and DTrace. While we love DTrace, ZFS has literally changed our world and made some personal “whoopsies” out of what could have been otherwise catastrophic, business-ending mistakes. ZFS has been a technology that has changed the way we interact with production computing systems and business problems.

ZFS has a million features that made it over a decade ahead of its time, but that’s not so important today. Today, we’re just thankful.

We want to thank the creators of ZFS and the community that placed such abusive production demands on it before us. We want to thank the community of people that saw the value of ZFS and ported it to… everything: FreeBSD, Linux, Mac OS X, and even Windows. It was a dreadful mistake that many vendors didn’t prop those communities up and assist them in making ZFS ubiquitous and default on their systems; you’ve done a disservice to your customers.

For 5 years, we didn’t bring “snowth” (IRONdb’s internal name) to market as a standalone product for a variety of reasons. One of the primary technical reasons was that our reliance on ZFS and Linux’s lack of adoption of ZFS made our deployable market artificially small. Last year, we decided that ZFS on Linux was “stable enough” to support our customers, and the last gating factor for IRONdb as a product was eliminated. I’m thankful for all of the ridiculously hard work the ZoL (ZFS on Linux) team has put into making ZFS as good on Linux as it is today.

At Circonus, under the torture harness of IRONdb, we’ve pushed on ZFS in ways that are hard to contemplate. Having dialogue with other ZFS developers and users, we know that we push on this filesystem in ways that are simply diabolical. If you look in the sizing section of the documentation of the excellent and capable InfluxDB, IRONdb sits squarely in their “infeasible” performance category; we attribute our ability to shatter these notions because we built out tech atop ZFS.

Yes, we use all the magical features therein: compression, device management (growing pools), online disk replacement, scrubbing, checksumming, snapshots, and many more, but today we want to show appreciation for the most important ZFS characteristic of all… stability. You’re just there for us. You’ve lost less data than any other filesystem we’ve used. You’ve crashed and locked up less than any of your counterparts. You’ve saved us years of filesystem checks, by completely eliminating them. You’ve saved us.

Thank you.

Some Like It Flat

JSON rules the world, much to our collective chagrin. I’ve mentioned before the atrocious shortcomings of JSON as a format and I feel deeply saddened that the format has taken the world by storm. However, it is here and we must cope… but not always.

Like every other API-centric platform in the world, we support data in and out in the ubiquitous JavaScript Object Notation format. I’ll admit, for a (more or less) machine parseable notation, it is remarkably comprehensible to humans (this is, of course, one of the leading drivers for its ubiquity). While I won’t dive into the deficits of JSON on the data-type interpretation side (which is insidious for systems that mainly communicate numerical data), I will talk about speed.

JSON is slow. JSONB was created because JSON was slow. JSONB is still slow. Object (data) serialization has always been of interest to protocols. When one system must communicate data to another, both systems must agree on a format for transmission. While JSON is naturally debuggable, it does not foster agreement (specifically on numeric values) and it is truly abysmal on the performance side. This is why so many protocol serializations exist today.

From the Java world comes Thrift, its successor Avro, and MessagePack. From Python we have pickle, which somehow has escaped the Python world to inflict harm upon others. From the C and C++ world we have Cap’n Proto, Flatbuffers, and perhaps the most popular, Google Protobuf (the heart of the widely adopted gRPC protocol). Now, these serialization libraries might have come from one language world, but they’d be useless without bindings in basically every other language… which they do generally boast, with the exception of pickle.

It should be noted that not all of these serialization libraries stop at serialization. Some bring protocol specification (RPC definition and calling convention) into scope. Notwithstanding that this can be useful, the fact that they are conflated within a single implementation is a tragedy.

For IRONdb, we needed a faster alternative to JSON because we were sacrificing an absurd amount of performance to the JSON god. This is a common design in databases; either a binary protocol is used from the beginning or a fast binary protocol is added for performance reasons. I will say that starting with JSON and later adopting a binary encoding and protocol has some rather obvious and profound advantages.

By delaying the adoption of a binary protocol, we had large system deployments with petabytes of data in them under demanding, real-world use cases. This made it a breeze to evaluate both the suitability and performance gains. Additionally, we were able to understand how much performance we squandered on the protocol side vs. the encoding side. Our protocol has been HTTP and our payload encoding has been JSON. It turns out that in our use cases, the overhead for dealing with HTTP was nominal in our workloads, but the overhead for serializing and deserializing JSON was comically absurd (and yes, this is even with non-allocating, SAX-style JSON parsers).

So, Google Protobuf right? Google can’t get it wrong. It’s popular; there are a great many tools around it. Google builds a lot of good stuff, but there were three undesirable issues with Google Protobuf: (1) IRONdb is in C and while the C++ support is good, the C support is atrocious, (2) it conflates protocol with encoding so it becomes burdensome to not adopt gRPC, and (3) it’s actually pretty slow.

So, if not Google’s tech, then whose? Well, Google’s of course. It is little known that Flatbuffers are actually Google Flatbuffers designed specifically as an encoding and decoding format for gaming and other high performance applications. To understand why it is so fast, consider this: you don’t have to parse anything to access data elements of encoded objects. Additionally, it boasts the strong forward and backward compatibility guarantees that you get with other complex serialization systems like Protobuf.

The solution for us:

Same REST calls, same endpoints, same data, different encoding. Simply specify the data encoding you are sending with a ‘Content-Type: x-circonus-<datatype>-flatbuffer’ header and the remote end can even avoid copying or even parsing any memory; it just accesses it. The integration into our C code is very macro-oriented and simple to understand. Along with roughly 1000x speedup in encoding and decoding data, we save about 75% volumetrically “over the wire.”

Site Maintenance Oct 23rd, 2017 10:00 EDT

We will be performing site wide maintenance on Monday, October 23rd, 2017 at 10:00am EDT (14:00 UTC). This maintenance is expected to last 15 minutes. During this window, the UI and API will be unavailable as we fail over to a new primary datacenter. This maintenance also includes the promotion of a new primary DB and movement of alerting services.

Over the past few weeks, we have spun up data collection to this new DC, and have been serving graph, worksheet, and dashboard data from it. During the maintenance window, data collection will continue uninterrupted. There will be an alerting outage as we switch services to their new home. Alerts that would have fired in the window will fire when we come out of maintenance; alerts that would have cleared will clear when we come out of maintenance.

Please double check that our IPs listed in out.circonus.net are permitted through any firewalls, especially if you have rules to permit webhook notifications.

We expect no major issues with this move. If you have any questions, please contact our support at support@circonus.com for further clarification.