This was a frequent request we were hearing from many customers:
“How can I analyze my data with Python?”
The Python Data Science toolchain (Jupyter/NumPy/pandas) offers a wide spectrum of advanced data analytics capabilities. Therefore, seamless integration with this environment is important for our customers who want to make use of those tools.
Circonus has for a long time provided Python bindings for its API. With these bindings, you can configure the account, create graphs and dashboards, etc. However, fetching data and getting it into the right format involves multiple steps and was not easy to get right.
We are now pleased to announce that this has changed. We have just added new capabilities to our Python bindings that allow you to fetch and analyze data more effectively. Here is how to use it.
Quick Tour
Connecting to the API
You need an API token to connect to the API. You can create one using the UI under Integrations > API Tokens. In the following we assume the variable api_token holds a valid API token for your account.
from circonusapi import circonusdata circ = circonusdata.CirconusData(api_token)
Searching for Metrics
The first thing we can do is search for some metrics:
>>> M = circ.search('(metric:duration)', limit=10)
The returned object extends the list class, and can be manipulated like any list object.
We override the __str__ method, so that printing the list, gives a table representation of the fetched metrics:
>>> print(M) check_id type metric_name -------------------------------------------------- 195902 numeric duration 218003 numeric duration 154743 numeric duration 217833 numeric duration 217834 numeric duration 218002 numeric duration 222857 numeric duration 222854 numeric duration 222862 numeric duration 222860 numeric duration
Metric lists provide a .fetch() method that can be used to fetch data. Fetches are performed serially, one metric at a time, so the retrieval can take some time. We will later see how to parallelize fetches with CAQL.
R = M.fetch( start=datetime(2018,1,1), # start at Midnight UTC 2018-01-01 period=60, # return 60 second (=1min) aggregates count=180, # return 180 samples kind="value" # return (mean-)value aggregate )
The resulting object is a dict that maps metrics names to the fetched data. This is designed in such a way that it can be directly passed to a pandas DataFrame constructor.
import pandas as pd df = pd.DataFrame(R) # [OPTIONAL] Make the DataFrame aware of the time column df['time']=pd.to_datetime(df['time'],unit='s') df.set_index('time', inplace=True) df.head()
time 154743/duration 195902/duration 217833/duration 217834/duration 218002/duration 218003/duration 222854/duration 222857/duration 222860/duration 222862/duration 2018-01-01 00:00:00 1 4 1 1 1 1 12 11 12 1 2018-01-01 00:01:00 1 2 1 1 2 1 11 12 12 1 2018-01-01 00:02:00 1 2 1 1 1 1 12 12 11 1 2018-01-01 00:03:00 1 2 1 1 1 1 12 11 12 1 2018-01-01 00:04:00 1 2 1 1 1 1 12 11 11 1
Data Analysis with pandas
Pandas makes common data analysis methods very easy to perform. We start with computing some summary statistics:
df.describe()
154743/duration 195902/duration 217833/duration 217834/duration 218002/duration 218003/duration 222854/duration 222857/duration 222860/duration 222862/duration count 180.000000 180.000000 180.0 180.000000 180.000000 180.000000 180.000000 180.000000 180.00000 180.000000 mean 1.316667 2.150000 1.0 1.150000 1.044444 1.177778 11.677778 11.783333 11.80000 1.022222 std 1.642573 0.583526 0.0 1.130951 0.232120 0.897890 0.535401 0.799965 0.89941 0.181722 min 1.000000 1.000000 1.0 1.000000 1.000000 1.000000 11.000000 11.000000 11.00000 1.000000 25% 1.000000 2.000000 1.0 1.000000 1.000000 1.000000 11.000000 11.000000 11.00000 1.000000 50% 1.000000 2.000000 1.0 1.000000 1.000000 1.000000 12.000000 12.000000 12.00000 1.000000 75% 1.000000 2.000000 1.0 1.000000 1.000000 1.000000 12.000000 12.000000 12.00000 1.000000 max 15.000000 4.000000 1.0 12.000000 3.000000 9.000000 13.000000 17.000000 16.00000 3.000000
Here is a plot of the dataset over time:
from matplotlib import pyplot as plt ax = df.plot(style=".",figsize=(20,5),legend=False, ylim=(0,20), linewidth=0.2)
We can also summarize the individual distributions as box plots:
ax = df.plot(figsize=(20,5),legend=False, ylim=(0,20), kind="box") ax.figure.autofmt_xdate(rotation=-20,ha="left")
Working with Histogram Data
Histogram data can be fetched using the kind=”histogram” parameter to fetch. Numeric metrics will be converted to histograms. Histograms are represented as libcircllhist objects, which have very efficient methods for the most common histogram operations (mean, quantiles).
MH = circ.search("api`GET`/getState", limit=1) print(MH) check_id type metric_name -------------------------------------------------- 160764 histogram api`GET`/getState
Let’s fetch the 1h latency distributions of this API for the timespan of one day:
RH = MH.fetch(datetime(2018,1,1), 60*60, 24, kind="histogram")
We can plot the resulting histograms with a little helper function:
fig = plt.figure(figsize=(20, 5)) for H in RH['160764/api`GET`/getState']: circllhist_plot(H, alpha=0.2) ax = fig.get_axes() ax[0].set_xlim(0,100)
The output:
(0, 100)
Again, we can directly import the data into a pandas data frame, and perform some calculations on the data:
dfh = pd.DataFrame(RH) # [OPTIONAL] Make the DataFrame aware of the time column dfh['time']=pd.to_datetime(dfh['time'],unit='s') dfh.set_index('time', inplace=True) dfh['p99'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.99)) dfh['p90'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.99)) dfh['p95'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.99)) dfh['p50'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.5)) dfh['mean'] = dfh.iloc[:,0].map(lambda h: h.mean()) dfh.head()
time 160764/api`GET`/getState p99 p90 p95 p50 mean 2018-01-01 00:00:00 {"+29e-002": 2, "+40e-002": 6, "+50e-002": 8, ... 112.835714 112.835714 112.835714 11.992790 15.387013 2018-01-01 01:00:00 {"+40e-002": 2, "+50e-002": 2, "+59e-002": 5, ... 114.961628 114.961628 114.961628 16.567822 19.542284 2018-01-01 02:00:00 {"+40e-002": 3, "+50e-002": 12, "+59e-002": 4,... 118.124324 118.124324 118.124324 20.556859 24.012226 2018-01-01 03:00:00 {"+29e-002": 1, "+40e-002": 7, "+50e-002": 21,... 427.122222 427.122222 427.122222 20.827982 37.040173 2018-01-01 04:00:00 {"+40e-002": 6, "+50e-002": 26, "+59e-002": 15... 496.077778 496.077778 496.077778 23.247373 40.965517
The CAQL API
Circonus comes with a wide range of data analysis capabilities that are integrated into the Circonus Analytics Query Language, CAQL.
CAQL provides highly efficient data fetching operations that allow you to process multiple metrics at the same time. Also by performing the computation close to the data, you can save time and bandwidth.
To get started, we search for duration
metrics, like we did before, using CAQL:
A = circ.caql('search:metric("duration")', datetime(2018,1,1), 60, 5000) dfc = pd.DataFrame(A) dfc.head()
output[0] output[10] output[11] output[12] output[13] output[14] output[15] output[16] output[17] output[18] ... output[21] output[2] output[3] output[4] output[5] output[6] output[7] output[8] output[9] time 0 4 12 1 1 2 1 1 1 11 1 ... 1 1 1 1 1 1 11 12 1 1514764800 1 2 12 1 1 1 1 1 1 11 1 ... 1 1 1 1 1 2 12 11 1 1514764860 2 2 11 1 1 2 1 1 1 12 1 ... 1 1 1 1 1 1 12 12 1 1514764920 3 2 12 1 1 2 1 1 1 12 1 ... 1 1 1 1 1 1 11 12 1 1514764980 4 2 11 1 1 2 1 1 1 11 1 ... 1 1 1 1 1 1 11 12 1 1514765040 5 rows × 23 columns
This API call fetched 1000 samples from 22 metrics, and completed in just over 1 second. The equivalent circ.search().fetch()
statement would have taken around one minute to complete.
One drawback of the CAQL fetching is, that we use the metric names in the output. We are working on resolving this shortcoming.
To showcase some of the analytics features, we’ll now use CAQL to compute a rolling mean over the second largest duration metric in the above cluster, and plot the transformed data using pandas:
B = circ.caql(""" search:metric("duration") | stats:trim(1) | stats:max() | rolling:mean(10M) """, datetime(2018,1,1), 60, 1000) df = pd.DataFrame(B) df['time']=pd.to_datetime(df['time'],unit='s') df.set_index('time', inplace=True) df.plot(figsize=(20,5), lw=.5,ylim=(0,50))
You can also fetch histogram data with circ.caql()
:
AH = circ.caql('search:metric:histogram("api`GET`/getState")', datetime(2018,1,1), 60*60, 24) dfch = pd.DataFrame(AH) dfch.head()
output[0] time 0 {"+29e-002": 2, "+40e-002": 6, "+50e-002": 8, ... 1514764800 1 {"+40e-002": 2, "+50e-002": 2, "+59e-002": 5, ... 1514768400 2 {"+40e-002": 3, "+50e-002": 12, "+59e-002": 4,... 1514772000 3 {"+29e-002": 1, "+40e-002": 7, "+50e-002": 21,... 1514775600 4 {"+40e-002": 6, "+50e-002": 26, "+59e-002": 15... 1514779200
We can perform a wide variety of data transformation tasks directly inside Circonus using CAQL expressions. This speeds up the computation even further. Another advantage is that we can leverage CAQL queries for live graphing and alerting in the Circonus UI.
In this example, we compute how many requests were serviced above certain latency thresholds:
B = circ.caql(''' search:metric:histogram("api`GET`/getState") | histogram:count_above(0,10,50,100,500,1000) ''', datetime(2018,1,1), 60*5, 24*20) dfc2 = pd.DataFrame(B) dfc2['time']=pd.to_datetime(dfc2['time'],unit='s') dfc2.set_index('time', inplace=True) dfc2.plot(figsize=(20,5), colormap="gist_heat",legend=False, lw=.5)
Conclusion
Getting Circonus data into Python has never been easier. We hope that this blog post allows you to get started with the new data fetching capabilities. A Jupyter notebook version of this blog post containing the complete source code is available here. If you run into any problems or have some suggestions, feel free to open an issue on GitHub, or get in touch on our Slack channel.