Q&A: Best Practices for Storing and Analyzing Time-Series Data

The exponential growth of machine generated data in recent years has created the need for solutions purpose-built to handle extremely high-frequency telemetry data. This has driven increasingly more organizations to adopt time series databases and address the unique challenges around ingesting, analyzing, and storing massive amounts of time-series data.

Circonus Founder and CTO, Theo Schlossnagle, recently discussed these issues and more with Chris McCubbin, a data scientist and software developer at AWS, in a webinar held by the Association for Computing Machinery (ACM). Theo, who designed Circonus’ time-series database, IRONdb, addressed questions such as:

Should you take into account the inherent or future value of data when deciding if it’s worth the cost of storing it?

How do you make it feasible to analyze millions of data points stored over years?

How does Circonus enable you to effectively analyze time-series data in today’s all virtual, ever-changing environment?

The following is a snapshot of the hour long discussion.

CM: What types of analytics does Circonus support?

TS: There are different types of time series databases. There are structured event databases, and there are log databases, where you have a timestamp and a log line that’s unstructured. Ours is very specific in that we assume that you’re monitoring a thing, and that thing is going to have some multitude of timestamped values attached to it.

So in the case of resource management and cloud computing, for example, I have a piece of physical equipment that’s running a hypervisor, and it has a certain number of cycles that it spends in idle, in VM, and in system and user, I have containers inside of that that are spending CPU time, they’re using memory resources. I’m monitoring each of those things and taking a measurement out of them at some dictated frequency. So maybe I’m taking a measurement every one second or so.

On more of a high frequency model, say I have an API that’s driving my application. Every time I service an API request, I know the endpoint that’s being serviced. I know the version of code that’s running. I know that the response code was say a 200 or 500. And inside of that criteria, perhaps I have 200,000 requests per second that I’m serving, and they’re all taking a certain number of microseconds. For every single one of those requests, I have the timestamp and the latency that was observed on servicing that request.

Those are the types of measurements we take in. If we’re looking at the types of analytics that we do, everything has a time context. We have a temporal query that says I’m looking over the last day, I’m looking over the last hour, looking over the last year or year over year, that sort of thing. One of the challenges is you need to specify what you’re interested in, so there’s a search syntax that allows you to drill down and find the measurements, which we call streams of data, that you’re interested in.

For example, you may be looking for latency on an API service I’m running in AWS. Say you have 10,000 containers on AWS in U.S. East and 10,000 in U.S. West, and you want to pull all of the latencies that each of those containers are providing and then compare them, or compare the 99th quantile latency over five minute rolling periods. This is a specification of what you’re looking for and defining the streams you’re looking at. You can look at 10,000 or a hundred thousand streams, and then you can apply either vectorization across those, or you apply some sort of windowing across those, and then you apply statistical methods. And those statistical methods can be very simple, like show me the variance or the mean, compose a histogram of all the values and return the inverse quantile at 100 milliseconds, or tell me what percentage of the population is faster than a hundred milliseconds. Or, you can get more sophisticated like apply an anomaly detection algorithm that looks at a rolling seasonal average of these things or an exponentially weighted average, and then try to do some sort of predictive analysis to bracket expected behavior and tell me how likely the next data point falls in that.

There’s a huge range of incredibly primitive kinds of time-series-centric functions that you could apply and then you can cobble those together and build more and more on those to produce more complex and usually domain specific questions that you would ask about the data.

CM: We talked a little bit about queries and analysis and time representation. Can you talk about how those things interact…so the design itself of the database – did you design the storage and roll off or re-indexing and things like that to accommodate the different types of queries your users will want to do?

TS: There’s a lot of evolution in the database design. One of the recognitions that we had early on was that when we’re receiving data from systems, those systems are pretty much the definition of relentless. They don’t care if your database is down – they’re still going to produce the data, and a lot of the statistical methods don’t apply well if you’re missing data. So if you’re down for five minutes, then you have to ingest the previous five minutes of missed samples. You have to catch up and then play in real time.

So it becomes critical you don’t ever go down. You have to be able to read the data off the wire, and you have to be able ingest data. The data structures and IO patterns for optimizing that are very different than the ones that you would use to do instantaneous recall and analysis of that data. I think that most people can agree that the fastest way to write data to disk is to not care where you write it and to just write it. So we write to the end of a file or write a new block, then the next available block. If you’re using modern storage, it doesn’t matter so much that it’s sequential, but if you’re using large storage devices it helps to write large sequential blocks, have large block sizes on file systems, or not use file systems at all.

But when you write large chunks of data as it’s coming in, it’s not organized in a fashion that allows for incredibly fast recall. It’s not put into columns for doing vectorized crunching, it’s not correlated by stream or by time, or if in a multi-tenant environment it’s not even correlated by customer. So when using the database format on-disk structures that are optimized for uninterrupted ingest, we have to realize that those structures are not the same ones we want to use for doing immediate recall, or for doing long tail analytics over multi-year data sets. When we accepted this, we realized that we needed to do ETL between those things.

We have certain data formats that are used depending on data type for ingest, and then others that are used for doing interim recall over extended periods of recent time, and then we have long tail storage, which is stored in yet a different format that is optimized for read only recall (not strictly read only, but it is incredibly painful to write to and delete for example). It’s highly optimized for compacted storage size and vectorized crunching. The idea that you could have a general purpose data structure in there ignores the fact that economics are a real thing here. If I’m going to store a petabyte of data, storing that all in non-volatile Ram or NVME just isn’t economically tractable. So I have to be able to handle all of the different IO workloads that are against affordable storage types and map that into the requirements database.

CM: Should you take into account the inherent or future value of data when deciding if it’s worth the cost of storing it? Should the database itself have some awareness about the value of the data?

TS: I’m not sure if the database should have an awareness of the value of the data, but I would say that if you can quantify the value of your data as non-zero, it’s almost a no-brainer to store it, at least in some fashion. The real trick is that a lot of data has value that is later realized. And that value is only realized if you have historic data. So there’s a lot of data where it’s very difficult to know that this data will have value, but when you come to realize that it has value, then having a year’s worth of that data or seven years worth of that data is immensely valuable. So anybody in the SRE realm knows that every time you have an incident and you go review what you’re doing, you become smarter as an organization and individually. You’re making hypotheses and you’re executing scientific reasoning around the data that you have to say, “hey, I saw this pattern” or “this correlation, I wonder if this is causal?” And if you can go back and replay the incident from three months ago and you actually have that data subjected to your new line of reasoning, then you can compound that value. That organizational growth compounds, but it doesn’t work if you weren’t collecting the data before. So I know it’s a meme to say, monitor all the things, and it is sort of unrealistic to actually restore every piece of data. But the closer you can get to that, the more future value you can generate out of effectively thin air.

We use a lot of the techniques around compressing time series data. One of the techniques that we hinge a lot of our value around is the idea of taking a million samples a second and after some period of time, say a month, and storing all million of those samples compressed where will no longer be able to figure out where inside the second they happened. Instead of storing a million samples in a given second with nanosecond offsets, we store a single histogram for that one second and store all million values in that histogram. We introduce time error and value error, both bounded very tightly. But it turns out that most of the statistical questions that you would ask doing analysis are achievable via that data structure. Now you’ve taken probably a half of megabyte of data and compressed it down to less than one kilobyte of data. Those techniques preserve the ability to statistically query the data. There are some fascinating approaches that you can use to know whether this data is valuable, but the cost of storing that data can be so cheap that I don’t really need to ask that question.

CM: In today’s environment, basically nothing is “real” anymore. An IP address that represents one machine one second, represents a different machine the next second. And inside that machine, there are virtual machines, and inside virtual machines are virtual environments. So everything’s virtual at every single level. So how can Circonus enable us to do good analysis on sort of longitudinal time series data when you don’t even know what this IP address is. If you’re making a decision to store everything based on IP, you’re obviously losing a lot.

TS: There are some fundamentals of monitoring that are incredibly important. When you are monitoring something, it is critical that you monitor the simplest thing. For example, your power company does not monitor the watts per second that you use – they monitor the total number of kilowatts consumed. The reason they do this is because if they miss a measurement, they still have the average that you used over the longer period of time, and it’s still correct. So it’s important that you monitor things correctly and that you actually take reasonable values that are good inputs to mathematical processing. Given that, one of the things you have to consider is that you need data that’s mergeable.

If I track API latency, for example, that’s one of my service level objectives is to have a certain number of my requests below a certain latency threshold, realizing that I might change that threshold over time and still need to be able to do analysis. I need to make sure that the way I record that on a single container that may come and go is the same as I recorded on all the other containers on all of the other services, and then I can then merge all of that together to get a holistic whole worldview. But by tracking it on each individual container or serverless process or what have you, I can drill down into those and see individual containers.

If I have a Kubernetes cluster that’s running 20 containers and pods, I may deploy a new piece of code that spins up 20 new ones and turns off the 20 old ones, so all of the names of my telemetry streams will change because they will have different container IDs – but I can still query those because I don’t actually care about the container ID. I want to find all of the things running this application from this time to that time, ignoring the container ID, merge them together, and then I see the overall trend. I can, for example, see a bump, and to determine if that bump correlates with anything specific, I break it down by container ID.

One pathology of failure that’s not uncommon is that you don’t have a degradation of service that’s homogenous across the set of machines. You’ll have one server that’s malfunctioning, one node in a cluster of systems that you believe is homogenous is actually acting very heterogeneous. It has packet loss that’s unexplained and a noisy neighbor that’s causing it to have latency outliers that others don’t. If you don’t track things on that sort of granular stream level, there’s no way after the fact to actually prove that point out. So you can say, I have these weird latency outliers. I have outliers on my app. It turns out no, I don’t. I have outliers on specific containers in a cluster at certain times. Now I have a very different hypothesis around what’s happening. So having that data both on an individual granular stream level and having it rolled up into a comprehensive view leads to the right question and answer analysis that engineers need to do.