Why Open Source Histograms Are The Future of Telemetry Monitoring

Latency measurements have become an important part of IT infrastructure and application monitoring. The latencies of a wide variety of events like requests, function calls, garbage collection, disk IO, system-call, CPU scheduling, etc. are of great interest to engineers operating and developing IT systems. But there are a number of technical challenges associated with managing and analyzing latency data. The volume emitted by a single data source can easily become very large; data has to be collected and aggregated from a large number of different sources; and the data has to be stored over long time periods in order to allow historic comparisons and long-term service quality estimations (SLOs). In order to address these challenges, a compression scheme has to be applied that drastically reduces the size of the data to be stored and transmitted. The most accurate, cost-effective technology to enable this compression is histograms.

Histograms are a data structure that allow users to model the distribution of a set of samples – for example, the age of every human on earth. But instead of storing each sample as its own record, they are grouped together in “buckets” or “bins” which allows for significant data compression and thus superior economics. This compression of data allows for extraordinary metric transmission and ingestion rates, high frequency, real-time analytics, and economical long-term storage. Histograms are also particularly useful in handling the breadth and depth of metric data produced by container technologies such as Kubernetes.

At Circonus, we’re passionate about histograms and how valuable they are for engineers and software developers, which is why we donated our histogram technology, OpenHistograms, to the open source community. The problem is that the monitoring industry has no single standard for histograms, and therefore all too frequently, users are leveraging them incorrectly, which has costly consequences. In this article, I’ll share why histograms are needed now more than ever, and therefore why the monitoring industry needs to embrace an open source, single-standard histogram technology.

Histograms are needed now more than ever

Histograms are more essential to the monitoring industry now than ever before. Why? When the internet was small and users were not accessing services at high rates, you could more easily store and analyze each individual request and set standards around serving all requests accurately and fast enough. Today, there are many, many more user interactions being generated, collected, and analyzed. But even more game-changing is that organizations now have multiple layers of systems, services, and applications communicating with each other that are generating an overwhelming volume of data — significantly more than what’s possible by just users. For example, if you’re running a database on a system and you expect your discs to perform operations at a certain speed, this activity alone could generate a million data points a second, which ends up being almost a hundred billion a day.

Now, ensuring that all requests are served fast enough becomes an impractical objective, both from a capability and economic standpoint. It’s just not worth being perfect. So engineers now need to analyze the behavior of their systems and determine quantitatively, what is good enough. If you’re servicing web pages or an API endpoint, how many errors are you allowed to have? How fast do you need to service requests? The problem with the question of how fast do most of them need to be is that you have two variables: how fast (measured in milliseconds) and how many (measured in a number like a percentile).

This is a really hard statistics problem to solve. And on top of this, organizations have significantly more data to store. If recording every single transaction is exorbitantly expensive and doing the math of analyzing latencies on every single transaction is also expensive, then engineers need some sort of model that allows them to inexpensively store all of those measurements and answer that question of how many, how fast. The histogram is a perfect model for all of that.

Histograms can collect, compress, and store ALL data points (billions!) and allow engineers to accurately analyze what percentage of their traffic is slower or faster than a certain speed — at low cost and zero overhead. Critically, they allow engineers to change both of those variables on the fly, after data ingestion. So instead of saying, “I need 99% of requests to be served faster than one second,” you can start to ask, “what does it look like when I have 98% of requests served faster than 5,500 milliseconds.” Without histograms, you have to be able to phrase your questions specifically before you start, and engineers cannot do this with specificity and accuracy beforehand. Histograms allow you to store unlimited data and post-facto answer more complex statistical questions, which is what’s needed in today’s service-centric, rapid release cycle environment.

Histograms must be open source

At Circonus, we’re open source advocates and believe most technology should be open source because it provides the assurance that users can be a stakeholder in it. But the most important reason we’re passionate about our histogram technology being open source is because users absolutely must have an industry standard around histograms — meaning organizations can leverage a single histogram technology across their monitoring stacks.

If you’re collecting your telemetry using different histograms from different vendors within your monitoring and observability stack — say, telemetry from your cloud provider and your telemetry from your APM provider — you cannot merge the data between histograms because they have different binning or different techniques. Unfortunately, all too often users do merge this data, introducing significant error that carries into the subsequent analysis of the data. This ends up hurting the operator and the end user.

The industry must focus on a single histogram model implementation because it increases compatibility between services and directly benefits the end user. Circonus’ implementation of histograms, Circhlist, has been in the industry since 2011. It has been independently tested and evaluated multiple times over the years and consistently deemed superior to other approaches in terms of balancing performance, accuracy, correctness, and usability. With the goal of fostering and facilitating the interchangeability and mergability of data between vendor platforms for all users, we recently released our histogram technology under the Apache 2.0 license to the open source community as OpenHistograms.

Circonus’ OpenHistograms are vendor-neutral log-linear histograms for the compression, mergeability, and analysis of telemetry data. Two key differentiating factors for OpenHistogram is that it’s in Base 10, which eases usability, and that it does not require floating point arithmetic, so you can run it on embedded systems that don’t have floating point units.

OpenHistograms allow users to seamlessly exchange telemetry between vendor platforms without introducing error. Organizations who are faced with the challenge of digesting and analyzing massive amounts of distribution data can now rely on a consistent, interchangeable, and stable representation of that data — a significant capability for the monitoring now and in the future.

Time for a single standard

The volume of data IT organizations are responsible for collecting and analyzing is growing substantially year over year, and as a result, users are increasingly leveraging histogram technology as a way to measure service quality. But a vast majority are merging telemetry data from different vendor histograms, and the output — while not apparent — is wrong. Organizations are inaccurately concluding they are hitting or not hitting SLOs and basing key operational decisions on this data that can cost them thousands of dollars a year. Every engineer and app developer should feel confident that they can create just on histogram, give it to someone, and know that they can accurately use it. By embracing vendor-neutral, industry-standard histogram technology, users have one source of truth and can rest assured their analysis is accurate.