The Scaling Limitations of Graphite and Solutions to Overcome Them

Graphite is a free open-source software (FOSS) tool that monitors and graphs numeric time-series data. Graphite was originally a project developed internally at Orbitz in 2006, which eventually grew to be their foundational monitoring tool. In 2008, Orbitz allowed Graphite to be released under the open source Apache 2.0 license.

Graphite made it possible to know more than simply if applications were up and running. For the first time, developers could instrument their applications so that the businesses could understand the performance of their infrastructure and how it was affecting the user experience.

Graphite has three main components:

  • Carbon – a Twisted daemon that listens for time-series data
  • Whisper – a database library for storing time-series data
  • Graphite webapp – A Django webapp that renders graphs on-demand

One of the reasons Graphite became so popular is its modular architecture, which allowed Graphite operators to scale certain components to handle their respective use cases.

However, Graphite typically cannot scale to meet the demands of modern enterprises. The proliferation of microservices has led to explosive growth in telemetry volume. A format that once worked well, Whisper, now severely underperforms when trying to meet the challenges of scale. Reading Graphite queries can be expensive, and its high availability model has some significant drawbacks.

This forces teams to make compromises in order to keep their existing deployments functioning. These compromises include:

  • Reducing the amount of data that they retain
  • Collecting less telemetry than they would have otherwise
  • Accepting unreasonable latency
  • Sacrificing data safety due to a lack of redundancy in the data being stored

In the following, I outline some of the challenges operators face when running Graphite at scale.

Key Limitations of Graphite

Inability to scale Whisper

Whisper is essentially a flat file database that represents each unique series by a fixed-size file, the size of which is determined by the resolution and retention configured within Whisper. The Whisper database essentially appends data to the end of each file.

Querying or reading this data requires finding all of the files that match a particular query, opening each one, parsing each file to get to the requested time period within that file, and copying the data out.

This process must occur for every single one of the matching series. And, given that there can be thousands, tens of thousands, or even hundreds of thousands of series, this process quickly becomes expensive from a disk perspective.

Addressing this limitation often involves re-architecting to provide more IOPS to account for the respective read and write operations. And this re-architecting often results in the loss of historic data, metrics gaps, and downtime. Worst of all, re-architecting in this way becomes a never-ending process as the infrastructure footprint of an organization continues to grow over time.

Little to No Data Safety

Offering little in the way of data redundancy, Graphite cannot guarantee the safety of your data against accidental loss or corruption.

Even keeping two copies of all of your time series data in two separate Whisper backends does not provide an adequate workaround. In such cases, there is no concept of state between replica backends—so in the event of an infrastructure, software, or network issue, the two backends often become out of sync.

Once that happens, there’s no way to re-sync them, which means you can be receiving different responses to the same query depending on which backend is responding. This typically results in organizations avoiding an active/active architecture that sends queries to both systems. While it is a best practice to always test your backup systems, in practice, if they are not in active use, there is a good chance that the system will not work as expected when you cut over to it.

Additionally, you cannot take down Graphite infrastructure for maintenance without losing data. So, if for example you take down Whisper, you will still have data streaming in, creating a growing backlog and missed metrics.

Inefficient Long-Term Data Retention

Three-to-six months of retention just doesn’t cut it these days—but that’s how most operators run Graphite. They want to prevent users from asking expensive queries that simply won’t ever return. Whisper is not good at responding to those types of queries.

This includes roll ups, which cannot be easily changed once they are configured in the initial Graphite stand up. Whatever preconceived notions there may have been in that initial standup are unable to be challenged or changed as the organization grows. This prevents any sort of query performance tuning or storage optimization.

Modern expectations around data retention are multi-year. Organizations want to be able run twelve-month, year-over-year analyses to really understand the impact their metrics are having on their customers and their business.

Spurious Aggregations

Spurious aggregations are most commonly seen when Graphite deployments are shared by many teams within an enterprise. Such instances often result in the pre-aggregation and storage of a large number of metrics no one ever asked for nor needed.

Even worse, these spurious aggregations often represent the majority of metrics currently in Whisper. If 90% of your metrics are things that people don’t look at, you are likely confusing users. You’re also cluttering your Whisper storage and slowing down queries as you do so.

How to Scale Graphite

If you and your team are currently living these Graphite challenges, Circonus has two Graphite modernization solutions to consider:

  1. Utilize a “drop-in” replacement of Whisper with our compatible time series database, Circonus IRONdb. This allows you to breathe new life into your existing deployment, so you are better able to:
    1. Efficiently manage the long-term storage of your telemetry
    2. Scale ingestion and querying through vertical and horizontal scaling
    3. Protect your data through a replicated datastore
  2. Migrate your Graphite to the full Circonus SaaS Monitoring Platform. This allows you to keep what you love about your current Graphite, and get rid of what you don’t. You can:
    1. Replace legacy ingestion components with modern, performant, highly-scalable components
    2. Empower downstream teams to leverage histograms through which to enable SLOs, error-budgeting and other modern SRE practices
    3. More efficiently manage the storage of metrics you may not need

You can learn more about these solutions here:
https://www.circonus.com/platform/graphite-modernization/