Case for a Broader, Deeper, Coordinated Approach to Monitoring Your Business

Effective monitoring is essential for running a successful web business. It improves business performance by accelerating response to issues and opportunities that arise from web application operations, helping your company get and keep customers, boost revenue and build brand reputation.

Regardless of their role, everyone responsible for the success of the business needs the ability to assess its status at any given point. Adopting a holistic approach to monitoring that integrates business and technology goals, and metrics provides executives, analysts and engineers with a clear picture of how the entire business is operating. It also provides invaluable data on trends and component interactions to guide planning, troubleshooting and strategy optimization. While system engineers don’t need to understand the details of marketing, they should be aware of their company’s marketing objectives and how the web applications they support contribute to, and are affected by, those objectives. Likewise, the CEO doesn’t need to know how the web applications work in the background, but should be able to correlate the importance of key operating metrics, such as email bounce rates for an e-commerce marketing business, and their impact on costs, revenue and market perception.

While almost all web businesses perform some level of monitoring, companies would benefit by adopting a broader, more sophisticated and proactive monitoring strategy. Use the approach recommended in this paper to determine the business objectives, measures and thresholds that define the success of your web application and will drive your monitoring strategy. Create a dashboard that combines this business and technical information to produce a visually impactful, holistic view of your web business performance. Review existing web applications to ensure monitoring is sufficient and used effectively. If your current sources of monitoring data are insufficient, research, acquire, learn and deploy the right set of monitoring tools to support your new guidelines. When developing new web applications, incorporate the design and construction of business and functionality monitors within the scope of the projects to focus efforts on the most important success measures and maximize the benefit of monitoring efforts once the application is deployed.

To learn more of the case for taking a broader, deeper and more coordinated approach to monitoring, read Monitoring the Big Picture: A Modern Approach for Web Application Monitoring. It provides technical and business managers a greater understanding of the role and importance of monitoring in managing their web businesses. It will illustrate how a holistic, multi-disciplinary monitoring program can solve complex issues that cross business and technical boundaries and drive real improvements in business performance.

Distributed Systems Testing or Why My Hair Is Falling Out

Here at Circonus, our infrastructure is highly distributed. Many of the functions of Circonus are distinct systems that communicate with each other. In addition, we use a distributed data store for storing and retrieving data for graphs. Of course, testing a system is necessary to ensure that everything keeps working smoothly, but testing distributed systems like ours can be extraordinarily difficult.

There are a bunch of issues that come up when testing distributed systems – accounting for asynchronous delivery among nodes in a distributed data store, assuring that all data has been replicated and stored properly across the data cluster, making sure that the entire system works end-to-end, dealing with and recovering from individual component or data storage node failure, and more. These issues can be extremely difficult to test, as anyone who has worked on a distributed system can attest. I have been working on distributed systems for years, and have developed a few strategies for dealing with these and other issues.

In the July issue of ACM Queue, I discuss how we deal with these and other issues when testing our systems. To learn about our approach to testing in more detail, check out Testing A Distributed System.

Also, if tackling complex infrastructures is your dream job, come work with us!

Introducing Leo – A tool to automatically setup & configure NAD!

Leo is an extension to the NAD (Node Agent Daemon) client which will automatically create a check, a set of graphs, and a worksheet for a particular host. Leo’s goal is to make configuration of a host as simple as possible.

After running a simple command line script, you will be able to log into your Circonus account and view graphs for CPU Usage, Disk, Network, and Memory utilization, as well as a worksheet for your host. Leo will prompt you for information such as IP or hostname, Circonus auth token, Broker id, and the location of a config file. Then, it will take that information  and give it to NAD, which will use it to create the check, graphs, and worksheet.

If you decide try Leo, we would love to hear your feedback. You can send any comments, including what you liked, what you thought could be improved, and any questions, to: hello@circonus.com

To get started with Leo, visit our youtube channel and watch our tutorials for installing and configuring Leo:

Installation Tutorial

Summary:

  1. This program is installed on an amazon web services ec2 instance running CentOS.
  2. You must have node.js and nad already installed.

More info on NAD is available on github here.

There are two ways to install Leo: wget and git clone.

wget

# wget https://github.com/circonus-labs/leo/archive/master.zip
# unzip master.zip
# cd leo-master
# npm install

git clone

# git clone https://github.com/circonus-labs/leo.git
# cd leo
# npm install

There is a third way to install leo that is not included in the video. This method takes into account that Leo is published as an npm module as circonus-leo. However, this method is not recommended because it will require users to go through the node_modules directory before it can reach Leo, and consequently adds an extra step to the process of running Leo.

npm

# npm install

Configuration Tutorial

Summary:

The instructions below assume that you installed Leo using git clone. If you used wget or npm, the instructions are the same, but you access Leo through the leo-master directory for wget, and go through node_modules/leo for npm.

To get an auth/API token:

  1. On your Circonus account, go to “API Tokens” under the “User” section of the Main Menu and click “New API Token +”
  2. In your terminal write :
    # leo/bin/circonus-setup -k ["YOUR API TOKEN"] -t ["YOUR IP ADRESS OR HOSTNAME"]
  3. Hit enter. You will receive an error saying “App: nad still pending approval”
  4. Then go back to your Circonus account and refresh your “API Tokens” page. You should now have an option to allow NAD access.
  5. Click the “Allow Access” option.

To find your Broker Id:

  1. Go to your Dashboard and click the “manage brokers” option above the map that displays all of the brokers.
  2. Click the menu symbol (the little hamburger) to the left of the broker you want to use.
  3. Click “view API object”. The number that comes after ‘”_cid”: “/broker/’ is your broker id. For example, if the API object read “_cid”: “/broker/3”, the broker id would be 3.

You can either configure Leo with one command line request that contains all of your information or run Leo and let it prompt you for your info.

Command Line Request

This example is using the default settings, which includes a JSON check, 4 graphs (CPU Usage, Disks, Network, and Memory), and a worksheet containing that check and those 4 graphs:

 # leo/bin/circonus-setup -k ["YOUR API TOKEN"] -t ["YOUR IP ADRESS OR HOSTNAME"] -b ["BROKER ID"] --alldefault 

Once you hit enter, it will prompt you to save your settings to a config file. Either enter in the name of the file to which you want your information to be saved, or just hit enter to skip this step. Then it should tell you that 1 check, 4 graphs, and 1 worksheet have been created.

Letting Leo Prompt You

# leo/bin/circonus-setup 

After running this code, Leo will prompt you for your auth token, target, broker id, whether or not you want to use the defaults settings (if you say no you can use the type of check you want to create and the metrics you want), and for a config file to which it will send your information. It will then create a check, graphs, and a worksheet based on the information you provided.

Let’s look at Leo in action through the creation of a JSON check and its counterparts:

Checks & Metrics

Leo can create either a JSON check or a PostgreSQL check containing up to 155 different metrics.

checks

Worksheets

One worksheet will be created for each configured check.

worksheet-item
checks

Graphs

If you choose to create a JSON check, Leo and NAD will create a graph for CPU, Disk space, Memory, and Network utilization. For a PostgreSQL check, Leo and NAD will create a PostgreSQL Connections graph.

graph-cpu
graph-disks
graph-network
graph-memory

Metrics from Custom Apps…Easy!

Many developers are looking for a platform to which they can send arbitrary data from their custom applications to collect and visualize those metrics, as well as alert on specific thresholds. With Circonus’s ability to accept and parse raw JSON, it’s easy to send metrics from custom applications into the system. More information on JSON parsing can be found here or in the User Docs, but the steps below will get you up and running quickly.

1.) The first step for sending JSON data to Circonus is to create an HTTPTrap check. Under the Checks page, click on “New Check +” in the upper right corner, then expand the JSON option and choose “Push (HTTPTrap)”. Select the HTTPTrap broker from the list, then set up the host and secret. Click on “Test Check”, then “Finish”, even though there are no metrics selected.

2.) Now that the check is created, find that check and go into the details. There will be a “Data Submission URL” listed, which is the URL to which you will PUT the data from your application. Once the data is being submitted at regular intervals (either as frequently as you have it, or every 30 seconds if it is a sample), you can go back into the check to enable the metrics. Alternatively, you can use the Check Bundle API to manage the metrics and enable any metrics that are present but disabled. You can also enable histogram collection using the same methods.

Once data is being collected, you can then start graphing, alerting, streaming to dashboards, and performing analytics on your data. Additionally, if you add other check types to your Circonus account, you can compare these custom metrics to other data to get the full picture of what is really going on at any given time.

Video: Architecture of a Distributed Analytics/Storage Engine for Massive Time-Series Data

The numerical analysis of time-series data isn’t new. The scale of today’s problems is. With millions of concurrent data streams, some of which run at 1MM samples per second, the challenge of storing the data and making it continuously available for analysis is a daunting challenge.

At Circonus, we designed such a solution. Our CEO, Theo Schlossnagle, during Applicative 2015, discusses the approach and the technical details of how the system was constructed. Check out his talk…

The Problem with Math: Why Your Monitoring Solution is Wrong

Math is perfect – the most perfect thing in the world. The problem with math is that it is perpetrated by us imperfect humans. Circonus has long aimed to bring deep numerical analysis to business telemetry data. That may sound like a lot of mumbo-jumbo, but it really means we want better answers to better questions about all that data your business creates. Like it or not, to do that, we need to do maths and to do them right.

That’s not how it works

I was watching a line of e-surance commercials that have recently aired wherein people execute Internet-sounding tasks in nonsensical ways, like posting photographs to your [Facebook] wall by pinning things to your living room wall while your friends sits on the couch to observe, and I was reminded of something I see too often: bad math. Math to which I’m inclined to rebut: “That’s not how it works; that’s not how any of this works!”

A great example of such bad math is the inaccurate calculation of quantiles. This may sound super “mathy,” but quantiles have wide applications and there is a strong chance that you need them for things such as enforcing service level agreements and calculating billing and payments.

What is a quantile? First we’ll use the syntax q(N, v) to represent a quantile where N is a set of samples and v is some number between 0 and 1, inclusively. Herein, we’ll assume some set N and just write a quantile as q(v). Remember that 0% is 0 and 100% is actually 1.0 in probability, so the quantile is simply asking what sample in my set represents a number such that v of the samples are less than it and (1-v) samples are greater than it. This may sound complicated. Most descriptions of math concepts are a bit opaque at first, but a few examples can be quite illustrative.

  • What is q(0)? What sample is such that 0 (or none) in the set are smaller and the rest are larger? Well, that’s easy: the smallest number in the set. q(0) is another way of writing the minimum.
  • What is q(1)? What sample is such that 1 (100% or all) in the set are smaller and none are larger? Also quite simple: the largest number in the set. q(1) is another way of writing the maximum.
  • What is q(0.5)? What sample is such that 50% (or half) in the set are smaller and the other half are larger? This is the middle or in statistics, the median. q(0.5) is another way of writing the median.

SLA calculations – the good, the bad and the ugly

Quite often when articulating Internet bandwidth billing scenarios, one will measure the traffic over each 5 minute period throughout an entire month and calculate q(0.95) over those samples. This is called 95th percentile billing. In service level agreements, one can stipulate that the latency for a particular service must be faster than a specific amount (some percentage of the time), or that some percentage of all interactions with said service must be at a specific latency or faster. Why are those methods different? In the first, your set of samples are calculated over discrete buckets of time, whereas in the second your samples are simply the latencies of each service request. As a note to those writing SLAs, the first is dreadfully difficult to articulate and thus nigh-impossible to calculate consistently or meaningfully, the second just makes sense. While discrete time buckets might make sense for availability-based SLAs, it makes little-to-no sense for latency-based SLAs. As “slow is the new down” is adopted across modern businesses, simple availability-based SLAs are rapidly becoming irrelevant.

So, let’s get more concrete: I have an API service with a simply stated SLA: 99.9% of my accesses should be services in 100 or fewer milliseconds. It might be a simple statement, but it is missing a critical piece of information: over what time period is this enforced? Not specifying the time unit of SLA enforcement is most often the first mistake people make. A simple example will best serve to illustrate why.

Assume I have an average of 5 requests per second to this API service. “Things go wrong” and I have twenty requests that are served at 200ms (significantly above our 100ms requirement) during the day: ten slow requests at 9:02am and the other ten offenders at 12:18pm. Some back of the napkin math says that as long as less that 1 request out of 1000 are slow, I’m fine (99.9% are still fast enough). As I far exceed 20,000 total requests during the day, I’ve not violated my SLA… However, if I enforce my SLA on five minute intervals, I have 1500 requests occurring between 9:00 and 9:05 and 10 slow responses. 10 out of 1500 is… well… there goes my SLA. Same blatant failure from 12:15 to 12:20. So, based on a five-minute-enforcement method I have 10 minutes of SLA violation during the day vs no SLA violation whatsoever using a one-day-enforcement method. But wait.. it gets worse. Why? Bad math.

Many systems out there calculate quantiles over short periods of time (like 1 minute or 5 minutes). Instead of storing the latency for every request, the system retains 5 minutes of these measurements, calculates q(0.999), and then discards the samples and stores the resulting quantile for later use. At the end of the day, you have q(0.999) for each 5 minute period throughout the day (288 of them). So given 288 quantiles throughout the day, how do you calculate the quantile for the whole day? Despite the fictitious answer some tools provide, the answer is you don’t. Math… it doesn’t work that way.

There are only two magical quantiles that will allow this type of reduction: q(0) and q(1). The minimum of a set of minimums is indeed the global minimum; the same is true for maximums. Do you know what the q(0.999) of the a set of q(0.999)s is? Or what the average of a set of q(0.999)s is? Hint: it’s not the answer you’re looking for. Basically, if you have a set of 288 quantiles representing each of the day’s five minute intervals and you want the quantile for the whole day, you are simply out of luck. Because math.

In fact, the situation is quite dire. If I calculated the quantile of the aforementioned 9:00 to 9:05 time interval where ten samples of 1500 are 200ms, the q(0.999) is 200ms despite the other 1490 samples being faster than 100ms. Devastatingly, if the other 1490 samples were 150ms (such that every single request was over the prescribed 100ms limit), the q(0.999) would still be 200ms. Because I’ve tossed the original samples, I have no idea if all of my samples violated the SLA or just 0.1% of them. In the worst case scenario, all of them were “too slow” and now I have 3000 request that were too slow. While 20 requests aren’t enough to break the day, 3000 most certainly are and my whole day is actually in violation. Because the system I’m using for collection quantile information is doing math wrong, the only reasonable answer to “did we violate the SLA today?” is “I don’t know, maybe.” It doesn’t need to be like this – Circonus calculates these quantiles correctly.

An aside on SLAs: While some of this might be a bit laborious to follow, the takeaway is to be very careful how you articulate your SLAs or you will often have no idea if you are meeting them. I recommend calculating quantiles on a day-to-day basis (and that all days are measured by UTC so the abomination that is daylight savings time never foils you). So to restate the example SLA above: 99.9% or more of the requests occurring on a 24-hour calendar day (UTC) shall be serviced in 100ms or less time. If you prefer to keep your increments of failure lower, you can opt for an hour-by-hour SLA instead of a day-by-day one. I do not recommend stating an SLA in anything less that one hour spans.

Bad math in monitoring – don’t let it happen to you

Quantiles are an example where the current methods that most tools use are simply wrong and there is no trick or method that can help. However, something that I’ve learned working on Circonus is how often other tools screw up even the most basic math. I’ll list a few examples in the form of tips, without the indignity of attributing them to specific products. (If you run any of these products, you might recognize the example… or at some point in the future you will either wake up in a cold sweat or simply let loose a stream of astounding explicatives.)

  • The average of a set of minimums is not the global minimum. It is, instead, nothing useful.
  • The simple average of a set of averages of varying sample sizes isn’t the average of all the samples combined. It is, instead, nothing useful.
  • The average of the standard deviations of separate sets of samples is not the standard deviation of the combined set of samples. It is, instead, nothing useful.
  • The q(v) of a set of q(v) calculated over sample sets is not the q(v) of the combined sample set. While creative, it is nothing useful.
  • The average of a set of q(v) calculated over sample sets is, you guessed it, nothing useful.
  • The average of a bunch of rates (which are nothing more than a change in value divided by a change in time: dv/dt) with varying dts is not the damn average rate (and a particularly horrible travesty).

At Circonus we learned early on that math is both critically important to our mission and quite challenging for many of our customers to understand. This makes it imperative that we not screw it up. Many new adopters contact support asking for an explanation as to why the numbers they see in our tool don’t match their existing tools. We have to explain that we thought they deserved the right answer. As for their existing tools: that’s not how it works; that’s not how any of this works.

Wrangling Elephants in the Cloud

Yonah Russ is a hands-on Technology Executive, System Architect, and Performance Engineer. He is founder of DonateMyFee. You can read more articles by Yonah on LinkedIn where you will also find the original version of this post.


You know the elephant in the room, the one no one wants to talk about. Well it turns out there was a whole herd of them hiding in my cloud. There’s a herd of them hiding in your cloud too. I’m sure of it. Here is my story and how I learned to wrangle the elephants in the cloud.

Like many of you, my boss walked into my office about three years ago and said, “We need to move everything to the cloud.” At the time, I wasn’t convinced that moving to the cloud had technical merit. The business, on the other hand, had decided that, for whatever reason, it was absolutely necessary.

As I began planning the move, selecting a cloud provider, picking tools with which to manage the deployment, I knew that I wasn’t going to be able to provide the same quality of service in a cloud as I had in our server farm. There were too many unknowns.

The cloud providers don’t like to give too many details on their setups nor do they like to provide many meaningful SLAs. I have very little idea what hardware I’m running. I have almost no idea how it’s connected. How many disks I’m running on? What RAID configuration? How many IOPS can I count on? Is a disk failing? Is it being replaced? What will happen if the power supply blows? Do I have redundant network connections?

Whatever it was that made the business decide to move, it trumped all these unknowns. In the beginning, I focused on getting what we had from one place to the other, following whichever tried and true best practices were still relevant.

Since then, I’ve come up with these guiding principles for working around the unknowns in the cloud.

Beginners:

  • Develop in the cloud
  • Develop for failure
  • Automate deployment to the cloud
  • Distribute deployments across regions

Advanced:

  • Monitor everything
  • Use multiple providers
  • Mix and match private cloud

Wrangling elephants for beginners:

Develop in the cloud.

Developers invariably want to work locally. It’s more comfortable. It’s faster. It’s why you bought them a crazy expensive MacBook Pro. It is also nothing like production and nothing developed that way ever really works the same in real life.

If you want to run with the IOPS limitations of standard Amazon EBS or you want to rely on Amazon ELBs to distribute traffic under sudden load, you need to have those limitations in development as well. I’ve seen developers cry when their MongoDB deployed to EBS and I’ve seen ELBs disappear 40% of a huge media campaign.

Develop for failure.

Cloud providers will fail. It is cheaper for them to fail and, in the worst case, credit your account for some machine hours, than it is for them to buy high quality hardware and setup highly available networks. In many cases, the failure is not even a complete and total failure (that would be too easy). Instead, it could just be some incredibly high response times which your application may not know how to deal with.

You need to develop your application with these possibilities in mind. Chaos Monkey by Netflix is a classic, if not over-achieving, example.

Automate deployment to the cloud.

I’m not even talking about more complicated, possibly over complicated, auto-scaling solutions. I’m talking about when it’s 3am and your customers are switching over to your competitors. Your cloud provider just lost a rack of machines including half of your service. You need to redeploy those machines ASAP, possibly to a completely different data center.

If you’ve automated your deployments and there aren’t any other hiccups, it will hopefully take less than 30 minutes to get back up. If not, well, it will take what it takes. There are many other advantages to automating your deployments but this is the one that will let you sleep at night.

Distribute deployments across regions.

A pet peeve of mine is the mess that Amazon has made with their “availability zones.” While the concept is a very easy to implement solution (from Amazon’s point of view) to the logistical problems involved in running a cloud service, it is a constantly overlooked source of unreliability for beginners choosing Amazon AWS. Even running a multi-availability zone deployment in Amazon only marginally increases reliability whereas deploying to multiple regions can be much more beneficial with a similar amount of complexity.

Whether you use Amazon or another provider, it is best to build your service from the ground up to run in multiple regions, even if only in an active/passive capacity. Aside from the standard benefits of a distributed deployment (mitigation of DDOS attacks and uplink provider issues, lower latency to customers, disaster recovery, etc.), running in multiple regions will protect you against regional problems caused by hardware failure, regional maintenance, or human error.

Advanced elephant wrangling:

The four principles before this are really about being prepared for the worst. If you’re prepared for the worst, then you’ve managed 80% of the problem. You may be wasting resources or you may be susceptible to provider level failures, but your services should be up all of the time.

Monitor Everything.

It is very hard to get reliable information about system resource usage in a cloud. It really isn’t in the cloud provider’s interest to give you that information. After all, they are making money by overbooking resources on their hardware. No, you shouldn’t rely on Amazon to monitor your Amazon performance, at least not entirely.

Even when they give you system metrics, it might not be the information you need to solve your problem. I highly recommend reading the book Systems Performance: Enterprise and the Cloud by Brendan Gregg.

Some clouds are better than others at providing system metrics. If you can choose them, great! Otherwise, you need to start finding other strategies for monitoring your systems. It could be to monitor your services higher up in the stack by adding more metric points to your code. It could be to audit your request logs. It could be to install an APM agent.

Aside from monitoring your services, you need to monitor your providers. Make sure they are doing their jobs. Trust me that some times they aren’t.

I highly recommend monitoring your services from multiple points of view so you can corroborate the data from multiple observers. This happens to fit in well with the next principle.

Use multiple providers.

There is no way around it. Using one provider for any third party service is putting all your eggs in one basket. You should use multiple providers for everything in your critical path, especially the following four:

  • DNS
  • Cloud
  • CDN
  • Monitoring

Regarding DNS, there are some great providers out there. CloudFlare is a great option for the budget conscious. Route53 is not free but not expensive. DNSMadeEasy is a little bit pricier but will give you some more advanced DNS features. Some of the nastiest downtimes in the past year were due to DNS providers.

Regarding Cloud, using multiple providers requires very good automation and configuration management. If you can find multiple providers which run the same underlying platform (for example, Joyent licenses out their cloud platform to various other public cloud vendors), then you can save some work. In any case, using multiple cloud providers can save you from some downtime, bad cloud maintenance, or worse.

CDNs also have their ups and downs. The Internet is a fluid space and one CDN may be faster one day and slower the next. A good Multi-CDN solution will save you from the bad days, and make every day a little better at the same time.

Monitoring is great but who’s monitoring the monitor. It’s a classic problem. Instead of trying to make sure every monitoring solution you use is perfect, use multiple providers from multiple points of view (application performance, system monitoring, synthetic polling).

These perspectives all overlap to some degree, backing each other up. If multiple providers start alerting, you know there is a real actionable problem, and from how they alert, you can sometimes home in on the root cause much more quickly.

If your APM solution starts crying about CPU utilization but your system monitoring solution is silent, you know that you may have a problem that needs to be verified. Is the APM system misreading the situation or has your system monitoring agent failed to warn you of a serious issue?

Mix and match private cloud

Regardless of all the above steps you can take to mitigate the risks of working in environments not completely in your control, really important business should remain in-house. You can keep the paradigm of software defined infrastructure by building a private cloud.

Joyent license their cloud platform out to companies for building private clouds with enterprise support. This makes a mixing and matching between public and private very easy. In addition, they have open sourced the entire cloud platform, so if you want to install without support, you are free to do so.

Summary

When a herd of elephants is stampeding, there is no hope of stopping them in their tracks. The best you can hope for is to point them in the right direction. Similarly, in the cloud, we will never get back the depth of visibility and control that we have with private deployments. What’s important is to learn how to steer the herd so we are prepared for the occasional stampede while still delivering high quality systems.

Underneath Clean Data: Avoiding Rot

When many people talk about clean data, they are referring to data that was collected in a controlled and rigorous process where bad inputs are avoided. Dirty data has samples outside of the intended collection set or values for certain fields that may be mixed up (e.g. consider “First Name: Theo Schlossnagle” and “Last Name: Male” …oops). These problems pose huge challenges for data scientists and statisticians, but it can get a whole lot worse. What if your clean data were rotten?

Rotten data

All (or almost all) of this data is stored on disks today… in files on disks (yes, even if it is in a database)… in files that are part of a filesystem on disks. There is also a saying, “It’s turtles all the way down,” that refers to the poor implementation of foundational technology. Case in point: did you know that you’re likely to have a bit error (i.e. one bit read back opposite of how it was stored) every time you write between 200TB to 2PB of data? This probability of storing bad data is called Bit Error Rate (BER). Did you know that most filesystems assume a BER of zero, when it never has and never will be zero? That means that on every filesystem you’ve used (unless you’ve been blessed to run on one of the few filesystems that accounts for this) you’ve had a chance of reading back data that you legitimately never wrote there!

Now, you may be thinking one bit in 2PB is quite a lot. This BER is published by drive manufacturers and while they are not outright lying, they omit a very real truth. You don’t store data on drives without connecting them to a system via cables to a Host Bus Adapter (HBA). Two more pieces of hardware that we’ll simply call turtles. Most HBAs use a memory type called Error-Correcting Code (ECC) that is designed to compensate for single bit errors in memory, but cabling is often imperfect and the effective BER of the attached drives is bumped ever so slightly higher. Also take into account that physical media is an imperfect storage medium; it is possible to write something correctly and have it altered over time due to environmental conditions and (to a lesser extent) use; this effect is called bit rot or data rot. All of this illustrates that the BER listed on your hard drive specification is optimistic. Combine all this with the inconvenient truth that writing out 2PB of data is quite common in today’s data systems and you wind up with even your cleanest data soiled. As an anecdote, at one point we detected more than one bit of error per month in a relatively small cluster (< 100TB).

You’ll notice that I said we detected these issues; this is because we use the ZFS filesystem underneath our customer’s data. ZFS checksums all data written so that it can be verified when it is retrieved. The authors of ZFS knew that on large data systems these issues would be real and must be handled and for that they have my deep gratitude. There is one issue here that escapes most people that have the foresight to run an advanced filesystem and it is hidden within this very paragraph.

In order for a checksumming filesystem (like ZFS) to detect bit errors, it must read the bad data. On large systems, some data is hot (meaning it is read often), but a significant amount of data is cold written and ignored for extended periods of time (months or years). When data engineers design systems, they account for the data access patterns of the applications that run on top of their systems: How many writes and reads? How much hot and cold? Are the reads and writes sequential or random? The answers to these questions help specify the configuration of the underlying storage systems so that it has enough space, enough bandwidth, and low enough latency to satisfy the expected usage. But, if we add into this the chance that our precious data is rotting and that we must detect an error before we can choose to repair it, then we are left with a situation where we must read all our cold data. We must read all our cold data. We must read all our cold data. Said three times it will induce cold sweats in most storage engineers; it wasn’t part of the originally specified workload and if you didn’t account for that in your system design, you’re squarely misspecified.

Scrubbing out the rot

In the ZFS world, the action of reading all of your data to verify its integrity and correct for data rot is aptly named “scrubbing.” For the storage engineers out there, I thought this would be an interesting exploration into what scrubbing actually does to your I/O latency. At Circonus we actually care about our customer’s data and scrub it regularly. I’ll show you what this looks like and then very briefly describe what we do to make sure that users aren’t affected.

On our telemetry storage nodes, we measure and record the latency of every disk I/O operation against every physical disk in the server using io nad plugin (which leverages DTrace on Illumos and ktap on Linux). All of these measurements are sent up to Circonus as a histogram and from there we can analyze the distribution of latencies.

Scrubbing Data #1

In this first graph, we’re looking at a time-series histogram focused on the period of time immediately before an apparently radical change in behavior.

Scrubbing Data #2

Moving our mouse one time unit to the right (just before 4am), we can see an entirely different workload present. One might initially think that in the new workload we have much better performance as many samples are now present in the lower latency side of the distribution (the left side of the heads-up graph). However, in the legend you’ll notice that the first graph is focused on approximately 900 thousand samples whereas the second graph is focused on approximately 3.2 million samples. So, while we have more low-latency samples, we also have many more samples as well.

Scrubbing Data #3

Of further interest is that, almost immediately at 4am, the workload changes again and we see a new distribution emerge in the signal. This distribution stays fairly consistent for about 7 hours with a few short interruptions, changes yet again just before Jan 5 at 12pm, and seemingly recovers to the original workload just after 4pm (16:00). This is the havoc a scrub can play, but we’ll see with some cursory analysis that the effects aren’t actually over at 4pm.

Scrubbing Data #4

The next thing we do is add an analytical overlay to our latency graph. This overlay represents an approximation of two times the number of modes in the distribution (the number of humps in the histogram) as time goes on. This measurement is an interesting characteristic of workload and can be used to detect changes in workload. As we can see, we veered radically from our workload just before 4am and returned to our original workload (or at least something with the same modality) just after midnight the following day.

Scrubbing Data #5

Lastly, we can see the effects on the upper end of the latency distribution spectrum by looking at some quantiles. In the above graph we reset the maximum y-value to 1M (the units here are in microseconds, so this is a 1s maximum). The overlays here are q(0.5), q(0.99), q(0.995), and q(0.999). We can see our service times growing into a range that would cause customer dissatisfaction.

While I won’t go into detail about how we solve this issue, it is fairly simplistic. All data in our telemetry store is replicated on multiple nodes. The system understands node latency and can prefer reads from nodes with lower latency.

Understanding how our systems behave while we keep our customers’ data from rotting away allows us to always serve the cleanest data as fast as possible.

Our Monitoring Tools are Lying to Us

I posted the article below to LinkedIn a few weeks back. Since it was relatively popular and relevant to the Circonus community we decided to repost to our Blog. You can find the original here.

I came across this vendor blog post today extolling the virtues of monitoring application performance using Percentiles versus Averages. Hard to believe in late 2012 there was still convincing to be done on this concept.

But in the age of Agile Computing and DevOps at scale, fixed percentiles over arbitrary, pre-determined time windows no longer cut the mustard for measuring application performance. Did they ever? Probably not, but they’re easy to calculate and cheap to store using 20th century “Small Data” technologies.

What if the proper threshold for supporting your service SLA for one KPI is measured at the 85th percentile over 1 min and for another KPI is measured at the 95th percentile over an hour? What if those thresholds change as your business changes and your business is changing rapidly? Are your tools as agile as your business?

What if consistently delighting your customers requires you to monitor a percentile of a particular metric at 1 min, 5 min, 1 hour , and 1 day intervals? Even if your tools imply they can do that, they probably can’t in reality. They weren’t designed to do that.

Lets say you are monitoring “response time” and that over the course of 1 min you typically have thousands of response time measurements. Existing tools will calculate the chosen percentile of those thousands of measurements and store the result in a database every 5 min. After 60 min they have 12 values, one for each 5 min window. Want to “calculate” the 95th percentile over an hour? More than likely what your tools will actually calculate is the average of those 12 values. But in reality there were 12 x thousands of response times measured over that hour, not 12. What’s the actual 95th percentile? Your tools probably can’t answer that question because they don’t have the data.

If you are like almost all of your IT peers, your monitoring tools begin to summarize performance data before it becomes even an hour old. Automatically summarizing performance data is one of the most “valuable” features of RRD Tools which I would bet is single the most common repository for IT performance data today. The perfect Small Data solution.

The point is, as soon as our tools begin summarizing performance data, we lose the ability to accurately analyze that data. Our tools begin to lie to us.

One-Minute-Resolution Data Storage is Here!

Have you noticed your graphs looking better? Zoom in. Notice more detail? That’s what our switch from 5 to 1-minute-resolution data storage has done for our SaaS customers. Couple that with our 7 years of data-retention and you’ve got an unrivaled monitoring and analytics tool. Dear customers, your computations are becoming more accurate as we speak!

“Keeping 5 times the data, combined with visualizing data in 1 second resolution in real-time, gives us unprecedented ability to forecast future trends and ask questions like ‘what happened last year during that big event’,” adds Circonus CEO, Theo Schlossnagle.

The switch from 5 to 1-minute-resolution data storage means more data points.

One Minute Data Storage Now and Then
One Minute Data Storage Then and Now

Companies around the world depend on Circonus to provide unparalleled insight into all aspects of their infrastructure. Their websites typically have large swings in traffic due to a variety of factors, such as product launches, news events, or holiday shopping and other annual events. Our switch to 1-minute-resolution data storage is great news for our SaaS customers.

“Our web infrastructure is essential to our business,” says Kevin Way, Director of Engineering Operations at Monetate. “Circonus gives us the insight we need to provide our customers with the reliability they deserve. The ability to use detailed data from previous events to predict a future event is incredibly valuable! This is a major differentiater for Circonus.”

“At Wanelo we invested heavily into metrics and visibility into our infrastructure, and Circonus is a huge part of our strategy. Being able to visualize our data in 1-minute resolution gives us unprecedented ability to diagnose and remediate issues across all aspects of our infrastructure,” says Konstantin Gredeskoul, CTO at Wanelo.com.

Along with the adoption of Devops, companies are increasingly dependent on highly dynamic cloud environments, provisioning and deprovisioning infrastructure as needed. This affects decisions made at every level of an organization, from the CEO, to the product team, to IT Operations. A detailed history of usage and traffic patterns – such as from AWS, Azure, Google Cloud, Heroku, Rackspace, or one’s own private cloud infrastructure – gives an organization immeasurable insight into network performance monitoring, as well as the costs associated with providing customers with a world-class experience.