Distributed Systems Testing or Why My Hair Is Falling Out

Here at Circonus, our infrastructure is highly distributed. Many of the functions of Circonus are distinct systems that communicate with each other. In addition, we use a distributed data store for storing and retrieving data for graphs. Of course, testing a system is necessary to ensure that everything keeps working smoothly, but testing distributed systems like ours can be extraordinarily difficult.

There are a bunch of issues that come up when testing distributed systems – accounting for asynchronous delivery among nodes in a distributed data store, assuring that all data has been replicated and stored properly across the data cluster, making sure that the entire system works end-to-end, dealing with and recovering from individual component or data storage node failure, and more. These issues can be extremely difficult to test, as anyone who has worked on a distributed system can attest. I have been working on distributed systems for years, and have developed a few strategies for dealing with these and other issues.

In the July issue of ACM Queue, I discuss how we deal with these and other issues when testing our systems. To learn about our approach to testing in more detail, check out Testing A Distributed System.

Also, if tackling complex infrastructures is your dream job, come work with us!