Here at Circonus, our infrastructure is highly distributed. Many of the functions of Circonus are distinct systems that communicate with each other. In addition, we use a distributed data store for storing and retrieving data for graphs. Of course, testing a system is necessary to ensure that everything keeps working smoothly, but testing distributed systems like ours can be extraordinarily difficult.
There are a bunch of issues that come up when testing distributed systems – accounting for asynchronous delivery among nodes in a distributed data store, assuring that all data has been replicated and stored properly across the data cluster, making sure that the entire system works end-to-end, dealing with and recovering from individual component or data storage node failure, and more. These issues can be extremely difficult to test, as anyone who has worked on a distributed system can attest. I have been working on distributed systems for years, and have developed a few strategies for dealing with these and other issues.
In the July issue of ACM Queue, I discuss how we deal with these and other issues when testing our systems. To learn about our approach to testing in more detail, check out Testing A Distributed System.
Also, if tackling complex infrastructures is your dream job, come work with us!