Some Like It Flat

JSON rules the world, much to our collective chagrin. I’ve mentioned before the atrocious shortcomings of JSON as a format and I feel deeply saddened that the format has taken the world by storm. However, it is here and we must cope… but not always.

Like every other API-centric platform in the world, we support data in and out in the ubiquitous JavaScript Object Notation format. I’ll admit, for a (more or less) machine parseable notation, it is remarkably comprehensible to humans (this is, of course, one of the leading drivers for its ubiquity). While I won’t dive into the deficits of JSON on the data-type interpretation side (which is insidious for systems that mainly communicate numerical data), I will talk about speed.

JSON is slow. JSONB was created because JSON was slow. JSONB is still slow. Object (data) serialization has always been of interest to protocols. When one system must communicate data to another, both systems must agree on a format for transmission. While JSON is naturally debuggable, it does not foster agreement (specifically on numeric values) and it is truly abysmal on the performance side. This is why so many protocol serializations exist today.

From the Java world comes Thrift, its successor Avro, and MessagePack. From Python we have pickle, which somehow has escaped the Python world to inflict harm upon others. From the C and C++ world we have Cap’n Proto, Flatbuffers, and perhaps the most popular, Google Protobuf (the heart of the widely adopted gRPC protocol). Now, these serialization libraries might have come from one language world, but they’d be useless without bindings in basically every other language… which they do generally boast, with the exception of pickle.

It should be noted that not all of these serialization libraries stop at serialization. Some bring protocol specification (RPC definition and calling convention) into scope. Notwithstanding that this can be useful, the fact that they are conflated within a single implementation is a tragedy.

For IRONdb, we needed a faster alternative to JSON because we were sacrificing an absurd amount of performance to the JSON god. This is a common design in databases; either a binary protocol is used from the beginning or a fast binary protocol is added for performance reasons. I will say that starting with JSON and later adopting a binary encoding and protocol has some rather obvious and profound advantages.

By delaying the adoption of a binary protocol, we had large system deployments with petabytes of data in them under demanding, real-world use cases. This made it a breeze to evaluate both the suitability and performance gains. Additionally, we were able to understand how much performance we squandered on the protocol side vs. the encoding side. Our protocol has been HTTP and our payload encoding has been JSON. It turns out that in our use cases, the overhead for dealing with HTTP was nominal in our workloads, but the overhead for serializing and deserializing JSON was comically absurd (and yes, this is even with non-allocating, SAX-style JSON parsers).

So, Google Protobuf right? Google can’t get it wrong. It’s popular; there are a great many tools around it. Google builds a lot of good stuff, but there were three undesirable issues with Google Protobuf: (1) IRONdb is in C and while the C++ support is good, the C support is atrocious, (2) it conflates protocol with encoding so it becomes burdensome to not adopt gRPC, and (3) it’s actually pretty slow.

So, if not Google’s tech, then whose? Well, Google’s of course. It is little known that Flatbuffers are actually Google Flatbuffers designed specifically as an encoding and decoding format for gaming and other high performance applications. To understand why it is so fast, consider this: you don’t have to parse anything to access data elements of encoded objects. Additionally, it boasts the strong forward and backward compatibility guarantees that you get with other complex serialization systems like Protobuf.

The solution for us:

Same REST calls, same endpoints, same data, different encoding. Simply specify the data encoding you are sending with a ‘Content-Type: x-circonus-<datatype>-flatbuffer’ header and the remote end can even avoid copying or even parsing any memory; it just accesses it. The integration into our C code is very macro-oriented and simple to understand. Along with roughly 1000x speedup in encoding and decoding data, we save about 75% volumetrically “over the wire.”