Understanding service latencies via TCP analysis

Latency is the root of all that is evil on the Internet… or so the saying goes. CPUs get faster, storage gets cheaper, IOPS are more plentiful, yet we feeble engineers have done close to nothing on improving the speed of light. While there is certainly a lower-bound to latencies rooted in the physical world, many of the latencies of today’s services (delivered both globally and within the datacenter) have latencies that are affected by things that are under our control. From a low level to a high level latencies suffer from service capacity issues, routing issues, systems and network tuning issues, bad code, bad deployments, human errors and even usage errors. You can’t improve what you can’t measure, so let’s get to measuring some latencies.

Most Internet services are pretty simple; a client calls you up and asks a question and you provide an answer. Due to this incredibly common pattern and the nuances of HTTP, it turns out to be trivial to measure service response times with nothing more than a packet sniffer (e.g. tcpdump).

“What is this nuance of HTTP?” you might ask. It is that HTTP requires the client to speak first. While all of the rest of the article can be used for more complicated services where that is not true, it oft times requires a deeper understanding of the protocol at hand and more adroit tcpdump manipulation.

“Client speaks first” makes it very easy to blindly assess the service latency on the initial HTTP request on a TCP connection. Returning to the phone call metaphor, I can simply watch the phones and measure the time between the first ring and the first word spoken by the receiver; I don’t care much about what happens in between. Granted, there is a wealth of information in the packets themselves: client (IP), agent (User-Agent), endpoint (requested URI), etc. Yet, to measure overall service latency, I simply need the first TCP SYN packet in and the first non-zero payload packet out on a given TCP/IP connection.

What are we looking at?

At this point we should stop and understand a bit better what we are actually measuring. Wouldn’t it be easier to just ask the service we’re running to measure its service times and report them? The answer to this varies from architecture to architecture, but if the answer is “no,” you should have birds and bees conversation with the software engineers or operations folk responsible for running the service. It should be absolutely trivial to track service times within the architecture.

Here, we’re asking something a tad more complex. Our measurements on the network side include an element of latency between the client and the server. What we are trying estimate here is the elapsed time between the client’s attempt to retrieve content the the first byte of that content arriving.

If we measure from SYN receipt to data write, we miss two vital times from the user’s perspective. We didn’t get the SYN until sometime after they sent it, and they won’t see our data until sometime after we’ve sent it. Because each of these is a single packet, we just need to know the time it takes to roundtrip a packet. Where can we get that information?

; ping
PING ( 56 data bytes
64 bytes from icmp_seq=0 ttl=249 time=39.022 ms
64 bytes from icmp_seq=1 ttl=249 time=40.013 ms
64 bytes from icmp_seq=2 ttl=249 time=34.300 ms
64 bytes from icmp_seq=3 ttl=249 time=40.373 ms
64 bytes from icmp_seq=4 ttl=249 time=33.129 ms

It isn’t like we can ping the remote machine. Instead we need to calculate this passively. TCP has this really inconvenient 3-way handshake that starts up a session that goes something like:

  1. Client: “I’d like to ask you to tell me a joke.”
  2. Server: “Okay, ask away”
  3. Client: “Okay, I’d like you to tell me a joke about TCP.”
  4. Server: “These two packets walked into a bar…”

From the TCP nitty-gritty, if we measure the time from the first SYN to the subsequent ACK package (before any data has transited), we have a rough estimation of the roundtrip time between the client and server. If we measure between observing (1) and sending (4), the part that we’re missing is the time between (1) being sent by the client and arriving at the server and (4) being sent and arriving at the client. That is approximately one roundtrip.

So, long story summarized. If we take the time between observing 1 and 4 and add the time between observing 2 and 3, we should have a rough approximation of the total time the client witnessed between making the request and receiving the first byte of the response. We’ll take a further short cut and note that the latency between is 1 and 2 is nominal on (at least our) systems, so we only track the timings of (1), (3) and (4) and calculate (4) – (1) + (3) – (1).

Actually doing this.

We’ll use tcpdump on our server to try to find what we’re looking for: the SYN from the client that will start the session, the ACK from the client that completes the three-way TCP handshake (in response to the server’s SYN-ACK) and the first non-zero-length data payload from the server to the client.

For our example, let’s look at web traffic on port 80. First we need to know how much data is in the packet. The IP header has the total packet length (in a 2-byte short starting a the second byte, which in BPF syntax is ‘ip[2:2]’. However, we’re looking for the payload length, so we need to remove from that the length of the IP header: ‘(ip[0]&0xf)*4)’ and the length of the TCP header: ‘((tcp[12]&0xf0)/4). We expect the one of the SYN or ACK bits set in the TCP flag and (because these are from the client) a destination port of 80:

'((((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) == 0) && ((tcp[tcpflags] & (tcp-syn|tcp-ack)) != 0) and dst port 80)'

The data payload from the server to the clients simply non-zero payload length with a source port of 80:

'(((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0 and src port 80)'

To test this, I’ll further restrict this to the host (my laptop) and run tcpdump. I don’t want to resolve IP addresses to names and I would like to keep the output terse with high-precision timestamps: (-n -ttq). Additionally, I need at least 54 bytes of the payload to get the TCP headers (this is left as an exercise for the reader). Putting this all together in a tcpdump command (and limiting the scope to my client at, we can see something like this:

/opt/omni/sbin/tcpdump -n -ttq ' ((((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) == 0) && ((tcp[tcpflags] & (tcp-syn|tcp-ack)) != 0) and dst port 80) or (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0 and src port 80) and host'
listening on us340, link-type EN10MB (Ethernet), capture size 384 bytes
1370816555.510014 IP > tcp 0
1370816555.542407 IP > tcp 0
1370816555.585086 IP > tcp 458
1370816555.617389 IP > tcp 0
1370816555.617398 IP > tcp 0
1370816555.617622 IP > tcp 0

The first three lines are exactly what we’re looking for. The subsequent lines would be (in this case) the shutting down of this TCP connection. These packets are harder to avoid using BPF syntax and we’ll figure out a way to ignore them as needed.

Doing something useful.

So, it is clear that on any high-volume service, these packets are going to be coming in quickly and we’ll need to process these latency calculations programmatically.

This is a bit more complicated, so I whipped together a small node.js program that runs tcpdump, consumes its output, calculated the estimated time-to-first-byte latencies and securely pushes them up to a specific Circonus httptrap check.

The only two arguments you need to supply are the UUID and secret from your httptrap in Circonus. Once running, the script will submit all latencies to one metric named “aggregate`latency” and another named “<hostname>`latency” (where <hostname> is the hostname of the machine on which it is running). This allows you to point a handful of machines at the same httptrap check and collect both their individual latencies and a recalculated aggregate set of latencies.

The Result:

In the above graph, we can see three distinct nodes and the estimated latency (in seconds) on delivering the first byte of data to the client. Every single client TCP connection is accounted for. There is no sampling or missing data, we see a comprehensive and complete picture of our service latencies.

This requires only 4 metrics in Circonus.