 # Monitoring Queue Sizes

At Circonus we are collecting a number of metric about block devices. An interesting one is the average length of the request queue, which is exposed by iostat:

aqu-sz
The average queue length of the requests that were issued to the device.
Note: In previous versions, this field was known as avgqu-sz.$ To collect aqu-sz as a metric we don’t want to fire up iostat on each collection interval. So we needed to understand how this value is derived from statistics exposed by the operating system. It turns out that there is an interesting bit of mathematics involved in deriving this value. Here is how it works. It was already noted by Baron Schwarz in 2010, that the calculation behind iostat are interesting. This post can be seen as a formalized elaboration. ## Step 1: Checking the Implementation The first step in understanding the iostat calculation of aqu-sz is to check the implementation (source): /* aqu-sz */ cprintf_f(NO_UNIT, 1, 6, 2, S_VALUE(ioj->rq_ticks, ioi->rq_ticks, itv) / 1000.0);  Here S_VALUEis defined (source) as:  #define S_VALUE(m,n,p) (((double) ((n) - (m))) / (p) * HZ);  and rq_ticks is Field 11 from /proc/diskstats (source). Digging around in the source code a little more, you will be able to verify for yourself that the following formula is used to calculate aqu-sz: $$aqusz = \frac{F_{11}(t_1) – F_{11}(t_0)}{t_1 – t_0}$$ where $t_0, t_1$ are the measurement timestamps in ms, and F_{11}(t) is the value of Field 11 of at time t. So, the average queue size can be calculated as a slope or discrete derivative of a mystical field from /proc/diskstats. Interesting! ## Step 2: Checking the kernel documentation Let’s see how that works. The kernel documentation says the following: Field 11 -- weighted # of milliseconds spent doing I/Os This field is incremented at each I/O start, I/O completion, I/O merge, or read of these stats by the number of I/Os in progress (field 9) times the number of milliseconds spent doing I/O since the last update of this field. This can provide an easy measure of both I/O completion time and the backlog that may be accumulating.$


So:

$$F_{11}(t_1) = F_{11}(t_0) + (\text{I/Os in progress at time t_1}) \times (\text{ms spent doing I/O in t_0,t_1})$$

Where $t_0$ is the time of the last update, and $t_1$ is the current time.

So if we want to calculate the value of $F_{11}$ at a time $t$, we have to recursively apply this rule until the beginning of time (boot):

$$F_{11}(t) = \sum_{t_i} (\text{I/Os in progress at time t’}) \times (\text{ms spent doing I/O in t_{i-1},t_{i-1}})$$

where $t_i$ runs through the times of every I/O event in the system. For the mathematical reader, this sum has a very familiar form. It’s summing a function values times a little time interval. This is precisely how integration over time works. So we are inclined to compare it with:

$$F_{11}(T) \overset{?}{\leftrightarrow} \int_0^T (\text{number of I/Os in progress at time t}) \text{dt}$$

This is not be quite right, since we should only sum over I/O time, and it’s not clear that the sum can be replaced by an integral, but it points us at the right direction. Assuming this were the case, we would get the following expression for avgqz:

$$avgqz \overset{?}{\leftrightarrow} \frac{\int_{t_0}^{t_1} (\text{I/Os in progress at time t}) \text{dt}}{t_1 – t_0}.$$

This is a very common way to express a mean value of a function. In this case, we average the number of I/Os in progress, and not the queue length, but that kind of imprecise wording is commonly found in man pages.

So all this would make sense, if only Field 11 accounted the total time and not the time “spent doing I/O” and we could replace the sum by an integral!

## Step 3: A mathematical model

In order to make the above idea precise, let’s introduce some notation:

\begin{align*} ios(t) &= \text{(number of I/O operations in progress at time t)} \\busy(t) &= \text{(one if the device is processing IO at time t, zero otherwise)} \end{align*}

Here, again $t$ denotes system uptime in ms. To warm up, let’s first look at the integral:

$$Busy(T) = \int_0^T busy(t) \text{dt} = \int_{t : busy(t) = 1} 1 \text{dt}$$

This integral measures the total time the system was busy processing IO since boot. We can get the utilization of the system during a time interval $t_0 and t_1$ as:

$$\frac{Busy(t_1) – Busy(t_0)}{t_1 – t_0} = \frac{\int_{t_0}^{t_1} busy(t) \text{dt}}{t_1-t_0} = \frac{\text{time spent doing IO}}{\text{total time}} = util(t_0, t_1)$$

This is another very important statistics that is reported by iostat -x under the %util column.

To get a mathematical description of Field 11, let’s try the following:

$$F(T) := \int_0^T ios(t) busy(t) \text{dt}$$

In this way we have:

$$F(t_1) – F(t_0) = \int_{t_0}^{t_1} ios(t) busy(t) \text{dt}$$

or, equivalently:

$$F(t_1) = F(t_0) + \int_{t_0}^{t_1} ios(t) busy(t) \text{dt}$$

Now, if $ios(t)$ is constant during the time interval $t_1,t_0$, then:

$$F(t_1) = F(t_0) + ios(t_1) \times \int_{t_0}^{t_1} busy(t) \text{dt}$$

Looking closely at this, we see that this is precisely the recursion formula for Field 11 from /proc/diskstat

$$F(t_1) = F(t_0) + (\text{IO ops in progress at time t_1}) \times (\text{numer of ms spent doing I/O in t_0,t_1})$$

And if $t_1,t_0$ are adjacent I/O events (start, complete, merge) the assumption that ios(t) is constant in between is justified. Hence we see, that $F_{11} = F$.

## Step 4: This can’t be true!

Now that we have a firm grip of $F_{11}$, we can start examining the claimed formula for the average queue size

$$aqusz = \frac{F_{11}(t_1) – F_{11}(t_0)}{t_1 – t_0} = \frac{\int_{t_0}^{t_1} ios(t) busy(t) \text{dt}}{t_1 – t_0}$$

Is this indeed a sensible measure for the average queue size?

It does not seem to be the case. Take for example a time interval, where there are $ios(t) = 10$, but $busy(t) = 0$, then: The average queue size should be 10, but the integral evaluates to 0. Hence the above expression is zero, which is clearly not sensible.

## Step 5: But is it really?

But is this really a valid example? If the two functions busy() and ios() were really independent, then this condition could certainly occur. In the analogue case of CPU scheduling, cases like these, where there are runnable threads but the CPU is busy doing something else (interrupt handlers, hypervisor) can indeed happen.

But is this really the case for block IO scheduling? Another look at the documentation reveals the following:

    Field  9 -- # of I/Os currently in progress
The only field that should go to zero. Incremented as requests are
given to appropriate struct request_queue and decremented as they finish.
Field 10 -- # of milliseconds spent doing I/Os
This field increases so long as field 9 is nonzero.$ So, Field 9 is our function $ios(t)$ and Field 10 is actually our function $Busy(t)$! And they indeed have a relation:  > "[Busy(t)] increases as long as [ios(t)] is nonzero"$


In other words $busy(t) = 1$ if $ios(t) > 0$ and $0$ otherwise!

Revisiting our definition of $F$, we find the following (explanations follow)

\begin{align*} F(T) &:= \int_{t \in [0,T]} ios(t) busy(t) \text{dt} \\ &= \int_{t \in [0,T], busy(t) = 1} ios(t) busy(t) \text{dt} + \int_{t \in [0,T], busy(t) = 0} ios(t) busy(t) \text{dt} \\ &= \int_{t \in [0,T], busy(t) = 1} ios(t) \text{dt} + 0 \\ &= \int_{t \in [0,T], ios(t) > 0} ios(t) \text{dt} + \int_{t \in [0,T], ios(t) = 0} ios(t) \text{dt} \\ &= \int_{t \in [0,T]} ios(t) \text{dt} \end{align*}

Let’s break that down. In the first step, we can divide the integral into two parts, where $busy = 0/1$, the part where $busy = 0$ evaluates to 0. In the next step we note that $busy = 0$ is equivalent to $ios = 0$, so we can replace the condition. Next we complete the integral by introducing another 0-summand: Integrating over $ios(t) = 0$ which evaluates to zero as well. Finally we put things back together.

We see that for the calculation of Field 11, it does not matter if we include $busy(t)$ in the integrand. We end up with the same number whether we sum over “the number of milliseconds *spent doing I/O* since the last update of this field” or just “the number of milliseconds since the last update of this field”:

\begin{align*} avgqz(t_0, t_1) &= \frac{F_{11}(t_1) – F_{11}(t_0)}{t_1 – t_0} \\ & \overset{!}{=} \frac{\int_{t_0}^{t_1} ios(t) \text{dt}}{t_1 – t_0} \\ &= \text{Integral-average over the number of IOs in progess within t_0,t_1.} \end{align*}

## Conclusion

We have seen that the avg-qz reported by iostat has indeed the interpretation of an integral average over the number of IOs in progress. We find it interesting to see calculus can be applied various state accounting metrics and clarify their relationships. Moreover, the ability to express integral averages as discrete derivatives of a counter ($F_{11}$) is another remarkable takeaway.