Monitoring Queue Sizes

At Circonus we are collecting a number of metric about block devices. An interesting one is the average length of the request queue, which is exposed by iostat[1]:

aqu-sz
  The average queue length of the requests that were issued to the device.
  Note: In previous versions, this field was known as avgqu-sz.

To collect “aqu-sz” as a metric we don’t want to fire up iostat on each collection interval. So we needed to understand how this value is derived from statistics exposed by the operating system. It turns out that there is an interesting bit of mathematics involved in deriving this value. Here is how it works.

It was already noted by Baron Schwarz in 2010[2], that the calculation behind iostat are interesting. This post can be seen as a formalized elaboration.

Step 1: Checking the Implementation

The first step in understanding the iostat calculation of aqu-sz is to check the implementation (source):

/* aqu-sz */
    cprintf_f(NO_UNIT, 1, 6, 2, S_VALUE(ioj->rq_ticks, ioi->rq_ticks, itv) / 1000.0);

Here “S_VALUE” is defined (source) as:

#define S_VALUE(m,n,p)		(((double) ((n) - (m))) / (p) * HZ)

and “rq_ticks” is Field 11 from /proc/diskstats (source).

Digging around in the source code a little more, you will be able to verify for yourself that the following
formula is used to calculate aqu-sz:

    \[aqusz = \frac{F_{11}(t_1) - F_{11}(t_0)}{t_1 - t_0}\]

where t_0, t_1 are the measurement timestamps in ms, and F_{11}(t) is the value of Field 11 of at time t.

So, the average queue size can be calculated as a slope or discrete derivative of a mystical field from /proc/diskstats. Interesting!

Step 2: Checking the kernel documentation

Let’s see how that works. The kernel documentation[3] says the following:

Field 11 -- weighted # of milliseconds spent doing I/Os
  This field is incremented at each I/O start, I/O completion, I/O
  merge, or read of these stats by the number of I/Os in progress
  (field 9) times the number of milliseconds spent doing I/O since the
  last update of this field.  This can provide an easy measure of both
  I/O completion time and the backlog that may be accumulating.

So:

    \[F_{11}(t_1) = F_{11}(t_0) + (\text{I/Os in progress at time $t_1$}) \times (\text{ms spent doing I/O in $t_0,t_1$})\]

Where t_0 is the time of the last update, and t_1 is the current time.

So if we want to calculate the value of F_{11} at a time t, we have to recursively apply this rule until the beginning of time (boot):

    \[F_{11}(t) = \sum_{t_i} (\text{I/Os in progress at time t'}) \times (\text{ms spent doing I/O in $t_{i-1},t_{i-1}$})\]

where t_i runs through the times of every I/O event in the system. For the mathematical reader, this sum has a very familiar form. It’s summing a function values times a little time interval. This is precisely how integration over time works. So we are inclined to compare it with:

    \[F_{11}(T) \overset{?}{\leftrightarrow} \int_0^T (\text{number of I/Os in progress at time t}) \text{dt}\]

This is not be quite right, since we should only sum over “I/O” time, and it’s not clear that the sum can be replaced by an integral, but it points us at the right direction. Assuming this were the case, we would get the following expression for `avgqz`:

    \[avgqz \overset{?}{\leftrightarrow}  \frac{\int_{t_0}^{t_1} (\text{I/Os in progress at time t}) \text{dt}}{t_1 - t_0}.\]

This is a very common way to express a mean value of a function. In this case, we average the number of I/Os in progress, and not the queue length, but that kind of imprecise wording is commonly found in man pages.

So all this would make sense, if only Field 11 accounted the total time and not the time “spent doing I/O” and we could replace the sum by an integral!

Step 3: A mathematical model

In order to make the above idea precise, let’s introduce some notation:

    \begin{align*} ios(t) &= \text{(number of I/O operations in progress at time t)} \\ busy(t) &= \text{(one if the device is processing IO at time t, zero otherwise)} \end{align*}

Here, again t denotes system uptime in ms. To warm up, let’s first look at the integral:

    \[Busy(T) = \int_0^T busy(t) \text{dt} = \int_{t : busy(t) = 1} 1 \text{dt}\]

This integral measures the total time the system was busy processing IO since boot. We can get the utilization of the system during a time interval t_0 < t_1 as:

    \[\frac{Busy(t_1) - Busy(t_0)}{t_1 - t_0} = \frac{\int_{t_0}^{t_1} busy(t) \text{dt}}{t_1-t_0} = \frac{\text{time spent doing IO}}{\text{total time}} = util(t_0, t_1)\]

This is another very important statistics that is reported by `iostat -x` under the “%util” column.

To get a mathematical description of Field 11, let’s try the following:

    \[F(T) := \int_0^T ios(t) busy(t) \text{dt}\]

In this way we have:

    \[F(t_1) - F(t_0) = \int_{t_0}^{t_1} ios(t) busy(t) \text{dt}\]

or, equivalently:

    \[F(t_1) = F(t_0) + \int_{t_0}^{t_1} ios(t) busy(t) \text{dt}\]

Now, if ios(t) is constant during the time interval t_1,t_0, then:

    \[F(t_1) = F(t_0) + ios(t_1) \times \int_{t_0}^{t_1} busy(t) \text{dt}\]

Looking closely at this, we see that this is precisely the recursion formula for Field 11 from /proc/diskstat

    \[F(t_1) = F(t_0) + (\text{IO ops in progress at time $t_1$}) \times (\text{numer of ms spent doing I/O in $t_0,t_1$})\]

And if t_1,t_0 are adjacent I/O events (start, complete, merge) the assumption that ios(t) is constant in between is justified. Hence we see, that F_{11} = F.

Step 4: This can’t be true!

Now that we have a firm grip of F_{11}, we can start examining the claimed formula for the average queue size

    \[aqusz = \frac{F_{11}(t_1) - F_{11}(t_0)}{t_1 - t_0} = \frac{\int_{t_0}^{t_1} ios(t) busy(t) \text{dt}}{t_1 - t_0}\]

Is this indeed a sensible measure for the average queue size?

It does not seem to be the case. Take for example a time interval, where there are ios(t) = 10, but busy(t) = 0, then: The average queue size should be 10, but the integral evaluates to 0. Hence the above expression is zero, which is clearly not sensible.

Step 5: But is it really?

But is this really a valid example? If the two functions busy() and ios() were really independent, then this condition could certainly occur. In the analogue case of CPU scheduling, cases like these, where there are runnable threads but the CPU is busy doing something else (interrupt handlers, hypervisor) can indeed happen.

But is this really the case for block IO scheduling? Another look at the documentation[1] reveals the following:

    Field  9 -- # of I/Os currently in progress
        The only field that should go to zero. Incremented as requests are
        given to appropriate struct request_queue and decremented as they finish.
    Field 10 -- # of milliseconds spent doing I/Os
        This field increases so long as field 9 is nonzero.

So, Field 9 is our function ios(t) and Field 10 is actually our function Busy(t)! And they indeed have a relation:

> "[Busy(t)] increases as long as [ios(t)] is nonzero"

In other words busy(t) = 1 if ios(t) > 0 and 0 otherwise!

Revisiting our definition of F, we find the following (explanations follow)

    \begin{align*} F(T) &:= \int_{t \in [0,T]} ios(t) busy(t) \text{dt} \\ &= \int_{t \in [0,T], busy(t) = 1} ios(t) busy(t) \text{dt} + \int_{t \in [0,T], busy(t) = 0} ios(t) busy(t) \text{dt} \\ &= \int_{t \in [0,T], busy(t) = 1} ios(t) \text{dt} + 0 \\ &= \int_{t \in [0,T], ios(t) > 0} ios(t) \text{dt} + \int_{t \in [0,T], ios(t) = 0} ios(t) \text{dt} \\ &= \int_{t \in [0,T]} ios(t) \text{dt} \end{align*}

Let’s break that down. In the first step, we can divide the integral into two parts, where busy = 0/1, the part where busy = 0 evaluates to 0. In the next step we note that busy = 0 is equivalent to ios = 0, so we can replace the condition. Next we complete the integral by introducing another 0-summand: Integrating over ios(t) = 0 which evaluates to zero as well. Finally we put things back together.

We see that for the calculation of Field 11, it does not matter if we include busy(t) in the integrand. We end up with the same number whether we sum over “the number of milliseconds *spent doing I/O* since the last update of this field” or just “the number of milliseconds since the last update of this field”:

    \begin{align*} avgqz(t_0, t_1) &= \frac{F_{11}(t_1) - F_{11}(t_0)}{t_1 - t_0} \\ & \overset{!}{=} \frac{\int_{t_0}^{t_1} ios(t) \text{dt}}{t_1 - t_0} \\ &= \text{Integral-average over the number of IOs in progess within $t_0,t_1$.} \end{align*}

Conclusion

We have seen that the avg-qz reported by iostat has indeed the interpretation of an integral average over the number of IOs in progress. We find it interesting to see calculus can be applied various state accounting metrics and clarify their relationships. Moreover, the ability to express integral averages as discrete derivatives of a counter (F_{11}) is another remarkable takeaway.

References

  1. man iostat
  2. Schwarz – How iostat computes metrics (2010)
  3. I/O statistics fields

 

Heinrich Hartmann is the Chief Data Scientist at Circonus. To learn more about data science for effective operations, sign up for in-person training with Heinrich at Velocity, Tuesday, 17 October & Wednesday, 18 October, 2017

Learn More