Crash-Dump Necromancy

Recently, one of our on-premises customers experienced a system crash with one of their metric-storage nodes. Circonus offers the option to operate a complete, private copy of our product on a customer’s own premises. This may be because of a compliance requirement or for reasons of scale that make the public SaaS version unsuitable. In this case, the node was temporarily incapacitated, and recovered on its own without intervention. This is the story of how we diagnosed the issue to determine why the node had this panic, which then informed our customer’s decisions about hardware replacement.

Our metrics store runs on OmniOS, a distribution of illumos, itself a fork of the now-defunct OpenSolaris project. The Solaris pedigree is important, because it means we have a robust set of post-mortem debugging capabilities, beginning with a kernel crash dump.

Whenever an OmniOS system encounters an unexpected condition, the kernel logs a panic message and dumps the contents of kernel memory to the local dump or swap device. On the next boot, a service called savecore notices that a dump is present and captures that data as a compressed vmdump.X file in /var/crash/<hostname>, where X is just an ordinal number, starting at 0, to distinguish multiple crash dumps. Our customer was able to send that file to us for analysis.

Analysis of such a crash dump is performed with MDB, the Modular Debugger. MDB is something of a swiss-army knife for debugging in illumos-based systems. It can operate on various types of “targets,” including kernel crash dumps and application core files, as well as a running system. In this exercise, we’ll primarily be using MDB interactively, but it can also function quite handily in a shell pipeline, as we’ll see later.

The customer knew from the panic message that the system had panicked due to the ZFS deadman timer:

panic[cpu0]/thread=ffffff00f40c5c40:
genunix: [ID 918906 kern.notice] I/O to pool 'data' appears to be hung.

ZFS is an advanced storage subsystem that combines the traditional capabilities of a filesystem, volume manager, and RAID controller into a unified storage stack. It features a pooled storage model, provable data integrity with end-to-end checksumming, and dead-simple administration. It’s a cornerstone of how Circonus protects metric data – we utilize node-local data redundancy with a mirroring discipline, as well as writing the same data to multiple nodes in the metric storage cluster. The end-to-end checksumming ensures that when we read back metric data, we can be confident it’s the same data we originally wrote.

The ZFS deadman timer does what it says on the tin – if the timer expires, the system panics and restarts. It is a last resort for when ZFS has sent one or more I/O commands to the underlying storage and has not heard back in the allotted time (this is tunable; the default is 1000 seconds, or about 16.5 minutes). Panicking the system at this point provides the administrator with a snapshot of the system state at the point when the deadman timer expired. As we will see, this is a rich source of information that helps us drill down on the culprit(s).

In order to analyze the customer’s crash dump, we must first decompress the crash dump file produced by savecore on the customer’s system:

# savecore -f vmdump.0 .

The uncompressed dump is written to ./vmcore.0, along with a copy of the corresponding kernel binary as ./unix.0. The kernel binary provides mappings of the core image for kernel functions, global variables, etc. To launch an interactive session using this crash dump, we just invoke mdb with the name of the core file.

# mdb vmcore.0
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs mr_sas sd ip hook neti sockfs arp usba lofs idm random ptm ipc nfs cpc ufs logindmux ]
>

We are now at the prompt (>) where we can run MDB dcmds (debugger commands), which are routines that access various properties of the target. If we didn’t have the hint from the kernel log about the reason for the crash, we could see what we’re dealing with by using the ::status dcmd:

> ::status
debugging crash dump vmcore.0 (64-bit) from [hostname redacted]
operating system: 5.11 omnios-d0a4f2e (i86pc)
panic message: I/O to pool 'data' appears to be hung.
dump content: kernel pages only

The deadman timer is tracked in the spa_sync function, so the first order of business is to locate the kernel thread that was in this function. To do that, we use the ::stacks dcmd, which prints kernel thread stacks. This dcmd is provided by the illumos kernel, so it is only available when debugging a kernel crash dump or interacting with a live system in kernel mode. Additionally, we can narrow down the list of stacks to our target function with the ‘-c’ option:

> ::stacks -c spa_sync
THREAD           STATE    SOBJ                COUNT
ffffff00f4ce5c40 SLEEP    CV                      2
                 swtch+0x141
                 cv_wait+0x70
                 zio_wait+0x5b
                 dsl_pool_sync+0xb3
                 spa_sync+0x27e
                 txg_sync_thread+0x227
                 thread_start+8

Interesting – a count of 2. ::stacks summarizes unique kernel thread stacks that share common features, such as thread state, type of synchronization object (conditional variable, mutex, etc.), and program counter. To find out more about these stacks, we can use another kernel-provided dcmd, ::findstack. This prints more information about a thread’s stack, including function arguments if the “-v” option is used. We can also take advantage of MDB’s handy pipelining capability, similar to a shell pipeline, in order to avoid a lot of tedious copy/pasting:

> ::stacks -c spa_sync | ::findstack -v
stack pointer for thread ffffff00f4ce5c40: ffffff00f4ce5960
[ ffffff00f4ce5960 _resume_from_idle+0xf4() ]
  ffffff00f4ce5990 swtch+0x141()
  ffffff00f4ce59d0 cv_wait+0x70(ffffff30ef69ff40, ffffff30ef69ff38)
  ffffff00f4ce5a10 zio_wait+0x5b(ffffff30ef69fc18)
  ffffff00f4ce5aa0 dsl_pool_sync+0xb3(ffffff21f098a6c0, 168a1db)
  ffffff00f4ce5b70 spa_sync+0x27e(ffffff21f0998000, 168a1db)
  ffffff00f4ce5c20 txg_sync_thread+0x227(ffffff21f098a6c0)
  ffffff00f4ce5c30 thread_start+8()
stack pointer for thread ffffff00f457dc40: ffffff00f457d960
[ ffffff00f457d960 _resume_from_idle+0xf4() ]
  ffffff00f457d990 swtch+0x141()
  ffffff00f457d9d0 cv_wait+0x70(ffffff30106b6dc8, ffffff30106b6dc0)
  ffffff00f457da10 zio_wait+0x5b(ffffff30106b6aa0)
  ffffff00f457daa0 dsl_pool_sync+0xb3(ffffff21f6494c80, 168a1df)
  ffffff00f457db70 spa_sync+0x27e(ffffff21f0f93000, 168a1df)
  ffffff00f457dc20 txg_sync_thread+0x227(ffffff21f6494c80)
  ffffff00f457dc30 thread_start+8()

Unlike the shell pipeline, the output that gets piped is not the literal text that we see in the terminal, but the value of an expression, which MDB calls “dot”. MDB’s syntax is based on computing the value of something, usually a memory address in the target, with dot representing the current address. Here, ::stacks is setting dot to the address of each stack in succession and transmitting those over the pipe, to be used as input by ::findstack. Then ::findstack operates on each one in turn. Without the pipeline, we would need to manually copy and paste each of the stack addresses and prefix it to the next dcmd, e.g. <addr>::findstack -v.

With both stacks shown with function arguments, we can see that the first argument to spa_sync differs between the two threads. From the source, we know that the first argument to spa_sync is a pointer to a spa_t struct, and that has a member, spa_name. This tells us the name of the pool that we’re syncing to. We can get that value by printing at the address of the struct and providing the struct type and member name we’re interested in:

> ffffff21f0998000::print -t spa_t spa_name
char [256] spa_name = [ "rpool" ]

> ffffff21f0f93000::print -t spa_t spa_name
char [256] spa_name = [ "data" ]

There is one for each pool in this system. We know from the panic message that it was the “data” pool that was hung, so we’ll focus on the “dc40” stack from here on:

stack pointer for thread ffffff00f457dc40: ffffff00f457d960
[ ffffff00f457d960 _resume_from_idle+0xf4() ]
  ffffff00f457d990 swtch+0x141()
  ffffff00f457d9d0 cv_wait+0x70(ffffff30106b6dc8, ffffff30106b6dc0)
  ffffff00f457da10 zio_wait+0x5b(ffffff30106b6aa0)
  ffffff00f457daa0 dsl_pool_sync+0xb3(ffffff21f6494c80, 168a1df)
  ffffff00f457db70 spa_sync+0x27e(ffffff21f0f93000, 168a1df)
  ffffff00f457dc20 txg_sync_thread+0x227(ffffff21f6494c80)
  ffffff00f457dc30 thread_start+8()

Moving up the stack, dsl_pool_sync issued some I/O, in the form of a zio_t, to the underlying storage. This resulted in calling zio_wait for notification that the I/O operations are complete. Each ZIO can have child ZIOs for related data, and those can have their own children, etc. So rather than looking up a single struct member, we potentially could be recursing down a non-trivial list. It’s therefore a good thing that, in addition to dcmds, MDB has something called a “walker,” which knows the specifics of a particular data structure and can automatically “walk” through it for us. The ZFS kernel module helpfully provides just such a walker for ZIOs called (perhaps obviously) ::zio.

> ::help zio

NAME
  zio - zio_t summary

SYNOPSIS
  addr ::zio [-cpr]
        -c display children
        -p display parents
        -r display recursively

Starting from the address of the ZIO that we’re waiting on, we use both the -c option to show child ZIOs and the -r option to recurse down through any descendants. For a busy pool, this can produce a lot of output. We can use another MDB feature, calling out to the shell with “!” to send the output to a file that we can work with outside of MDB.

> ffffff30106b6aa0::zio -c -r !cat > zio_data
> ::quit

$ wc -l zio_data 
4798 zio_data

Now we can look at this file in a text editor in order to more easily explore the output.

I/O in ZFS flows through the ZIO pipeline, where various stages can not only perform the actual I/O, but also transform the data in different ways, such as compression or deduplication. Since we’re dealing with a hung pool, we’re interested in actual I/O. Therefore we want to look for ZIOs that are in the VDEV_IO_* stages.

ADDRESS                 TYPE  STAGE            WAITER           TIME_ELAPSED
ffffff30106b6aa0        NULL  CHECKSUM_VERIFY  ffffff00f457dc40 -
 ffffff30106afc58       WRITE ISSUE_ASYNC      -                -
  ffffff2e4e9248e0      WRITE ISSUE_ASYNC      -                -
   ffffff309de9ac18     WRITE ISSUE_ASYNC      -                -
    ffffff22749a8578    WRITE ISSUE_ASYNC      -                -
     ffffff318ec07750   WRITE ISSUE_ASYNC      -                -
      ffffff23025a5e58  WRITE ISSUE_ASYNC      -                -
...
       ffffff31ba43ee50 WRITE VDEV_IO_START    -                1007685ms

Starting at the top of our output file, we see the ZIO address that we were waiting on, followed by its descendants. This pool was primarily doing writes, as expected, given that it’s running the Circonus metrics store. If we proceed down the list, we eventually come to some actual VDEV_IO operations. In ZFS, a “vdev” (virtual device) is the building block for a pool. There are many types of vdevs (some representing higher-level objects such as mirror or raidz sets), but for our purposes, we’re talking about “leaf vdevs”. These are the actual devices that store data, i.e., disks. The last column is interesting too – time elapsed. This means that this ZIO has been in the VDEV_IO_START stage for a whopping 1007 seconds. Something’s definitely wrong here, and that lines up pretty well with the ZFS deadman timer.

We would expect that the ZIOs doing vdev I/O would have some information about which vdev they are operating on. Recalling the zio_t structure, does it mention vdevs?

> ffffff31ba43ee50::print -t zio_t !grep vdev
    zio_done_func_t *io_done = vdev_mirror_child_done
    vdev_t *io_vd = 0xffffff21f4db0d00

Yes it does, and there’s a pointer, *io_vd, to a vdev_t. If we print that with MDB, we find that it has a wealth of information about the vdev in question. In particular, there is a device path:

> ffffff21f4db0d00::print -t vdev_t
vdev_t {
...
    char *vdev_path = 0xffffff21f5b196e0 "/dev/dsk/c1t7d0s0"

Now we know that our waited-upon ZIO was at least writing to /dev/dsk/c1t7d0s0 and maybe to some others as well (remember, there were nearly 4800 lines of output for all the active ZIOs for the “data” zpool).

As an aside, we can actually collapse the above two invocations into one by using the structure pointer operator:

> ffffff31ba43ee50::print -t zio_t io_vd->vdev_path
char *io_vd->vdev_path = 0xffffff21f5b196e0 "/dev/dsk/c1t7d0s0"

It’s time to break out our text-processing tools and crunch this data. We’d need to get vdev_t pointers out of all the ZIOs that are “interesting,” and print the vdev_path from each one of those to show what device the ZIO was issued to. I like to use awk for this:

$ awk '$1 ~ /^ff/ && $5 != "-"' zio_data | head -5
       ffffff31ba43ee50 WRITE VDEV_IO_START    -                1007685ms
       ffffff31855c0c70 WRITE VDEV_IO_START    -                1007685ms
       ffffff2d9fe99af0 WRITE VDEV_IO_START    -                1007685ms
       ffffff22746e7060 WRITE VDEV_IO_START    -                1007686ms
       ffffff2d9a82c1b0 WRITE VDEV_IO_START    -                1007685ms

“Interesting” ZIOs in this case would be ones where the elapsed time has a value, i.e., the last column is not “-”. But this isn’t helpful on its own – we need to drill down into each zio_t, find the vdev_t, and extract the vdev_path from it. To do this, we build up a list of MDB dcmds and use the shell to invoke mdb with this list.

$ awk '$1 ~ /^ff/ && $5 != "-" \
  {
    printf("%s::print -t zio_t io_vd->vdev_path\n", $1)
  }' zio_data | \
  mdb vmcore.0 | \
  sort | uniq
char *io_vd->vdev_path = 0xffffff21f5b196e0 "/dev/dsk/c1t7d0s0"
char *io_vd->vdev_path = 0xffffff21f5b196f8 "/dev/dsk/c1t6d0s0"
char *io_vd->vdev_path = 0xffffff21f5b19710 "/dev/dsk/c1t5d0s0"
char *io_vd->vdev_path = 0xffffff21f5b19728 "/dev/dsk/c1t4d0s0"

This takes the ZIO address (column 1) and prints the *io_vd->vdev_path value. Awk generates a command line for every ZIO address, which we shell-pipe into MDB. Then we sort and uniq the output to produce a concise list of the disks with outstanding I/O at the time of the panic. Also, in case you might be wondering, the vmcore.0 image is ~25G in size and this completes in under 0.5 seconds on a modest machine.

If a single disk were implicated, we might suspect that disk of being the culprit. However, since multiple disks are shown, the culprit is more likely either the disk controller or possibly some other core system problem, such as a PCIe bus error, a bad backplane port, or a cabling issue.

Even though this problem wasn’t caused by our software, the fact that OmniOS has these debugging facilities available enabled us to help the customer diagnose the issue with their system, leading to a replacement of the disk controller.