LuaJIT and the Illumos VM

I’m about to dive into some esoteric stuff, so be warned. We use LuaJIT extensively throughout most of our product suite here at Circonus. It is core supported technology in the libmtev platform on which we build most of our high-performance server apps. Lua (and LuaJIT) integrate fantastically with C code and it makes many complex tasks seem simple if you can get the abstraction right between procedural lua code and a non-blocking event framework like that within libmtev. Lua is all fine and well, but for performance reasons we use LuaJIT and therein bodies are buried.

What’s so special about LuaJIT? An object in LuaJIT is always 64bit. It can represent a double, or if the leading 13 bits are 1s (which is invalid in any 64bit double representation) then it can represent other things:

**                  ---MSW---.---LSW---
** primitive types |  itype  |         |
** lightuserdata   |  itype  |  void * |  (32 bit platforms)
** lightuserdata   |ffff|    void *    |  (64 bit platforms, 47 bit pointers)
** GC objects      |  itype  |  GCRef  |
** int (LJ_DUALNUM)|  itype  |   int   |
** number           -------double------

Ouch, that’s right. We’ve got pointer compaction going on. On 64 bit systems, we can only represent 47 bits of our pointer. That is 128TB of addressable space, but we’ll explore problems it causes.

LuaJIT is a garbage collected language and as such wants to manage its own memory. It mmaps large chunks of memory and then allocates and manages objects within that segment. These objects actually must sit in the lower 32bits of memory because luajit leverages the other 32bits for type information and other notes (e.g. garbage collection annotations). Linux and several other UNIX-like operating systems support a MAP_32BIT flag to mmap that instructs the kernel to return a memory mapped segment under the 4GB boundary.

Here we see a diagram of how this usually works with LuaJIT on platforms that support MAP32_BIT. When asked for a MMAP_32BIT mapping, Linux (and some other operating systems) starts near the 4GB boundary and works backwards. The heap (or brk()/sbrk() calls), where malloc and friends live, typically starts near the beginning of memory space and works upwards. 4GB is enough space to cause many apps to not have problems, but if you are a heavy lifting app, at some point you could legitimately attempt to group your heap into the memory mapped regions and that would result in a failed allocation attempt. You’re effectively out of memory! If your allocator uses mmap behind the scenes, you won’t have this problem.

Enter Illumos:

Enter Illumos, and we have a set of problems emerge. On old versions of Illumos, MAP_32BIT wasn’t supported, and this usually caused issues around 300MB or so of heap usage. That’s an “Itty-bitty living space,” not useful for most applications; certainly not ours. Additionally, the stack doesn’t grow down from the 47-bit boundary; it grows down from the 64-bit boundary.

On systems that don’t support MAP_32BIT, LuaJIT will make a best effort by leveraging mmap()’s “hinting” system to hint at memory locations it would like down in the lower 32bit space. This works and can get off the starting blocks, but we’ve compounded our issues, as you can see in this diagram. Because we’re not growing strictly down from the 4GB boundary, our heap has even less room to grow.

Lack of MAP_32BIT support was fixed back in August of 2013, but you know people are still running old systems. On more recent versions Illumos, it looks more like Linux.

The interaction between LuaJIT and our stratospheric stack pointers remains an unaddressed issue. If we push lightuserdata into lua from our stack, we get a crash. Surprisingly, in all our years, we’ve ran across only one dependency that has triggered the issue: LPEG, for which we have a patch.

We’re still left with a space issue: our apps still push the envelope, and 4GB’s of heap is a limitation we thought it would be ridiculous to accept. So we have stupid trick to fix the issue.

On Illumos (and thus OmniOS and SmartOS), the “brk” starting point is immediately after the BSS segment. If we can simply inform the loader that our binary had a variable declared in BSS somewhere else, we could make our heap start growing from somewhere other than near zero.

# gcc -m64 -o mem mem.c
# elfdump -s mem  | grep _end
      [30]  0x0000000000412480 0x0000000000000000  OBJT GLOB  D    1 .bss           _end
      [88]  0x0000000000412480 0x0000000000000000  OBJT GLOB  D    0 .bss           _end

Kudos to Rich Lowe for pointers on how to accomplish this feat. Using the Illumos linker, we can actually specify a map using -M map64bit where map64bit is file containing:

mason_dixon = LOAD ?E V0x100000000 L0x8;

# gcc -Wl,-M -Wl,map64bit -m64 -o mem mem.c
# elfdump -s mem  | grep _end
      [30]  0x0000000100001000 0x0000000000000000  OBJT GLOB  D    1 ABS            _end
      [88]  0x0000000100001000 0x0000000000000000  OBJT GLOB  D    0 ABS            _end

This has the effect of placing an 8 byte sized variable at the virtual address 4GB, and the VM space looks like the above diagram. Much more room!

Roughly 4GBs of open space for native LuaJIT memory management, and since we run lots of VMs in a single process, this gets exercised. The heap grows up from 4GB, leaving the application plenty of room to grow. There is little risk of extending the heap past 128TB, as we don’t own any machines with that much memory. So, everything on the heap and in the VM are all cleanly accessible to LuaJIT.

So when we say we have some full stack engineers here at Circonus — we know which way the stack grows on x86_64.