01 April 2020

A Lockless Stack

Back in October 2019 I started a new gig working from home for NVIDIA. And I freaking love it. One of the things I've been championing on my team has been thread safety and I've been creating several thread-safe data structures. One of the ones that I've done recently is a Lockless Stack. Stacks are also known as LIFO (last in, first out) lists.

Here's an implementation of a simple, non-lockless, stack class. This is a very common academic problem and interview question. Marvel at the simplicity! The pop() function returns the most recently item push()ed. Granted there is no error checking here, and the T type is required to have a next pointer, but you get the idea.
Windows has had an interlocked SList for a while. It is essentially a LIFO list and is therefore a stack data structure. However, I also need to support Linux on x86-64 and AArch64 platforms. The Windows SList made a good starting point for researching other implementations of a lockless stack.

First, let's take a look at the Windows structures for SList
The first thing that is apparent is that the 32-bit and 64-bit versions look quite different. My implementation only needs to be 64-bit, so I'll focus on that. Notice that it is aligned to 16 bytes, a seemingly odd requirement.

Let's check out the disassembly of InitializeSListHead:
This is very simple. It checks to make sure that the SList is actually aligned and raises a status if it isn't. Then it sets the entire structure to zero bytes. Let's take a look at InterlockedPushEntrySList:
Alright, this is interesting, and it explains why 16-byte alignment is required: the cmpxchg16b will atomically compare-and-swap 16 bytes (128 bits) but requires that they be aligned to 16 bytes. Another interesting thing to note is this use of the Sequence field. Without this field incrementing during each Push operation, multiple threads that are Pushing and Popping the same value could succeed the cmpxchg16b instruction and each think that they atomimcally updated the Stack, but that wouldn't necessarily be true.

Finally, let's take a look at the last function we're interested in: InterlockedPopSList:
This function is the most complicated yet, and does the most work. Once again we see the cmpxchg16b instruction again. However, there's something really interesting about this. See that Fault tag? Nothing ever jumps to it. Furthermore, the operation there--dereferencing NextEntry to read Next--is NOT safe in a multi-threaded environment. Another thread could have popped NextEntry and deleted it causing this thread to trigger an access violation when dereferencing NextEntry. However, this access violation just signals that another thread has already popped the item--something that would be abundantly clear when the cmpxchg16b fails and the process is repeated. This is an example of a race condition: something that can happen sometimes as a result of two threads "racing" each other. In this case, an access violation is unacceptable. But it would be rare.

Interestingly, although nothing is present here in the assembly (except for that tell-tale Fault label), there is a way that a crash there could be avoided: Windows Structured Exception Handling (SEH). In 32-bit mode, SEH was kind of expensive: to enter a __try block the compiler would generate some prologue and epilogue code that would set up an exception handler. But this article talks about the differences with SEH for 64-bit. No longer is any code generated for a __try block; instead a record is written elsewhere in the binary that tells the exception handler exactly how to unwind the stack and what function to call to handle the exception (the __except block). This is amazing as it means that there is ZERO COST to a __try block unless an exception is thrown. Although I could find UNWIND_INFO for this function, it doesn't have an exception handler; but Windows could be doing something else undocumented to achieve the same result. I believe that's what is happening here: if an access violation exception occurs at the Fault label, Windows jumps back to the Resume label and retries the operation with a fresh load of ListHead.

Now that we've looked at the Windows 64-bit implementation of SList, we have learned the following things:
  1. It uses a 128-bit doubleworld-compare-and-exchange instruction.
  2. It appears to use Structured Exception Handling to avoid a rare race condition.
  3. It uses a Sequence field incremented on the Push operation to ensure synchronization.

However, as I mentioned above, my requirements were different than Windows: I am 64-bit only, and in addition to Windows, I need to support Linux on both x86-64 and AArch64 (v8.1) architectures. I am also trying to make this a header-only C++ template class. Another interesting point is that in my testing, cmpxchg16b (128-bit) is quite a bit slower than cmpxchg8b (64-bit)--about 10% slower in the uncontended case, and about 40% slower in the contended case (though this is going to be very machine-dependent). Adding more complication is that the AArch64 architecture I need to support doesn't have a doubleword-compare-and-exchange equivalent instruction. This led me wonder: can I compress all of the necessary information into 64-bits? The Windows implementation has a Depth member (the size of the stack), but I don't need that field. I need the Sequence field though, for the reasons mentioned above. Windows dedicates a full 16 bits to the Sequence, but we probably don't need that many.

How many bits does a 64-bit address really need? If we require that all addresses are 8-byte (64-bit) aligned--a reasonable requirement--then the least-significant 3 bits will always be zero. Furthermore, Wikipedia's 64-bit computing page indicates that virtual addresses for x86-64 and ARM64 are limited to 48 bits (leaving 16 bits at the top), and the x86-64 architecture requires that bits 48-63 (15 bits) match bit 47. It's conceivable, however, that future processors will use more address bits, and future operating systems will eventually support them. For now let's assume that we can get by stealing 7 of those 15 bits and leaving 1 bit to indicate the value of bit 47. I don't expect this code to be used for kernel code (currently kernels are the only thing using upper canonical addresses with bit 47 set), but I don't want to prevent that either. Stealing 7 of the most-significant bits and 3 of the least-significant bits from a 64-bit address gives us 10 bits (allowing a count between 0 and 1023) to work with for our Sequence field. Considering that we're using this as a check field to serialize Push operations, this should be sufficient. Encoding the data in this manner takes 3 instructions on x86-64 (2 on ARM64), and decoding takes 6 instructions on x86-64 (4 on ARM64).
The other issue to solve is the lack of SEH on Linux. As shown in the Pop operation above, we need some sort of safety on referencing the element that is being popped. We have to read the next element before doing the cmpxchg16b, but either a failure to read the next element or a failure of the cmpxchg16b will signal that we need to retry. Linux does have signal handling, but installing a signal handler and having each Pop operation issue a sigsetjmp() is quite costly--far more costly than 64-bit Windows' __try statement. Recall that __try on Windows is essentially free; the only cost happens if an exception is thrown. This is not the case on Linux. In order to achieve the same level of "free", we need to install a signal handler that understands that SIGSEGV on one specific instruction--the dereference--should trigger a restart of the Pop operation instead of a crash. This can be done--much more easily in assembly, but requires some pretty significant gymnastics in order to be header-only and cross-platform.
Et voilĂ ! The problems we faced with exception handling are solved and we've managed to squeeze all of the necessary info to maintain a 64-bit lockless stack into 64-bits!

Here's a link to the full source: Lockless Stack