06 July 2024

Fixing a Memory Corruption bug in Python

For Omniverse we use Python for scripting. About a year and a half ago, we upgraded from Python 3.7 to Python 3.10. Shortly thereafter, we started seeing occasional memory corruption issues crop up, but it took a while to actually link the cause to Python.

A lot of the crashes that we were seeing involved very deep and complex Python stack traces in seemingly innocuous things: the garbage collector, releasing a reference, sometimes allocating new memory within Python. Since my team in the Omniverse group handles the plugin system, we manage the plugin that embeds Python into Omniverse, which means I was assigned all of these crashes for triage.

We do have some internal Python experts, and my team wasn't the tip of the spear on the 3.7 -> 3.10 upgrade (though we were involved). The nature of memory corruption means that crashes can happen in very random ways, since any call to acquire or free memory (which happens a LOT) may trigger a crash. So we had 20+ different stack traces that were considered to be part of the same memory corruption pathology. Curiously, this was only happening on Windows, and another clue was that quite often we would see an error in the log sometimes well before the crash occurred:

RuntimeError: <_overlapped.Overlapped object at 0xXXXXXXXXXXXXXXXX> still has pending operation at deallocation, the process may crash.

"Helpful!" I thought to myself sarcastically. Well, the process was in fact crashing, so this may be an indication as to what happening.

However, this crash was fairly low priority. It wasn't happening often per se, but it was annoying. I made a task to track all of the different automatically generated crash stack traces, and periodically tried to find someone who could look into it. Since it was Windows-only and rare, it never percolated up to the top of anyone's priority queue.

Months passed and I continued to be irritated when another new stack trace would pop up. Everyone, including me, was busier with higher priority tasks. Since no one else could be bothered to fix it, I eventually decided to take a crack at it.


"Overlapped" I/O?

Typically when you're writing a program and you want to do a network request, or reading a file, or any other sort of input/output the easiest thing to do is to stop everything and wait because you need the results before continuing. But in the world of 60+ FPS, stopping and waiting for even a few milliseconds can feel sluggish. We instead want to do things a-synchronously--have them happen in a background process and the program can do other things while waiting. One way of doing this is through parallelization, that is, using a separate thread or CPU to do work, but this can also be messy. Also, Windows is old and computers used to just have one CPU, so a long time ago Windows started doing a-synchronicity through concurrency and the Operating System would handle interleaving all of the requests. "Overlapped" I/O is how Windows started doing that: multiple I/O operations could be done at a time, hence they overlap each other.

The problem is, it's pretty complicated. And even more so in an environment like Python where implementation requires both C code and Python code.


Reproducing the problem

The first step was to try to reduce the problem to a minimum so that it was easier to debug. I also knew that we started having this problem when we moved from Python 3.7 to 3.10, so I looked at the changes to see what might be different. I noticed this in the "what's new in Python 3.8" article:
On Windows, the default event loop is now ProactorEventLoop. (Contributed by Victor Stinner in bpo-34687.)

It turns out that Windows and Linux asyncio event loops are implemented quite differently, which makes sense because Asynchronous IO works quite differently on Windows and Linux. With Python 3.8, asyncio now used Overlapped IO by default through the Proactor event loop.

Omniverse was also doing something different from most Python applications: constantly stopping and re-starting the event loop. See, in most Python applications the program hands control to asyncio until the program finishes; asyncio then dispatches everything that the application does. Since Omniverse uses embedded Python, we still have a graphical application that needs to run, so each frame of the graphical application we would just run the asyncio event loop for one "tick" and then stop it.

I wrote a simple app that emulated that behavior, but I couldn't get it to crash, so that wasn't enough to do it. We had a sample case that could produce the Overlapped error message typically within 5 minutes (which is an eternity when you're waiting for it), so I looked at what it was doing. After much debugging I found another clue: other threads were calling call_soon_threadsafe on the event loop. I added that to my test program. BINGO. This program had an 80% chance to print the error message within 60 seconds:

This program starts a thread that pings away scheduling a dummy function to run on the next loop iteration. It's real purpose is to send a wakeup signal to the event loop. The main thread just constantly does our stop/run idiom like Omniverse does so it doesn't have to hand control to the event loop permanently.


But does it corrupt memory?

The error message comes from the implementation of _overlapped, which is done in C and acts as a bridge between Python code for the asyncio event loop and the Windows API call to manage overlapped I/O. Let's look at what happens before that.

When the run_forever call is cleaning up it tries to cancel the "_self_reading_future":

Which in turn calls into the C code for _overlapped and invokes a Windows API function called CancelIoEx:


Then the run_forever function unregisters the 'overlapped' object (ov) and sets the future to None, effectively destroying the object.

This is where the danger arises! The overlapped structure (managed by the overlapped Python object) has been passed to the Windows API when an IO operation was requested (in this case it's for the socketpair used to notify the event loop for other threads) and cannot be deallocated until Windows reports that it's done with it!

The deallocation function attempts to check this, and reports the error message if Windows isn't done with it. But then goes on to deallocate the structure anyways, which can lead to memory corruption!


While not ideal, a better alternative might be to report the error and leak the OVERLAPPED structure so that a crash won't occur. But the best alternative is to see if we can fix the problem.

With a reproducible case and an understanding of what was going on, I filed a Python issue.


Can we fix it?

I put the initial diagnosis into the Python issue: CancelIoEx is being properly called, returns TRUE meaning that Windows requested the cancellation (NOT that it was cancelled yet!), but the deallocation function sees that the overlapped I/O hasn't yet completed. It tries to call CancelIoEx again (this always fails with ERROR_NOT_FOUND since the previous cancel request succeeded), and then calls GetOverlappedResult (note that wait=false because the second CancelIoEx attempt fails), which will also fail with ERROR_IO_PENDING. As a test I tried changing it to wait, but this just caused a hang.

What gives? Originally I though this was a bug in Windows and handling Overlapped I/O, but then I tried something else. Not only is this system using Overlapped I/O, it's using another complex Windows-ism: I/O Completion Ports. This system is a way to group multiple I/O requests together, so that instead of having to check myriad different requests, you check one "completion port" to find out when any of your requests finish. What if the cancelled I/O request is stuck in the I/O completion port? Instead of waiting on GetOverlappedResult, maybe I need to read the request from GetQueuedCompletionStatus.

BINGO.

Turns out that was the case: it really depends on where within the labyrinthine Windows I/O ecosystem the I/O request ended up before the cancel request happened.

So now we could make an "official" fix recommendation. Generally large projects like Python desire fix recommendations from outsiders like me to be as simple and elegant as possible. Which meant that I would have better luck without completely trying to change how their Overlapped I/O worked. Since all we needed to do was make sure that GetQueuedCompletionStatus was called on our _overlapped object again, that gave us a goal. GetQueuedCompletionStatus was actually called every time the event loop ran, and it already cleaned up finished _overlapped objects once they completed. All we need to do is only allow cleanup after canceling if the I/O was complete, otherwise it would get cleaned up automatically during the next event loop iteration.

So that just meant changing the line



to



Done and dusted!

Of course, for the official fix I added a unit test and a bunch of comments since Overlapped I/O and I/O Completion Ports are frustratingly touchy.

As a bonus, Guido van Rossum (creator of Python) commented positively on the fix, and eventually we got the fix merged back as far as Python 3.8--the oldest supported version (for security fixes), but also the first version where the ProactorEventLoop became the default.

05 June 2024

Basic Callback Hygiene

There's a tenet that I've been pushing on for a few years now from my little corner of NVIDIA, and that is the concept of Basic Callback Hygiene.

For the uninitiated, a callback simply-put is a function that I can pass to another function that it will later call to "call me back." A lot of subscription services in programming work like this: Something is going to happen in the future. I want to be notified of it, so I'll call a subscription function and give it a function that it can call to tell me when that happens.

This simple concept gets a lot more complicated when you ask what sorts of things you can do within that callback function. What happens when I don't care about the subscription anymore? What if the module containing the function unloads? What thread is the callback going to happen on? And so on, and so forth.

I've experienced problems with all of these complications over the course of my career, and given the amount of asynchronous and modular code that I've been writing for Omniverse, I finally distilled down a simple set of rules:

  1. Callbacks can ask to be unregistered from within the callback.
  2. Strong recommendation to not hold locks (i.e. a mutex) while calling callbacks.
  3. When a callback asks to be unregistered, there must be a strong guarantee that the callback is not currently being called, nor will ever be called in the future.
That's it. A subscription system that follows these basic principles will avoid 99% of the issues that we typically run into with callbacks.

Let's examine each of these. To put this in practical terms, let's say that we're designing an event system. We can subscribe to an event which will register our callback and we can dispatch an event which will callback all of the subscribers from the context of the dispatching thread (so basically it loops over all subscribers and calls their callback). Our imaginary system is thread-safe, so we can subscribe or dispatch from any thread. We can also unsubscribe to unregister our subscription.


Unregistration from within a callback

This is probably the most useful one. Let's consider the case where I cannot unsubscribe from the callback but I want to. My callback is called and I can handle the event, but I cannot unsubscribe. This particular callback only wants to receive the first event after registration and no more events (an example of this might be a one-shot function that I want to defer until the next frame).

Since I cannot unsubscribe, I must be able to conditionally have a function run later that will then unsubscribe my function outside of the event, and I need to ignore any more events that I might receive in the meantime.

This makes something that should be simple much more troublesome, and every place that I want to do this sort of action now requires more thought, effort, time, testing, etc. Changing the system to allow un-subscription from within the callback means that the complexity of handling that situation is in a singular place: my event system, rather than each and every callback that needs that behavior.


Don't Hold Locks

Holding a lock while calling a callback is dangerous because the callback can execute arbitrary code. The callback may call back into our event system and we don't want the complication of having recursive locks (which are arguably evil) or contributing to lock cycles which can cause deadlocks.

Writing an event system to be thread-safe and yet not hold any locks while calling callbacks is challenging to say the least. Generally this requires something akin to tracking the number of threads that are in the middle of doing callbacks, at a minimum. We typically do it by adding a thread identifier to a list while holding a lock, and then unlocking to call the callback, then re-acquiring the lock and removing the thread identifier from the list. This is more complicated to be sure, and the reason why will be evident in the next section. There are other ways to do it, but you'll need some tracking to maintain state while the mutex is unlocked for the callback.

This also means that how you walk the list of subscriptions must be stateful in a way that can resume after the mutex is unlocked and re-acquired. If these subscriptions are stored in C++'s std::vector for instance, there exists a possibility that unlocking the mutex will allow another thread to subscribe, which may resize and therefore reallocate the vector, which invalidates all iterators; or another thread may unsubscribe which will remove an entry, which can affect the position of the current thread walking over the vector. (Solutions to these, respectively, might involve using indexes instead of iterators, and leaving null pointers in the array to mark empty spaces).


Strong Unregistration Guarantees

This is by far the hardest issue to solve correctly. Here's a real-world scenario: right before a module unloads (in the main thread, say), it tells our Event system to unsubscribe a previously-subscribed callback. Without a strong guarantee, what will happen if another thread is currently calling our subscribed callback? As noted above, we don't want to be holding locks while calling callbacks, so waiting on a lock isn't going to wait for our callback to finish. What has happened in actual code is that the unsubscribe happens quickly, the module proceeds to unload, and then a crash occurs because another thread was still executing a function that was in the module. Or after unloading, objects were destructed since we would no longer get events, but with another thread in our callback function, a crash occurs because deleted memory was accessed.

Or the problem that I'm fixing right now: memory corruption occurs because a dependent structure was freed, but a race with the callback means that we were still changing things inside the structure.

It quickly becomes apparent that any request to unregister must wait for any other calls to finish. Also, once we return from the unregister function, it stands to reason that we cannot ever call the callback again. To do so would be violating the caller's unregistration request.

This then requires a strong ordering guarantee:
  1. Unregister function is called.
    1. WAIT for any calls to the callback in question to finish*.
    2. PREVENT any future calls to the callback.
  2. Return from the unregister function. At this point the caller can guarantee that our event system is free of the callback and it is safe to destroy structures.
"Ah," the astute and observant amongst you might notice, "what if the unregistration is from within the callback?!" This was our first point, above. If we WAIT for any calls to the callback to finish and the unregistration call is coming from the callback, we would wait forever. So we amend the line with the asterisk (*) above instead to say: "WAIT for any calls to the callback in question in other threads to finish."

So our subscription process is pretty simple:
  1. Take the lock
  2. Append our subscription information to the std::vector (or similar)
  3. Unlock the lock
The notification process is a bit more complicated as it has to synchronize with the unsubscribe process:
  1. Take the lock
  2. Iterate over the std::vector (or similar) by index, skipping null entries. For each:
    1. Add the current thread identifier to a list of threads that are notifying.
    2. Unlock the lock
    3. Call the callback function
    4. Re-acquire the lock. At this point, the vector may have been reallocated and grown, and subsequent entries may have been unregistered. Any other thread could have done anything.
    5. Remove our current thread identifier from the list of threads that are notifying.
    6. Is a waiting flag set on the subscription? If so, notify the waiting thread (i.e. using a condition variable).
  3. Unlock the lock
And now, the pièce de résistance: the unsubscribe process:
  1. Take the lock
  2. Find the subscription to remove. If not found, unlock and return; otherwise:
  3. Set the entry in the std::vector to null, but it is not destroyed yet!
  4. Is the subscription currently being notified by a thread other than the current thread? If so, set a waiting flag on the subscription that is used in 2.6 above.
  5. Unlock the lock.
  6. If the waiting flag was set, wait on the condition variable for notification from 2.6 above. Otherwise it is safe to destroy the subscription and return.
  7. Re-acquire the lock.
  8. If the list of notifying threads contains threads besides the current thread go to step 5.
  9. Unlock the lock. It is now safe to destroy the subscription.
This will allow our unsubscribe to wait until no threads are notifying our subscription, or only our thread that is currently unsubscribing during notification.

Of course, there is a lot more nuance to this, especially when it comes to handling recursion (dispatching within a callback) or even just the complexity of working with multiple threads, but this should help you understand the concept of Basic Callback Hygiene.