The Joshua Tree: Fixing a Memory Corruption bug in Python

For Omniverse we use Python for scripting. About a year and a half ago, we upgraded from Python 3.7 to Python 3.10. Shortly thereafter, we started seeing occasional memory corruption issues crop up, but it took a while to actually link the cause to Python.

A lot of the crashes that we were seeing involved very deep and complex Python stack traces in seemingly innocuous things: the garbage collector, releasing a reference, sometimes allocating new memory within Python. Since my team in the Omniverse group handles the plugin system, we manage the plugin that embeds Python into Omniverse, which means I was assigned all of these crashes for triage.

We do have some internal Python experts, and my team wasn't the tip of the spear on the 3.7 -> 3.10 upgrade (though we were involved). The nature of memory corruption means that crashes can happen in very random ways, since any call to acquire or free memory (which happens a LOT) may trigger a crash. So we had 20+ different stack traces that were considered to be part of the same memory corruption pathology. Curiously, this was only happening on Windows, and another clue was that quite often we would see an error in the log sometimes well before the crash occurred:

RuntimeError: <_overlapped.Overlapped object at 0xXXXXXXXXXXXXXXXX> still has pending operation at deallocation, the process may crash.

"Helpful!" I thought to myself sarcastically. Well, the process was in fact crashing, so this may be an indication as to what happening.

However, this crash was fairly low priority. It wasn't happening often per se, but it was annoying. I made a task to track all of the different automatically generated crash stack traces, and periodically tried to find someone who could look into it. Since it was Windows-only and rare, it never percolated up to the top of anyone's priority queue.

Months passed and I continued to be irritated when another new stack trace would pop up. Everyone, including me, was busier with higher priority tasks. Since no one else could be bothered to fix it, I eventually decided to take a crack at it.

"Overlapped" I/O?

Typically when you're writing a program and you want to do a network request, or reading a file, or any other sort of input/output the easiest thing to do is to stop everything and wait because you need the results before continuing. But in the world of 60+ FPS, stopping and waiting for even a few milliseconds can feel sluggish. We instead want to do things a-synchronously--have them happen in a background process and the program can do other things while waiting. One way of doing this is through parallelization, that is, using a separate thread or CPU to do work, but this can also be messy. Also, Windows is old and computers used to just have one CPU, so a long time ago Windows started doing a-synchronicity through concurrency and the Operating System would handle interleaving all of the requests. "Overlapped" I/O is how Windows started doing that: multiple I/O operations could be done at a time, hence they overlap each other.

The problem is, it's pretty complicated. And even more so in an environment like Python where implementation requires both C code and Python code.

Reproducing the problem

The first step was to try to reduce the problem to a minimum so that it was easier to debug. I also knew that we started having this problem when we moved from Python 3.7 to 3.10, so I looked at the changes to see what might be different. I noticed this in the "what's new in Python 3.8" article:

On Windows, the default event loop is now ProactorEventLoop. (Contributed by Victor Stinner in bpo-34687.)

It turns out that Windows and Linux asyncio event loops are implemented quite differently, which makes sense because Asynchronous IO works quite differently on Windows and Linux. With Python 3.8, asyncio now used Overlapped IO by default through the Proactor event loop.

Omniverse was also doing something different from most Python applications: constantly stopping and re-starting the event loop. See, in most Python applications the program hands control to asyncio until the program finishes; asyncio then dispatches everything that the application does. Since Omniverse uses embedded Python, we still have a graphical application that needs to run, so each frame of the graphical application we would just run the asyncio event loop for one "tick" and then stop it.

I wrote a simple app that emulated that behavior, but I couldn't get it to crash, so that wasn't enough to do it. We had a sample case that could produce the Overlapped error message typically within 5 minutes (which is an eternity when you're waiting for it), so I looked at what it was doing. After much debugging I found another clue: other threads were calling call_soon_threadsafe on the event loop. I added that to my test program. BINGO. This program had an 80% chance to print the error message within 60 seconds:

This program starts a thread that pings away scheduling a dummy function to run on the next loop iteration. It's real purpose is to send a wakeup signal to the event loop. The main thread just constantly does our stop/run idiom like Omniverse does so it doesn't have to hand control to the event loop permanently.

But does it corrupt memory?

The error message comes from the implementation of _overlapped, which is done in C and acts as a bridge between Python code for the asyncio event loop and the Windows API call to manage overlapped I/O. Let's look at what happens before that.

When the run_forever call is cleaning up it tries to cancel the "_self_reading_future":

Which in turn calls into the C code for _overlapped and invokes a Windows API function called CancelIoEx:

Then the run_forever function unregisters the 'overlapped' object (ov) and sets the future to None, effectively destroying the object.

This is where the danger arises! The overlapped structure (managed by the overlapped Python object) has been passed to the Windows API when an IO operation was requested (in this case it's for the socketpair used to notify the event loop for other threads) and cannot be deallocated until Windows reports that it's done with it!

The deallocation function attempts to check this, and reports the error message if Windows isn't done with it. But then goes on to deallocate the structure anyways, which can lead to memory corruption!

While not ideal, a better alternative might be to report the error and leak the OVERLAPPED structure so that a crash won't occur. But the best alternative is to see if we can fix the problem.

With a reproducible case and an understanding of what was going on, I filed a Python issue.

Can we fix it?

I put the initial diagnosis into the Python issue: CancelIoEx is being properly called, returns TRUE meaning that Windows requested the cancellation (NOT that it was cancelled yet!), but the deallocation function sees that the overlapped I/O hasn't yet completed. It tries to call CancelIoEx again (this always fails with ERROR_NOT_FOUND since the previous cancel request succeeded), and then calls GetOverlappedResult (note that wait=false because the second CancelIoEx attempt fails), which will also fail with ERROR_IO_PENDING. As a test I tried changing it to wait, but this just caused a hang.

What gives? Originally I though this was a bug in Windows and handling Overlapped I/O, but then I tried something else. Not only is this system using Overlapped I/O, it's using another complex Windows-ism: I/O Completion Ports. This system is a way to group multiple I/O requests together, so that instead of having to check myriad different requests, you check one "completion port" to find out when any of your requests finish. What if the cancelled I/O request is stuck in the I/O completion port? Instead of waiting on GetOverlappedResult, maybe I need to read the request from GetQueuedCompletionStatus.

BINGO.

Turns out that was the case: it really depends on where within the labyrinthine Windows I/O ecosystem the I/O request ended up before the cancel request happened.

So now we could make an "official" fix recommendation. Generally large projects like Python desire fix recommendations from outsiders like me to be as simple and elegant as possible. Which meant that I would have better luck without completely trying to change how their Overlapped I/O worked. Since all we needed to do was make sure that GetQueuedCompletionStatus was called on our _overlapped object again, that gave us a goal. GetQueuedCompletionStatus was actually called every time the event loop ran, and it already cleaned up finished _overlapped objects once they completed. All we need to do is only allow cleanup after canceling if the I/O was complete, otherwise it would get cleaned up automatically during the next event loop iteration.

So that just meant changing the line

Done and dusted!

Of course, for the official fix I added a unit test and a bunch of comments since Overlapped I/O and I/O Completion Ports are frustratingly touchy.

As a bonus, Guido van Rossum (creator of Python) commented positively on the fix, and eventually we got the fix merged back as far as Python 3.8--the oldest supported version (for security fixes), but also the first version where the ProactorEventLoop became the default.

The Joshua Tree

06 July 2024

Fixing a Memory Corruption bug in Python

"Overlapped" I/O?

Reproducing the problem

But does it corrupt memory?

Can we fix it?

No comments: