21 November 2024

Game Hacking: Part Deux


I'm still playing through the Assassin's Creed franchise. But now I've finished Rogue and have moved on to Unity which--although released on the same day--are (console) generations apart. However, this also means that the games were created with different Ubisoft game engines: AC:Rogue was the last game on AnvilNext and AC:Unity was the first game on AnvilNext 2.0.

Different engines ostensibly have different issues, and older engines can have issues on newer hardware. My previous post describes my issues with AC:Rogue where the game would stall for seconds at a time with too many CPUs. And I run an AMD Threadripper 5975W with 32 cores/64 hyperthreads--difficult hardware to come by 10-12 years ago when these engines were created.

Turns out that AC:Unity also has problems with too many CPUs--it crashes to desktop immediately after launch!

Has this already been fixed?

The first step was to see if anyone else has run into this problem and if they've fixed it. Turns out I was in luck. A reddit user running on the same CPU had the same issue. Unfortunately, the workaround was pretty sucky--disable CPU cores in the BIOS. I definitely don't want to have to reboot before and after playing the game.

Fortunately now we know what the problem is--too many CPUs. The fix in the previous blog post was similar, but since this is the second game I've had to fix in the same way, let's see if we can make a fix that is more generic.

Let's make a generic tool! But how?

Enter Microsoft Detours--an amazing tool for hacking (and even some legitimate uses). Basically the plan is to create a .DLL file that we can inject into the process that will intercept (or hook) some system calls and lie present an alternative set of facts to the game. This .DLL can be used with nearly any game, so we could go back and use it with Rogue to fix it without hacking the assembly language.

In the previous post, we pointed out two system calls (GetSystemInfo and GetNativeSystemInfo) that tell us the number of CPUs, but this is just the tip of the iceberg. It turns out that AC:Unity uses many more system calls than this. And for our generic tool to support all games, we should at least figure out what AC:Unity is using and support those calls.

But where do we draw the line? In the past 20+ years we have gone from single CPU to multiple SMP CPUs, to NUMA. Machines are available these days that have over 64 CPUs. AMD is preparing to launch servers with 256 cores next year (order now!). Now it's unlikely people will run games on these machines, but that was probably the thinking back when 32+ CPU machines were unheard-of. The point is that modern Operating Systems have evolved and added system calls to support NUMA, CPU Groups and CPU Sets. While it's conceivable that games could call some or all of these, let's assume that they won't. But maybe we can set up some logging for these functions to see if they're ever called.

The bigger issue is probably going to be affinity. Simply put, affinity is a list of CPUs that a process (or threads) can run on. Typically this list is given as a bit mask since each bit can effectively represent a CPU. Since (historically) the number of CPUs was small, this was fine for a 32-bit value, but now we have over 32 CPUs! No longer can a single 32-bit value represent an affinity mask. Older 32-bit applications typically used a 32-bit value, but newer 64-bit applications generally use 64 bits (but bugs may exist from porting previous code, etc.).

Intercepting Functions

The Detours wiki has a pretty simple minimum example for hooking instrumenting the Sleep function. Let's alter it for GetSystemInfo, one of the functions that we'll need to intercept. (Source Listing)

This listing will compile into a .DLL that, when loaded by an application, will override calls to GetSystemInfo and allow us to alter the responses.

Once DetourTransactionCommit() is called in our InstallDetours() function our hook is in place:

Now whenever GetSystemInfo is called, our version is called instead. We just happen to call the original and then modify the number of processors:

The next step is that we need to get the game to load our new DLL. Unfortunately, this is easier said than done. Games these days typically don't want hackers messing with them and they go to great lengths to try to prevent that (just as I did when working on my games). AC:Rogue didn't have any protections, but AC:Unity has a multiplayer element to it, and multiplayer games typically make attempts to curb hacking/cheating. Now I'm just trying to get the game to run but it doesn't know that.

Detours has (at least) two tools that can be used. The WithDll tool can be used to launch an application and inject a DLL before it fully loads. Unfortunately this method didn't work with AC:Unity because of how it's launched. I purchased the game through Steam, so Steam launches AC:Unity, which sees that it wasn't launched from Ubisoft Connect, so it launches Ubisoft Connect and exits, which does a few things (cloud sync for instance) and then re-launches AC:Unity. Since I cannot control how the game is launched from Ubisoft Connect, this didn't pan out. Interestingly though the app did launch with the injected DLL.

The other tool is SetDll. This tool can modify an executable file to always load a new DLL. This sounds like exactly what we need! Unfortunately, this didn't work either since AC:Unity has hack-prevention checks that noticed that the executable was modified and refused to run.

Frustrated, I started looking in to other ways to inject the DLL. SetDll works on DLLs too, so if I can catch it early enough in the loading process, I could use a different DLL as an injection vector. I fired up Process Monitor and fiddled with the filters to see what the AC:Unity process was doing when it started. Turns out it ships with a DLL that is loaded very early, and that DLL doesn't complain about modifications. Eureka! I have my injection vector.

I added some simple logging (using OutputDebugString that I could view with the DebugView utility) to my simple example to make sure that my function was being called (since AC:Unity also detects debuggers and terminates the process if one is found). Good news--it was being called! But of course, the game was still crashing. So while this was enough to essentially fix AC:Rogue, the newer engine does different things and will need more work.

Affinity Functions

As I mentioned above, the functions that control affinity are probably in use. The Microsoft article about Processor Groups has a good list of functions that control affinity and scheduling. My process was to first add logging for these functions to see if they were being called. The logging would include what arguments that AC:Unity was passing to the functions as well as what the functions returned. Turns out that several of them were being called. Even though GetSystemInfo (and GetNativeSystemInfo) were intercepted and showing a lower number of CPUs, the affinity functions were happily reporting affinity across a larger number of CPUs.

The next process was to change the behavior of these affinity functions to only return bits set for the number of processors that we were emulating.

An interesting function that it turns out AC:Unity was also calling is SetThreadIdealProcessor (and the extended version). This function gives the Windows scheduler a hint about which CPU to schedule the thread on, but it's not as strong as affinity. I also intercepted this function to only allow the processors in our emulated set.

Logical Processor Info

In my work hacking AC:Rogue I discovered that it was optionally looking for and calling a function called GetLogicalProcessorInformation (which also has an extended version). CPUs these days are very complicated: they have multiple layers of caching (some of which are shared with other cores), can share hyperthreads with other cores, not to mention the NUMA relationships. These functions are the means to access this data on Windows, and as you might expect, are quite complicated. The extended version is even more complicated allowing different means of querying data and returning variable-sized information.

It turns out that AC:Unity was also calling these functions, so they are another one that I had to wrangle.

For the original function, all of the data is presented at the same time, so the function returns the same data every time it is called (at least, it is highly unlikely that it would change while the game is running). Therefore, it is easily cached. The first time my intercepted function is called, it queries all of the host data, chops it down to only include our emulated processors, and saves it off so that every subsequent time that we're called we just return the same data.

Unfortunately, the extended function allows specifying a Relationship Type to query, and the data is drastically different based on that query type. My solution here was to cache the chopped-down host data and remember the Relationship Type along with it. So if a different Relationship Type was queried I would throw away the cache and rebuild it. This however was much more complicated because of the variable-sized data. I essentially had to filter the data in place and then copy the filtered data into the cache. For this it's probably better to look at the code directly: (intercept function, cache builder)

Does it run now?

YES! 🎉

In the end, I ended up hooking 12 functions (but three--the Process Group functions--weren't called by AC:Unity and therefore only log).


Finally, I can play the game for a bit.

But... oh no. There's another problem.

The Video Problem

I have a Samsung Odyssey G9 49" monitor. It's aspect ratio is 32:9 and I'm playing a game originally designed to run on consoles that output 16:9. However, the PC port of the game itself beautifully figures out the correct aspect and does pretty well with the super ultrawide display, as evidenced in the screenshot at the top of this post.

The problem is the full-motion videos that it uses for cutscenes. They are originally designed for 16:9 and are stretched out to 32:9--double the width they should be. It's almost breaking my brain trying to watch them and figure out what's going on.

A very wide Mr. Templar tied to a tree

Can I fix this too?

Bink Video

Bink is a popular video encoder and player software package for games. Developed by RAD Game Tools (now owned by Epic and rechristened as Epic Game Tools), Bink was originally released way back in 1999.

Bink is pretty easy to integrate and very fast. I integrated it in to EverQuest 2 back in 2006, so I'm somewhat familiar with the API.

Turns out that Ubisoft uses it in AC:Unity, though a much newer version than the one I used. Fortunately a lot of the API seems to have remained consistent. And all I want to do is prevent the videos from being stretched.

Research

The first step is to figure out how AC:Unity is using Bink. Fortunately I already have the means to do this, just like I did with the CPU Windows system functions--hook the API calls and see how they're being called. Bink is loaded as a DLL, so theoretically I can hook it too.

And the initial tests are promising. I used Microsoft's dumpbin tool to list the exports from the Bink DLL. Pretty much any application is guaranteed to call BinkOpen, so that's the first function that I hook. And testing is quite quick since the first thing AC:Unity shows is the Ubisoft logo video.

Unfortunately, this is where things hit a snag. I've managed to hook a few functions that give me useful information (I can see BinkRegisterFrameBuffers called for instance), but none of the functions with names that sound promising are ever called (BinkCopyToBuffer, BinkCopyToBufferRect, BinkBufferOpen, BinkBufferSetResolution, BinkBufferSetScale). I ended up hooking 20+ of the most interesting sounding functions only to see a tiny few of them being called. How is it accessing the data without calling BinkCopyToBuffer?!?

It's time to get dirty. I can't attach a debugger because of the anti-hack measures (and it will take way too long to reverse engineer those), but I need to see it in a debugger when it's trying to access the data. So I'll make the game hang when it tries to access the framebuffer!

It turns out this isn't too complicated: I register an unhandled exception filter that just sleeps infinitely when a crash happens, and then I call VirtualProtect to tell Windows to make the framebuffer data no-access--any attempt to read or write to it will cause a crash. However, I have to allow Bink to write the data in to the framebuffer, I just want to see where it's read back out, so I have to allow access to the BinkDoFrame function.

As soon as the game hangs (without displaying the first video frame) I use Task Manager to write a full dump. But the dump is confusing. I have no symbols, no function names, nothing to really give me any clues as to what's happening except the assembly code. And it looks like AC:Unity is reading the data directly out of the framebuffers.

That's exactly what it's doing. AC:Unity is calling BinkRegisterFrameBuffers which means it is allocating its own framebuffers and undoubtedly retaining the pointers. BinkDoFrame will decompress the video stream into the framebuffer, and then AC:Unity is reading the data (probably with a shader) to display on the screen.

The video fix

Looking at the allocated framebuffers, they're 1920 x 1080 (16:9 ratio) and their shaders are probably the stretching culprit. I'm not a GPU/graphics guy (ironic given where I work) and I have no idea how to break into the game's shader pipeline. It can be done because there are other mods out there that do, but it will take a long time to learn how to do this. And I want to get back to playing the game, so how do I do it quickly? Is there something I can do to the framebuffers?

It turns out I can! I need AC:Unity to allocate 32:9 framebuffers instead of 16:9, and the first two values in the structure returned from BinkOpen looks like width and height. So the first thing I try is just scaling that width to 32:9. Since buffers historically were powers of two, they typically have a pitch passed to the decoder that says how many bytes represent a line so that it can skip to the start of the next line (extra bytes would just be off the screen). This helps us out here since the pitch tells Bink to skip about half of the line and just start decoding on the next line. It works!

Better?

Well, kinda. Why is half of the screen green? Hey, but at least the aspect ratio is correct. And it's not centered, but we can fix that too (probably).

Alright, so the green screen. Encoding video data as RGB is very inefficient. Remember, Bink is a library developed in 1999 and first used on the PlayStation. It uses the YCrCb color space for its video information to use less space. AC:Unity likely has a pixel shader that computes the RGB values from the YCrCb planes that the framebuffers are allocated from. The buffers started with zeroed bytes and since the right half of the screen is never written to by the decoder, they remain as zeroed bytes. Well converting the zero values for YCrCb ends up with RGB color #008700, a quite greenish hue.

Looking up the value for black in YCrCb and setting the initial state of the buffers to those values gives us a dark gray. Not quite black. But that link has the minimum Y value as 16, which gives us a dark gray, but trying a Y value of 0 gives us black.

So the last issue to tackle is the centering. AC:Unity has given Bink the addresses for the framebuffers by calling BinkRegisterFrameBuffers. Bink decodes data into the first line at the start of the buffer, advances the line pointer by pitch bytes and decodes the next line, so on and so forth. If we can tell it to use a different starting address for the buffer that represents a centered frame, it will move the whole image.

This is pretty easily done, we can just calculate the scaled width of our screen, subtract the width of the video and divide by two to get the framebuffer line offset. We have to do this for each plane separately because the Cr and Cb planes are smaller than the Y/A planes. The tricky thing here though is that Bink requires the framebuffer address to be 16-byte aligned, otherwise a hang occurs in the decoder (speaking from experience).

Voila!

Not so wide now, Mr. Templar!

Source is here. Uploaded to NexusMods and now back to playing the game!


09 October 2024

Game Hacking


I've been playing through the Assassin's Creed franchise in anticipation of Shadows due out later this year, or well, now pushed to early next year. I really enjoyed Assassin's Creed IV: Black Flag, but ran into an issue that some other people had too: Kenway's Fleet wouldn't load. Fortunately, someone created a patch that fixed it, and additionally made initial connection to Ubisoft much faster.

Fast forward a bit and I'm playing Assassin's Creed: Rogue, which is a great game in its own right, but borrows a lot from Black Flag. However, I started running into an issue: five (or more) second stalls where the game would just freeze (audio would keep playing) and then resume as if nothing happened. I could go a minute or so without these freezes, but sometimes it would be only a second or two after the previous freeze. It was nigh unplayable.

Searching around, other people didn't seem to report the same problems that I did. But my PC is pretty unique: it's an AMD Threadripper with 64 logical cores and 128 GB of RAM. That's a lot of hardware for a game released 10 years ago. Looking in Task Manager, I see that the game has about 280 threads. That's a lot! I wondered if the game is creating a number of worker threads per logical CPU, but perhaps isn't optimized for such a high number of CPUs. Designing software that runs well fully using 4 CPUs is a lot different than software that fully uses 64 CPUs.

Alright, so my working theory: the game has a contention problem that shows up with tons of threads.

How to test that theory? I fire up one of my favorite profilers: Superluminal. Unfortunately, my ability to seriously debug performance issues is going to be hampered by the fact that I don't have any debugging information (or source code) for the game in question. But Superluminal can tell me what the OS is doing, and what requests the game is making of the OS.

So I fire up Superluminal and tell it to start watching "ACC.exe", the AC:Rogue executable. Sure enough, within a few seconds I get a 5 second stall. After it recovers, I head back to superluminal to stop the tracing and start looking at it. The stall is clear as day and I have selected it here:


From this, it's pretty apparent that the main thread is running hard during the stall, and from the function list in the bottom-left corner, I see it's trying to lock and unlock "critical sections" which are Windows mutex-like objects. Note that before and after the stall the main thread spends a decent amount of time synchronizing (the red part of the red/green CPU activity graph). Let's scroll through the other threads and see if we see anything else interesting.


Aha! Thread 86036 has a CPU chart that is all blue. This is Superluminal's visual language for "preemption" that is, the thread was running but the Windows scheduler decided that running something else was more important. Also interesting to note that the other worker threads above are completely blocked on synchronization. They're just waiting during this stall.

Looking through some of the other threads there's not really much that is interesting. A few threads are sleeping and waking up periodically. But the main thread is running hard, this thread 86036 is preempted, and pretty much everything else stops while this is happening. It's possible that Windows  (for some unknown reason) chooses to stop running this thread 86036 and instead runs the main thread.

Superluminal gets information about every process running on the system at that time, and the only thing using any CPU is ACC.exe, the game we're profiling.

For additional information, I also captured system state using Event Tracing for Windows by utilizing the UIforETW tool by Bruce Dawson. Looking at the capture I was able to find a different stall in the same thread:


Check out the highlighted line: the thread was ready to run for 4,747,081.6 microseconds (4.7 seconds!) before it ran, and the Readying Thread of -1 indicates that it was in fact preemption. Not immediately apparent from the screenshot (but from further digging through the UI) is that both the main thread and thread 86036 have the same Ideal CPU (0) and Priority (15)

Unfortunately it's still not super apparent exactly what is causing this. A possibility is affinity--which CPUs a thread is allowed to run on--but it's not clear how to check this. Most threads have 0 as their Ideal CPU, so I'm not sure that this value is actually used by Windows 11. But the Priority values are interesting. Most of the threads in the process have Priority values of 13 and 15, which would correspond to a "High" priority class and "normal" and "time critical" priorities respectively. If both the main thread and thread 86036 share a "time critical" priority, they could affect each other running.

But my initial thought was that 280+ threads is a lot, especially if it's running some or all of them as time-critical priority. Let's see if we can get the game to create fewer threads. Unfortunately, this is easier said than done.

Most compute-intensive programs trying to max-out a machine (like games) will typically create at least one thread per CPU, so maybe we can lie to the game about how many CPUs there are. There are actually several API calls that it could be making to determine this, but the most common one would be GetSystemInfo (side note: modern C++ Runtime implementations use GetNativeSystemInfo, but this function does not appear to be used by ACC.exe). Fortunately the game executable doesn't appear to be cryptographically signed, so we should be able to patch it.

The next step is to disassemble the game so that we can find the places where it might be trying to find out the number of CPUs. For this I used IDA Freeware. While the disassembly is quick, the analysis happens in the background and takes a little while. Once analysis was done, I looked at the Imports and found the GetSystemInfo function. There were several cross-references to places that were calling it (I took the liberty of renaming the functions based on a cursory glance of what they were doing):


Let's take a look at the second entry, the one that I renamed to getSystemInfo2. I was really impressed with IDA here: it understands Windows API structures and was able to figure out how members of the structures were being used:

Note that dwNumberOfProcessors is read into the edx register here and stored in various locations. Instead of reading the value from the structure, I want to lie to the game: I want it to see a hard-coded value of 8 CPUs. So now I need to change the bytes (8B 54 24 50) which is the machine code for "mov edx, [rsp+68h+SystemInfo.dwNumberOfProcessors]" to... something else. Something that just gives me the value of 8 in edx. Unfortunately the free version of IDA doesn't do assembling, so I can't write new code here and expect IDA to build me a new executable.

Instead, we're going to do it the hard way: I need to figure out how to get the value of 8 into edx but in 4 bytes or less. Since IDA won't help me out here, I'll use an online assembler to try some things, and then a hex editor to change the actual bytes.

Well, let's try the obvious instruction: "mov edx, 8"

Unfortunately this results in 5 bytes (ba 08 00 00 00), which is difficult to stuff into 4 bytes. What's interesting is that the full immediate 32-bit value representing "8" is present in the machine code, which makes some sense since we're writing to a 32-bit register.

This gave me an idea. What is the machine code to move 8 into the lowest byte of edx (the dl register)? Turns out this is only 2 bytes (b2 08)! The problem is that the other 24 bits of our edx register are garbage. Maybe we can clear the register with an xor (exclusive-or) instruction? If you xor a register against itself the value is always 0. Let's try that: xor edx, edx--this produced machine code (31 d2).

Perfect, so we have our solution:

Then I'll just use my trusty hex editor plugin for VSCode to search for the preceding bytes and write over the old bytes:

We just need to apply this to each location in our cross-reference list above.

Working through the list, we run into a problem with the function that I've called "getMemoryAndCpuInfo". The dwNumberOfProcessors value is written to a different register: r11d. This is an extended register for 64-bit processors and needs 5 machine code bits to read the value into it.

Unfortunately, our trick isn't working here. Assembling "xor r11d,r11d; mov r11b, 8" takes 6 bytes.

Let's try a different approach: push and pop the value, which uses the stack. We really don't want to make any stack changes as moving the stack pointer will cause issues with all of the following instructions that read/write from the stack. But if we immediately reset the stack pointer by popping the value, it should be fine. So let's try that:

Perfect! The push/pop is only 4 instruction bytes, so we had to add a "nop" or no-operation instruction (only one byte) to bring us up to our total of 5.

Our next wrinkle happens in the function that I've dubbed getProcessorTopology. While this function is also calling GetSystemInfo and using dwNumberOfProcessors (easily patched by our previous methods), it's also doing something further:

It's checking to see if Windows is new enough to have a function GetLogicalProcessorInformation and calling it if it does. We don't really want the game to call this function since it ruins our lie about the number of processors. Fortunately, this function is smart enough to handle the situation where that function doesn't exist. So to prevent it from calling this function, we'll just change the name to all X's. It will attempt to find a function called XXXXXX... which obviously doesn't exist. IDA can also confirm for us that this is the only location where that string is used, so we can safely change the name.

Unfortunately we're not out of the woods yet. There is one more function, and it's a doozy:

This function is doing a signed integer divide by the number of processors. None of our tricks thus far will work for this. Let's take a deeper look at what this instruction is doing. It takes a 64-bit value with the most significant 32 bits in edx and the least significant 32 bits in eax and divides by a 32-bit value (in this case, our dwNumberOfProcessors)--eax will receive the quotient and edx will contain the remainder. It does this in only 4 bytes!

However, it doesn't appear to be using the quotient at all, just the remainder. So this is actually doing a modulus by the number of processors. So what we need to do is actually a modulus by our lie of 8 processors. Can we do it in 4 bytes!?

Well, we have a few things going for us. The original function does a cdq (convert doubleword to quadword) instruction to set up for the divide, it also needs eax and edx for the divide but only edx with the result. And since we've chosen a magic power-of-two value of 8, we can achieve a modulus by doing an AND operation with (8 - 1) or 7.

Within reason, we can also move some instructions around as long as the side-effects are the same.

All we really need to do is this, which is only 5 bytes:

And we've already established that we don't need the cdq instruction since that was part of the divide. So cdq + idiv gives us 5 bytes! All we need to do is move the "mov r8, rbx" one byte earlier then we have the 5 bytes that we need together!

That's all the locations calling GetSystemInfo patched up. The question is, does it run?

I fired up the game, and it purrs along like a charm! No more stalls. Seems like the theory of my computer having too many CPUs for an older game trying to take advantage of them was accurate.

I've uploaded the patched executable to Nexus Mods but as of this writing there haven't been any downloads. I guess not many people are trying to play a 10-year-old game with 64+ CPUs! At least I'm enjoying it again 😉



06 July 2024

Fixing a Memory Corruption bug in Python

For Omniverse we use Python for scripting. About a year and a half ago, we upgraded from Python 3.7 to Python 3.10. Shortly thereafter, we started seeing occasional memory corruption issues crop up, but it took a while to actually link the cause to Python.

A lot of the crashes that we were seeing involved very deep and complex Python stack traces in seemingly innocuous things: the garbage collector, releasing a reference, sometimes allocating new memory within Python. Since my team in the Omniverse group handles the plugin system, we manage the plugin that embeds Python into Omniverse, which means I was assigned all of these crashes for triage.

We do have some internal Python experts, and my team wasn't the tip of the spear on the 3.7 -> 3.10 upgrade (though we were involved). The nature of memory corruption means that crashes can happen in very random ways, since any call to acquire or free memory (which happens a LOT) may trigger a crash. So we had 20+ different stack traces that were considered to be part of the same memory corruption pathology. Curiously, this was only happening on Windows, and another clue was that quite often we would see an error in the log sometimes well before the crash occurred:

RuntimeError: <_overlapped.Overlapped object at 0xXXXXXXXXXXXXXXXX> still has pending operation at deallocation, the process may crash.

"Helpful!" I thought to myself sarcastically. Well, the process was in fact crashing, so this may be an indication as to what happening.

However, this crash was fairly low priority. It wasn't happening often per se, but it was annoying. I made a task to track all of the different automatically generated crash stack traces, and periodically tried to find someone who could look into it. Since it was Windows-only and rare, it never percolated up to the top of anyone's priority queue.

Months passed and I continued to be irritated when another new stack trace would pop up. Everyone, including me, was busier with higher priority tasks. Since no one else could be bothered to fix it, I eventually decided to take a crack at it.


"Overlapped" I/O?

Typically when you're writing a program and you want to do a network request, or reading a file, or any other sort of input/output the easiest thing to do is to stop everything and wait because you need the results before continuing. But in the world of 60+ FPS, stopping and waiting for even a few milliseconds can feel sluggish. We instead want to do things a-synchronously--have them happen in a background process and the program can do other things while waiting. One way of doing this is through parallelization, that is, using a separate thread or CPU to do work, but this can also be messy. Also, Windows is old and computers used to just have one CPU, so a long time ago Windows started doing a-synchronicity through concurrency and the Operating System would handle interleaving all of the requests. "Overlapped" I/O is how Windows started doing that: multiple I/O operations could be done at a time, hence they overlap each other.

The problem is, it's pretty complicated. And even more so in an environment like Python where implementation requires both C code and Python code.


Reproducing the problem

The first step was to try to reduce the problem to a minimum so that it was easier to debug. I also knew that we started having this problem when we moved from Python 3.7 to 3.10, so I looked at the changes to see what might be different. I noticed this in the "what's new in Python 3.8" article:
On Windows, the default event loop is now ProactorEventLoop. (Contributed by Victor Stinner in bpo-34687.)

It turns out that Windows and Linux asyncio event loops are implemented quite differently, which makes sense because Asynchronous IO works quite differently on Windows and Linux. With Python 3.8, asyncio now used Overlapped IO by default through the Proactor event loop.

Omniverse was also doing something different from most Python applications: constantly stopping and re-starting the event loop. See, in most Python applications the program hands control to asyncio until the program finishes; asyncio then dispatches everything that the application does. Since Omniverse uses embedded Python, we still have a graphical application that needs to run, so each frame of the graphical application we would just run the asyncio event loop for one "tick" and then stop it.

I wrote a simple app that emulated that behavior, but I couldn't get it to crash, so that wasn't enough to do it. We had a sample case that could produce the Overlapped error message typically within 5 minutes (which is an eternity when you're waiting for it), so I looked at what it was doing. After much debugging I found another clue: other threads were calling call_soon_threadsafe on the event loop. I added that to my test program. BINGO. This program had an 80% chance to print the error message within 60 seconds:

This program starts a thread that pings away scheduling a dummy function to run on the next loop iteration. It's real purpose is to send a wakeup signal to the event loop. The main thread just constantly does our stop/run idiom like Omniverse does so it doesn't have to hand control to the event loop permanently.


But does it corrupt memory?

The error message comes from the implementation of _overlapped, which is done in C and acts as a bridge between Python code for the asyncio event loop and the Windows API call to manage overlapped I/O. Let's look at what happens before that.

When the run_forever call is cleaning up it tries to cancel the "_self_reading_future":

Which in turn calls into the C code for _overlapped and invokes a Windows API function called CancelIoEx:


Then the run_forever function unregisters the 'overlapped' object (ov) and sets the future to None, effectively destroying the object.

This is where the danger arises! The overlapped structure (managed by the overlapped Python object) has been passed to the Windows API when an IO operation was requested (in this case it's for the socketpair used to notify the event loop for other threads) and cannot be deallocated until Windows reports that it's done with it!

The deallocation function attempts to check this, and reports the error message if Windows isn't done with it. But then goes on to deallocate the structure anyways, which can lead to memory corruption!


While not ideal, a better alternative might be to report the error and leak the OVERLAPPED structure so that a crash won't occur. But the best alternative is to see if we can fix the problem.

With a reproducible case and an understanding of what was going on, I filed a Python issue.


Can we fix it?

I put the initial diagnosis into the Python issue: CancelIoEx is being properly called, returns TRUE meaning that Windows requested the cancellation (NOT that it was cancelled yet!), but the deallocation function sees that the overlapped I/O hasn't yet completed. It tries to call CancelIoEx again (this always fails with ERROR_NOT_FOUND since the previous cancel request succeeded), and then calls GetOverlappedResult (note that wait=false because the second CancelIoEx attempt fails), which will also fail with ERROR_IO_PENDING. As a test I tried changing it to wait, but this just caused a hang.

What gives? Originally I though this was a bug in Windows and handling Overlapped I/O, but then I tried something else. Not only is this system using Overlapped I/O, it's using another complex Windows-ism: I/O Completion Ports. This system is a way to group multiple I/O requests together, so that instead of having to check myriad different requests, you check one "completion port" to find out when any of your requests finish. What if the cancelled I/O request is stuck in the I/O completion port? Instead of waiting on GetOverlappedResult, maybe I need to read the request from GetQueuedCompletionStatus.

BINGO.

Turns out that was the case: it really depends on where within the labyrinthine Windows I/O ecosystem the I/O request ended up before the cancel request happened.

So now we could make an "official" fix recommendation. Generally large projects like Python desire fix recommendations from outsiders like me to be as simple and elegant as possible. Which meant that I would have better luck without completely trying to change how their Overlapped I/O worked. Since all we needed to do was make sure that GetQueuedCompletionStatus was called on our _overlapped object again, that gave us a goal. GetQueuedCompletionStatus was actually called every time the event loop ran, and it already cleaned up finished _overlapped objects once they completed. All we need to do is only allow cleanup after canceling if the I/O was complete, otherwise it would get cleaned up automatically during the next event loop iteration.

So that just meant changing the line



to



Done and dusted!

Of course, for the official fix I added a unit test and a bunch of comments since Overlapped I/O and I/O Completion Ports are frustratingly touchy.

As a bonus, Guido van Rossum (creator of Python) commented positively on the fix, and eventually we got the fix merged back as far as Python 3.8--the oldest supported version (for security fixes), but also the first version where the ProactorEventLoop became the default.

05 June 2024

Basic Callback Hygiene

There's a tenet that I've been pushing on for a few years now from my little corner of NVIDIA, and that is the concept of Basic Callback Hygiene.

For the uninitiated, a callback simply-put is a function that I can pass to another function that it will later call to "call me back." A lot of subscription services in programming work like this: Something is going to happen in the future. I want to be notified of it, so I'll call a subscription function and give it a function that it can call to tell me when that happens.

This simple concept gets a lot more complicated when you ask what sorts of things you can do within that callback function. What happens when I don't care about the subscription anymore? What if the module containing the function unloads? What thread is the callback going to happen on? And so on, and so forth.

I've experienced problems with all of these complications over the course of my career, and given the amount of asynchronous and modular code that I've been writing for Omniverse, I finally distilled down a simple set of rules:

  1. Callbacks can ask to be unregistered from within the callback.
  2. Strong recommendation to not hold locks (i.e. a mutex) while calling callbacks.
  3. When a callback asks to be unregistered, there must be a strong guarantee that the callback is not currently being called, nor will ever be called in the future.
That's it. A subscription system that follows these basic principles will avoid 99% of the issues that we typically run into with callbacks.

Let's examine each of these. To put this in practical terms, let's say that we're designing an event system. We can subscribe to an event which will register our callback and we can dispatch an event which will callback all of the subscribers from the context of the dispatching thread (so basically it loops over all subscribers and calls their callback). Our imaginary system is thread-safe, so we can subscribe or dispatch from any thread. We can also unsubscribe to unregister our subscription.


Unregistration from within a callback

This is probably the most useful one. Let's consider the case where I cannot unsubscribe from the callback but I want to. My callback is called and I can handle the event, but I cannot unsubscribe. This particular callback only wants to receive the first event after registration and no more events (an example of this might be a one-shot function that I want to defer until the next frame).

Since I cannot unsubscribe, I must be able to conditionally have a function run later that will then unsubscribe my function outside of the event, and I need to ignore any more events that I might receive in the meantime.

This makes something that should be simple much more troublesome, and every place that I want to do this sort of action now requires more thought, effort, time, testing, etc. Changing the system to allow un-subscription from within the callback means that the complexity of handling that situation is in a singular place: my event system, rather than each and every callback that needs that behavior.


Don't Hold Locks

Holding a lock while calling a callback is dangerous because the callback can execute arbitrary code. The callback may call back into our event system and we don't want the complication of having recursive locks (which are arguably evil) or contributing to lock cycles which can cause deadlocks.

Writing an event system to be thread-safe and yet not hold any locks while calling callbacks is challenging to say the least. Generally this requires something akin to tracking the number of threads that are in the middle of doing callbacks, at a minimum. We typically do it by adding a thread identifier to a list while holding a lock, and then unlocking to call the callback, then re-acquiring the lock and removing the thread identifier from the list. This is more complicated to be sure, and the reason why will be evident in the next section. There are other ways to do it, but you'll need some tracking to maintain state while the mutex is unlocked for the callback.

This also means that how you walk the list of subscriptions must be stateful in a way that can resume after the mutex is unlocked and re-acquired. If these subscriptions are stored in C++'s std::vector for instance, there exists a possibility that unlocking the mutex will allow another thread to subscribe, which may resize and therefore reallocate the vector, which invalidates all iterators; or another thread may unsubscribe which will remove an entry, which can affect the position of the current thread walking over the vector. (Solutions to these, respectively, might involve using indexes instead of iterators, and leaving null pointers in the array to mark empty spaces).


Strong Unregistration Guarantees

This is by far the hardest issue to solve correctly. Here's a real-world scenario: right before a module unloads (in the main thread, say), it tells our Event system to unsubscribe a previously-subscribed callback. Without a strong guarantee, what will happen if another thread is currently calling our subscribed callback? As noted above, we don't want to be holding locks while calling callbacks, so waiting on a lock isn't going to wait for our callback to finish. What has happened in actual code is that the unsubscribe happens quickly, the module proceeds to unload, and then a crash occurs because another thread was still executing a function that was in the module. Or after unloading, objects were destructed since we would no longer get events, but with another thread in our callback function, a crash occurs because deleted memory was accessed.

Or the problem that I'm fixing right now: memory corruption occurs because a dependent structure was freed, but a race with the callback means that we were still changing things inside the structure.

It quickly becomes apparent that any request to unregister must wait for any other calls to finish. Also, once we return from the unregister function, it stands to reason that we cannot ever call the callback again. To do so would be violating the caller's unregistration request.

This then requires a strong ordering guarantee:
  1. Unregister function is called.
    1. WAIT for any calls to the callback in question to finish*.
    2. PREVENT any future calls to the callback.
  2. Return from the unregister function. At this point the caller can guarantee that our event system is free of the callback and it is safe to destroy structures.
"Ah," the astute and observant amongst you might notice, "what if the unregistration is from within the callback?!" This was our first point, above. If we WAIT for any calls to the callback to finish and the unregistration call is coming from the callback, we would wait forever. So we amend the line with the asterisk (*) above instead to say: "WAIT for any calls to the callback in question in other threads to finish."

So our subscription process is pretty simple:
  1. Take the lock
  2. Append our subscription information to the std::vector (or similar)
  3. Unlock the lock
The notification process is a bit more complicated as it has to synchronize with the unsubscribe process:
  1. Take the lock
  2. Iterate over the std::vector (or similar) by index, skipping null entries. For each:
    1. Add the current thread identifier to a list of threads that are notifying.
    2. Unlock the lock
    3. Call the callback function
    4. Re-acquire the lock. At this point, the vector may have been reallocated and grown, and subsequent entries may have been unregistered. Any other thread could have done anything.
    5. Remove our current thread identifier from the list of threads that are notifying.
    6. Is a waiting flag set on the subscription? If so, notify the waiting thread (i.e. using a condition variable).
  3. Unlock the lock
And now, the pièce de résistance: the unsubscribe process:
  1. Take the lock
  2. Find the subscription to remove. If not found, unlock and return; otherwise:
  3. Set the entry in the std::vector to null, but it is not destroyed yet!
  4. Is the subscription currently being notified by a thread other than the current thread? If so, set a waiting flag on the subscription that is used in 2.6 above.
  5. Unlock the lock.
  6. If the waiting flag was set, wait on the condition variable for notification from 2.6 above. Otherwise it is safe to destroy the subscription and return.
  7. Re-acquire the lock.
  8. If the list of notifying threads contains threads besides the current thread go to step 5.
  9. Unlock the lock. It is now safe to destroy the subscription.
This will allow our unsubscribe to wait until no threads are notifying our subscription, or only our thread that is currently unsubscribing during notification.

Of course, there is a lot more nuance to this, especially when it comes to handling recursion (dispatching within a callback) or even just the complexity of working with multiple threads, but this should help you understand the concept of Basic Callback Hygiene.