Showing posts with label hacking. Show all posts

21 November 2024

Game Hacking: Part Deux

I'm still playing through the Assassin's Creed franchise. But now I've finished Rogue and have moved on to Unity which--although released on the same day--are (console) generations apart. However, this also means that the games were created with different Ubisoft game engines: AC:Rogue was the last game on AnvilNext and AC:Unity was the first game on AnvilNext 2.0.

Different engines ostensibly have different issues, and older engines can have issues on newer hardware. My previous post describes my issues with AC:Rogue where the game would stall for seconds at a time with too many CPUs. And I run an AMD Threadripper 5975W with 32 cores/64 hyperthreads--difficult hardware to come by 10-12 years ago when these engines were created.

Turns out that AC:Unity also has problems with too many CPUs--it crashes to desktop immediately after launch!

Has this already been fixed?

The first step was to see if anyone else has run into this problem and if they've fixed it. Turns out I was in luck. A reddit user running on the same CPU had the same issue. Unfortunately, the workaround was pretty sucky--disable CPU cores in the BIOS. I definitely don't want to have to reboot before and after playing the game.

Fortunately now we know what the problem is--too many CPUs. The fix in the previous blog post was similar, but since this is the second game I've had to fix in the same way, let's see if we can make a fix that is more generic.

Let's make a generic tool! But how?

Enter Microsoft Detours--an amazing tool for hacking (and even some legitimate uses). Basically the plan is to create a .DLL file that we can inject into the process that will intercept (or hook) some system calls and ~~lie~~ present an alternative set of facts to the game. This .DLL can be used with nearly any game, so we could go back and use it with Rogue to fix it without hacking the assembly language.

In the previous post, we pointed out two system calls (GetSystemInfo and GetNativeSystemInfo) that tell us the number of CPUs, but this is just the tip of the iceberg. It turns out that AC:Unity uses many more system calls than this. And for our generic tool to support all games, we should at least figure out what AC:Unity is using and support those calls.

But where do we draw the line? In the past 20+ years we have gone from single CPU to multiple SMP CPUs, to NUMA. Machines are available these days that have over 64 CPUs. AMD is preparing to launch servers with 256 cores next year (order now!). Now it's unlikely people will run games on these machines, but that was probably the thinking back when 32+ CPU machines were unheard-of. The point is that modern Operating Systems have evolved and added system calls to support NUMA, CPU Groups and CPU Sets. While it's conceivable that games could call some or all of these, let's assume that they won't. But maybe we can set up some logging for these functions to see if they're ever called.

The bigger issue is probably going to be affinity. Simply put, affinity is a list of CPUs that a process (or threads) can run on. Typically this list is given as a bit mask since each bit can effectively represent a CPU. Since (historically) the number of CPUs was small, this was fine for a 32-bit value, but now we have over 32 CPUs! No longer can a single 32-bit value represent an affinity mask. Older 32-bit applications typically used a 32-bit value, but newer 64-bit applications generally use 64 bits (but bugs may exist from porting previous code, etc.).

Intercepting Functions

The Detours wiki has a pretty simple minimum example for ~~hooking~~ instrumenting the Sleep function. Let's alter it for GetSystemInfo, one of the functions that we'll need to intercept. (Source Listing)

This listing will compile into a .DLL that, when loaded by an application, will override calls to GetSystemInfo and allow us to alter the responses.

Once DetourTransactionCommit() is called in our InstallDetours() function our hook is in place:

Now whenever GetSystemInfo is called, our version is called instead. We just happen to call the original and then modify the number of processors:

The next step is that we need to get the game to load our new DLL. Unfortunately, this is easier said than done. Games these days typically don't want hackers messing with them and they go to great lengths to try to prevent that (just as I did when working on my games). AC:Rogue didn't have any protections, but AC:Unity has a multiplayer element to it, and multiplayer games typically make attempts to curb hacking/cheating. Now I'm just trying to get the game to run but it doesn't know that.

Detours has (at least) two tools that can be used. The WithDll tool can be used to launch an application and inject a DLL before it fully loads. Unfortunately this method didn't work with AC:Unity because of how it's launched. I purchased the game through Steam, so Steam launches AC:Unity, which sees that it wasn't launched from Ubisoft Connect, so it launches Ubisoft Connect and exits, which does a few things (cloud sync for instance) and then re-launches AC:Unity. Since I cannot control how the game is launched from Ubisoft Connect, this didn't pan out. Interestingly though the app did launch with the injected DLL.

The other tool is SetDll. This tool can modify an executable file to always load a new DLL. This sounds like exactly what we need! Unfortunately, this didn't work either since AC:Unity has hack-prevention checks that noticed that the executable was modified and refused to run.

Frustrated, I started looking in to other ways to inject the DLL. SetDll works on DLLs too, so if I can catch it early enough in the loading process, I could use a different DLL as an injection vector. I fired up Process Monitor and fiddled with the filters to see what the AC:Unity process was doing when it started. Turns out it ships with a DLL that is loaded very early, and that DLL doesn't complain about modifications. Eureka! I have my injection vector.

I added some simple logging (using OutputDebugString that I could view with the DebugView utility) to my simple example to make sure that my function was being called (since AC:Unity also detects debuggers and terminates the process if one is found). Good news--it was being called! But of course, the game was still crashing. So while this was enough to essentially fix AC:Rogue, the newer engine does different things and will need more work.

Affinity Functions

As I mentioned above, the functions that control affinity are probably in use. The Microsoft article about Processor Groups has a good list of functions that control affinity and scheduling. My process was to first add logging for these functions to see if they were being called. The logging would include what arguments that AC:Unity was passing to the functions as well as what the functions returned. Turns out that several of them were being called. Even though GetSystemInfo (and GetNativeSystemInfo) were intercepted and showing a lower number of CPUs, the affinity functions were happily reporting affinity across a larger number of CPUs.

The next process was to change the behavior of these affinity functions to only return bits set for the number of processors that we were emulating.

An interesting function that it turns out AC:Unity was also calling is SetThreadIdealProcessor (and the extended version). This function gives the Windows scheduler a hint about which CPU to schedule the thread on, but it's not as strong as affinity. I also intercepted this function to only allow the processors in our emulated set.

Logical Processor Info

In my work hacking AC:Rogue I discovered that it was optionally looking for and calling a function called GetLogicalProcessorInformation (which also has an extended version). CPUs these days are very complicated: they have multiple layers of caching (some of which are shared with other cores), can share hyperthreads with other cores, not to mention the NUMA relationships. These functions are the means to access this data on Windows, and as you might expect, are quite complicated. The extended version is even more complicated allowing different means of querying data and returning variable-sized information.

It turns out that AC:Unity was also calling these functions, so they are another one that I had to wrangle.

For the original function, all of the data is presented at the same time, so the function returns the same data every time it is called (at least, it is highly unlikely that it would change while the game is running). Therefore, it is easily cached. The first time my intercepted function is called, it queries all of the host data, chops it down to only include our emulated processors, and saves it off so that every subsequent time that we're called we just return the same data.

Unfortunately, the extended function allows specifying a Relationship Type to query, and the data is drastically different based on that query type. My solution here was to cache the chopped-down host data and remember the Relationship Type along with it. So if a different Relationship Type was queried I would throw away the cache and rebuild it. This however was much more complicated because of the variable-sized data. I essentially had to filter the data in place and then copy the filtered data into the cache. For this it's probably better to look at the code directly: (intercept function, cache builder)

Does it run now?

YES! 🎉

In the end, I ended up hooking 12 functions (but three--the Process Group functions--weren't called by AC:Unity and therefore only log).

Finally, I can play the game for a bit.

But... oh no. There's another problem.

The Video Problem

I have a Samsung Odyssey G9 49" monitor. It's aspect ratio is 32:9 and I'm playing a game originally designed to run on consoles that output 16:9. However, the PC port of the game itself beautifully figures out the correct aspect and does pretty well with the super ultrawide display, as evidenced in the screenshot at the top of this post.

The problem is the full-motion videos that it uses for cutscenes. They are originally designed for 16:9 and are stretched out to 32:9--double the width they should be. It's almost breaking my brain trying to watch them and figure out what's going on.

A very wide Mr. Templar tied to a tree

Can I fix this too?

Bink Video

Bink is a popular video encoder and player software package for games. Developed by RAD Game Tools (now owned by Epic and rechristened as Epic Game Tools), Bink was originally released way back in 1999.

Bink is pretty easy to integrate and very fast. I integrated it in to EverQuest 2 back in 2006, so I'm somewhat familiar with the API.

Turns out that Ubisoft uses it in AC:Unity, though a much newer version than the one I used. Fortunately a lot of the API seems to have remained consistent. And all I want to do is prevent the videos from being stretched.

Research

The first step is to figure out how AC:Unity is using Bink. Fortunately I already have the means to do this, just like I did with the CPU Windows system functions--hook the API calls and see how they're being called. Bink is loaded as a DLL, so theoretically I can hook it too.

And the initial tests are promising. I used Microsoft's dumpbin tool to list the exports from the Bink DLL. Pretty much any application is guaranteed to call BinkOpen, so that's the first function that I hook. And testing is quite quick since the first thing AC:Unity shows is the Ubisoft logo video.

Unfortunately, this is where things hit a snag. I've managed to hook a few functions that give me useful information (I can see BinkRegisterFrameBuffers called for instance), but none of the functions with names that sound promising are ever called (BinkCopyToBuffer, BinkCopyToBufferRect, BinkBufferOpen, BinkBufferSetResolution, BinkBufferSetScale). I ended up hooking 20+ of the most interesting sounding functions only to see a tiny few of them being called. How is it accessing the data without calling BinkCopyToBuffer?!?

It's time to get dirty. I can't attach a debugger because of the anti-hack measures (and it will take way too long to reverse engineer those), but I need to see it in a debugger when it's trying to access the data. So I'll make the game hang when it tries to access the framebuffer!

It turns out this isn't too complicated: I register an unhandled exception filter that just sleeps infinitely when a crash happens, and then I call VirtualProtect to tell Windows to make the framebuffer data no-access--any attempt to read or write to it will cause a crash. However, I have to allow Bink to write the data in to the framebuffer, I just want to see where it's read back out, so I have to allow access to the BinkDoFrame function.

As soon as the game hangs (without displaying the first video frame) I use Task Manager to write a full dump. But the dump is confusing. I have no symbols, no function names, nothing to really give me any clues as to what's happening except the assembly code. And it looks like AC:Unity is reading the data directly out of the framebuffers.

That's exactly what it's doing. AC:Unity is calling BinkRegisterFrameBuffers which means it is allocating its own framebuffers and undoubtedly retaining the pointers. BinkDoFrame will decompress the video stream into the framebuffer, and then AC:Unity is reading the data (probably with a shader) to display on the screen.

The video fix

Looking at the allocated framebuffers, they're 1920 x 1080 (16:9 ratio) and their shaders are probably the stretching culprit. I'm not a GPU/graphics guy (ironic given where I work) and I have no idea how to break into the game's shader pipeline. It can be done because there are other mods out there that do, but it will take a long time to learn how to do this. And I want to get back to playing the game, so how do I do it quickly? Is there something I can do to the framebuffers?

It turns out I can! I need AC:Unity to allocate 32:9 framebuffers instead of 16:9, and the first two values in the structure returned from BinkOpen looks like width and height. So the first thing I try is just scaling that width to 32:9. Since buffers historically were powers of two, they typically have a pitch passed to the decoder that says how many bytes represent a line so that it can skip to the start of the next line (extra bytes would just be off the screen). This helps us out here since the pitch tells Bink to skip about half of the line and just start decoding on the next line. It works!

Better?

Well, kinda. Why is half of the screen green? Hey, but at least the aspect ratio is correct. And it's not centered, but we can fix that too (probably).

Alright, so the green screen. Encoding video data as RGB is very inefficient. Remember, Bink is a library developed in 1999 and first used on the PlayStation. It uses the YCrCb color space for its video information to use less space. AC:Unity likely has a pixel shader that computes the RGB values from the YCrCb planes that the framebuffers are allocated from. The buffers started with zeroed bytes and since the right half of the screen is never written to by the decoder, they remain as zeroed bytes. Well converting the zero values for YCrCb ends up with RGB color #008700, a quite greenish hue.

Looking up the value for black in YCrCb and setting the initial state of the buffers to those values gives us a dark gray. Not quite black. But that link has the minimum Y value as 16, which gives us a dark gray, but trying a Y value of 0 gives us black.

So the last issue to tackle is the centering. AC:Unity has given Bink the addresses for the framebuffers by calling BinkRegisterFrameBuffers. Bink decodes data into the first line at the start of the buffer, advances the line pointer by pitch bytes and decodes the next line, so on and so forth. If we can tell it to use a different starting address for the buffer that represents a centered frame, it will move the whole image.

This is pretty easily done, we can just calculate the scaled width of our screen, subtract the width of the video and divide by two to get the framebuffer line offset. We have to do this for each plane separately because the Cr and Cb planes are smaller than the Y/A planes. The tricky thing here though is that Bink requires the framebuffer address to be 16-byte aligned, otherwise a hang occurs in the decoder (speaking from experience).

Voila!

Not so wide now, Mr. Templar!

Source is here. Uploaded to NexusMods and now back to playing the game!

09 October 2024

Game Hacking

I've been playing through the Assassin's Creed franchise in anticipation of Shadows due out later this year, or well, now pushed to early next year. I really enjoyed Assassin's Creed IV: Black Flag, but ran into an issue that some other people had too: Kenway's Fleet wouldn't load. Fortunately, someone created a patch that fixed it, and additionally made initial connection to Ubisoft much faster.

Fast forward a bit and I'm playing Assassin's Creed: Rogue, which is a great game in its own right, but borrows a lot from Black Flag. However, I started running into an issue: five (or more) second stalls where the game would just freeze (audio would keep playing) and then resume as if nothing happened. I could go a minute or so without these freezes, but sometimes it would be only a second or two after the previous freeze. It was nigh unplayable.

Searching around, other people didn't seem to report the same problems that I did. But my PC is pretty unique: it's an AMD Threadripper with 64 logical cores and 128 GB of RAM. That's a lot of hardware for a game released 10 years ago. Looking in Task Manager, I see that the game has about 280 threads. That's a lot! I wondered if the game is creating a number of worker threads per logical CPU, but perhaps isn't optimized for such a high number of CPUs. Designing software that runs well fully using 4 CPUs is a lot different than software that fully uses 64 CPUs.

Alright, so my working theory: the game has a contention problem that shows up with tons of threads.

How to test that theory? I fire up one of my favorite profilers: Superluminal. Unfortunately, my ability to seriously debug performance issues is going to be hampered by the fact that I don't have any debugging information (or source code) for the game in question. But Superluminal can tell me what the OS is doing, and what requests the game is making of the OS.

So I fire up Superluminal and tell it to start watching "ACC.exe", the AC:Rogue executable. Sure enough, within a few seconds I get a 5 second stall. After it recovers, I head back to superluminal to stop the tracing and start looking at it. The stall is clear as day and I have selected it here:

From this, it's pretty apparent that the main thread is running hard during the stall, and from the function list in the bottom-left corner, I see it's trying to lock and unlock "critical sections" which are Windows mutex-like objects. Note that before and after the stall the main thread spends a decent amount of time synchronizing (the red part of the red/green CPU activity graph). Let's scroll through the other threads and see if we see anything else interesting.

Aha! Thread 86036 has a CPU chart that is all blue. This is Superluminal's visual language for "preemption" that is, the thread was running but the Windows scheduler decided that running something else was more important. Also interesting to note that the other worker threads above are completely blocked on synchronization. They're just waiting during this stall.

Looking through some of the other threads there's not really much that is interesting. A few threads are sleeping and waking up periodically. But the main thread is running hard, this thread 86036 is preempted, and pretty much everything else stops while this is happening. It's possible that Windows (for some unknown reason) chooses to stop running this thread 86036 and instead runs the main thread.

Superluminal gets information about every process running on the system at that time, and the only thing using any CPU is ACC.exe, the game we're profiling.

For additional information, I also captured system state using Event Tracing for Windows by utilizing the UIforETW tool by Bruce Dawson. Looking at the capture I was able to find a different stall in the same thread:

Check out the highlighted line: the thread was ready to run for 4,747,081.6 microseconds (4.7 seconds!) before it ran, and the Readying Thread of -1 indicates that it was in fact preemption. Not immediately apparent from the screenshot (but from further digging through the UI) is that both the main thread and thread 86036 have the same Ideal CPU (0) and Priority (15)

Unfortunately it's still not super apparent exactly what is causing this. A possibility is affinity--which CPUs a thread is allowed to run on--but it's not clear how to check this. Most threads have 0 as their Ideal CPU, so I'm not sure that this value is actually used by Windows 11. But the Priority values are interesting. Most of the threads in the process have Priority values of 13 and 15, which would correspond to a "High" priority class and "normal" and "time critical" priorities respectively. If both the main thread and thread 86036 share a "time critical" priority, they could affect each other running.

But my initial thought was that 280+ threads is a lot, especially if it's running some or all of them as time-critical priority. Let's see if we can get the game to create fewer threads. Unfortunately, this is easier said than done.

Most compute-intensive programs trying to max-out a machine (like games) will typically create at least one thread per CPU, so maybe we can lie to the game about how many CPUs there are. There are actually several API calls that it could be making to determine this, but the most common one would be GetSystemInfo (side note: modern C++ Runtime implementations use GetNativeSystemInfo, but this function does not appear to be used by ACC.exe). Fortunately the game executable doesn't appear to be cryptographically signed, so we should be able to patch it.

The next step is to disassemble the game so that we can find the places where it might be trying to find out the number of CPUs. For this I used IDA Freeware. While the disassembly is quick, the analysis happens in the background and takes a little while. Once analysis was done, I looked at the Imports and found the GetSystemInfo function. There were several cross-references to places that were calling it (I took the liberty of renaming the functions based on a cursory glance of what they were doing):

Let's take a look at the second entry, the one that I renamed to getSystemInfo2. I was really impressed with IDA here: it understands Windows API structures and was able to figure out how members of the structures were being used:

Note that dwNumberOfProcessors is read into the edx register here and stored in various locations. Instead of reading the value from the structure, I want to lie to the game: I want it to see a hard-coded value of 8 CPUs. So now I need to change the bytes (8B 54 24 50) which is the machine code for "mov edx, [rsp+68h+SystemInfo.dwNumberOfProcessors]" to... something else. Something that just gives me the value of 8 in edx. Unfortunately the free version of IDA doesn't do assembling, so I can't write new code here and expect IDA to build me a new executable.

Instead, we're going to do it the hard way: I need to figure out how to get the value of 8 into edx but in 4 bytes or less. Since IDA won't help me out here, I'll use an online assembler to try some things, and then a hex editor to change the actual bytes.

Well, let's try the obvious instruction: "mov edx, 8"

Unfortunately this results in 5 bytes (ba 08 00 00 00), which is difficult to stuff into 4 bytes. What's interesting is that the full immediate 32-bit value representing "8" is present in the machine code, which makes some sense since we're writing to a 32-bit register.

This gave me an idea. What is the machine code to move 8 into the lowest byte of edx (the dl register)? Turns out this is only 2 bytes (b2 08)! The problem is that the other 24 bits of our edx register are garbage. Maybe we can clear the register with an xor (exclusive-or) instruction? If you xor a register against itself the value is always 0. Let's try that: xor edx, edx--this produced machine code (31 d2).

Perfect, so we have our solution:

Then I'll just use my trusty hex editor plugin for VSCode to search for the preceding bytes and write over the old bytes:

We just need to apply this to each location in our cross-reference list above.

Working through the list, we run into a problem with the function that I've called "getMemoryAndCpuInfo". The dwNumberOfProcessors value is written to a different register: r11d. This is an extended register for 64-bit processors and needs 5 machine code bits to read the value into it.

Unfortunately, our trick isn't working here. Assembling "xor r11d,r11d; mov r11b, 8" takes 6 bytes.

Let's try a different approach: push and pop the value, which uses the stack. We really don't want to make any stack changes as moving the stack pointer will cause issues with all of the following instructions that read/write from the stack. But if we immediately reset the stack pointer by popping the value, it should be fine. So let's try that:

Perfect! The push/pop is only 4 instruction bytes, so we had to add a "nop" or no-operation instruction (only one byte) to bring us up to our total of 5.

Our next wrinkle happens in the function that I've dubbed getProcessorTopology. While this function is also calling GetSystemInfo and using dwNumberOfProcessors (easily patched by our previous methods), it's also doing something further:

It's checking to see if Windows is new enough to have a function GetLogicalProcessorInformation and calling it if it does. We don't really want the game to call this function since it ruins our lie about the number of processors. Fortunately, this function is smart enough to handle the situation where that function doesn't exist. So to prevent it from calling this function, we'll just change the name to all X's. It will attempt to find a function called XXXXXX... which obviously doesn't exist. IDA can also confirm for us that this is the only location where that string is used, so we can safely change the name.

Unfortunately we're not out of the woods yet. There is one more function, and it's a doozy:

This function is doing a signed integer divide by the number of processors. None of our tricks thus far will work for this. Let's take a deeper look at what this instruction is doing. It takes a 64-bit value with the most significant 32 bits in edx and the least significant 32 bits in eax and divides by a 32-bit value (in this case, our dwNumberOfProcessors)--eax will receive the quotient and edx will contain the remainder. It does this in only 4 bytes!

However, it doesn't appear to be using the quotient at all, just the remainder. So this is actually doing a modulus by the number of processors. So what we need to do is actually a modulus by our lie of 8 processors. Can we do it in 4 bytes!?

Well, we have a few things going for us. The original function does a cdq (convert doubleword to quadword) instruction to set up for the divide, it also needs eax and edx for the divide but only edx with the result. And since we've chosen a magic power-of-two value of 8, we can achieve a modulus by doing an AND operation with (8 - 1) or 7.

Within reason, we can also move some instructions around as long as the side-effects are the same.

All we really need to do is this, which is only 5 bytes:

And we've already established that we don't need the cdq instruction since that was part of the divide. So cdq + idiv gives us 5 bytes! All we need to do is move the "mov r8, rbx" one byte earlier then we have the 5 bytes that we need together!

That's all the locations calling GetSystemInfo patched up. The question is, does it run?

I fired up the game, and it purrs along like a charm! No more stalls. Seems like the theory of my computer having too many CPUs for an older game trying to take advantage of them was accurate.

I've uploaded the patched executable to Nexus Mods but as of this writing there haven't been any downloads. I guess not many people are trying to play a 10-year-old game with 64+ CPUs! At least I'm enjoying it again 😉

06 July 2024

Fixing a Memory Corruption bug in Python

For Omniverse we use Python for scripting. About a year and a half ago, we upgraded from Python 3.7 to Python 3.10. Shortly thereafter, we started seeing occasional memory corruption issues crop up, but it took a while to actually link the cause to Python.

A lot of the crashes that we were seeing involved very deep and complex Python stack traces in seemingly innocuous things: the garbage collector, releasing a reference, sometimes allocating new memory within Python. Since my team in the Omniverse group handles the plugin system, we manage the plugin that embeds Python into Omniverse, which means I was assigned all of these crashes for triage.

We do have some internal Python experts, and my team wasn't the tip of the spear on the 3.7 -> 3.10 upgrade (though we were involved). The nature of memory corruption means that crashes can happen in very random ways, since any call to acquire or free memory (which happens a LOT) may trigger a crash. So we had 20+ different stack traces that were considered to be part of the same memory corruption pathology. Curiously, this was only happening on Windows, and another clue was that quite often we would see an error in the log sometimes well before the crash occurred:

RuntimeError: <_overlapped.Overlapped object at 0xXXXXXXXXXXXXXXXX> still has pending operation at deallocation, the process may crash.

"Helpful!" I thought to myself sarcastically. Well, the process was in fact crashing, so this may be an indication as to what happening.

However, this crash was fairly low priority. It wasn't happening often per se, but it was annoying. I made a task to track all of the different automatically generated crash stack traces, and periodically tried to find someone who could look into it. Since it was Windows-only and rare, it never percolated up to the top of anyone's priority queue.

Months passed and I continued to be irritated when another new stack trace would pop up. Everyone, including me, was busier with higher priority tasks. Since no one else could be bothered to fix it, I eventually decided to take a crack at it.

"Overlapped" I/O?

Typically when you're writing a program and you want to do a network request, or reading a file, or any other sort of input/output the easiest thing to do is to stop everything and wait because you need the results before continuing. But in the world of 60+ FPS, stopping and waiting for even a few milliseconds can feel sluggish. We instead want to do things a-synchronously--have them happen in a background process and the program can do other things while waiting. One way of doing this is through parallelization, that is, using a separate thread or CPU to do work, but this can also be messy. Also, Windows is old and computers used to just have one CPU, so a long time ago Windows started doing a-synchronicity through concurrency and the Operating System would handle interleaving all of the requests. "Overlapped" I/O is how Windows started doing that: multiple I/O operations could be done at a time, hence they overlap each other.

The problem is, it's pretty complicated. And even more so in an environment like Python where implementation requires both C code and Python code.

Reproducing the problem

The first step was to try to reduce the problem to a minimum so that it was easier to debug. I also knew that we started having this problem when we moved from Python 3.7 to 3.10, so I looked at the changes to see what might be different. I noticed this in the "what's new in Python 3.8" article:

On Windows, the default event loop is now ProactorEventLoop. (Contributed by Victor Stinner in bpo-34687.)

It turns out that Windows and Linux asyncio event loops are implemented quite differently, which makes sense because Asynchronous IO works quite differently on Windows and Linux. With Python 3.8, asyncio now used Overlapped IO by default through the Proactor event loop.

Omniverse was also doing something different from most Python applications: constantly stopping and re-starting the event loop. See, in most Python applications the program hands control to asyncio until the program finishes; asyncio then dispatches everything that the application does. Since Omniverse uses embedded Python, we still have a graphical application that needs to run, so each frame of the graphical application we would just run the asyncio event loop for one "tick" and then stop it.

I wrote a simple app that emulated that behavior, but I couldn't get it to crash, so that wasn't enough to do it. We had a sample case that could produce the Overlapped error message typically within 5 minutes (which is an eternity when you're waiting for it), so I looked at what it was doing. After much debugging I found another clue: other threads were calling call_soon_threadsafe on the event loop. I added that to my test program. BINGO. This program had an 80% chance to print the error message within 60 seconds:

This program starts a thread that pings away scheduling a dummy function to run on the next loop iteration. It's real purpose is to send a wakeup signal to the event loop. The main thread just constantly does our stop/run idiom like Omniverse does so it doesn't have to hand control to the event loop permanently.

But does it corrupt memory?

The error message comes from the implementation of _overlapped, which is done in C and acts as a bridge between Python code for the asyncio event loop and the Windows API call to manage overlapped I/O. Let's look at what happens before that.

When the run_forever call is cleaning up it tries to cancel the "_self_reading_future":

Which in turn calls into the C code for _overlapped and invokes a Windows API function called CancelIoEx:

Then the run_forever function unregisters the 'overlapped' object (ov) and sets the future to None, effectively destroying the object.

This is where the danger arises! The overlapped structure (managed by the overlapped Python object) has been passed to the Windows API when an IO operation was requested (in this case it's for the socketpair used to notify the event loop for other threads) and cannot be deallocated until Windows reports that it's done with it!

The deallocation function attempts to check this, and reports the error message if Windows isn't done with it. But then goes on to deallocate the structure anyways, which can lead to memory corruption!

While not ideal, a better alternative might be to report the error and leak the OVERLAPPED structure so that a crash won't occur. But the best alternative is to see if we can fix the problem.

With a reproducible case and an understanding of what was going on, I filed a Python issue.

Can we fix it?

I put the initial diagnosis into the Python issue: CancelIoEx is being properly called, returns TRUE meaning that Windows requested the cancellation (NOT that it was cancelled yet!), but the deallocation function sees that the overlapped I/O hasn't yet completed. It tries to call CancelIoEx again (this always fails with ERROR_NOT_FOUND since the previous cancel request succeeded), and then calls GetOverlappedResult (note that wait=false because the second CancelIoEx attempt fails), which will also fail with ERROR_IO_PENDING. As a test I tried changing it to wait, but this just caused a hang.

What gives? Originally I though this was a bug in Windows and handling Overlapped I/O, but then I tried something else. Not only is this system using Overlapped I/O, it's using another complex Windows-ism: I/O Completion Ports. This system is a way to group multiple I/O requests together, so that instead of having to check myriad different requests, you check one "completion port" to find out when any of your requests finish. What if the cancelled I/O request is stuck in the I/O completion port? Instead of waiting on GetOverlappedResult, maybe I need to read the request from GetQueuedCompletionStatus.

BINGO.

Turns out that was the case: it really depends on where within the labyrinthine Windows I/O ecosystem the I/O request ended up before the cancel request happened.

So now we could make an "official" fix recommendation. Generally large projects like Python desire fix recommendations from outsiders like me to be as simple and elegant as possible. Which meant that I would have better luck without completely trying to change how their Overlapped I/O worked. Since all we needed to do was make sure that GetQueuedCompletionStatus was called on our _overlapped object again, that gave us a goal. GetQueuedCompletionStatus was actually called every time the event loop ran, and it already cleaned up finished _overlapped objects once they completed. All we need to do is only allow cleanup after canceling if the I/O was complete, otherwise it would get cleaned up automatically during the next event loop iteration.

So that just meant changing the line

Done and dusted!

Of course, for the official fix I added a unit test and a bunch of comments since Overlapped I/O and I/O Completion Ports are frustratingly touchy.

As a bonus, Guido van Rossum (creator of Python) commented positively on the fix, and eventually we got the fix merged back as far as Python 3.8--the oldest supported version (for security fixes), but also the first version where the ProactorEventLoop became the default.

06 February 2008

Haxx0ring 4tw

When I heard that Hoglund and McGraw were writing a book about Exploiting Online Games, I just had to have a copy. It's the best book purchase I've made this year, and I think it's a great book for my crowd, the MMO game developers.

I've written on this topic before, and I feel very strongly that the game development industry doesn't give enough thought about black hats who will be on your system.

That being said, not all hackers are malicious. When I first started hacking Ultima Online in 2002, it was mostly out of curiosity. I drank up as much information about UO as I could find. My hacks were never malicious in nature; I never tried to crash the server or cheat other players or anything. I was just a programmer with a love of the game and a penchant for the low level workings of machines. And I was getting bored trying to tame a bajillion animals to get my skill up.

I discovered for myself most of the techniques that Hoglund and McGraw talk about in their book. My first tool was a keyboard/mouse macro program. By then I was curious about what kind of stuff was going over the network, so I set out trying to get at the data. As expected, it was encrypted. Instead of trying to crack the encryption, I wanted to find something easier: how do I make the client tell me the decrypted info? I eventually produced a DLL that I could inject into the client which would re-write some machine code and forward me the messages. Voila! The tools started getting really interesting now. In no time, I had a packet logger, apps that pointed me to hidden objects on the ground, and a simple macro program making the aforementioned taming much easier.

After reading through the book I dug out my old code and sat down with a blanket by the fire for a good read. The code wasn't pretty but the things it was doing were pretty amazing.

But now I work on the other side of the fence. I have to think about why people want to hack and how it affects everything. It all comes down to money. That and an impatient fast-paced culture. Simple economics teach us about supply and demand. If there were no demand for gold, then gold farmers would move on to doing something else. You can do all you want trying to detect, ban, or threaten farmers until you're blue in the face; that's where the money is, so that's what they'll be doing. Plus, there will always be more of them then there are of you. These are points that you learn in the industry, but Hoglund and McGraw spell out explicitly.

There's some stuff that you can do to mitigate the circumstances: don't rely on the client for anything and watch for irregularities on the server (economy logs, inhuman playing time, repetitive behaviors, etc). But I think there's another point that not many people discuss. I think you can make a difference in farming in your game design. If you design your game so that there is no player trading, you've effectively cut off the farming. But not being able to sell anything to other players has a problem: it isn't fun. The trick is striking a balance, but we haven't seen it yet.

If you are interested in how MMOs work and think about hacking software, this is the book for you. If you are a programmer in this industry, this book is definitely for you. Go buy it!

24 November 2007

A Decade of Ultima Online

Recently, Ultima Online celebrated its Tenth Anniversary. UO will always hold a special place in my heart because it was my first MMO, both as a player and as a developer.

As part of the celebration, IGN has an article and speedman (my dev name on UO) got a mention.

I can't explain why, but I always look at my memories in UO through rose-colored glasses. I remember playing back in 1997 and getting PK'd all the time as I mined ore (which was always gone due to the now-defunct resource bank system). I took some time off for a few years but eventually found my way back. A lot had changed in my time away, but I found a good group of friends to help me through the transition.

UO was a sandbox in many ways. Origin Systems provided the world but the players were the content. There were no predefined quests, no raid encounters and no leveling. There were only a handful of skills and spells and a wide-open world. When I wanted to gain a higher rank in the guild (which wasn't defined in the game mechanics), my mentor sent me on a quest (again, not defined in the game mechanics) which I had to document. It was the most unique experience I've ever had in an MMO and I passed it down to other guildmates when I became their mentors. As a developer, I also had the opportunity to chronicle it for the players. The story of my guild quest is now available as two in-game books written by Autenil.

UO was also my break into the game industry. I was a bit of a hacker at the time and wrote a few utilities for UO. I never distributed them to anyone but my closest friends; it was mostly a learning experience. The crown of my collection was a DLL that would attach itself to the UO client and re-write part of the software to pass network messages to my applications after the client had decrypted them. All of my tools were centered around this hook DLL. I had message loggers, information sniffers, and a few useful tools. Yes, I even had tools that would make it easier to gain skills by automating some tedious tasks (my animal taming tool would keep retrying until something was tamed, rename it so that you wouldn't tame it again, and release it).

Anyways, the company I worked for at the time sent me on a business trip to Austin, TX which also just happened to be the city of Origin Systems, the very headquarters of UO. Turns out they were looking for a server programmer and I had passed the tech test with flying colors. Since I was going to be in Austin anyway, I arranged a meeting with the UO dev team. Over three hours at Cheesecake Factory they grilled me about programming and MMOs (some of my old friends might remember that night). They were having some bandwidth problems with their latest expansion and I (due to my tools) knew what the main problem was. I told them that I would fix the bandwidth problem within two weeks if hired. They hired me and I fixed the problem within the first few days. Within two weeks, it was live and players were rejoicing.

The development environment that UO required was challenging. The servers would only build and run on Linux and we used GDB for our debugging. All programmers and designers had and maintained their own Linux boxes. Getting a server up and running required a spider web of filesystem links and a delicate matrix of configuration files and scripts (of several different formats). Still, I loved the proprietary script language known as Wombat and the badly-written C++ code.

I don't work on or really even play UO anymore, and I don't always (ever?) agree with what Electronic Arts/Mythic does with the game, but I still think about the game (and the friends I made working on it and playing it) often.

Congratulations, UO. Here's to another decade. At least.

24 March 2007

Deconstructing the XBox security system

This is a great video that talks about the original XBox, its security shortcomings, and how hackers eventually got it to run Linux. Over an hour, but very entertaining for geeks like myself.

10 March 2007

Hackalicious

The more I beta-test/play/develop MMOs, the more I realize that developers still cut corners when it comes to security. If you're working on an MMO, especially before launch, it's very important to take a look at your game through the eyes of a hacker.

I've worked on hardening some games to the hacker types (and let's admit it: I was one myself once), so here's my short primer on MMOs and hacking:

Don't ship your game with the .PDB file! Obvious? Maybe, but not to some.

Turn off RTTI. Why are you using dynamic_cast anyways? It supports crazy things like casting horizontally in multiple inheritance. Not to mention that it's slow. Hackers love RTTI though, as it causes the executable to include mangled names of pretty much everything. If you need a dynamic typesafe casting solution, search the web or get a book. Or use virtual functions, though it's ugly as sin.

Take a look at your client in a hex editor, especially strings. You might be surprised at what you find. Those assert lines can actually contain a ton of useful information, like source file names and line numbers, not to mention C++ code itself. Or you might just find that you're including the class names of every message type in the game. Yes, it's been done.

Beware large areas of [un]initialized memory. Hackers love it when they already have space allocated in your program that they are free to write their code into. UO had a bunch of these before I cleaned them up, the largest of which being a 640x480 graphics buffer that was never fully used. Use the heap and make it lazy allocate. If possible, delete it as soon as you're done with it.

Change message IDs frequently. If you can do it right (and unpredictably), this can work well to foil those who like to read the network traffic.

Log stuff to the server, but don't take action on it right away. You don't want to do anything that lets a hacker know that you're onto them. Check with your legal department to find out what you can report, but it generally shouldn't be bad to do something like CRC the client in memory and send back a pass/fail response to the server. Or report if they're using a debugger, though this can be challenging. Unfortunately, uploading their modifications for review is anathema.

Watch your encryption keys. Yes, EQ2 actually held the encryption key in plaintext in a static buffer after negotiating it. It's fixed now, but hackers rejoiced!

Verify user data. UO has an immediate request/response targeting system. The server sends a request to the client with an ID which the client returns unmolested to the server. But what happens when a hacker figures out how to make use of changing that ID? You get the insurance bug.

Beware of unreleased/undocumented content. Again with UO, there were quite a few "gumps" that were in half-implemented features that still got pushed to live servers. A hacker figured out how to make use of one of these and ended up teleporting half the players on the server to a single X,Y location. Use source control branches or some configuration system to make sure this untested stuff can't be used. Period.

Don't make assumptions. You've heard the clichés so i'll spare you. Just because a normal user can't see something (like a UI window or network message format) doesn't mean they can't exploit it.

Movement. A notoriously bad topic for MMOs, especially 3D ones, there should really be more seriousness here, at least in the way of logging. Servers can at least sanity check movement without having to run simulations of every client.

Buffer overruns/remote code execution. Yes, it can happen to you. Watch how you're using the string functions, especially the varargs ones (sprintf, sscanf, etc) and wide functions. The discussion on this one topic alone could be (and has been) a lecture in itself, so I'll just say: be aware of it.

You can never fully stop people from hacking on your game, but you can make it difficult (and you should). Then there's the other side of the coin: banning the people you catch hacking. Usually this is financially advantageous: if a hacker is turning people off of your game, it's probably not just one.

Unfortunately, maintaining hack resistance and detection can be a full-time job, especially if your game is well established and there are many hack programs out for it. In such cases, it might be better to look at using something like PunkBuster.

This is definitely an area of MMOs that should be collaborated on, since all MMO developers don't want the hackers :)