I've been playing through the Assassin's Creed franchise in anticipation of Shadows due out later this year, or well, now pushed to early next year. I really enjoyed Assassin's Creed IV: Black Flag, but ran into an issue that some other people had too: Kenway's Fleet wouldn't load. Fortunately, someone created a patch that fixed it, and additionally made initial connection to Ubisoft much faster.
Fast forward a bit and I'm playing Assassin's Creed: Rogue, which is a great game in its own right, but borrows a lot from Black Flag. However, I started running into an issue: five (or more) second stalls where the game would just freeze (audio would keep playing) and then resume as if nothing happened. I could go a minute or so without these freezes, but sometimes it would be only a second or two after the previous freeze. It was nigh unplayable.
Searching around, other people didn't seem to report the same problems that I did. But my PC is pretty unique: it's an AMD Threadripper with 64 logical cores and 128 GB of RAM. That's a lot of hardware for a game released 10 years ago. Looking in Task Manager, I see that the game has about 280 threads. That's a lot! I wondered if the game is creating a number of worker threads per logical CPU, but perhaps isn't optimized for such a high number of CPUs. Designing software that runs well fully using 4 CPUs is a lot different than software that fully uses 64 CPUs.
Alright, so my working theory: the game has a contention problem that shows up with tons of threads.
How to test that theory? I fire up one of my favorite profilers: Superluminal. Unfortunately, my ability to seriously debug performance issues is going to be hampered by the fact that I don't have any debugging information (or source code) for the game in question. But Superluminal can tell me what the OS is doing, and what requests the game is making of the OS.
So I fire up Superluminal and tell it to start watching "ACC.exe", the AC:Rogue executable. Sure enough, within a few seconds I get a 5 second stall. After it recovers, I head back to superluminal to stop the tracing and start looking at it. The stall is clear as day and I have selected it here:
From this, it's pretty apparent that the main thread is running hard during the stall, and from the function list in the bottom-left corner, I see it's trying to lock and unlock "critical sections" which are Windows mutex-like objects. Note that before and after the stall the main thread spends a decent amount of time synchronizing (the red part of the red/green CPU activity graph). Let's scroll through the other threads and see if we see anything else interesting.
Aha! Thread 86036 has a CPU chart that is all blue. This is Superluminal's visual language for "preemption" that is, the thread was running but the Windows scheduler decided that running something else was more important. Also interesting to note that the other worker threads above are completely blocked on synchronization. They're just waiting during this stall.
Looking through some of the other threads there's not really much that is interesting. A few threads are sleeping and waking up periodically. But the main thread is running hard, this thread 86036 is preempted, and pretty much everything else stops while this is happening. It's possible that Windows (for some unknown reason) chooses to stop running this thread 86036 and instead runs the main thread.
Superluminal gets information about every process running on the system at that time, and the only thing using any CPU is ACC.exe, the game we're profiling.
For additional information, I also captured system state using Event Tracing for Windows by utilizing the UIforETW tool by Bruce Dawson. Looking at the capture I was able to find a different stall in the same thread:
Check out the highlighted line: the thread was ready to run for 4,747,081.6 microseconds (4.7 seconds!) before it ran, and the Readying Thread of -1 indicates that it was in fact preemption. Not immediately apparent from the screenshot (but from further digging through the UI) is that both the main thread and thread 86036 have the same Ideal CPU (0) and Priority (15)
Unfortunately it's still not super apparent exactly what is causing this. A possibility is affinity--which CPUs a thread is allowed to run on--but it's not clear how to check this. Most threads have 0 as their Ideal CPU, so I'm not sure that this value is actually used by Windows 11. But the Priority values are interesting. Most of the threads in the process have Priority values of 13 and 15, which would correspond to a "High" priority class and "normal" and "time critical" priorities respectively. If both the main thread and thread 86036 share a "time critical" priority, they could affect each other running.
But my initial thought was that 280+ threads is a lot, especially if it's running some or all of them as time-critical priority. Let's see if we can get the game to create fewer threads. Unfortunately, this is easier said than done.
Most compute-intensive programs trying to max-out a machine (like games) will typically create at least one thread per CPU, so maybe we can lie to the game about how many CPUs there are. There are actually several API calls that it could be making to determine this, but the most common one would be GetSystemInfo (side note: modern C++ Runtime implementations use GetNativeSystemInfo, but this function does not appear to be used by ACC.exe). Fortunately the game executable doesn't appear to be cryptographically signed, so we should be able to patch it.
The next step is to disassemble the game so that we can find the places where it might be trying to find out the number of CPUs. For this I used IDA Freeware. While the disassembly is quick, the analysis happens in the background and takes a little while. Once analysis was done, I looked at the Imports and found the GetSystemInfo function. There were several cross-references to places that were calling it (I took the liberty of renaming the functions based on a cursory glance of what they were doing):
Note that dwNumberOfProcessors is read into the edx register here and stored in various locations. Instead of reading the value from the structure, I want to lie to the game: I want it to see a hard-coded value of 8 CPUs. So now I need to change the bytes (8B 54 24 50) which is the machine code for "mov edx, [rsp+68h+SystemInfo.dwNumberOfProcessors]" to... something else. Something that just gives me the value of 8 in edx. Unfortunately the free version of IDA doesn't do assembling, so I can't write new code here and expect IDA to build me a new executable.
Instead, we're going to do it the hard way: I need to figure out how to get the value of 8 into edx but in 4 bytes or less. Since IDA won't help me out here, I'll use an online assembler to try some things, and then a hex editor to change the actual bytes.
Well, let's try the obvious instruction: "mov edx, 8"
Unfortunately this results in 5 bytes (ba 08 00 00 00), which is difficult to stuff into 4 bytes. What's interesting is that the full immediate 32-bit value representing "8" is present in the machine code, which makes some sense since we're writing to a 32-bit register.
This gave me an idea. What is the machine code to move 8 into the lowest byte of edx (the dl register)? Turns out this is only 2 bytes (b2 08)! The problem is that the other 24 bits of our edx register are garbage. Maybe we can clear the register with an xor (exclusive-or) instruction? If you xor a register against itself the value is always 0. Let's try that: xor edx, edx--this produced machine code (31 d2).
Perfect, so we have our solution:
Then I'll just use my trusty hex editor plugin for VSCode to search for the preceding bytes and write over the old bytes:
We just need to apply this to each location in our cross-reference list above.
Working through the list, we run into a problem with the function that I've called "getMemoryAndCpuInfo". The dwNumberOfProcessors value is written to a different register: r11d. This is an extended register for 64-bit processors and needs 5 machine code bits to read the value into it.
Unfortunately, our trick isn't working here. Assembling "xor r11d,r11d; mov r11b, 8" takes 6 bytes.
Let's try a different approach: push and pop the value, which uses the stack. We really don't want to make any stack changes as moving the stack pointer will cause issues with all of the following instructions that read/write from the stack. But if we immediately reset the stack pointer by popping the value, it should be fine. So let's try that:
Perfect! The push/pop is only 4 instruction bytes, so we had to add a "nop" or no-operation instruction (only one byte) to bring us up to our total of 5.
Our next wrinkle happens in the function that I've dubbed getProcessorTopology. While this function is also calling GetSystemInfo and using dwNumberOfProcessors (easily patched by our previous methods), it's also doing something further:
It's checking to see if Windows is new enough to have a function GetLogicalProcessorInformation and calling it if it does. We don't really want the game to call this function since it ruins our lie about the number of processors. Fortunately, this function is smart enough to handle the situation where that function doesn't exist. So to prevent it from calling this function, we'll just change the name to all X's. It will attempt to find a function called XXXXXX... which obviously doesn't exist. IDA can also confirm for us that this is the only location where that string is used, so we can safely change the name.
Unfortunately we're not out of the woods yet. There is one more function, and it's a doozy:
This function is doing a signed integer divide by the number of processors. None of our tricks thus far will work for this. Let's take a deeper look at what this instruction is doing. It takes a 64-bit value with the most significant 32 bits in edx and the least significant 32 bits in eax and divides by a 32-bit value (in this case, our dwNumberOfProcessors)--eax will receive the quotient and edx will contain the remainder. It does this in only 4 bytes!
However, it doesn't appear to be using the quotient at all, just the remainder. So this is actually doing a modulus by the number of processors. So what we need to do is actually a modulus by our lie of 8 processors. Can we do it in 4 bytes!?
Well, we have a few things going for us. The original function does a cdq (convert doubleword to quadword) instruction to set up for the divide, it also needs eax and edx for the divide but only edx with the result. And since we've chosen a magic power-of-two value of 8, we can achieve a modulus by doing an AND operation with (8 - 1) or 7.
Within reason, we can also move some instructions around as long as the side-effects are the same.
All we really need to do is this, which is only 5 bytes:
And we've already established that we don't need the cdq instruction since that was part of the divide. So cdq + idiv gives us 5 bytes! All we need to do is move the "mov r8, rbx" one byte earlier then we have the 5 bytes that we need together!
That's all the locations calling GetSystemInfo patched up. The question is, does it run?
I fired up the game, and it purrs along like a charm! No more stalls. Seems like the theory of my computer having too many CPUs for an older game trying to take advantage of them was accurate.
I've uploaded the patched executable to Nexus Mods but as of this writing there haven't been any downloads. I guess not many people are trying to play a 10-year-old game with 64+ CPUs! At least I'm enjoying it again 😉
2 comments:
Excellent writeup. I enjoyed the explanations, especially of how you found instructions to do patching. I wonder if any of the built-in Windows compatibility settings could be used to do this – lying about number of CPUs seems like something that they could handle.
Great detective work, my friend!
Post a Comment