06 July 2024

Fixing a Memory Corruption bug in Python

For Omniverse we use Python for scripting. About a year and a half ago, we upgraded from Python 3.7 to Python 3.10. Shortly thereafter, we started seeing occasional memory corruption issues crop up, but it took a while to actually link the cause to Python.

A lot of the crashes that we were seeing involved very deep and complex Python stack traces in seemingly innocuous things: the garbage collector, releasing a reference, sometimes allocating new memory within Python. Since my team in the Omniverse group handles the plugin system, we manage the plugin that embeds Python into Omniverse, which means I was assigned all of these crashes for triage.

We do have some internal Python experts, and my team wasn't the tip of the spear on the 3.7 -> 3.10 upgrade (though we were involved). The nature of memory corruption means that crashes can happen in very random ways, since any call to acquire or free memory (which happens a LOT) may trigger a crash. So we had 20+ different stack traces that were considered to be part of the same memory corruption pathology. Curiously, this was only happening on Windows, and another clue was that quite often we would see an error in the log sometimes well before the crash occurred:

RuntimeError: <_overlapped.Overlapped object at 0xXXXXXXXXXXXXXXXX> still has pending operation at deallocation, the process may crash.

"Helpful!" I thought to myself sarcastically. Well, the process was in fact crashing, so this may be an indication as to what happening.

However, this crash was fairly low priority. It wasn't happening often per se, but it was annoying. I made a task to track all of the different automatically generated crash stack traces, and periodically tried to find someone who could look into it. Since it was Windows-only and rare, it never percolated up to the top of anyone's priority queue.

Months passed and I continued to be irritated when another new stack trace would pop up. Everyone, including me, was busier with higher priority tasks. Since no one else could be bothered to fix it, I eventually decided to take a crack at it.


"Overlapped" I/O?

Typically when you're writing a program and you want to do a network request, or reading a file, or any other sort of input/output the easiest thing to do is to stop everything and wait because you need the results before continuing. But in the world of 60+ FPS, stopping and waiting for even a few milliseconds can feel sluggish. We instead want to do things a-synchronously--have them happen in a background process and the program can do other things while waiting. One way of doing this is through parallelization, that is, using a separate thread or CPU to do work, but this can also be messy. Also, Windows is old and computers used to just have one CPU, so a long time ago Windows started doing a-synchronicity through concurrency and the Operating System would handle interleaving all of the requests. "Overlapped" I/O is how Windows started doing that: multiple I/O operations could be done at a time, hence they overlap each other.

The problem is, it's pretty complicated. And even more so in an environment like Python where implementation requires both C code and Python code.


Reproducing the problem

The first step was to try to reduce the problem to a minimum so that it was easier to debug. I also knew that we started having this problem when we moved from Python 3.7 to 3.10, so I looked at the changes to see what might be different. I noticed this in the "what's new in Python 3.8" article:
On Windows, the default event loop is now ProactorEventLoop. (Contributed by Victor Stinner in bpo-34687.)

It turns out that Windows and Linux asyncio event loops are implemented quite differently, which makes sense because Asynchronous IO works quite differently on Windows and Linux. With Python 3.8, asyncio now used Overlapped IO by default through the Proactor event loop.

Omniverse was also doing something different from most Python applications: constantly stopping and re-starting the event loop. See, in most Python applications the program hands control to asyncio until the program finishes; asyncio then dispatches everything that the application does. Since Omniverse uses embedded Python, we still have a graphical application that needs to run, so each frame of the graphical application we would just run the asyncio event loop for one "tick" and then stop it.

I wrote a simple app that emulated that behavior, but I couldn't get it to crash, so that wasn't enough to do it. We had a sample case that could produce the Overlapped error message typically within 5 minutes (which is an eternity when you're waiting for it), so I looked at what it was doing. After much debugging I found another clue: other threads were calling call_soon_threadsafe on the event loop. I added that to my test program. BINGO. This program had an 80% chance to print the error message within 60 seconds:

This program starts a thread that pings away scheduling a dummy function to run on the next loop iteration. It's real purpose is to send a wakeup signal to the event loop. The main thread just constantly does our stop/run idiom like Omniverse does so it doesn't have to hand control to the event loop permanently.


But does it corrupt memory?

The error message comes from the implementation of _overlapped, which is done in C and acts as a bridge between Python code for the asyncio event loop and the Windows API call to manage overlapped I/O. Let's look at what happens before that.

When the run_forever call is cleaning up it tries to cancel the "_self_reading_future":

Which in turn calls into the C code for _overlapped and invokes a Windows API function called CancelIoEx:


Then the run_forever function unregisters the 'overlapped' object (ov) and sets the future to None, effectively destroying the object.

This is where the danger arises! The overlapped structure (managed by the overlapped Python object) has been passed to the Windows API when an IO operation was requested (in this case it's for the socketpair used to notify the event loop for other threads) and cannot be deallocated until Windows reports that it's done with it!

The deallocation function attempts to check this, and reports the error message if Windows isn't done with it. But then goes on to deallocate the structure anyways, which can lead to memory corruption!


While not ideal, a better alternative might be to report the error and leak the OVERLAPPED structure so that a crash won't occur. But the best alternative is to see if we can fix the problem.

With a reproducible case and an understanding of what was going on, I filed a Python issue.


Can we fix it?

I put the initial diagnosis into the Python issue: CancelIoEx is being properly called, returns TRUE meaning that Windows requested the cancellation (NOT that it was cancelled yet!), but the deallocation function sees that the overlapped I/O hasn't yet completed. It tries to call CancelIoEx again (this always fails with ERROR_NOT_FOUND since the previous cancel request succeeded), and then calls GetOverlappedResult (note that wait=false because the second CancelIoEx attempt fails), which will also fail with ERROR_IO_PENDING. As a test I tried changing it to wait, but this just caused a hang.

What gives? Originally I though this was a bug in Windows and handling Overlapped I/O, but then I tried something else. Not only is this system using Overlapped I/O, it's using another complex Windows-ism: I/O Completion Ports. This system is a way to group multiple I/O requests together, so that instead of having to check myriad different requests, you check one "completion port" to find out when any of your requests finish. What if the cancelled I/O request is stuck in the I/O completion port? Instead of waiting on GetOverlappedResult, maybe I need to read the request from GetQueuedCompletionStatus.

BINGO.

Turns out that was the case: it really depends on where within the labyrinthine Windows I/O ecosystem the I/O request ended up before the cancel request happened.

So now we could make an "official" fix recommendation. Generally large projects like Python desire fix recommendations from outsiders like me to be as simple and elegant as possible. Which meant that I would have better luck without completely trying to change how their Overlapped I/O worked. Since all we needed to do was make sure that GetQueuedCompletionStatus was called on our _overlapped object again, that gave us a goal. GetQueuedCompletionStatus was actually called every time the event loop ran, and it already cleaned up finished _overlapped objects once they completed. All we need to do is only allow cleanup after canceling if the I/O was complete, otherwise it would get cleaned up automatically during the next event loop iteration.

So that just meant changing the line



to



Done and dusted!

Of course, for the official fix I added a unit test and a bunch of comments since Overlapped I/O and I/O Completion Ports are frustratingly touchy.

As a bonus, Guido van Rossum (creator of Python) commented positively on the fix, and eventually we got the fix merged back as far as Python 3.8--the oldest supported version (for security fixes), but also the first version where the ProactorEventLoop became the default.

05 June 2024

Basic Callback Hygiene

There's a tenet that I've been pushing on for a few years now from my little corner of NVIDIA, and that is the concept of Basic Callback Hygiene.

For the uninitiated, a callback simply-put is a function that I can pass to another function that it will later call to "call me back." A lot of subscription services in programming work like this: Something is going to happen in the future. I want to be notified of it, so I'll call a subscription function and give it a function that it can call to tell me when that happens.

This simple concept gets a lot more complicated when you ask what sorts of things you can do within that callback function. What happens when I don't care about the subscription anymore? What if the module containing the function unloads? What thread is the callback going to happen on? And so on, and so forth.

I've experienced problems with all of these complications over the course of my career, and given the amount of asynchronous and modular code that I've been writing for Omniverse, I finally distilled down a simple set of rules:

  1. Callbacks can ask to be unregistered from within the callback.
  2. Strong recommendation to not hold locks (i.e. a mutex) while calling callbacks.
  3. When a callback asks to be unregistered, there must be a strong guarantee that the callback is not currently being called, nor will ever be called in the future.
That's it. A subscription system that follows these basic principles will avoid 99% of the issues that we typically run into with callbacks.

Let's examine each of these. To put this in practical terms, let's say that we're designing an event system. We can subscribe to an event which will register our callback and we can dispatch an event which will callback all of the subscribers from the context of the dispatching thread (so basically it loops over all subscribers and calls their callback). Our imaginary system is thread-safe, so we can subscribe or dispatch from any thread. We can also unsubscribe to unregister our subscription.


Unregistration from within a callback

This is probably the most useful one. Let's consider the case where I cannot unsubscribe from the callback but I want to. My callback is called and I can handle the event, but I cannot unsubscribe. This particular callback only wants to receive the first event after registration and no more events (an example of this might be a one-shot function that I want to defer until the next frame).

Since I cannot unsubscribe, I must be able to conditionally have a function run later that will then unsubscribe my function outside of the event, and I need to ignore any more events that I might receive in the meantime.

This makes something that should be simple much more troublesome, and every place that I want to do this sort of action now requires more thought, effort, time, testing, etc. Changing the system to allow un-subscription from within the callback means that the complexity of handling that situation is in a singular place: my event system, rather than each and every callback that needs that behavior.


Don't Hold Locks

Holding a lock while calling a callback is dangerous because the callback can execute arbitrary code. The callback may call back into our event system and we don't want the complication of having recursive locks (which are arguably evil) or contributing to lock cycles which can cause deadlocks.

Writing an event system to be thread-safe and yet not hold any locks while calling callbacks is challenging to say the least. Generally this requires something akin to tracking the number of threads that are in the middle of doing callbacks, at a minimum. We typically do it by adding a thread identifier to a list while holding a lock, and then unlocking to call the callback, then re-acquiring the lock and removing the thread identifier from the list. This is more complicated to be sure, and the reason why will be evident in the next section. There are other ways to do it, but you'll need some tracking to maintain state while the mutex is unlocked for the callback.

This also means that how you walk the list of subscriptions must be stateful in a way that can resume after the mutex is unlocked and re-acquired. If these subscriptions are stored in C++'s std::vector for instance, there exists a possibility that unlocking the mutex will allow another thread to subscribe, which may resize and therefore reallocate the vector, which invalidates all iterators; or another thread may unsubscribe which will remove an entry, which can affect the position of the current thread walking over the vector. (Solutions to these, respectively, might involve using indexes instead of iterators, and leaving null pointers in the array to mark empty spaces).


Strong Unregistration Guarantees

This is by far the hardest issue to solve correctly. Here's a real-world scenario: right before a module unloads (in the main thread, say), it tells our Event system to unsubscribe a previously-subscribed callback. Without a strong guarantee, what will happen if another thread is currently calling our subscribed callback? As noted above, we don't want to be holding locks while calling callbacks, so waiting on a lock isn't going to wait for our callback to finish. What has happened in actual code is that the unsubscribe happens quickly, the module proceeds to unload, and then a crash occurs because another thread was still executing a function that was in the module. Or after unloading, objects were destructed since we would no longer get events, but with another thread in our callback function, a crash occurs because deleted memory was accessed.

Or the problem that I'm fixing right now: memory corruption occurs because a dependent structure was freed, but a race with the callback means that we were still changing things inside the structure.

It quickly becomes apparent that any request to unregister must wait for any other calls to finish. Also, once we return from the unregister function, it stands to reason that we cannot ever call the callback again. To do so would be violating the caller's unregistration request.

This then requires a strong ordering guarantee:
  1. Unregister function is called.
    1. WAIT for any calls to the callback in question to finish*.
    2. PREVENT any future calls to the callback.
  2. Return from the unregister function. At this point the caller can guarantee that our event system is free of the callback and it is safe to destroy structures.
"Ah," the astute and observant amongst you might notice, "what if the unregistration is from within the callback?!" This was our first point, above. If we WAIT for any calls to the callback to finish and the unregistration call is coming from the callback, we would wait forever. So we amend the line with the asterisk (*) above instead to say: "WAIT for any calls to the callback in question in other threads to finish."

So our subscription process is pretty simple:
  1. Take the lock
  2. Append our subscription information to the std::vector (or similar)
  3. Unlock the lock
The notification process is a bit more complicated as it has to synchronize with the unsubscribe process:
  1. Take the lock
  2. Iterate over the std::vector (or similar) by index, skipping null entries. For each:
    1. Add the current thread identifier to a list of threads that are notifying.
    2. Unlock the lock
    3. Call the callback function
    4. Re-acquire the lock. At this point, the vector may have been reallocated and grown, and subsequent entries may have been unregistered. Any other thread could have done anything.
    5. Remove our current thread identifier from the list of threads that are notifying.
    6. Is a waiting flag set on the subscription? If so, notify the waiting thread (i.e. using a condition variable).
  3. Unlock the lock
And now, the pièce de résistance: the unsubscribe process:
  1. Take the lock
  2. Find the subscription to remove. If not found, unlock and return; otherwise:
  3. Set the entry in the std::vector to null, but it is not destroyed yet!
  4. Is the subscription currently being notified by a thread other than the current thread? If so, set a waiting flag on the subscription that is used in 2.6 above.
  5. Unlock the lock.
  6. If the waiting flag was set, wait on the condition variable for notification from 2.6 above. Otherwise it is safe to destroy the subscription and return.
  7. Re-acquire the lock.
  8. If the list of notifying threads contains threads besides the current thread go to step 5.
  9. Unlock the lock. It is now safe to destroy the subscription.
This will allow our unsubscribe to wait until no threads are notifying our subscription, or only our thread that is currently unsubscribing during notification.

Of course, there is a lot more nuance to this, especially when it comes to handling recursion (dispatching within a callback) or even just the complexity of working with multiple threads, but this should help you understand the concept of Basic Callback Hygiene.

04 July 2017

Resolute Timers

This weekend I got linked into a tweetstorm regarding a tool called Set Timer Resolution by one Lucas Hale. People are claiming that using this tool results in better hit accuracy, faster responsiveness and higher frame-rates for King of the Kill and other Daybreak titles.
Seen here, in all its majesty
It looks like other people have said that this helps, and at one point it looks like the download link was removed from the site. At the time of this writing, Download 3K is apparently a valid mirror (MD5: 4b3bccdb3bcbd48162aa77270d910276). I cannot recommend using any specific third-party applications, including this one. Your mileage may vary and incorrect use of software may cause issues.

This specific very simple app (only 32k in size!) does not affect the game in any way shape or form. In fact, it was originally authored back in 2007, way before King of the Kill. Instead, it tells Windows to check it's timers more often. That's it.

Imagine this. You need to do something in 30 seconds, but you only have a clock with a minute hand. You glance at the clock and it says 4:59 pm. Once it changes to 5:00 pm, has 30 seconds elapsed? Not necessarily! What if you glanced at the clock a mere second before it changed? To ensure that a full 30 seconds has elapsed you would actually have to wait until 5:01 pm to guarantee that at least 30 seconds has passed, but up to 1 minute 59 seconds could have passed!

This is the nature of resolution: how often you can check the clock and it will tell you a different value. Now, computers do things a LOT faster than even once a second. Computers can do things in the nanosecond range (1/1,000,000,000 of a second) or even faster! When I started up the Windows 10 machine that I'm writing this post with, the resolution was 15.625 milliseconds (~0.015 sec). That's WAY slower than 0.000000001 sec! In fact, that resolution will only check the clock 64 times per second, which can be slower than some frame rates that people get when playing King of the Kill.

When we do windows programming, when we set a timer or tell a thread to sleep, we specify values in milliseconds (1/1,000 sec), but if it's only checking the clock every 15.625 milliseconds, a 1 millisecond timer can end up waiting 15.625 milliseconds, which is more than a whole frame in some cases. Obviously we want Windows to check the clock much faster than 15.625 milliseconds.
This is about how often I check my phone. And I always forget to check the time.

Yes but does it WORK?

At first I thought this tool might have some confirmation bias behind it, but after digging into it, i'm going to say that it's plausible that it has a positive effect on gameplay. Windows has but one internal timer, and it's shared by everything running on the system. When Daybreak titles start up, we tell Windows that we want 1 millisecond resolution on the system timer. But, to be good software citizens, we tell Windows to set the timer resolution back to what it was when the game is ending. Seems reasonable, right?

Now imagine that everything does that. Say you start up Some App(tm) that sets the timer resolution from 15 ms to 1 ms, then you start up King of the Kill that also tries to set it to 1 ms. However, then you shut down Some App. Thinking that it's a good software citizen also, it sets the timer back to 15 ms. But you're still playing King of the Kill! Now you might see some different behaviors, like getting micro-stutters, and miss hits that should have landed, etc. The game is doing what it's supposed to, but something happened that it didn't expect: the system timer got set back to low resolution. The Set Resolution Timer tool doesn't appear to continually update the Current Resolution display, but I believe it will try periodically to make sure the system timer is at the selected resolution. EDIT: Lucas Hale (the author of SetTimerResolution) commented below to let me know that this assumption is invalid. It appears that Windows will take the maximum resolution requested by any running application. So if the game client requests 1ms resolution, and SetTimerResolution requests 0.5ms resolution, it will take the latter. This is good as it makes the devs' lives easier!

How does Set Timer Resolution work?

This section is not for the technically faint-at-heart. I'm going to wax programmatically on you. First of all, when the game starts up, we call a documented function called timeBeginPeriod with the minimum value reported by the timeGetDevCaps function (generally 1 [millisecond]). This would probably work fine in many cases as long as our game is the only thing running on the machine. But that is never the case. Little programs that do behind-the-scenes things can start and end and do all sorts of stuff. Browsers can be running with multiple tabs open. Streamer software. Video recording software. Etc. If any of those things can affect the system timer while the game is running, then bad things happen.

It looks like Set Timer Resolution goes even deeper than the multimedia functions (like timeBeginPeriod) that our titles are calling. It goes straight to the kernel, the heart of every operating system. It looks like it's calling some undocumented user-mode kernel functions: NtQueryTimerResolution and NtSetTimerResolution. These are likely called deeper down from the multimedia functions that our titles use.

So where do we go from here?

I'd like to make Set Timer Resolution completely unnecessary. Since the game is already trying to set the timer resolution at startup, it seems like we could be doing a better job of making sure it stays set. I'll evaluate this against our current priorities and talk with the team about getting this in an upcoming hotfix.
Boom.

20 February 2017

Which patch? Dispatch.

**Authors note: I started writing this post a year ago. Hence the references are a bit dated, but the content is relevant.

At the behest of the good people of reddit, I figured I would talk a little bit about a recent issue that cropped up in Planetside 2: a runaway thread in our threading library that went mostly unnoticed and caused reduced performance.

Threads Are Hard

Writing good multi-threaded code isn't easy. I challenge the linked article's author in that multi-threaded programming is actually difficult. Sure, some of it boils down to programmers not following best practices, but there are several other facets that you don't encounter in single-threaded programming:
  • Synchronization
    • Reads and Writes to data must be synchronized
    • Compiler's optimization effect may subtly effect program operation
    • Lock-less programming considerations
    • Possibility of dead-locking or live-locking
  • Non-determinism. This basically means that the program doesn't run the same way every time. This has several repercussions:
    • Difficult-to-reproduce issues
    • Testing/error logging can mask issues or create false positives (similar to the Observer effect)
    • Statistical testing (problems occurring with a low enough frequency that you don't see them until they reach the Live environment)
Woody shares my expression.
However, the availability of hardware is favoring power by having increasingly larger numbers of cores. To take advantage of that power, you need threads.

Concurrency

Threads are a solution to the problem of trying to make a computer seem to do multiple things at the same time. In the olden days, threads didn't exist, but systems could start other processes by forking the process. This would create a copy of the process that could do different things without affecting the parent process. By being a copy, changes that one process made to its memory and variables wouldn't affect the other process. The two processes would still be able to communicate somewhat via IPC, but this is generally much slower than, say, setting a variable within a process.

Like my threads?
But sometimes you want one process to be able to do multiple things at the same time. That's where threads come in to play. Since threads allow multiple things to happen "simultaneously" within the same process, you have concurrency. 

I put "simultaneously" in quotes, because multi-processor systems of yore were generally limited to server-class hardware. It was uncommon to find a user's home machine with more than one processor. This means that the system could really only do one thing at a time, but it looked like it was doing things simultaneously because it was switching threads (i.e. things that it was doing) very, very quickly. These days, everything is multi-processor. My house thermostat is probably multi-CPU (ok, not really). The focus in hardware shifted from doing one thing very fast (higher clock speed aka GHz on CPUs) to doing many things pretty fast at the same time. The previous generation consoles (Playstation 3 and Xbox 360) had three to four CPUs whereas today's console generation (PlayStation 4 and Xbox One) have eight. My work computer has 12 "logical" cores.

Early threading involved creating threads for very specific tasks. For example, EverQuest II largely runs in a single thread, but creates specific threads for loading files, talking to the streaming asset server, updating particle effects, etc. Most of the time those specific threads are doing nothing; they're just sleeping. As the number of processors in a system grew, it becomes less practical to have dedicated, specific threads, especially when the number of processors in a system differs from system to system.

Synchronization

Let's take a break for a second to talk about a related topic. Synchronization is a big, huge, gargantuan topic wherein lies most of the problems with multi-threading. I'm only going to say a few words about Synchronization.
Yep, like that.
Nearly everything in the computer system can be considered a resource that must be shared: files, memory, CPU time, DVD drives, graphics, etc. What must be shared must be synchronized so that separate threads don't counter-productively stomp on each other. Process memory is probably the most often shared and problematic resource because everything interacts with it. Something like a file is fairly easy to synchronize because nearly every access of it requires a function call. Memory, on the other hand... For instance, here is an example of a function that just increments a number. What would the value be if you had two threads on a multi-processor machine calling this function 1000 times?





Hint: it's usually not 2000. Surprise! Incrementing a number is not an atomic operation, it's actually a read-modify-write operation. This is one of those things that makes multi-threaded programming so hard! To protect the section of code that increments the number, you have to use some sort of synchronization primitive, like an atomic intrinsic, mutex, spin-lock, or the like.

Tasking

When you have more processors available, it makes more sense to break problems down in to logical tasks or units of work. Instead of having a dedicated thread to load files, now you just have a "load file" task. You have a "collect garbage" task. You have an "animate entity" task. Task, task, task.

Khaaaaaaaannn!!!

This concept of tasks helps you to fill all available processors with work to do. Theoretically, if you can keep the task backlog full, CPUs will always have work to do. If you're using 100% of the available processors, you're doing the maximum amount of work that the system can do.

Dispatch

In 2009, Apple launched Mac OS X 10.6 with a programming API (libdispatch) marketed with the name Grand Central Dispatch (GCD). It is, among other things, a generic task execution system. You have a function or a task that you want to run in the main thread or a separate thread at specific priority levels, immediately or at a scheduled time in the future. You just take that function or task and throw it on a "dispatch queue" and let it run. Simple. Powerful. Efficient. I quickly fell in love with what GCD could offer when I used it for my iPhone game, Bust a Mine.

Fast forward to mid-2014. I became Technical Director of Planetside 2. The team was working on porting Planetside 2 to the PlayStation 4. Performance profiling was showing that the CPUs on the PS4 were slower than average CPUs on our PC players' machines, and Planetside 2 was still largely single-threaded. We started looking at threading technologies like Intel's Threading Building BlocksOpenMP, and even the C++11 thread support library. However, given my experience with libdispatch and the approach of looking at the problem as tasks rather than dedicated threads, we decided to look around for something similar. We found xdispatch, a port of libdispatch to Windows and Linux (libdispatch was originally written for Mac OS X which is based on BSD). However, it had some issues: namely it didn't support the PS4 (few things did) and was based on a much older version of libdispatch. We began adapting it to the PS4 and it gave us a solid framework to start multi-threading Planetside 2.

Adaptive Tuning

We developed a threading sub-team on the PS4-on-PS2 project that had a primary requirement (increase performance) through two primary points of attack: 1) tune xdispatch to do what we needed and 2) adapt existing threads and operations into tasks that could be done concurrently. Ideally these changes would carry over to the PC version of the game as well.

Both of these facets were challenging. On the tuning front we discovered that part of the reason that GCD works so well on Mac OS X is because the kernel--the core of the operating system itself--is controlling the dispatch scheduling. We didn't have that ability on the PS4, nor could we get the information about how busy the system is! We went several iterations on how to deal with this, but eventually settled on working around this by setting up some guidelines--the standard dispatch queues would only be used for CPU-intensive work (calculation, animation, decompression, etc.) and we would limit them to the number of CPUs available.

The second facet was a much longer pole that would continue throughout the project. Converting existing dedicated threads and PS2's old job-queue system to Dispatch was easy and quickly done, but that didn't net nearly the performance that we would need to see to be viable on PS4. We would have to go much deeper. This would involve taking core aspects of PS2's main loop and breaking them into tasks--entity processing, physics, animation, rendering, etc. This is difficult to do with C++ because non-thread-safe side-effects are nearly impossible to find; we would have to identify everything non-thread-safe that was happening in the single-threaded code before converting it to tasks.

OMG Bugs

That was the song, right?

The reddit post that originally spawned this blog post sheds some insight on to a problem that still existed a year after the PS4 version originally launched. Namely, PC players identified that a thread would take a whole CPU, but they found that if they killed it, performance got better and nothing bad really happened (or was not immediately visible). This was found to be a bug in timers in xdispatch that, to my knowledge, still exists. You can read the above link for more technical information, but it had to do with bad assumptions in the PC port of software originally written for BSD. Shockingly, the problem also existed in the PS4 build even though it shouldn't have. It looks like the timer implementation in xdispatch (and the libdispatch version that it was based on) was functional but not very efficient, so we wrote a new timer outside of xdispatch and used that instead.

Still later, we finally got a handle on one of our long-standing (post-PS4-launch) crash bugs. This was a crash bug that we had never seen internally (see my above point about bugs becoming a statistical problem). It looked like a memory corruption problem, which just made all of the programmers reading this shudder in horror. Memory corruption is terrible. It is an evil problem and few good tools exist to locate it assuming you can figure out how to make it happen. But find it we did, and it was also an issue with converting systems to multi-threading. In this case, an 'animation complete' flag was in a data structure that was getting freed before the task performing the operation finished and set the flag. Since the memory was freed before the flag was set, sometimes the memory had been reused for other things, hence the corruption. This was a problem not with Dispatch itself, but with how a previously single-threaded operation had been converted to a task.

Most recently, we began hearing reports of 'lock-ups' and 'hangs' from PS4 players. This coincided with an update to PS4 SDK 3.500 (from 2.000) for Planetside 2, which, among other things, gave us an additional half of a CPU to use for game logic (with 2.000 the system would reserve two whole CPUs for itself, whereas after 3.000 it only used one-and-a-half). Because of this, we ramped up Dispatch (now a completely retooled version no longer based at all on xdispatch but based on a more current version of libdispatch) to take advantage of that half-a-CPU. Eventually we determined this to be the cause of the lockup, but for unexpected reasons. The game was not experiencing a dead-lock, but a form of live-lock; all of the CPUs were running threads, they just weren't making any progress. This was because of a degenerate case between the design of libdispatch and the PS4 scheduler. Basically, the internals of libdispatch (and our Dispatch library that was based on it) are lock-less--they are doing atomic operations rather than locking mutexes. Some of these atomic operations are loops waiting for conditions to be met, or multiple compare-and-exchange operations in a loop; they try to do an operation based on old information and retry with updated information if it fails. But the PS4 scheduler will not run any thread of a lower priority if a higher priority thread is runnable. We could end up in a state where a lower-priority thread would be preempted after it changed conditions for a higher-priority thread. This would cause a sort of priority inversion that would never resolve. Most operating systems will at least give some time to lower priority threads to prevent starvation, but the PS4 does not. Indeed the default scheduler for the PS4 is the FIFO scheduler, but even the round-robin scheduler will not run lower-priority threads. Our solution to this involved applying a progressive algorithm that would eventually put threads to sleep in extreme cases in order to allow the live-lock to resolve. Generally this might look like a slight momentary dip in frame rate or may not even be noticed at all.

Looking Forward

Make sure your blinker is on.

We're always looking for ways to increase performance across all platforms. As other Daybreak games ramp up we're finding new ways to eke out increasing frame-rates and sharing that knowledge among the teams. Our internal Dispatch implementation is moving into other Daybreak titles and future projects, and it all started here, on Planetside 2. Efforts have been made to keep these types of changes in parity between the different games.

25 November 2015

Don't Repeat Yourself

Something that I've been working with the Planetside 2 team lately has been a more strategic direction in how we write maintainable code for the future. Part of my direction for this has been adherence to the D.R.Y. principle--Don't Repeat Yourself. Namely, instead of distributing authority out by the expression of Interfaces (and writing a lot of repetitive code), we are architecting unambiguously authoritative systems that can handle all functionality in a common way.

One of our most recent realizations of this principle is in our new UI data-source layer.

Previously, Planetside 2 (and previous Forgelight Engine ancestors) had a concept of a DataSource interface. This was a simple interface that allowed sub-systems (such as Skills or Items) to express data in a string-based row/column format to the UI layer (Planetside 2 uses Scaleform GFx, a Flash-based interpreter/renderer). The Actionscript 3 UI code could find a DataSource by name and query it in the ways expressed by the interface (it could also register as a listener to determine if anything changed).
class IUIDataSourceTable {
public:
    virtual int GetData(int row, int column, String& output) = 0;
    virtual int GetRowCount() = 0;
    virtual int GetColumnCount() = 0;
    virtual String GetColumnName(int column) = 0;
    virtual void AddListener(IUIListener*) = 0;
    virtual void RemoveListener(IUIListener*) = 0;
    virtual void NotifyChanges() = 0;
};

Seems great so far, right?

Because this is an interface, each system is responsible for surfacing internal data to the UI through the GetData call. But each system can implement it completely different. As long as 'output' receives the value of a given cell, where that data comes from isn't important to the UI layer. Also, detecting when that data changed (and notifying listeners) became the responsibility of each system.

Okay... So?

Well, now let's say that you want to do more complex operations with that data, such as sorting, filtering or joining with another IUIDataSourceTable. Sure we could add additional functions to the IUIDataSourceTable, but as our codebase grows, we're faced with a quandary: how do shrinking development teams maintain ever-growing codebases??

Let's look at it a different way. Say I wanted to add a Filter function to my IUIDataSourceTable. Planetside 2 has on the order of about 130 implementations of this interface (give or take) in very disparate systems such as Character Select, Skills, Items, Marketplace, Social, Experience, etc. If I want to add another function to my interface, I'm looking at actually writing 130 functions and doing a lot of research into those systems to find out how they actually surface the data.

This is the problem with distributed authority: it exponentially extends a simple change. Work-arounds in the past involving just two things that we always want to do--filtering and sorting--required either a C++ engineer to add to a specific DataSource implementation, or a UI engineer to gather everything and sort it in Actionscript 3. This is not a very productive or sustainable model and it's ridiculously time intensive (both on the CPU and in development time).

Ok, now let's adopt a don't-repeat-yourself strategy.

Instead of declaring an Interface that each system must adhere to, let's design a system that is unambiguously authoritative in terms of UI data. We want one system that we can tune and add functionality to. And since we're talking about UI datasources, can you think of anything that is designed to store row/column data and has sorting and filtering capabilities? Maybe... like... a database?! Turns out there is such a beast that we can use as the core of our new authoritative UI data system: sqlite3! This gives us a light-weight, in-memory SQL database that we can use to store our data and slice it any way we need! And better yet, browsers (and even mobile devices) have been using it for years!

After getting an engineer excited about this prospect and putting her on the project, she turned out an awesome system called uiDB that exposed game data to the Flash-based UI through SQL and tables. This has some really cool features:

  • Adding new functionality is trivial. We just added some very broad functionality in less than a day because we changed one authoritative system instead of 130 interface implementations.
  • sqlite has a callback mechanism, so we can use Actionscript 3's event mechanism to notify the UI when data changes--automatically. C++ just updates the game data in the uiDB system as it changes, and the authoritative uiDB system notifies any listeners. No extra code required.
  • The AS3 code can create a view to slice, sort, filter, or join data however it wants and can still get notified when any of that data changes. Without having to write any additional C++ code.
  • The AS3 code can also perform any SQL query on the in-memory UI database whenever it wants. Need data one time? No problem. Need to display data and refresh when it changes? Equally trivial.
  • We wrote a DataProvider implementation for uiDB tables and views. This is the standard DataProvider that Flash widgets use, so our UI engineers don't have to write any code to handle change events or anything--the DataProvider can do all of that and the UI just updates automatically. Talk about code reduction! 
Another surprise out of changing from the interface to authoritative model is that our UI data performance increased, even though we were now using a SQL-interpreting database! This is because of a few different reasons:
  • In the Interface model, each system retrieved its data through different means. Some could be woefully inefficient.
  • sqlite3 is a well-tested, highly optimized, tightly-packed, small C library. It has a development team that cares only about its performance and functionality in a variety of settings. As such, we are gaining the wealth of knowledge and experience of another team focusing intently on one piece of technology, allowing us to focus on making a game. (This could branch into another discussion about utilizing as much third-party technology as makes sense).
  • By consolidating authority in one system instead of spreading it out, we had one target to focus on for performance testing and optimization.
We are continuing to provide additional functionality through this system to our UI engineers in order to make their lives easier (and we're doing this across the board for all disciplines). This is looking forward to the future of Daybreak Games and the Forgelight Engine as much as the future of Planetside2.

All in all, work smarter, save time across the board, and don't repeat yourself!

30 June 2014

By the Library of Thor!

I recently uploaded to github a C++ template library called Thor, which is something that I've been playing around with. It's really just a way for me to write some crazy template code--a creative outlet if you will. This blog post is meant to show off some of the things that I'm proud of in Thor. The name Thor came in part because of inspiration from Andrei Alexandrescu's template library Loki and his book Modern C++ Design (an excellent advanced C++ book).

Some things in my Thor library are just me implementing things in the way I would if I had full control, and some are a slightly different take on popular concepts. Some things are already outmoded based on offerings in C++11 (especially in terms of atomic support), but are included for posterity.

I apologize that the code samples are images (Blogger currently doesn't do well with template syntax), but generally a link is provided to the code on github.

Containers

The C++ standard library includes several standard, well-known and well-documented containers, however these containers don't often allow for much customization. For instance, std::vector (up until the C++11 standard) had no obvious means to free memory (C++11 adds the shrink_to_fit() function). Although not technically correct, I'll refer to the C++ standard library as STL below for brevity's sake.

thor::basic_string

Thor's string class works very similarly to std::string and std::wstring with a few changes. First, the strings are atomically reference counted, so the underlying string data is shared when possible. Secondly, a template parameter can be specified to pre-reserve (as part of the basic_string instance, not a separate heap allocation) space for the string. The default of zero is a fully dynamic string, but if you need a quick string on the stack and don't want to allocate anything from the heap, it is easily done with the addition of a size template paramter:

Furthermore, in addition the STL constructors for basic_string, additional options are available for printf-style formatting and literal strings (that don't require copying or allocation):

The printf-style formatting is also supported for append, insert, replace and assign variants.

Thor's string also supports conversion between UTF-8 and "wide" (wchar_t). If wchar_t is 32-bits, then the conversion will be between UTF-8 and UTF-32, however the more likely case is that wchar_t is 16-bits, in which case the conversion will be between UTF-8 and UTF-16, complete with support for UTF-16 surrogate pairs.

thor::list

In general node-based STL containers (such as std::list) will allocate a 'dead node' or a terminator node from the heap as soon as they're constructed (even default constructed). Usually this terminator node will include space for the 'T' stored type, although that space will never be used and therefore wastes space. I frown upon the fact that wasted space is allocated from the slower heap for such a simple operation as default constructing a list.

Consider Thor's list class. Instead of heap-allocating a terminator node, the head/tail pointers for the list class use a list_node_base and the address of the list_node_base is used as the terminator. Thus, the terminator (used by the end() iterator) is implemented as such:

This has the advantage of not requiring a heap allocation for the terminator node while still allowing for simple reverse iteration and no need for null-pointer checking at head/tail. However, there are two disadvantages. The first is that any accidental writes to the end() iterator (which is completely invalid anyways) may cause memory corruption. This can be prevented in checked/debug builds with asserts when trying to dereference the end() iterator. The second is that swap semantics require slightly more finesse (although these are more complicated in Thor for another reason explained below). In std::list the swap() function must only exchange pointers and counts between two lists--since the terminator node is allocated on the heap it is not tied to either list and is automatically part of the swap. However, since Thor's list essentially has a local terminator node as part of the instance, care must be taken to fix up the terminators during swap:

Notice that the swap function handles a concept of shareable. This is because of another enhancement of Thor's containers. Essentially, space for a certain number of items can be pre-allocated along with the container, similarly to the string class mentioned above. For instance, thor::list<T,5> specifies that no heap allocations must occur for the first five list entries. Fortunately, thor::list<T,5> inherits from thor::list<T> so that it can be used anywhere where a list is required. However, it also means that the problem of swap() now must handle pre-allocated nodes. It is less efficient to swap() a list with pre-allocated nodes as this requires the nodes be converted to heap allocated nodes.

There are many other extensions to all of the containers, yet they all try to remain true to the C++ standard library specification.

thor::hash_map (et al)

The thor::hashtable class (which forms the basis of hash_map, hash_set and the multi- versions of each) has a noteworthy feature: a policy class that controls how the stored values are organized into buckets. The options are a power-of-two system that is very fast by using bitwise masks instead of divide/mod instructions (although this method can be terrible for pointers as keys as they typically are multiples of 4 or 8 and the bucket strategy loses efficiency leading to collisions), or a more traditional (and slower yet less collision-prone) prime-number strategy.

The C++ standard library hash containers are undefined order containers. This means that the order in which you insert items is not necessarily the same order in which you would iterate over the items. However, thor's hash containers are a fusion of a list and a hashtable, which allows for a defined iteration order as well as amortized O(1) lookup time. This does require a slight change to how iteration starts: the begin() function defaults to 'list' iteration mode but can be changed to 'hash' iteration mode, whereas find() operations default to 'hash' iteration mode.

Embedded Containers

For std::list (or thor::list for that matter), the 'T' type that you're storing in the container is contained within a node that tracks other information for the container. If you had a list of pointers to objects in the heap, then the objects were allocated and the list container must allocate a small node to store the pointer. This is wasteful. The embedded containers can alleviate this by having the node tracking information as part of the object that is being stored. For instance, the 'T' type stored in a thor::embedded_list must contain an embedded_list_link member and the member is given as a template parameter to thor::embedded_list.

Atomic integer/pointer

The thor::atomic_integer system is similar to C++11's atomic wrapper, so while not completely necessary it was a fun exercise to write. The real meat of how it works is based in the platform-specific interlocked system (the Windows version is interlocked_win). This uses template specialization to handle one-, two-, four- and eight-byte integers. By using intrinsics, these atomic_integers actually compile down to very few inline instructions for most operations. Consider how declaring interlocked works on a platform with a 32-bit integer. First, some magic happens based on how the non-specialized interlocked class is declared. There is a second template parameter that is required, but defaults to sizeof(T):

The template parameter T_SIZE defaults to 4 for int, which causes a specialization to be selected:

This specialization uses the correct functions/intrinsics for the numeric type that we're wrapping in atomic_integer. This allows atomic_integer to do the correct operation based on size of the integer parameter:

Memory Alignment

Another place where Thor uses template specialization is with memory alignment. All of the containers do memory allocation through Thor's memory functions. There is a simple class called the align_selector:

This class's only job is to determine if an object can use system-guaranteed alignment or not (most objects can). If the system alignment will suffice, alignment_selector::alignment will be zero. The align_alloc class uses the align_selector as a default template parameter to select the proper specialization for the alignment:

The system default alignment just allocates raw bytes:

But the default version that handles non-default alignments will over-allocate and offset:

Arguably, this version could use _aligned_malloc or memalign or similar, but by using new[] an application may still override new and use their own memory manager if so desired.

There is a lot more to the Thor library. Take a look at it on github and let me know what you think.

15 May 2014

Busting Mines

Behold: Bust A Mine. My first iOS development project.

Not a bad run!

Inspiration

When I set out to (once again) try my hand at iOS development, I didn't know what the future would hold. It was around the time that a former game that I worked on, Clone Wars Adventures, had announced that it would be ending on 31 March 2014. One of my favorite mini-games in CWA was called Mine Buster, a simple game that involved creating a chain reaction for points. It was also a mini-game that would soon cease to exist. And a good mini-game that could use a Spiritual Successor.

Beginnings

I started working on my own rendition in early March while professionally employed as the Director of Technology at Molten Games. Within two weeks of late-night spare time I had a working prototype that functioned similarly to its inspiration. I had heard good things about using Cocos2D 3.0 and it seemed to fit the bill. My prototype was still very rough and could only run on the iOS Simulator that comes with Xcode 5. However, I was happy enough with the progress that I decided to spring for the iOS Developer Program so that I could actually run and debug on my iPhone and iPad. There was a little learning curve to overcome as running code on the device requires creating App Identifiers, Certificates for code signing and Provisioning Profiles for signed code to be allowed to run on the device. Being able to run and debug on my devices was instantly gratifying, but also revealed another issue: the need to deal with different screen sizes (iPad vs. iPhone) and aspect ratios.

Artist Wanted

It was time to start finalizing the art--getting real assets at the necessary sizes and building out the art style. For this, I'd actually need an artist. Fortunately, Molten's UI artist was on-board with the project after seeing the prototype running on my phone. He had free reign with the art style and created a look inspired by the likes of Geometry Wars yet unique in its own right. We worked together to integrate the art into the game and iterate on the style. Unfortunately, Molten Games collapsed around this time, towards the end of March. Priorities shifted to finding a new day job since this project was a journey of Experience and less of Business.

To Test or not to Test

Working with an artist brought to light another need that I'd have to figure out: distributing test builds of my app to other people. For this I began using Test Flight. This provided a great way to bring testers on board and get feedback. The only issue was minor: when a new device was registered, I would have to add the device's unique identifier to the provisioning profile and upload another build of the app. I started acquiring testers with a broad spectrum of devices. I also acquired a few devices of my own: non-Retina iPad and older iPhone. I was also blessed with winning an iPad Mini in a contest! I made sure to also test on the newer 64-bit devices. Through this I actually discovered some 64-bit compatibility issues with Cocos2D 3.0 (in beta at the time) and worked with the developers to solve them.

Sprechen Sie Deutsche?

I can understand exactly one of these.
Having visited Moscow and Shanghai for assisting Planetside 2 localization (and working on games for the past decade or so), I'm constantly reminded of the worldwide gaming community. The English-speaking community might be easier for me to market, but it's a small part of the world. Therefore, I wanted to make sure that my app could be distributed and popular in other locales. I had a very small number of strings (less than 100), so professional localization would be fairly easy. I originally started working with Rev.com but found them difficult and obtuse. I ended up going with e2f and found them slightly more expensive, but very easy to work with. They appreciated all of my contextual notes and turned around six languages (French, Italian, German, Spanish, Russian and Simplified Chinese) in less than 24 hours. In hindsight, I should have included a few more items in the translations, such as future (generic) update notes and anything that I would potentially need in the next few months.

Finishing Touches

This game, since it was an exploration of experience, was going to be free and advertisement-supported. However, given the ease of doing in-app purchases, I would offer an in-app purchase to remove the ads. I also wanted GameKit integration to essentially have free Leaderboards. Fortunately many kind developers have written simple APIs around these pieces of functionality and have shared them for free. Integrating the likes of ABGameKitHelper and MKStoreKit was very easily done. My testers also suggesting graphical features such as using UIMotionEffect to simulate a parallax display on iOS 7+.

A Different Beat

Having some manner of background music was important to me. However, I'm not a composer. I don't really know many composers. So I started looking for free music. Interestingly, though, "free" isn't always free. Some companies let you use their music for free for certain purposes, and an app is not generally one of those free uses. I originally wanted to use some songs from Freeplay Music, but after seeing their licensing requirements and reading about their litigation processes, I decided against it. I ended up using music by Chris Zabriskie who offers his music for free with a Creative Commons Attribution license. I feel that his music meshes well with the peaceful, soothing nature of Bust A Mine's laid-back gameplay.

Pre-Launch Checklist

I've heard of using a test market to pre-launch, so I figured I'd give that a try. I have some friends with family in the Philippines and it seems to be a fairly popular test market. Launching in the Philippines also revealed an interesting fact: iAd isn't available in all parts of the world. Therefore, my advertisement strategy would have to change for a worldwide launch.

3... 2... 1... Liftoff!

Bust A Mine officially launched in a limited capacity on 9 May 2014, about 9 weeks after development started. Considering that it was my first foray into iOS development and Objective-C programming in general, I'm calling that a success. Not a business success, but an experiential success. For the full worldwide launch, I'm waiting for Apple to approve the next version that will integrate Google AdMob as a backup to iAd.

To Infinity and Beyond

The background reverberates with explosive force
I really enjoy iOS development. The Objective-C language allows for easy checking to see if features are available on the device it's running on. Furthermore, the Xcode integration with Apple back-end services is very well done and easy to use. Want GameKit integration? Flip a switch and you've got it.

Several people have asked about an Android port as well. Interestingly, no one has asked about a Windows phone version. In any case, I'm considering a port to Cocos2D-X, a C++ library that supports all major mobile platforms.

I've got plenty of ideas for Bust A Mine for the future. And a few of my game designer friends have offered some feedback and ideas of their own. Plus, I have more ideas for future games as well. Mobile development has become a fun little hobby for me.

Download Bust A Mine and give it a try. Leave a review or rating if you feel so inclined. And let me know what you think in the comments!