04 July 2017

Resolute Timers

This weekend I got linked into a tweetstorm regarding a tool called Set Timer Resolution by one Lucas Hale. People are claiming that using this tool results in better hit accuracy, faster responsiveness and higher frame-rates for King of the Kill and other Daybreak titles.
Seen here, in all its majesty
It looks like other people have said that this helps, and at one point it looks like the download link was removed from the site. At the time of this writing, Download 3K is apparently a valid mirror (MD5: 4b3bccdb3bcbd48162aa77270d910276). I cannot recommend using any specific third-party applications, including this one. Your mileage may vary and incorrect use of software may cause issues.

This specific very simple app (only 32k in size!) does not affect the game in any way shape or form. In fact, it was originally authored back in 2007, way before King of the Kill. Instead, it tells Windows to check it's timers more often. That's it.

Imagine this. You need to do something in 30 seconds, but you only have a clock with a minute hand. You glance at the clock and it says 4:59 pm. Once it changes to 5:00 pm, has 30 seconds elapsed? Not necessarily! What if you glanced at the clock a mere second before it changed? To ensure that a full 30 seconds has elapsed you would actually have to wait until 5:01 pm to guarantee that at least 30 seconds has passed, but up to 1 minute 59 seconds could have passed!

This is the nature of resolution: how often you can check the clock and it will tell you a different value. Now, computers do things a LOT faster than even once a second. Computers can do things in the nanosecond range (1/1,000,000,000 of a second) or even faster! When I started up the Windows 10 machine that I'm writing this post with, the resolution was 15.625 milliseconds (~0.015 sec). That's WAY slower than 0.000000001 sec! In fact, that resolution will only check the clock 64 times per second, which can be slower than some frame rates that people get when playing King of the Kill.

When we do windows programming, when we set a timer or tell a thread to sleep, we specify values in milliseconds (1/1,000 sec), but if it's only checking the clock every 15.625 milliseconds, a 1 millisecond timer can end up waiting 15.625 milliseconds, which is more than a whole frame in some cases. Obviously we want Windows to check the clock much faster than 15.625 milliseconds.
This is about how often I check my phone. And I always forget to check the time.

Yes but does it WORK?

At first I thought this tool might have some confirmation bias behind it, but after digging into it, i'm going to say that it's plausible that it has a positive effect on gameplay. Windows has but one internal timer, and it's shared by everything running on the system. When Daybreak titles start up, we tell Windows that we want 1 millisecond resolution on the system timer. But, to be good software citizens, we tell Windows to set the timer resolution back to what it was when the game is ending. Seems reasonable, right?

Now imagine that everything does that. Say you start up Some App(tm) that sets the timer resolution from 15 ms to 1 ms, then you start up King of the Kill that also tries to set it to 1 ms. However, then you shut down Some App. Thinking that it's a good software citizen also, it sets the timer back to 15 ms. But you're still playing King of the Kill! Now you might see some different behaviors, like getting micro-stutters, and miss hits that should have landed, etc. The game is doing what it's supposed to, but something happened that it didn't expect: the system timer got set back to low resolution. The Set Resolution Timer tool doesn't appear to continually update the Current Resolution display, but I believe it will try periodically to make sure the system timer is at the selected resolution. EDIT: Lucas Hale (the author of SetTimerResolution) commented below to let me know that this assumption is invalid. It appears that Windows will take the maximum resolution requested by any running application. So if the game client requests 1ms resolution, and SetTimerResolution requests 0.5ms resolution, it will take the latter. This is good as it makes the devs' lives easier!

How does Set Timer Resolution work?

This section is not for the technically faint-at-heart. I'm going to wax programmatically on you. First of all, when the game starts up, we call a documented function called timeBeginPeriod with the minimum value reported by the timeGetDevCaps function (generally 1 [millisecond]). This would probably work fine in many cases as long as our game is the only thing running on the machine. But that is never the case. Little programs that do behind-the-scenes things can start and end and do all sorts of stuff. Browsers can be running with multiple tabs open. Streamer software. Video recording software. Etc. If any of those things can affect the system timer while the game is running, then bad things happen.

It looks like Set Timer Resolution goes even deeper than the multimedia functions (like timeBeginPeriod) that our titles are calling. It goes straight to the kernel, the heart of every operating system. It looks like it's calling some undocumented user-mode kernel functions: NtQueryTimerResolution and NtSetTimerResolution. These are likely called deeper down from the multimedia functions that our titles use.

So where do we go from here?

I'd like to make Set Timer Resolution completely unnecessary. Since the game is already trying to set the timer resolution at startup, it seems like we could be doing a better job of making sure it stays set. I'll evaluate this against our current priorities and talk with the team about getting this in an upcoming hotfix.
Boom.

20 February 2017

Which patch? Dispatch.

**Authors note: I started writing this post a year ago. Hence the references are a bit dated, but the content is relevant.

At the behest of the good people of reddit, I figured I would talk a little bit about a recent issue that cropped up in Planetside 2: a runaway thread in our threading library that went mostly unnoticed and caused reduced performance.

Threads Are Hard

Writing good multi-threaded code isn't easy. I challenge the linked article's author in that multi-threaded programming is actually difficult. Sure, some of it boils down to programmers not following best practices, but there are several other facets that you don't encounter in single-threaded programming:
  • Synchronization
    • Reads and Writes to data must be synchronized
    • Compiler's optimization effect may subtly effect program operation
    • Lock-less programming considerations
    • Possibility of dead-locking or live-locking
  • Non-determinism. This basically means that the program doesn't run the same way every time. This has several repercussions:
    • Difficult-to-reproduce issues
    • Testing/error logging can mask issues or create false positives (similar to the Observer effect)
    • Statistical testing (problems occurring with a low enough frequency that you don't see them until they reach the Live environment)
Woody shares my expression.
However, the availability of hardware is favoring power by having increasingly larger numbers of cores. To take advantage of that power, you need threads.

Concurrency

Threads are a solution to the problem of trying to make a computer seem to do multiple things at the same time. In the olden days, threads didn't exist, but systems could start other processes by forking the process. This would create a copy of the process that could do different things without affecting the parent process. By being a copy, changes that one process made to its memory and variables wouldn't affect the other process. The two processes would still be able to communicate somewhat via IPC, but this is generally much slower than, say, setting a variable within a process.

Like my threads?
But sometimes you want one process to be able to do multiple things at the same time. That's where threads come in to play. Since threads allow multiple things to happen "simultaneously" within the same process, you have concurrency. 

I put "simultaneously" in quotes, because multi-processor systems of yore were generally limited to server-class hardware. It was uncommon to find a user's home machine with more than one processor. This means that the system could really only do one thing at a time, but it looked like it was doing things simultaneously because it was switching threads (i.e. things that it was doing) very, very quickly. These days, everything is multi-processor. My house thermostat is probably multi-CPU (ok, not really). The focus in hardware shifted from doing one thing very fast (higher clock speed aka GHz on CPUs) to doing many things pretty fast at the same time. The previous generation consoles (Playstation 3 and Xbox 360) had three to four CPUs whereas today's console generation (PlayStation 4 and Xbox One) have eight. My work computer has 12 "logical" cores.

Early threading involved creating threads for very specific tasks. For example, EverQuest II largely runs in a single thread, but creates specific threads for loading files, talking to the streaming asset server, updating particle effects, etc. Most of the time those specific threads are doing nothing; they're just sleeping. As the number of processors in a system grew, it becomes less practical to have dedicated, specific threads, especially when the number of processors in a system differs from system to system.

Synchronization

Let's take a break for a second to talk about a related topic. Synchronization is a big, huge, gargantuan topic wherein lies most of the problems with multi-threading. I'm only going to say a few words about Synchronization.
Yep, like that.
Nearly everything in the computer system can be considered a resource that must be shared: files, memory, CPU time, DVD drives, graphics, etc. What must be shared must be synchronized so that separate threads don't counter-productively stomp on each other. Process memory is probably the most often shared and problematic resource because everything interacts with it. Something like a file is fairly easy to synchronize because nearly every access of it requires a function call. Memory, on the other hand... For instance, here is an example of a function that just increments a number. What would the value be if you had two threads on a multi-processor machine calling this function 1000 times?





Hint: it's usually not 2000. Surprise! Incrementing a number is not an atomic operation, it's actually a read-modify-write operation. This is one of those things that makes multi-threaded programming so hard! To protect the section of code that increments the number, you have to use some sort of synchronization primitive, like an atomic intrinsic, mutex, spin-lock, or the like.

Tasking

When you have more processors available, it makes more sense to break problems down in to logical tasks or units of work. Instead of having a dedicated thread to load files, now you just have a "load file" task. You have a "collect garbage" task. You have an "animate entity" task. Task, task, task.

Khaaaaaaaannn!!!

This concept of tasks helps you to fill all available processors with work to do. Theoretically, if you can keep the task backlog full, CPUs will always have work to do. If you're using 100% of the available processors, you're doing the maximum amount of work that the system can do.

Dispatch

In 2009, Apple launched Mac OS X 10.6 with a programming API (libdispatch) marketed with the name Grand Central Dispatch (GCD). It is, among other things, a generic task execution system. You have a function or a task that you want to run in the main thread or a separate thread at specific priority levels, immediately or at a scheduled time in the future. You just take that function or task and throw it on a "dispatch queue" and let it run. Simple. Powerful. Efficient. I quickly fell in love with what GCD could offer when I used it for my iPhone game, Bust a Mine.

Fast forward to mid-2014. I became Technical Director of Planetside 2. The team was working on porting Planetside 2 to the PlayStation 4. Performance profiling was showing that the CPUs on the PS4 were slower than average CPUs on our PC players' machines, and Planetside 2 was still largely single-threaded. We started looking at threading technologies like Intel's Threading Building BlocksOpenMP, and even the C++11 thread support library. However, given my experience with libdispatch and the approach of looking at the problem as tasks rather than dedicated threads, we decided to look around for something similar. We found xdispatch, a port of libdispatch to Windows and Linux (libdispatch was originally written for Mac OS X which is based on BSD). However, it had some issues: namely it didn't support the PS4 (few things did) and was based on a much older version of libdispatch. We began adapting it to the PS4 and it gave us a solid framework to start multi-threading Planetside 2.

Adaptive Tuning

We developed a threading sub-team on the PS4-on-PS2 project that had a primary requirement (increase performance) through two primary points of attack: 1) tune xdispatch to do what we needed and 2) adapt existing threads and operations into tasks that could be done concurrently. Ideally these changes would carry over to the PC version of the game as well.

Both of these facets were challenging. On the tuning front we discovered that part of the reason that GCD works so well on Mac OS X is because the kernel--the core of the operating system itself--is controlling the dispatch scheduling. We didn't have that ability on the PS4, nor could we get the information about how busy the system is! We went several iterations on how to deal with this, but eventually settled on working around this by setting up some guidelines--the standard dispatch queues would only be used for CPU-intensive work (calculation, animation, decompression, etc.) and we would limit them to the number of CPUs available.

The second facet was a much longer pole that would continue throughout the project. Converting existing dedicated threads and PS2's old job-queue system to Dispatch was easy and quickly done, but that didn't net nearly the performance that we would need to see to be viable on PS4. We would have to go much deeper. This would involve taking core aspects of PS2's main loop and breaking them into tasks--entity processing, physics, animation, rendering, etc. This is difficult to do with C++ because non-thread-safe side-effects are nearly impossible to find; we would have to identify everything non-thread-safe that was happening in the single-threaded code before converting it to tasks.

OMG Bugs

That was the song, right?

The reddit post that originally spawned this blog post sheds some insight on to a problem that still existed a year after the PS4 version originally launched. Namely, PC players identified that a thread would take a whole CPU, but they found that if they killed it, performance got better and nothing bad really happened (or was not immediately visible). This was found to be a bug in timers in xdispatch that, to my knowledge, still exists. You can read the above link for more technical information, but it had to do with bad assumptions in the PC port of software originally written for BSD. Shockingly, the problem also existed in the PS4 build even though it shouldn't have. It looks like the timer implementation in xdispatch (and the libdispatch version that it was based on) was functional but not very efficient, so we wrote a new timer outside of xdispatch and used that instead.

Still later, we finally got a handle on one of our long-standing (post-PS4-launch) crash bugs. This was a crash bug that we had never seen internally (see my above point about bugs becoming a statistical problem). It looked like a memory corruption problem, which just made all of the programmers reading this shudder in horror. Memory corruption is terrible. It is an evil problem and few good tools exist to locate it assuming you can figure out how to make it happen. But find it we did, and it was also an issue with converting systems to multi-threading. In this case, an 'animation complete' flag was in a data structure that was getting freed before the task performing the operation finished and set the flag. Since the memory was freed before the flag was set, sometimes the memory had been reused for other things, hence the corruption. This was a problem not with Dispatch itself, but with how a previously single-threaded operation had been converted to a task.

Most recently, we began hearing reports of 'lock-ups' and 'hangs' from PS4 players. This coincided with an update to PS4 SDK 3.500 (from 2.000) for Planetside 2, which, among other things, gave us an additional half of a CPU to use for game logic (with 2.000 the system would reserve two whole CPUs for itself, whereas after 3.000 it only used one-and-a-half). Because of this, we ramped up Dispatch (now a completely retooled version no longer based at all on xdispatch but based on a more current version of libdispatch) to take advantage of that half-a-CPU. Eventually we determined this to be the cause of the lockup, but for unexpected reasons. The game was not experiencing a dead-lock, but a form of live-lock; all of the CPUs were running threads, they just weren't making any progress. This was because of a degenerate case between the design of libdispatch and the PS4 scheduler. Basically, the internals of libdispatch (and our Dispatch library that was based on it) are lock-less--they are doing atomic operations rather than locking mutexes. Some of these atomic operations are loops waiting for conditions to be met, or multiple compare-and-exchange operations in a loop; they try to do an operation based on old information and retry with updated information if it fails. But the PS4 scheduler will not run any thread of a lower priority if a higher priority thread is runnable. We could end up in a state where a lower-priority thread would be preempted after it changed conditions for a higher-priority thread. This would cause a sort of priority inversion that would never resolve. Most operating systems will at least give some time to lower priority threads to prevent starvation, but the PS4 does not. Indeed the default scheduler for the PS4 is the FIFO scheduler, but even the round-robin scheduler will not run lower-priority threads. Our solution to this involved applying a progressive algorithm that would eventually put threads to sleep in extreme cases in order to allow the live-lock to resolve. Generally this might look like a slight momentary dip in frame rate or may not even be noticed at all.

Looking Forward

Make sure your blinker is on.

We're always looking for ways to increase performance across all platforms. As other Daybreak games ramp up we're finding new ways to eke out increasing frame-rates and sharing that knowledge among the teams. Our internal Dispatch implementation is moving into other Daybreak titles and future projects, and it all started here, on Planetside 2. Efforts have been made to keep these types of changes in parity between the different games.