27 September 2008

EQII Getting Multicore Support... Wait, what?!

An interesting note showed up in some recent EQII Test Patch Notes. Yes, it's true. We're working on "Multicore support" for EQII and plan to release the first iteration with Game Update 49. "Multicore Support" is a fan-friendly way of supporting multiple CPUs by way of multiple threads of execution in the client.

I thought I'd take some time to explain what led up to this and offer some explanation of the difficulties of threading. EQII was released in late 2004, but development actually started about five years earlier. In 2004, multiple-CPU machines were generally very expensive and reserved for servers and high-end workstations; the mainstream machines were single-CPU and relied upon increasing clock speeds to keep getting faster. Late in the development of EQII, Intel was starting to introduce hyper-threaded processors. Even though some Operating Systems would detect a hyper-threaded processor as two CPUs, it was really just one CPU. The game targeted computers more advanced for the day (in some ways) and launched largely with the expectation that CPUs would just get faster and faster.

However, clock speed increases started decelerating. Now we find ourselves getting machines that have smaller clock speed increases but more CPUs. Dual- and Quad-core machines are becoming quite commonplace and AMD has even introduced a Triple-core chip.

When an MMO is under development, the core engine design is towards the front of the project so that the tool chains and art pipelines can be created, allowing more people to be more productive. It is then refined over the course of the project, but single-threaded vs. multi-threaded (multicore) is something that should be decided early in the engine design. The reason for this is simple: multi-threaded programming can be significantly different than single-threaded programming and, unless you've done it every day for 20 years, decidedly NOT easier than network code.

Let me explain how single- and multi-threaded programming are different with a simple example. Single-threaded programming is like driving down the single lane of a highway. You're going fast and don't really have to stop for anything. Multi-threaded programming is like two (or more, but we'll use two for my example) highways: one running north-south and the other running east-west. At some point they have intersections. What happens if there are no traffic lights? Disaster. It's the same with multi-threaded programming. You have to find all of the intersections (programmers call them critical sections) and put traffic lights (synchronization primitives) on them. Intersections on a highway are easy to see; unfortunately they're not as easy in code. The other thing to consider is that you have to stop at the intersection and wait for your turn. It's the same with threads. That's also one of the reasons why adding a second CPU doesn't give you 100% more performance.

The first attempt at having additional cores help out was using a technology very similar to OpenMP. I took loops that iterated over lots of things (particles, vertex transformations, degenerate triangles in shadow meshes, etc) and parallelized them. The way this works is like this: Say you have 1000 numbers and you just have to add one to each of them. A single thread would step through all 1000 numbers adding one to each of them. OpenMP allows you to take those 1000 numbers and divide them up evenly among all processors. Unfortunately, this didn't give us any sort of meaningful performance gain as any gains were outdone by the time it takes to synchronize and hand off the data to other threads.

The second attempt became to take the specific system where we're spending most of our CPU time (animation and vertex transformation) and dividing it up in large chunks easily handed off to another thread. This turned out to work very well and netted at 10-15% frame rate gain in populous places. Another bonus was that it was mostly easy to integrate into the single-threaded engine. The way it works is by trying to do animations before the main game thread would. The first time something is animated in the main thread it gets added to a list of items that need animation. The next frame, the animation thread sees these items and can animate them before the main thread needs them. In some cases, the main thread needs one before the animation thread has gotten around to it. No worries! If something hasn't been animated yet, the main thread can take care of it. If the animation thread comes across something that the main thread is using, it just skips it and goes on to the next item. With how useful this system is, the drawbacks are that we only do animation in a separate thread and only one extra thread can really only make use of one extra processor (so Quad-core doesn't have a great advantage over Dual-core [yet], at least as far as EQII is concerned).

So what about the future? There's plenty of systems that we can offload in a similar manner to other threads: shadows, particles, maybe even scenegraph traversals or collision! However, just creating theads for each subtask isn't really the best way to help out. What I'd rather do is create individual tasks for each item: Animate, ExecuteParticleSystem, ComputeShadow, etc. that can be doled out to worker threads similar to Intel's Threading Building Blocks. This would give us the best fit to each processor type and support even higher processor counts in the future! Now that the proof-of-concept has been shown to work and the fan response so far is very favorable, all we need is time.

I hope this post has given you a little bit of an idea why multicore support is not simple, especially on a game engine that was not particularly designed with threading in mind.

13 September 2008


A few days ago I read an article about (and subsequently ordered) a kit to build Mitch Altman's TV-B-Gone. I haven't had the opportunity to build anything electronic in quite a while, so I jumped at the chance.

It came yesterday and I enlisted the help of my 6-year-old son to put it together today.

It's a fairly simple kit with an 8-pin microcontroller, 4 transistors, 2 capacitors, 5 resistors and 5 LEDs (oh, and a push button).

Assembly was a little slow as my son wanted to know what each part did and was worried about his personal safety considering that he equates all electricity with being shocked. I enjoyed explaining it to him and asking him to show me where the parts should go.

I whipped out the old soldering iron and tools and went to town. Putting together electronic kits reminded me of my childhood days assembling HeathKits and tinkering with my 200-in-1 electronics lab (and later, real breadboards and etching my own circuit boards). Ah, the good old days. I'm not exactly sure when I decided to switch from hardware to software, and while I love programming, it's always nice to smell the solder once in a while.

Here's a short video of the initial test of the product:

Needless to say, eating out should be a little more enjoyable now!

STL: A Love-and-Hate Relationship

I'm a big fan of the STL and specifically the containers. Usually. The interface generally makes sense, code is written around them fairly quickly and they have good interoperability. However, there are some problems that really start to become apparent on larger projects. Consider these search results on EQII's codebase:
  • "std::map" - 2460
  • "std::set" - 1574
  • "std::list" - 393

The EQII codebase has nearly 3 million lines of just C/C++/C# (client, server, tools, not counting server lua script and data files) and uses STL containers liberally. If I had to do it all over again, I believe I'd forgo using STL containers for a few reasons:
  • Memory Usage - STL allocators typically hold on to memory forever and reuse it as necessary. However, EQII has a very good memory manager with a constant-time free-able small block allocator (written by yours truly), reporting, leak detection, etc. Writing custom allocators for STL is a major pain (especially on an established codebase) and I'd really rather not hack STLport or the gnu implementation to use our memory manager.
  • Compile Times - The more templated code you have the longer the compile and link times. I've talked previously about how we replaced std::vector with our own class to save compile times. Also, have you looked at STL implementation code? It's bloated and ugly, which brings me to my next point...
  • Debugging - Trying to dig through a container in a debugger watch window (or GDB) is a pain. Sometimes it's even a matter of obscure memory offsets and you have to calculate addresses to see values in a map.
  • Additional Functionality - If you have a vector that you're not using anymore and you want to free the memory for it, clear() doesn't cut it. The code to do it is neither trivial nor intuitive: std::vector< T >().swap( myVec ). Adding functions isn't easy as the STL containers are purposely non-virtual.

I'll probably look for a different solution for containers for my next project. Home-grown containers can emulate the STL containers and provide all that I desire above. They can even be compatible with STL functors and algorithms too.