26 March 2008

Infinite Loops Are Bad

When working on an MMO server, one thing that you're never going to get rid of are crashes. We generally don't like to admit it, but it's true. Crashes can be caused by many things including bad software, bad hardware, bad programmers, bad state, bad anything. Linux (and Windows) MMO server developers typically have crash recovery down to a science: dump a core file (or minidump under Windows), mail it off to the programmers and restart the process.

Worse than crashes (and hopefully less frequent) are a little problem known as Infinite Loops. But if you have a server process lock up, how do you get useful information in order to fix it? We want to treat Infinite Loops like crashes and get a core file or minidump that shows where the lockup occurred and hopefully why.

The concept behind detecting an infinite loop is trivial: Start the main game loop as a separate thread and increment a value over time (say, every main loop iteration). The initial thread then looks for this value to be changed over time. If the value goes long enough without having changed, your process can be considered to have locked up. I'll leave it as an exercise for the reader.

So how does our infinite loop detection thread get the other thread to drop useful information?

In Linux (and most other flavors of *nix) it's really easy. Just do a pthread_kill() to the locked-up game thread with a signal that drops a core (SIGABRT usually works nicely) and then _exit().

For Windows, there's a little more code to write and you have to have dbghelp.dll. Basically, just write a minidump file. The normal minidump file should have information on all threads, so even though you're writing it from the main thread doing the infinite loop detection, it will include all the necessary info about your locked-up thread.

Here's a very simple sample of a Linux program for making a thread drop a core:
#include <pthread.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>

#include <sys/types.h>
#include <sys/stat.h>

void* threadMain( void* )
{
while ( 1 )
;
return 0;
}


int main( int argc, char** argv )
{
pthread_t id;
if ( 0 != pthread_create( &id, 0, threadMain, 0 ) )
{
perror( "pthread_create failed" );
return 1;
}

sleep( 1 );

pthread_kill( id, SIGABRT );
_exit( 0xabcd );
}


Why am I writing about this? Previously EQ2's infinite loop detection would essentially do this:
assert( 0 && "INFINITE LOOP DETECTED" );
Which would crash the process and cause it to be restarted, but doesn't give us any useful information about where or why it locked up. In a past life when working on UO we didn't even have any infinite loop detection, so a locked-up server would just blackhole people and prevent the shard from doing synchronized backups. On EQ2 we don't typically see a lot of infinite loops at this point, but it's recently been changed to give us more information. And that's not something that you're likely to see in the patch notes ;)

Edit 3/29/08 10:23 PM:
Response based on a comment by KC (see comment below). Unfortunately, blogger doesn't support <pre> or <code> tags :P
Infinite Loops in EQ2 are usually caused by programming errors. Our script system (lua) has an instruction limit that the EQ2 team added so Designers can't cause infinite loops. An unending dialog cycle (if I'm interpreting you correctly) wouldn't be an infinite loop since it requires user interaction (the server is waiting on the client before serving up the next dialog). The simplest infinite loop would be something like this:
while(true){/*do nothing*/}

This just "does nothing" forever without giving the server the chance to do anything else. We actually have a dev-only command that does this for testing the infinite loop handler :)

The last time we actually had a legitimate infinite loop issue, the code looked something like this:
for ( unsigned i = 0; i != kiCount; /*increment in loop*/ )
{
if ( shouldSkip( i ) )
{
// oops, forgot to increment i!
continue;
}
// real logic
++i;
}

4 comments:

Mikul said...

Very interesting, and nice to see a critical issue, however small has an investigation path.

It also reminded me of my classic error when i use to code text based muds. I wrote what i consider my most beautiful routine back in the early nineties. The basics behind it being, that it was a room with a bridge in it. That as you walked over it, you had a chance to fall into the burning ooze below. Now on the other side of the bridge in the hut was a very good ring for newbies (and everyone else as it happened). Now to stop high level players ganking it all the time, i added various triggers for weight carry and over confidence (high level) to cause more problems for people that shouldn't be there.

Now the problem comes when your reviewing 200 lines of your own code and just thinking its all fine and dandy. And your blind to having the following statement present:


if (1 > 5)


For the life of me, i couldn't see the error. Still couldn't see it when the routine destroyed 2 of my personal friends player files, as it ran into an infinite loop and basically mangled there player files into a big pile of mess.

I still have a print out of the events as they unfolded, as due to being a text based game. He just printed out the history of his text client from the moments leading up to walking into the room itself and the escapades as the game tried to cope. There is alot more fun to this story, when you read that print out... but something for another time.

KC said...

Fascinating!

What sort of events in EQ2 would cause infinite loops? (traditionally I mean, since you indicated they don't frequently occur these days :))

Perhaps, a dialogue error that leads an NPC into an un-ending cycle of text that cannot be aborted?

I would assume that NPC callout speeches are exempt from the types of code that get counted to see if they loop or not, since this is the intended result of the callouts themselves, to never end :)

Joshua Kriegshauser said...

KC: I updated the blog post with the response to your comment since blogger doesn't support <pre> or <code> tags :P

Mikul said...

I can think of a text cycle that could get stuck in an infinite loop, or at least appears that it could without knowing what starts it.

The Bayle statue in QH. That streams off a long winded multiple-text event, that could get stuck between 2 sections of text. As it appears to not have any dependence on a "player", maybe in location only.