When working on an MMO server, one thing that you're never going to get rid of are crashes. We generally don't like to admit it, but it's true. Crashes can be caused by many things including bad software, bad hardware, bad programmers, bad state, bad anything. Linux (and Windows) MMO server developers typically have crash recovery down to a science: dump a core file (or minidump under Windows), mail it off to the programmers and restart the process.
Worse than crashes (and hopefully less frequent) are a little problem known as Infinite Loops. But if you have a server process lock up, how do you get useful information in order to fix it? We want to treat Infinite Loops like crashes and get a core file or minidump that shows where the lockup occurred and hopefully why.
The concept behind detecting an infinite loop is trivial: Start the main game loop as a separate thread and increment a value over time (say, every main loop iteration). The initial thread then looks for this value to be changed over time. If the value goes long enough without having changed, your process can be considered to have locked up. I'll leave it as an exercise for the reader.
So how does our infinite loop detection thread get the other thread to drop useful information?
In Linux (and most other flavors of *nix) it's really easy. Just do a
pthread_kill()
to the locked-up game thread with a
signal that drops a core (SIGABRT usually works nicely) and then
_exit()
.
For Windows, there's a little more code to write and you have to have dbghelp.dll. Basically, just
write a minidump file. The normal minidump file should have information on all threads, so even though you're writing it from the main thread doing the infinite loop detection, it will include all the necessary info about your locked-up thread.
Here's a very simple sample of a Linux program for making a thread drop a core:
#include <pthread.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
void* threadMain( void* )
{
while ( 1 )
;
return 0;
}
int main( int argc, char** argv )
{
pthread_t id;
if ( 0 != pthread_create( &id, 0, threadMain, 0 ) )
{
perror( "pthread_create failed" );
return 1;
}
sleep( 1 );
pthread_kill( id, SIGABRT );
_exit( 0xabcd );
}
Why am I writing about this? Previously
EQ2's infinite loop detection would essentially do this:
assert( 0 && "INFINITE LOOP DETECTED" );
Which would crash the process and cause it to be restarted, but doesn't give us any useful information about where or why it locked up. In a past life when working on
UO we didn't even have
any infinite loop detection, so a locked-up server would just blackhole people and prevent the shard from doing synchronized backups. On EQ2 we don't typically see a lot of infinite loops at this point, but it's recently been changed to give us more information. And that's not something that you're likely to see in the patch notes ;)
Edit 3/29/08 10:23 PM:
Response based on a comment by KC (see comment below). Unfortunately, blogger doesn't support
<pre>
or
<code>
tags :P
Infinite Loops in EQ2 are usually caused by programming errors. Our script system (lua) has an instruction limit that the EQ2 team added so Designers can't cause infinite loops. An unending dialog cycle (if I'm interpreting you correctly) wouldn't be an infinite loop since it requires user interaction (the server is waiting on the client before serving up the next dialog). The simplest infinite loop would be something like this:
while(true){/*do nothing*/}
This just "does nothing" forever without giving the server the chance to do anything else. We actually have a dev-only command that does this for testing the infinite loop handler :)
The last time we actually had a legitimate infinite loop issue, the code looked something like this:
for ( unsigned i = 0; i != kiCount; /*increment in loop*/ )
{
if ( shouldSkip( i ) )
{
// oops, forgot to increment i!
continue;
}
// real logic
++i;
}