The Joshua Tree: June 2024

There's a tenet that I've been pushing on for a few years now from my little corner of NVIDIA, and that is the concept of Basic Callback Hygiene.

For the uninitiated, a callback simply-put is a function that I can pass to another function that it will later call to "call me back." A lot of subscription services in programming work like this: Something is going to happen in the future. I want to be notified of it, so I'll call a subscription function and give it a function that it can call to tell me when that happens.

This simple concept gets a lot more complicated when you ask what sorts of things you can do within that callback function. What happens when I don't care about the subscription anymore? What if the module containing the function unloads? What thread is the callback going to happen on? And so on, and so forth.

I've experienced problems with all of these complications over the course of my career, and given the amount of asynchronous and modular code that I've been writing for Omniverse, I finally distilled down a simple set of rules:

Callbacks can ask to be unregistered from within the callback.
Strong recommendation to not hold locks (i.e. a mutex) while calling callbacks.
When a callback asks to be unregistered, there must be a strong guarantee that the callback is not currently being called, nor will ever be called in the future.

That's it. A subscription system that follows these basic principles will avoid 99% of the issues that we typically run into with callbacks.

Let's examine each of these. To put this in practical terms, let's say that we're designing an event system. We can subscribe to an event which will register our callback and we can dispatch an event which will callback all of the subscribers from the context of the dispatching thread (so basically it loops over all subscribers and calls their callback). Our imaginary system is thread-safe, so we can subscribe or dispatch from any thread. We can also unsubscribe to unregister our subscription.

Unregistration from within a callback

This is probably the most useful one. Let's consider the case where I cannot unsubscribe from the callback but I want to. My callback is called and I can handle the event, but I cannot unsubscribe. This particular callback only wants to receive the first event after registration and no more events (an example of this might be a one-shot function that I want to defer until the next frame).

Since I cannot unsubscribe, I must be able to conditionally have a function run later that will then unsubscribe my function outside of the event, and I need to ignore any more events that I might receive in the meantime.

This makes something that should be simple much more troublesome, and every place that I want to do this sort of action now requires more thought, effort, time, testing, etc. Changing the system to allow un-subscription from within the callback means that the complexity of handling that situation is in a singular place: my event system, rather than each and every callback that needs that behavior.

Don't Hold Locks

Holding a lock while calling a callback is dangerous because the callback can execute arbitrary code. The callback may call back into our event system and we don't want the complication of having recursive locks (which are arguably evil) or contributing to lock cycles which can cause deadlocks.

Writing an event system to be thread-safe and yet not hold any locks while calling callbacks is challenging to say the least. Generally this requires something akin to tracking the number of threads that are in the middle of doing callbacks, at a minimum. We typically do it by adding a thread identifier to a list while holding a lock, and then unlocking to call the callback, then re-acquiring the lock and removing the thread identifier from the list. This is more complicated to be sure, and the reason why will be evident in the next section. There are other ways to do it, but you'll need some tracking to maintain state while the mutex is unlocked for the callback.

This also means that how you walk the list of subscriptions must be stateful in a way that can resume after the mutex is unlocked and re-acquired. If these subscriptions are stored in C++'s std::vector for instance, there exists a possibility that unlocking the mutex will allow another thread to subscribe, which may resize and therefore reallocate the vector, which invalidates all iterators; or another thread may unsubscribe which will remove an entry, which can affect the position of the current thread walking over the vector. (Solutions to these, respectively, might involve using indexes instead of iterators, and leaving null pointers in the array to mark empty spaces).

Strong Unregistration Guarantees

This is by far the hardest issue to solve correctly. Here's a real-world scenario: right before a module unloads (in the main thread, say), it tells our Event system to unsubscribe a previously-subscribed callback. Without a strong guarantee, what will happen if another thread is currently calling our subscribed callback? As noted above, we don't want to be holding locks while calling callbacks, so waiting on a lock isn't going to wait for our callback to finish. What has happened in actual code is that the unsubscribe happens quickly, the module proceeds to unload, and then a crash occurs because another thread was still executing a function that was in the module. Or after unloading, objects were destructed since we would no longer get events, but with another thread in our callback function, a crash occurs because deleted memory was accessed.

Or the problem that I'm fixing right now: memory corruption occurs because a dependent structure was freed, but a race with the callback means that we were still changing things inside the structure.

It quickly becomes apparent that any request to unregister must wait for any other calls to finish. Also, once we return from the unregister function, it stands to reason that we cannot ever call the callback again. To do so would be violating the caller's unregistration request.

This then requires a strong ordering guarantee:

Unregister function is called.

WAIT for any calls to the callback in question to finish*.
PREVENT any future calls to the callback.

Return from the unregister function. At this point the caller can guarantee that our event system is free of the callback and it is safe to destroy structures.

"Ah," the astute and observant amongst you might notice, "what if the unregistration is from within the callback?!" This was our first point, above. If we WAIT for any calls to the callback to finish and the unregistration call is coming from the callback, we would wait forever. So we amend the line with the asterisk (*) above instead to say: "WAIT for any calls to the callback in question in other threads to finish."

So our subscription process is pretty simple:

Take the lock
Append our subscription information to the std::vector (or similar)
Unlock the lock

The notification process is a bit more complicated as it has to synchronize with the unsubscribe process:

Take the lock
Iterate over the std::vector (or similar) by index, skipping null entries. For each:

Add the current thread identifier to a list of threads that are notifying.
Unlock the lock
Call the callback function
Re-acquire the lock. At this point, the vector may have been reallocated and grown, and subsequent entries may have been unregistered. Any other thread could have done anything.
Remove our current thread identifier from the list of threads that are notifying.
Is a waiting flag set on the subscription? If so, notify the waiting thread (i.e. using a condition variable).

Unlock the lock

And now, the pièce de résistance: the unsubscribe process:

Take the lock
Find the subscription to remove. If not found, unlock and return; otherwise:
Set the entry in the std::vector to null, but it is not destroyed yet!
Is the subscription currently being notified by a thread other than the current thread? If so, set a waiting flag on the subscription that is used in 2.6 above.
Unlock the lock.
If the waiting flag was set, wait on the condition variable for notification from 2.6 above. Otherwise it is safe to destroy the subscription and return.
Re-acquire the lock.
If the list of notifying threads contains threads besides the current thread go to step 5.
Unlock the lock. It is now safe to destroy the subscription.

This will allow our unsubscribe to wait until no threads are notifying our subscription, or only our thread that is currently unsubscribing during notification.

Of course, there is a lot more nuance to this, especially when it comes to handling recursion (dispatching within a callback) or even just the complexity of working with multiple threads, but this should help you understand the concept of Basic Callback Hygiene.

05 June 2024

Basic Callback Hygiene

Unregistration from within a callback

Don't Hold Locks

Strong Unregistration Guarantees