Sergey Babkin on CEP and stuff: Asynchronous programming 4

In this part I want to go over some simplifications I've made in the first part, mostly because some things are wrong and should never be used. Here I want to talk about them, and also about the solutions for the same problems that should be used instead.

Back then I've said that there is no await() in asynchronous programming, but usually there is. Just that it should never be used because it leads to very bad deadlocks. In particular, people tend to use it in combination with the serial executor as a way of locking, to run some blocking code without releasing the executor thread. If some of the called code wants to run on the same executor (essentially, doing a recursive lock), the code will wait on its queue forever and thus deadlock. It's not really await()'s problem, the same would happen with any recursive lock including the patterns that I'll show soon, but people are less aware of the issue with await() and proudly feel like they've "cheated the system".

And I've already mentioned that there is no good reason to use the serial executor at all, there are better patterns. These better patterns rely on something that I haven't mentioned yet: the mutexes. Which are available with asynchronous programming but need to be treated somewhat differently than in common programs. In asynchronous programming they need to be treated as spinlocks, to protect a short and quick chunk of code. Sometimes they can even be replaced by the lock-free code (finally a good use for it!).

Instead the futures should be used as the mechanism for the long waits. A future has a nice property that avoids the race conditions: it might get completed after a function got chained to it, or before a function got chained to it, the function will run in either case when the future gets completed. So as I showed before, sometimes we don't even care about the value returned in a future but care about the fact of its completion to let some other code run. But a thing to keep in mind is that if a future is already completed before we chain a function to it, the function will usually run immediately on chaining, although there is no guarantee of that. This leads to the difficult to diagnose errors when some function assumes that it has an exclusive access to some data while it assembles a chain of futures and functions but in reality the first future in the chain sometimes completes on another CPU before the whole chain is assembled. Then the functions in the chain start running on the other CPUs, and our function in question at some point ends up with chaining another function that accesses the same data to an already completed future, and that another function gets called immediately, accessing the same data.

This mostly happens with the serial executors, when both the current function and the chained one rely on the same serial executor for serialization (another reason to never use the serial executors). The executor gets specified in the chaining arguments, but since it's the same executor as the currently running one, the chaining thinks that it's fine to call directly. But it can also happen on any executor, while using mutexes in a slightly easier to diagnose pattern, where one function assembles a chain under mutex, and one of the functions in the chain tries to lock the same mutex, which becomes a recursive lock, and everything deadlocks.

Hence the rule: either do no chaining under a locked mutex, or if you really need to, make sure that the first future in the chain won't get completed until after you unlock the mutex. In the last case you'd usually start with creating a promise, then build a chain on its future side, and finally after unlocking the mutex you'd chain that first promise to some future that might have been completed.

Another thing that I didn't mention is that usually the executors have a direct way to schedule a function to be executed on them. The trouble is that the signature of that function is usually different than a function that gets chained to a future, because with direct scheduling there are no future and no result promise arguments to the function. So if you need a function used both ways, you can't, because the signatures are different. In this situation, instead of the direct scheduling, you can use chaining on a future that is either pre-completed or gets completed right after chaining. However with plain chaining it will cause the function to be called right there (and this is known as "inlined" as opposed to scheduling on an executor which is known as "trampolining"). So you'd have to use the kind of chaining that allows to explicitly disable the inlining. Or if this option is not available in your asynchronous library, then there is no other choice than to do an explicit scheduling.

Disabling the immediate inlined execution on chaining also resolves the other potential issues mentioned above (at the cost of additional overhead of scheduling). Or if it's not available, a chain can be made run through an explicit scheduling with pseudo-code like this (pseudo, since it plays a bit loose with the types):

// it didn't occur to me before but the contexts do have to have
// a virtual base class for their destruction to work
// correctly in shared_ptr
struct SchedBarrierCtx : public ContextBase {
AsyncFunctionBase func;
shared_ptr<FutureBase> input;
shared_ptr<ContextBase> ctx;
shared_ptr<Scheduler> sched;
shared_ptr<PromiseBase> output;
};

template <typename Tin, typename Tout, typename Tctx>
shared_ptr<Future<Tout>> sched_barrier(
shared_ptr<Future<Tin>> input,
AsyncFunction<Tin, Tout, Tctx> func,
shared_ptr<Tctx> ctx,
shared_ptr<Scheduler> sched)
{
auto barrier_ctx = make_shared<SchedBarrierCtx> {
func, input, ctx, sched, /*output*/ nullptr};
// no need to specify the executor for chain(),
// because barrier_trampoline1() will do that anyway,
// and it's cheaper to inline it on any current executor
return input->chain(barrier_trampoline1, barrier_ctx);
}

void barrier_trampoline1(
shared_ptr<FutureBase> input,
shared_ptr<SchedBarrierCtx> ctx,
shared_ptr<PromiseBase> result)
{
ctx->output = result;
ctx->sched->schedule(barrier_trampoline2, ctx);
}

void barrier_trampoline2(shared_ptr<SchedBarrierCtx> ctx)
{
ctx->func(ctx->input, ctx->ctx, ctx->output);
}

The arguments for the chained function get passed through scheduling by saving them in the barrier context.

Note that barrier or not, but the scheduled function can still complete before chain() returns! It's not very probable, because it requires another CPU to pick the scheduled work and complete it while the current CPU gets delayed by something else (perhaps an interrupt handler in the kernel), or for the kernel scheduler to do something unusual, but it's possible. The only thing guaranteed here is that the chained function will run in another kernel thread, and so if that kernel thread blocks, the one that called the chaining can still continue.

Sergey Babkin on CEP and stuff

Wednesday, February 5, 2025

Asynchronous programming 4 - a look under the carpet

No comments:

Post a Comment

Links

About Me

Labels

Blog Archive