Wednesday, February 5, 2025

Asynchronous programming 4 - a look under the carpet

 

In this part I want to go over some simplifications I've made in the first part, mostly because some things are wrong and should never be used. Here I want to talk about them, and also about the solutions for the same problems that should be used instead.

Back then I've said that there is no await() in asynchronous programming, but usually there is. Just that it should never be used because it leads to very bad deadlocks. In particular, people tend to use it in combination with the serial executor as a way of locking, to run some blocking code without releasing the executor thread. If some of the called code wants to run on the same executor (essentially, doing a recursive lock), the code will wait on its queue forever and thus deadlock. It's not really await()'s problem, the same would happen with any recursive lock including the patterns that I'll show soon, but people are less aware of the issue with await() and proudly feel like they've "cheated the system".

And I've already mentioned that there is no good reason to use the serial executor at all, there are better patterns. These better patterns rely on something that I haven't mentioned yet: the mutexes. Which are available with asynchronous programming but need to be treated somewhat differently than in common programs. In asynchronous programming they need to be treated as spinlocks, to protect a short and quick chunk of code. Sometimes they can even be replaced by the lock-free code (finally a good use for it!). 

Instead the futures should be used as the mechanism for the long waits. A future has a nice property that avoids the race conditions: it might get completed after a function got chained to it, or before a function got chained to it, the function will run in either case when the future gets completed. So as I showed before, sometimes we don't even care about the value returned in a future but care about the fact of its completion to let some other code run. But a thing to keep in mind is that if a future is already completed before we chain a function to it, the function will usually run immediately on chaining, although there is no guarantee of that. This leads to the difficult to diagnose errors when some function assumes that it has an exclusive access to some data while it assembles a chain of futures and functions but in reality the first future in the chain sometimes completes on another CPU before the whole chain is assembled. Then the functions in the chain start running on the other CPUs, and our function in question at some point ends up with chaining another function that accesses the same data to an already completed future, and that another function gets called immediately, accessing the same data. 

This mostly happens with the serial executors, when both the current function and the chained one rely on the same serial executor for serialization (another reason to never use the serial executors). The executor gets specified in the chaining arguments, but since it's the same executor as the currently running one, the chaining thinks that it's fine to call directly. But it can also happen on any executor, while using mutexes in a slightly easier to diagnose pattern, where one function assembles a chain under mutex, and one of the functions in the chain tries to lock the same mutex, which becomes a recursive lock, and everything deadlocks. 

Hence the rule: either do no chaining under a locked mutex, or if you really need to, make sure that the first future in the chain won't get completed until after you unlock the mutex. In the last case you'd usually start with creating a promise, then build a chain on its future side, and finally after unlocking the mutex you'd chain that first promise to some future that might have been completed.

Another thing that I didn't mention is that usually the executors have a direct way to schedule a function to be executed on them. The trouble is that the signature of that function is usually different than a function that gets chained to a future, because with direct scheduling there are no future and no result promise arguments to the function. So if you need a function used both ways, you can't, because the signatures are different. In this situation, instead of the direct scheduling, you can use chaining on a future that is either pre-completed or gets completed right after chaining. However with plain chaining it will cause the function to be called right there (and this is known as "inlined" as opposed to scheduling on an executor which is known as "trampolining"). So you'd have to use the kind of chaining that allows to explicitly disable the inlining. Or if this option is not available in your asynchronous library, then there is no other choice than to do an explicit scheduling.

Disabling the immediate inlined execution on chaining also resolves the other potential issues mentioned above (at the cost of additional overhead of scheduling). Or if it's not available, a chain can be made run through an explicit scheduling with pseudo-code like this (pseudo, since it plays a bit loose with the types):

// it didn't occur to me before but the contexts do have to have
// a virtual base class for their destruction to work
// correctly in shared_ptr
struct SchedBarrierCtx : public ContextBase {
  AsyncFunctionBase func;
  shared_ptr<FutureBase> input;
  shared_ptr<ContextBase> ctx;
  shared_ptr<Scheduler> sched;
  shared_ptr<PromiseBase> output;
};

template <typename Tin, typename Tout, typename Tctx>
shared_ptr<Future<Tout>> sched_barrier(
  shared_ptr<Future<Tin>> input,
  AsyncFunction<Tin, Tout, Tctx> func,
  shared_ptr<Tctx> ctx,
  shared_ptr<Scheduler> sched)
{
  auto barrier_ctx = make_shared<SchedBarrierCtx> {
    func, input, ctx, sched, /*output*/ nullptr};
  // no need to specify the executor for chain(),
  // because barrier_trampoline1() will do that anyway,
  // and it's cheaper to inline it on any current executor
  return input->chain(barrier_trampoline1, barrier_ctx);
}

void barrier_trampoline1(
  shared_ptr<FutureBase> input,
  shared_ptr<SchedBarrierCtx> ctx,
  shared_ptr<PromiseBase> result)
{
  ctx->output = result;
  ctx->sched->schedule(barrier_trampoline2, ctx);
}

void barrier_trampoline2(shared_ptr<SchedBarrierCtx> ctx)
{
  ctx->func(ctx->input, ctx->ctx, ctx->output);
}

The arguments for the chained function get passed through scheduling by saving them in the barrier context.

Note that barrier or not, but the scheduled function can still complete before chain() returns! It's not very probable, because it requires another CPU to pick the scheduled work and complete it while the current CPU gets delayed by something else (perhaps an interrupt handler in the kernel), or for the kernel scheduler to do something unusual, but it's possible. The only thing guaranteed here is that the chained function will run in another kernel thread, and so if that kernel thread blocks, the one that called the chaining can still continue.

Sunday, February 2, 2025

Asynchronous programming 3 - some assistance

I've been saying it 20 years ago, and 15 years ago in the TPOPP book, and I'm still saying it now: the asynchronous programming has to be assisted by a compiler, otherwise it's just a huge pain of doing manually things that a compiler normally does. Fortunately, I think now we have an out-of-the-box solution: the C++ coroutines in C++20, as described for example here: https://en.cppreference.com/w/cpp/language/coroutines . I haven't quite tried to do an actual implementation with them but it looks like the right thing. You define your Promise class (note that coroutines don't differentiate between the Future and Promise sides and call everything a Promise), and then the coroutine statements take that Promise class as a template argument and arrange the splitting of the sequential code into fragments. And you do the explicit parallelism on your own.

Another solution that I played with, doing a partial implementation, would work with plain C too: a preprocessor. It can be done in some smart way, as a whole pre-parser like cfront of yore, or a lot can be achieved even with the standard C preprocessor. The only trick is to generate the unique function names, and these can be done by using the macro __LINE__. Since the line number stays the same within a macro invocation, each invocation gets a unique number that can be used repeatedly within the macro body.

The most difficult part is that  we'll need to use the same call and return macros in both the "header" part of the function and the "continuation" part. Which means that all the functions have to have the same result type, and return the value in the same way. So let's take the example from the last post and reformat it to fit into this approach. The original example from the previous installment was:

struct FuncContext {
  int a;
};

Future<int> func()
{
  auto ctx = make_shared<FuncContext>();
  shared_ptr<Future<int>> futa = get_a();
  return futa->chain(func2, ctx) // use default executor
    ->chain(func3, ctx);
}

void func2(shared_ptr<FuncContext> ctx, shared_ptr<Future<int>> arg, shared_ptr<Promise<int>> result) {
  ctx->a = arg->value();
  get_b(ctx->a)->chain(result);
}

void func3(shared_ptr<FuncContext> ctx, shared_ptr<Future<int>> arg, shared_ptr<Promise<int>> result) {
  int b = arg->value;
  result->return_value(ctx->a + b);
}

To get the same return type throughout we change the "header" part to return void and pass the returned future back via an argument. 

The other problem is the type of that return promise's value: carrying it through all the "continuation" parts is difficult, so we'd have to revert to the base promise type that doesn't care about the return value and cast it only when setting the value. This base type has to exist for the scheduler to juggle all these promises in its queues. Also, remember, the premise here is that coroutines are not available, which would often mean plain C, and there the promises can't be templatized in the first place.

The code becomes:

struct FuncContext {
  int a;
};

void func(shared_ptr<Promise<int>>* result_future)
{
  auto ctx = make_shared<FuncContext>();
  auto result = make_shared<Promise<int>>();
  *result_future = result.to_future();
  shared_ptr<Future<int>> fut_cont;
  get_a(&fut_cont);
  fut_cont->chain(func2, ctx)->chain(result);
}

void func2(shared_ptr<FuncContext> ctx, shared_ptr<Future<int>> arg, shared_ptr<PromiseBase> result) {
  ctx->a = arg->value();
  shared_ptr<Future<int>> fut_cont;
  get_b(ctx->a, &fut_cont);
  fut_cont->chain(func3, ctx)->chain(result);
}

void func3(shared_ptr<FuncContext> ctx, shared_ptr<Future<int>> arg, shared_ptr<PromiseBase> result) {
  int b = arg->value;
  static_cast<Promise<int>*>(result.get())->return_value(ctx->a + b);
}

Then we want to make it look like this:

ASYNC_FUNC_0ARG(func, int, {
  int a; // this is the context
}) {
  ASYNC_CALL_0ARG(func, ctx->a, int, get_a);
  ASYNC_CALL_1ARG(func, int b, int, get_b, ctx->a);
  ASYNC_FUNC_RETURN(int, ctx->a + b);
} ASYNC_FUNC_END

Here for simplicity I've just used separate macros for definitions and calls of functions with different number of arguments. It's definitely possible to use the macros with variable number of arguments, just it's not something that I use often and I'm too lazy to look it up now. The invocation of ASNC_FUNC_END is needed to balance out the curly braces. The name of the calling function is needed in the CALL macros to refer to the context type name, this unfortunately can't be avoided, and then incidentally it can be used to generate the names of continuation functions. Alternatively, we could define the function name as a macro before the function definition and undef it afterwards, then everything in between could just use that macro for function name.

There is a bit of ugliness but still, looks much shorter and simpler than before, doesn't it? Now all we do is to define the macros that will translate one into another by copy-pasting from the long example (I haven't actually tried these macros right now, so they might contain small bugs but it shows the idea, and I did get a similar system working in the past):

#define ASYNC_FUNC_0ARG(fname, func_return_type, context_body) \
struct fname##Context context_body; \
void fname(shared_ptr<Promise<return_type>>* result_future) \
{ \
  using return_type = func_return_type; \
  auto ctx = make_shared<fname##Context>(); \
  auto result = make_shared<Promise<return_type>>(); \
  *result_future = result.to_future();

#define ASYNC_FUNC_END }

#define ASYNC_CALL_0ARG(fname, assign_to, call_return_type, call) \
    shared_ptr<Future<call_return_type>> fut_cont; \
    call(&fut_cont); \
   
static void fname##__LINE__(shared_ptr<fname##Context> ctx, shared_ptr<Future<call_return_type>> arg, shared_ptr<PromiseBase> result); \
    fut_cont->chain(cont##__LINE__, ctx)->chain(result); \
  } \
} \
static void fname##__LINE__(shared_ptr<fname##Context> ctx, shared_ptr<Future<call_return_type>> arg, shared_ptr<PromiseBase> result) { \
  assign_to = arg->value(); \
  {

#define ASYNC_CALL_1ARG(fname, assign_to, call_return_type, call, call_arg1) \
    shared_ptr<Future<call_return_type>> fut_cont; \
    call(call_arg1, &fut_cont); \
   
static void fname##__LINE__(shared_ptr<fname##Context> ctx, shared_ptr<Future<call_return_type>> arg, shared_ptr<PromiseBase> result); \
    fut_cont->chain(cont##__LINE__, ctx)->chain(result); \
  } \
} \
static void fname##__LINE__(shared_ptr<fname##Context> ctx, shared_ptr<Future<call_return_type>> arg, shared_ptr<PromiseBase> result) { \
  assign_to = arg->value(); \
  {

#define ASYNC_FUNC_RETURN(return_type, expr) \
  static_cast<Promise<return_type>*>(result.get())->return_value(expr)

There are a couple more of things to explain in  ASYNC_CALL macros. One is that they have to declare the continuation function before using it, this is something that I've glanced over before, because if you write these continuation functions manually, you'd collect all the declarations up front. But if they're generated on the fly, the declarations also have to come on the fly. These functions can be static because they're not called from outside the file. The second thing is that the current function gets closed with two curly braces, and the next one gets opened with two curly braces. This is because ASYNC_FUNC opens the function with a curly brace for the generated definitions, and then another brace comes after the macro, and then we need to maintain the same brace depth throughout.

Note that the execution of the asynchronous functions here is strictly sequential, no ifs nor loops. However similar macros can be made for ifs and loops, and if I ever get around to transform this text to a chapter for a newer version of my book on parallel programming, I'll do them too. They'd be ugly but still better than writing things manually. And a specialized preprocessor like cfront can reduce the ugliness of having to repeat the names that can't be remembered between the C preprocessor macros and to explicitly specify the level of nesting for the ifs and loops.