Saturday, January 25, 2025

Asynchronous programming 2 - filling in the types

For the sake of a quick introduction, I've glanced over some things in part 1. Here I want to come back and show them.

Let's start with a small code snippet similar to what was shown in part 1:

func(context, arg)
{
  ...
}

Future fut1;
fut1.chain(func, context, executor);

Note that there are no types in this snippet, I've dropped them to avoid getting mired in them. Let's fill them in, and the answer might vary depending on the specific asynchronous library.

Let's start with the function argument arg. Note that it's not explicitly mentioned anywhere in chain(). That's because the argument comes from the future fut1, it's the value that becomes stored in it. So if, suppose, the type of fut1 is actually Future<int>, the argument might actually be

int arg

but the more typical solution is to pass the whole input future as an argument:

Future<int> arg

Except that normally the futures wouldn't be copied but passed by reference. And considering in how many places they get referred to, the only reasonable way is to use either reference counting or garbage collection. Reference counting is more natural for C++ and C, so the type would become:

shared_ptr<Future<int>> arg

Next, what is the function's return value? Being an asynchronous function, its return value must be returned through a Promise. Moreover, that Promise's Future side needs to be returned at the time of the chaining, so the chaining becomes (assuming that the returned value is of type double):

shared_ptr<Future<int>> fut1;
shared_ptr<Future<double>> fut2 = fut1.chain(func, context, executor);

But how will  the function know where to return that value? It has to receive that result Promise as an argument too:

void func(context, shared_ptr<Future<int>> arg, shared_ptr<Promise<double>> result);

Since the result is returned through an argument, the normal function's return type becomes void. It's the responsibility of the function to make sure that the result promise will be completed, however this doesn't have to happen by function's return time. Instead it can schedule some other code that will complete this promise later (for example, by chaining it from some other future that it creates). Which, yes, is another potential source of errors when the promise completion gets forgotten in one of the branches of execution and the rest of logic gets dealocked waiting for it. The way to debug this is to have the library keep track of the futures that have some dependency chained to them but haven't been completed and haven't been scheduled to run and haven't been chained to something else. However this can also be a normal intermediate state of a future being still prepared, or of a future stored in some data structure to be found and completed later, so the program can't just abort every time on seeing such a future. Instead it has to be a tool that can run and list all the suspicious futures whenever a deadlock is suspected. Or there can be a special flag that would let the future be temporarily excepted, that gets cleared on exiting the constructing scope unless explicitly preserved. Then any con-compliant future without this flag can be an immediate reason for a program abort, but if the flag gets mismanaged, the deadlocks could still happen. As I've said many times before, the asynchronous programming is fragile and hard to debug.

The executor would generally also be a shared_ptr. The final part is the context. Which is normally also a shared_ptr to some object. What object, depends on the function. Consider a classic function:

int func()
{
  int a = get_a();
  int b = get_b(a);
  return a+b;
}

If the functions get_a() and get_b() can block (and I've made get_b() dependent on a to make the execution sequential), in the asynchronous form this function gets split:

struct FuncContext {
  int a;
};

Future<int> func()
{
  auto ctx = make_shared<FuncContext>();
  shared_ptr<Future<int>> futa = get_a();
  return futa->chain(func2, ctx) // use default executor
    ->chain(func3, ctx);
}

void func2(shared_ptr<FuncContext> ctx, shared_ptr<Future<int>> arg, shared_ptr<Promise<int>> result) {
  ctx->a = arg->value();
  get_b(ctx->a)->chain(result);
}

void func3(shared_ptr<FuncContext> ctx, shared_ptr<Future<int>> arg, shared_ptr<Promise<int>> result) {
  int b = arg->value;
  result->return_value(ctx->a + b);
}

This highlights how the asynchronous code is typically written:

  • There are two kinds of asynchronous functions: the "head parts" of the actual meaningful high-level functions, like func(), and the split-out internal fragments of the meaningful functions, like func2() and func3(). They're usually written differently, the heads taking the arguments just like the common functions and returning a future with the result, where the fragments are tied to some future representing the result of another asynchronous function call, do the next step of computation until calling another asynchronous function, and then return the result of that function as their result (at least in this pattern where the fragments are pre-chained in advance).
  • The context carries the values between the fragments, and is an analog of a stack frame in a normal function. It's possible to fine-tune each step's context but that's usually more trouble than worth, so other than for some obvious optimizations (such as b here not getting stored in the context because it's used in only one fragment), it's much easier and better to just carry the same context throughout all the fragments.

Note that all the dynamic objects are nicely auto-destroyed by reference counting after the function completes, and in the meantime are held alive in the scheduling queues and future chains. However the implication there is that a value stays alive as long as the future containing it stays alive, and if that future is kept for a long time, the value would also be kept.

Why would a future be kept for a long time? Because a future represents both a value and the fact of completion, and the fact of completion might be interesting for much longer than the value, as will be shown in the future installments. In this case it might be useful to chain a future with a value to a future without a value:

Future<SomeType> fut_a;
Promise<void> prom_b;
...
fut_a->chain(prom_b);

However normally the chaining expects that the types of values on both sides are the same. So this is a special case of converting to void that should be included in the library. If it isn't in the library, it can be implemented as:

template<typename T>
void convert_to_void_impl(shared_ptr<void> ctx, shared_ptr<Future<T>> input, shared_ptr<Promise<void>> result)
{
  result->return_value();
}

template<typename T>
shared_ptr<Future<void>> chain_to_void(shared_ptr<Future<T>> input) {
  return input->chain(convert_to_void_impl, nullptr, input->getExecutor());
}

using an intermediate function to change the type. And if some particular library supports no void future, you can always use an int future instead and never look at its value, just be satisfied that it has some value.

Tuesday, January 21, 2025

Asynchronous programming 1 - Futures and Promises and Executors

A little while ago I've worked on an OS project that was written in an entirely asynchronous way, without explicit synchronizaton. Which is pretty much what I've described in my book in Section 6.4 "Queues as the sole synchronization mechanism" but in a new-fashioned way. So I want to write up the lessons from that experience before I completely forgot them.

First, how is the basic programming goes in this paradigm. The basic synchronization unit there is a "future" that is used to signal the completion of some operation that may involve a wait, and return a value resulting form that operation. In pseudocode it looks like this:

Future f = start_some_operation(args);

In the plain programming with futures, you have to wait for the future to be completed before reading the value from it (for now we'll skip over the part of how the values of different types are passed through the futures - basically in C++ it means that the future object would be a template with the value type as its parameter, and much more painful in plain C):

MyValue v = await(f);

However the fully asynchronous programming allows no waits. Instead you chain the future to call the next function:

f.chain(another_operation, context);

When the future becomes completed, it will use its result value to call:

another_operation(context, value)

How does a future get completed? There are two varieties of doing this. First, there is something in the runtime environment that completes the futures when the wait is finished. In a fully asynchronous OS this can be the kernel itself, or in a POSIX environment this would be some glue code that translates the end of a system call in some background thread to a completion of a future. Second, this could be done from another chunk of code right in the same environment. Fundamentally, both ways are the same, it's always some other code completing the future, just it can be a part of the same process and address space and logical unit or be somewhere outside of it.

The futures really have two separate APIs, one for the receiving side where the returned value gets tied to some other code, one for the sending side that puts the returned value into the future and marks it completed. For this reason, the sending API is sometimes given a separate name, a "promise". So it's a promise to return a value, which eventually gets fulfilled, and the value comes out on the other side as the completion of a future.

How does the function chained to the future execute? There are two ways: one is to call it immediately, another one is to schedule it for execution later with some kind of a scheduler. They have different trade-offs: the immediate execution can be faster but as the function gets called, it grows the current stack, and a long enough chain can overflow the available stack size. The scheduling is slower but limits the stack growth and can be used across the process boundaries, such as when the kernel code completes a userspace future. This has a very close relation to how things work in Triceps: a future is similar to a Triceps Label, except for the part that the Labels are connected once into a pipeline that is then used to send a continuous stream of data, while the futures are built into ephemeral pipelines that are used only once and then discarded, and new pipelines need to be built for more data. Unlike futures, Triceps always schedules the data that comes through a Label to avoid the stack depth troubles, and also provides certain guarantees of the scheduling order for the labels within a Unit.

However the futures model also has an analogy of a Triceps Unit, called an "executor". It's a scheduler with its queue, and also some CPU threads to perform the execution, potentially in a multithreaded way. Looking back, the idea of running short thread snippets on multiple CPUs in https://babkin-cep.blogspot.com/2020/02/scheduling-policy-for-burst.html is exactly such a combination of asynchronous logic with a multithreaded executor. 

There can be multiple executors in a process, and not all of them are multithreaded. The single-threaded executors have a special use: they are used as mutexes. Since the functions in the scheduling queue of a single-threaded executor run sequentially, this has the same effect as serializing them on a mutex. So you'd have a separate single-threaded executor for every place where you'd lock a mutex in a common program. Well, sort of, as long as you don't try to do any other asynchronous operation when holding the lock - then as soon as you leave the current function, your lock gets released, so the serial executors are more like spinlocks without performance advantages, and you need to build special patterns to make sleeplocks. All the caveats of locking still apply, and you can cause deadlocks just as in the regular programs (they manifest a little differently, as all futures being incomplete, but still lock up the execution). If you want, you can also create executors with specific scheduling order logic, such as a single-threaded executor with the same logic as a Triceps Unit.

The multithreaded executors are generally all equivalent (since they provide no guarantees about the order of execution), so you'd normally have only one, with as many threads as there are physical CPUs. With some exceptions. You may have snippets executing at different priorities by using different executors (again, one executor per priority is enough). Or you may want to limit the load imposed by some code by limiting its parallelism, then making a separate executor with a smaller number of threads makes sense. 

When creating more executors, don't forget that the underlying resources are not unlimited, and their threads will be fighting it out for the real CPUs at the OS level. And yes, the threads may get preempted at the OS level at the most inconvenient times. So this is the case where having both some OS support and coordination between the executors within the process to synchronize the preemption at the points where the threads switch over to handling the next future can really help.

With the executors, the chaining of the code to a future gets an extra argument, the executor to schedule this code on:

 f.chain(another_operation, context, executor);

And then the chained operation can be executed as a direct function call only if it should run on the same executor as the code that completes the future, and only if the executor's scheduling logic allows it.

To give a small example, this pseudo-code with locking:

foo()
{
  before_lock();
  lock(mutexA);
  inside_lock();
  unlock(mutexA);
  after_lock();
}

becomes:

foo()
{
  before_lock();
  p = new_promise<int>();
  p.to_future().chain(inside_lock, NULL, singleExecutorA);
  p.return_value(0);
}

inside_lock(context, value)
{
  ...
  p = new_promise<int>();
  p.to_future().chain(after_lock, NULL, multiExecutor);
  p.return_value(0);
}

after_lock(context, value)
{
  ...
}

A lot more code. With the chaining now built into the functions themselves. And this example also glossed over the fact that if foo() returned a value, it must now return a promise of this value instead (so yes, the non-blocking and potentially blocking functions become different). This code also assumes that mutexA was held as a spinlock only for a very short period of time and no asynchronous operations were called from inside_lock(). Things get very confusing very fast, and organizing them in certain patterns does help. More on that in the next installments.