A little while ago I've worked on an OS project that was written in an entirely asynchronous way, without explicit synchronizaton. Which is pretty much what I've described in my book in Section 6.4 "Queues as the sole synchronization mechanism" but in a new-fashioned way. So I want to write up the lessons from that experience before I completely forgot them.
First, how is the basic programming goes in this paradigm. The basic synchronization unit there is a "future" that is used to signal the completion of some operation that may involve a wait, and return a value resulting form that operation. In pseudocode it looks like this:
Future f = start_some_operation(args);
In the plain programming with futures, you have to wait for the future to be completed before reading the value from it (for now we'll skip over the part of how the values of different types are passed through the futures - basically in C++ it means that the future object would be a template with the value type as its parameter, and much more painful in plain C):
MyValue v = await(f);
However the fully asynchronous programming allows no waits. Instead you chain the future to call the next function:
f.chain(another_operation, context);
When the future becomes completed, it will use its result value to call:
another_operation(context, value)
How does a future get completed? There are two varieties of doing this. First, there is something in the runtime environment that completes the futures when the wait is finished. In a fully asynchronous OS this can be the kernel itself, or in a POSIX environment this would be some glue code that translates the end of a system call in some background thread to a completion of a future. Second, this could be done from another chunk of code right in the same environment. Fundamentally, both ways are the same, it's always some other code completing the future, just it can be a part of the same process and address space and logical unit or be somewhere outside of it.
The futures really have two separate APIs, one for the receiving side where the returned value gets tied to some other code, one for the sending side that puts the returned value into the future and marks it completed. For this reason, the sending API is sometimes given a separate name, a "promise". So it's a promise to return a value, which eventually gets fulfilled, and the value comes out on the other side as the completion of a future.
How does the function chained to the future execute? There are two ways: one is to call it immediately, another one is to schedule it for execution later with some kind of a scheduler. They have different trade-offs: the immediate execution can be faster but as the function gets called, it grows the current stack, and a long enough chain can overflow the available stack size. The scheduling is slower but limits the stack growth and can be used across the process boundaries, such as when the kernel code completes a userspace future. This has a very close relation to how things work in Triceps: a future is similar to a Triceps Label, except for the part that the Labels are connected once into a pipeline that is then used to send a continuous stream of data, while the futures are built into ephemeral pipelines that are used only once and then discarded, and new pipelines need to be built for more data. Unlike futures, Triceps always schedules the data that comes through a Label to avoid the stack depth troubles, and also provides certain guarantees of the scheduling order for the labels within a Unit.
However the futures model also has an analogy of a Triceps Unit, called an "executor". It's a scheduler with its queue, and also some CPU threads to perform the execution, potentially in a multithreaded way. Looking back, the idea of running short thread snippets on multiple CPUs in https://babkin-cep.blogspot.com/2020/02/scheduling-policy-for-burst.html is exactly such a combination of asynchronous logic with a multithreaded executor.
There can be multiple executors in a process, and not all of them are multithreaded. The single-threaded executors have a special use: they are used as mutexes. Since the functions in the scheduling queue of a single-threaded executor run sequentially, this has the same effect as serializing them on a mutex. So you'd have a separate single-threaded executor for every place where you'd lock a mutex in a common program. Well, sort of, as long as you don't try to do any other asynchronous operation when holding the lock - then as soon as you leave the current function, your lock gets released, so the serial executors are more like spinlocks without performance advantages, and you need to build special patterns to make sleeplocks. All the caveats of locking still apply, and you can cause deadlocks just as in the regular programs (they manifest a little differently, as all futures being incomplete, but still lock up the execution). If you want, you can also create executors with specific scheduling order logic, such as a single-threaded executor with the same logic as a Triceps Unit.
The multithreaded executors are generally all equivalent (since they provide no guarantees about the order of execution), so you'd normally have only one, with as many threads as there are physical CPUs. With some exceptions. You may have snippets executing at different priorities by using different executors (again, one executor per priority is enough). Or you may want to limit the load imposed by some code by limiting its parallelism, then making a separate executor with a smaller number of threads makes sense.
When creating more executors, don't forget that the underlying resources are not unlimited, and their threads will be fighting it out for the real CPUs at the OS level. And yes, the threads may get preempted at the OS level at the most inconvenient times. So this is the case where having both some OS support and coordination between the executors within the process to synchronize the preemption at the points where the threads switch over to handling the next future can really help.
With the executors, the chaining of the code to a future gets an extra argument, the executor to schedule this code on:
f.chain(another_operation, context, executor);
And then the chained operation can be executed as a direct function call only if it should run on the same executor as the code that completes the future, and only if the executor's scheduling logic allows it.
To give a small example, this pseudo-code with locking:
foo()
{
before_lock();
lock(mutexA);
inside_lock();
unlock(mutexA);
after_lock();
}
becomes:
foo()
{
before_lock();
p = new_promise<int>();
p.to_future().chain(inside_lock, NULL, singleExecutorA);
p.return_value(0);
}
inside_lock(context, value)
{
...
p = new_promise<int>();
p.to_future().chain(after_lock, NULL, multiExecutor);
p.return_value(0);
}
after_lock(context, value)
{
...
}
A lot more code. With the chaining now built into the functions themselves. And this example also glossed over the fact that if foo() returned a value, it must now return a promise of this value instead (so yes, the non-blocking and potentially blocking functions become different). This code also assumes that mutexA was held as a spinlock only for a very short period of time and no asynchronous operations were called from inside_lock(). Things get very confusing very fast, and organizing them in certain patterns does help. More on that in the next installments.