Sergey Babkin on CEP and stuff: Asynchronous programming 10

What if a computation fails? Then the future still gets completed but has to return an error. At the very least, if the future can return only one value, the returned object should have a place for error indication. But a better asynchronous library would have a place for error indication right in the futures. BTW, to do it right, it shouldn't be just an error code, it should be a proper error object that allows nested errors, and ideally also error lists, like Error in Triceps.

So suppose that our futures do have an error indication. How can these errors be handled?

Chaining between futures is easy: just the error chains through in the same way as the value. A nice consequence is that the asynchronous lock pattern and other patterns like that just work transparently, releasing the lock in any case, error or no error. However an error object may have references to the objects that we might not want to be stuck in the state of an asynchronous lock. And we don't want the unrelated code that locks the mutex next to start with an error. An error is applicable even to a void future, so it would get stuck even in one of those. So there should be a separated case of chaining to a void future that just passes through the completion but not the error. If your library doesn't have one, you can make one with a function chained to the first future and freshly completes the second future, ignoring the error.

Chaining functions is more complicated. In the simplest case we can just let the function check for error and handle it (and that's a good reason to have the whole input future as an argument and not just the value from it). But that means doing a lot of the same boilerplate error propagation code in a lot of functions.

The other option is to have the chaining code propagate the error directly form the input future to the result promise, ignoring the function in this case, and basically cancelling the chain. This is very much like how the exceptions work in normal programming, just skipping over the rest of function and returning an error, so this behavior should be the "normal" chaining, while a function that handles the error in its input future is more like a "catch" or "finally" statement. Note that if the function get skipped in case of an error, it doesn't really need to see the whole input future, it could as well get just the value from it. With this option, if you've prepared a million-long chain for a loop (not a great idea, better generate each iteration on by one), it will all get cancelled on the first error.

The third option (and these options are not mutually exclusive, they all can be available to use as needed) is to chain a function specifically as an error handler, to be run on error only. Which is even more like a "catch" statement than the previous case. But there is a catch: this makes the chain branch, which means that eventually the normal and error paths have to join back together with an AllOf. Which is not only a pain to always add explicitly but it also implies that the error path somehow has to complete even if there is no error, so again there has to be a chain cancellation logic for the error handlers but working in the opposite way, ignoring the functions on success. That's probably not worth the trouble, so the handling of errors by a separate function makes sense mostly as an one-off where depending on the success of the input future either the normal function or the error handler function get called, sending their result to the same promise, so the fork gets immediately joined back. This is just like doing an if-else in one function but has the benefit of allowing the composition, reusing the same error handling function with many normal processing functions, being truly like a "catch" statement. This pattern is particularly convenient for adding some higher-level details to the error object (as in building a stack trace in normal programming).

The next item is the error handling in AllOf, or in its generalization to map-reduce. How much do we care that some of the branches have failed? A typical case would probably be where we want all of them to succeed, so for that AllOf should collect all the errors from all the branches into one error object.

What about a semaphore? There in a natural way any returned error would cause the whole semaphore to be cancelled. If the semaphore represents a limited-parallelized loop, that's what we'd probably want anyway. Well, due to the parallelism, there might be hiccups where there might be some more iterations scheduled by the other futures completing normally at the same time as the one with the error. One possible race comes from the semaphore logic picking one future from the run queue, executing it, and then coming back when its result promise completes. The queue itself is unlocked after the head future got picked from it, so another completed future would pick the next head future from the queue before the first one completes the loop of error propagation. Another possible race comes from the part where it's OK to add more work to the semaphore while running on one of its own chains. So if instead of pre-generating all the iteration head futures in advance we just put into the semaphore one future with a function than on running generates the head of one iteration (but doesn't start it yet!), then reattaches itself to the semaphore, and then lets that iteration run by chaining it to the input future. And of course if this iteration generation function doesn't check for errors, the reattached copy can grab a successfully completed future and run an iteration. So it would help to also add explicit logic to the semaphore that just cancels all the outstanding incoming futures once one of them gets an error. And also could pay attention to errors in the iteration generation function to stop generating once an error is seen.

Of course, not every semaphore needs to be self-collapsing on error, some of them are used for general synchronization, and should ignore the errors.

The most complicated question of error handling is: can we stop the ongoing parallel iterations of a parallel loop when one iteration gets an error? This can be done by setting a flag in a common context, checking it in every function, and bailing out immediately with a special error if this flag is set. This is kind of a pain to do all over the place. So maybe this can be folded into the futures/promises themselves: create a cancellation object and attach it to the futures, so when completing a future with a cancellation object attached, it would check if the cancellation is true and replace the result with a cancellation error instead. Note that this would not happen everywhere but only on the futures where the cancellation object is attached. So when you call into a library, you can't attach the cancellation objects to the futures created inside the library. And you can't always quickly cancel the future that waits for the library call to return because the library might still be using some memory in your state (although this of course depends a lot on the library API, if all the memory is controlled by reference counters then we don't care, we can just let it run and ignore the result).

Can we propagate the cancellation object between futures, so that they would even go through the insides of a library? Generally, yes, we can do it on chaining, But that takes some care.

First, the propagation must stop once we reach the end of the whole logical operation, and also must stop when we go to the void futures for the patterns like the asynchronous mutex. And stop even for non-void futures in the patterns like the cache, where one caller asking to cancel the read shouldn't cancel the read for everyone.

Second, the functions that create intermediate promises form scratch must have a way to propagate the cancellations from their inputs to these newly created promises.

Third, the libraries need to be able to do their own cancellations too, so it's not a single cancellation object but a set of cancellation objects per future, with the overhead of checking them all on every step (and yes, also with overhead of attaching them all to every step). Although if the sets are not modified often, maybe an optimized version can be devised where the overhead is taken at the set creation time and then the set consolidates the state of all the cancellations in it, making necessary to attach only one set and check the state of only one set.

Fourth, what about the system calls to the OS, which on a microkernel OS would likely translate to calls in the other processes? The cancellation state cannot be read from other address spaces. Which basically means that as we cross the address space boundary, we need to create a matching cancellation object (and here treating the whole set of cancellation objects as one object helps too) on the other side, remember this match on our side, and then have a system call that would propagate the cancellation to the other side. Fairly complicated but I think doable. Of course, at some point this whole path will get down to the hardware, and there we won't be able to actually interrupt an ongoing operation, but we can arrange to ignore its result and return. And there are things that can't be ignored, for example an app might suddenly stop caring whether its buffer write has succeeded or not, but a filesystem can't ignore whether a metadata block write succeeded or not. However this filesystem shouldn't keep the app waiting, if the app has lost interest, the filesystem can sort out its metadata writes in the background.

Fifth, between this filesystem write example and the cache example, a cancellation flag also needs to have a future connected to it, that would get completed with a cancellation error when the cancellation is triggered. We can then chain from this future directly to the result future of the cache read or block write, "overtaking" the normal result to essentially do an "anyOf", with the first completion setting the result (including error) into the future and any following completion attempts to set the result getting ignored. A catch is that when one path completes, the other will still hold a reference on the result future, potentially causing the unpleasant memory leaks. And also the cancellation future would keep accumulating these chainings to it after each operation under it gets normally completed. Maybe the cancellation objects would be short-lived and this wouldn't be a real problem. Or maybe this will require to think of a way for un-chaining once it gets overtaken by completion of another path.

The final thing to say is that the C++ coroutines don't seem smart enough to translate the error handling in promises to look like exception handling at high level. And this is a very important subject, so maybe the coroutines are not the answer yet.

Sergey Babkin on CEP and stuff

Saturday, February 15, 2025

Asynchronous programming 10 - error handling

No comments:

Post a Comment

Links

About Me

Labels

Blog Archive