Sergey Babkin on CEP and stuff

programming by example

2025-08-07T01:23:00.001-04:00

Everyone is excited one way or another about the "vibe programming" with AI. Will it replace the professional programmers? In short, I'd say, probably not. Will it improve productivity? I think, probably yes, but maybe not in the ways that people envision it now. Will it allow non-programmers write programs without involving professionals? Definitely yes (and observing such an experience is what prompted me to write this post) but probably not in the industrial settings.

My personal experience of using the AI is not really successful. But that's probably because I'm not asking it to build the simple things. I ask it things when I get stumped, as a shortcut instead of digging through piles of code or searching through (often non-existent) documentation. And the reality is that when I get stumped, the AI gets stumped too. It offers solutions but they don't work, like referring to some non-existing libraries. But it's still useful, because this shows the at least semi-working examples and points to a direction for further digging, either in the code or in the documentation, and happens to work better than a simple search.

The real important part here is the generated examples. Note that the previous version of programming-without-understanding has been to search and copy-paste the examples from Stackoverflow and such. And the same works even in your own code: it's always harder to write some new thing from scratch, but once you have an example of something similar in your codebase, you can copy and modify it very quickly and easily. Over 25 years ago I've had an argument about a man page: I've been saying that we should add an example of usage of the function at the end of the man page, and a more senior engineer was telling me that no, the examples don't belong in the man pages, they should only contain the descriptions. The experience of the following 25 years definitely proved me right: it's much easier to understand the usage of a function from an example and copy-paste that example than by reading a lengthy description and trying to figure out all the fine points and all the ambiguities in it. When we write APIs, we write them with some specific usage in mind (with some exceptions like a bunch of Enterprise Java APIs, but that's the reasob why these are horrible). And it's much better and easier for everyone to provide the examples of this usage. The "test-driven programming" is another example of the importance of examples: it tells us to start by writing the examples of usage, and only then the implementation. The name of "test-driven" is actually misleading, since the examples are not tests. The point of examples is to show the typical usage, the point of the tests is to show the corner cases in a much more detailed and elaborate way. So if you take that name at face value and think that the examples are the tests, or start by writing the actual tests, you're going to have a bad time. But if you think of it as "example-driven programming", it's great. And that's basically what the AI-generated code is: the examples on steroids, fine-tuned to your own case but not necessarily fully correct.

By the way, if you think that the programming ability comes to the AI naturally, you're mistaken. There are companies that specialize on producing the labeled datasets for AI training, and they provide the specialized datasets for programming too. Moreover, they do the fine-tuning, running sessions through corrections and modifications to analyze the errors and make the code work, with these sessions incorporated into the training data (and yes, the same kind of fine-tuning happens for the non-programming subjects too).

And, well, this is also how it works for the non-professionals: the AI supplies them with tailored examples that often work correctly for the simple problems. An important point is that the problems might be logically simple but require a wide knowledge of a huge library that would take a professional programmer a long time to learn the old-fashioned way. So that's also a great way for the professional programmers to start using and learn the important parts of an unfamiliar library. But also what is trivial for a professional, isn't trivial for someone who doesn't know much about programming. This definitely brings programming to the masses. And if in some professional settings these people (say, quantitative analysts) can hire professional programmers for help, there are lots of people who could use the programming casually but can't hire a specialist. The AI is a great resource for them. But does it threaten the professional labor market? Probably not, because these are mostly people who couldn't afford to hire a professional in the first place. Although it probably would affect the low end of the market, where people hire a barely competent programmer for specialized repeated tasks. However setting the problems for AI is also a skill that isn't equally mastered by everyone, so there will be a new market for people who specialize in managing the AI.

The problem that the non-professionals have with the AI-generated code is that even the small errors stump them. Accidentally lose or insert an extra backslash in a regular expression, and nothing works any more, and someone who doesn't know the regular expressions will never be able to tell why. Perhaps the AI could be trained to do the debugging by offering the things to try and paste back the printouts. Like, you know, when in the 1960s a programmer instructed a non-professional client over the phone. This AI-assisted debugging would also come useful for the professionals.

How about the more general predictions for the rest of us? Will it run us out of business? My guess is that it will change the relative value of the skills similarly to how the introduction of the compilers changed them. The wider use of AI will probably devalue the specialized knowledge of various libraries, making them easier to learn by example. It would probably also create the better code analysis tools, being able to expand on the fly all the "auto" declarations into actual types and where did this variable come from. There probably will be new AI-oriented programming languages too. The human-oriented programming languages are built towards reducing the redundancy, making things easier to express in a concise way, manageable for the limited attention span of a human mind. The AI-oriented programming languages would probably have a lot more things spelled out explicitly as a way to catch more errors in the AI-generated code. Perhaps even the the same languages will have the human and AI versions, with all the explicitly spelled-out redundant statements either hidden or expanded. And the expanded form would also be a great code analysis tool for the humans.

momentum stochastic descent

2025-06-01T22:47:00.008-04:00

Attending the NeurIPS conference last year gave me the idea that I should combine the momentum descent with stochastic descent (people already do that), and then try my changes on that. There is a bit of a problem with how to compute the gradients without incurring twice the overhead, but I've realized that it can be approximated in an easy way, by averaging the gradients from the steps of stochastic descent. Yes, they would be computed at different points in the vicinity, but perhaps close enough. So I've done a little bit of code refactoring to make it a little more structured, and then implemented the momentum on stochastic descent.

The good news is that the momentum descent works pretty well out of the box, producing about half the error in the same number of steps as the plain stochastic descent.

The bad news is that stopping the momentum in the dimensions where the gradient changes sign doesn't work. Well, it still works better than plain stochastic descent but not as good as plain FISTA + stochastic. How about not stopping but reducing the momentum, multiplying it by some coefficient less than 1? The result is about proportional: 0.1 is close to just stopping, 0.9 is close to plain FISTA, and 0.5 is about half-way. Which I think means that the braking definitely messes things up, and it's a wrong-shaped function for braking this way. There is a huge number of dimensions that experience the gradient sign flipping on my examples, like half of them, and evidently it's completely OK.

But maybe it can still be used to adjust the step size. We'll see. I'll need to analyze and understand more what is going on in there.

BCPL, B, and C

2025-04-19T15:22:00.001-04:00

Have you ever wondered, what were the earlier BCPL and B languages? As it turns out, Bell Labs still hosts Dennis Ritchie's historic web pages (http://cm.bell-labs.co/who/dmr/) that include the descriptions of BCPL (http://cm.bell-labs.co/who/dmr/bcpl.pdf) and B (http://cm.bell-labs.co/who/dmr/kbman.pdf), and also the source code of a couple of early versions of the C compiler (http://cm.bell-labs.co/who/dmr/primevalC.html).

So for what I can tell, B and C aren't exactly the breaking points in the language itself, instead the language had gradually evolved from BCPL (just as according to the stories there weren't any versions of Unix inside Bell Labs research, it was just snapshotted and packaged as versions for external consumption). BCPL is a very old language, from the times when the computers had widely differing character sets (and mostly in 6-bit encodings), so the language document defined the logical lexemes that got mapped to specific characters in each implementation. So you could see how they'd say "let's use curly braces for the block start and end", "let's skip the statements that are duplications to save memory" (BCPL had separate keywords for positive and negative conditions in the loops, and also the version of "if" statement with "else" was named "test"), "some more operators would be convenient", and get a Bell Labs dialect of BCPL that became B. And then B is almost the same as an early version of C.

It looks like the breaking points for "a new language" were in fact determined by the compiler back-end, the code generation. B went from direct assembly to "threaded code" (having nothing to do with the our modern idea of threads), which was apparently a popular concept in the 1970s (and still most widely known from the Forth language). The idea there is that the program is built of a sequence of subroutine addresses, each of them essentially an instruction of a virtual machine. But these instructions don't need parsing, all you need to do is to jump to the address and you get to the machine-code subroutine that implements this virtual instruction. Except that at the end of the subroutine it doesn't do a return, instead it does the jump to the address in the next virtual machine instruction. On PDP-11 this jump was done as

jmp *(r3)+

with r3 containing the program counter of the virtual machine. That's where the "threaded" in the name comes from, r3 is the "thread" that sews together the code fragments. The advantage is that the generated code is more compact than the direct machine code (a big deal for the very small memories of the day), and also as the interpreters go, it's a very fast one, losing to the direct machine code only by a factor of 3-5 or so. The idea of threaded code can still be useful for the asynchronous programming, as described in https://babkin-cep.blogspot.com/2025/02/asynchronous-programming-9-composition.html. And then C went back from the threaded code to the direct machine code generation.

Some interesting tidbits about BCPL and B:

* The lvalue and rvalue terms come all the way from the CPL (of which BCPL is a reduced version). There are no words "pointer" nor "reference" nor even "address" in the BCPL document, only "lvalue". It actually would be very difficult to understand without the prior knowledge of lvalue and rvalue terms, but my guess is that it builds on the CPL definition with a better explanation.

* BCPL is a word-based language where the addresses are of the words, and so there is no difference between integer and pointer arithmetic. So when implemented on PDP-11, a byte-addressed 16-bit machine, B still has lvalues as word addresses, shifted right by one bit from the machine addresses. So adding 1 to the address actually does add 1 to the number, and the opposite shift left by one bit is done only when dereferencing the pointer.

* The strings had both '\0' at the end and the 1-byte string length at the front.

* The arrays (naturally, of words, as this was the only supported type) used one more word before the array to store the address (lvalue) of the first word of the array, rather than using a constant for that. And that lvalue was also modifiable!

* The multi-character character constants that still exist in C originate in BCPL, where you could specify up to one word's worth of characters (which might be six).

* The "break" statement (but not "continue") existed in BCPL, disappeared in B, and then reappeared in C.

* BCPL had both symbolic constants and a preprocessor, both of which disappeared in B.

Asynchronous programming 11 - writing libraries

2025-02-15T18:27:00.001-05:00

Writing the libraries in the asynchronous way is a special challenge. There is so much context attached to the futures, and implicitly inherited between them (especially through inlining) that it tends to spill over in all the wrong places. The libraries have to take explicit steps to prevent this spill-over from their callers and to their callers. And of course the libraries at the deeper levels have to do this too.

The most frequent and absolutely worst issue is with inlining. Getting rid of the whole inlining debacle by using a delayed scheduling slot solves this problem. Otherwise you have to explicitly chain every future entering your library to your own trampolined future, and also mark every future exiting your library as trampolined.

The executors represent another issue, just as bad. You don't want your code to run on the caller's executor, and don't want caller's code to run on your executor. So not only trampoline the inputs but trampoline on your executor. But this doesn't solve the exit side, not letting the user code processing your result to run on your executor. You need not only have the caller provide the result promises but also make sure that they're trampolined on a specific executor. However, like for inlining, there is an easy solution: get rid of the explicit executors in the asynchronous subsystem altogether. There is no good reason to use the serial executors, they're all pain and no gain, and all the parallel executors are effectively equivalent, so having one global executor per process is sufficient. If you want to limit the degree of multithreaing, use some form of semaphores.

Next, do the error handling and cancellations right, this usually requires some thinking through of your code.

Another item that can make sense is priorities. There would be priority inversions as usual, but not any worse than with synchronous programming. In fact, it could be resolved with priority inheritance in a fairly straightforward way: the cancellations propagate (and stop propagation) in the same way as the priority inheritance would, so all we need to do is to stick the priority propagation on top of the pre-existing cancellation mechanism (of course if it does pre-exist).

And as the last point, try to avoid using raw pointers as arguments, this will save great many issues with the memory getting freed under them. Use the reference-counted shared pointers instead whenever possible.

Asynchronous programming 10 - error handling

2025-02-15T15:29:00.001-05:00

What if a computation fails? Then the future still gets completed but has to return an error. At the very least, if the future can return only one value, the returned object should have a place for error indication. But a better asynchronous library would have a place for error indication right in the futures. BTW, to do it right, it shouldn't be just an error code, it should be a proper error object that allows nested errors, and ideally also error lists, like Error in Triceps.

So suppose that our futures do have an error indication. How can these errors be handled?

Chaining between futures is easy: just the error chains through in the same way as the value. A nice consequence is that the asynchronous lock pattern and other patterns like that just work transparently, releasing the lock in any case, error or no error. However an error object may have references to the objects that we might not want to be stuck in the state of an asynchronous lock. And we don't want the unrelated code that locks the mutex next to start with an error. An error is applicable even to a void future, so it would get stuck even in one of those. So there should be a separated case of chaining to a void future that just passes through the completion but not the error. If your library doesn't have one, you can make one with a function chained to the first future and freshly completes the second future, ignoring the error.

Chaining functions is more complicated. In the simplest case we can just let the function check for error and handle it (and that's a good reason to have the whole input future as an argument and not just the value from it). But that means doing a lot of the same boilerplate error propagation code in a lot of functions.

The other option is to have the chaining code propagate the error directly form the input future to the result promise, ignoring the function in this case, and basically cancelling the chain. This is very much like how the exceptions work in normal programming, just skipping over the rest of function and returning an error, so this behavior should be the "normal" chaining, while a function that handles the error in its input future is more like a "catch" or "finally" statement. Note that if the function get skipped in case of an error, it doesn't really need to see the whole input future, it could as well get just the value from it. With this option, if you've prepared a million-long chain for a loop (not a great idea, better generate each iteration on by one), it will all get cancelled on the first error.

The third option (and these options are not mutually exclusive, they all can be available to use as needed) is to chain a function specifically as an error handler, to be run on error only. Which is even more like a "catch" statement than the previous case. But there is a catch: this makes the chain branch, which means that eventually the normal and error paths have to join back together with an AllOf. Which is not only a pain to always add explicitly but it also implies that the error path somehow has to complete even if there is no error, so again there has to be a chain cancellation logic for the error handlers but working in the opposite way, ignoring the functions on success. That's probably not worth the trouble, so the handling of errors by a separate function makes sense mostly as an one-off where depending on the success of the input future either the normal function or the error handler function get called, sending their result to the same promise, so the fork gets immediately joined back. This is just like doing an if-else in one function but has the benefit of allowing the composition, reusing the same error handling function with many normal processing functions, being truly like a "catch" statement. This pattern is particularly convenient for adding some higher-level details to the error object (as in building a stack trace in normal programming).

The next item is the error handling in AllOf, or in its generalization to map-reduce. How much do we care that some of the branches have failed? A typical case would probably be where we want all of them to succeed, so for that AllOf should collect all the errors from all the branches into one error object.

What about a semaphore? There in a natural way any returned error would cause the whole semaphore to be cancelled. If the semaphore represents a limited-parallelized loop, that's what we'd probably want anyway. Well, due to the parallelism, there might be hiccups where there might be some more iterations scheduled by the other futures completing normally at the same time as the one with the error. One possible race comes from the semaphore logic picking one future from the run queue, executing it, and then coming back when its result promise completes. The queue itself is unlocked after the head future got picked from it, so another completed future would pick the next head future from the queue before the first one completes the loop of error propagation. Another possible race comes from the part where it's OK to add more work to the semaphore while running on one of its own chains. So if instead of pre-generating all the iteration head futures in advance we just put into the semaphore one future with a function than on running generates the head of one iteration (but doesn't start it yet!), then reattaches itself to the semaphore, and then lets that iteration run by chaining it to the input future. And of course if this iteration generation function doesn't check for errors, the reattached copy can grab a successfully completed future and run an iteration. So it would help to also add explicit logic to the semaphore that just cancels all the outstanding incoming futures once one of them gets an error. And also could pay attention to errors in the iteration generation function to stop generating once an error is seen.

Of course, not every semaphore needs to be self-collapsing on error, some of them are used for general synchronization, and should ignore the errors.

The most complicated question of error handling is: can we stop the ongoing parallel iterations of a parallel loop when one iteration gets an error? This can be done by setting a flag in a common context, checking it in every function, and bailing out immediately with a special error if this flag is set. This is kind of a pain to do all over the place. So maybe this can be folded into the futures/promises themselves: create a cancellation object and attach it to the futures, so when completing a future with a cancellation object attached, it would check if the cancellation is true and replace the result with a cancellation error instead. Note that this would not happen everywhere but only on the futures where the cancellation object is attached. So when you call into a library, you can't attach the cancellation objects to the futures created inside the library. And you can't always quickly cancel the future that waits for the library call to return because the library might still be using some memory in your state (although this of course depends a lot on the library API, if all the memory is controlled by reference counters then we don't care, we can just let it run and ignore the result).

Can we propagate the cancellation object between futures, so that they would even go through the insides of a library? Generally, yes, we can do it on chaining, But that takes some care.

First, the propagation must stop once we reach the end of the whole logical operation, and also must stop when we go to the void futures for the patterns like the asynchronous mutex. And stop even for non-void futures in the patterns like the cache, where one caller asking to cancel the read shouldn't cancel the read for everyone.

Second, the functions that create intermediate promises form scratch must have a way to propagate the cancellations from their inputs to these newly created promises.

Third, the libraries need to be able to do their own cancellations too, so it's not a single cancellation object but a set of cancellation objects per future, with the overhead of checking them all on every step (and yes, also with overhead of attaching them all to every step). Although if the sets are not modified often, maybe an optimized version can be devised where the overhead is taken at the set creation time and then the set consolidates the state of all the cancellations in it, making necessary to attach only one set and check the state of only one set.

Fourth, what about the system calls to the OS, which on a microkernel OS would likely translate to calls in the other processes? The cancellation state cannot be read from other address spaces. Which basically means that as we cross the address space boundary, we need to create a matching cancellation object (and here treating the whole set of cancellation objects as one object helps too) on the other side, remember this match on our side, and then have a system call that would propagate the cancellation to the other side. Fairly complicated but I think doable. Of course, at some point this whole path will get down to the hardware, and there we won't be able to actually interrupt an ongoing operation, but we can arrange to ignore its result and return. And there are things that can't be ignored, for example an app might suddenly stop caring whether its buffer write has succeeded or not, but a filesystem can't ignore whether a metadata block write succeeded or not. However this filesystem shouldn't keep the app waiting, if the app has lost interest, the filesystem can sort out its metadata writes in the background.

Fifth, between this filesystem write example and the cache example, a cancellation flag also needs to have a future connected to it, that would get completed with a cancellation error when the cancellation is triggered. We can then chain from this future directly to the result future of the cache read or block write, "overtaking" the normal result to essentially do an "anyOf", with the first completion setting the result (including error) into the future and any following completion attempts to set the result getting ignored. A catch is that when one path completes, the other will still hold a reference on the result future, potentially causing the unpleasant memory leaks. And also the cancellation future would keep accumulating these chainings to it after each operation under it gets normally completed. Maybe the cancellation objects would be short-lived and this wouldn't be a real problem. Or maybe this will require to think of a way for un-chaining once it gets overtaken by completion of another path.

The final thing to say is that the C++ coroutines don't seem smart enough to translate the error handling in promises to look like exception handling at high level. And this is a very important subject, so maybe the coroutines are not the answer yet.

Asynchronous programming 9 - composition

2025-02-12T04:47:00.003-05:00

Consider this traditional function:

void writeHeader()
{
char buf[512];
// ... populate the bufer ...
write(buf, 512);
}

and its asynchronous version:

void writeHeader(
shared_ptr<FutureBase> input,
shared_ptr<HeaderCtx> ctx,
shared_ptr<WritePromise> result)
{
char buf[512];
// ... populate the bufer ...
write(buf, 512)->chain(result);
}

What is wrong with it? The buffer on the stack gets freed before the write completes and the next scheduled function fills it with garbage. The next version:

HeaderCtx {
...
char buf[512];
};

void writeHeader(
shared_ptr<FutureBase> input,
shared_ptr<HeaderCtx> ctx,
shared_ptr<WritePromise> result)
{
// ... populate the bufer ...
write(ctx->buf, 512)->chain(result);
}

Potentially better, with buffer in the context (remember, the context is an analog of a stack frame in the normal functions) but now the context gets freed immediately after writeHeader() returns too! So no, not really better. What we need is to keep the context alive until the write completes. It can be done like this:

void empty(
shared_ptr<FutureBase> input,
shared_ptr<void> ctx,
shared_ptr<Promise<void>> result)
{}

void writeHeader(
shared_ptr<FutureBase> input,
shared_ptr<HeaderCtx> ctx,
shared_ptr<WritePromise> result)
{
// ... populate the bufer ...
auto wres = write(ctx->buf, 512)
wres->chain(result);
wres->chain(empty, ctx);
}

or in a slightly different version:

void empty(
shared_ptr<FutureBase> input,
shared_ptr<void> ctx,
shared_ptr<PromiseBase> result)
{
input->chain(result);
}

The empty function does nothing, it's just a placeholder for the context to be kept alive in a chained promise until the write completes. Note that the same empty function can be used in all the places where this functionality is needed.

Which brings us to the point that instead of writing custom snippets for everything, we might be able to compose a good deal of computation out of pre-defined functions.

One repeating example has been storing the result of a computation in a variable. It can be done as a reusable function that gets an address to store as its context (and that's one of the examples where the context would be better as just a pointer instead of a shared_ptr) and stores the value of a given type from its input future. Considering that a future has two separate meanings, returning the value and signaling the completion, we could even define a separate specialized kind of future that would store the value at a given address instead of keeping it internally.

Another obvious possible composition is in collecting the arguments of the asynchronous functions. It would make sense to be able to compute the arguments in parallel, then call the function. And it's not that hard to do. An asynchronous function in any case consists of multiple plain functions: "header function" and "continuation functions", with the context passed to the continuation functions being the stack frame of the asynchronous function, with the context allocated and needed arguments copied into the context by the header part. How about we make the function arguments into a structure and pass it as context to the header part of the asynchronous function? Which would now become not called directly but chained to the completion of the context. Which in turn would be driven by AllOf for completion of the computation of all the arguments (stored into the structure on completion as discussed above), and sometimes perhaps one more function, telling that the previous computation in the sequence has completed. Not every argument has to be computed asynchronously, they could be assigned synchronously, and then there just won't be a future for this argument to include into AllOf. To reduce the overhead, potentially the arguments structure can be passed not as a shared_ptr but as a plain pointer, owned by the calling function (as the arguments are on the stack for the plain functions) - then of course the calling function needs to make sure that the argument structure lives throughout the call, as been shown above with the buffer.

Well, if you're using coroutines, the compiler would probably do all that for you, generating just enough of the small functions on the fly. If the coroutines don't work for you, the missing custom fragments can probably be filled in the modern C++ with lambdas. Lambdas can be combined with the macros too if you really want to.

One thing that can be said about large collections of small functions calling each other through a scheduler, is that they'll never be very efficient. Although they can be a little more efficient if instead of returning back they could be made to jump straight to the next function in the chain, such as if the entry address of the next function is pushed onto the stack instead of the return address. In fact, I've started writing this series because of a something that I've read recently, about a virtual machine in an ancient DBMS that worked exactly like this, instead of returning from a function having an instruction (PC = (RCHAIN)+), so that a sequence of function addresses to call would be prepared in memory, RCHAIN initialized pointing to the start of it, and then calling this instruction to jump to the first function in the sequence.

P.S. I've read the description of the B language (http://cm.bell-labs.co/who/dmr/kbman.html), and it also worked like this, with each snipped ending with "jmp *(r3)+". This was called "threaded code". I also remember from reading almost 40 year ago that Forth also used the threaded code. Looks like it was a popular approach in the 1970s, and now it has a new use!

Asynchronous programming 8 - loops and a semahore

2025-02-09T17:57:00.003-05:00

The two basic varieties of loops are where you either execute everything sequentially or schedule all the iterations in parallel. Suppose a loop consists of three parts A, B, C, connected by futures, and suppose the loop has 100 iterations. In the sequential form it unrolls into:

-> A1 -> B1 -> C1 -> A2 -> B2 -> C2 -> ... -> A100 -> B100 -> C100 ->

In the parallel form it becomes

   /->A1 -> B1 -> C1 ---\
->          ...          ->
   \->A100->B100->C100 -/

An important thing to keep in mind for both forms is to prevent inlining at least at the start of each iteration. In the parallel form the inlining would kill the parallelism, and in either form it will likely overflow the stack by the recursive calls (300 of them in this example). This is a very good reason to ever avoid the inlining, and the tail scheduling slot from the previous installment is a good solution there.

In the parallel form all the futures at the end of each iteration are joined into one (usually ignoring the return value and becoming a void future) with an AllOf primitive, where the final future gets completed only after all the inputs get completed. It can be easily implemented by counting the completed futures, with pseudo-code like:

class AllOf {
public:
AllOf(count):
ctx_(make_shared<Context>{count})
{}

// inputs - a container of input future references
template<typename Container>
AllOf(const Container &inputs):
   ctx_(make_shared<Context>{inputs.size})
{
    for (cost auto &inf in inputs){
      addInput(inf);
   }
}

void addInput(shared_ptr<FutureBase> input)
{
    // Note that the result promie is ignored
    input->chain(advance, ctx_);
}

shared_ptr<Future<void>> getResult() const
{
    ctx_->result_->to_future();
}

static void advance(
    shared_ptr<FutureBase> input,
    shared_ptr<Context> ctx,
    shared_ptr<Promise<void>> result /*ignored*/)
{
    if (atomic_decrease(ctx->count_) == 0) {
      ctx->result_->complete();
    }
}
protected:
struct Context : public ContextBase {
    atomic_int count_;
shared_ptr<Promise<void>> result_ = make_promise<void>();
};
shared_ptr<Context> ctx_;
};

// used like this:
AllOf res(100);
for (int i = 0; i < 100; i++) {
    res.addInput(shedule_iteration(ctx));
}
res.getResult->chain(...)

Although fundamentally this doesn't have to be just a counter, it can as well run a map-reduce aggregation pattern, where the "map" part is done in parallel as loop iterations, and "reduce" combines the results under a mutex in a class similar to AllOf here, and after all the inputs are processed, completes the output promise.

Note that there is no need to even collect all the inputs in an array as some libraries do, the constructor from a container is just a convenience. Once they're properly chained together, only the count matters. And it's easy to make a variation of the implementation where the count would be accumulated dynamically as the inputs are connected, and then finalized (it's important to not send out the result before finalization).

Now, what if you want to do something in between, a parallel loop with a limited parallelism? An executor with a limited number of threads won't work here because it limits the computational parallelism but here we want to limit the resource parallelism. Think for example of each iteration opening, proecessing, and closing a file descriptor, with a limited number of descriptors to be open at a time. Here we need a semaphore implemented with futures. A semaphore is kind of a mix of the AllOf pattern shown here and the asynchronous mutex pattern shown before. Of course, the easiest way is to just serialize the loop by partitioning it into K parts, each part fully serial. But that's not the same thing as the semaphore where the next iteration gets started when any of the running iterations complete, a semaphore is better at balancing the load.

A path to implementing a semaphore would be to keep two lists, one of the completed futures, another one of waiting promises, one of the lists being non-empty at each time. If a new promise gets chained when there is a spare completed future on the list, they can be "annihilated" by using the future to complete the promise and forgetting both (but pre-rigging the chain of execution off that promise that will return a completed future to the semaphore at the end). When a completed future gets added and there is a promise waiting, they can also be "annihilated" in the same way. Otherwise the future or promise gets added to the appropriate list. If we start by adding all the promises first, and telling the semaphore the degree of parallelism K, then the semaphore can also signal the completion of work when it collects all K futures on their list, completing another special promise.

Finally, there is a pattern of retry loop for manual programming that allows to reduce the number of small functions. Sometimes you need to prepare multiple resources before starting an operation:

if (resource_a == nullptr) {
resource_a = acquire_a(); // may sleep
}
if (resource_b == nullptr) {
resource_b = acquire_b(); // may sleep
}
operation(resource_a, resource_b);

An interesting property of this preparation is that it can be restarted from scratch with little cost. So instead of writing little fragment functions to do one check and one assignment per function, we can write one function and schedule it recursively after each acquisition, making a state machine to remember where we left off last time:

void startOperation(
shared_ptr<FutureBase> input,
shared_ptr<Context> ctx,
shared_ptr<Promise<Sometype>> result)
{
switch(ctx->state) {
case AFTER_A:
    ctx->resource_a =
      static_cast<Future<ResourceA> *>(input.get())->value();
    break;
case AFTER_B:
    ctx->resource_b =
      static_cast<Future<ResourceB> *>(input.get())->value();
    break;
}
ctx->state = INIT;

if (ctx->resource_a == nullptr) {
    state = AFTER_A;
acquire_a()->chain(startOperation, ctx);
    return;
}
if (ctx->resource_b == nullptr) {
    state = AFTER_B;
acquire_b()->chain(startOperation, ctx);
    return;
}
operation(ctx->resource_a, ctx->resource_b)->chain(result);
}

It's doesn't look like the most efficient way of execution but surprisingly, if the resources are typically already acquired, it is, doing the whole check and initiating the operation in a single function. It's also a very efficient way to make the program more readable by humans, without a myriad of tiny functions.

P.S. See more about the loops in the installment on error handling.

Asynchronous programming 7 - caching

2025-02-08T10:21:00.004-05:00

Caching of values becomes easy with the futures because the futures take care of almost all the needed synchronization. The combination futures and shared_ptr also transparently solves the issues with what to do with the ownership of an object discarded from cache if it's still used by a reader (that reader will have the last shared_ptr) and what to do if the cache is full of incomplete reads (add the entry to the cache anyway, so that if anyone else tries to read it in the meantime, they will wait for it and won't initiate another read, and then chain the reading from the completion of the previous read). The pseudo-code looks like this:

class Cache {
public:
shared_ptr<Future<Value>> read(Key k) {
    scopelock(mutex_);
    auto it = data_.find(k);
    if (it != data.end()) {
      // bump the key in LRU order
      lru_.remove(times_[k]);
      times_[k] = ++generation_;
      lru_[generation_] = k;
      // note that the future might not be completed yet!
      // but that's OK, the caller will wait for it
      return it->second;
    }
    tryRemoveOldL();
    // the new key gets always inserted, even if the cache is overflowing
    times_[k] = ++generation_;
    lru_[generation_] = k;
    if (data_.size < SIZE_LIMIT) {
      data_[k] = reading_ = readValue(k);
    } else {
      // wait for the last read to complete before reading this key
      auto ctx = make_shared<WaitCtx> {this, key};
      shared_ptr<Promise<Value>> result = make_promise<Value>();
      auto prev_reading = reading_;
      data_[k] = reading_ = result->to_future();
      prev_reading->chain(readDelayed, ctx)->chain(result);
    }
    return data[k];
}

protected:
// context for waiting to read a key
struct WaitCtx {
    Cache *cache;
    Key key;
};

// If the cache is overflowing, tries to remove the oldest completed
// element, expects the object to be already locked.
// Returns true if either the cache wasn't overflowing or an element
// got successfully discarded from it.
bool tryRemoveOldL() {
    if (data_.size >= SIZE_LIMIT) {
      for (it in lru_) {
        if (data_[it->second]->is_completed()) {
          Key kold = it->second;
          data_.remove(kold);
          lru_.remove(times_[kold]);
          times_.remove(kold);
          return true;
        }
      }
      return false;
    }
    return true;
}

shared_ptr<Future<Value>> readValue(Key k) {
    ...
}

static void readDelayed(
    shared_ptr<Future<Value>> input,
    shared_ptr<WaitCtx> ctx,
    shared_ptr<Promise<Value>> result)
{
    scopelock(ctx->cache->mutex_);
    if (ctx->cache->tryRemoveOldL()) {
      ctx->cache->readValue(ctx->key)->chain(result);
    } else {
      // oops, at least our input element should have completed,
      // this means it got stolen from us in a race between its
      // completion and this function running, need to delay again
      auto prev_reading = ctx->cache->reading_;
      ctx->cache->reading_ = result->to_future();
      prev_reading->chain(readDelayed, ctx)->chain(result);
    }
}

Mutex mutex_;
map<Key, shared_ptr<Future<Value>>> data_;
map<Key, int64_t> times_;
map<int64_t, Key> lru_;
int64_t generation_ = 0;
shared_ptr<Future<Value>> reading_;
};

This code is much simpler than the example I've shown in "The Practice of Parallel Programming", so finally the asynchronous programming is good for something!

However the scheduling of the asynchronous functions doesn't mesh well with the object methods, those have to be made into the static methods (perhaps there is some better solution?).

There is also a special case of caching, the lazy reading of exactly one object. That requires only a mutex and a future. Check under mutex if the future reference is null, and if so then initiate the read and remember the result future from it. Otherwise just return the future, and it doesn't matter at this point if the future is completed or not, it will be the caller's responsibility to wait for it.

Asynchronous programming 6 - inlining done right

2025-02-08T09:10:00.000-05:00

To recap, "inlining" is when we complete a future that has some function chained to it (directly or through other futures), and that function gets immediately executed from the completion library call. The opposite is "trampolining" where this function gets put into a scheduler queue and executed later.

Inlining allows to save on the cost of scheduling, and also keeps the cache hot: completing a future means that we've just put some value into it, and so reading that value (and other values it's connected to) immediately means that it will still be in the cache.

However inlining can also be self-defeating: suppose we want to complete ten futures, each with a function chained to it. If trampolined, ten CPUs can pick them from the scheduler queue and execute in parallel. But inlining would inadvertently cause them to be executed sequentially.

The reality is that inlining is only efficient when it's done at the tail of the current function. On the other hand, the issues with inlining (stack overflows and bad interactions with mutexes and serializing the parallel execution) can be avoided if the inlined function was called only after the current function returns.

Put this way, the straightforward solution is to replace inlining with a special case of trampolining via a "delayed scheduling slot": have a special thread-local variable in the scheduler, sufficient to hold a reference to a single scheduled function. Whenever a future is completed, put one chained function there and schedule the rest as usual. If the delayed slot is already used, then it can be either left as-is and all the new functions scheduled as usual, or in the hope that the later completions have a hotter cache, move the old contents of the delayed slot into the normal scheduling queue and put the new function there. Then when the current asynchronous function is completed, have the scheduler code check the delay slot, and if not empty, call the function from there.

This can be expressed in pseudocode:

thread_local<Function> delaySlot;

complete_future(Future fut)
{
FunctionList funcs;
FutureList recfut;

recfut.insert(fut);

for (f in recfut) {
scopedlock(f);

    if (f.completed)
    continue; // a double completion, nothing to do

funcs.merge(f.chained_functions);
f.chained_functions.clear();

    recfut.merge(f.chained_futures);
    f.chained_futures.clear();
}
if (delaySlot.empty() && funcs.size() == 1) {
    delaySlot = funcs.front();
} else if (!funcs.empty()) {
    scopelock(schedulerQueue);
    if (!delaySlot.empty()) {
      schedulerQueue.insert(delaySlot);
    }
    delaySlot = funcs.front();
    funcs.pop_front();
for (fun in funcs) {
      schedulerQueue.insert(fun);
}
}

scheduler_thread()
{
while (true) {
    Function f;
    if (!delaySlot.empty()) {
      f = delaySlot;
      delaySlot.clear();
    } else {
      f = getNextFromQueue();
    }
    execute(f);
}
}

Another interesting point is that cache locality gets improved by unfair scheduling, inserting the new functions at the front of the scheduling queue, with the premise that the more recent inserts will have a hotter cache. It's not exactly unfair either: Remember that in asynchronous programming the sequential execution gets broken up into a sequence of separate small functions. And so the most recently scheduled function is likely the continuation of the previous function, and running it first is completely fair, with the scheduling queue becoming a representation of the call stack, the innermost "return addresses" being at the front of the queue.

This is very similar to Triceps's scheduling logic, following from the same reasoning. Or to put it differently, this is the reason why Triceps's scheduling logic should also be used for the asynchronous programming in general.

Asynchronous programming 5 - critical sections the right way

2025-02-08T08:10:00.001-05:00

Here is how to do the critical sections (i.e. sleeplock mutex analog) properly in the asynchronous way. Consider that the chains of futures are always computed sequentially, one before another. So that's what we need to do to have the critical sections computed sequentially: arrange them into a dynamically built chain. Note that there can be more than one action chained to the same future, and that's what get used for synchronization: the same future at the end of critical section gets two actions chained to it. One is used to return the result and continue computation in the usual way, another one only looks for completion and starts the next thread's asynchronous section.

Suppose we have pseudo-code like this:

funcA() {
...
// this future completes when the code can enter the
// critical section
shared_ptr<Future<void>> futB = waitSyncB();
auto futX = futB->chain(funcCritC, ctx);
return futX;
}

// this function represents the critical section for serialization
// that may require sleep
funcCritC(
shared_ptr<Future<void>> input,
shared_ptr<SomeCtx> ctx,
shared_ptr<Promise<Sometype>> result)
{
...
auto futD = funcSleepD();
futD->chain(result);
}

The execution of this code can be described as a chain:

funcA -> waitSyncB -> futB -> funcCritC -> funcSleepD ->futD -> futX -> ...

With the critical section in it being the part

futB -> funcCritC -> funcSleepD -> futD -> futX

that may have a sleep in the middle (and funcSleepD also being a part of the critical section) . This chain can execute only when futB allows it, telling that the critical section became free. FutB has no value, it's used purely for synchronization. So if we have three threads executing this code (let's mark them as suffixes 1, 2, 3), they should be arranged in a chain

funcCritC1 -> funcSleepD1 ->futD1 ->futX1 -> futB2 -> funcCritC2 -> funcSleepD2 ->futD2 ->futX2 -> futB3 -> funcCritC3 -> funcSleepD3 ->futD3 -> futX3 ->

Note that the instances of futD here grow an extra connection. In addition to being connected to futX in the original chain, they become connected to funCritC of the original thread.

... futB1 -> funcCritC1 -> funcSleepD1 ->futD1 -> futX1 -> ...
                                            |
                                            V
                                      futB2 -> funcCritC2 -> funcSleepD2 ->futD2 -> futX2

This basically means that the function waitSyncB() should create the connection FutX(N) -> FutB(N+1). And no serial executors are needed, this chain will execute sequentially on any executor!

Here is the pseudcode of an implementation:

class Serial {
public:
Serial()
    : tail_(make_shared<Future<void>>())
{
    // the critical section starts free
    tail_.complete();
}

void serialize(
    shared_ptr<Future<void>> &head,
    shared_ptr<Promise<void>> &newtail)
{
    newtail = make_shared<Promise<void>>();
    head = newtail->to_future();
    atomic_swap(tail_, head);
}

protected:
shared_ptr<Future<void>> tail_;
}

Then the code of funcA() becomes:

Serial asyncMutex;

funcA() {
...
// this future completes when the code can enter the
// critical section
shared_ptr<Future<void>> futB;
// we need to complete this future when exiting the
// critical section
shared_ptr<Promise<void>> futB_end;
asyncMutex.serialize(futB, futB_end);
auto futX = futB->chain_trampolined(funcCritC, ctx);
futX->chain(futB_end);
return futX;
}

There is a bit of general ugliness of the asynchronous programming where we need to know the result future before creating the chain that produces it (very similar to the issue with Triceps Labels) but otherwise it's straightforward and simple to use.

The chaining with chain_trampolined() is used here to request the next function to be run through a scheduler queue to avoid the situation when a long chain of waiting threads gets built up while one thread is waiting inside the critical section, and then everything in that tries to execute inlined (by direct calls rather than scheduling in the executor) and runs out of stack. In fact, this is such a typical and thorny issue that probably everything should be trampolined by default, and inlining (direct calls) done only when explicitly requested. But the creators of the asynchronous libraries tend to have an opposite opinion.

Another thing to note is that there is no atomic swap defined on the standard library shared pointers. You can make one by either using a traditional mutex as a spinlock, with regular swap under it, or define your own version of shared_ptr just for the futures that does have an atomic swap. It's such a convenient operation that making your own shared_ptr is worth it.

Asynchronous programming 4 - a look under the carpet

2025-02-05T09:52:00.005-05:00

In this part I want to go over some simplifications I've made in the first part, mostly because some things are wrong and should never be used. Here I want to talk about them, and also about the solutions for the same problems that should be used instead.

Back then I've said that there is no await() in asynchronous programming, but usually there is. Just that it should never be used because it leads to very bad deadlocks. In particular, people tend to use it in combination with the serial executor as a way of locking, to run some blocking code without releasing the executor thread. If some of the called code wants to run on the same executor (essentially, doing a recursive lock), the code will wait on its queue forever and thus deadlock. It's not really await()'s problem, the same would happen with any recursive lock including the patterns that I'll show soon, but people are less aware of the issue with await() and proudly feel like they've "cheated the system".

And I've already mentioned that there is no good reason to use the serial executor at all, there are better patterns. These better patterns rely on something that I haven't mentioned yet: the mutexes. Which are available with asynchronous programming but need to be treated somewhat differently than in common programs. In asynchronous programming they need to be treated as spinlocks, to protect a short and quick chunk of code. Sometimes they can even be replaced by the lock-free code (finally a good use for it!).

Instead the futures should be used as the mechanism for the long waits. A future has a nice property that avoids the race conditions: it might get completed after a function got chained to it, or before a function got chained to it, the function will run in either case when the future gets completed. So as I showed before, sometimes we don't even care about the value returned in a future but care about the fact of its completion to let some other code run. But a thing to keep in mind is that if a future is already completed before we chain a function to it, the function will usually run immediately on chaining, although there is no guarantee of that. This leads to the difficult to diagnose errors when some function assumes that it has an exclusive access to some data while it assembles a chain of futures and functions but in reality the first future in the chain sometimes completes on another CPU before the whole chain is assembled. Then the functions in the chain start running on the other CPUs, and our function in question at some point ends up with chaining another function that accesses the same data to an already completed future, and that another function gets called immediately, accessing the same data.

This mostly happens with the serial executors, when both the current function and the chained one rely on the same serial executor for serialization (another reason to never use the serial executors). The executor gets specified in the chaining arguments, but since it's the same executor as the currently running one, the chaining thinks that it's fine to call directly. But it can also happen on any executor, while using mutexes in a slightly easier to diagnose pattern, where one function assembles a chain under mutex, and one of the functions in the chain tries to lock the same mutex, which becomes a recursive lock, and everything deadlocks.

Hence the rule: either do no chaining under a locked mutex, or if you really need to, make sure that the first future in the chain won't get completed until after you unlock the mutex. In the last case you'd usually start with creating a promise, then build a chain on its future side, and finally after unlocking the mutex you'd chain that first promise to some future that might have been completed.

Another thing that I didn't mention is that usually the executors have a direct way to schedule a function to be executed on them. The trouble is that the signature of that function is usually different than a function that gets chained to a future, because with direct scheduling there are no future and no result promise arguments to the function. So if you need a function used both ways, you can't, because the signatures are different. In this situation, instead of the direct scheduling, you can use chaining on a future that is either pre-completed or gets completed right after chaining. However with plain chaining it will cause the function to be called right there (and this is known as "inlined" as opposed to scheduling on an executor which is known as "trampolining"). So you'd have to use the kind of chaining that allows to explicitly disable the inlining. Or if this option is not available in your asynchronous library, then there is no other choice than to do an explicit scheduling.

Disabling the immediate inlined execution on chaining also resolves the other potential issues mentioned above (at the cost of additional overhead of scheduling). Or if it's not available, a chain can be made run through an explicit scheduling with pseudo-code like this (pseudo, since it plays a bit loose with the types):

// it didn't occur to me before but the contexts do have to have
// a virtual base class for their destruction to work
// correctly in shared_ptr
struct SchedBarrierCtx : public ContextBase {
AsyncFunctionBase func;
shared_ptr<FutureBase> input;
shared_ptr<ContextBase> ctx;
shared_ptr<Scheduler> sched;
shared_ptr<PromiseBase> output;
};

template <typename Tin, typename Tout, typename Tctx>
shared_ptr<Future<Tout>> sched_barrier(
shared_ptr<Future<Tin>> input,
AsyncFunction<Tin, Tout, Tctx> func,
shared_ptr<Tctx> ctx,
shared_ptr<Scheduler> sched)
{
auto barrier_ctx = make_shared<SchedBarrierCtx> {
func, input, ctx, sched, /*output*/ nullptr};
// no need to specify the executor for chain(),
// because barrier_trampoline1() will do that anyway,
// and it's cheaper to inline it on any current executor
return input->chain(barrier_trampoline1, barrier_ctx);
}

void barrier_trampoline1(
shared_ptr<FutureBase> input,
shared_ptr<SchedBarrierCtx> ctx,
shared_ptr<PromiseBase> result)
{
ctx->output = result;
ctx->sched->schedule(barrier_trampoline2, ctx);
}

void barrier_trampoline2(shared_ptr<SchedBarrierCtx> ctx)
{
ctx->func(ctx->input, ctx->ctx, ctx->output);
}

The arguments for the chained function get passed through scheduling by saving them in the barrier context.

Note that barrier or not, but the scheduled function can still complete before chain() returns! It's not very probable, because it requires another CPU to pick the scheduled work and complete it while the current CPU gets delayed by something else (perhaps an interrupt handler in the kernel), or for the kernel scheduler to do something unusual, but it's possible. The only thing guaranteed here is that the chained function will run in another kernel thread, and so if that kernel thread blocks, the one that called the chaining can still continue.

Asynchronous programming 3 - some assistance

2025-02-02T16:42:00.017-05:00

I've been saying it 20 years ago, and 15 years ago in the TPOPP book, and I'm still saying it now: the asynchronous programming has to be assisted by a compiler, otherwise it's just a huge pain of doing manually things that a compiler normally does. Fortunately, I think now we have an out-of-the-box solution: the C++ coroutines in C++20, as described for example here: https://en.cppreference.com/w/cpp/language/coroutines . I haven't quite tried to do an actual implementation with them but it looks like the right thing. You define your Promise class (note that coroutines don't differentiate between the Future and Promise sides and call everything a Promise), and then the coroutine statements take that Promise class as a template argument and arrange the splitting of the sequential code into fragments. And you do the explicit parallelism on your own.

Another solution that I played with, doing a partial implementation, would work with plain C too: a preprocessor. It can be done in some smart way, as a whole pre-parser like cfront of yore, or a lot can be achieved even with the standard C preprocessor. The only trick is to generate the unique function names, and these can be done by using the macro __LINE__. Since the line number stays the same within a macro invocation, each invocation gets a unique number that can be used repeatedly within the macro body. In modern C++, of course, we could also use the lambdas, making the naming issue moot, it's more of a plain C issue.

The most difficult part is that we'll need to use the same call and return macros in both the "header" part of the function and the "continuation" part. Which means that all the functions have to have the same result type, and return the value in the same way. So let's take the example from the last post and reformat it to fit into this approach. The original example from the previous installment was:

struct FuncContext {
int a;
};

Future<int> func()
{
auto ctx = make_shared<FuncContext>();
shared_ptr<Future<int>> futa = get_a();
return futa->chain(func2, ctx) // use default executor
->chain(func3, ctx);
}

void func2(shared_ptr<FuncContext> ctx, shared_ptr<Future<int>> arg, shared_ptr<Promise<int>> result) {
ctx->a = arg->value();
get_b(ctx->a)->chain(result);
}

void func3(shared_ptr<FuncContext> ctx, shared_ptr<Future<int>> arg, shared_ptr<Promise<int>> result) {
int b = arg->value;
result->return_value(ctx->a + b);
}

To get the same return type throughout we change the "header" part to return void and pass the returned future back via an argument.

The other problem is the type of that return promise's value: carrying it through all the "continuation" parts is difficult, so we'd have to revert to the base promise type that doesn't care about the return value and cast it only when setting the value. This base type has to exist for the scheduler to juggle all these promises in its queues. Also, remember, the premise here is that coroutines are not available, which would often mean plain C, and there the promises can't be templatized in the first place.

The code becomes:

struct FuncContext {
int a;
};

void func(shared_ptr<Promise<int>>* result_future)
{
auto ctx = make_shared<FuncContext>();
auto result = make_shared<Promise<int>>();
*result_future = result.to_future();
shared_ptr<Future<int>> fut_cont;
get_a(&fut_cont);
fut_cont->chain(func2, ctx)->chain(result);
}

void func2(shared_ptr<FuncContext> ctx, shared_ptr<Future<int>> arg, shared_ptr<PromiseBase> result) {
ctx->a = arg->value();
shared_ptr<Future<int>> fut_cont;
get_b(ctx->a, &fut_cont);
fut_cont->chain(func3, ctx)->chain(result);
}

void func3(shared_ptr<FuncContext> ctx, shared_ptr<Future<int>> arg, shared_ptr<PromiseBase> result) {
int b = arg->value;
static_cast<Promise<int>*>(result.get())->return_value(ctx->a + b);
}

Then we want to make it look like this:

ASYNC_FUNC_0ARG(func, int, {
int a; // this is the context
}) {
ASYNC_CALL_0ARG(func, ctx->a, int, get_a);
ASYNC_CALL_1ARG(func, int b, int, get_b, ctx->a);
ASYNC_FUNC_RETURN(int, ctx->a + b);
} ASYNC_FUNC_END

Here for simplicity I've just used separate macros for definitions and calls of functions with different number of arguments. It's definitely possible to use the macros with variable number of arguments, just it's not something that I use often and I'm too lazy to look it up now. The invocation of ASNC_FUNC_END is needed to balance out the curly braces. The name of the calling function is needed in the CALL macros to refer to the context type name, this unfortunately can't be avoided, and then incidentally it can be used to generate the names of continuation functions. Alternatively, we could define the function name as a macro before the function definition and undef it afterwards, then everything in between could just use that macro for function name.

There is a bit of ugliness but still, looks much shorter and simpler than before, doesn't it? Now all we do is to define the macros that will translate one into another by copy-pasting from the long example (I haven't actually tried these macros right now, so they might contain small bugs but it shows the idea, and I did get a similar system working in the past):

#define ASYNC_FUNC_0ARG(fname, func_return_type, context_body) \
struct fname##Context context_body; \
void fname(shared_ptr<Promise<return_type>>* result_future) \
{ \
using return_type = func_return_type; \
auto ctx = make_shared<fname##Context>(); \
auto result = make_shared<Promise<return_type>>(); \
*result_future = result.to_future();

#define ASYNC_FUNC_END }

#define ASYNC_CALL_0ARG(fname, assign_to, call_return_type, call) \
    shared_ptr<Future<call_return_type>> fut_cont; \
    call(&fut_cont); \
    static void fname##__LINE__(shared_ptr<fname##Context> ctx, shared_ptr<Future<call_return_type>> arg, shared_ptr<PromiseBase> result); \
    fut_cont->chain(cont##__LINE__, ctx)->chain(result); \
} \
} \
static void fname##__LINE__(shared_ptr<fname##Context> ctx, shared_ptr<Future<call_return_type>> arg, shared_ptr<PromiseBase> result) { \
assign_to = arg->value(); \
{

#define ASYNC_CALL_1ARG(fname, assign_to, call_return_type, call, call_arg1) \
    shared_ptr<Future<call_return_type>> fut_cont; \
    call(call_arg1, &fut_cont); \
    static void fname##__LINE__(shared_ptr<fname##Context> ctx, shared_ptr<Future<call_return_type>> arg, shared_ptr<PromiseBase> result); \
    fut_cont->chain(cont##__LINE__, ctx)->chain(result); \
} \
} \
static void fname##__LINE__(shared_ptr<fname##Context> ctx, shared_ptr<Future<call_return_type>> arg, shared_ptr<PromiseBase> result) { \
assign_to = arg->value(); \
{

#define ASYNC_FUNC_RETURN(return_type, expr) \
static_cast<Promise<return_type>*>(result.get())->return_value(expr)

There are a couple more of things to explain in ASYNC_CALL macros. One is that they have to declare the continuation function before using it, this is something that I've glanced over before, because if you write these continuation functions manually, you'd collect all the declarations up front. But if they're generated on the fly, the declarations also have to come on the fly. These functions can be static because they're not called from outside the file. The second thing is that the current function gets closed with two curly braces, and the next one gets opened with two curly braces. This is because ASYNC_FUNC opens the function with a curly brace for the generated definitions, and then another brace comes after the macro, and then we need to maintain the same brace depth throughout.

Note that the execution of the asynchronous functions here is strictly sequential, no ifs nor loops. However similar macros can be made for ifs and loops, and if I ever get around to transform this text to a chapter for a newer version of my book on parallel programming, I'll do them too. They'd be ugly but still better than writing things manually. And a specialized preprocessor like cfront can reduce the ugliness of having to repeat the names that can't be remembered between the C preprocessor macros and to explicitly specify the level of nesting for the ifs and loops.

Asynchronous programming 2 - filling in the types

2025-01-25T17:10:00.003-05:00

For the sake of a quick introduction, I've glanced over some things in part 1. Here I want to come back and show them.

Let's start with a small code snippet similar to what was shown in part 1:

func(context, arg)
{
...
}

Future fut1;
fut1.chain(func, context, executor);

Note that there are no types in this snippet, I've dropped them to avoid getting mired in them. Let's fill them in, and the answer might vary depending on the specific asynchronous library.

Let's start with the function argument arg. Note that it's not explicitly mentioned anywhere in chain(). That's because the argument comes from the future fut1, it's the value that becomes stored in it. So if, suppose, the type of fut1 is actually Future<int>, the argument might actually be

int arg

but the more typical solution is to pass the whole input future as an argument:

Future<int> arg

Except that normally the futures wouldn't be copied but passed by reference. And considering in how many places they get referred to, the only reasonable way is to use either reference counting or garbage collection. Reference counting is more natural for C++ and C, so the type would become:

shared_ptr<Future<int>> arg

Next, what is the function's return value? Being an asynchronous function, its return value must be returned through a Promise. Moreover, that Promise's Future side needs to be returned at the time of the chaining, so the chaining becomes (assuming that the returned value is of type double):

shared_ptr<Future<int>> fut1;
shared_ptr<Future<double>> fut2 = fut1.chain(func, context, executor);

But how will the function know where to return that value? It has to receive that result Promise as an argument too:

void func(context, shared_ptr<Future<int>> arg, shared_ptr<Promise<double>> result);

Since the result is returned through an argument, the normal function's return type becomes void. It's the responsibility of the function to make sure that the result promise will be completed, however this doesn't have to happen by function's return time. Instead it can schedule some other code that will complete this promise later (for example, by chaining it from some other future that it creates). Which, yes, is another potential source of errors when the promise completion gets forgotten in one of the branches of execution and the rest of logic gets dealocked waiting for it. The way to debug this is to have the library keep track of the futures that have some dependency chained to them but haven't been completed and haven't been scheduled to run and haven't been chained to something else. However this can also be a normal intermediate state of a future being still prepared, or of a future stored in some data structure to be found and completed later, so the program can't just abort every time on seeing such a future. Instead it has to be a tool that can run and list all the suspicious futures whenever a deadlock is suspected. Or there can be a special flag that would let the future be temporarily excepted, that gets cleared on exiting the constructing scope unless explicitly preserved. Then any con-compliant future without this flag can be an immediate reason for a program abort, but if the flag gets mismanaged, the deadlocks could still happen. As I've said many times before, the asynchronous programming is fragile and hard to debug.

The executor would generally also be a shared_ptr. The final part is the context. Which is normally also a shared_ptr to some object. What object, depends on the function. Consider a classic function:

int func()
{
int a = get_a();
int b = get_b(a);
return a+b;
}

If the functions get_a() and get_b() can block (and I've made get_b() dependent on a to make the execution sequential), in the asynchronous form this function gets split:

struct FuncContext {
int a;
};

This highlights how the asynchronous code is typically written:

There are two kinds of asynchronous functions: the "head parts" of the actual meaningful high-level functions, like func(), and the split-out internal fragments of the meaningful functions, like func2() and func3(). They're usually written differently, the heads taking the arguments just like the common functions and returning a future with the result, where the fragments are tied to some future representing the result of another asynchronous function call, do the next step of computation until calling another asynchronous function, and then return the result of that function as their result (at least in this pattern where the fragments are pre-chained in advance).
The context carries the values between the fragments, and is an analog of a stack frame in a normal function. It's possible to fine-tune each step's context but that's usually more trouble than worth, so other than for some obvious optimizations (such as b here not getting stored in the context because it's used in only one fragment), it's much easier and better to just carry the same context throughout all the fragments.

Note that all the dynamic objects are nicely auto-destroyed by reference counting after the function completes, and in the meantime are held alive in the scheduling queues and future chains. However the implication there is that a value stays alive as long as the future containing it stays alive, and if that future is kept for a long time, the value would also be kept.

Why would a future be kept for a long time? Because a future represents both a value and the fact of completion, and the fact of completion might be interesting for much longer than the value, as will be shown in the future installments. In this case it might be useful to chain a future with a value to a future without a value:

Future<SomeType> fut_a;
Promise<void> prom_b;
...
fut_a->chain(prom_b);

However normally the chaining expects that the types of values on both sides are the same. So this is a special case of converting to void that should be included in the library. If it isn't in the library, it can be implemented as:

template<typename T>
void convert_to_void_impl(shared_ptr<void> ctx, shared_ptr<Future<T>> input, shared_ptr<Promise<void>> result)
{
result->return_value();
}

template<typename T>
shared_ptr<Future<void>> chain_to_void(shared_ptr<Future<T>> input) {
return input->chain(convert_to_void_impl, nullptr, input->getExecutor());
}

using an intermediate function to change the type. And if some particular library supports no void future, you can always use an int future instead and never look at its value, just be satisfied that it has some value.

Asynchronous programming 1 - Futures and Promises and Executors

2025-01-21T07:00:00.001-05:00

A little while ago I've worked on an OS project that was written in an entirely asynchronous way, without explicit synchronizaton. Which is pretty much what I've described in my book in Section 6.4 "Queues as the sole synchronization mechanism" but in a new-fashioned way. So I want to write up the lessons from that experience before I completely forgot them.

First, how is the basic programming goes in this paradigm. The basic synchronization unit there is a "future" that is used to signal the completion of some operation that may involve a wait, and return a value resulting form that operation. In pseudocode it looks like this:

Future f = start_some_operation(args);

In the plain programming with futures, you have to wait for the future to be completed before reading the value from it (for now we'll skip over the part of how the values of different types are passed through the futures - basically in C++ it means that the future object would be a template with the value type as its parameter, and much more painful in plain C):

MyValue v = await(f);

However the fully asynchronous programming allows no waits. Instead you chain the future to call the next function:

f.chain(another_operation, context);

When the future becomes completed, it will use its result value to call:

another_operation(context, value)

How does a future get completed? There are two varieties of doing this. First, there is something in the runtime environment that completes the futures when the wait is finished. In a fully asynchronous OS this can be the kernel itself, or in a POSIX environment this would be some glue code that translates the end of a system call in some background thread to a completion of a future. Second, this could be done from another chunk of code right in the same environment. Fundamentally, both ways are the same, it's always some other code completing the future, just it can be a part of the same process and address space and logical unit or be somewhere outside of it.

The futures really have two separate APIs, one for the receiving side where the returned value gets tied to some other code, one for the sending side that puts the returned value into the future and marks it completed. For this reason, the sending API is sometimes given a separate name, a "promise". So it's a promise to return a value, which eventually gets fulfilled, and the value comes out on the other side as the completion of a future.

How does the function chained to the future execute? There are two ways: one is to call it immediately, another one is to schedule it for execution later with some kind of a scheduler. They have different trade-offs: the immediate execution can be faster but as the function gets called, it grows the current stack, and a long enough chain can overflow the available stack size. The scheduling is slower but limits the stack growth and can be used across the process boundaries, such as when the kernel code completes a userspace future. This has a very close relation to how things work in Triceps: a future is similar to a Triceps Label, except for the part that the Labels are connected once into a pipeline that is then used to send a continuous stream of data, while the futures are built into ephemeral pipelines that are used only once and then discarded, and new pipelines need to be built for more data. Unlike futures, Triceps always schedules the data that comes through a Label to avoid the stack depth troubles, and also provides certain guarantees of the scheduling order for the labels within a Unit.

However the futures model also has an analogy of a Triceps Unit, called an "executor". It's a scheduler with its queue, and also some CPU threads to perform the execution, potentially in a multithreaded way. Looking back, the idea of running short thread snippets on multiple CPUs in https://babkin-cep.blogspot.com/2020/02/scheduling-policy-for-burst.html is exactly such a combination of asynchronous logic with a multithreaded executor.

There can be multiple executors in a process, and not all of them are multithreaded. The single-threaded executors have a special use: they are used as mutexes. Since the functions in the scheduling queue of a single-threaded executor run sequentially, this has the same effect as serializing them on a mutex. So you'd have a separate single-threaded executor for every place where you'd lock a mutex in a common program. Well, sort of, as long as you don't try to do any other asynchronous operation when holding the lock - then as soon as you leave the current function, your lock gets released, so the serial executors are more like spinlocks without performance advantages, and you need to build special patterns to make sleeplocks. All the caveats of locking still apply, and you can cause deadlocks just as in the regular programs (they manifest a little differently, as all futures being incomplete, but still lock up the execution). If you want, you can also create executors with specific scheduling order logic, such as a single-threaded executor with the same logic as a Triceps Unit.

The multithreaded executors are generally all equivalent (since they provide no guarantees about the order of execution), so you'd normally have only one, with as many threads as there are physical CPUs. With some exceptions. You may have snippets executing at different priorities by using different executors (again, one executor per priority is enough). Or you may want to limit the load imposed by some code by limiting its parallelism, then making a separate executor with a smaller number of threads makes sense.

When creating more executors, don't forget that the underlying resources are not unlimited, and their threads will be fighting it out for the real CPUs at the OS level. And yes, the threads may get preempted at the OS level at the most inconvenient times. So this is the case where having both some OS support and coordination between the executors within the process to synchronize the preemption at the points where the threads switch over to handling the next future can really help.

With the executors, the chaining of the code to a future gets an extra argument, the executor to schedule this code on:

f.chain(another_operation, context, executor);

And then the chained operation can be executed as a direct function call only if it should run on the same executor as the code that completes the future, and only if the executor's scheduling logic allows it.

To give a small example, this pseudo-code with locking:

foo()
{
before_lock();
lock(mutexA);
inside_lock();
unlock(mutexA);
after_lock();
}

becomes:

foo()
{
before_lock();
p = new_promise<int>();
p.to_future().chain(inside_lock, NULL, singleExecutorA);
p.return_value(0);
}

inside_lock(context, value)
{
...
p = new_promise<int>();
p.to_future().chain(after_lock, NULL, multiExecutor);
p.return_value(0);
}

after_lock(context, value)
{
...
}

A lot more code. With the chaining now built into the functions themselves. And this example also glossed over the fact that if foo() returned a value, it must now return a promise of this value instead (so yes, the non-blocking and potentially blocking functions become different). This code also assumes that mutexA was held as a spinlock only for a very short period of time and no asynchronous operations were called from inside_lock(). Things get very confusing very fast, and organizing them in certain patterns does help. More on that in the next installments.

realSEUDO and evaluation of activation profiles

2024-12-29T14:20:00.000-05:00

(This is the part 3 of discussion of realSEUDO).

When the activation profiles are generated, there are different ways to scale them that basically depend on how do we normalize the reference images of the cells.

For example, apparently one traditional way is to normalize them is to make the sum of all the pixel values to be equal 1. A consequence of this is that the larger cells have more bleak reference images, and so their found coefficients for least-squared-error fit to match the same absolute brightnesss in the frame will be higher, since the formula is basically (coefficient = frame_brightness / reference_image_brightness).

The SEUDO (noise suppression) algorithm does a different normalization: it considers the cell image as a multi-dimensional vector (with each pixel being a dimension) and normalizes it to the euclidean length (i.e. norm2) of 1. This way all the cell images and all the gaussian kernels (that get normalized in the same way) have the same euclidean length and duke it out being equal in this parameter. But it still means that the larger cells will have more bleak images with the same consequences, although less so than in the first kind of normalization.

The differing coefficients mean that the activation profiles of different cells are not really comparable, they will be scaled all differently. It also means that as the cell images are being learned and change frame-to-frame, the coefficients they produce in the recognition part will differ wildly. RealSEUDO solves the last problem by generating the update coefficients for the previous frames as the learning happens. Fortunately, the updates are easy: since the reference images are scaled by a multiplier, they recognition results can be easily adjusted by changing this multiplier (which for them is actually a divisor, since the relation is reciprocal).

But I think that in general the property of comparability between different cells and between different recordings is very important, and nobody seems to pay much attention to it. From the mathematical standpoint it's "we can always normalize to the mean and variance". But that normalization works only if we have seen all the cells reach the same maximal absolute brightness. If some cell never reached more than half that, the normalization will make it twice brighter than it was. But if in some session it never reaches more than half brightness and in another session it lights all the way up, its profiles not only won't be comparable to the other cells but even to the same cell in another session. That's why I think that we should look for some absolutes.

The differing coefficients also create a different way of inequality in recognition: the larger cells with higher coefficients get an advantage in LASSO and experience a weaker push towards 0, and the gaussian kernels, being usually smaller than even the small cells, get a stronger push towards 0. Which might be good or not. I haven't experimented with this, so I don't know which way is better in reality, but it definitely looks like something worth experimenting with.

To make the profiles comparable, I think the right way is to put them into the range between 0 and 1, at least more-or-less. The easy way to look at it is that we take the range of pixel brightness and remap it to [0, 1], then we take the brightest activation of a cell and rescale its brightness so that the brightest saturated pixels also become 1. Then the maximal coefficient also becomes 1. There are a couple of problems left though.

First, it might affect LASSO, and a different scaling might work better for LASSO. This is easy to solve: since the coefficient changes reciprocally to the scaling of the reference image in normalization, LASSO can be run with one kind of normalization and then a single multiplication operation can convert it to another kind.

Second, there is the problem of noise, at both the low and high ends. At the low end the black isn't black, it's a kind of frothing gray with some level of background lighting. At the high end there are very bright pixels that go above the normal brightness. At the low end realSEUDO makes an assumption that the cell activations aren't very dense, and so the median pixel brightness in the image represents the close-to-average average background lighting, and then the difference between that and the 10-percentile brightness represents the level of background noise (although these levels may need to be adjusted if the assumptions are incorrect, and there is also an adjustment for unevenness of background lighting). So we take this difference and go up by that much from the median to cut the noise off, and that becomes our 0 level. At the high end it collects the brightest levels of every pixel that ever was assigned to the cell, and then takes not the very brightest ones but one standard deviation up from the mean brightness as the reference brightness, to cut off the very bright anomalous pixels.

And here we come to the actual comparison that tries to tell, which results are better and which are worse. The trouble is that in a real recording we don't have the ground truth. Instead the CNMF recognition is used as the assumed ground truth, so the comparison is not really how good the recognition is but how close it is to CNMF. If you make something better than CNMF, you'd get a worse rating than if you match CNMF exactly. This could be resolved by generating the artificial images of known pure signal with noise added on top. And there is a package, NAOMi, that does that. However, unbelievably, NAOMi doesn't save the original profiles it used for generation, producing the CNMF-based recognition as a reference instead, re-creating the same issue.

So a consequence of this is that even though SEUDO is supposed to clean up the noise, in reality it produces the best correlation to CNMF when tuned for a very, very mild noise suppression, much milder than the original SEUDO paper found was the optimal trade-off between suppressing the noise and introducing distortion.

Then the comparison by correlation has a couple of fine points. One is that the selection of zero level matters. Raising the zero level cuts off all the lower fluctuations, and doesn't scale with the correlation's normalization. The best correlation will be with the same selection of zero level.

Another is that the lateral shifts on time axis matter a lot. The very first step of noise suppression in realSEUDO is done by averaging a few frames, and the way it tags the averaged frames is off-center, by the last included frame. So shifting the trace to almost center the averaging makes a big improvement on the correlation. Not quite centering though, for the best result it still has to lag by a frame. Which makes me think that this lag by one frame (or another way to put it, off-by-one difference in frame numbering) is also built into CNMF that is used as the reference.

But comparing the time profiles is insufficient in evaluating the match, the shape of the cells matters too. Figure 4 in the appendix to realSEUDO paper contains a small example of matching profiles together with cell shapes as detected by different algorithms. This is a small subset of the full matching that was auto-generated. It's technically a post-processing, not a part of realSEUDO, but if we want to compare the different algorithms, we need to build some automated ways to find the differences in their outputs. Our version of differencing reused a part of the realSEUDO logic. Unlike correlation that produces a close-or-not score, realSEUDO's evaluation differentiates separate cases, with a score for what it thinks is the best fitting case:

Two shapes are unrelated
One image subsumes another closely enough to be two versions of the same image
One image subsumes another and is a combination of multiple images
Two shapes overlap but are distinct

Just as realSEUDO uses these scores for merge-or-split decisions, the same can be applied to matching, on both space and time. And so, for example, it recognized that CNMF's cell 35 got found in a similar shape by realSEUDO as cell 4, but by OnACID as two separate cells 7 and 22. Of course, without knowing the ground truth we still can't tell, which way is more correct, other than by eyeballing the video at the times of discrepancies.

LASSO-ISTA-FISTA

2024-12-22T13:04:00.002-05:00

This is a follow-up to the previous post, talking about the details of the optimization algorithms (in the math sense, i.e. trying to find a minimum of a function by gradient descent) LASSO-ISTA-FISTA that we've modified for realSEUDO. To remind, FISTA is a momentum gradient descent on top of ISTA which is the logic for selecting the step size for the simple gradient descent on top of LASSO which is the idea of adding a bias to the function being optimized to drive the dimensions with small values towards 0. The amount of this bias is adjusted by the parameter lambda.

LASSO minimizes a function like

||f(x) -y||^2 + lambda*|x|

Here x is a vector (you can think of it as coordinates in an n-dimensional space), in our case of analyzing the brain cell videos it's the vector of activation levels for all the cells and and all the SEUDO's gaussian kernels. y is also a vector (of a different size) that represents the brightness of all the pixels in an image (an image can be also thought of as a matrix but here it doesn't matter, we just string out all the rows into a vector). f(x) is a vector-to-vector function that maps the activation levels to the image in pixels, in our case multiplying the pixel values in the dictionary image of each cells by its activation level and then adding up the brightness of each pixel from all the sources. So f(x)-y is the error between the actual image and our match, and also a vector. Then the operation ||_|| converts the error vector to a single number by taking its euclidean length (AKA norm2) in n-dimensional space (with the n here being the size of vector y). Taking its square lets us skip the square root of the euclidean computation, since both with and without the square root the minimum is at the same value of x. The bias is the lambda*|x| part. |x| is the norm1 of the vector, computed as the sum of absolute values of all the elements, and lambda adjusts the strength of its effects.

In case if you wonder, what these functions look like, here is a simple example. It's a smooth function with creases added along the axes by the bias function. The effect of the bias is that the gradient in every dimension gets lambda added to it on the positive side and subtracted on the negative side. And since the minimum is at zero gradient, this makes the dimensions with shallow gradients around 0 (which means that they don't matter much) to be pushed to 0.

The idea of LASSO is to interleave the descent steps: one according to the gradient of main function, another one according to the gradient of the bias function, which still requires computing the gradient of the main function at the new point. Which not only doubles the number of heaviest steps (gradient computation) but also tends to dither around the minimum, taking steps in different directions according to the tug-of-war of two gradients. Well, this is stupid, so why not just add two gradients together to start with? This worked very well, but turned out to be not new: someone beat us to it not that many years ago.

By the way, in our case we're only looking for non-negative values, so we just cut off the negative side, looking at only one n-dimensional quadrant, and then the gradient bias will be simply lambda in each dimension, makes the computation a tiny bit cheaper, and also a bit easier to reason about, since the creases get cut off and the function becomes smooth in all the acceptable range.

The goal of ISTA is to select the gradient step that is as big as possible but doesn't overshoot the minimum. The safe step size depends on the maximum steepness, i.e. maximum second derivative of the function that can be encountered in stepping towards minimum (which basically means between the starting point and minimum, unless we overshoot), known as Lipschitz constant L, the step size is its reciprocal 1/L. This limit comes basically from Newton's method: we want to get to the point where the first derivative is zero, so how far can we step to not overstep that point? 1/L. In fact, if the second derivative is constant, this step would bring us exactly to the minimum.

The momentum descent in FISTA makes the good estimation of basic step less important but still plenty important, as the big steps grow the momentum much faster. TFOCS (the Matlab library for gradient descent) does a periodic sampling of the current steepness by looking at the change of gradient during the step. But TFOCS is a general library, could we do better for our specific kind of function? We've looked into the static estimation of the magnitude of the second derivative, and it worked worse then TFOCS. But out of the ways of estimation, one that worked better, was to take a separate derivative in each dimension (i.e. for dimension m the sum of dx_i/dx_m for all i) and use it separately for that dimension. This is something that only an engineer would do and a mathematician won't: try something and see if it works. And this something is "compute and use a separate L in every dimension", so that in stepping the gradient gets distorted according to these partial values of L, slowing down the stepping in the dimensions with steep second derivative and speeding up in the dimensions that are nearly flat. Now, as I've come back to it, and looked at it again analytically, my first reaction was "WTF, this can't make things better!" but it did. I even went back to the code and tried again, and it still did.

The thing is, as you get more than two dimensions, the negative terms with x_i*x_j kick in and the function stops being consistently convex in all dimensions, the gradient doesn't point exactly towards the minimum any more, and the second derivative along the direction of gradient stops being constant. So the static estimation worked worse than the sampling done by TFOCS, and we switched to the sampling too. But then how about applying the idea of separate L by dimension to the sampling? As it turns out, it still worked well, although not always. In all the early examples we tried, it consistently worked better than TFOCS but then we've encountered more examples where it worked a little worse too. It all depends, and maybe there is a way to pick and choose, which way to use for a specific case.

And then the idea that we can treat the dimensions separately came handy for solving the overshooting problem with the momentum descent, by stopping the momentum in the dimensions that either have the gradient change sign or go out of acceptable range. This worked very, very well. And didn't leave much of FISTA as it is: momentum descent with brakes. The gradually growing brakes got replaced with the sudden stopping by separate dimensions.

All these ideas can be applied to any momentum descent, including TFOCS and other libraries. Whether they would work well, depends on the particular functions being minimized but that's OK, that's why TFOCS already has multiple options to be used in different situations. Although the fact that sudden stopping seems to work decently well even with such jagged functions as in the NNs suggests that it's likely to work well in many if not all cases.

Triceps and realSEUDO at NeurIPS

2024-12-22T02:37:00.003-05:00

I've just come back from the NeurIPS 2024 conference. With Triceps listed as my organization. I've done a project with mentoring my family member at JHU, and well, since it's not work-related, using my day job would be wrong, and I'm not at JHU either (and I've paid for attending the conference myself, JHU turned out to be too cheap for that). So Triceps it is, especially that Triceps is used in the project. And that's, by the way, exactly the project I've mentioned before as an application of Triceps, just I couldn't tell the details before it got published. Well, now we've had a poster session at NeurIPS (which people tell me is kind of a big deal, I've just had fun), and the paper is out: "realSEUDO for real-time calcium imaging analysis" at https://neurips.cc/virtual/2024/poster/94683 or https://www.researchgate.net/publication/380895097_realSEUDO_for_real-time_calcium_imaging_analysis or https://arxiv.org/abs/2405.15701. Here is what it does, in simple words.

The calcium imaging is a brain imaging technique, in live (although not exactly unharmed) brain. People have bred a strain of mice whose neurons light up when they activate. So they take a mouse, open up its skull, and stick a camera with a microscope on it. This allows to record a movie of a living brain where you can see what is going on in it literally cell by cell, the brightness of the cells corresponding to their activation level.

Once you get the movie, you need to analyze it: figure out where the cells are, and their activation levels. The "golden standard" of the detection algorithm is CNMF. It solves an optimization problem (in the mathematical sense, i.e. finding the minimum of a function that represents the error), for two mappings at once: what pixels belong to what cell, and what cells activate at what level in each frame (activation profile). More exactly, it deals not with separate pixels but with gaussian kernels. "Kernel" in the math speak is any repeatable element (so for another example, "CUDA kernels" are the CUDA functions that are repeated across the GPU), but in this case it's a sprite that gets stenciled repeatedly, centered on each pixel. "Gaussian" is the shape of the sprite, a 2-dimensional representation of the normal distribution where the brightness corresponds to the weight, so it's an approximation of a circle that is brightest in the center and then smoothly goes to nothing at the edges.

The trouble is that CNMF works only on the whole movie after it has been collected. You can't use it to decode frame-by-frame in real time because it needs to see all the frames at once. So you can't identify cells and then immediately use this knowledge to alter the stimulus and see what effect this has on the brain. There is another algorithm, OnACID that does frame-by-frame decoding using a variation of CNMF logic, except that it requires a starting period of 100 frames or so to get the initial cells, and then the quality of identifying the cells is much worse than with CNMF, and it's still not very fast, substantially slower than the 30 frames per second even on an 80-CPU machine we used.

In the same area but in a little differentt niche, SEUDO is the JHU professor's previous project, a technique to reduce noise in the activation profiles. It can be used together with the other algorithms. Aside from the random noise, there is an inherent cross-talk: the neurons are stacked in multiple layers and can have very long dendrites that can overlap with many other neurons, adding little weak light spots that get registered as weak activations on these other neurons. So the idea there is that to recognize the activations we both use the known cell shapes and also fill the image with gaussian kernels centered on every pixel, and solve the minimization problem on all of them, then throw away the values for gaussian kernels, thus discarding the cross-talk from unknown sources. The little light spots that don't fit well with the known cells get attributed to these gaussian kernels, and so the cell activation profiles become smoother. The problem is that it's slow, and quickly becomes slower as you increase the size of the kernels. If you take a kernel of 30x30 pixels, you get extra 900 coefficients in the quadratic equation you're trying to minimize (and that's for every kernel, which gets centered on every pixel).

The goal was to get this whole thing to work in real time (i.e. at least 30 fps), and we've achieved it. It works in real time (even including SEUDO, which is really an optional extra processing on top), and produces the quality that is generally way better than OnACID, and fairly close to CNMF. Well, arguably, at least sometimes better than CNMF but the problem is that there is no existing accepted way to rate "better or worse" (the unscientific method is by eyeballing the small fragments of the movies), the accepted way of rating is "same or different from CNMF". I'll talk a bit more about it later, in a separate post.

So the work really has 5 parts:

The optimization (in programming sense) and parallelization in C++ of the optimization (in math sense) algorithm
Improvements to the optimization (in the math sense) algorithms
An off-shoot of trying to apply some of these improvements to optimization algorithms, and more, to the neural network training problem (the things I did in Triceps)
Improvements to the SEUDO algorithm
The completely new logic for automatically finding the cells in the movies (that got the name of realSEUDO)

The first part is kind of straightforward (read my book :-) with only few things worth mentioning. First, outdoing Matlab and the TFOCS package in it is not easy, it's already well optimized and parallelized, but doable. There are two keys to it. One is to match what TFOCS does: represent the highly sparse matrix of coefficients by "drawing" the sprites and thus regenerating the matrix on the fly whenever it gets used, otherwise the matrix just won't fit into the memory. The computation of a gradient requires two passes of drawing. TFOCS does it by generating two complementary drawing functions. We use two passes with one drawing function but computing different sums: the first pass goes by columns and computes the intermediate sums (common subexpressions) that get stored, then the second pass goes by rows and computes the final sums from the intermediate ones. The two passes with stored intermediate values reduce the complexity from O(n^3) to O(n^2) compared to doing everything in one pass. The second key was to make the pixel drawing function into a template rather than a function call, because the overhead of a function call in the inner loop has a higher cost than the rest of computation. And another thing that needs mentioning is that to pass data between Matlab and C++ code you have to use Matlab's C API. Their C++ API is extremely slow, adding much more overhead than the computation itself.

But that's still not fast enough. Then it was the turn of improving the optimization (in math sense) algorithms. There are three: FISTA which is a momentum gradient descent on top of ISTA which is the logic for selecting the step size for the simple gradient descent on top of LASSO which is the idea of adding a bias to the function being optimized to drive the dimensions with small values towards 0. The amount of this bias is adjusted by the parameter lambda (wait, we'll come to its effect yet).

I've started writing up the LASSO-ISTA-FISTA summary but even in a summary form it got quite long. So I'll put it into a separate follow-up post. The summary-of-a-summary is that a weird but serendipitous experiment gave me an idea of treating the gradient dimensions separately, and that led to the idea of the momentum stopping on overshoot by dimension. The momentum gradient descent is kind of like a ball rolling down a well, observed with a stroboscope: each step of a descent is like the passage of time to the next strobe flash. So the next time we see the ball, it might already be past the minimum and the momentum carrying it up the opposite wall. Eventually it will stop, go back, and likely overshoot the minimum again, in the other direction. In the real world the ball eventually stops because of friction that drains its energy. In the momentum descent the friction is simulated by gradually reducing the momentum coefficient (which means growing friction). The trouble is, you don't know in advance, how much friction to put in. With too little friction, the ball will bounce around a lot, with too much friction it will be rolling in molasses. So people do things like resetting the momentum coefficient after some large number of steps. But if we look at each dimension, there is an easy indication of an overshoot: the gradient in that dimension changes sign. So at that time we can just immediately kill the momentum, and we don't even have to kill the whole momentum, we can kill it in just that one dimension. And this worked really well, to the point that the friction became completely unnecessary. The other place where the sudden stopping helps is when the variables try to go out out of range (for example, in our case we have the boundary condition that each dimension is non-negative).

This worked very well for a smooth (well, mostly, except for the creases added by LASSO bias, but then again the creases are at the sign change boundaries, and we always stay on the positive side) quadratic function. Would it work on a much more jagged functions used in the training of neural networks? This is what I've tried in the NN code in Triceps, and as it turns out, the answer is sort of yes, at least in certain limited ways, as I wrote at length before in the other posts, and the fraction of dimensions experiencing the stopping can also be used to auto-adjust the training step size (i.e. the training schedule - the existing algorithms like Adam rely on preset constants, here it adjusts automatically). It's currently been applied to plain gradient descent only, and not beating the stochastic gradient descent in the current form, but on the other hand, it does make the plain gradient descent with momentum not only work at all but work not that much worse than the stochastic descent. Check out the figure with NN training speed comparison in the appendix. Now at NeurIPS I've learned things that gave me more ideas for marrying this logic with the stochastic descent, so there are more things to try in the future.

For SEUDO, the trick has turned out to be the sparsity. It still works decently well when the kernels are placed less frequently, up to about 1/3 of the diameter. So with the sprites 30x30 you get a slow-down by a factor of 900, but place these sprites in a grid with the step of 10 pixels, and you get the factor of 100 back. It would probably be even better to place them on a triangular grid (basically, the same grid but with offset on alternating rows, and different horizontal and vertical step sizes to form the equilateral triangles of neighboring centers), to keep the distances between the centers even in every direction, but we haven't got around to do that (yet?).

And the way to make the detection of the cell images in the video (that's what realSEUDO is, for which incidentally the plain SEUDO is optional) fast was to do it with the old-fashioned image processing. It goes in two stages (well, and a noise reduction pre-stage, and it also expects that the image is already stabilized): the first stage finds the contiguous spots above the background brightness, then adds time as the third dimension and builds the sausage-like shapes of the cell activations. They are pretty jagged at the ends, as the cells start to light up in little spots, then these spots gradually expand and merge, and going dark again looks similar but in the opposite direction. The sausage gets eventually squashed along the time axis to build the reference cell image. The trouble is, there are multiple layers of cells in the brain, with overlapping cells sometimes activating at the same time. So the second stage is used to figure out these overlaps between the cells: when do we see the same cell but at a different brightness level, when do we see a completely different overlapping cell, and when both (or more) overlapping cells are lighting up at the same time. So it merges and splits the cell images accordingly, and generates the events telling what it's doing. At the same time it tries to fit the known cells into the current frame and generates the events for the cell activations (Triceps is used for reporting the events as a more efficient way to collect or forward them than Matlab). The splitting/merging code is very important but purely empirical, with a few important tunable coefficients chosen by trial and error at the moment. It's a tour-de-force of empiric engineering beating science. But perhaps some merging of both approaches is possible, maybe using the detected shapes as a starting point for CNMF-like further optimization. If the starting point is already very close to the optimum, the short path to the optimum would be fast (as we see in reusing the detected activation coefficients from the previous frame as the starting point for the current frame).

And there are other little things, like partitioning a large image into segments and processing them in parallel, and then splicing the cell shapes and activation profiles back together (this is not a new idea, CNMF does it too). Since all the high-level logic has been done in Matlab, the parallelism is a bit weirdly shaped by Matlab's abilities: there is parallelism on top with the partitioning by segments and on the bottom with the C++ code running FISTA but there is a good deal of parallelism opportunities in the middle of the splitting/merging logic that aren't used. The trouble is basically that Matlab doesn't have shared read-only data, it starts each parallelization by an expensive copying of the whole relevant state, and then ends by copying back the result. The other reason to use the segments comes from the overhead on the cell shapes: the way Matlab works, the shapes have to be represented as image-sized matrices, quadratically increasing the overhead of all the operations as the image grows. The sparse matrices (as the C++ part of the code already does) would fix that, there is such a recent new feature in Matlab, or obviously if moving the code from Matlab to another language. BTW, these are much simpler sparse matrices than ones produced in FISTA equations, here we know that all the non-0 values are located in a small bounding box, so they're trivial to implement.

There are also things to be said about the evaluation of detected profiles (which also comes up in splicing the segments), but it's a largish subject for another post.

P.S. The code used to produce the performance graphs in the appendix can be found in https://sourceforge.net/p/triceps/code/HEAD/tree/trunk/cpp/nn/demo/seudo-paper/. The data was pulled from the logs with scripts and then the graphs were built in Matlab, but those scripts will need to be changed from Matlab to something more easily available before publishing them.

P.P.S. Some graphs from the appendix, training of a neural network with the methods implemented in Triceps:

P.P.P.S. No, I don't work at Microsoft Research. This attribution is a screw-up made by Adam Charles, which I'm working on straightening.

why compilers are called compilers

2024-09-28T00:47:00.000-04:00

An interesting tidbit from a recent IEEE Spectrum magazine https://spectrum.ieee.org/from-punch-cards-to-python: The very first "A-0 Compiler" literally compiled the program by pulling out its fragments from various tapes. As the article says:

But she needed a library of frequently used instructions for the computer to reference and a system to translate English to machine code. That way, the computer could understand what task to complete.

Such a library didn’t exist, so Hopper built her own. It included tapes that held frequently used instructions for tasks that she called subroutines. Each tape stored one subroutine, which was assigned a three-number call sign so that the UNIVAC I could locate the correct tape. The numbers represented sets of three memory addresses: one for the memory location of the subroutine, another for the memory location of the data, and the third for the output location, according to the Stanford presentation.

“All I had to do was to write down a set of call numbers, let the computer find them on the tape, and do the additions,” she said in a Centre for Computing History article. “This was the first compiler.”

unusual aggregations

2024-05-15T02:43:00.006-04:00

I've been doing a sizable amount of SQL querying, and it turns out the modern systems have some new and interesting aggregation functions. Like for example MAX_BY and MIN_BY that return the value of one expression from the row where another expression is maximal or minimal. And there are more creative functions if your system supports nested fields, array fields, or map fields.

The other real nice syntax is using the indexes of the fields in the GROUP BY clause instead of the actual expressions - this avoids the need to write the same expressions twice. It could be made even better by specifying the names of the output fields instead of indexes.

There are a few more aggregation functions that would be nice to have.

One is being able to specify aggregation in a group vs aggregation across all records. A frequent problem is to find the percentage of record count in a group relative to the whole set. This would be easy to implement as (COUNT(*) * 100 / COUNT_GLOBAL(*)). And this could be generalized to nested aggregations (similarly to how Triceps allows the nested indexes) by specifying the nesting level. For example, if we have 4 fields in a GROUP BY clause, the generalized form COUNT_G(4, *) would be equivalent to COUNT(*), COUNT_G(0, *) will be the global count, and the intermediate values would do the grouping in the larger groups by fewer fields.

It would also be nice to have some simpler syntax to specify the two-pass querying. It can be done with joins but it's very annoying to write those joins manually in plain SQL. I'm not sure yet what would be the good syntax, just have an idea of the semantics for now.

One example of that would be the grouping by percentiles. On the first pass you'd compute the percentiles of some expression, on the second pass you'd group the records by where they fit between these percentiles. So if you do the percentiles with a step of 10%, you'd get ten records with the summaries of these percentiles. Well, if you store the records you work on (as in a CEP system), it doesn't even have to be two passes, it can easily be a single pass where you decide in which bucket to fit a record, and then adjust all the buckets accordingly.

Another example of two-pass querying would be bringing the CEP windows into the plain SQL, in case if you want to identify some interesting records, and then go an aggregation on the other related records that preceded them within a time interval. Again, if you have the records come in the time order and cached for that time interval, it can be done with a single pass, but even if that is not available, two passes would do it in any circumstances. Very, very useful for time series analysis.

Or if you think about two-pass aggregation as being done with a map-reduce aggregator, the first pass would seed the mapping function and instantiate the reducers, where the second pass would feed the records into these reducers.

Note also that if you write such a join manually, you end up doing essentially a transposition of a normal query, where instead of specifying an expression per record, you specify an expression per row. Something like this (in pseudo-SQL, since remembering the exact syntax boggles me, I can only copy-paste fragments but not freely write them):

WITH
SELECT .. AS input
WITH
SELECT low, high
FROM
    -- this is the transposition that propagates through
    -- the rest of the query
    INSERT (0, 20), (20, 40), (40, 60), (60, 80), (80, 100)
AS percentiles
WITH
SELECT
    percentiles.high AS percent,
    PERCENTILE(input.field, percentiles.low) AS low,
    PERCENTILE(input.field, percentiles.high) AS high
FROM percentiles, input
AS boundaries
SELECT
percent,
... whatever aggregation ...
FROM boundaries, input
WHERE
input.field >= boundaries.low
AND input.field < boundaries.high

It would be nice to have an easier way to do such transpositions too, even aside from the two-pass queries.

Another idea for processing the time series is the ability to build a group from a set of records with sequential IDs/ timestamps. Suppose we have time in slices, where an event might be happening or not happening in each slice, and we want to consider an uninterrupted sequence of slices with events as one event. It's easy to build a join that would put together two consecutive events. But how about a hundred events? This would be easy to do with a group-building function that tracks the first and last timestamp in the group, looks for a new record being consecutive, and merging the groups if needed. Note that this is not an aggregation function, really. It's a new concept, a group building function that replaces the fixed function of all the values in a group having the same value. It would also work for building a window around an interesting record (with or without merging the close interesting records into the same window), where the window becomes the group.

comparing the FloatNeuralNet options

2023-10-17T12:58:00.000-04:00

I've done a plot of the errors with multiple variations of the FloatNeuralNet options, using the Leaky ReLU as the least common denominator. I've added some too: for example, I've never tried before to do the momentum method without my improvements, so I've done a custom build without them. The results are interesting, I should use more graphs in the future (and add a proper graph generation rather than scraping the debugging log).

First of all, none of non-stochastic implementations have done as well as the stochastic one. Well, they do better for the same training rate, but the catch is that the stochastic approach is stable at a much higher training rate. It works fine with a 100 times higher training rate for the same initial seed, and starts getting rough but still even faster at a 1000 times higher training rate.

The second observation is that without my changes the momentum method doesn't work at all. At all. It just breaks down right away.

The third observation is that my auto-adjustment of the training rate starts quite fast, somewhere between stochastic method's 100x and 1000x training rates but then flattens out and falls behind. So even as-is it's probably a valuable technique for quick short training.

As for why it flattens, my hypothesis is this: Imagining the optimization as a ball rolling down the hill along with the gradient, killing the momentum in the dimensions where the gradient keeps changing sign prevents the ball from wobbling in the trough and jumping out of it. It works essentially as a shock absorber. But what if the trough doesn't go predominantly in one dimension? What if it's going at a diagonal in multiple dimensions? Then the damping that prevents wobbling will also be slowing down the ball. This is I think what happens when the optimization slows down. Fortunately, the mathematicians have a way to find the dominant direction of the trough and rotate to it, the Principal Component Analysis does that. Then, I guess, the same momentum method can be applied to the rotated dimensions, and then rotated back. Unfortunately, this whole procedure is probably quite expensive. And as the ball rolls, the rotation will change, so the post-rotation dimensions will change, and it's not clear, how to match them between iterations. Maybe the rotations should be recomputed only infrequently, when the progress slows down: this will both amortize their costs and keep stability between iterations.

trends

2023-07-02T13:03:00.000-04:00

Some time ago Linkedin started sending weekly e-mail with the trends for "your kind of jobs". Every week since it's been reporting a 10% drop in job postings, then went over 10%. Well, last week it held at 0, then this week it dropped another 13%.

At the same time the government reports tout the job growth. So, is it a drop in high-paying jobs and growth in low-paying jobs? (Which is not exactly a great thing). As it happens, I know someone who runs a job posting web site that is used predominantly for low-to-mid paying jobs. And guess what, he says that the postings on his site have also sharply dropped around April. So no, not a growth of low-paying jobs either.

What does it all tells us about the government statistics? As they saying goes, "lies, damned lies, and statistics". Maybe they'll "revise" it a few months later as they've done with statistics on income after inflation, where the loudly touted slight growth above inflation had quietly turned into a ~5% loss.

the first actual use of Triceps

2023-06-03T12:55:00.002-04:00

I've recently got Triceps used for a real application, for the first time ever, after approximately 13 years. Not really a production usage, rather a scientific demo, but it's an outside project, not part of Triceps itself. Embedded in Matlab, or all things. No, not a full Matlab API, but an application written in C++ embedded as a native function into Matlab. Which is exactly what Triceps was designed for.

But really Matlab and Triceps with a full Matlab API might be a match made in heaven. Matlab really sucks at two things: the data structures where you can append and search by value, and parallelism. And those are things where Triceps excels. Yeah, Matlab has the parfor loops but the trouble there is that the parallel threads (they're more typically processes but logically still threads) are stateless. All the input data is packed up and sent to them (at a great overhead), and then the results are packed up and sent back. You can't just preserve state in a thread between two calls, it has to be sent from scratch every time. And no, it doesn't seem to be the same constant in shared memory read in parallel by multiple threads. It actually gets copied for every thread. So parfor only works well when you send a small amount of data, process it for some long time, and then send a small result back. Not well when you want your thread to make queries to a large database. But keeping state is what a Triceps thread does. The Triceps threads are also easy to organize in pipelines. (Yeah, Matlab does have some pipelines in one of extensions, but they seem as cumbersome as parfor). And the read-only shared memory would work too if queried through Triceps and only small results of the queries get returned to Matlab. It could work really, really awesomely together. The trouble of course is that I personally don't have much of an incentive to do this work.

That's the good part. What went badly? Well, Triceps in C++ feels rather dated. It's a project that was started in 2010, before C++11, and it feels like that. I didn't even aim for C++ to be the main language, but more as an API for embedding into the other languages. But now it can be done much more smoothly straight in C++. So if I ever get to it, a modernization of the C++ API is in order. Some libraries are dated too, in particular NSPR. It also proved to be a pain in building: I haven't thought much about it before, but the definitions used in building of the applications that use Triceps have to be the same as when building Triceps itself. So if triceps is built with NSPR, and the application doesn't include the right definitions for the Triceps headers to use NSPR, it crashes in a surprising way. Fortunately, the standard C++ library now has APIs for the atomic values, so transition to that API is now in order. On the other hand, shared_ptr is a more complicated question, and keeping the Triceps Autoref might still be better for efficiency.

Bayes 29: computation for numeric values (and Python)

2023-04-15T02:09:00.000-04:00

As promised in the previous installment, here is the how the data science handles the Bayesian inference directly on the numeric values. Remember from that last installment, the formula for inference by weight is:

W(H[i]|E) = W(H[i]) * P(E|H[i])

So what they do is instead of using the probability P(E|H[i]), they use the probability density p(E|H[i]) (note the lowercase p) from a previously computed distribution, this time E meaning not just a specific event variable but a specific numeric value of a specific variable. Which really makes sense, since that's what the probability density means: the probability that this specific value would come out in the distribution for this variable. Since it's a conditional distribution, only the training cases for the hypothesis H[i] are used to compute the distribution.

The typical distribution used here is the normal distribution. For what I understand, it's not only because it's the typical one occurring in reality, but also because for any original distribution, once you start repeatedly pulling the examples from it, the values produced by these examples become described by the normal distribution (if I remember right, this comes out of the Central Limit Theorem). The normal distribution has two parameters: mu (the mean value) and sigma (standard deviation, equal to the square root of the variance). The probability density function then is:

p(x) = (1/sqrt(2*pi*sigma^2)) * exp( -(x-mu)^2 / (2*sigma^2) )

Or, since sigma here is always squared, we can also replace the squared sigma with the variance var:

p(x) = (1/sqrt(2*pi*var)) * exp( -(x-mu)^2 / (2*var) )

The values of mu and var would be different for each hypothesis H[i]. So we'd split all the training cases into subsets by mutually exclusive hypotheses, and then compute:

mu[E][i] = sum(cases[i][E]) / N[i]
var[E][i] = sum((cases[i][E] - mu[E][i])^2) / (N[i] - 1)

Here N[i] is the count of the training cases with the outcome H[i], and cases[i] are all the cases with that outcome, effectively N[i] = count(cases[i]). The reason why we use (N[i] - 1) instead of plain N[i] is that it computes not a variance but a sample variance. This comes out of a rather interesting distinction between the probability theory and statistics: the probability theory describes the random distributions where the parameters of these distributions are known in advance, while statistics describes how to deduce these parameters from looking at the samples of the data. Here we obviously don't somehow know the distribution in advance but have to deduce it from looking at the training cases, i.e. the data samples, so we have to use the statistics and its sample variance. Technically speaking, mu here is also not the true mean but the sample mean, but the formula for it is the same anyway. However the need to divide by (N[i] - 1) comes from the sample mean producing a different value that is offset from the true mean, sample variance counter-adjusting for this offset. And the reason for this offset is that we do the sampling without replacement (i.e. each training case is used at most once in the computation). If we did the sampling with replacement, for example select 1000 values at random from the training cases (each time from the whole set of the training cases, not excluding the cases that have already been selected but allowing them to be selected again), then when computing the sample variance we'd use the plain N[i], i.e. the same 1000 rather than 999. Or if we magically knew the true mean, we'd use N[i] for the sample variance too.

And here is a good spot to talk about Python. As much as I detest the language, I see why it's popular for the data science: because it has some very good libraries in it for this purpose: numpy, sklearn, matplotlib. They've even added an operator of matrix multiplication @ into the language for the convenience of these libraries. And the point of the matrix operations is not only that they allow to write down things in a form that is short and comfortable for the data scientists (although confusing as hell for the rest of us), but also that they allow to move all the tight loops into the C++ code that is way faster than the Python code, and also allows the library to internally parallelize the computations as needed. So the computations in a slow interpreted language with questionable parallelism become quite fast and quite parallel. Aside from that, the libraries provide other conveniences. Sklearn even has a bunch of popular data sets built right into it. So in Python the code to compute mu and var looks like this:

# X is a numpy matrix, where each row corresponds to a training case
# and each column to an input variable (i.e. a numeric event);
# Y is a numpy vector containing the outcomes for all the training cases

# Find the subset of the training case that have the outcome i,
# it produces a vector of the same size with True for
# each included case and False for each excluded case.
selector = (Y == i)

# Select the subset of training cases for i, keeping all the columns
subset = X[selector, :]

# Count the included cases, which is the number of rows in the subset
n = subset.shape[0]

# Compute the sample mean, by averaging across all rows (axis=0). 
# The result is a row with a column per input.
mu[i] = subset.mean(axis = 0)

# Compute the sample variance, again producing a row with a
# column per input. The tricky part is that mu[i] is a single row,
# so the subtraction operator automatically considers it as a matrix
# that has as many rows as subset, with all the rows being the same.
# Sum adds up the columns across all the rows (axis=0) into one row.
# And division by a scalar divides each column by the same scalar.
# The result is a row with a column per input.
var[i] = np.square(subset - mu[i]).sum(axis=0) / (n - 1)

But wait, there is more. Note how the probability density function has an exponent function in it. Instead of computing the weights by taking the exponents and multiplying, we could compute the logarithms of the weights, by adding up the logarithms (and the logarithm of the exponent is the argument of the exponent, saving this function computation). So the formula becomes:

logW(H[i]|E) = logW(H[i]) + log(p(E|H[i])) =
  = logW(H[i]) + log(1/sqrt(2*pi*var[i, E])) 
             - (x[E] - mu[i, E])^2 / (2*var[i, E])

The part log(1/sqrt(2*pi*var[i, E])) is a constant that can be pre-computed in advance, so the computation with a few multiplications and additions is quite efficient.

Bayes 28: computation by weight revisited

2023-04-10T02:03:00.000-04:00

I've found out recently how the people in the data science compute the Bayesian inference for values from an arbitrary numeric range (as opposed to yes/no events). As it turns out, they do it using a computation by weights. I've been wondering for a while after discovering the computation by weight on my own, why nobody else uses it, it's so much simpler than by probabilities. So the answer is similar to what I've discovered before for the connection between the Bayes and neural networks: the people in the field do know about it and do use it, only the simplified popular explanations don't.

I wanted to write down the explanation of how they do it (at least, for myself in the future), and so I went back to read what I wrote before about the Bayesian computation by weight, and found that what I wrote before is quite complicated. I wrote it down as I discovered it when experimenting with all those extra parameters that make a Bayesian system work in reality, and so that explanation is also full of discussion of those parameters. Also, I've learned some things since then.

Before going into th enew ground, let me try a new take on the same thing: discuss the Bayesian inference by weight, only now in its most basic form.

Let's start again with an event E and hypothesis H. In the modern terms, they call this machine a Bayesian classifier, and call the mutually-exclusive hypotheses classes. The events are the inputs, and based on the values of the inputs the machine is trying to decide, which hypothesis is most likely to be true. The classic Bayesian formula for a binary event E is:

P(H|E) = P(H) * P(E|H) / P(E)

Here P(H) is the prior (i.e. previous) probability that the hypothesis is true, P(E) is the prior probability that the event E will be found true (after we learn the actual value of the event this probability collapses), P(E|H) is the conditional probability that the event would be true for the cases where hypothesis H is true, and finally P(H|E) is the new (posterior) probability of H after we learn that E is true. If we learn that E is false, we would compute P(H|~E) instead, for the complementary event ~E.

Where did that formula come from? Well, when talking about probabilities, it really helps to draw a diagram that starts with the range of all the possible cases and then splits it into parts with the appropriate weights. Let's draw this field as a square, in beautiful ASCII art. And then split it with a horizontal line so that the area above the line matches the number of the cases resulting in the hypothesis H1 being true, and below the line with H1 being false (i.e. the complementary hypothesis ~H1, which we'll also name H2, being true). This is for a very simple classification with two complementary hypotheses: H1 = ~H2 and H2 = ~H1.

+---------------+
|               |
|               | H1
|               |
+---------------+
|               |
|               | H2 = ~H1
|               |
+---------------+

Now let's do the same, but split the same square vertically. The left side will match the event E1, and the right side will be its complementary ~E1:

+---------+-----+
|         |     |
|         |     |
|         |     |
|         |     |
|         |     |
|         |     |
|         |     |
+---------+-----+
    E1      ~E1

Now let's take the second picture and split each vertical part horizontally, into two parts, that again match H1 on top and H2 on the bottom. I've marked these parts with the letters a, b, c, d.

+---------+-----+
| a       | b   |
|         |     |
|---------|     | H1
| c       |     |
|         |     |
|         |-----|
|         | d   | H2
+---------+-----+
    E1      ~E1

Here the parts a and b correspond to H1, and their total area is equal to the original area (within the ASCII-art limitations) of the upper part of the split in the first picture. The parts c and d correspond to H2, and again the sum of their areas is equal to the original area of the lower part. But obviously they're split differently on the left and right sides, meaning that if we know that E is true, H1 has a lower probability than H2, but if E is false, H1 has a higher probability. And if we don't differntiate based on E1, the left and right parts average out.

The areas of these parts a...d are the weights of four sub-divisions:

a: E & H1
b: ~E & H1
c: E & H2 (or equivalently E & ~H1)
d: ~E & H2 (or equivalently E & ~H1)

The reason why I prefer to use H2 instead of ~H1 is that this notation allows to generalize more obviously to more than two hypotheses: H3, H4, and so on. Each additional hypothesis would add two areas to the picture, one on the left and one on the right.

Now we can express the probabilities through relations of these areas (and areas can also be called weights):

P(H1) = (a + b) / (a + b + c + d)
P(H2) = (c + d) / (a + b + c + d)
P(E) = (a + c) / (a + b + c + d)
P(~E) = 1 - P(E) = (b + d) / (a + b + c + d)
P(E|H1) = a / (a + b)
P(E|H2) = c / (c + d)
P(H1|E) = a / (a + c)
P(H2|E) = c / (a + c)
P(H1|~E) = b / (b + d)
P(H2|~E) = d / (b + d)

The general principle for P(x|y) is that the area that satisfies both conditions x and y becomes the numerator, and the area that satisfies y becomes the denominator.

Alright, let's substitute these formulas into the Bayesian formula:

P(H1) * P(E|H1) / P(E) = P(H1) * (1/P(E)) * P(E|H1)
  = ((a + b) / (a + b + c + d))
    * ((a + b + c + d) / (a + c))
    * (a / (a + b))
  = a / (a + c)
  = P(H1|E)
  
P(H2) * P(E|H2) / P(E) = P(H2) * (1/P(E)) * P(E|H2)
  = ((c + d) / (a + b + c + d))
    * ((a + b + c + d) / (a + c))
    * (c / (c + d))
  = c / (a + c)
  = P(H2|E)

So that's where that formula comes from and how it gets proven. Note that the computation of both probabilities involves the final division by the same value (a + c). If we just want to compare them, this division by the same value doesn't matter and we can skip it. Instead we'll just get a and c, which are also the weights W(H1|E) and W(H2|E), and if we want to get the probabilities, we can normalize by dividing by their sum. Or I guess rather than W(H1|E) we could say W(H1 & E) but that's the beauty of workiing with the weights instead of probabilities: the probabilities require normalization at each step, dividing by the sum of all possible weights, while with weights the sum is kept implicit, and W(H1|E) = W(H1 & E) = W(E|H1). When expressed in weights, the formulas become simpler:

W(H1) = a + b
W(H1|E) = W(H) * P(E|H1) = (a + b) * (a / (a + b)) = a

That's pretty much it. But there is one more question to consider: what if we have more than one event? We usually do have multiple events. After we compute P(Hi|E1) (or W(Hi|E1)), now we have a smaller rectangle left, with the level of the horizontal divider shifted from where it was in the previous square (or rectangle). What do we do with the next event? There are multiple ways to look at it.

One way is to hope that the events are completely independent from each other. This basically means that as we look at each part produced on splitting by E1 (E1 and ~E1), and further split each of these parts by E2, the horizontal lines in each vertical quarter shift according to the current level of H in the part being split, with the result that P(E2|E1,Hi) = P(E2|Hi). It would look something like this:

+----+----+--+--+
| e  | f  |g |h |
|----|    |  |  |
|    |    |  |  | H1
|    |----|  |  |
|    |    |--|  |
|    |    |  |  |
|    |    |  |--| H2
+----+----+--+--+
 E1   E1   ~E1 ~E1
 E2   ~E2  E2  ~E2

The equality P(E2|E1,H1) = P(E2|H1) = P(E2|~E1,H1) would imply (even though it doesn't quite look like it in ASCII art):

e / (e + f) = a / (a + b) = g / (g+h)

That's a simple-minded implementation (in the scienific circles they call it naive, sometimes even spelled with two "i"s). The problem there is that when the assumption is not true and the events are strongly dependent, this can drive probabilities in weird ways.

Another way would be to build the tree of exact splits: split the slices produced by E1 into slices for E2, then for E3, and so on, and for each vertical slice after each split find the exact proportion of cases. This is obviously much more complex, and complexity grows exponentially with the number of events.

The third way (I think I've just realized it from reading the materials about data science) would be to track the splits by events pair-wise, do the vertical splits by each pair: (E1, E2), (E1, E3), ..., (E1, En), (E2, E3), (E2, E4), ..., (E2, En), and so on. I haven't quite thought through yet the adjustments that would need to be computed for each split. I guess, fundamentally it would be a representation of the covariance matrix (another word I've learned). But I guess there are two potential ways to do it: one would be to adjust the positions of the horizontal lines after the second split, another would be to adjust the positions of the vertical lines for the second split. I'm not sure which one is more correct. Maybe either way would adjust the ratio of the areas on the left and right to the same value. But either way, after you apply E1, adjust the probabilities of E2, E3, and the rest of the events according to this matrix. Then after applying E2, adjust E3, E4, and so on. It won't be a perfect match that can be produced with the exact splits but it would easily catch the events that are duplicates and near-duplicates of each other.

VNC How-to

2023-03-06T00:31:00.001-05:00

Theoretically speaking, an X terminal can work remotely through a tunnel with ssh -X. But in practice it does that very, very slowly. I don't understand why but they do a huge number of synchronous requests, which become very slow when the RTT is high. The thing that allows to use it over a long distances is called VNC. Its documentation is surprisingly poor, and so are the recorded talks, but I've finally figured it out.

All the VNC varieties grow from one source, that used to be Open Source but produced by a commercial company (NX). At some point they've stopped opensourcing it, but the previously published Open Source version started living its own life (FreeNX). The next major branch was TightVNC, and then TigerVNC branched off (but for what I undertand, TigerVNC is still backwards-compatible with TightVNC). Nowadays either TightVNC or TigerVNC or both are included in the Linux distributions as packages.

Running is fairly easy. On the remote machine start:

xvncserver -geometry 1800x1100 -alwaysshared :1

Change the geometry and display number to taste (or just skip the display number altogether, it will pick the first free one). "-alwaysshared" means that it will allow multiple parallel connections. For some reason it doesn't allow to set dpi (the display resolution) but you can make a copy of the script and add the option -dpi to the X server command (but it also looks like almost nothing nowadays pays any attention to the DPI set in the X server).

You can start this from an SSH session, and it will keep running after you close the session. It doesn't use SSH tunneling but opens its own socket, and does its own encryption and password on it (it asks to create the password on the first start). Caveat: apparently some 10 years ago a bug was found where VNC allowed bypassing the password on connection. So better not open it to the wide Internet, just in case, but then connect the display through an SSH tunnel that forwards to the VNC server (it's easy, just one option on the client).

To kill the server later, do

xvncserver -kill :1

To connect to the server, run on the display side:

xvncviewer -via user@gateway host:1

Here host is the name of the host with the VNC server, and "-via" means to go through an SSH tunnel before connecting to the host. The screen and all applications stay alive between the connections, which is awesome (just like RDP).

That's it, just two commands. It's somewhat inconvenient that Alt-Tab for window switching gets consumed by the local machine, but you can redefine an alternative combination on the remote machine instead (at least in the civilized session managers like MATE or Cinnamon).

A little on how it works: just as you could have expected, it starts two proxy X servers derived from Xnest, one on the remote server, one on the display side. So most of the synchronous requests get handled locally on one or another side. And the remote server stays alive all the time, so it preserves the session state between connections. But there is more to it, since the protocol between these two X servers is not the regular X protocol under encryption but a modified one, since the NX times. There are even multiple versions of these protocols, but fortunately the client and server are smart enough to negotiate them, and it just works.