To recap, "inlining" is when we complete a future that has some function chained to it (directly or through other futures), and that function gets immediately executed from the completion library call. The opposite is "trampolining" where this function gets put into a scheduler queue and executed later.
Inlining allows to save on the cost of scheduling, and also keeps the cache hot: completing a future means that we've just put some value into it, and so reading that value (and other values it's connected to) immediately means that it will still be in the cache.
However inlining can also be self-defeating: suppose we want to complete ten futures, each with a function chained to it. If trampolined, ten CPUs can pick them from the scheduler queue and execute in parallel. But inlining would inadvertently cause them to be executed sequentially.
The reality is that inlining is only efficient when it's done at the tail of the current function. On the other hand, the issues with inlining (stack overflows and bad interactions with mutexes and serializing the parallel execution) can be avoided if the inlined function was called only after the current function returns.
Put this way, the straightforward solution is to replace inlining with a special case of trampolining via a "delayed scheduling slot": have a special thread-local variable in the scheduler, sufficient to hold a reference to a single scheduled function. Whenever a future is completed, put one chained function there and schedule the rest as usual. If the delayed slot is already used, then it can be either left as-is and all the new functions scheduled as usual, or in the hope that the later completions have a hotter cache, move the old contents of the delayed slot into the normal scheduling queue and put the new function there. Then when the current asynchronous function is completed, have the scheduler code check the delay slot, and if not empty, call the function from there.
This can be expressed in pseudocode:
thread_local<Function> delaySlot;
complete_future(Future fut)
{
FunctionList funcs;
FutureList recfut;
recfut.insert(fut);
for (f in recfut) {
scopedlock(f);
if (f.completed)
continue; // a double completion, nothing to do
funcs.merge(f.chained_functions);
f.chained_functions.clear();
recfut.merge(f.chained_futures);
f.chained_futures.clear();
}
if (delaySlot.empty() && funcs.size() == 1) {
delaySlot = funcs.front();
} else if (!funcs.empty()) {
scopelock(schedulerQueue);
if (!delaySlot.empty()) {
schedulerQueue.insert(delaySlot);
}
delaySlot = funcs.front();
funcs.pop_front();
for (fun in funcs) {
schedulerQueue.insert(fun);
}
}
scheduler_thread()
{
while (true) {
Function f;
if (!delaySlot.empty()) {
f = delaySlot;
delaySlot.clear();
} else {
f = getNextFromQueue();
}
execute(f);
}
}
Another interesting point is that cache locality gets improved by unfair scheduling, inserting the new functions at the front of the scheduling queue, with the premise that the more recent inserts will have a hotter cache. It's not exactly unfair either: Remember that in asynchronous programming the sequential execution gets broken up into a sequence of separate small functions. And so the most recently scheduled function is likely the continuation of the previous function, and running it first is completely fair, with the scheduling queue becoming a representation of the call stack, the innermost "return addresses" being at the front of the queue.
This is very similar to Triceps's scheduling logic, following from the same reasoning. Or to put it differently, this is the reason why Triceps's scheduling logic should also be used for the asynchronous programming in general.
No comments:
Post a Comment