Sergey Babkin on CEP and stuff: consistent time and loops

It's obvious that the graph loops cannot be treated like the rest of the links with consistent time, or there would be no pipelining: we'd be forced to wait for an update from the looping link every time we send a record down the loop. Or we might end up sending a whole lot of records down the loop before reading back from the looping link, accumulating a whole lot of records in the queue of the looping link.

So what is the right thing to do? I think it depends on the circumstances, on what we're trying to achieve. I can think of the following uses:

1. Record the processing at real time and be able to reproduce it exactly in an accelerated playback.

2. Process records in accelerated time to start with, but as in real time don't care too much about the exact outcome on the first run as long as it's within the reasonable constraints. (But be able to replay it again in exactly the same way).

3. Process records in accelerated time to start with, and make sure that they get treated in an exactly consistent way, any run from scratch producing the exact same result.

The case(1) can be reasonably resolved by treating the looping links like the inputs form the external systems where we don't particularly care about the exact synchronization: re-timestamping the incoming records, logging them, and then processing with the new timestamp. On replay, read the logged records with their timestamps, and process them in the same way.

A few more words about how the re-timestamping would work: tag the record in the log with both the original timestamp and the new one, and use the new timestamp in the further processing. A simple-minded way to handle the replay would be to just read the log, and ignore the records sent through the looping link on the replay (since they presumably should be the same). A smarter way would be to receive the records form the link and compare them with the records in the log (including the original timestamp). If they match, all is well, the new timestamp from the log can be used. If they diverge then something somewhere has changed and either the replay should be aborted or some mitigation should be done, perhaps in the same way as for the case (2).

So, how to solve the case (2)? The trouble there is that the model's clock gets driven by the incoming records, that get queued, allowing the clock to go far ahead before the records get processed and arrive back through a looping link. The solution seems to be to limit, how far can the records be delayed at the re-timestamping point. Suppose we choose some time interval that is large enough to let the records go through the loop (to allow the queues to do an efficient pipelining) but not outlandishly large, let's call this interval L. Say, 1 second or 0.1 second. Set it as the limit. Then when a record arrives through the looping link (such as, in the figure below, B receiving a record back from D) with a timestamp T, we re-stamp it with a timestamp that is the smaller of: (T + L) and the lowest timestamp of the next record in the incoming queues (going up the links as usual to get the timestamp of the record they would send next, and since the records from the looping links are processed first, this in effect means "go just before that record").

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
           /   \
   +----+ /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+
   |       \    /
   |        \  /
   |         V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |
   +--------+

This would essentially allow the looping links to "run behind the clock" by up to L. So if B receives from A a record with timestamp T2, it can treat the looping link "almost normally", by asking it if it has any earlier records, but using the time limit (T2 - L) instead of T2 (as it would for the normal "downstream" links). This is very much the same as the case (1), only it limits by how much the clock could run ahead. This solution would actually work fine for both the cases (1) and (2). And since it preserves the high priority of the looping links, it would prevent them from buffering too much data.

It can also be stretched to cover (3) by always using (T + L) for the looped records, and not the optimistic minimum. The trouble there is that a large number of records might enter the loop in the interval L, and they will collect on the looping link's queue. But it should be fine for the small test models with a small amount of data.

I've thought about fixing this issue by using a semi-blocking queue on the looping link: essentially compose the looping link from a blocking queue with the usual limit (and special logic) followed by a non-blocking queue. The re-timestamping would happen when moving the records from the blocking queue to the non-blocking one. If the non-blocking queue is empty, treat the records from the blocking queue as having the timestamps behind their original ones by L, and move them to the non-blocking queue when the time (T + L) is reached. Unless the loop is about to go deadlocked. Then instead of deadlocking, move the first record from the blocking queue on the looping link to the non-blocking queue, re-stamping it with the lowest timestamp of the next record in the incoming queues as in the solution for (2). Since the total depth of the queues in a loop is fixed for a given model, that would make the ordering of the records fully predictable, even if they get processed before (T + L).

But this fix has its own trouble: for consistency it requires that the loop must be fully deadlocked before breaking this deadlock. Not just that the queues to and from the top node are full, but that every queue in the loop is full. Which is not easy to track. Triceps already finds the loops between the Trieads, so that part is not a problem, but tracking at runtime how the queues become full, with all the possibilities of intertwined loops, might add a good deal of overhead. So it might not be a practical fix.

This might need more thinking or maybe it's just a problem that doesn't need to be fixed.

In the meantime, there is one more aspect to the loops. A looping link might be used essentially to send a timing event, telling the nodes upstream to re-examine their state. In the example below, the looping link from D to B would be driven by the scheduler on D:

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
           /   \
   +----+ /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+    scheduler
   |       \    /        |
   |        \  / /+------+
   |         V V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |
   +--------+

In this case adding the delay L looks weird, since the time of the scheduler event might already be well past L from the original record that triggered it. It would make sense to make a special scheduler driven by a looping link instead:

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
scheduler  /   \
   ^    | /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+
   |       \    /
   |        \  /
   |         V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |
   +--------+

This scheduler's notion of time would be driven by D but it would generate the events for B. With a special limitation that the events must be scheduled by at least L into the future. Any lower delays would be rounded up to L.

Oh, and one more thing about the models that would run predictably every time from scratch: their infinite-precision clock can't use a single common record counter for the high-precision part, because that would cause race conditions. Instead each node would have to keep its own record counter, and the low bits of the timestamp would consist of (node_id, per_node_counter). Which might actually be to the best, since it would remove the contention point of the common counter. The schedulers can be given their own node ids. And to think of it, the schedulers should probably be given the lower ids than the input nodes, because the timed events should probably be processed before the records coming from outside at the same time.

Sergey Babkin on CEP and stuff

Saturday, May 9, 2020

consistent time and loops

No comments:

Post a Comment

Links

About Me

Labels

Blog Archive