Saturday, May 9, 2020

consistent time and loops

It's obvious that the graph loops cannot be treated like the rest of the links with consistent time, or there would be no pipelining: we'd be forced to wait for an update from the looping link every time we send a record down the loop. Or we might end up sending a whole lot of records down the loop before reading back from the looping link, accumulating a whole lot of records in the queue of the looping link.

So what is the right thing to do? I think it depends on the circumstances, on what we're trying to achieve. I can think of the following uses:

1. Record the processing at real time and be able to reproduce it exactly in an accelerated playback.

2. Process records in accelerated time to start with, but as in real time don't care too much about the exact outcome on the first run as long as it's within the reasonable constraints. (But be able to replay it again in exactly the same way).

3. Process records in accelerated time to start with, and make sure that they get treated in an exactly consistent way, any run from scratch producing the exact same result.

The case(1) can be reasonably resolved by treating the looping links like the inputs form the external systems where we don't particularly care about the exact synchronization: re-timestamping the incoming records, logging them, and then processing with the new timestamp. On replay, read the logged records with their timestamps, and process them in the same way.

A few more words about how the re-timestamping would work: tag the record in the log with both the original timestamp and the new one, and use the new timestamp in the further processing. A simple-minded way to handle the replay would be to just read the log, and ignore the records sent through the looping link on the replay (since they presumably should be the same). A smarter way would be to receive the records form the link and compare them with the records in the log (including the original timestamp). If they match, all is well, the new timestamp from the log can be used. If they diverge then something somewhere has changed and either the replay should be aborted or some mitigation should be done, perhaps in the same way as for the case (2).

So, how to solve the case (2)? The trouble there is that the model's clock gets driven by the incoming records, that get queued, allowing the clock to go far ahead before the records get processed and arrive back through a looping link. The solution seems to be to limit, how far can the records be delayed at the re-timestamping point. Suppose we choose some time interval that is large enough to let the records go through the loop (to allow the queues to do an efficient pipelining) but not outlandishly large, let's call this interval L. Say, 1 second or 0.1 second. Set it as the limit. Then when a record arrives through the looping link (such as, in the figure below, B receiving a record back from D) with a timestamp T, we re-stamp it with a timestamp that is the smaller of: (T + L) and the lowest timestamp of the next record in the incoming queues (going up the links as usual to get the timestamp of the record they would send next, and since the records from the looping links are processed first, this in effect means "go just before that record").

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
           /   \
   +----+ /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+
   |       \    /
   |        \  /
   |         V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |
   +--------+

This would essentially allow the looping links to "run behind the clock" by up to L. So if B receives from A a record with timestamp T2, it can treat the looping link "almost normally", by asking it if it has any earlier records, but using the time limit (T2 - L) instead of T2 (as it would for the normal "downstream" links). This is very much the same as the case (1), only it limits by how much the clock could run ahead. This solution would actually work fine for both the cases (1) and (2). And since it preserves the high priority of the looping links, it would prevent them from buffering too much data.

It can also be stretched to cover (3) by always using (T + L) for the looped records, and not the optimistic minimum. The trouble there is that a large number of records might enter the loop in the interval L, and they will collect on the looping link's queue. But it should be fine for the small test models with a small amount of data.

I've thought about fixing this issue by using a semi-blocking queue on the looping link: essentially compose the looping link from a blocking queue with the usual limit (and special logic) followed by a non-blocking queue. The re-timestamping would happen when moving the records from the blocking queue to the non-blocking one. If the non-blocking queue is empty, treat the records from the blocking queue as having the timestamps behind their original ones by L, and move them to the non-blocking queue when the time (T + L) is reached. Unless the loop is about to go deadlocked. Then instead of deadlocking, move the first record from the blocking queue on the looping link to the non-blocking queue, re-stamping it with the lowest timestamp of the next record in the incoming queues as in the solution for (2). Since the total depth of the queues in a loop is fixed for a given model, that would make the ordering of the records fully predictable, even if they get processed before (T + L).

But this fix has its own trouble: for consistency it requires that the loop must be fully deadlocked before breaking this deadlock. Not just that the queues to and from the top node are full, but that every queue in the loop is full. Which is not easy to track. Triceps already finds the loops between the Trieads, so that part is not a problem, but tracking at runtime how the queues become full, with all the possibilities of intertwined loops, might add a good deal of overhead. So it might not be a practical fix.

This might need more thinking or maybe it's just a problem that doesn't need to be fixed.

In the meantime, there is one more aspect to the loops. A looping link might be used essentially to send a timing event, telling the nodes upstream to re-examine their state. In the example below, the looping link from D to B would be driven by the scheduler on D:

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
           /   \
   +----+ /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+    scheduler
   |       \    /        |
   |        \  / /+------+
   |         V V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |
   +--------+

In this case adding the delay L looks weird, since the time of the scheduler event might already be well past L from the original record that triggered it. It would make sense to make a special scheduler driven by a looping link instead:

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
scheduler  /   \
   ^    | /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+
   |       \    /
   |        \  /
   |         V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |
   +--------+

This scheduler's notion of time would be driven by D but it would generate the events for B. With a special limitation that the events must be scheduled by at least L into the future. Any lower delays would be rounded up to L.

Oh, and one more thing about the models that would run predictably every time from scratch: their infinite-precision clock can't use a single common record counter for the high-precision part, because that would cause race conditions. Instead each node would have to keep its own record counter, and the low bits of the timestamp would consist of (node_id, per_node_counter). Which might actually be to the best, since it would remove the contention point of the common counter. The schedulers can be given their own node ids. And to think of it, the schedulers should probably be given the lower ids than the input nodes, because the timed events should probably be processed before the records coming from outside at the same time.

Tuesday, May 5, 2020

consistent time

I've done some more thinking on the issues of consistent time in Triceps models, and came up with a design. It's not a final design, more of design notes, so that I won't forget them until I get to an implementation. But I think it's quite interesting.

Let's start with the models that are the unidirectional graphs, without any loops, they are much easier to reason about.

The basic premise is that the records need to be timestamped when they enter the model, and then processed in the order of these timestamps. The time issues tend to be highlighted in the diamond-shaped graphs, so let's start with one:

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
           /   \
          /     \
          V      V
        +---+  +---+
        | B |  | C |
        +---+  +---+
           \    /
            \  /
             V V
           +-----+
           |  D  |
           +-----+

An input record arrives, gets stamped, goes to the node (i.e. Triead, a triceps thread, but let's talk in more generic terms for now) A, where it can cause the records to be sent to nodes B and/or C. The records produced by A would have the same timestamp as the incoming record. B and C in turn may produce their own records (which again would get the same timestamp) and send them to D.

When two records with the same timestamp arrive from the same node, they have the natural queuing order, and they can be predictably processed in that order. But what if they arrive from different nodes, say B and C? The result of processing might depend on the order, which basically means that we have to define priorities between all the inputs. Here D has the inputs from B and C: one of them (say, B) would get a higher priority, and its records will be processed before the records with the same timestamp from the other input (C).

Well, that's easy to say, but what should D do when it receives a record with timestamp T from C, and nothing yet from B. For how long should it wait for something from B before it decides that it can process the record from C?

One way is to always send the timestamp metadata along each path even if there are no actual records. But this looks like a lot of unnecessary overhead. And it also has another potential issue: suppose we have two external inputs that get timestamped. The input 1 gets a lot of records for a period of time, the input 2 gets none. How often should the input 2 send its timestamp metadata even if it sends no data records? Well, maybe there is another solution.

Another way to resolve it would be for D to ask B, do you ever plan to send me anything with timestamp T or earlier? If not then D can safely go and process the record from C. If yes then D would have to wait for a notification from B. So the request from D to B can be asynchronous. If B is ready, it will send a timestamp metadata as a callback to D right away, if not then it will send either timestamped records or a bare timestamp later.

But B might not be able to answer this directly. It might have to consult A with the same question first. And A would have to consult the timestamper on its input. If A has two inputs, it would consult both timestampers.

Well, these callbacks would be very similar to just sending the timestamp metadata records all the time, and here would be additional delays involved with the requests upstream. But there are positive things too:

* It tells us, how often the inactive timestamper would get queried and send its metadata: more or less, for every record that comes in through any timestamper.

* In reality, not every node in the model would get input for every incoming record. If none of the nodes at some level produce records, the propagation would stop there and won't go farther downstream. And when the requests are sent upstream, they also would reach only to the timestampers that could send input to this node.

* The requests can be of two types: that the time T has been reached and that the time T has been passed. Because the node D prefers B over C, when gets a record from C with timestamp T, it would have to make sure that B is past T before processing that record. But if it gets a record from B with timestamp T, knowing that C is at T (and not yet past it) would be enough to start processing that record.

* The requests can be amortized. I.e. if the node D has records with timestamps T1 and T2 queued from node C, it can ask B once to send the timestamp updates until it reaches past T2. In the same way, if D asks B whether it's past T2 and B knows that it's already past a later timestamp T3, it can send the notification about T3 right away, and D will be able to use this knowledge later.

* Finally, this communication doesn't have to go through the regular queues. The more efficient data structures can be used to propagate the timestamps between the nodes. The timestamp information is cumulative: once B has notified that it had processed the input up to the timestamp T3, there is no point in keeping the information about it processing the timestamps T1 and T2, they're implied in T3.

* But if the communication goes across the network, the round trip time can become prohibitively expensive for the upstream requests. So the downstream timestamps should be sent on their own, although the cumulative approach would still apply. Of course, if there are multiple streams sent over the network between two systems, all these streams should just be sent across using the same sequencer, so there would be no issue with synchronization between them. But there would still be the synchronization issue between multiple external systems, and between them and the local scheduled events. Yet another possible solution for that would be to just restamp them with the local time if the relative ordering doesn't matter.

Well, let's move on to the next problem. When can a node say that it's past a particular timestamp? It depends on the clock granularity. As long as the clock is at T, there might be more records coming in, so it can't say that it's past T yet. So if the clock granularity is 1 millisecond, we'd be stuck waiting for that millisecond to end. What we need is a clock with infinite granularity. It can be simulated by keeping a single global record counter, and building the timestamp as a pair (time, count). The counter would be like the lower bits of the timestamp. This would also allow to amortize the calls to read the clock: if we have 100 records queued up at the timestamper, we can just increase the global counter by 100, and assign each record the same time and the next count in the allocated sequence.

Why not forget the time at all and move on to just the counter? We'd need time to be able to replay the records later at an accelerated pace, and still handle the time-based events.

Basically, the timed scheduler would run as a timestamper. And yes, it would also have to use the same clock, and do the same (time, count) pairs allocated from that clock. If D has some time-based events, the diagram would become:

.
           input
             |
             V
        timestamper <----------------------------- clock
             |                                       |
             V                                       |
           +---+                                     |
           | A |                                     |
           +---+                                     |
           /   \                                     |
          /     \                                    |
          V      V                                   |
        +---+  +---+                                 |
        | B |  | C |                                 |
        +---+  +---+    scheduler/timestamper <------+
           \    /        +
            \  //+-------+
             V VV
           +-----+
           |  D  |
           +-----+

This would mean that when D gets a record from B, it would have to check the time not only in C but also in the timestamper. The timestamper can also do the amortization by telling the time of its next scheduled event. But that's a bit tricky: first, there would have to be some count part of the timestamp associated with that future time. There are two ways to go about it: either first tell the time without the count part (essentially, with it at 0), or allocate the count part in advance, when the event gets scheduled, and then just use that count when the timer fires. Second, it might happen that the processing of an incoming record would schedule a new, earlier event. But that's not a problem because the events can't be scheduled in the past. Since the scheduler would run as a part of D and synchronous with it, it could update its estimation and move the next event forward.

It would also have an interesting interaction with the replay: if we want to replay a timestamped log of incoming events and get the model to behave in exactly the same way as before, the scheduler events would have to happen at exactly the same times too. Which means them getting the exact same count part of the timestamp, and that normally won't be stable due to the races for the clock. But it can be resolved by recording the exact time of all the scheduled events into the same log and reusing it during the replay, the schedulers receiving their timestamps not from the clock but from the log, same as the incoming records.

What if we want to change the model between the reruns? Well, in this case we can discard the log for the schedulers, and just get the count values for them form fresh, and write the values into the new log. The running sequence would be slightly different than the first time, but since the model has changed, it wouldn't matter. Or we could even reuse the part of the log that is still applicable, merge it with the events from the new schedulers, and write into the new log. Either way, once the new log gets written, it can be reused again to produce the exact same result on the next replay.

Another interesting thing about the replay, is how do we transition from a replay to the live performance? If we have a recorded log from yesterday, want to replay it and then continue with today's data, what do we do with all the scheduled events that would have fired overnight? This basically suggests that we need to have some kind of "time reset events" that would be used when the time gets moved abruptly. It would allow the application logic to reset properly all the events that have been scheduled for the meantime- either just cancel them or execute them once, depending on the application semantics.

I think this should work quite well for the graphs without loops. The loops add a good deal of trouble. More on them later.

Sunday, May 3, 2020

TLB coherence

The discussion described in the last post got me thinking, why don't we have a hardware consistency for the page address translation cache (TLB), done through the same bus transactions as the usual cache snooping? In a multi-threaded environment, invalidating the TLB across all the threads is a pain that requires the cross-CPU interrupts.

In the simplest case the CPU that drives a TLB invalidation could just do a bus transaction that specifies a virtual address, and every CPU would invalidate its TLB entry that matches this address. If the user processes use the address space randomization, there would be few conflicts between processes, and the kernel address space is common for everyone. In a more complicated way, the virtual address can be accompanied by the process or memory protection id, so that only the pages of the CPUs running that process would be invalidated. And finally, each page translation entry has a unique identifier: the physical memory address where the CPU had read it from. If the TLB keeps this address as a tag (keeping enough bits to identify the page of the page directory is enough, since the rest of bits are determined by the virtual address), it can be used as a tag in the TLB invalidation transaction, and only the exact TLB entrues matching this tag would be invalidated.

Which made me wonder if anyone else had thought of this before. And after a bit of searching, it has turned out that they did: https://arxiv.org/pdf/1701.07517.pdf. These guys have approached the problem from the hypervisor standpoint, but the implementation is exactly the same, using the physical memory address of the entry as a tag. There is no date on the article, but judging by the citation list, it could not have been written earlier than 2016.