Sergey Babkin on CEP and stuff: 2020

Friday, December 25, 2020

time for more analog computers?

In case if you haven't heard about them, the analog computers were unlike the digital computers. They were the electrical models assmebled to simulate the other real-world events. For example, if you wanted to model a car suspension, you'd have an oscillator circuit to simulate a spring, and a resistor to simulate a shock absorber. They were purpose-assembled for every specific application.

Well, now we have another thing that purports to simulate the real-world systems: the units in the neural networks that are supposed to be like the real neurons. The race to cram more "neurons" into the neural networks has led to the values in the "neurons" to be represented as very short 8-bit fixed-point numbers. And the models still work. Which means that the "neurons" are quite well at tolerating the small imprecisions in the data, which is the typical issue with the analog signals. But in return the analog signals are much more resistant to large disruptions: i.e. if you change one single highest bit in a digital representation of a number, this small disruption changes the value of a number by half its whole range, something that doesn't happen with the analog signals. There already are ideas of improving the reliability of computations by making them more resistant to large errors by accepting the small errors (I think I've read about them in IEEE's Spectrum).

So, the next logical step: why not make each "neuron" in a neural network an analog machine? Instead of 8 bits, have one analog signal, and do all the multiplication and addition on the analog signals. It looks like the perfect match, a combination of these two ideas. Has anyone already done that? If not, why not? Is there a way to do that?

P.S. It turns out, people are already doing this. I've recently read in Spectrum that people are using photonic computations for the machine learning, and that the photonic computations are limited to this niche because of their limited precision. I can't really think of a reason for the limited precision other than these computations being analog.

P.P.S. It's 2022 now, slightly more than a year after the original post, and IEEE Spectrum is reporting that IBM is working on a way to use the memory (like Flash, but other similar types too) as an analog summator, doing the "neuron" processing in a memory cell.

Sunday, September 27, 2020

a book on probability

I've stumbled upon the book "Introduction to probability" by Anderson, Seppalainen, and Valko. It's a college textbook.

Right in the second chapter it does a very good introduction into the Bayesian formula, deriving it from the descriptions of weights of events, in the same way as I struggled to reinvent on my own. And right there it also provides the "General version of Bayes' formula" for the enumerated events, that took me so much time to reinvent on my own. Well, another case of "everything is invented before us" :-) If only we knew, what books to read at the right time :-) On the other hand, I'm not sure if I would have recognized the importance of that formula if I didn't have a use for it in mind first.

In case if you wonder, here is the formula:

If B1...Bn are events that partition the sample space, then for any event A we have:

P(A) = sum from i=1 to n of (P(A & Bi)) = sum from i=1 to n of (P(A|Bi)*P(Bi))

Then

P(Bk|A) = P(A & Bk) / P(A) = P(A|Bk)*P(Bk) / (sum from i=1 to n of (P(A|Bi)*P(Bi)))

Wednesday, September 23, 2020

cheap malloc-ed stacks

I've accidentally realized that the disjointed stacks (ones composed of the separately allocated chunks) can be pretty inexpensive if they get allocated in fixed pages, and each stack frame is guaranteed to be less than a page (which is fairly common in, say, the OS kernel). And the stack pages can be made larger than the physical pages.

All that the function needs to do at the start of the call is check that there is enough room to the end of the page. Which is easy enough by checking that (SP & PAGE_SIZE) < (PAGE_SIZE-FRAME_SIZE). Well, taking a page from the RISC book, we'd also want to reserve some room at the end of the page for the leaf functions that could then skip the whole check, but it doesn't change the general approach.

If a new page needs to be allocated, it can be taken from a pool of free stack pages (a linked list), and would be linked to the previous page, then the frame created in it (copying the old stack pointer onto the new stack, as in the "classic" Intel x86 sequence of "push BP" that saves the pointer to the previous stack frame). The next page can also be found already linked to the current page from the last time, and then the allocation can be skipped. The function return then doesn't need to do anything special, just "pop SP" and return. And then if the thread blocks, the scheduler can return the unused pages beyond the last active frame back to the pool.

There obviously is some overhead but not a real huge amount of it.

Saturday, May 9, 2020

consistent time and loops

It's obvious that the graph loops cannot be treated like the rest of the links with consistent time, or there would be no pipelining: we'd be forced to wait for an update from the looping link every time we send a record down the loop. Or we might end up sending a whole lot of records down the loop before reading back from the looping link, accumulating a whole lot of records in the queue of the looping link.

So what is the right thing to do? I think it depends on the circumstances, on what we're trying to achieve. I can think of the following uses:

1. Record the processing at real time and be able to reproduce it exactly in an accelerated playback.

2. Process records in accelerated time to start with, but as in real time don't care too much about the exact outcome on the first run as long as it's within the reasonable constraints. (But be able to replay it again in exactly the same way).

3. Process records in accelerated time to start with, and make sure that they get treated in an exactly consistent way, any run from scratch producing the exact same result.

The case(1) can be reasonably resolved by treating the looping links like the inputs form the external systems where we don't particularly care about the exact synchronization: re-timestamping the incoming records, logging them, and then processing with the new timestamp. On replay, read the logged records with their timestamps, and process them in the same way.

A few more words about how the re-timestamping would work: tag the record in the log with both the original timestamp and the new one, and use the new timestamp in the further processing. A simple-minded way to handle the replay would be to just read the log, and ignore the records sent through the looping link on the replay (since they presumably should be the same). A smarter way would be to receive the records form the link and compare them with the records in the log (including the original timestamp). If they match, all is well, the new timestamp from the log can be used. If they diverge then something somewhere has changed and either the replay should be aborted or some mitigation should be done, perhaps in the same way as for the case (2).

So, how to solve the case (2)? The trouble there is that the model's clock gets driven by the incoming records, that get queued, allowing the clock to go far ahead before the records get processed and arrive back through a looping link. The solution seems to be to limit, how far can the records be delayed at the re-timestamping point. Suppose we choose some time interval that is large enough to let the records go through the loop (to allow the queues to do an efficient pipelining) but not outlandishly large, let's call this interval L. Say, 1 second or 0.1 second. Set it as the limit. Then when a record arrives through the looping link (such as, in the figure below, B receiving a record back from D) with a timestamp T, we re-stamp it with a timestamp that is the smaller of: (T + L) and the lowest timestamp of the next record in the incoming queues (going up the links as usual to get the timestamp of the record they would send next, and since the records from the looping links are processed first, this in effect means "go just before that record").

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
           /   \
   +----+ /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+
   |       \    /
   |        \  /
   |         V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |
   +--------+

This would essentially allow the looping links to "run behind the clock" by up to L. So if B receives from A a record with timestamp T2, it can treat the looping link "almost normally", by asking it if it has any earlier records, but using the time limit (T2 - L) instead of T2 (as it would for the normal "downstream" links). This is very much the same as the case (1), only it limits by how much the clock could run ahead. This solution would actually work fine for both the cases (1) and (2). And since it preserves the high priority of the looping links, it would prevent them from buffering too much data.

It can also be stretched to cover (3) by always using (T + L) for the looped records, and not the optimistic minimum. The trouble there is that a large number of records might enter the loop in the interval L, and they will collect on the looping link's queue. But it should be fine for the small test models with a small amount of data.

I've thought about fixing this issue by using a semi-blocking queue on the looping link: essentially compose the looping link from a blocking queue with the usual limit (and special logic) followed by a non-blocking queue. The re-timestamping would happen when moving the records from the blocking queue to the non-blocking one. If the non-blocking queue is empty, treat the records from the blocking queue as having the timestamps behind their original ones by L, and move them to the non-blocking queue when the time (T + L) is reached. Unless the loop is about to go deadlocked. Then instead of deadlocking, move the first record from the blocking queue on the looping link to the non-blocking queue, re-stamping it with the lowest timestamp of the next record in the incoming queues as in the solution for (2). Since the total depth of the queues in a loop is fixed for a given model, that would make the ordering of the records fully predictable, even if they get processed before (T + L).

But this fix has its own trouble: for consistency it requires that the loop must be fully deadlocked before breaking this deadlock. Not just that the queues to and from the top node are full, but that every queue in the loop is full. Which is not easy to track. Triceps already finds the loops between the Trieads, so that part is not a problem, but tracking at runtime how the queues become full, with all the possibilities of intertwined loops, might add a good deal of overhead. So it might not be a practical fix.

This might need more thinking or maybe it's just a problem that doesn't need to be fixed.

In the meantime, there is one more aspect to the loops. A looping link might be used essentially to send a timing event, telling the nodes upstream to re-examine their state. In the example below, the looping link from D to B would be driven by the scheduler on D:

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
           /   \
   +----+ /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+    scheduler
   |       \    /        |
   |        \  / /+------+
   |         V V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |
   +--------+

In this case adding the delay L looks weird, since the time of the scheduler event might already be well past L from the original record that triggered it. It would make sense to make a special scheduler driven by a looping link instead:

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
scheduler  /   \
   ^    | /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+
   |       \    /
   |        \  /
   |         V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |
   +--------+

This scheduler's notion of time would be driven by D but it would generate the events for B. With a special limitation that the events must be scheduled by at least L into the future. Any lower delays would be rounded up to L.

Oh, and one more thing about the models that would run predictably every time from scratch: their infinite-precision clock can't use a single common record counter for the high-precision part, because that would cause race conditions. Instead each node would have to keep its own record counter, and the low bits of the timestamp would consist of (node_id, per_node_counter). Which might actually be to the best, since it would remove the contention point of the common counter. The schedulers can be given their own node ids. And to think of it, the schedulers should probably be given the lower ids than the input nodes, because the timed events should probably be processed before the records coming from outside at the same time.

Tuesday, May 5, 2020

consistent time

I've done some more thinking on the issues of consistent time in Triceps models, and came up with a design. It's not a final design, more of design notes, so that I won't forget them until I get to an implementation. But I think it's quite interesting.

Let's start with the models that are the unidirectional graphs, without any loops, they are much easier to reason about.

The basic premise is that the records need to be timestamped when they enter the model, and then processed in the order of these timestamps. The time issues tend to be highlighted in the diamond-shaped graphs, so let's start with one:

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
           /   \
          /     \
          V      V
        +---+  +---+
        | B |  | C |
        +---+  +---+
           \    /
            \  /
             V V
           +-----+
           |  D  |
           +-----+

An input record arrives, gets stamped, goes to the node (i.e. Triead, a triceps thread, but let's talk in more generic terms for now) A, where it can cause the records to be sent to nodes B and/or C. The records produced by A would have the same timestamp as the incoming record. B and C in turn may produce their own records (which again would get the same timestamp) and send them to D.

When two records with the same timestamp arrive from the same node, they have the natural queuing order, and they can be predictably processed in that order. But what if they arrive from different nodes, say B and C? The result of processing might depend on the order, which basically means that we have to define priorities between all the inputs. Here D has the inputs from B and C: one of them (say, B) would get a higher priority, and its records will be processed before the records with the same timestamp from the other input (C).

Well, that's easy to say, but what should D do when it receives a record with timestamp T from C, and nothing yet from B. For how long should it wait for something from B before it decides that it can process the record from C?

One way is to always send the timestamp metadata along each path even if there are no actual records. But this looks like a lot of unnecessary overhead. And it also has another potential issue: suppose we have two external inputs that get timestamped. The input 1 gets a lot of records for a period of time, the input 2 gets none. How often should the input 2 send its timestamp metadata even if it sends no data records? Well, maybe there is another solution.

Another way to resolve it would be for D to ask B, do you ever plan to send me anything with timestamp T or earlier? If not then D can safely go and process the record from C. If yes then D would have to wait for a notification from B. So the request from D to B can be asynchronous. If B is ready, it will send a timestamp metadata as a callback to D right away, if not then it will send either timestamped records or a bare timestamp later.

But B might not be able to answer this directly. It might have to consult A with the same question first. And A would have to consult the timestamper on its input. If A has two inputs, it would consult both timestampers.

Well, these callbacks would be very similar to just sending the timestamp metadata records all the time, and here would be additional delays involved with the requests upstream. But there are positive things too:

* It tells us, how often the inactive timestamper would get queried and send its metadata: more or less, for every record that comes in through any timestamper.

* In reality, not every node in the model would get input for every incoming record. If none of the nodes at some level produce records, the propagation would stop there and won't go farther downstream. And when the requests are sent upstream, they also would reach only to the timestampers that could send input to this node.

* The requests can be of two types: that the time T has been reached and that the time T has been passed. Because the node D prefers B over C, when gets a record from C with timestamp T, it would have to make sure that B is past T before processing that record. But if it gets a record from B with timestamp T, knowing that C is at T (and not yet past it) would be enough to start processing that record.

* The requests can be amortized. I.e. if the node D has records with timestamps T1 and T2 queued from node C, it can ask B once to send the timestamp updates until it reaches past T2. In the same way, if D asks B whether it's past T2 and B knows that it's already past a later timestamp T3, it can send the notification about T3 right away, and D will be able to use this knowledge later.

* Finally, this communication doesn't have to go through the regular queues. The more efficient data structures can be used to propagate the timestamps between the nodes. The timestamp information is cumulative: once B has notified that it had processed the input up to the timestamp T3, there is no point in keeping the information about it processing the timestamps T1 and T2, they're implied in T3.

* But if the communication goes across the network, the round trip time can become prohibitively expensive for the upstream requests. So the downstream timestamps should be sent on their own, although the cumulative approach would still apply. Of course, if there are multiple streams sent over the network between two systems, all these streams should just be sent across using the same sequencer, so there would be no issue with synchronization between them. But there would still be the synchronization issue between multiple external systems, and between them and the local scheduled events. Yet another possible solution for that would be to just restamp them with the local time if the relative ordering doesn't matter.

Well, let's move on to the next problem. When can a node say that it's past a particular timestamp? It depends on the clock granularity. As long as the clock is at T, there might be more records coming in, so it can't say that it's past T yet. So if the clock granularity is 1 millisecond, we'd be stuck waiting for that millisecond to end. What we need is a clock with infinite granularity. It can be simulated by keeping a single global record counter, and building the timestamp as a pair (time, count). The counter would be like the lower bits of the timestamp. This would also allow to amortize the calls to read the clock: if we have 100 records queued up at the timestamper, we can just increase the global counter by 100, and assign each record the same time and the next count in the allocated sequence.

Why not forget the time at all and move on to just the counter? We'd need time to be able to replay the records later at an accelerated pace, and still handle the time-based events.

Basically, the timed scheduler would run as a timestamper. And yes, it would also have to use the same clock, and do the same (time, count) pairs allocated from that clock. If D has some time-based events, the diagram would become:

.
           input
             |
             V
        timestamper <----------------------------- clock
             |                                       |
             V                                       |
           +---+                                     |
           | A |                                     |
           +---+                                     |
           /   \                                     |
          /     \                                    |
          V      V                                   |
        +---+  +---+                                 |
        | B |  | C |                                 |
        +---+  +---+    scheduler/timestamper <------+
           \    /        +
            \  //+-------+
             V VV
           +-----+
           |  D  |
           +-----+

This would mean that when D gets a record from B, it would have to check the time not only in C but also in the timestamper. The timestamper can also do the amortization by telling the time of its next scheduled event. But that's a bit tricky: first, there would have to be some count part of the timestamp associated with that future time. There are two ways to go about it: either first tell the time without the count part (essentially, with it at 0), or allocate the count part in advance, when the event gets scheduled, and then just use that count when the timer fires. Second, it might happen that the processing of an incoming record would schedule a new, earlier event. But that's not a problem because the events can't be scheduled in the past. Since the scheduler would run as a part of D and synchronous with it, it could update its estimation and move the next event forward.

It would also have an interesting interaction with the replay: if we want to replay a timestamped log of incoming events and get the model to behave in exactly the same way as before, the scheduler events would have to happen at exactly the same times too. Which means them getting the exact same count part of the timestamp, and that normally won't be stable due to the races for the clock. But it can be resolved by recording the exact time of all the scheduled events into the same log and reusing it during the replay, the schedulers receiving their timestamps not from the clock but from the log, same as the incoming records.

What if we want to change the model between the reruns? Well, in this case we can discard the log for the schedulers, and just get the count values for them form fresh, and write the values into the new log. The running sequence would be slightly different than the first time, but since the model has changed, it wouldn't matter. Or we could even reuse the part of the log that is still applicable, merge it with the events from the new schedulers, and write into the new log. Either way, once the new log gets written, it can be reused again to produce the exact same result on the next replay.

Another interesting thing about the replay, is how do we transition from a replay to the live performance? If we have a recorded log from yesterday, want to replay it and then continue with today's data, what do we do with all the scheduled events that would have fired overnight? This basically suggests that we need to have some kind of "time reset events" that would be used when the time gets moved abruptly. It would allow the application logic to reset properly all the events that have been scheduled for the meantime- either just cancel them or execute them once, depending on the application semantics.

I think this should work quite well for the graphs without loops. The loops add a good deal of trouble. More on them later.

Sunday, May 3, 2020

TLB coherence

The discussion described in the last post got me thinking, why don't we have a hardware consistency for the page address translation cache (TLB), done through the same bus transactions as the usual cache snooping? In a multi-threaded environment, invalidating the TLB across all the threads is a pain that requires the cross-CPU interrupts.

In the simplest case the CPU that drives a TLB invalidation could just do a bus transaction that specifies a virtual address, and every CPU would invalidate its TLB entry that matches this address. If the user processes use the address space randomization, there would be few conflicts between processes, and the kernel address space is common for everyone. In a more complicated way, the virtual address can be accompanied by the process or memory protection id, so that only the pages of the CPUs running that process would be invalidated. And finally, each page translation entry has a unique identifier: the physical memory address where the CPU had read it from. If the TLB keeps this address as a tag (keeping enough bits to identify the page of the page directory is enough, since the rest of bits are determined by the virtual address), it can be used as a tag in the TLB invalidation transaction, and only the exact TLB entrues matching this tag would be invalidated.

Which made me wonder if anyone else had thought of this before. And after a bit of searching, it has turned out that they did: https://arxiv.org/pdf/1701.07517.pdf. These guys have approached the problem from the hypervisor standpoint, but the implementation is exactly the same, using the physical memory address of the entry as a tag. There is no date on the article, but judging by the citation list, it could not have been written earlier than 2016.

Sunday, April 19, 2020

statistical and password-based memory protection

I've recently had an interesting online conversation with Terry Lambert (whom I know from the FreeBSD times) about the memory protection that resulted in I think a new and interesting idea, that I want to write down here.

Terry was talking about the statistical memory protection. I'm not sure if he invented it, or at least the term, since I can't find it anywhere else. Terry gives the related links but none of them are exactly about that:

https://en.wikipedia.org/wiki/Capability-based_addressing
CSDL | IEEE Computer Society - Fast Protection-Domain Crossing in the CHERI Capability-System Architecture
Address space layout randomization - Wikipedia
Intro to System Software, Chapter 18 [Iowa State CS]
ERIM: Secure, Efficient In-process Isolation with Protection ... [PDF]
Sharing and Protection in a Single Address Space Operating ... [PDF]
Mondrian Memory Protection - UT CS [PDF]
Scheduling and Load Balancing in Parallel and Distributed Systems

The statistical memory protection works by having a large common address space where the protection is achieved by allocating memory at random addresses, so the address itself serves as a password. Plus the same physical page can be mapped at different addresses with different permissions, so knowing one address would give the write access while another one the read-only access.

As far as I see it, the major benefit of this system would be in the ability to pass the complex data structures linked by pointers between processes. With the address map being common, the pointers would work in any process. This gives cheap and convenient "mailboxes" for exchanging messages between processes.

This would have multiple potential problems too.

One is the “password leakage” that is also mentioned in one of the papers. Once a password gets out, it can be saved and passed around. With the separate addresses for RO and RW access to the same page this could get pretty bad: if one of these pointers in an RO page accidentally refers to the RW address, the permission leaks. And actually the whole passing of pointer-connected structures breaks down because the writer and reader would have to use different pointers (with RO and RW addresses).

Another issue, the memory segments would have to be allocated at the globally unique random addresses, which would have to be a synchronized operation. Considering that the segments have to be spaced far enough form each other, that would consume a sizable number of bits and degrade security of the passwords.

This would also be susceptible to the brute-force attacks: maybe the attack won’t enumerate the whole address space but it can start with a random point and just see what can be found there. It won’t always be lucky, but sometimes it would. Another way of the attack would be to stick incorrect pointers into the shared pages, hoping that the other process reading the pointer from that page would cause some damage.

The brute-force attacks can be frustrated by keeping a fault counter, and killing the process that has too many faults, but it's harder than it seems at first. The fault counters would have to have a scope greater than a process. Because if you kill a process, nothing stops the user from spawning another one. Also nothing would stop the user from spawning a lot of processes, each doing fewer probes than the fault limit and exiting. But on the other hand, with a greater scope, a malicious application would be able to screw the user completely, so drawing this scope boundary won’t be easy. A possible solution for that problem is to introduce a delay in process's scheduling after a fault, and these delays can even be made to increase if user's global rate is high. This would dramatically slow down the attacks without affecting the correctly running code.

There is one more reason for why the fault counting would have issues: if we allow to transfer the pointers through the global memory, there is no way to know whose fault is the faulty pointer, did the current process get it from someone else?

I've come up with an idea to improve on the statistical memory protection, let's call it password-based memory protection: the same can be achieved in a better and cheaper way by associating a long (say, 128 bit) “password” with a page, effectively making the addresses in the “password-protected” pages 192-bit long (64 bit for the address and 128-bit password). There can be separate passwords for reading-only and reading-writing a page. Possibly execute-only too. But the passwords can be kept separate in some context, so the “short” 64-bit part of the address would be the same for the reader and writer. The context can be inherited from the page: when we load a pointer from a page, we set in the CPU register the password part of it equal to the password of the page. This would also allow to catch the malicious addresses in the shared pages by mismatch of the password. And the whole address space can be split into the private and shared parts, with only the shared part using the passwords. This would allow the private segments to allocate pages without getting a global lock on the address space. And the whole 128-bit space can be used for the password.

It would still be randomly susceptible to the brute-force attacks, as well as to the password leakage.

Some progress can be made on the fault-blaming through. If only the pointers without passwords cold be transferred through the password-protected memory then when we read such a pointer, we can implicitly add the password of the protected page to it and immediately verify that the pointer is valid, that this password also matches the page referred by the pointer. If it doesn't, it means that someone tried to slip in an invalid pointer. If our process has read-only access to the page, we can even say for sure that if was another process's fault and not penalize this one (unless of course it mapped the same page twice, and the other mapping is read-write).

Let's consider in more detail how the pointers would work. For the pointers-with-passwords let's use the "far" keyword from the days of MS-DOS. The common (AKA "near") pointers without password would be just pointers without a special keyword.

int *pn; // a common pointer
int far *pf; // a far pointer that includes the password

They generally can't be cast to each other, a far pointer has to be explicitly converted from a near pointer and a password:

pf = make_far(pn, password);

Typically we would use the common pointers in the structures that get passed through the password-protected memory, to store only the address but not password:

 
struct data {
  int *p1;
  int *p2;
};

This would also allow to use the same structure in the normal and password-protected memory.

But when a common pointer gets accessed through a far pointer, it gets treated in a special way: essentially, it becomes implicitly converted to a far pointer.

 
data far *dpf;
data *dpn; 

pn = dpf->p1; // incorrect, won't compile
dpn->p1 = dpf->p1; // incorrect, won't compile
pf = dpf->p1; // correct, implicitly copies the password 
  // from dpf to pf and verifies that pf is valid, throws an
  // exception if invalid
dpf->p1 = pn; // incorrect, won't compile
dpf->p1 = dpn->p1; // incorrect, won't compile
dpf->p1 = pf; // correct, verifies that dpf and pf have
  // the same password (throws an exception if the
  // passwords differ), and drops the password when
  // writing into p1
dpf->p2 = dpf->p1; // correct

It would actually be more correct to say that the reference to a pointer accessed through a far pointer turns implicitly into a reference to a far pointer, and handles the conversion between the kinds of pointers.

This of course wouldn't stop someone from copying pointers as data to or from the password-protected memory via memcpy (though it would of course have to be a special version of memcpy that accepts the far pointers as arguments), but that would be their own problem of forfeiting the hardware help in the protection.

On the assembly level, this can be done with the addition of the password registers in the CPU along with the instructions for accessing the password-protected pages. Something like this:

 
  ; load the address part of the far pointer
  load (r1), r2
  ; load the password part of the far pointer into a password register
  loadpwd (r1+16), pr1
  ; read a value at a password-protected  far pointer
  loadpp r2, pr1, r3
  ; assuming that the read value was a pointer, verify that it was correct,
  ; it must be pointing to a page that uses the same password from pr1
  verifyp r3, pr1, failure_label
  ; or maybe read and verify in one instruction
  loadpv r2, pr1, r3, failure_label
  ; store a far pointer
  store r3, (r5)
  storepwd pr1, (r5+16)

The next development to prevent the brute-force attacks and password leaks is to have the OS to track on the side, access to which segments is granted to the process.

Then the problem of the leaked passwords can be solved by changing them periodically. Instead of keeping the pair (password, address) in the far pointer, place the all the passwords for all the granted segments into a canonical location in the process. Then represent the far pointer as the pair (address of the password, address of the data). Whenever a password needs to be changed, change it in the canonical location (of which the OS would keep track). And when an access exception happens due to a stale password in the CPU register, the kernel can check the access rights of the process against the canonical list. If the rights are present, it can update the password in the register and return to retry the access.

But a consequence of keeping this special section is that the passwords themselves can be stored in a protected section of memory, otherwise inaccessible in the user mode. The password registers in the CPU would have two parts: the address of the password in the special section that can be loaded and stored by the process and the shadow value of the password itself, invisible to the process and used to check the access. Then the users won’t be able to brute-force the passwords at all because they won't be ever able to set the arbitrary passwords, all the password management would happen in the OS.

Then if the passwords are not user-visible, they don’t have to be long any more, 64 bits, or maybe even 32 bits would suffice as long as they are all unique. Another consequence would be that changing the password would not be needed most of the time, its only use would be to revoke someone’s access.

But wait, there is more. If we don't have the special hardware available, instead of checking on every access we can check when mapping the page into the address space of the process. It’s just that the shared pages would be mapped at the same virtual address in every process. So we’ve made a full circle back to the existing solution here. It feels like not disturbing the page table would be more efficient, but actually not: the segment would have to be allocated first anyway before making it available to any process, and if the page map is shared between all the processes, that means potentially disturbing all the processes, not only the ones that will use that segment.

The only other part of the implementation would be the verification that the pointers received in the shared pages are safe, that they point to the pages of he same source. This can be done in user space, allocating a “page id map”, where each page address is direct-mapped to the “source id” of this page (say, 4 bytes each). Look up the ids of the page that contained the pointer and of the pointed page, and verify that the ids are the same. The map itself can be sparse, the unused parts mapped to an all-0 page, so it would be cheap. Some kernel help might be useful for handling the sparseness and making sure that the page id map is consistent with the page table. The compiler support with the "far pointers" would also still be useful for the automated verification, even with the user space solution.

Wednesday, February 19, 2020

Scheduling policy for burst multithreading

Over a year ago I've been reading Eric Raymond's blog post on the limitations of multithreading: http://esr.ibiblio.org/?p=8223

That gave me an idea that on the desktop we routinely have the spare CPUs going to waste. If we could reduce the cost of cross-CPU interaction, that would allow to schedule efficiently even the small snippets of parallel code. And we can reduce this cost by scheduling multiple threads of one process together, even if they have nothing to do - then they would just park the CPU in user space, and wake up quickly, directly in the user space, with already pre-heated cache and TLB. And then we can use the fast low-overhead inter-thread communications directly in the user space.

And then I've realized that another way to look at it is kind of like superscalar execution on a higher level, sort of what they've tried to do in Itanium but better.

Well, this finally ended up in a defense publication:
https://www.tdcommons.org/dpubs_series/2896/

Not sure if it will end up in anything useful but at least the idea is out there and you can read the details of it.

Sergey Babkin on CEP and stuff