Thursday, December 2, 2021

retraining the neural networks

IEEE Spectrum had printed an issue on AI, and there they've described a way to teach the old neural networks the new tricks without making them forget the old tricks: by marking the more important connections as immutable when training the network on a data set for the new kind of classification.

Which looks like a generalization of the technique of "emergents" (see there everything below the top 2 layers is declared immutable and the top 2 layers are re-trained for the new classification from scratch. The idea being that the first training had taught the network, what details are important, and then re-training assembles the new classifier from these details.

But the generalization goes farther: it can look at the weights of the connections, and if the weight is close to 0 (on the range of [-1, 1]), this can't be an important connection, and is a candidate to re-train, making the model learn the new details too. They say in the article that the old classifier degrades somewhat after this re-training but that's not surprising: the set of the "emergents" has changed during the re-training while the old classifier section is still expecting the old set. It would make sense to fixate the network below the top 2 layers (the "new emergents") and do another training of the old part with that.  The newly learned emergents might even be useful to improve the classification of the previously learned categories. They actually talk in the article about doing a periodic re-training with the previously trained subsets of data but not quite in the same context.

Tuesday, September 7, 2021

"The Practice of Parallel Programming" available as a PDF

It kind of didn't occur to me initially to announce it here, but here we go: I've made the fully edited and typeset version of my book "The Practice of Parallel Programming" available as a PDF. It's exactly the same as the printed version, with a small intro to the Online Edition 1.0 added, and formatted to a page size that is more convenient for printing at home.

Available for download here:

Sunday, August 29, 2021

lexical analysis with regular expressions

Today I've got this simple idea of doing the lexical analysis (i.e. tokenization) in Perl with regular expressions: write all the possible tokes as alternatives in the regular expression, and use Perl's ability to embed code into the expressions to produce the token id. For example:

$s = "++abc"; # string to tokenize
$t = 0; # token ids returned here
while ($s =~ s/^\s*([+][+](?{ $t=1; })|[+](?{ $t=2; })|[a-z]+(?{ $t=3; }))//) {
  print $s, "=", $1, "\n";

This looks for the tokens of "+", "++", and a simple lowercase identifier. One caveat I've found is that if one alternative is a prefix of another (such as "+" and "++"), the longer alternative must go first. But otherwise it looks pretty simple, and should be efficient. I'm probably not the first one to discover it, but I've been looking for the various lexer solutions in Perl and Python and haven't seen this one yet. Of course, it would look better in the extended syntax, and with symbolic token names.

Doing the same in Python is a bit more difficult, since it doesn't allow to execute code as part of the expression. So the parsing would be just to a string, and then matching by a string. Or then looking up the token id by a dictionary (which gets a little tricky if there could be more than one non-constant lexeme, but I guess those could be sorted in the second round if needed). Something like this:

import re
tokens = { "++": 1, "+": 2, } # regexp can be auto-generated from reverse-ordered dict keys
lex = re.compile(r'^\s*(' + r'[+][+]' + r'|[+]' + r'|[a-z]+' + r')') # the string to parse
s = "++abc"
m = lex.match(s) # 3 is the token id for the identifiers that are variable
t = tokens[] if in tokens else 3 # consume the token from the string
s = s[m.span()[1]:]

It doesn't do the loop but goes through all the important steps.

Thursday, June 17, 2021

on Rust

 I've attended a talk on the Rust programming language, and I've had a couple of realizations.

1. The new and interesting thing in Rust is  its enforcement of "borrowing" vs "consumption" in the references, and the separation of the code that strictly follows the straight-jacketed "safe" code from the "unsafe" code.  The rest of it seems to be substandard compared to C++.

I think that most real code will end up having both the "safe" and "unsafe" parts, because the purely "safe" version is conductive only to writing the COBOL-like programs. But the separation forces you to encapsulate the unsafe parts, keep them separate and small, and then the rest of the code would just rely on them instead of spreading unsafety all over the place.

But can the same concept be imported into C++? I think it can. There actually already are the syntactic means to express the borrowing and consumption in C++. Borrowing is "const &", and consumption is "&&".  So all we need is a way to ban the plain "&" and "*" in the parts of the code that we decree "safe".

2. A thing that looks annoying in Rust is its lack of classes, it has traits instead. In this respect it's similar to Go (just as annoying) and the C++ templates (in an implicit way, and with its own even worse challenges). It obviously works but  in an annoying way.

So what is annoying? That the definitions of Rust functions end up containing a list of traits for each argument: Trait1 + Trait2 + Trait3. The same applies to Go. C++ is actually less annoying because it doesn't even require you to define the traits as such, it just uses them in templates. It pays for that by very non-obvious error messages when a trait is missing, but at least you don't have to drag this list around.

But what is a class (or an interface)? An interface is a list of methods. And each method is also a trait. So when we define an interface, what we really do is write

InterfaceA = TraitA1 + TraitA2 + TraitA3

If we could do all this "trait arithmetic", all the annoyance would go away. I wonder if Rust already supports something like that, just it wasn't mentioned in the talk?

Monday, March 22, 2021

on automatic alerting

 A few years ago I've read the book "Thinking fast and slow" by Daniel Kahneman. One of the things that stuck the chord fo rme thre was the story of how the smaller counties have the cases of both higher and lower occurrence of the diseases than the larger counties. This is very much the same as what we see with the automatic alerting: when we set up an alert for, say, request processing latency, and at night there is only one request during a time period that has an anomalously high latency and triggers an alert (and we solve this with the cludges like "don't alert if there are less than 20 points").

It looked like the number of points needs to be taken in the account, with the fewer points to be treated in a more relaxed way. But I couldn't figure out how to put it together.

Recently I've been reading some books on the probablity, and it looks like the solution happens to be well-researched: we have to treat the number of points as a sample from the full larger number of points. So if we get 1000 hits per a unit of time during the day and only 50 hits during the nighttime, we can really see these 50 hits as a sample of the full traffic, with the people sleeping serving as a filter. And then instead of looking at "is the median over N" we should look at a differnt question: given the sample, what is the probability (i.e. confidence range) that the median of the full set of data would have been over N? That should solve the issue with the false alerts pretty nicely in an automatic way. I wonder, why nobody I've heard of had laready done this?

Friday, December 25, 2020

time for more analog computers?

In case if you haven't heard about them, the analog computers were unlike the digital computers. They were the electrical models assmebled to simulate the other real-world events. For example, if you wanted to model a car suspension, you'd have an oscillator circuit to simulate a spring, and a resistor to simulate a shock absorber. They were purpose-assembled for every specific application.

Well, now we have another thing that purports to simulate the real-world systems: the units in the neural networks that are supposed to be like the real neurons. The race to cram more "neurons" into the neural networks has led to the values in the "neurons" to be represented as very short 8-bit fixed-point numbers. And the models still work. Which means that the "neurons" are quite well at tolerating the small imprecisions in the data, which is the typical issue with the analog signals. But in return the analog signals are much more resistant to large disruptions: i.e. if you change one single highest bit in a digital representation of a number, this small disruption changes the value of a number by half its whole range, something that doesn't happen with the analog signals. There already are ideas of improving the reliability of computations by making them more resistant to large errors by accepting the small errors (I think I've read about them in IEEE's Spectrum).

So, the next logical step: why not make each "neuron" in a neural network an analog machine? Instead of 8 bits, have one analog signal, and do all the multiplication and addition on the analog signals. It looks like the perfect match, a combination of these two ideas. Has anyone already done that? If not, why not? Is there a way to do that? 

P.S. It turns out, people are already doing this. I've recently read in Spectrum that people are using photonic computations for the machine learning, and that the photonic computations are limited to this niche because of their limited precision.  I can't really think of a reason for the limited precision other than these computations being analog.

Sunday, September 27, 2020

a book on probability

I've stumbled upon the book "Introduction to probability" by Anderson, Seppalainen, and Valko. It's a college textbook. 

Right in the second chapter it does a very good introduction into the Bayesian formula, deriving it from the descriptions of weights of events, in the same way as I struggled to reinvent on my own. And right there it also provides the "General version of Bayes' formula" for the enumerated events, that took me so much time to reinvent on my own. Well, another case of "everything is invented before us" :-) If only we knew, what books to read at the right time :-) On the other hand, I'm not sure if I would have recognized the importance of that formula if I didn't have a use for it in mind first.

In case if you wonder, here is the formula: 

If B1...Bn are events that partition the sample space, then for any event A we have:

P(A) = sum from i=1 to n of (P(A & Bi)) = sum from i=1 to n of (P(A|Bi)*P(Bi))


P(Bk|A) = P(A & Bk) / P(A) = P(A|Bk)*P(Bk) / (sum from i=1 to n of (P(A|Bi)*P(Bi)))

Wednesday, September 23, 2020

cheap malloc-ed stacks

I've accidentally realized that the disjointed stacks (ones composed of the separately allocated chunks) can be pretty inexpensive if they get allocated in fixed pages, and each stack frame is guaranteed to be less than a page (which is fairly common in, say, the OS kernel). And the stack pages can be made larger than the physical pages.

All that the function needs to do at the start of the call is check that there is enough room to the end of the page. Which is easy enough by checking that (SP & PAGE_SIZE) < (PAGE_SIZE-FRAME_SIZE). Well, taking a page from the RISC book, we'd also want to reserve some room at the end of the page for the leaf functions that could then skip the whole check, but it doesn't change the general approach.

If a new page needs to be allocated, it can be taken from a pool of free stack pages (a linked list), and would be linked to the previous page, then the frame created in it (copying the old stack pointer onto the new stack, as in the "classic" Intel x86 sequence of "push BP" that saves the pointer to the previous stack frame). The next page can also be found already linked to the current page from the last time, and then the allocation can be skipped. The function return then doesn't need to do anything special, just "pop SP" and return. And then if the thread blocks, the scheduler can return the unused pages beyond the last active frame back to the pool.

There obviously is some overhead but not a real huge amount of it.

Saturday, May 9, 2020

consistent time and loops

It's obvious that the graph loops cannot be treated like the rest of the links with consistent time, or there would be no pipelining: we'd be forced to wait for an update from the looping link every time we send a record down the loop. Or we might end up sending a whole lot of records down the loop before reading back from the looping link, accumulating a whole lot of records in the queue of the looping link.

So what is the right thing to do? I think it depends on the circumstances, on what we're trying to achieve. I can think of the following uses:

1. Record the processing at real time and be able to reproduce it exactly in an accelerated playback.

2. Process records in accelerated time to start with, but as in real time don't care too much about the exact outcome on the first run as long as it's within the reasonable constraints. (But be able to replay it again in exactly the same way).

3. Process records in accelerated time to start with, and make sure that they get treated in an exactly consistent way, any run from scratch producing the exact same result.

The case(1) can be reasonably resolved by treating the looping links like the inputs form the external systems where we don't particularly care about the exact synchronization: re-timestamping the incoming records, logging them, and then processing with the new timestamp. On replay, read the logged records with their timestamps, and process them in the same way.

A few more words about how the re-timestamping would work: tag the record in the log with both the original timestamp and the new one, and use the new timestamp in the further processing. A simple-minded way to handle the replay would be to just read the log, and ignore the records sent through the looping link on the replay (since they presumably should be the same). A smarter way would be to receive the records form the link and compare them with the records in the log (including the original timestamp). If they match, all is well, the new timestamp from the log can be used. If they diverge then something somewhere has changed and either the replay should be aborted or some mitigation should be done, perhaps in the same way as for the case (2).

So, how to solve the case (2)? The trouble there is that the model's clock gets driven by the incoming records, that get queued, allowing the clock to go far ahead before the records get processed and arrive back through a looping link. The solution seems to be to limit, how far can the records be delayed at the re-timestamping point. Suppose we choose some time interval that is large enough to let the records go through the loop (to allow the queues to do an efficient pipelining) but not outlandishly large, let's call this interval L. Say, 1 second or 0.1 second. Set it as the limit. Then when a record arrives through the looping link (such as, in the figure below, B receiving a record back from D) with a timestamp T, we re-stamp it with a timestamp that is the smaller of: (T + L) and the lowest timestamp of the next record in the incoming queues (going up the links as usual to get the timestamp of the record they would send next, and since the records from the looping links are processed first, this in effect means "go just before that record").

           | A |
           /   \
   +----+ /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+
   |       \    /
   |        \  /
   |         V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |

This would essentially allow the looping links to "run behind the clock" by up to L. So if B receives from A a record with timestamp T2, it can treat the looping link "almost normally", by asking it if it has any earlier records, but using the time limit (T2 - L) instead of T2 (as it would for the normal "downstream" links). This is very much the same as the case (1), only it limits by how much the clock could run ahead. This solution would actually work fine for both the cases (1) and (2). And since it preserves the high priority of the looping links, it would prevent them from buffering too much data.

It can also be stretched to cover (3) by always using (T + L) for the looped records, and not the optimistic minimum. The trouble there is that a large number of records might enter the loop in the interval L, and they will collect on the looping link's queue. But it should be fine for the small test models with a small amount of data.

I've thought about fixing this issue by using a semi-blocking queue on the looping link: essentially compose the looping link from a blocking queue with the usual limit (and special logic) followed by a non-blocking queue. The re-timestamping would happen when moving the records from the blocking queue to the non-blocking one. If the non-blocking queue is empty, treat the records from the blocking queue as having the timestamps behind their original ones by L, and move them to the non-blocking queue when the time (T + L) is reached. Unless the loop is about to go deadlocked. Then instead of deadlocking, move the first record from the blocking queue on the looping link to the non-blocking queue, re-stamping it with the lowest timestamp of the next record in the incoming queues as in the solution for (2). Since the total depth of the queues in a loop is fixed for a given model, that would make the ordering of the records fully predictable, even if they get processed before (T + L).

But this fix has its own trouble: for consistency it requires that the loop must be fully deadlocked before breaking this deadlock. Not just that the queues to and from the top node are full, but that every queue in the loop is full. Which is not easy to track. Triceps already finds the loops between the Trieads, so that part is not a problem, but tracking at runtime how the queues become full, with all the possibilities of intertwined loops, might add a good deal of overhead. So it might not be a practical fix.

This might need more thinking or maybe it's just a problem that doesn't need to be fixed.

In the meantime, there is one more aspect to the loops. A looping link might be used essentially to send a timing event, telling the nodes upstream to re-examine their state. In the example below, the looping link from D to B would be driven by the scheduler on D:

           | A |
           /   \
   +----+ /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+    scheduler
   |       \    /        |
   |        \  / /+------+
   |         V V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |

In this case adding the delay L looks weird, since the time of the scheduler event might already be well past L from the original record that triggered it. It would make sense to make a special scheduler driven by a looping link instead:

           | A |
scheduler  /   \
   ^    | /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+
   |       \    /
   |        \  /
   |         V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |

This scheduler's notion of time would be driven by D but it would generate the events for B. With a special limitation that the events must be scheduled by at least L into the future. Any lower delays would be rounded up to L.

Oh, and one more thing about the models that would run predictably every time from scratch: their infinite-precision clock can't use a single common record counter for the high-precision part, because that would cause race conditions. Instead each node would have to keep its own record counter, and the low bits of the timestamp would consist of (node_id, per_node_counter). Which might actually be to the best, since it would remove the contention point of the common counter. The schedulers can be given their own node ids. And to think of it, the schedulers should probably be given the lower ids than the input nodes, because the timed events should probably be processed before the records coming from outside at the same time.

Tuesday, May 5, 2020

consistent time

I've done some more thinking on the issues of consistent time in Triceps models, and came up with a design. It's not a final design, more of design notes, so that I won't forget them until I get to an implementation. But I think it's quite interesting.

Let's start with the models that are the unidirectional graphs, without any loops, they are much easier to reason about.

The basic premise is that the records need to be timestamped when they enter the model, and then processed in the order of these timestamps. The time issues tend to be highlighted in the diamond-shaped graphs, so let's start with one:

           | A |
           /   \
          /     \
          V      V
        +---+  +---+
        | B |  | C |
        +---+  +---+
           \    /
            \  /
             V V
           |  D  |

An input record arrives, gets stamped, goes to the node (i.e. Triead, a triceps thread, but let's talk in more generic terms for now) A, where it can cause the records to be sent to nodes B and/or C. The records produced by A would have the same timestamp as the incoming record. B and C in turn may produce their own records (which again would get the same timestamp) and send them to D.

When two records with the same timestamp arrive from the same node, they have the natural queuing order, and they can be predictably processed in that order. But what if they arrive from different nodes, say B and C? The result of processing might depend on the order, which basically means that we have to define priorities between all the inputs. Here D has the inputs from B and C: one of them (say, B) would get a higher priority, and its records will be processed before the records with the same timestamp from the other input (C).

Well, that's easy to say, but what should D do when it receives a record with timestamp T from C, and nothing yet from B. For how long should it wait for something from B before it decides that it can process the record from C?

One way is to always send the timestamp metadata along each path even if there are no actual records. But this looks like a lot of unnecessary overhead. And it also has another potential issue: suppose we have two external inputs that get timestamped. The input 1 gets a lot of records for a period of time, the input 2 gets none. How often should the input 2 send its timestamp metadata even if it sends no data records? Well, maybe there is another solution.

Another way to resolve it would be for D to ask B, do you ever plan to send me anything with timestamp T or earlier? If not then D can safely go and process the record from C. If yes then D would have to wait for a notification from B. So the request from D to B can be asynchronous. If B is ready, it will send a timestamp metadata as a callback to D right away, if not then it will send either timestamped records or a bare timestamp later.

But B might not be able to answer this directly. It might have to consult A with the same question first. And A would have to consult the timestamper on its input. If A has two inputs, it would consult both timestampers.

Well, these callbacks would be very similar to just sending the timestamp metadata records all the time, and here would be additional delays involved with the requests upstream. But there are positive things too:

* It tells us, how often the inactive timestamper would get queried and send its metadata: more or less, for every record that comes in through any timestamper.

* In reality, not every node in the model would get input for every incoming record. If none of the nodes at some level produce records, the propagation would stop there and won't go farther downstream. And when the requests are sent upstream, they also would reach only to the timestampers that could send input to this node.

* The requests can be of two types: that the time T has been reached and that the time T has been passed. Because the node D prefers B over C, when gets a record from C with timestamp T, it would have to make sure that B is past T before processing that record. But if it gets a record from B with timestamp T, knowing that C is at T (and not yet past it) would be enough to start processing that record.

* The requests can be amortized. I.e. if the node D has records with timestamps T1 and T2 queued from node C, it can ask B once to send the timestamp updates until it reaches past T2. In the same way, if D asks B whether it's past T2 and B knows that it's already past a later timestamp T3, it can send the notification about T3 right away, and D will be able to use this knowledge later.

* Finally, this communication doesn't have to go through the regular queues. The more efficient data structures can be used to propagate the timestamps between the nodes. The timestamp information is cumulative: once B has notified that it had processed the input up to the timestamp T3, there is no point in keeping the information about it processing the timestamps T1 and T2, they're implied in T3.

* But if the communication goes across the network, the round trip time can become prohibitively expensive for the upstream requests. So the downstream timestamps should be sent on their own, although the cumulative approach would still apply. Of course, if there are multiple streams sent over the network between two systems, all these streams should just be sent across using the same sequencer, so there would be no issue with synchronization between them. But there would still be the synchronization issue between multiple external systems, and between them and the local scheduled events. Yet another possible solution for that would be to just restamp them with the local time if the relative ordering doesn't matter.

Well, let's move on to the next problem. When can a node say that it's past a particular timestamp? It depends on the clock granularity. As long as the clock is at T, there might be more records coming in, so it can't say that it's past T yet. So if the clock granularity is 1 millisecond, we'd be stuck waiting for that millisecond to end. What we need is a clock with infinite granularity. It can be simulated by keeping a single global record counter, and building the timestamp as a pair (time, count). The counter would be like the lower bits of the timestamp. This would also allow to amortize the calls to read the clock: if we have 100 records queued up at the timestamper, we can just increase the global counter by 100, and assign each record the same time and the next count in the allocated sequence.

Why not forget the time at all and move on to just the counter? We'd need time to be able to replay the records later at an accelerated pace, and still handle the time-based events.

Basically, the timed scheduler would run as a timestamper. And yes, it would also have to use the same clock, and do the same (time, count) pairs allocated from that clock. If D has some time-based events, the diagram would become:

        timestamper <----------------------------- clock
             |                                       |
             V                                       |
           +---+                                     |
           | A |                                     |
           +---+                                     |
           /   \                                     |
          /     \                                    |
          V      V                                   |
        +---+  +---+                                 |
        | B |  | C |                                 |
        +---+  +---+    scheduler/timestamper <------+
           \    /        +
            \  //+-------+
             V VV
           |  D  |

This would mean that when D gets a record from B, it would have to check the time not only in C but also in the timestamper. The timestamper can also do the amortization by telling the time of its next scheduled event. But that's a bit tricky: first, there would have to be some count part of the timestamp associated with that future time. There are two ways to go about it: either first tell the time without the count part (essentially, with it at 0), or allocate the count part in advance, when the event gets scheduled, and then just use that count when the timer fires. Second, it might happen that the processing of an incoming record would schedule a new, earlier event. But that's not a problem because the events can't be scheduled in the past. Since the scheduler would run as a part of D and synchronous with it, it could update its estimation and move the next event forward.

It would also have an interesting interaction with the replay: if we want to replay a timestamped log of incoming events and get the model to behave in exactly the same way as before, the scheduler events would have to happen at exactly the same times too. Which means them getting the exact same count part of the timestamp, and that normally won't be stable due to the races for the clock. But it can be resolved by recording the exact time of all the scheduled events into the same log and reusing it during the replay, the schedulers receiving their timestamps not from the clock but from the log, same as the incoming records.

What if we want to change the model between the reruns? Well, in this case we can discard the log for the schedulers, and just get the count values for them form fresh, and write the values into the new log. The running sequence would be slightly different than the first time, but since the model has changed, it wouldn't matter. Or we could even reuse the part of the log that is still applicable, merge it with the events from the new schedulers, and write into the new log. Either way, once the new log gets written, it can be reused again to produce the exact same result on the next replay.

Another interesting thing about the replay, is how do we transition from a replay to the live performance? If we have a recorded log from yesterday, want to replay it and then continue with today's data, what do we do with all the scheduled events that would have fired overnight? This basically suggests that we need to have some kind of "time reset events" that would be used when the time gets moved abruptly. It would allow the application logic to reset properly all the events that have been scheduled for the meantime- either just cancel them or execute them once, depending on the application semantics.

I think this should work quite well for the graphs without loops. The loops add a good deal of trouble. More on them later.

Sunday, May 3, 2020

TLB coherence

The discussion described in the last post got me thinking, why don't we have a hardware consistency for the page address translation cache (TLB), done through the same bus transactions as the usual cache snooping? In a multi-threaded environment, invalidating the TLB across all the threads is a pain that requires the cross-CPU interrupts.

In the simplest case the CPU that drives a TLB invalidation could just do a bus transaction that specifies a virtual address, and every CPU would invalidate its TLB entry that matches this address. If the user processes use the address space randomization, there would be few conflicts between processes, and the kernel address space is common for everyone. In a more complicated way, the virtual address can be accompanied by the process or memory protection id, so that only the pages of the CPUs running that process would be invalidated. And finally, each page translation entry has a unique identifier: the physical memory address where the CPU had read it from. If the TLB keeps this address as a tag (keeping enough bits to identify the page of the page directory is enough, since the rest of bits are determined by the virtual address), it can be used as a tag in the TLB invalidation transaction, and only the exact TLB entrues matching this tag would be invalidated.

Which made me wonder if anyone else had thought of this before. And after a bit of searching, it has turned out that they did: These guys have approached the problem from the hypervisor standpoint, but the implementation is exactly the same, using the physical memory address of the entry as a tag. There is no date on the article, but judging by the citation list, it could not have been written earlier than 2016.

Sunday, April 19, 2020

statistical and password-based memory protection

I've recently had an interesting online conversation with Terry Lambert (whom I know from the FreeBSD times) about the memory protection that resulted in I think a new and interesting idea, that I want to write down here.

Terry was talking about the statistical memory protection. I'm not sure if he invented it, or at least the term, since I can't find it anywhere else. Terry gives the related links but none of them are exactly about that:

The statistical memory protection works by having a large common address space where the protection is achieved by allocating memory at random addresses, so the address itself serves as a password. Plus the same physical page can be mapped at different addresses with different permissions, so knowing one address would give the write access while another one the read-only access.

As far as I see it, the major benefit of this system would be in the ability to pass the complex data structures linked by pointers between processes. With the address map being common, the pointers would work in any process. This gives cheap and convenient "mailboxes" for exchanging messages between processes.

This would have multiple potential problems too.

One is the “password leakage” that is also mentioned in one of the papers. Once a password gets out, it can be saved and passed around. With the separate addresses for RO and RW access to the same page this could get pretty bad:  if one of these pointers in an RO page accidentally refers to the RW address, the permission leaks. And actually the whole passing of pointer-connected structures breaks down because the writer and reader would have to use different pointers (with RO and RW addresses). 

Another issue, the memory segments would have to be allocated at the globally unique random addresses, which would have to be a synchronized operation. Considering that the segments have to be spaced far enough form each other, that would consume a sizable number of bits and degrade security of the passwords. 

This would also be susceptible to the brute-force attacks: maybe the attack won’t enumerate the whole address space but it can start with a random point and just see what can be found there. It won’t always be lucky, but sometimes it would. Another way of the attack would be to stick incorrect pointers into the shared pages, hoping that the other process reading the pointer from that page would cause some damage.

The brute-force attacks can be frustrated by keeping a fault counter, and killing the process that has too many faults, but it's harder than it seems at first. The fault counters would have to have a scope greater than a process. Because if you kill a process, nothing stops the user from spawning another one. Also nothing would stop the user from spawning a lot of processes, each doing fewer probes than the fault limit and exiting. But on the other hand, with a greater scope, a malicious application would be able to screw the user completely, so drawing this scope boundary won’t be easy. A possible solution for that problem is to introduce a delay in process's scheduling after a fault, and these delays can even be made to increase if user's global rate is high. This would dramatically slow down the attacks without affecting the correctly running code.

There is one more reason for why the fault counting would have issues: if we allow to transfer the pointers through the global memory, there is no way to know whose fault is the faulty pointer, did the current process get it from someone else?

I've come up with an idea to improve on the statistical memory protection, let's call it password-based memory protection: the same can be achieved in a better and cheaper way by associating a long (say, 128 bit) “password” with a page, effectively making the addresses in the “password-protected” pages 192-bit long (64 bit for the address and 128-bit password). There can be separate passwords for reading-only and reading-writing a page. Possibly execute-only too. But the passwords can be kept separate in some context, so the “short” 64-bit part of the address would be the same for the reader and writer. The context can be inherited from the page: when we load a pointer from a page, we set in the CPU register the password part of it equal to the password of the page. This would also allow to catch the malicious addresses in the shared pages by mismatch of the password. And the whole address space can be split into the private and shared parts, with only the shared part using the passwords. This would allow the private segments to allocate pages without getting a global lock on the address space. And the whole 128-bit space can be used for the password.

It would still be randomly susceptible to the brute-force attacks, as well as to the password leakage.

Some progress can be made on the fault-blaming through. If only the pointers without passwords cold be transferred through the password-protected memory then when we read such a pointer, we can implicitly add the password of the protected page to it and immediately verify that the pointer is valid, that this password also matches the page referred by the pointer. If it doesn't, it means that someone tried to slip in an invalid pointer. If our process has read-only access to the page, we can even say for sure that if was another process's fault and not penalize this one (unless of course it mapped the same page twice, and the other mapping is read-write).

Let's consider in more detail how the pointers would work. For the pointers-with-passwords let's use the "far" keyword from the days of MS-DOS. The common (AKA "near") pointers without password would be just pointers without a special keyword.

int *pn; // a common pointer
int far *pf; // a far pointer that includes the password

They generally can't be cast to each other, a far pointer has to be explicitly converted from a near pointer and a password:

pf = make_far(pn, password);

Typically we would use the common pointers in the structures that get passed through the password-protected memory, to store only the address but not password:

struct data {
  int *p1;
  int *p2;

This would also allow to use the same structure in the normal and password-protected memory.

But when  a common pointer gets accessed through a far pointer, it gets treated in a special way: essentially, it becomes implicitly converted to a far pointer.

data far *dpf;
data *dpn; 

pn = dpf->p1; // incorrect, won't compile
dpn->p1 = dpf->p1; // incorrect, won't compile
pf = dpf->p1; // correct, implicitly copies the password 
  // from dpf to pf and verifies that pf is valid, throws an
  // exception if invalid
dpf->p1 = pn; // incorrect, won't compile
dpf->p1 = dpn->p1; // incorrect, won't compile
dpf->p1 = pf; // correct, verifies that dpf and pf have
  // the same password (throws an exception if the
  // passwords differ), and drops the password when
  // writing into p1
dpf->p2 = dpf->p1; // correct

It would actually be more correct to say that the reference to a pointer accessed through a far pointer turns implicitly into a reference to a far pointer, and handles the conversion between the kinds of pointers.

This of course wouldn't stop someone from copying pointers as data to or from the password-protected memory via memcpy (though it would of course have to be a special version of memcpy that accepts the far pointers as arguments), but that would be their own problem of forfeiting the hardware help in the protection.

On the assembly level, this can be done with the addition of the password registers in the CPU along with the instructions for accessing the password-protected pages. Something like this:

  ; load the address part of the far pointer
  load (r1), r2
  ; load the password part of the far pointer into a password register
  loadpwd (r1+16), pr1
  ; read a value at a password-protected  far pointer
  loadpp r2, pr1, r3
  ; assuming that the read value was a pointer, verify that it was correct,
  ; it must be pointing to a page that uses the same password from pr1
  verifyp r3, pr1, failure_label
  ; or maybe read and verify in one instruction
  loadpv r2, pr1, r3, failure_label
  ; store a far pointer
  store r3, (r5)
  storepwd pr1, (r5+16)

The next development to prevent the brute-force attacks and password leaks is to have the OS to track on the side, access to which segments is granted to the process. 

Then the problem of the leaked passwords can be solved by changing them periodically. Instead of keeping the pair (password, address) in the far pointer, place the all the passwords for all the granted segments into a canonical location in the process. Then represent the far pointer as  the pair (address of the password, address of the data). Whenever a password needs to be changed, change it in the canonical location (of which the OS would keep track). And when an access exception happens due to a stale password in the CPU register, the kernel can check the access rights of the process against the canonical list. If the rights are present, it can update the password in the register and return to retry the access. 

But a consequence of keeping this special section is that the passwords themselves can be stored in a protected section of memory, otherwise inaccessible in the user mode. The password registers in the CPU would have two parts: the address of the password in the special section that can be loaded and stored by the process and the shadow value of the password itself, invisible to the process and used to check the access. Then the users won’t be able to brute-force the passwords at all because they won't be ever able to set the arbitrary passwords, all the password management would happen in the OS. 

Then if the passwords are not user-visible, they don’t have to be long any more, 64 bits, or maybe even 32 bits would suffice as long as they are all unique. Another consequence would be that changing the password would not be needed most of the time, its only use would be to revoke someone’s access.

But wait, there is more. If we don't have the special hardware available, instead of checking on every access we can check when mapping the page into the address space of the process. It’s just that the shared pages would be mapped at the same virtual address in every process. So we’ve made a full circle back to the existing solution here. It feels like not disturbing the page table would be more efficient, but actually not: the segment would have to be allocated first anyway before making it available to any process, and if the page map is shared between all the processes, that means potentially disturbing all the processes, not only the ones that will use that segment. 

The only other part of the implementation would be the verification that the pointers received in the shared pages are safe, that they point to the pages of he same source. This can be done in user space, allocating a “page id map”, where each page address is direct-mapped to the “source id” of this page (say, 4 bytes each). Look up the ids of the page that contained the pointer and of the pointed page, and verify that the ids are the same. The map itself can be sparse, the unused parts mapped to an all-0 page, so it would be cheap. Some kernel help might be useful for handling the sparseness and making sure that the page id map is consistent with the page table. The compiler support with the "far pointers" would also still be useful for the automated verification, even with the user space solution.

Wednesday, February 19, 2020

Scheduling policy for burst multithreading

Over a year ago I've been reading Eric Raymond's blog post on the limitations of multithreading:

That gave me an idea that on the desktop we routinely have the spare CPUs going to waste. If we could reduce the cost of cross-CPU interaction, that would allow to schedule efficiently even the small snippets of parallel code. And we can reduce this cost by scheduling multiple threads of one process together, even if they have nothing to do - then they would just park the CPU in user space, and wake up quickly, directly in the user space, with already pre-heated cache and TLB. And then we can use the fast low-overhead inter-thread communications directly in the user space.

And then I've realized that another way to look at it is kind of like superscalar execution on a higher level, sort of what they've tried to do in Itanium but better.

Well, this finally ended up in a defense publication:

Not sure if it will end up in anything useful but at least the idea is out there and you can read the details of it.

Monday, September 2, 2019

networking & financials

Looking at the Federal Reserve rates ,

we can observe that once in a while they suddenly decide to raise the rates quickly, each time followed by a recession. As the people who caused the Great Depression said,  they "fought to stop the speculations", and succeeded, bringing the whole economy down with them.

But the graph overall looks very reminiscent to what we see in the congestion control protocols. Except that the congestion control protocols act more sensibly: increase the transmission speed slowly, but when a catastrophe is detected, drop it sharply. It would be interesting to build an economic model and use the congestion control protocols to regulate the refinancing rate in it, starting with the classic TCP. I suspect that it might be able to do a lot better than Fed's current tealeaf reading, since the Fed seems to be doing things backwards: raising the rates sharply and dropping slowly. Since the catastrophes happen when the rates get too high, it would make a lot more sense to limit the rate of increases to something like 0.25% (25bpp) per year, to creep up on that catastrophe slowly, detect it early in the making, and then react sharply.

Tuesday, July 9, 2019

graph equivalence 7: why it DOESN'T work, and a conjecture


I've got around to read a bit more about the graph equivalence (i.e. isomorphism) problem. Looks like the state of the art algorithm is called "nauty". I don't understand it yet, I haven't found a straightforward enough description nor understood the complex kind of description. But based on the introductory page,  some things are easy to notice.

First, they use the same signature-based approach for comparison of the graphs, they call it the "canonical labeling".

Second, they use the same approach of splitting the set of nodes into the equivalency classes, they call it "coloring" and "partition refinement". Which is good, this means that I've reinvented something that is known to work in general.

Third, right that page contains an example of a graph on which my algorithm doesn't work. It's the pink one in the section "Refinement: equitable partition". Here is a copy, with names assigned to the nodes:

.              -----A-----
.             /     |     \
.            /      |      \
.           B-------+------ H
.           |       |       |
.           |       |       |
.           C-------+-------G
.           |       |       |
.           |       |       |
.           D-------+-------F
.            \      |      /
.             \     |     /
.              -----E-----

I've been starting with the premise that the nodes would have tags, and the links will be directional, but with a basic unlabelled non-directional graph the things are much harder. All you've got to start with is the number of links in every node. If there is some asymmetry, it will be used to unravel the dependencies. But if every node has the same number of links then the logic is stuck, and as this example shows, it's possible to make all the nodes have the same number of links but not be all symmetrical.

Before I've read further about the existing algorithm, I want to take another independent stab at the solution. As I've described in the Part 4, I've started with the idea of describing the graph topology in terms of the spanning trees, rooted at each of the nodes, and then tried to simplify it with the hashes. This simplification has turned out to be wrong. So let's return to the spanning trees (with redundant links on the side), and take another approach to avoid the circular dependency. When computing the equivalency class (AKA "hash" AKA "color") of each node on each top-level iteration of the algorithm, instead of looking at the immediate neighbor nodes, let's take this node as the root of the spanning tree and look at its relation with each other node by building a path from the root to that node (if the graph is disconnected and no path exists, take an empty path, or just don't enter any path for that node). To account for the possible redundant links, we could find multiple paths: remove all the links but one from the destination node, find a path, then return to the original graph, and remove all but a different single link, find another path, and so on.

The path in general would include the tags on the nodes, the number of links connected to them (in the original graphs), and so on, to make it more distinctive. But if the nodes are untagged (initially belonging to the same equivalence class), and all have the same number of links, the only distinction between the paths would be their length.

Let's build the paths:

Root A:
  Node B can be reached via the paths:
    using the link AB: AB - 1 step
    using the link HB: AHB - 2 steps
    using the link CB: AHGCB or AEDCB - 4 steps
  Node C can be reached via the paths:
    ABC - 2 steps
    AHGC - 3 steps
    AEDC - 3 steps
  Node D can be reached via the paths:
    AED - 2 steps
    ABCD - 3 steps
    AEFD - 3 steps
  Node E can be reached via the paths:
    AE - 1 step
    ABCDE - 4 steps
    AHGFE - 4 steps
  Node F is similar to D
  Node G is similar to C
  Node H is similar to B

So overall the set of paths going from the root A can be described in the shorter form as:
  2x(1, 2, 4); (1, 4, 4); 4x(2, 3, 3)

Root B (using the shorter notation):
  Node A: 1, 2, 4
  Node C: 1, 3, 4
  Node D: 2, 3, 4
  Node E: 2, 3, 4
  Node F: 3, 3, 4
  Node G: 2, 2, 4
  Node H: 1, 2, 3

The resulting set of paths from the root B:
  (1, 2, 3); (1, 2, 4); (1, 3, 4); (2, 2, 4); 2x(2, 3, 4); (3, 3, 4)

Root C: (using the shorter notation):
  Nodes A and E: 2, 3, 3
  Nodes B and D: 1, 3, 4
  Nodes F and H: 2, 2, 3
  Node G: 1, 3, 3

The resulting set of paths from the root C:
  (1, 3, 3); 2x(1, 3, 4); 2x(2, 2, 3); 2x(2, 3, 3)

The node E will have the same paths as A; G will have the same paths as C; nodes D, F, H will have the same paths as B. As you can see, these three classes of nodes will have distinct sets of paths, so they will be successfully split into three classes indeed.

This solves the problem for this specific example. Would it solve every possible example? I don't know, right now I can't think of a proof one way or another.

But let's estimate, how difficult would it be to build these paths. If there are N nodes and total L links, first we'll have the loop that goes over every of the N root nodes. Then it would go over each of the (N-1) target nodes, and for each one of them go over up to (N-1) links that enter it, and find the shortest (or, with tags, also "lexicographically first") path between the root and the target node (which shouln't take longer than L steps). Then we'll sort and collate the paths to each target, and concatenate them into the sorted set of paths, which should take within O(N^3). And then sort and collate the nodes by their paths, which should be within O(N^4) or so. Which means that it's all still polynomial.

So, if this approach works for every possible example, that would mean that it could solve each possible example in polynomial time. I've grown a bit more pessimistic about that, so probably it doesn't work for every possible example. But it would be interesting to understand, why.

As a side note, I've tried to reduce this graph a bit, and apply my new algorithm to the reduced version:

.              -----A-----
.             /     |     \
.            /      |      \
.           B-------+------ F
.           |       |       |
.           |       |       |
.           C-------+-------E
.            \      |      /
.             \     |     /
.              -----D-----

And the algorithm produced the same path sets for every node. Only then did I notice that in the reduced graph all the nodes really are equivalent! This graph can be seen as a triangular prism, with the top face ABF and bottom face CDE:

.              A
.             /|\
.            / | \
.           B--+- F
.           |  |  |
.           |  |  |
.           |  D  |
.           | / \ |
.           |/   \|
.           C-----E

And all the corners in this prism really are equivalent! The extra horizontal line in the original example adds the magic to make the nodes different.

Another side note: it's easy to construct a similar directional graph too. Just replace each "horizontal" and "vertical" non-directional link that has a pair of nodes at the end with a pair of arrows, one going one way, another one going back (between the same pair of nodes, or inserting another pair of nodes for the "back" link below the original pair). And as a final touch, turn the links that go "around the circle" into the arrows, all of them going either clockwise or counter-clockwise.

P.S. A good reference page:


Friday, March 8, 2019

defense against SQL injections

I've been reading the question on Quora:
Why can't SQL change so that injection attacks are no longer possible?

People there were referring to the example of getting through the value escaping. It works like this:

$iId = mysql_real_escape_string("1 OR 1=1");    
$sSql = "SELECT * FROM table WHERE id = $iId";

Well, this could easily be fixed by converting the variable to int before using it, but it's easy to forget. They also give examples of the Unicode character sets getting mishandled, and of using the wrong kind of quotes in the query.

But reading it gave me the idea that it might be actually easy enough to resolve: instead of using the plain strings, use the strings tagged with their expected type. I.e. “this is a part of a query”, “this should be an integer”, “this should be an external string”. It can be as simple as using a separate sanitizer function for each data type. So for example the string sanitizer would among other things add the apostrophes around the value. And the int sanitizer would check that the value is a valid integer. Then even if you accidentally use the string sanitizer for an int, your query just won’t compile. And the example from the injection attack would become
  1. $sSql = "SELECT * FROM table WHERE id = " + sanitize_int("1 OR 1=1");
The sanitizer would either throw an exception or cut down its argument to something like “1”, neither becoming a dangerous attack.

But it can be done better yet, by building the SQL query as a list of tagged objects and then passing this list of objects directly to the SQL compiler. The overloaded "+" can be used to make it look neat:
  1. $sSql = sql_q("SELECT * FROM table WHERE id = ") + sql_int($arg);
Then the sanitizing would be done directly in the SQL compiler. Or course, it’s not 100% fool-proof, because someone might still use the sql_q() tag on an external value, but it’s still much better. And there are some ways to do better, like Perl can use its tainting logic, and I think I’ve seen a C++ function that accepted only an array of characters as its argument.

Essentially, it injects the fragments of syntax into the statement, to be used by the SQL parser. Then parser would then know that the contents of sql_int() has to be parsed as a single integer lexeme. And no matter what the character encoding, the compiler would know to treat the contents of sql_string() as a string.

This is really very similar to the parameter substitution in the prepared queries but it's much easier to read (no more need to count which "?" matches which parameter), and adds the explicit type information rather than having the SQL compiler guess, and allows to build the SQL statements dynamically, adding clauses and subqueries in a dynamic way. You know, things like “and if this condition is true, add a GROUP BY”, “for each parameter in the list, generate a query and make a UNION of them”, “and if another condition is true, instead of reading directly from a table, read from another sub-query”.

It can also be taken another step further, by creating types for the constant expressions and subqueries.  The contents of sql_int_expr() wouldn't have to be a single integer, but can conveniently be a string containing a whole integer constant expression. Again, the compiler would ensure that it really is one. And the contents of sq_subquery() would be built from the same kind of fragments and guarantee that it's the whole subquery. So an accidentally misplaced parenthesis in it could be diagnosed more directly, without overflowing into the parent query.

Tuesday, December 25, 2018

Triceps 2.1.0 released

After a long break, Triceps 2.1.0 is finally released:

It includes two important features:

1. The fast compiled Ordered index. The code for it has been sitting in SVN for three years, and I have finally found time to update the documentation to replace the old ordered index implemented in Perl with the new one.

2. Triceps now builds works with the modern versions of C++ and Perl. This was harder to do than it sounds, since the C++ standard has changed in an incompatible and unpleasant way. But the good news is that the code produced by the new compiler is a good deal faster than by the old one.

Friday, December 7, 2018

Principal component analysis

Today I've learned about the existence of the Principal component analysis. It's supposed to find the orthogonal dimensions that best describe the variance of the data set. Perhaps this is the real answer for removing the duplication in the dimensions that I've been thinking about some time ago. And it had been invented in 1901! I haven't understood it yet, will need to read up on it. For now just making a note so that I won't forget about it.

Saturday, December 1, 2018

Triceps in flux

I've finally updated my machine to a modern version of Linux (Mint 19), and now I'm going through, so to say, bringing Triceps into the Century of the Fruit Bat. Both C++ and Perl languages have changed, (and as has been mentioned in a previous post, Ghostscript had functionality removed).

Most annoyingly, C++ doesn't allow to call even the non-virtual methods on the NULL pointers any more. Well, it does allow to call them, but  auto-removes any checks for (this == NULL) inside the methods. And I've been using this in multiple places. But now it's all worked around by using the special static objects and the code builds again.

The tests still fail though, but now it's hopefully all trivial: some message got changed because of the different hashing, and also the Perl tests fail because the syntax of regexps got changed in Perl 5.26, it doesn't accept the plain "{" in the patterns any more. But I'm working on it.

Friday, November 23, 2018

converting EPS to SVG

I've tried to build my book on a recent Linux Mint and failed: I've been using Ghostscript to convert the figures in EPS to the SVG format, for inclusion into PDF, and it turns out that Ghostscript doesn't support SVG any more. I've tried to download and build the source but the source didn't have it either. I remember, it got broken a long time ago, it was crashing on my previous attempt to use a newer Linux. I've found a bug report from 2013 on Ghostscript's web site. Apparently, instead of fixing the SVG driver, they've just dropped it.

But I was able to find another way, with Inkscape. The conversion command looks like this:

inkscape -D "$SRC_EPS" --export-plain-svg="$DST_SVG"

After this change the build worked like a charm.

The Triceps build will need a similar change. I'm about to try it.