Sunday, September 27, 2020

a book on probability

I've stumbled upon the book "Introduction to probability" by Anderson, Seppalainen, and Valko. It's a college textbook. 

Right in the second chapter it does a very good introduction into the Bayesian formula, deriving it from the descriptions of weights of events, in the same way as I struggled to reinvent on my own. And right there it also provides the "General version of Bayes' formula" for the enumerated events, that took me so much time to reinvent on my own. Well, another case of "everything is invented before us" :-) If only we knew, what books to read at the right time :-) On the other hand, I'm not sure if I would have recognized the importance of that formula if I didn't have a use for it in mind first.

In case if you wonder, here is the formula: 

If B1...Bn are events that partition the sample space, then for any event A we have:

P(A) = sum from i=1 to n of (P(A & Bi)) = sum from i=1 to n of (P(A|Bi)*P(Bi))

Then 

P(Bk|A) = P(A & Bk) / P(A) = P(A|Bk)*P(Bk) / (sum from i=1 to n of (P(A|Bi)*P(Bi)))

Wednesday, September 23, 2020

cheap malloc-ed stacks

I've accidentally realized that the disjointed stacks (ones composed of the separately allocated chunks) can be pretty inexpensive if they get allocated in fixed pages, and each stack frame is guaranteed to be less than a page (which is fairly common in, say, the OS kernel). And the stack pages can be made larger than the physical pages.

All that the function needs to do at the start of the call is check that there is enough room to the end of the page. Which is easy enough by checking that (SP & PAGE_SIZE) < (PAGE_SIZE-FRAME_SIZE). Well, taking a page from the RISC book, we'd also want to reserve some room at the end of the page for the leaf functions that could then skip the whole check, but it doesn't change the general approach.

If a new page needs to be allocated, it can be taken from a pool of free stack pages (a linked list), and would be linked to the previous page, then the frame created in it (copying the old stack pointer onto the new stack, as in the "classic" Intel x86 sequence of "push BP" that saves the pointer to the previous stack frame). The next page can also be found already linked to the current page from the last time, and then the allocation can be skipped. The function return then doesn't need to do anything special, just "pop SP" and return. And then if the thread blocks, the scheduler can return the unused pages beyond the last active frame back to the pool.

There obviously is some overhead but not a real huge amount of it.

Saturday, May 9, 2020

consistent time and loops

It's obvious that the graph loops cannot be treated like the rest of the links with consistent time, or there would be no pipelining: we'd be forced to wait for an update from the looping link every time we send a record down the loop. Or we might end up sending a whole lot of records down the loop before reading back from the looping link, accumulating a whole lot of records in the queue of the looping link.

So what is the right thing to do? I think it depends on the circumstances, on what we're trying to achieve. I can think of the following uses:

1. Record the processing at real time and be able to reproduce it exactly in an accelerated playback.

2. Process records in accelerated time to start with, but as in real time don't care too much about the exact outcome on the first run as long as it's within the reasonable constraints. (But be able to replay it again in exactly the same way).

3. Process records in accelerated time to start with, and make sure that they get treated in an exactly consistent way, any run from scratch producing the exact same result.

The case(1) can be reasonably resolved by treating the looping links like the inputs form the external systems where we don't particularly care about the exact synchronization: re-timestamping the incoming records, logging them, and then processing with the new timestamp. On replay, read the logged records with their timestamps, and process them in the same way.

A few more words about how the re-timestamping would work: tag the record in the log with both the original timestamp and the new one, and use the new timestamp in the further processing. A simple-minded way to handle the replay would be to just read the log, and ignore the records sent through the looping link on the replay (since they presumably should be the same). A smarter way would be to receive the records form the link and compare them with the records in the log (including the original timestamp). If they match, all is well, the new timestamp from the log can be used. If they diverge then something somewhere has changed and either the replay should be aborted or some mitigation should be done, perhaps in the same way as for the case (2).

So, how to solve the case (2)? The trouble there is that the model's clock gets driven by the incoming records, that get queued, allowing the clock to go far ahead before the records get processed and arrive back through a looping link. The solution seems to be to limit, how far can the records be delayed at the re-timestamping point. Suppose we choose some time interval that is large enough to let the records go through the loop (to allow the queues to do an efficient pipelining) but not outlandishly large, let's call this interval L. Say, 1 second or 0.1 second. Set it as the limit. Then when a record arrives through the looping link (such as, in the figure below, B receiving a record back from D) with a timestamp T, we re-stamp it with a timestamp that is the smaller of: (T + L) and the lowest timestamp of the next record in the incoming queues (going up the links as usual to get the timestamp of the record they would send next, and since the records from the looping links are processed first, this in effect means "go just before that record").

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
           /   \
   +----+ /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+
   |       \    /
   |        \  /
   |         V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |
   +--------+

This would essentially allow the looping links to "run behind the clock" by up to L. So if B receives from A a record with timestamp T2, it can treat the looping link "almost normally", by asking it if it has any earlier records, but using the time limit (T2 - L) instead of T2 (as it would for the normal "downstream" links). This is very much the same as the case (1), only it limits by how much the clock could run ahead. This solution would actually work fine for both the cases (1) and (2). And since it preserves the high priority of the looping links, it would prevent them from buffering too much data.

It can also be stretched to cover (3) by always using (T + L) for the looped records, and not the optimistic minimum. The trouble there is that a large number of records might enter the loop in the interval L, and they will collect on the looping link's queue. But it should be fine for the small test models with a small amount of data.

I've thought about fixing this issue by using a semi-blocking queue on the looping link: essentially compose the looping link from a blocking queue with the usual limit (and special logic) followed by a non-blocking queue. The re-timestamping would happen when moving the records from the blocking queue to the non-blocking one. If the non-blocking queue is empty, treat the records from the blocking queue as having the timestamps behind their original ones by L, and move them to the non-blocking queue when the time (T + L) is reached. Unless the loop is about to go deadlocked. Then instead of deadlocking, move the first record from the blocking queue on the looping link to the non-blocking queue, re-stamping it with the lowest timestamp of the next record in the incoming queues as in the solution for (2). Since the total depth of the queues in a loop is fixed for a given model, that would make the ordering of the records fully predictable, even if they get processed before (T + L).

But this fix has its own trouble: for consistency it requires that the loop must be fully deadlocked before breaking this deadlock. Not just that the queues to and from the top node are full, but that every queue in the loop is full. Which is not easy to track. Triceps already finds the loops between the Trieads, so that part is not a problem, but tracking at runtime how the queues become full, with all the possibilities of intertwined loops, might add a good deal of overhead. So it might not be a practical fix.

This might need more thinking or maybe it's just a problem that doesn't need to be fixed.

In the meantime, there is one more aspect to the loops. A looping link might be used essentially to send a timing event, telling the nodes upstream to re-examine their state. In the example below, the looping link from D to B would be driven by the scheduler on D:

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
           /   \
   +----+ /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+    scheduler
   |       \    /        |
   |        \  / /+------+
   |         V V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |
   +--------+

In this case adding the delay L looks weird, since the time of the scheduler event might already be well past L from the original record that triggered it. It would make sense to make a special scheduler driven by a looping link instead:

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
scheduler  /   \
   ^    | /     \
   |    V V      V
   |    +---+  +---+
   |    | B |  | C |
   |    +---+  +---+
   |       \    /
   |        \  /
   |         V V
   |       +-----+
   |       |  D  |
   |       +-----+
   |        |
   +--------+

This scheduler's notion of time would be driven by D but it would generate the events for B. With a special limitation that the events must be scheduled by at least L into the future. Any lower delays would be rounded up to L.

Oh, and one more thing about the models that would run predictably every time from scratch: their infinite-precision clock can't use a single common record counter for the high-precision part, because that would cause race conditions. Instead each node would have to keep its own record counter, and the low bits of the timestamp would consist of (node_id, per_node_counter). Which might actually be to the best, since it would remove the contention point of the common counter. The schedulers can be given their own node ids. And to think of it, the schedulers should probably be given the lower ids than the input nodes, because the timed events should probably be processed before the records coming from outside at the same time.

Tuesday, May 5, 2020

consistent time

I've done some more thinking on the issues of consistent time in Triceps models, and came up with a design. It's not a final design, more of design notes, so that I won't forget them until I get to an implementation. But I think it's quite interesting.

Let's start with the models that are the unidirectional graphs, without any loops, they are much easier to reason about.

The basic premise is that the records need to be timestamped when they enter the model, and then processed in the order of these timestamps. The time issues tend to be highlighted in the diamond-shaped graphs, so let's start with one:

.
           input
             |
             V
        timestamper
             |
             V
           +---+
           | A |
           +---+
           /   \
          /     \
          V      V
        +---+  +---+
        | B |  | C |
        +---+  +---+
           \    /
            \  /
             V V
           +-----+
           |  D  |
           +-----+

An input record arrives, gets stamped, goes to the node (i.e. Triead, a triceps thread, but let's talk in more generic terms for now) A, where it can cause the records to be sent to nodes B and/or C. The records produced by A would have the same timestamp as the incoming record. B and C in turn may produce their own records (which again would get the same timestamp) and send them to D.

When two records with the same timestamp arrive from the same node, they have the natural queuing order, and they can be predictably processed in that order. But what if they arrive from different nodes, say B and C? The result of processing might depend on the order, which basically means that we have to define priorities between all the inputs. Here D has the inputs from B and C: one of them (say, B) would get a higher priority, and its records will be processed before the records with the same timestamp from the other input (C).

Well, that's easy to say, but what should D do when it receives a record with timestamp T from C, and nothing yet from B. For how long should it wait for something from B before it decides that it can process the record from C?

One way is to always send the timestamp metadata along each path even if there are no actual records. But this looks like a lot of unnecessary overhead. And it also has another potential issue: suppose we have two external inputs that get timestamped. The input 1 gets a lot of records for a period of time, the input 2 gets none. How often should the input 2 send its timestamp metadata even if it sends no data records? Well, maybe there is another solution.

Another way to resolve it would be for D to ask B, do you ever plan to send me anything with timestamp T or earlier? If not then D can safely go and process the record from C. If yes then D would have to wait for a notification from B. So the request from D to B can be asynchronous. If B is ready, it will send a timestamp metadata as a callback to D right away, if not then it will send either timestamped records or a bare timestamp later.

But B might not be able to answer this directly. It might have to consult A with the same question first. And A would have to consult the timestamper on its input. If A has two inputs, it would consult both timestampers.

Well, these callbacks would be very similar to just sending the timestamp metadata records all the time, and here would be additional delays involved with the requests upstream. But there are positive things too:

* It tells us, how often the inactive timestamper would get queried and send its metadata: more or less, for every record that comes in through any timestamper.

* In reality, not every node in the model would get input for every incoming record. If none of the nodes at some level produce records, the propagation would stop there and won't go farther downstream. And when the requests are sent upstream, they also would reach only to the timestampers that could send input to this node.

* The requests can be of two types: that the time T has been reached and that the time T has been passed. Because the node D prefers B over C, when gets a record from C with timestamp T, it would have to make sure that B is past T before processing that record. But if it gets a record from B with timestamp T, knowing that C is at T (and not yet past it) would be enough to start processing that record.

* The requests can be amortized. I.e. if the node D has records with timestamps T1 and T2 queued from node C, it can ask B once to send the timestamp updates until it reaches past T2. In the same way, if D asks B whether it's past T2 and B knows that it's already past a later timestamp T3, it can send the notification about T3 right away, and D will be able to use this knowledge later.

* Finally, this communication doesn't have to go through the regular queues. The more efficient data structures can be used to propagate the timestamps between the nodes. The timestamp information is cumulative: once B has notified that it had processed the input up to the timestamp T3, there is no point in keeping the information about it processing the timestamps T1 and T2, they're implied in T3.

* But if the communication goes across the network, the round trip time can become prohibitively expensive for the upstream requests. So the downstream timestamps should be sent on their own, although the cumulative approach would still apply. Of course, if there are multiple streams sent over the network between two systems, all these streams should just be sent across using the same sequencer, so there would be no issue with synchronization between them. But there would still be the synchronization issue between multiple external systems, and between them and the local scheduled events. Yet another possible solution for that would be to just restamp them with the local time if the relative ordering doesn't matter.

Well, let's move on to the next problem. When can a node say that it's past a particular timestamp? It depends on the clock granularity. As long as the clock is at T, there might be more records coming in, so it can't say that it's past T yet. So if the clock granularity is 1 millisecond, we'd be stuck waiting for that millisecond to end. What we need is a clock with infinite granularity. It can be simulated by keeping a single global record counter, and building the timestamp as a pair (time, count). The counter would be like the lower bits of the timestamp. This would also allow to amortize the calls to read the clock: if we have 100 records queued up at the timestamper, we can just increase the global counter by 100, and assign each record the same time and the next count in the allocated sequence.

Why not forget the time at all and move on to just the counter? We'd need time to be able to replay the records later at an accelerated pace, and still handle the time-based events.

Basically, the timed scheduler would run as a timestamper. And yes, it would also have to use the same clock, and do the same (time, count) pairs allocated from that clock. If D has some time-based events, the diagram would become:

.
           input
             |
             V
        timestamper <----------------------------- clock
             |                                       |
             V                                       |
           +---+                                     |
           | A |                                     |
           +---+                                     |
           /   \                                     |
          /     \                                    |
          V      V                                   |
        +---+  +---+                                 |
        | B |  | C |                                 |
        +---+  +---+    scheduler/timestamper <------+
           \    /        +
            \  //+-------+
             V VV
           +-----+
           |  D  |
           +-----+

This would mean that when D gets a record from B, it would have to check the time not only in C but also in the timestamper. The timestamper can also do the amortization by telling the time of its next scheduled event. But that's a bit tricky: first, there would have to be some count part of the timestamp associated with that future time. There are two ways to go about it: either first tell the time without the count part (essentially, with it at 0), or allocate the count part in advance, when the event gets scheduled, and then just use that count when the timer fires. Second, it might happen that the processing of an incoming record would schedule a new, earlier event. But that's not a problem because the events can't be scheduled in the past. Since the scheduler would run as a part of D and synchronous with it, it could update its estimation and move the next event forward.

It would also have an interesting interaction with the replay: if we want to replay a timestamped log of incoming events and get the model to behave in exactly the same way as before, the scheduler events would have to happen at exactly the same times too. Which means them getting the exact same count part of the timestamp, and that normally won't be stable due to the races for the clock. But it can be resolved by recording the exact time of all the scheduled events into the same log and reusing it during the replay, the schedulers receiving their timestamps not from the clock but from the log, same as the incoming records.

What if we want to change the model between the reruns? Well, in this case we can discard the log for the schedulers, and just get the count values for them form fresh, and write the values into the new log. The running sequence would be slightly different than the first time, but since the model has changed, it wouldn't matter. Or we could even reuse the part of the log that is still applicable, merge it with the events from the new schedulers, and write into the new log. Either way, once the new log gets written, it can be reused again to produce the exact same result on the next replay.

Another interesting thing about the replay, is how do we transition from a replay to the live performance? If we have a recorded log from yesterday, want to replay it and then continue with today's data, what do we do with all the scheduled events that would have fired overnight? This basically suggests that we need to have some kind of "time reset events" that would be used when the time gets moved abruptly. It would allow the application logic to reset properly all the events that have been scheduled for the meantime- either just cancel them or execute them once, depending on the application semantics.

I think this should work quite well for the graphs without loops. The loops add a good deal of trouble. More on them later.

Sunday, May 3, 2020

TLB coherence

The discussion described in the last post got me thinking, why don't we have a hardware consistency for the page address translation cache (TLB), done through the same bus transactions as the usual cache snooping? In a multi-threaded environment, invalidating the TLB across all the threads is a pain that requires the cross-CPU interrupts.

In the simplest case the CPU that drives a TLB invalidation could just do a bus transaction that specifies a virtual address, and every CPU would invalidate its TLB entry that matches this address. If the user processes use the address space randomization, there would be few conflicts between processes, and the kernel address space is common for everyone. In a more complicated way, the virtual address can be accompanied by the process or memory protection id, so that only the pages of the CPUs running that process would be invalidated. And finally, each page translation entry has a unique identifier: the physical memory address where the CPU had read it from. If the TLB keeps this address as a tag (keeping enough bits to identify the page of the page directory is enough, since the rest of bits are determined by the virtual address), it can be used as a tag in the TLB invalidation transaction, and only the exact TLB entrues matching this tag would be invalidated.

Which made me wonder if anyone else had thought of this before. And after a bit of searching, it has turned out that they did: https://arxiv.org/pdf/1701.07517.pdf. These guys have approached the problem from the hypervisor standpoint, but the implementation is exactly the same, using the physical memory address of the entry as a tag. There is no date on the article, but judging by the citation list, it could not have been written earlier than 2016.

Sunday, April 19, 2020

statistical and password-based memory protection

I've recently had an interesting online conversation with Terry Lambert (whom I know from the FreeBSD times) about the memory protection that resulted in I think a new and interesting idea, that I want to write down here.

Terry was talking about the statistical memory protection. I'm not sure if he invented it, or at least the term, since I can't find it anywhere else. Terry gives the related links but none of them are exactly about that:


The statistical memory protection works by having a large common address space where the protection is achieved by allocating memory at random addresses, so the address itself serves as a password. Plus the same physical page can be mapped at different addresses with different permissions, so knowing one address would give the write access while another one the read-only access.

As far as I see it, the major benefit of this system would be in the ability to pass the complex data structures linked by pointers between processes. With the address map being common, the pointers would work in any process. This gives cheap and convenient "mailboxes" for exchanging messages between processes.

This would have multiple potential problems too.

One is the “password leakage” that is also mentioned in one of the papers. Once a password gets out, it can be saved and passed around. With the separate addresses for RO and RW access to the same page this could get pretty bad:  if one of these pointers in an RO page accidentally refers to the RW address, the permission leaks. And actually the whole passing of pointer-connected structures breaks down because the writer and reader would have to use different pointers (with RO and RW addresses). 

Another issue, the memory segments would have to be allocated at the globally unique random addresses, which would have to be a synchronized operation. Considering that the segments have to be spaced far enough form each other, that would consume a sizable number of bits and degrade security of the passwords. 

This would also be susceptible to the brute-force attacks: maybe the attack won’t enumerate the whole address space but it can start with a random point and just see what can be found there. It won’t always be lucky, but sometimes it would. Another way of the attack would be to stick incorrect pointers into the shared pages, hoping that the other process reading the pointer from that page would cause some damage.

The brute-force attacks can be frustrated by keeping a fault counter, and killing the process that has too many faults, but it's harder than it seems at first. The fault counters would have to have a scope greater than a process. Because if you kill a process, nothing stops the user from spawning another one. Also nothing would stop the user from spawning a lot of processes, each doing fewer probes than the fault limit and exiting. But on the other hand, with a greater scope, a malicious application would be able to screw the user completely, so drawing this scope boundary won’t be easy. A possible solution for that problem is to introduce a delay in process's scheduling after a fault, and these delays can even be made to increase if user's global rate is high. This would dramatically slow down the attacks without affecting the correctly running code.

There is one more reason for why the fault counting would have issues: if we allow to transfer the pointers through the global memory, there is no way to know whose fault is the faulty pointer, did the current process get it from someone else?

I've come up with an idea to improve on the statistical memory protection, let's call it password-based memory protection: the same can be achieved in a better and cheaper way by associating a long (say, 128 bit) “password” with a page, effectively making the addresses in the “password-protected” pages 192-bit long (64 bit for the address and 128-bit password). There can be separate passwords for reading-only and reading-writing a page. Possibly execute-only too. But the passwords can be kept separate in some context, so the “short” 64-bit part of the address would be the same for the reader and writer. The context can be inherited from the page: when we load a pointer from a page, we set in the CPU register the password part of it equal to the password of the page. This would also allow to catch the malicious addresses in the shared pages by mismatch of the password. And the whole address space can be split into the private and shared parts, with only the shared part using the passwords. This would allow the private segments to allocate pages without getting a global lock on the address space. And the whole 128-bit space can be used for the password.

It would still be randomly susceptible to the brute-force attacks, as well as to the password leakage.

Some progress can be made on the fault-blaming through. If only the pointers without passwords cold be transferred through the password-protected memory then when we read such a pointer, we can implicitly add the password of the protected page to it and immediately verify that the pointer is valid, that this password also matches the page referred by the pointer. If it doesn't, it means that someone tried to slip in an invalid pointer. If our process has read-only access to the page, we can even say for sure that if was another process's fault and not penalize this one (unless of course it mapped the same page twice, and the other mapping is read-write).


Let's consider in more detail how the pointers would work. For the pointers-with-passwords let's use the "far" keyword from the days of MS-DOS. The common (AKA "near") pointers without password would be just pointers without a special keyword.
 

int *pn; // a common pointer
int far *pf; // a far pointer that includes the password

They generally can't be cast to each other, a far pointer has to be explicitly converted from a near pointer and a password:

pf = make_far(pn, password);

Typically we would use the common pointers in the structures that get passed through the password-protected memory, to store only the address but not password:

 
struct data {
  int *p1;
  int *p2;
};

This would also allow to use the same structure in the normal and password-protected memory.

But when  a common pointer gets accessed through a far pointer, it gets treated in a special way: essentially, it becomes implicitly converted to a far pointer.

 
data far *dpf;
data *dpn; 

pn = dpf->p1; // incorrect, won't compile
dpn->p1 = dpf->p1; // incorrect, won't compile
pf = dpf->p1; // correct, implicitly copies the password 
  // from dpf to pf and verifies that pf is valid, throws an
  // exception if invalid
dpf->p1 = pn; // incorrect, won't compile
dpf->p1 = dpn->p1; // incorrect, won't compile
dpf->p1 = pf; // correct, verifies that dpf and pf have
  // the same password (throws an exception if the
  // passwords differ), and drops the password when
  // writing into p1
dpf->p2 = dpf->p1; // correct

It would actually be more correct to say that the reference to a pointer accessed through a far pointer turns implicitly into a reference to a far pointer, and handles the conversion between the kinds of pointers.

This of course wouldn't stop someone from copying pointers as data to or from the password-protected memory via memcpy (though it would of course have to be a special version of memcpy that accepts the far pointers as arguments), but that would be their own problem of forfeiting the hardware help in the protection.

On the assembly level, this can be done with the addition of the password registers in the CPU along with the instructions for accessing the password-protected pages. Something like this:


 
  ; load the address part of the far pointer
  load (r1), r2
  ; load the password part of the far pointer into a password register
  loadpwd (r1+16), pr1
  ; read a value at a password-protected  far pointer
  loadpp r2, pr1, r3
  ; assuming that the read value was a pointer, verify that it was correct,
  ; it must be pointing to a page that uses the same password from pr1
  verifyp r3, pr1, failure_label
  ; or maybe read and verify in one instruction
  loadpv r2, pr1, r3, failure_label
  ; store a far pointer
  store r3, (r5)
  storepwd pr1, (r5+16)


The next development to prevent the brute-force attacks and password leaks is to have the OS to track on the side, access to which segments is granted to the process. 

Then the problem of the leaked passwords can be solved by changing them periodically. Instead of keeping the pair (password, address) in the far pointer, place the all the passwords for all the granted segments into a canonical location in the process. Then represent the far pointer as  the pair (address of the password, address of the data). Whenever a password needs to be changed, change it in the canonical location (of which the OS would keep track). And when an access exception happens due to a stale password in the CPU register, the kernel can check the access rights of the process against the canonical list. If the rights are present, it can update the password in the register and return to retry the access. 

But a consequence of keeping this special section is that the passwords themselves can be stored in a protected section of memory, otherwise inaccessible in the user mode. The password registers in the CPU would have two parts: the address of the password in the special section that can be loaded and stored by the process and the shadow value of the password itself, invisible to the process and used to check the access. Then the users won’t be able to brute-force the passwords at all because they won't be ever able to set the arbitrary passwords, all the password management would happen in the OS. 

Then if the passwords are not user-visible, they don’t have to be long any more, 64 bits, or maybe even 32 bits would suffice as long as they are all unique. Another consequence would be that changing the password would not be needed most of the time, its only use would be to revoke someone’s access.

But wait, there is more. If we don't have the special hardware available, instead of checking on every access we can check when mapping the page into the address space of the process. It’s just that the shared pages would be mapped at the same virtual address in every process. So we’ve made a full circle back to the existing solution here. It feels like not disturbing the page table would be more efficient, but actually not: the segment would have to be allocated first anyway before making it available to any process, and if the page map is shared between all the processes, that means potentially disturbing all the processes, not only the ones that will use that segment. 

The only other part of the implementation would be the verification that the pointers received in the shared pages are safe, that they point to the pages of he same source. This can be done in user space, allocating a “page id map”, where each page address is direct-mapped to the “source id” of this page (say, 4 bytes each). Look up the ids of the page that contained the pointer and of the pointed page, and verify that the ids are the same. The map itself can be sparse, the unused parts mapped to an all-0 page, so it would be cheap. Some kernel help might be useful for handling the sparseness and making sure that the page id map is consistent with the page table. The compiler support with the "far pointers" would also still be useful for the automated verification, even with the user space solution.


Wednesday, February 19, 2020

Scheduling policy for burst multithreading

Over a year ago I've been reading Eric Raymond's blog post on the limitations of multithreading: http://esr.ibiblio.org/?p=8223

That gave me an idea that on the desktop we routinely have the spare CPUs going to waste. If we could reduce the cost of cross-CPU interaction, that would allow to schedule efficiently even the small snippets of parallel code. And we can reduce this cost by scheduling multiple threads of one process together, even if they have nothing to do - then they would just park the CPU in user space, and wake up quickly, directly in the user space, with already pre-heated cache and TLB. And then we can use the fast low-overhead inter-thread communications directly in the user space.

And then I've realized that another way to look at it is kind of like superscalar execution on a higher level, sort of what they've tried to do in Itanium but better.

Well, this finally ended up in a defense publication:
https://www.tdcommons.org/dpubs_series/2896/

Not sure if it will end up in anything useful but at least the idea is out there and you can read the details of it.

Monday, September 2, 2019

networking & financials

Looking at the Federal Reserve rates

https://www.macrotrends.net/2015/fed-funds-rate-historical-chart ,

we can observe that once in a while they suddenly decide to raise the rates quickly, each time followed by a recession. As the people who caused the Great Depression said,  they "fought to stop the speculations", and succeeded, bringing the whole economy down with them.

But the graph overall looks very reminiscent to what we see in the congestion control protocols. Except that the congestion control protocols act more sensibly: increase the transmission speed slowly, but when a catastrophe is detected, drop it sharply. It would be interesting to build an economic model and use the congestion control protocols to regulate the refinancing rate in it, starting with the classic TCP. I suspect that it might be able to do a lot better than Fed's current tealeaf reading, since the Fed seems to be doing things backwards: raising the rates sharply and dropping slowly. Since the catastrophes happen when the rates get too high, it would make a lot more sense to limit the rate of increases to something like 0.25% (25bpp) per year, to creep up on that catastrophe slowly, detect it early in the making, and then react sharply.

Tuesday, July 9, 2019

graph equivalence 7: why it DOESN'T work, and a conjecture

<<Prev

I've got around to read a bit more about the graph equivalence (i.e. isomorphism) problem. Looks like the state of the art algorithm is called "nauty". I don't understand it yet, I haven't found a straightforward enough description nor understood the complex kind of description. But based on the introductory page http://pallini.di.uniroma1.it/Introduction.html,  some things are easy to notice.

First, they use the same signature-based approach for comparison of the graphs, they call it the "canonical labeling".

Second, they use the same approach of splitting the set of nodes into the equivalency classes, they call it "coloring" and "partition refinement". Which is good, this means that I've reinvented something that is known to work in general.

Third, right that page contains an example of a graph on which my algorithm doesn't work. It's the pink one in the section "Refinement: equitable partition". Here is a copy, with names assigned to the nodes:

.
.              -----A-----
.             /     |     \
.            /      |      \
.           B-------+------ H
.           |       |       |
.           |       |       |
.           C-------+-------G
.           |       |       |
.           |       |       |
.           D-------+-------F
.            \      |      /
.             \     |     /
.              -----E-----

I've been starting with the premise that the nodes would have tags, and the links will be directional, but with a basic unlabelled non-directional graph the things are much harder. All you've got to start with is the number of links in every node. If there is some asymmetry, it will be used to unravel the dependencies. But if every node has the same number of links then the logic is stuck, and as this example shows, it's possible to make all the nodes have the same number of links but not be all symmetrical.

Before I've read further about the existing algorithm, I want to take another independent stab at the solution. As I've described in the Part 4, I've started with the idea of describing the graph topology in terms of the spanning trees, rooted at each of the nodes, and then tried to simplify it with the hashes. This simplification has turned out to be wrong. So let's return to the spanning trees (with redundant links on the side), and take another approach to avoid the circular dependency. When computing the equivalency class (AKA "hash" AKA "color") of each node on each top-level iteration of the algorithm, instead of looking at the immediate neighbor nodes, let's take this node as the root of the spanning tree and look at its relation with each other node by building a path from the root to that node (if the graph is disconnected and no path exists, take an empty path, or just don't enter any path for that node). To account for the possible redundant links, we could find multiple paths: remove all the links but one from the destination node, find a path, then return to the original graph, and remove all but a different single link, find another path, and so on.

The path in general would include the tags on the nodes, the number of links connected to them (in the original graphs), and so on, to make it more distinctive. But if the nodes are untagged (initially belonging to the same equivalence class), and all have the same number of links, the only distinction between the paths would be their length.

Let's build the paths:

Root A:
  Node B can be reached via the paths:
    using the link AB: AB - 1 step
    using the link HB: AHB - 2 steps
    using the link CB: AHGCB or AEDCB - 4 steps
  Node C can be reached via the paths:
    ABC - 2 steps
    AHGC - 3 steps
    AEDC - 3 steps
  Node D can be reached via the paths:
    AED - 2 steps
    ABCD - 3 steps
    AEFD - 3 steps
  Node E can be reached via the paths:
    AE - 1 step
    ABCDE - 4 steps
    AHGFE - 4 steps
  Node F is similar to D
  Node G is similar to C
  Node H is similar to B

So overall the set of paths going from the root A can be described in the shorter form as:
  2x(1, 2, 4); (1, 4, 4); 4x(2, 3, 3)

Root B (using the shorter notation):
  Node A: 1, 2, 4
  Node C: 1, 3, 4
  Node D: 2, 3, 4
  Node E: 2, 3, 4
  Node F: 3, 3, 4
  Node G: 2, 2, 4
  Node H: 1, 2, 3

The resulting set of paths from the root B:
  (1, 2, 3); (1, 2, 4); (1, 3, 4); (2, 2, 4); 2x(2, 3, 4); (3, 3, 4)

Root C: (using the shorter notation):
  Nodes A and E: 2, 3, 3
  Nodes B and D: 1, 3, 4
  Nodes F and H: 2, 2, 3
  Node G: 1, 3, 3

The resulting set of paths from the root C:
  (1, 3, 3); 2x(1, 3, 4); 2x(2, 2, 3); 2x(2, 3, 3)

The node E will have the same paths as A; G will have the same paths as C; nodes D, F, H will have the same paths as B. As you can see, these three classes of nodes will have distinct sets of paths, so they will be successfully split into three classes indeed.

This solves the problem for this specific example. Would it solve every possible example? I don't know, right now I can't think of a proof one way or another.

But let's estimate, how difficult would it be to build these paths. If there are N nodes and total L links, first we'll have the loop that goes over every of the N root nodes. Then it would go over each of the (N-1) target nodes, and for each one of them go over up to (N-1) links that enter it, and find the shortest (or, with tags, also "lexicographically first") path between the root and the target node (which shouln't take longer than L steps). Then we'll sort and collate the paths to each target, and concatenate them into the sorted set of paths, which should take within O(N^3). And then sort and collate the nodes by their paths, which should be within O(N^4) or so. Which means that it's all still polynomial.

So, if this approach works for every possible example, that would mean that it could solve each possible example in polynomial time. I've grown a bit more pessimistic about that, so probably it doesn't work for every possible example. But it would be interesting to understand, why.

As a side note, I've tried to reduce this graph a bit, and apply my new algorithm to the reduced version:

.
.              -----A-----
.             /     |     \
.            /      |      \
.           B-------+------ F
.           |       |       |
.           |       |       |
.           C-------+-------E
.            \      |      /
.             \     |     /
.              -----D-----

And the algorithm produced the same path sets for every node. Only then did I notice that in the reduced graph all the nodes really are equivalent! This graph can be seen as a triangular prism, with the top face ABF and bottom face CDE:

.
.              A
.             /|\
.            / | \
.           B--+- F
.           |  |  |
.           |  |  |
.           |  D  |
.           | / \ |
.           |/   \|
.           C-----E


And all the corners in this prism really are equivalent! The extra horizontal line in the original example adds the magic to make the nodes different.

Another side note: it's easy to construct a similar directional graph too. Just replace each "horizontal" and "vertical" non-directional link that has a pair of nodes at the end with a pair of arrows, one going one way, another one going back (between the same pair of nodes, or inserting another pair of nodes for the "back" link below the original pair). And as a final touch, turn the links that go "around the circle" into the arrows, all of them going either clockwise or counter-clockwise.

P.S. A good reference page: http://algorist.com/problems/Graph_Isomorphism.html

<<Prev
 

Friday, March 8, 2019

defense against SQL injections

I've been reading the question on Quora:
Why can't SQL change so that injection attacks are no longer possible?

People there were referring to the example of getting through the value escaping. It works like this:

$iId = mysql_real_escape_string("1 OR 1=1");    
$sSql = "SELECT * FROM table WHERE id = $iId";

Well, this could easily be fixed by converting the variable to int before using it, but it's easy to forget. They also give examples of the Unicode character sets getting mishandled, and of using the wrong kind of quotes in the query.


But reading it gave me the idea that it might be actually easy enough to resolve: instead of using the plain strings, use the strings tagged with their expected type. I.e. “this is a part of a query”, “this should be an integer”, “this should be an external string”. It can be as simple as using a separate sanitizer function for each data type. So for example the string sanitizer would among other things add the apostrophes around the value. And the int sanitizer would check that the value is a valid integer. Then even if you accidentally use the string sanitizer for an int, your query just won’t compile. And the example from the injection attack would become
  1. $sSql = "SELECT * FROM table WHERE id = " + sanitize_int("1 OR 1=1");
The sanitizer would either throw an exception or cut down its argument to something like “1”, neither becoming a dangerous attack.

But it can be done better yet, by building the SQL query as a list of tagged objects and then passing this list of objects directly to the SQL compiler. The overloaded "+" can be used to make it look neat:
  1. $sSql = sql_q("SELECT * FROM table WHERE id = ") + sql_int($arg);
Then the sanitizing would be done directly in the SQL compiler. Or course, it’s not 100% fool-proof, because someone might still use the sql_q() tag on an external value, but it’s still much better. And there are some ways to do better, like Perl can use its tainting logic, and I think I’ve seen a C++ function that accepted only an array of characters as its argument.

Essentially, it injects the fragments of syntax into the statement, to be used by the SQL parser. Then parser would then know that the contents of sql_int() has to be parsed as a single integer lexeme. And no matter what the character encoding, the compiler would know to treat the contents of sql_string() as a string.

This is really very similar to the parameter substitution in the prepared queries but it's much easier to read (no more need to count which "?" matches which parameter), and adds the explicit type information rather than having the SQL compiler guess, and allows to build the SQL statements dynamically, adding clauses and subqueries in a dynamic way. You know, things like “and if this condition is true, add a GROUP BY”, “for each parameter in the list, generate a query and make a UNION of them”, “and if another condition is true, instead of reading directly from a table, read from another sub-query”.

It can also be taken another step further, by creating types for the constant expressions and subqueries.  The contents of sql_int_expr() wouldn't have to be a single integer, but can conveniently be a string containing a whole integer constant expression. Again, the compiler would ensure that it really is one. And the contents of sq_subquery() would be built from the same kind of fragments and guarantee that it's the whole subquery. So an accidentally misplaced parenthesis in it could be diagnosed more directly, without overflowing into the parent query.

Tuesday, December 25, 2018

Triceps 2.1.0 released

After a long break, Triceps 2.1.0 is finally released: http://triceps.sourceforge.net/

It includes two important features:

1. The fast compiled Ordered index. The code for it has been sitting in SVN for three years, and I have finally found time to update the documentation to replace the old ordered index implemented in Perl with the new one.

2. Triceps now builds works with the modern versions of C++ and Perl. This was harder to do than it sounds, since the C++ standard has changed in an incompatible and unpleasant way. But the good news is that the code produced by the new compiler is a good deal faster than by the old one.

Friday, December 7, 2018

Principal component analysis

Today I've learned about the existence of the Principal component analysis. It's supposed to find the orthogonal dimensions that best describe the variance of the data set. Perhaps this is the real answer for removing the duplication in the dimensions that I've been thinking about some time ago. And it had been invented in 1901! I haven't understood it yet, will need to read up on it. For now just making a note so that I won't forget about it.

Saturday, December 1, 2018

Triceps in flux

I've finally updated my machine to a modern version of Linux (Mint 19), and now I'm going through, so to say, bringing Triceps into the Century of the Fruit Bat. Both C++ and Perl languages have changed, (and as has been mentioned in a previous post, Ghostscript had functionality removed).

Most annoyingly, C++ doesn't allow to call even the non-virtual methods on the NULL pointers any more. Well, it does allow to call them, but  auto-removes any checks for (this == NULL) inside the methods. And I've been using this in multiple places. But now it's all worked around by using the special static objects and the code builds again.

The tests still fail though, but now it's hopefully all trivial: some message got changed because of the different hashing, and also the Perl tests fail because the syntax of regexps got changed in Perl 5.26, it doesn't accept the plain "{" in the patterns any more. But I'm working on it.

Friday, November 23, 2018

converting EPS to SVG

I've tried to build my book on a recent Linux Mint and failed: I've been using Ghostscript to convert the figures in EPS to the SVG format, for inclusion into PDF, and it turns out that Ghostscript doesn't support SVG any more. I've tried to download and build the source but the source didn't have it either. I remember, it got broken a long time ago, it was crashing on my previous attempt to use a newer Linux. I've found a bug report from 2013 on Ghostscript's web site. Apparently, instead of fixing the SVG driver, they've just dropped it.

But I was able to find another way, with Inkscape. The conversion command looks like this:

inkscape -D "$SRC_EPS" --export-plain-svg="$DST_SVG"

After this change the build worked like a charm.


The Triceps build will need a similar change. I'm about to try it.

Wednesday, November 7, 2018

how to reinvent the bicycle 2

I've been thinking more about the problem of teaching the engineering
approach to problem solving, and I've decided to try using another
interview problem as an example. It's an older problem, and I really like
it, but I don't use it any more because the other people liked and used it
too, and the solutions are already available on the Internet.

I'm not just going to give the solution, blindly remembering that is of no
use to anyone. I want to show how the better solutions can be born out of
the bad solutions. I'm going to start with the worst solution I can think
of, and then gradually show the better solutions. The puzzle for you, the
reader, is to use the issues in these solutions as hints towards
the better solutions, that would take you as far ahead as possible.

I wrote those solutions as I would do at an interview, without actually
compiling and running the code on a computer. So they might contain bugs,
from trivial bugs to the major ones. Whenever I've noticed a major bug
after writing a solution and following through its execution manually, I've
left the buggy version in, and added the fixed version. Sometimes a solution
was so overcomplicated that I didn't notice a bug in it until I moved on to
another, simpler solution. There still might be bugs, but hopefully not too
bad ones. It just shows that the overcomplicated approaches should be
avoided.

The problem is to write a matcher for the very simple regular expressions,
that include only the operators "." (any character) and "*" (any number of
repeats of the previous character). The "*" is greedy, consuming as many
matching characters as possible. There is no escape character like
backslash. The string must match completely, as if the regexp implicitly
had "^" at the font and "$" at the end.

For the Windows people who don't know regular expressions, the regexps are
a way of pattern matching, kind of like the file patterns (which are known
in the Unix world as globs) but different. Most of the characters in the
regexps have the literal meaning, just as in the file patterns. The
character "." is like "?" in the file patterns, matching exactly one any single
character. The character "*" is different than in the file patterns. It
means zero or more repetitions of the previous element, so the file pattern
"*" is analogous to the regexp ".*".

The function declaration will be:

bool match(const char *pattern, const char *text);

Let's start with the analysis.

The first thing to notice about this problem is that some regular expressions
in it are impossible to match. The "a*a" will never match anything because
the greedy "a*" will consume all the "a"s, and the second "a" will never encounter
a match. The same goes for "*." followed by anything, because ".*" will
consume everything to the end of the string.

The other thing to consider is, could the strings be in UTF-8? UTF-8 can
be handled transparently in many ways but this is not one of them. If
treated simple-mindedly, the "*" after a multi-byte Unicode character would
expect the repeat of its last byte. And a "." would match a single byte of
a multibyte character. Specifically for UTF-8, this problem
can be handled by keeping track of the continuation. Every byte of a
multi-byte character in UTF has the high bit 0x80 set, and the first one of
them has also the bit 0x40 set, so you can parse where the character
starts. I've looked up the exact bitmasks on the Internet, it's not
something that I'd expect anyone to know off-hand at the interview, nor the
details of the exact UTF-8 encoding, but the general principle that UTF-8
is easy to parse is a reasonable knowledge. However this kind of parsing is a
bad idea anyway, because it would mess up the handling of the strings in
the single-byte 8-bit encodings. A much better approach is to use the
library functions to either convert the whole strings to wchar_t (requires
the allocation of extra memory) or to parse the strings
character-by-character as they get read (better). This is important to
mention to show that you understand the multi-byte encoding issues, and can
bring you bonus points at an interview. But the problem itself is not about
that, the multi-byte support can be trivially added after the general
solution is found, so let's assume that we don't need to care about that
for now, and the strings are in ASCII.

The first solution goes in the most complicated way I can think of. You
might have attended a college course on parsing that talked about the
finite machine matcher for the regular expressions. The most unsophisticated
approach is to push this way blindly.

Each node of the finite machine graph could be a plain array of the
possible exits from that node, but to make things easier to write, let's
use a C++ map:

using namespace std;

struct Node {
    map<char, Node*> exits_;
};

The parsing of the pattern will start like this:

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    Node *parsed = nullptr;

    for (const char *p = pattern; *p != 0; p++) {
        ...
    }

}

Since we're dynamically allocating the Nodes, we need to take
care of freeing them too. A simple way to do it is to use the reference
counting in the form of shared_ptr or unique_ptr. But we can't store them
right in the Node, because the Nodes may have circular references. The
unique_ptr just will refuse to do so (if you insist with std::move(),
you'll get NULL pointers in the unexpected places, and will probably
crash), and shared_ptr would create a link loop and leak the memory. It's
possible to arrange the entries in a plain list, and then free them at the
end manually. But the easier way to do it is to keep the reference counts
in a separate vector, and let the nodes connect in whatever ways with the
pointers. And these should be the real pointers, not the weak references.
The weak references are needed when you might need them to become strong at
some point. Here we have no such need, so save the overhead.

We can also use the holder to find the first and last allocated node, so we
don't need the separate parsed pointer.

What should the loop do? Since the Nodes contain the map of exit links and
not entry links, on each literal character in the pattern we'll need to
create a new node, and then update the _previous_ node with the exit link
to the new one. So we need to start by allocating one initial node before
the loop, to always have a previous node.

When we see a '.', we populate the map for every character code except '\0' to
point to the next node. If we had to deal with the wide characters, this
wouldn't work, because the number of characters would be too great. Then
we'd have to add a special "default" link. But for ASCII it's good enough.

When we see '*', we don't allocate a new node but make the previous node a
copy of its previous node, pointing to itself (so we should also keep a
pointer to the node before last).

And when we see a '\0', we will add to the last node an exit that maps the
'\0' to a NULL pointer, this being a signal to stop. We could also add a
separate boolean field final_ for this purpose, but a null pointer is just
as good, and safer in the way that we wouldn't need a separate code path
for checking the finality, and will always check all the pointers for NULL.

Here is the loop fleshed out:

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    holder.push_back(make_shared<Node>());
    Node *previous = nullptr;

    for (const char *p = pattern; *p != 0; p++) {
        if (*p == '*' && previous != nullptr) {
            holder.back()->exits_ = previous->exits_;
        } else {
            previous = holder.back().get();
            holder.push_back(make_shared<Node>());

            if (*p == '.')  {
                for (int c = 1; c <= 255; c++) {
                    previous->exits_[(char)c] = holder.back().get();
                }
            } else {
                previous->exits_[*p] = holder.back().get();
            }
        }
    }
    holder.back()->exits_['\0'] = nullptr;

}

This code has to deal with some situations that "shouldn't happen", the
invalid patterns that either start with a "*" or have multiple "*" in a
row. This is something that needs to be called out explicitly. Two ways to
deal with it is to either return some kind of an error indication (such as
throw an exception) or just process them in some not officially defined
way. This code takes the approach of the silent handling, simply because
it's easier to do in a small code snippet. From the caller's standpoint it
might be either good or bad: the good is that the caller won't have to
handle the errors, the bad is that the author of the pattern might be
surprised by its effect. Whatever choice you make, you need to be aware
of this trade-off. The silent handling here is that the "*" at the start is
treated as a literal, and multiple "*" are treated as one.

This code is also not correct in the handling of "*", in two ways! How do I
know this? I tried to run through it manually on the examples of what I see
like corner cases: "", "a", ".", "a*", "a*b", "a*a", ".*", ".*a" with the
matching and mismatching strings, and I actually tried the more complex
cases first.

The first defect is that it doesn't handle the greediness right. A pattern
like ".*a" will not be unmatchable, instead it would mean what in the full
regexps can be expressed as "[^a]*a". Which might be not bad by itself but
isn't what the problem statement says. Worse yet, due to the same mistake
the pattern "a*a" will be equivalent to a single "a", which is outright
wrong. The fix for this problem is to check whether the map value at this
key has been already set, and if so then don't overwrite it:

struct Node {
    map<char, Node*> exits_;

    void set_exit(char c, Node *to) {
        if (exits_.find(c) == exits_.end()) {
            exits_[c] = to;
        }
    }
};

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    holder.push_back(make_shared<Node>());
    Node *previous = nullptr;

    for (const char *p = pattern; *p != 0; p++) {
        if (*p == '*' && previous != nullptr) {
            holder.back()->exits_ = previous->exits_;
        } else {
            previous = holder.back().get();
            holder.push_back(make_shared<Node>());

            if (*p == '.')  {
                for (int c = 1; c <= 255; c++) {
                    previous->set_exit((char)c, holder.back().get());
                }
            } else {
                previous->set_exit(*p, holder.back().get());
            }
        }
    }
    holder.back()->set_exit('\0', nullptr);

}

The second problem is even worse, "*" ends up meaning "1 or more
repetitions" instead of "0 or more repetitions", like the operator "+" in
the full regexps. To fix that, on the next character after a "*" we need to
add the new link to both the previous and pre-previous node.

The state machine graph for the expression "a*b" used to look like this:

 
      a
     +-+
   a \ /  b    \0
O-----O-----O-----X

Now it will look like this:

 
      a
     +-+
   a \ /  b    \0
O-----O-----O-----X
 \         /
  +-------+
     b

The mistake I've made here is that I haven't started writing out the
explicit graph examples until after finding the bug. It could have been
prevented by thinking through the graphs from the start.

The code change is:

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    holder.push_back(make_shared<Node>());
    Node *previous = nullptr;
    Node *before_star = nullptr;

    for (const char *p = pattern; *p != 0; p++) {
        if (*p == '*' && previous != nullptr) {
            if (before_star == nullptr) {
                holder.back()->exits_ = previous->exits_;
                before_star = previous;
            }
        } else {
            previous = holder.back().get();
            holder.push_back(make_shared<Node>());

            if (*p == '.')  {
                for (int c = 1; c <= 255; c++) {
                    previous->set_exit((char)c, holder.back().get());
                    if (before_star) {
                        before_star->set_exit((char)c, holder.back().get());
                    }
                }
            } else {
                previous->set_exit(*p, holder.back().get());
                if (before_star) {
                    before_star->set_exit(*p, holder.back().get());
                }
            }
            before_star = nullptr;
        }
    }
    holder.back()->set_exit('\0', nullptr);
    if (before_star) {
        before_star->set_exit('\0', nullptr);
    }

}

This finally took care of the pattern parsing. Mostly. It still has a bug
with handling the sequences of starred characters, such as matching the
pattern "a*b*c" to the string "c". It will not accept "c" because at each
point we update only one previously starred node, not the whole sequence of
them. This is a pretty tricky code, and I didn't even notice this error
until I started doing this problem the next, easier, way, and noticed it
there, and then realized that the first version had this issue too.

As it has turned out after I drew the graph and traced the code, it's even
worse than that. In the "a*b*c", it would accept any mix of "a"s and "b"s,
rather than strictly "a"s followed by "b"s, caused by the copying of all the
links from the node before star.

Well, let's fix it. The graph for "a*b*c" used to look like this:

 
      a      a,b 
     +-+     +-+ 
   a \ /  b  \ /  c       \0
O-----O-------O-------O-------X
 \     \     /       /
  +-----\---+       /
   b     +---------+
              c

It should look like this:

 
      a       b 
     +-+     +-+ 
   a \ /  b  \ /  c       \0
O-----O-------O-------O-------X
|\     \     /       /|
| +-----\---+       / |
|  b     +---------+  |
|             c       |
 \                   /
  +-----------------+
          c

Whenever there is a sequence of starred characters, there has got to be a
link from before each starred node to each following starred node, and to
the node past the stars. The only way I could figure this out was with the
graph diagram.

This is very difficult to do by just following the graph, because there is
no indication in the graph, which link is going to the following node and
which go to the other nodes including itself.

The better way to resolve it is to notice that we're appending the nodes to
the vector sequentially anyway. So we can use the vector order to go
through them in sequence. Thus the variable before_star turns from a
pointer into a vector index:

struct Node {
    map<char, Node*> exits_;

    void set_exit(char c, Node *to) {
        if (c == '.') {
            for (int i = 1; i <= 255; i++) {
                if (exits_.find((char)i) == exits_.end()) {
                    exits_[i] = to;
                }
            }
        } else {
            if (exits_.find(c) == exits_.end()) {
                exits_[c] = to;
            }
        }
    }
};

bool match(const char *pattern, const char *text) {
    vector<shared_ptr<Node> > holder;
    holder.push_back(make_shared<Node>());
    int first_link = 0;
    char last_char;

    enum State {
        AFTER_STAR,
        AFTER_LITERAL,
    };
    State state = AFTER_LITERAL;

    for (const char *p = pattern; *p != 0; p++) {
        if (*p == '*' && holder.size() > 0) {
            holder.back()->set_exit(last_char, holder.back().get());
            state = AFTER_STAR;
        } else {
            last_char = *p;
            if (state == AFTER_LITERAL) {
                first_link = holder.size() - 1;
            }

            holder.push_back(make_shared<Node>());
            for (int i = first_link; i < holder_size() - 1; i++)
                holder[i]->set_exit(*p, holder.back().get());

            state = AFTER_LITERAL;
        }
    }
    for (int i = first_link; i < holder_size() - 1; i++)
        holder[i]->set_exit('\0', nullptr);

}

The setting of all exits on a "." became used in multiple places, so it
moved into set_exit(). The parsing of the star syntax became more
complicated, requiring to remember the last character, and to keep the
state for detection of where the sequence of starred characters ends.
The "classic" way of state handling is to put the state processing as a
switch statement in the loop but that would have made the code way too
painful and repetitions, so I've decided to skip that example of doing
things per an unsuitable pattern and move on right to the more
straightforward version above.

Now let's do the pattern matching:

bool match(const char *pattern, const char *text) {
    // ... skipped the pattern parsing ...

    Node *cur = holder.front().get();
    for (const char *t = text; cur != nullptr; t++) {
        auto it = cur->exits_.find(*t);
        if (it == cur->exits_.end())
            return false;
        cur = it->second;
    }
    return true;
}

It's done but is hadn't been a small feat, with all the memory management
thrown in, and all the opportunities to make the mistakes in the pattern
building. It can be powered though as a show of brute force, but I don't
think that I've done it all in under an hour, which it the usual interview
time slot.

I hope that at the very least by the time I've started talking about
building the holder vector, you've got a better idea: this vector goes in
parallel with the original pattern string, so why not just work directly
with the pattern string? Indeed, this is a much easier way. But it also has
the harder and easier version.

Again, let's start with the harder version. The loop will be working in
very much the same way as the matching loop in the parsed-pattern version
(as some textbooks would teach), but will read the pattern directly from
the string.

bool match(const char *pattern, const char *text) {
    char last_pattern = '\0';

    const char *p = pattern;
    for (const char *t = text; ; t++) {
        while (true) { // for starred sequences in pattern
            // at this point p always points after the last literal
            // character but possibly before the following star

            if (*p == '*') {
                if (last_pattern == '.' || *t == last_pattern) {
                    if (*t == 0)
                        return true;
                    break;
                }
                ++p;
            }

            last_pattern = *p;
            if (*p == '.' || *t == *p) {
                if (*t == 0)
                    return true;
                ++p;
                break;
            }

            if (*p == 0)
                return false;
            ++p;
            if (*p != '*')
                return false;
        }
    }

    return false; // never reached
}

The "for" loop here is interesting, without an exit condition. This is
because the matching of '\0' follows the common pattern, and is done in the
same loop code, with the only addition being the check for the end of lines
when we detect a match of the literals. Otherwise, if we'd checked the text
and pattern for EOL in the "for" loop condition, we'd have to repeat the
inner "while" loop once more after the "for" loop, to account for the case
when the pattern ends with a sequence of starred characters.

The inner loop is necessary to handle the sequences of multiple starred
characters, such as "a*b*c" matching the "c". If we don't do the loop, "c"
would get compared to "b" and the match will be considered failed. Since
this code is much smaller and simpler than the parsed-pattern version, this
becomes much easier to notice and implement - just add a loop!

Speaking of the stars, this version takes a different approach to the
improperly placed stars in the pattern, just because it was easier to do
this way. The leading star gets ignored, and the second star in a row gets
treated as a literal. Which is OK, since this is the behavior for the
undefined patterns (unless you're writing code that must be compatible with
some existing library and need to repeat its quirks).

This code also contains a bug. It happily matches the '.' followed by
anything in the pattern to an end of line in the text. To fix that bug, the
conditions of the form

if (*p == '.' || *t == *p)

need to be replaced with

if (*t == *p || *t != 0 && *p == '.')

There is another way to take the same general approach but keep the main
loop more classic, by using a function for the internal while loop, and
then this function can be easily called twice. It isn't any better or
worse, just different (and I've included the bug fix into it):

bool match_one(char &last_pattern, const char *&p, char c) {
    while (true) { // for starred sequences in pattern
        // at this point p always points after the last literal
        // character but possibly before the following star

        if (*p == '*') {
            if (c == last_pattern
            || c != 0 && last_pattern == '.') {
                break;
            }
            ++p;
        }

        last_pattern = *p;
        if (c == *p || c != 0 && *p == '.')
            ++p;
            break;
        }

        if (*p == 0)
            return false;
        ++p;
        if (*p != '*')
            return false;
    }

    return true;
}

bool match(const char *pattern, const char *text) {
    char last_pattern = '\0';

    const char *p = pattern;
    for (const char *t = text; *t != 0; t++) {
        if (!match_one(last_pattern, p, *t))
            return false;
    }
    if (!match_one(last_pattern, p, '\0'))
        return false;

    return true;
}

And as yet another variation of the same, you could also look ahead for the
stars after the pattern and save this state in a separate variable:

bool match(const char *pattern, const char *text) {
    char last_pattern = '\0';
    bool last_starred = false;

    const char *p = pattern;
    for (const char *t = text; ; t++) {
        while (true) { // for starred sequences in pattern
            // at this point p always points before the next literal
            // character

            if (last_starred) {
                if (last_pattern == '.' || *t == last_pattern) {
                    break;
                }
            }

            last_pattern = *p++;
            last_starred = (last_pattern != 0 && *p == '*');
            if (last_starred) // skip over the star
                ++p;

            if (*t == last_pattern
            || *t != 0 && last_pattern == '.') {
                if (*t == 0)
                    return true;
                break;
            }

            if (!last_starred)
                return false;
        }
    }

    return false; // never reached
}

I think that this version is much simpler and in the real world if I
started writing the previous version, I would have never finished it, and
switched to this one.

But even this version is not great. Having to keep state feels like too much
work. It would be much easier to go the other way around, iterate through
the pattern and consume the matching characters from the text. Then the
state will become embodied in the code rather than variables, and will be
simpler:

bool match(const char *pattern, const char *text) {
    const char *t = text;
    for (const char *p = pattern; *p != 0; p++) {
        if (p[1] == '*') {
            if (*p == '.') {
                while (*t)
                    ++t;
            } else {
                while (*t == *p)
                    ++t;
            }
            ++p; // adjust to consume the star
        } else if (*p == '.') {
            if (*t++ == 0)
                return false;
        } else {
            if (*t++ != *p)
                return false;
        }
    }
    return *t == 0;
}

This version is much smaller and much easier to follow through. It does an
explicit selection by the type of each pattern element, so each one of them
has its own code fragment and not the logic that is spread through the code
and mixed with the logic of the other elements. And all this makes the
creation of bugs more difficult.

A slightly special thing about this version is that it looks ahead by two
characters, not just one, to detect that the current pattern character is
followed by a star. It's not something that's usually taught but it makes
the code a lot easier.

This version also has a theoretical foundation: it's a recursive-descent
LL(1) parser of the text. Except that the regular expressions define a
non-recursive language, so there is no recursion. It really is perfectly
per textbook, you've just got to pick the right textbook! It also parses
not a fixed grammar but one given in the regular expression pattern. So
it's an LL(2) parser of the pattern, with the nested LL(1) parsers of the
matching substrings in the text. The 2 in LL(2) means that we're looking
ahead by 2 characters. It can also be parsed by an LL(1) parser as has been
shown before but looking ahead by 2 characters makes it easier.

It is the version that kind of came to my mind almost right away when I
first thought about this problem. But I can't really say that it just pops
in my mind out of nowhere. I do size up the different approaches in my
mind, and try the ones that look simpler first. It doesn't mean that this
first estimation is always right. Sometimes I go pretty deep along one
approach before deciding to abandon it, and applying the lessons learned to
another approach. And sometimes this another approach end up even worse,
but the lessons learned there help to get through the logjam on the first
approach.

So if you start with the poor approaches, you can still arrive at the
better ones by listening to the hints that the code gives to you as you
write it. When you see an easier way to go, use it. You can also power
through the difficult approaches to the successful end but that tends to be
much more difficult than switching the approach to a simpler one.

Thursday, October 25, 2018

how to reinvent the bicycle

I've been recently conducting some job interviews, and met a spate of junior engineer candidates with a similar issue: they could quickly come up with pretty good overall idea of the solution, and they can write the code as such, but they would fail to translate this overall idea into the code. They couldn't divide the overall idea into components and then step by step work through the details and interdependencies of these components and sub-components, even with the intense hinting from my side.

This is something that should be done almost mechanically, with little mental effort. And yet they could not do it, they try to do it by intuition, but their intuition is not strong enough to handle this complex problem, and they don't know how to use the mechanical approach either. The hints don't help much because they don't cause the right mechanistic associations.

The problem I ask is actually quite difficult, too difficult for a perfectly adequate junior-to-midlevel engineer, and I'm not sure if I myself would have solved it well some 20 years ago (I know that I can solve it now, it came from my real-life experience where I had to do this thing from scratch real fast). So I don't expect a good solution from this category of candidates, a so-so solution is plenty good enough for them (although some of them do very well, producing a fully completed optimal solution). But there is a difference between producing a complete solution that is not that good and getting the right overall idea, figuring out the conceptual parts that I consider difficult and important (that the first group never figures out), only to fail miserably to work out all the details necessary to write the code. Not that the code doesn't get produced at all (though sometimes it doesn't), but what gets produced has glaring holes and is much worse than the code produced by the first group. Interestingly, I see a good deal more women than men in this second group. The first group is, I think, about even between men and women; this sounds like a statistical impossibility but this second group is small compared to the first one and doesn't shift the overall averages that much. Granted, the sample size is small, so it might well be a statistical fluctuation, or maybe not.

I don't want to discuss the exact problem I use, because I don't want it to spread over the internets, making me invent another one. But I've come up with a good analogy that illustrates the issue. Some time ago I've read about the artists that would ask people to draw a bicycle, and then would produce a bicycle in this drawn shape as an art object. It's suitable to be only an art object because it's completely non-functional. If I were to draw a bicycle without thinking, I would also produce something like that. But spending some thought, any engineer should be able to reproduce a proper bicycle from the general logic: the function of the main components (wheels, steering, seat, pedals, chain) and the general considerations of the strength of the frame that shape it. The same logic can be used to check that none of the main components were forgotten: for example, if you forget about the chain, the pedals would be left disconnected from the rear wheel, so you'd have to recall it. Each component might be very non-trivial (the said chain took a long time to invent), but once you know the components, it should be impossible to put them into a wrong place. If an engineer figured out the chain but couldn't put it into the right place, there is something very wrong with this engineer.

The problem, I think, is that people are not really taught to do this kind of thinking in programming. The books and college courses describe the syntax of the programming languages and the general picture but leave a void between these layers. People learn this on their own from the examples and practice. But the examples and practice tend to train the intuition, and people are left to figure out the mechanistic approach on their own, and they either figure it out or they don't. It looks like quite a few of the generally smart people either don't or take a long time to develop it. Not to say that there is anything wrong with intuition, it's my favorite thing, but the mechanistic approach allows to stretch a good deal beyond the immediate reach of intuition, and to grow the future intuition.

I've recently seen a question on Quora, approximately "As you gain more experience, do you still write code that works but you don't know why?", and this I think is exactly the difference between the intuitive and mechanistic solutions. The intuition might give you some code that works, or possibly doesn't. The mechanistic approach lets you verify that what the intuition provided actually does what it's supposed to do, and provides the stepping stones for the intuition to go further: both to fix what is going wrong, and to do the more complex multi-leap designs.

By the way, the functional programming in Haskell and such seems to be an area where people are exhibiting this issue in a particularly exaggerated way: they write the high-level code hoping that the compiler will do the mechanistic work for them, and the compiler does but often not in the way they expect.

Programming is not the only area with this kind of a teaching problem. I think math has the same issue. The way the proofs of various theorems are taught is usually not the way how the authors had discovered them. These proofs get edited and adjusted a lot to make them look easier to understand. But this loses the teaching aspect of how to create the new proofs.

So, how would you teach it? Perhaps by walking through the solutions of the complex problems, showing step by step, how the initial general ideas get gradually solidified, deducing the requirements for their elements, and recursively for the sub-elements, resolving the conflicts along the way, with setbacks and retries.

The bicycle example suggests that there is probably a general transferable skill too, and this skill can be trained by the puzzle games like the classic "The Incredible Machine" where the goal is to build a Rube Goldberg contraption to accomplish the particular goal from a set of components. Like in the real life, the tasks there might include the extra components that look useful but don't really work out, or provide multiple ways to reach the goal. This is of course different from programming in the way that you need to achieve only one exact goal, while in programming you have to solve a whole class of related goals that include the corner cases, but still is a good point to start.

So the better way to teach for programming would probably be a combination of the "lecture material" and of the puzzles where the students have to build the working code from a number of provided hints.

I actually know an old book that does something like this: Jon Bentley's "Programming Pearls". It's built around a somewhat different idea but maybe it's close enough.