Wednesday, March 8, 2017

OS support for thread stopping

I've recently had done some thinking about how can an OS make the shutdown of a multithreaded application or its part easier, to stop the threads quickly but relatively nicely on request. I think a very helpful support can be pretty easy to implement:

  • Have system call to request the shutdown of an OS-level thread.
  • This call would make any system calls that could block non-blocking (except for the thread synchronization calls, like mutex or condition variable operations), would interrupt any current such blocking calls, and return an error. This includes the timed sleeps, and even the thread synchronization calls with timeouts would return an error immediately instead of sleeping. Optionally, it might also just fail any file operations altogether, possibly except for close(). Note that the file descriptors as such wouldn't be revoked altogether, they would just kind of stop working in one thread, but the other threads would still work.
  • A problem with this might be in such things as logging: when your thread is stopping, it might still write the useful log files. Revoking these log files would be  Well, one solution for this would be to buffer the logs in memory and write from a separate thread. Another might be to keep some kind of exception list.
  • The threads might also beget threads, and keeping track of all of them is complicated. So there is a need for grouping them, like the process groups in the OS, or Triceps Triead fragments, and then stopping the whole group.
Yet another separate approach might be to say "but if you want the ability to stop something, why don't you run it as a separate process and kill if when needed?". This would make a lot of sense for such things as PAM authentication modules.

There are problems with the separate processes too. But after some thinking, they seem to be fairly easy to solve.

When you think about processes, usually a separate binary comes to mind but it doesn't have to. You can just do a fork() and run from the same binary, and you even get the whole address space copy-on-write out of the box. Though I guess it might be interesting to NOT copy-on-write the whole address space. Instead just allocate some communication area for passing the arguments and results, and get rd of the rest, maybe even re-running the run-time loader from scratch in the new context (that would be kind of like the Java class loader spaces).

Communication with a separate process is kind of painful. But if you map an anonymous memory segment, that segment can be inherited by the child process. Indeed, should be inherited to act as the communication region (truly shared, not COW) even if the rest of address space is dropped. And then you can happily create the mutexes and condition variables in it for synchronization. The arguments and results would have to be serialized to pass through this region but it shouldn't be such a big deal. It would be also possible to keep some large data dictionaries in this communication space.

Of course the callbacks wouldn't work across the process boundary. But there is a solution for this too: create a local thread pool (or simply a thread) for callbacks that would get the requests from the remote process though a portion of the same communications region, execute the callback, and return the results back there.

I think this looks feasible even without any new technology needed, just needs a helper library to implement the interface. This isn't even a new concept, it's a classic microkernel architecture without the common memory space. Yeah, copying the data in the serialized form adds an overhead but it's not that much overhead compared to say JVM, and people use Java and .NET all over the place with no regrets.

Sunday, March 5, 2017

Bayes 25 & AdaBoost: the partial confidence problem

The code in the previous example works only with the full-confidence events, those that have the confidence of 0 or 1. Unfortunately I haven't found a good way to handle the partial confidence yet. I've found a way but it has problems. Let me show you.

Previously I was treating the partial confidence as a superposition of 0 and 1. This approach cannot be properly used here because it doesn't cover the connections between the events. To give an example, in the previous post the code had correctly found that in tab24_01.txt the events E1and E2 are the same, so after E1 is applied, E2 becomes ignored (it's AdaBoost probabilities become 0.5). So now if we then do a computation and find that E1 is unknown (i.e. its confidence is 0.5) but E2 is known, E2 will still be ignored because its table entry says to always ignore it.

The correct way should handle this situation properly: if E1 is unknown, this should bring up the table values for E2 to compensate for it and treat E2 as E1 was formerly.

The obvious way to do it is to treat the partial confidence as a superposition of "the event is unknown" and "the event is known". So if the event is known, we process it as before. If the event is unknown, we would follow the unknown branch and then apply all the following events as if this events wasn't there. And if the event is partially known, we would split the weights of all the outcomes proportionally and handle one part with the known branch and another one with the unknown branch.

So the confidence of 0.7 would mean a superposition of 0.4 of unknown and 0.6 of the known value "1". The confidence 0.1 would mean a superposition of 0.8 of unknown and 0.2 of the known value "0". So if we denote C(E) as the event confidence, K(E) as the known weight in the superposition, and U(E) as the unknown weight in the superposition, we can write them as:

K(E) = abs(C(E) - 0.5) * 2
U(E) = 1 - K(E)

The downside of this obvious approach is that the table and computation becomes more complex. As we go through more events, two branches become four, then eight, then sixteen, and so on. The formerly quadratic complexity O(Noutcomes * Nevents) becomes exponential O(Noutcomes*Nevents*2^Nevents). I don't know yet if there is a better way.

Maybe it would be possible to build a table of "gradients" between the events: i.e. if the confidence of the event A changes, how are the probabilities of another event B affected. The problem though is that this gradient depends on all the events in the sequence between A and B, so tis idea doesn't seem to really fly.

Maybe another idea is workable, that would reduce the exponential cost to a cubic one: after splitting the superposition, instead of following two branches, just take the probabilities for each following event from both branches and mix them by the weight of the branches:

P(H|Ei) = K(Ej) * P(H|KjEi) + U(Ej) * P(H|UjEi)

It's better than exponential but still not great, with the run time O(Noutcomes*Nevents*Nevents). And the table of these pre-computed values would still seem to be exponential, each event depending on all the possible combinations of the previous events.

Wednesday, March 1, 2017

Bayes 24: AdaBoost as a solution to the correlated events

When writing the whitepaper, I've realized that AdaBoost is very good at finding the cross-correlation of its partial hypotheses (that map to the Bayesian events), and it always picks the next step with the minimal cross-correlation. The Bayesian computations do have an issue with cross-correlation, as described in the whitepaper and near the end of Part 8 of my notes: if there are multiple events that duplicate or mirror each other, all of them are treated as independent and drive the probabilities off. This looked like AdaBoost could give a hint of how to solve this problem, so I've started thinking about it, and I think I've found a solution.

In retrospect the solution is pretty obvious: instead of computing the Bayesian model directly, treat it as an AdaBoost model whose set of the partial hypotheses equals to the set of the Bayesian events, with an extra limitation that each one of them must be used exactly once. If there are multiple equivalent events and they happen to go one after another, AdaBoost will treat the first one as real, and the following ones will get their error rate of 0.5 and will be effectively no-ops.

Well, if there are other events interspersed in between these equivalent copies, they won't be no-ops. These other events will shift the weights of the training cases, so that the second an further copies will find themselves drifted away from 0.5. The good news here is that AdaBoost is pretty good at removing the partial correlation too but the bad news is that the order matters, to fully remove the duplication the equivalent events should go one after another. And in general, it looks like the best removal of the cross-correlation happens when the closely related events follow closely one after another. This kind of order can be implemented if we reverse the AdaBoost preference of the partial hypotheses and make it always pick the one that has its training error rate closest to 0.5. Since (unlike the classic AdaBoost) the set of partial hypotheses is limited and each element from it must be used only once, the better ones will eventually be used as well.

To experiment with these ideas, I wrote the code example It doesn't reorder the events, just computes the AdaBoost values for whatever order it's given, allowing to experiment with the various orders. But it has a couple of advanced features from the later chapters of the book on AdaBoost: it does the multi-class (but single-label) version that also partitions the error rate computation for  the sign of the value of the partial hypothesis.

The multi-class means that it can work not only with two outcomes (a yes/no decision) but with multiple possible outcomes (i.e. Bayesian hypotheses), just like the normal Bayesian table. And just like the normal Bayesian table, these outcomes must be mutually exclusive, so each case may have only one outcome (single-label). The multi-class logic works in the way that is similar to the one described in Part 10: the resulting table is just a combination of the individual per-outcome tables. From the standpoint of each outcome, the single-label property means that the rest of the rows of the table taken together represent the opposite of this outcome, so there is no need to compute that opposite explicitly.

The partitioning of the error rate computation not only makes the result better but also comes very handy to prove that the opposite of one outcome is equivalent to the rest of the rows of the table. The AdaBoost training error e is the ratio of the total weights of the training cases where the value of the event E (that is, the Bayesian event, in AdaBoost terms it's a partial hypothesis) is not equal to the value of the outcome (Bayesian hypothesis) H. And we'll compute it separately for E (i.e. E=1) and ~E (i.e. E=0). It ends up expressable as probabilities that can be computed by summing and dividing weights W of the training cases:

e(H,E) = P(~H|E) = (sum for cases i where H(i)=0 and E=1 (Wi))/(sum for cases j where E=1 (Wj))
e(H,~E) = P(H|~E) = (sum for cases i where H(i)=1 and E=0 (Wi))/(sum for cases j where E=0 (Wj))

So if we pick a particular outcome and denote it H1, we can show that its opposite ~H1 is equivalent to the sum of the rest of hypotheses H2...HN by showing that

P(~H1|E) = sum for i in [2, N] (P(Hi|E))

which after removing the common denominator means

(sum for cases j where H1(E)=0 and E=1 (Wj)) = (sum for i in [2, N] (sum for cases k where Hi(E)=1 and E=1 (Wk))

The single-label property means that each case is marked with exactly one outcome, and all the cases where the outcome H1 is absent must have some other outcome Hi present, so these sums are indeed equal.

So it is all self-consistent. The key part of the computation is:

for my $hyp (values %abhyp) {
        if ($conf) {
            $hyp->{wt} *= $hyp->{ppos}[$evi];
        } else {
            $hyp->{wt} *= (1. - $hyp->{pneg}[$evi]);

And the matching part of the table building is:

foreach my $c (@case) {
        my $hyp = $c->{hyp}[0];
        # this code really handles only the TC values of 0 and 1,
        # for the arbitrary values it would have to do an interpolation
        if ($c->{tc}[$i] >= 0.5) {
            $c->{wt} /= $abhyp{$hyp}->{ppos}[$i];
        } else {
            $c->{wt} /= (1. - $abhyp{$hyp}->{pneg}[$i]);

The training undoes the weight change that will be done in the computation.

Now let's try some examples. Here is the example that I used in the whitepaper:

$ perl -n tab24_01.txt in24_01_01.txt
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E2             ,E3             ,
hyp1   ,3.00000,0.66667^0.66667,0.50000^0.50000,0.40000^0.33333,
hyp2   ,3.00000,0.33333^0.33333,0.50000^0.50000,0.60000^0.66667,
--- Applying event E1, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E2, c=1.000000
hyp1    w=1.00000 p=0.66667
hyp2    w=0.50000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.40000 p=0.57143
hyp2    w=0.30000 p=0.42857

Here the option "-n" means "don't normalize the weights" (or when normalized to the sum of 1 they will be the same as probabilities). The two values printed in the AdaBoostian table separated by a "^" are the P(H|E) and P(~H|~E).

As you can see, the result is not correct either (because of the XOR problem) but it's better than with the plain Bayes version that gave hyp1 the probability 0.667. And since E1 and E2 are the same, E2 ended up with the values of 0.5 in the table, and applying it didn't change the weight.

If we change the order of events, E2 gains some use back:

$ perl -n tab24_02.txt in24_01_01.txt 
Original training cases:
!      ,       ,E1          ,E3          ,E2          ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E3             ,E2             ,
hyp1   ,3.00000,0.66667^0.66667,0.40000^0.33333,0.47368^0.48276,
hyp2   ,3.00000,0.33333^0.33333,0.60000^0.66667,0.52632^0.51724,
--- Applying event E1, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.80000 p=0.57143
hyp2    w=0.60000 p=0.42857
--- Applying event E2, c=1.000000
hyp1    w=0.37895 p=0.54545
hyp2    w=0.31579 p=0.45455

Rather surprisingly, the result ended up more correct. Or maybe it's not that surprising: after all, AdaBoost is designed to gain more information from available data.

Yet another interesting result get produced from repeating the same pair of events multiple times:

$ perl -n tab24_03.txt in24_03_01.txt 
Original training cases:
!      ,       ,E01         ,E02         ,E11         ,E12         ,E21         ,E22         ,E31         ,E32         ,E41         ,E42         ,E51         ,E52         ,E61         ,E62         ,E71         ,E72         ,E81         ,E82         ,E91         ,E92         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E01            ,E02            ,E11            ,E12            ,E21            ,E22            ,E31            ,E32            ,E41            ,E42            ,E51            ,E52            ,E61            ,E62            ,E71            ,E72            ,E81            ,E82            ,E91            ,E92            ,
hyp1   ,3.00000,0.66667^0.66667,0.40000^0.33333,0.47368^0.48276,0.49694^0.49526,0.49916^0.49946,0.49990^0.49985,0.49997^0.49998,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,
hyp2   ,3.00000,0.33333^0.33333,0.60000^0.66667,0.52632^0.51724,0.50306^0.50474,0.50084^0.50054,0.50010^0.50015,0.50003^0.50002,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,
--- Applying event E01, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E02, c=1.000000
hyp1    w=0.80000 p=0.57143
hyp2    w=0.60000 p=0.42857
--- Applying event E11, c=1.000000
hyp1    w=0.37895 p=0.54545
hyp2    w=0.31579 p=0.45455
--- Applying event E12, c=1.000000
hyp1    w=0.18831 p=0.54242
hyp2    w=0.15886 p=0.45758
--- Applying event E21, c=1.000000
hyp1    w=0.09400 p=0.54159
hyp2    w=0.07956 p=0.45841
--- Applying event E22, c=1.000000
hyp1    w=0.04699 p=0.54149
hyp2    w=0.03979 p=0.45851
--- Applying event E31, c=1.000000
hyp1    w=0.02349 p=0.54147
hyp2    w=0.01990 p=0.45853
--- Applying event E32, c=1.000000
hyp1    w=0.01175 p=0.54146
hyp2    w=0.00995 p=0.45854

--- Applying event E92, c=1.000000
hyp1    w=0.00000 p=0.54146
hyp2    w=0.00000 p=0.45854

AdaBoost converges pretty quickly to the weights that make both events get the probability 0.5 and stop the further changes.

This seems to work, although it needs more testing. One more limitation of this implementation is that it works only with the events that have the value of either 0 or 1, not anything in between. I'll talk more about this in the next post.