Tuesday, August 1, 2017

Bayes 27 & AdaBoost: another problem with partial confidence

The solution from the previous installment can handle the partial confidence but it has its own issue. To demonstrate it, let's look at a run with the table from the part 26 but where both of the fully-correlated events E1 and E2 get a confidence of 0.7:

$ perl ex26_01run.pl -n -p 1 tab24_01.txt in27_01_02.txt 
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E2             ,E3             ,
hyp1   ,3.00000,0.66667^0.66667,0.66667^0.66667,0.40000^0.33333,
+      ,       ,               ,0.50000^0.50000,0.40000^0.33333,
hyp2   ,3.00000,0.33333^0.33333,0.33333^0.33333,0.60000^0.66667,
+      ,       ,               ,0.50000^0.50000,0.60000^0.66667,
--- Initial weights
hyp1    w=3.00000 p=0.50000
hyp2    w=3.00000 p=0.50000
--- Applying event E1, c=0.700000
hyp1    w=1.70000 p=0.56667
hyp2    w=1.30000 p=0.43333
--- Applying event E2, c=0.700000
hyp1    w=0.91800 p=0.60554
hyp2    w=0.59800 p=0.39446
--- Applying event E3, c=1.000000$ perl ex27_01run.pl -n -p 1 tab24_01.txt in27_01_02.txt
hyp1    w=0.36720 p=0.50579
hyp2    w=0.35880 p=0.49421

The probabilities move after applying E1, after E2 they move again. The interesting question is, should have they moved after E2? It really depends on how E1 and E2 have been measured: are their confidences derived independently or tracking back to the same measurement?

The code assumes that they are independent, so if both are independently measured at the confidence of 0.7, together they provide a higher confidence. But if they both come from the same test, this assumption is wrong.

If they came from the same test then E2 carries no additional information after E1 has been processed. In this case we should have treated the value C(E2)=0.7 (i.e. C(E2)=C(E1)) as having the meaning "I don't know", and then the differences from it would carry the new information.

But this approach has its own problems. First of all, it's fine if the confidence values C(E) represent the measured probabilities. But if they represent some expert estimation then an expert saying "I don't know" and giving the confidence of 0.5 would shift the probabilities instead of leaving them unchanged. Well, it it could be re-scaled automatically but then how do we know that the expert means "I don't know" and not "I've carefully considered everything and I think that it's a 50-50 chance"? So it sounds like getting it right would require using two separate numbers: the absolute confidence that tells how sure we are of the estimation, and separately the estimation. Which would create all kinds of messes.

The second problem, if we want to do such a re-scaling, to which value do we re-scale? It's reasonably easy if two events are copies of each other but what if the correlation is only partial? There actually is an a rather easy answer to the problem: the needed value would be P(E), the probability that the event is true at that point (after the previous events have been considered). I've actually went and modified the code to compute the P(E) based on the weights only to find out that I've forgotten an important property of the AdaBoost logic. Let me show its output:

$ perl ex27_01run.pl -n -p 1 tab24_01.txt in27_01_02.txt
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1                     ,E2                     ,E3                     ,
hyp1   ,3.00000,0.66667^0.66667@0.66667,0.66667^0.66667@0.66667,0.40000^0.33333@0.50000,
+      ,       ,                       ,0.50000^0.50000@0.50000,0.40000^0.33333@0.50000,
hyp2   ,3.00000,0.33333^0.33333@0.33333,0.33333^0.33333@0.33333,0.60000^0.66667@0.75000,
+      ,       ,                       ,0.50000^0.50000@0.50000,0.60000^0.66667@0.75000,
--- Initial weights
hyp1    w=3.00000 p=0.50000
hyp2    w=3.00000 p=0.50000
--- Applying event E1, c=0.700000
Expected P(E)=0.500000
hyp1    w=1.70000 p=0.56667
hyp2    w=1.30000 p=0.43333
--- Applying event E2, c=0.700000
Expected P(E)=0.513333
hyp1    w=0.91800 p=0.60554
hyp2    w=0.59800 p=0.39446
--- Applying event E3, c=1.000000
Expected P(E)=0.598615
hyp1    w=0.36720 p=0.50579
hyp2    w=0.35880 p=0.49421

The computation table now has one more element printed after "@", the estimation of event's probablity for the particular branch based on the previous events. During the computation these values get mixed together based on the weight of the branches. But notice that when applying E2, the message says:

Expected P(E)=0.513333

It's 0.513333, not 0.7! The reason is two-fold. First, AdaBoost works by adjusting the weights of the training cases such as to bring the probability of the last seen event to 0.5. It just does the correction opposite to what I was trying to achieve. Second, these are the wrong probabilities in the table. These probabilities are computed per-branch based on the absolute confidence. They don't care if the actual confidence was 0.7 or 0.3, these values are the same from the standpoint of the absolute confidence. But this difference matters a lot when trying to predict P(E)!

Instead we need to honestly compute the P(E) based on the previous events, not per the AdaBoostian table but per the original training cases. But if we do that, we end up again with a table of exponential size. The trick with ignoring all the events but the last few might still work though. The idea would be similar: since we try to order the closely-correlated events close to each other, the previous few events would be the ones with the strongest effects. And if the farther events aren't closely correlated, they won't have a big effect anyway, so they can be ignored. Maybe I'll try this next. Or maybe it's simply not worth the trouble.

Monday, July 10, 2017

Bayes 26 & AdaBoost: more on the partial confidence

I think I've found at least a partial solution to the problem of partial confidence that avoids going exponential.

It grows right out of the problem definition: the problem really exists only for the closely related events. If two events are unrelated, then the confidence of one wouldn't affect the other in any noticeable way. But for the related events, the first one takes up the bulk of the effect, and then the effect of the following events becomes greatly diminished. If two events are actually the same (like E1 and E2 in tab24_01.txt in the part 24), the second event gets stripped of any effect whatsoever. But if at the execution time we find that we have no information about the first event and good information about the second event, it's too late to make use of the second event because its probabilities are already hardcoded in the table and condemn it to having no effect. But if we could change the probability at run time then we could use the second event instead. Better yet, the probabilities of the following events would be unaffected because they don't care which one of the duplicate events provided the information. For a sequence of not equal but closely correlated events, if the first one is excluded from the table preparation, the others still seem to converge in their effects to something close to the original sequence, and the following events don't get affected that much.

Since the logic in the part 24 assumes that the closely-related events would be placed in the sequence next to each other, the solution might be to compute an event's table entry in multiple versions, depending on whether the values of a limited number of the previous events are known or not. It would still be exponential for this number of previous events. But since we expect a fast convergence and hope that not too many event values will be unknown, this number of previous events can be small, and the exponent of it would be reasonably small too.

By the way, I've come up with a name for the K(E) described in the previous part: the "absolute confidence". The confidence C(E) has the range from 0 showing that we're completely sure  that the event is false, though 0.5 showing that we have no idea, to 1 showing that the event is true. The absolute confidence

K(E) = abs(2*C(E)  - 1)

has the range from 0 showing that we have no idea to 1 showing that we're completely sure that the event is true or false.

I had to draw a picture to figure it out how it would work. Suppose that we want to take 3 previous events into the consideration. The datapoints with the probability values for these events can then be seen as nodes of a tree, with the nodes for each event aligned vertically under the event name:

E1         E2       E3       E4        E5
                      +------o
                    0/
              +-----o
            0/      1\
            /         +------o
      +----o
     /      \         +------o
   0/       1\      0/
   /          +-----o
  /                 1\
 /                    +------o 
o                              
 \                            0+-------o
  \                   +------o<
   \                0/        1+-------o
   1\          +----o
     \        /     1\        0+-------o
      \     0/        +------o<
       \    /                 1+-------o
        +--o                  
            \                 0+-------o
            1\        +------o<
              \     0/        1+-------o
               +----o
                    1\        0+-------o
                      +------o<
                              1+-------o

E1 still has only one probability value because it has no preceding events. But E2 has two possible probabilities, depending on whether E1 was known (K(E1)=0) or unknown (K(E1)=1). E3 has four probabilities, depending on the possible combinations of E1 and E2. And E4 has eight probabilities.

When we come to E5, it can't just continue branching exponentially.It should depend on only three previous events, so we drop the dependency on E1. This is done by abandoning the subtree with K(E1)=0 and continuing branching the subtree with K(E1)=1. This is a reasonable thing to do because we assume that the combination {0, 0, 0} "should never happen" (i.e no more than 3 events in a row would be unknown, otherwise use a higher value of depth), and the other mirroring pairs of branches of {0, b, c} and {1, b, c} would converge to about the same probability value. Thus the events starting from E5 would also have 8 probability values each.

Then when we do the computation on the built table, we can have any K(E) in range [0, 1], so we do an interpolation between the branches. Suppose we're computing E4, and have 8 branches {E1, E2, E3} to interpolate. We do it by adding them up with the following weights:

W{0, 0, 0} = (1-K(E1))*(1-K(E2))*(1-K(E3))
W{0, 0, 1} = (1-K(E1))*(1-K(E2))*K(E3)
W{0, 1, 0} = (1-K(E1))*K(E2)*(1-K(E3))
W{0, 1, 1} = (1-K(E1))*K(E2)*K(E3)
W{1, 0, 0} = K(E1)*(1-K(E2))*(1-K(E3))
W{1, 0, 1} = K(E1)*(1-K(E2))*K(E3)
W{1, 1, 0} = K(E1)*K(E2)*(1-K(E3))
W{1, 1, 1} = K(E1)*K(E2)*K(E3)

Or the other way to look at it is that we start with the weight of 1 for every branch, then for E1 multiply the weight of every branch where K(E1)=0 by the actual 1-K(E1), and where K(E1)=1 by the actual K(E1). Then similarly for E2, multiply the weight of every branch where K(E2)=0 by the actual 1-K(E2), and where K(E2)=1 by the actual K(E2). And so on. The sum of all the weights will always be 1.

The physical meaning of this interpolation is that we split the computation into multiple branches and finally add them up together. Since the computation is additive, we can put them back together right away rather than waiting for the end.

And since I was implementing the partial confidences in the final computation, I've also implemented the support for the partial confidences in training: it's done by logically splitting the training case into two weighted sub-cases with the event values of 0 and 1.

All this is implemented in the script ex26_01run.pl, as a further modification of ex24_01run.pl. Without the extra parameters it works just as ex24_01 does. The new parameter -p sets the number of previous events whose absolute confidence affect the current event. Here is an example of a run with the same training table as before and the same input:

$ perl ex26_01run.pl -n tab24_01.txt in24_01_01.txt 
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E2             ,E3             ,
hyp1   ,3.00000,0.66667^0.66667,0.50000^0.50000,0.40000^0.33333,
hyp2   ,3.00000,0.33333^0.33333,0.50000^0.50000,0.60000^0.66667,
--- Initial weights
hyp1    w=3.00000 p=0.50000in a row 
hyp2    w=3.00000 p=0.50000
--- Applying event E1, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E2, c=1.000000
hyp1    w=1.00000 p=0.66667
hyp2    w=0.50000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.40000 p=0.57143
hyp2    w=0.30000 p=0.42857

The option -n disables the normalization of the weight of the hypotheses, just as before. The output is the same as ex24_01 produced. In the training data E1 and E2 are duplicates, so the probabilities computed for E2 in the adaboostian table (P(H|E) and P(~H|~E), they are separated by "^") are at 0.5, meaning that it has no effect.

And here is the same training data but with support for one previous unknown event, and the input data having E1 unknown (i.e. C(E1) = 0.5):

$ perl ex26_01run.pl -n -p 1 tab24_01.txt in26_01_01.txt 
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E2             ,E3             ,
hyp1   ,3.00000,0.66667^0.66667,0.66667^0.66667,0.40000^0.33333,
+      ,       ,               ,0.50000^0.50000,0.40000^0.33333,
hyp2   ,3.00000,0.33333^0.33333,0.33333^0.33333,0.60000^0.66667,
+      ,       ,               ,0.50000^0.50000,0.60000^0.66667,
--- Initial weights
hyp1    w=3.00000 p=0.50000
hyp2    w=3.00000 p=0.50000
--- Applying event E1, c=0.500000
hyp1    w=1.50000 p=0.50000
hyp2    w=1.50000 p=0.50000
--- Applying event E2, c=1.000000
hyp1    w=1.00000 p=0.66667
hyp2    w=0.50000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.40000 p=0.57143
hyp2    w=0.30000 p=0.42857

Here E2 and E3 have not one pair of probabilities in the adaboostian table but two pairs, as shown in the branching picture above. The extra branches are shown on the continuation lines that start with a "+". The branches for E2 show that if E1 is known (the second branch), E2 will have no effect as in the previous example, but if E1 is unknown (the first branch), E2's probabilities would be the same as for E1, so E2 would replace it and converge to the same values.

As you can see in the run log, after E1 with confidence 0.5 (i.e. absolute confidence 0) is applied, both hypotheses still have the equal weight. But then after E2 gets applied, the weights and probabilities of hypotheses become the same as in the previous example, thus the knowledge of E2 compensating for the absent knowledge of E1.

An interesting thing to notice is that the probabilities for E3  are the same in either case. This is because this table can handle only one unknown event in a row, which means that at least one of E1 or E2 should be known, and E2 is able to fully compensate for the unknown E1, so both ways converge to the same probabilities of E3.

Now some information about how it all is implemented:

Some of the arrays have gained an extra dimensions, to be able to keep track of the multiple branches.The weight of the training case {wt} has become an array instead of a scalar, and the probabilities {ppos} and {pneg} of the hypotheses became 2-dimensional arrays.

The function compute_abtable() that computes the adaboostean table had gained an extra nested loop that goes through all the branches. As the final step of processing every event it prunes the branches coming from the previous event and then splits each of these branches in two. The support of the partial confidence in the training data is done by replacing the statement

if ($c->{tc}[$i] >= 0.5) {
    $c->{wt} /= $abhyp{$hyp}->{ppos}[$i];
} else {
    $c->{wt} /= (1. - $abhyp{$hyp}->{pneg}[$i]);
} 

with

push @{$c->{wt}}, 
    $c->{tc}[$i] * $oldwt->[$br] / $abhyp{$hyp}->{ppos}[$i][$br]
    + (1. - $c->{tc}[$i]) * $oldwt->[$br] / (1. - $abhyp{$hyp}->{pneg}[$i][$br]);

The function apply_abevent() that does the computation based on the table has been updated to manage the branches during this step. It had gained the extra argument of an array that keeps the history of the absolute confidence of the few previous events. This history is used to compute the weights of the branches:

for (my $br = 0; $br <= $maxbr; $br++) {
    my $w = 1.; 
    for (my $bit = 0; $bit <= $#$evhist; $bit++) {
        if ($br & (1 << $bit)) {
            $w *= $evhist->[$bit];
        } else {
            $w *= 1. - $evhist->[$bit];
        }   
    }   
    push @brwt, $w; 
}   

Then an update based on the current event is computed for each branch, and the weight of the hypothesis is updated per these weighted bracnhes:

my $p = 0.; 

# each branch is weighed per the observed history
for (my $br = 0; $br <= $maxbr; $br++) {
    $p += $brwt[$br] * ( 
        $conf * $hyp->{ppos}[$evi][$br]
        + (1. - $conf) * (1. - $hyp->{pneg}[$evi][$br])
    );  
}   

$hyp->{cwt} *= $p; 

And of course the functions that print the tables and normalize the data have been updated too.

All this stuff works on the small examples. I'm not sure yet if it would work well in the real world, I'd need a bigger example for that, and I don't have one yet. Also, I've noticed one potential issue even on the small examples but more about it in the next installment.

Sunday, June 4, 2017

Neuron in the Bayesian terms, part 3

I want to elaborate on why the second trick is so important. Basically, because in general the necessary condition of the chances changing on an event in the opposite way (by dividing or multiplying by the same number) is not true. It takes a specifically defined event to make it true.

In general if an event is true, the chance of a hypothesis after applying this event is:

Chance(H|E) = P(H/E) / P(~H|E)

The probabilities change on an event as follows:

P(H|E) = P(H) * P(E|H) / P(E)
P(~H|E) = P(~H) * P(E|~H) / P(E)

So then the chance changes (after substituting and reducing the formula):

Chance(H|E) = Chance(H) *   P(E|H) / P(E|~H)

If an event is false, the chance changes in a similar but different way:

Chance(H|~E) = P(H/~E) / P(~H|~E)
= Chance(H) *   P(~E|H) / P(~E|~H) 

Being able to do the "opposites" requires that

Chance(H|E) / Chance(H) = 1/ ( Chance(H|~E) / Chance(H) )

Chance(H|E) / Chance(H) = Chance(H) / Chance(H|~E)


Which then means after doing the substitution:

P(E|H) / P(E|~H) = P(~E|~H) /  P(~E|H)

Which is not always true. But if we take a page from the AdaBoost's book and define an event such that 

P(E|H) = P(~E|~H)
and
P(E|~H) =  P(~E|H)

Then it all magically works. Again, this magic is not needed for the Bayesian computation in general, it's needed only to fit it to the formula used in the neural networks and in AdaBoost.

There are versions of AdaBoost that use the different multipliers for when x[i] < 0 and x[i] > 0. Which means that

l[i]^x[i]

gets replaced with

if (x[i] > 0) {
  lpos[i]^x[i]
} else {
  lneg[i]^x[i]
}

AdaBoost in this situation still uses the same definition of E=(x == y) for the reasons of its internal logic,  but if we don't require this logic then we can define E=(x > 0) and easily implement the chance computation for it, since the requirement of both multipliers being the opposites would disappear.

This opens two possibilities for variations of the neurons:

(1) One with still the definition E=(x==y) but different multipliers for x[i] of different sign.
(2) One with the definition E=(x > 0)

A network with such neurons would obviously take twice the memory space for its trained multipliers (but not for the data being computed). I'm not even sure if it would be trainable at all, especially the version (2), since AdaBoost has good reasons to keep its definition of E for the convergence. But maybe it would, and considering that AdaBoost in this modification converges faster, maybe the training of the neural network would converge faster too. Might be worth a try.

emergents

I've attended a talk about the commercial use of the pre-trained neural networks. The idea is that you take a pre-trained network, throw away the top couple of layers, then put the blank layers on top instead and train only these new layers for your purpose.

The presenter gave an example of taking the Google's general image classifier network and teaching it to recognize your specific images instead: "huggable" vs "non-huggable". It was smart enough to recognize that a knitted cactus is huggable while a real one is not.

Well, it's kind of not surprising: the top layers of the classifier network are trained to recognize the various kinds of images but to do that the layer below it must produce the characteristics that allow to recognize these various kinds of images. If the original classifier has the wide knowledge (and it does), the previous layer would know what is important to recognize a very wide rage of images. They call the output of that previous layer "the emergents". And then when you use them as inputs, the new top layer would have an easy job classifying them for the new goal. You could probably even do a Bayesian classifier on top of them without much difference.

Saturday, June 3, 2017

Neuron in the Bayesian terms, part 2

I've been trying to describe in proper formulas my insight that a neuron in neural networks is a kind of a Bayesian machine(OK, this is apparently known to the people who do the neural networks professionally but it was an insight for me), and I think that I've finally succeeded. Let's start from the beginning:

A neuron implements the function

y = nonlinear(sum(w[i] * x[i]))

Here y is the output of the neuron, x[i] is the array of its inputs, w[i] is the array of some weights determined by training. Let's define the range of y and x[i] as [-1, 1] (if the values are in the other ranges, they can be trivially scaled to the range [-1, 1]  by adjusting the weights w[i]).

The nonlinear function is generally some kind of a sigmoid function that matches what we do at the end of the Bayesian computations: if the probability of a hypothesis is above some level, we accept it as true, i.e. in other words pull it up to 1, if it's below some equal or lower level we reject it, i.e. pull it down to 0, and if these levels are not the same and the probability is in the middle then we leave it as some value in the middle, probably modified in some way from the original value. Except of course that this nonlinear function deals not with the probabilities in range [0, 1] but with the values in range [-1, 1], -1 meaning "false" and 1 meaning "true".

The Bayesian formula uses the multiplication, not addition, so the first trick here is that one can be converted into another using a logarithm. To do that let's introduce a new value l[i], of which w[i] will be the logarithm:


l[i] = exp(w[i])
w[i] = ln(l[i])


This can be substituted into the expression for the sum:

sum (w[i] * x[i]) = sum( ln(l[i]) * x[i] )

Using the rule ln(a)*b = ln(a^b), the sum can be converted further:

sum( ln(l[i]) * x[i] ) = sum( ln(l[i]^x[i]) )

Then using the rule ln(a) +ln(b) = ln(a * b) the sum can be converted into a product:

sum( ln(l[i]^x[i]) ) = ln( product( l[i]^x[i] ) )

Now let's see how this product can be connected to the Bayesian computation. The classic Bayesian formula for probabilities is:

 P(H|E) = P(H) * P(E|H) / P(E)

So as each Bayesian event E[i] gets applied, we can say that the final probability will be:

P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )

Well, technically I maybe should have used another letter instead of i for the index, maybe j, but since the i in both formulas will be soon connected to each other, I'll just use it in both places from the start.

Instead of computing with probabilities we will be computing with chances. In case if you haven't used the chances before nor seen my mentioning of them in the previous posts, a chance is a relation of the probability of the hypothesis being true to the probability of it being false. For a simple example, if we have 5 doors, with prizes behind 2 of them and no prizes behind the remaining 3, then the chance of getting a prize by opening a random door is 2 to 3, or 2/3 (but the probability is 2/5). In a formula this gets expressed as:

Chance(H) = P(H) / P(~H) = P(H) / (1 - P(H))

So if we have two mutually exclusive hypotheses H and ~H, the probabilities for them will be:

P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )
P(~H) = P0(~H) * product( P(E[i]|~H) / P(E[i]) )


And the chances will be:


Chance(H) = P(H) / P(~H) 
= ( P0(H) * product( P(E[i]|H) / P(E[i]) ) ) / ( P0(~H) * product( P(E[i]|~H) / P(E[i]) ) )
= (P0(H)/P0(~H)) * product( P(E[i]|H) / P(E[i]|~H) )


If the initial probabilities of both hypotheses are equal, P0(H) = P0(~H) = 0.5, then their relation will be 1 and can be thrown out of the formula:
 
Chance(H) = product( P(E[i]|H) / P(E[i]|~H) )

Well, almost. The consideration above glanced over the question of what do we do if the event is false? The answer is that in this case the factor in the product should use the probabilities of ~E instead of E: 

Chance(H) = product( 
  if (E is true) {
    P(E[i]|H) / P(E[i]|~H)
  } else {
    P(~E[i]|H) / P(~E[i]|~H)
  }
)

Now comes the turn of the second major trick: what kind of events to use for the Bayesian formula. We'll use the same kind of events as for AdaBoost, the event being "x[i] accurately predicts y", so if H is "y > 0" and thus ~H is "y < 0", the conditional probability will be:

P(E[i]|H) = P(y == x[i])
P(~E[i]|~H) = P(y == x[i])

An interesting property of this definition is that the conditional probabilities of these events get computed across the whole range of the training data, without differentiation between x[i] < 0 and x[i] > 0. This means that

P(E|H) = P(~E|~H)
P(E|~H) = P(~E|H)

We then use the general property that

P(E|H) = 1 - P(~E|H)
P(E|H~) = 1 - P(~E|~H)

and get

P(E|~H) = P(~E|H) = 1 - P(E|H)

So after substituting all these equivalencies the computation of the chances becomes nicely symmetrical:

Chance(H) = product( 
  if (E is true) {
    P(E[i]|H) / (1 - P(E[i]|H))
  } else {
    (1 - P(E[i]|H)) / P(E[i]|H)
  }
)

When x[i] is in range [-1, 1], and especially if x[i] is in the set {-1, 1}, the if/else can be replaced by the power of x[i]: when x[i]==1, it will leave the expression as-is, when x==-1, the expression will be "turned over":

Chance(H) = product(  ( P(E[i]|H) / (1 - P(E[i]|H)) )^x[i] )

We can now interpret the original formula in terms of the chances. That formula was:

y = nonlinear(sum(w[i] * x[i]))
= nonlinear( ln( product( l[i]^x[i] ) ) )

Here we can substitute the product:

y = nonlinear( ln( Chance(y > 0) ) )

A logarithm is a very convenient thing to apply on the chance to obtain a symmetric range of values (-infinity, +infinity) centered at 0 instead of [0, +infinity) "centered" at 1. And if the weights are properly selected, the actual range of y will be not (-infinity, +infinity) but [-1, 1].

This substitution of the product will mean:

l[i] = P(E[i]|H) / (1 - P(E[i]|H))
w[i] =  ln( P(E[i]|H) / (1 - P(E[i]|H)) )

So the original weights in the neuron have the "physical meaning" of the logarithms of the chance multipliers.

Wednesday, March 8, 2017

OS support for thread stopping

I've recently had done some thinking about how can an OS make the shutdown of a multithreaded application or its part easier, to stop the threads quickly but relatively nicely on request. I think a very helpful support can be pretty easy to implement:

  • Have system call to request the shutdown of an OS-level thread.
  • This call would make any system calls that could block non-blocking (except for the thread synchronization calls, like mutex or condition variable operations), would interrupt any current such blocking calls, and return an error. This includes the timed sleeps, and even the thread synchronization calls with timeouts would return an error immediately instead of sleeping. Optionally, it might also just fail any file operations altogether, possibly except for close(). Note that the file descriptors as such wouldn't be revoked altogether, they would just kind of stop working in one thread, but the other threads would still work.
  • A problem with this might be in such things as logging: when your thread is stopping, it might still write the useful log files. Revoking these log files would be  Well, one solution for this would be to buffer the logs in memory and write from a separate thread. Another might be to keep some kind of exception list.
  • The threads might also beget threads, and keeping track of all of them is complicated. So there is a need for grouping them, like the process groups in the OS, or Triceps Triead fragments, and then stopping the whole group.
Yet another separate approach might be to say "but if you want the ability to stop something, why don't you run it as a separate process and kill if when needed?". This would make a lot of sense for such things as PAM authentication modules.

There are problems with the separate processes too. But after some thinking, they seem to be fairly easy to solve.

When you think about processes, usually a separate binary comes to mind but it doesn't have to. You can just do a fork() and run from the same binary, and you even get the whole address space copy-on-write out of the box. Though I guess it might be interesting to NOT copy-on-write the whole address space. Instead just allocate some communication area for passing the arguments and results, and get rd of the rest, maybe even re-running the run-time loader from scratch in the new context (that would be kind of like the Java class loader spaces).

Communication with a separate process is kind of painful. But if you map an anonymous memory segment, that segment can be inherited by the child process. Indeed, should be inherited to act as the communication region (truly shared, not COW) even if the rest of address space is dropped. And then you can happily create the mutexes and condition variables in it for synchronization. The arguments and results would have to be serialized to pass through this region but it shouldn't be such a big deal. It would be also possible to keep some large data dictionaries in this communication space.

Of course the callbacks wouldn't work across the process boundary. But there is a solution for this too: create a local thread pool (or simply a thread) for callbacks that would get the requests from the remote process though a portion of the same communications region, execute the callback, and return the results back there.

I think this looks feasible even without any new technology needed, just needs a helper library to implement the interface. This isn't even a new concept, it's a classic microkernel architecture without the common memory space. Yeah, copying the data in the serialized form adds an overhead but it's not that much overhead compared to say JVM, and people use Java and .NET all over the place with no regrets.

Sunday, March 5, 2017

Bayes 25 & AdaBoost: the partial confidence problem

The code in the previous example works only with the full-confidence events, those that have the confidence of 0 or 1. Unfortunately I haven't found a good way to handle the partial confidence yet. I've found a way but it has problems. Let me show you.

Previously I was treating the partial confidence as a superposition of 0 and 1. This approach cannot be properly used here because it doesn't cover the connections between the events. To give an example, in the previous post the code had correctly found that in tab24_01.txt the events E1and E2 are the same, so after E1 is applied, E2 becomes ignored (it's AdaBoost probabilities become 0.5). So now if we then do a computation and find that E1 is unknown (i.e. its confidence is 0.5) but E2 is known, E2 will still be ignored because its table entry says to always ignore it.

The correct way should handle this situation properly: if E1 is unknown, this should bring up the table values for E2 to compensate for it and treat E2 as E1 was formerly.

The obvious way to do it is to treat the partial confidence as a superposition of "the event is unknown" and "the event is known". So if the event is known, we process it as before. If the event is unknown, we would follow the unknown branch and then apply all the following events as if this events wasn't there. And if the event is partially known, we would split the weights of all the outcomes proportionally and handle one part with the known branch and another one with the unknown branch.

So the confidence of 0.7 would mean a superposition of 0.4 of unknown and 0.6 of the known value "1". The confidence 0.1 would mean a superposition of 0.8 of unknown and 0.2 of the known value "0". So if we denote C(E) as the event confidence, K(E) as the known weight in the superposition, and U(E) as the unknown weight in the superposition, we can write them as:

K(E) = abs(C(E) - 0.5) * 2
U(E) = 1 - K(E)

The downside of this obvious approach is that the table and computation becomes more complex. As we go through more events, two branches become four, then eight, then sixteen, and so on. The formerly quadratic complexity O(Noutcomes * Nevents) becomes exponential O(Noutcomes*Nevents*2^Nevents). I don't know yet if there is a better way.

Maybe it would be possible to build a table of "gradients" between the events: i.e. if the confidence of the event A changes, how are the probabilities of another event B affected. The problem though is that this gradient depends on all the events in the sequence between A and B, so tis idea doesn't seem to really fly.

Maybe another idea is workable, that would reduce the exponential cost to a cubic one: after splitting the superposition, instead of following two branches, just take the probabilities for each following event from both branches and mix them by the weight of the branches:

P(H|Ei) = K(Ej) * P(H|KjEi) + U(Ej) * P(H|UjEi)

It's better than exponential but still not great, with the run time O(Noutcomes*Nevents*Nevents). And the table of these pre-computed values would still seem to be exponential, each event depending on all the possible combinations of the previous events.

Wednesday, March 1, 2017

Bayes 24: AdaBoost as a solution to the correlated events

When writing the whitepaper, I've realized that AdaBoost is very good at finding the cross-correlation of its partial hypotheses (that map to the Bayesian events), and it always picks the next step with the minimal cross-correlation. The Bayesian computations do have an issue with cross-correlation, as described in the whitepaper and near the end of Part 8 of my notes: if there are multiple events that duplicate or mirror each other, all of them are treated as independent and drive the probabilities off. This looked like AdaBoost could give a hint of how to solve this problem, so I've started thinking about it, and I think I've found a solution.

In retrospect the solution is pretty obvious: instead of computing the Bayesian model directly, treat it as an AdaBoost model whose set of the partial hypotheses equals to the set of the Bayesian events, with an extra limitation that each one of them must be used exactly once. If there are multiple equivalent events and they happen to go one after another, AdaBoost will treat the first one as real, and the following ones will get their error rate of 0.5 and will be effectively no-ops.

Well, if there are other events interspersed in between these equivalent copies, they won't be no-ops. These other events will shift the weights of the training cases, so that the second an further copies will find themselves drifted away from 0.5. The good news here is that AdaBoost is pretty good at removing the partial correlation too but the bad news is that the order matters, to fully remove the duplication the equivalent events should go one after another. And in general, it looks like the best removal of the cross-correlation happens when the closely related events follow closely one after another. This kind of order can be implemented if we reverse the AdaBoost preference of the partial hypotheses and make it always pick the one that has its training error rate closest to 0.5. Since (unlike the classic AdaBoost) the set of partial hypotheses is limited and each element from it must be used only once, the better ones will eventually be used as well.

To experiment with these ideas, I wrote the code example ex24_01run.pl. It doesn't reorder the events, just computes the AdaBoost values for whatever order it's given, allowing to experiment with the various orders. But it has a couple of advanced features from the later chapters of the book on AdaBoost: it does the multi-class (but single-label) version that also partitions the error rate computation for  the sign of the value of the partial hypothesis.

The multi-class means that it can work not only with two outcomes (a yes/no decision) but with multiple possible outcomes (i.e. Bayesian hypotheses), just like the normal Bayesian table. And just like the normal Bayesian table, these outcomes must be mutually exclusive, so each case may have only one outcome (single-label). The multi-class logic works in the way that is similar to the one described in Part 10: the resulting table is just a combination of the individual per-outcome tables. From the standpoint of each outcome, the single-label property means that the rest of the rows of the table taken together represent the opposite of this outcome, so there is no need to compute that opposite explicitly.

The partitioning of the error rate computation not only makes the result better but also comes very handy to prove that the opposite of one outcome is equivalent to the rest of the rows of the table. The AdaBoost training error e is the ratio of the total weights of the training cases where the value of the event E (that is, the Bayesian event, in AdaBoost terms it's a partial hypothesis) is not equal to the value of the outcome (Bayesian hypothesis) H. And we'll compute it separately for E (i.e. E=1) and ~E (i.e. E=0). It ends up expressable as probabilities that can be computed by summing and dividing weights W of the training cases:

e(H,E) = P(~H|E) = (sum for cases i where H(i)=0 and E=1 (Wi))/(sum for cases j where E=1 (Wj))
e(H,~E) = P(H|~E) = (sum for cases i where H(i)=1 and E=0 (Wi))/(sum for cases j where E=0 (Wj))

So if we pick a particular outcome and denote it H1, we can show that its opposite ~H1 is equivalent to the sum of the rest of hypotheses H2...HN by showing that

P(~H1|E) = sum for i in [2, N] (P(Hi|E))

which after removing the common denominator means

(sum for cases j where H1(E)=0 and E=1 (Wj)) = (sum for i in [2, N] (sum for cases k where Hi(E)=1 and E=1 (Wk))

The single-label property means that each case is marked with exactly one outcome, and all the cases where the outcome H1 is absent must have some other outcome Hi present, so these sums are indeed equal.

So it is all self-consistent. The key part of the computation is:

for my $hyp (values %abhyp) {
        if ($conf) {
            $hyp->{wt} *= $hyp->{ppos}[$evi];
        } else {
            $hyp->{wt} *= (1. - $hyp->{pneg}[$evi]);
        }
    }

And the matching part of the table building is:

foreach my $c (@case) {
        my $hyp = $c->{hyp}[0];
        # this code really handles only the TC values of 0 and 1,
        # for the arbitrary values it would have to do an interpolation
        if ($c->{tc}[$i] >= 0.5) {
            $c->{wt} /= $abhyp{$hyp}->{ppos}[$i];
        } else {
            $c->{wt} /= (1. - $abhyp{$hyp}->{pneg}[$i]);
        }   
    }   

The training undoes the weight change that will be done in the computation.

Now let's try some examples. Here is the example that I used in the whitepaper:

$ perl ex24_01run.pl -n tab24_01.txt in24_01_01.txt
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E2             ,E3             ,
hyp1   ,3.00000,0.66667^0.66667,0.50000^0.50000,0.40000^0.33333,
hyp2   ,3.00000,0.33333^0.33333,0.50000^0.50000,0.60000^0.66667,
--- Applying event E1, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E2, c=1.000000
hyp1    w=1.00000 p=0.66667
hyp2    w=0.50000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.40000 p=0.57143
hyp2    w=0.30000 p=0.42857

Here the option "-n" means "don't normalize the weights" (or when normalized to the sum of 1 they will be the same as probabilities). The two values printed in the AdaBoostian table separated by a "^" are the P(H|E) and P(~H|~E).

As you can see, the result is not correct either (because of the XOR problem) but it's better than with the plain Bayes version that gave hyp1 the probability 0.667. And since E1 and E2 are the same, E2 ended up with the values of 0.5 in the table, and applying it didn't change the weight.

If we change the order of events, E2 gains some use back:

$ perl ex24_01run.pl -n tab24_02.txt in24_01_01.txt 
Original training cases:
!      ,       ,E1          ,E3          ,E2          ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E3             ,E2             ,
hyp1   ,3.00000,0.66667^0.66667,0.40000^0.33333,0.47368^0.48276,
hyp2   ,3.00000,0.33333^0.33333,0.60000^0.66667,0.52632^0.51724,
--- Applying event E1, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.80000 p=0.57143
hyp2    w=0.60000 p=0.42857
--- Applying event E2, c=1.000000
hyp1    w=0.37895 p=0.54545
hyp2    w=0.31579 p=0.45455

Rather surprisingly, the result ended up more correct. Or maybe it's not that surprising: after all, AdaBoost is designed to gain more information from available data.

Yet another interesting result get produced from repeating the same pair of events multiple times:

$ perl ex24_01run.pl -n tab24_03.txt in24_03_01.txt 
Original training cases:
!      ,       ,E01         ,E02         ,E11         ,E12         ,E21         ,E22         ,E31         ,E32         ,E41         ,E42         ,E51         ,E52         ,E61         ,E62         ,E71         ,E72         ,E81         ,E82         ,E91         ,E92         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E01            ,E02            ,E11            ,E12            ,E21            ,E22            ,E31            ,E32            ,E41            ,E42            ,E51            ,E52            ,E61            ,E62            ,E71            ,E72            ,E81            ,E82            ,E91            ,E92            ,
hyp1   ,3.00000,0.66667^0.66667,0.40000^0.33333,0.47368^0.48276,0.49694^0.49526,0.49916^0.49946,0.49990^0.49985,0.49997^0.49998,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,
hyp2   ,3.00000,0.33333^0.33333,0.60000^0.66667,0.52632^0.51724,0.50306^0.50474,0.50084^0.50054,0.50010^0.50015,0.50003^0.50002,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,
--- Applying event E01, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E02, c=1.000000
hyp1    w=0.80000 p=0.57143
hyp2    w=0.60000 p=0.42857
--- Applying event E11, c=1.000000
hyp1    w=0.37895 p=0.54545
hyp2    w=0.31579 p=0.45455
--- Applying event E12, c=1.000000
hyp1    w=0.18831 p=0.54242
hyp2    w=0.15886 p=0.45758
--- Applying event E21, c=1.000000
hyp1    w=0.09400 p=0.54159
hyp2    w=0.07956 p=0.45841
--- Applying event E22, c=1.000000
hyp1    w=0.04699 p=0.54149
hyp2    w=0.03979 p=0.45851
--- Applying event E31, c=1.000000
hyp1    w=0.02349 p=0.54147
hyp2    w=0.01990 p=0.45853
--- Applying event E32, c=1.000000
hyp1    w=0.01175 p=0.54146
hyp2    w=0.00995 p=0.45854
...

--- Applying event E92, c=1.000000
hyp1    w=0.00000 p=0.54146
hyp2    w=0.00000 p=0.45854

AdaBoost converges pretty quickly to the weights that make both events get the probability 0.5 and stop the further changes.

This seems to work, although it needs more testing. One more limitation of this implementation is that it works only with the events that have the value of either 0 or 1, not anything in between. I'll talk more about this in the next post.

Wednesday, February 22, 2017

a better explanation of Bayes-AdaBoost connection

After some re-reading, my previous posts on the Bayesian interpretation of AdaBoost have been kind of cryptic. It's no wonder because I've been writing them as I was understanding things, so they came out as more of a personal reminder rather than the general explanation. But I've got around and wrote another version as a whitepaper, that goes all the way from scratch and hopefully is much more understandable:

https://sourceforge.net/p/exbayes/code/HEAD/tree/whitepapers/AdaboostBayes.pdf?format=raw

Monday, February 20, 2017

Bayesian examples now on SourceForge

I've been working on one more post on the Bayes-AdaBoost cross-over, and I've finally got tired of pasting the substantial examples right in the post. So I've created a SourceForge project for them and uploaded all the examples there:

https://sourceforge.net/p/exbayes/code/HEAD/tree/v1/

The proper SVN checkout recipe from there is:

svn checkout svn://svn.code.sf.net/p/exbayes/code/ exbayes

Saturday, February 4, 2017

The mythical 10x programmer, or the silver bullet explained

Recently I've been talking about some of the experiences from Aleri times when my interlocutor responded "so I guess you're one of those 10x programmers". This made me pause. Like any programmer per Larry Wall's recipe, I have a good amount of hubris but can I really say that I'm a 10x programmer? After some thinking, I guess sometimes I am and sometimes I'm not. This got me thinking further, about what was different between the situations when I was and when I wasn't. And I think the difference is very much about the organization and methodology, and thus can probably be applied to any programmers to improve their productivity. I've tried to distill the knowledge, and here it goes. By the way, Fred Brooks also called the goal of 10x productivity increase the "silver bullet", so here is maybe your silver bullet. Maybe. I'm not entirely sure that the recipe will work out of the box, it will probably need at least some debugging. Or maybe it won't work but it's still worth a try.

Let me start with describing a couple of cases when I could fairly be said to have been a 10x programmer. There were some other good cases too but on a couple of occasions I've actually had a benchmark.

One case was at Aleri, working on the CEP system. Here is the real short version of the events: At the time there were 3 main players in the CEP world: StreamBase, Aleri and Coral8, all having an approximate parity in their products. Around 2009 Coral8 apparently ran out of money, so Aleri used it this opportunity to consume it. And then it had two approximately equal products and set about merging them. The merging didn't go that great and Aleri ran out of money too, eventually consumed in 2010 by Sybase. See, even though the Coral8 product was about the same from the functional standpoint, it had the code base about 10x the size, and had maybe not quite 10x but a lot more people working on it. Since the Coral8 code base was so much larger and there were many more people familiar with it, it was taken as the foundation of the merged product. And this merging then crashed and burned. Another merging attempt was then done under Sybase, this time doing it the other way around, and it worked much better. So it's an interesting example where the 10x was there, then it wasn't, and then it was again.

Another case was the UnitedLinux. The full history of UnitedLinux is an interesting story of dog-eat-dog in the world of the Linux vendors but I won't go into the full funny details, just touch up on the important parts. One of conditions of the UnitedLinux consortium set by SuSE was that SuSE does all the general development. Everyone else does just their branding customizations and additional packages. So Caldera where I worked at the time (soon to be renamed SCO, with the SCO Linux brand of UnitedLinux) had closed the Linux development office in Germany, with most of the people heading to SuSE. When UnitedLinux was released, the various consortium members (I remember Connectiva, and whoever else) held their release parties and posted pictures with great many people. We didn't have any party at SCO: see, the SCO Linux was done by only two developers and half a tester (and a bit of a graphical artist), and Ron was in California while I was in New Jersey, so having a party would have been difficult. And we've had it ready earlier: the final image was sent to testing about 20 minutes after downloading the drop from SuSE. I think I still have the commemorative t-shirt somewhere, one thing Caldera was big on was the t-shirts.

So what was special in these cases? In a very short summary, I'd say that it was the compact code done with the familiar technologies, incrementally, with low interdependencies, and with high level of script automation. Oh, and you need to have the good engineers as well. But just the good engineers aren't enough: I wouldn't say that the engineers at Coral8 were any worse than at Aleri, just they've got mired in the slow methodologies.

Let's look at the details.

The compact code pretty much means the straightforward and to-the-point code. There is a classic book "The practice of programming" (and I liked it so much that I've patterned the name of my book after it) that shows the code like this: when you have a problem, solve it in the most straightforward way. There are lots of published systems of various code decorations, and all of them are wrong.  The decorations don't make the code better, they make it worse, more difficult to read and modify. The extra layers of abstractions don't help anything either, they make the code more difficult to both understand and modify (not to mention writing it in the first place). There is an article about "worse is better" written by Richard Gabriel from the MIT AI labs when he discovered that the simpler Unix code worked better than his "better" over-abstracted code. But this phrase is misleading, better is better, it's just that the over-abstractions are worse, not better. This absolutely doesn't mean that the code should be flat abstraction-wise. The code that is too flat is also an unmaintainable spaghetti mess. The abstraction layers are needed. But they have to be introduced for a good reason, not on a whim and not by the bureaucratic rules. The good functions are usually pretty large. The good classes are usually pretty large. The good rule of thumb for making a chunk of code into a new abstraction is to notice when you do the same thing over and over again. Then it's time to put this code into one place and call it from the other places. A good abstraction reduces the size of code, does not increase it.

The problem with the overly-abstracted code is really the same as with the  under-abstracted flat code: there are only so many things you can keep in your head at a time. In the code that is too flat you need to keep too many interdependencies of that flat level in your head. In the code that is too abstracted you need to keep too many abstractions and their APIs and calling patterns in your head. The optimal spot is in the middle.

A lot of things are easier to just do than to use a library. If you do the logging from a shell script, a 10-line shell function will not only work much better than a shell version of log4j but will also be a lot easier to call (not to mention avoiding an extra dependency). Or an iteration over a container with a plain for-loop is not only easier to write and understand but also much easier to remember than the STL templates for that.

An important thing is to keep the documentation close to the code. This means both the good comments in the code and the user documentation.

The basic rule behind the good comments is very simple: the comments must be one (or sometimes more) step more abstract than the code. A comment should be explaining either why something is done (or not done), especially if the reasoning is not obvious, or telling what is done as a short summary in the higher-level terms. As long as this rule is followed, the more comments is pretty much the better (but there is only so much you can write without breaking this rule).

Keeping the user documentation together with the code, in the same versioning system and in the same tree hierarchy is very important. This way the documentation can be updated in parallel with the code and will never become obsolete. And at the end of the release cycle all the changes can be found and re-edited. This principle implies that the documentation must be kept in a text-based format that is easy to diff, just as the code is, and that the tools used on it must not change that exact text format at random. Which generally means that the editing should be done with a normal text editor, and the WYSIWYG tools should go into the garbage bucket. The pain inflicted by them is much greater than the gain. The ease of versioning and diffing is much more important than the ease of visual formatting. While on the subject of documentation, the technical writers need to either really be technical or be kept in check. The precise meaning of documentation is important, and I've seen too often an edit to make a sentence smoother also making it ambiguous or entirely meaningless. And "smoother" is not always "better" either. I've bought a book of Richard Feynman's letters, more for the collection purposes since the letters are usually fairly boring but unexpectedly it also turned out to be a great reading. Except for the part where it included the magazine articles about his Nobel prize. The style of these articles was completely different from Feynman's, it was the usual unreadable journalist drivel. Don't turn your documentation into the journalist drivel.

To give another example of where I think smooth is bad, the sentence above "An important thing is to keep the documentation close to the code" is better in that place than "Keeping the documentation close to the code is important". The complex sentence slows you down and gives you a breather before the text switches to a new subject, emphasizing that switch. Also, it brings the word "important" closer to the start of the sentence where it's more noticeable.

Next principle, the familiar technologies, affects the speed of development a lot. It's kind of obvious but the counter-point is "if we use that latest and greatest new tool, we'll be done much faster". And sometimes the latest and greatest tool really does that. But most of the time the familiarity is much more important. If you have an idea of how the goal can be straightforwardly achieved with the familiar tools, that's usually the best bet by a wide margin. If not, learn the new tool. Sometimes this would also give you an idea of how the goal can be achieved easier with your old tools. And a lot of technologies are just crap anyway. So the overall priority should be not "the goal for the tool" but "the tool for the goal".

This doesn't mean that the new tools shouldn't be learned or that everyone has to be familiar with every tool to be used. Usually one person having the familiarity with a tool is enough to frame the code around it, giving everyone else a big head start in learning it. It's much easier to add code to an existing unfamiliar framework than to build a new framework with an unfamiliar tool. And there are the times to use the familiar tools and times to learn the new ones. There is the cadence of development when the things start amorphous and slow, and then as the solutions for various items start falling together the project starts crystallizing and picking up the pace until it rolls at full steam to the goal. Then the next cycle starts. The slow part of the cycle is a good time to learn the new tools. But things tend to go a lot faster when only one thing is new: a new tool for an old purpose or an old tool for a new purpose. Of course sometimes a bullet has to be bitten and a new tool learned together with the new purpose but that goes slower.


On to the incrementality. It's nothing new, Eric Raymond wrote about it in "The Cathedral and the Bazaar". But it still gets lost surprisingly often, even in the "agile methodologies". Instead of "we'll make this great new thing to replace that dingy old thing" the right approach is "we'll transform this dingy old thing into this great new thing". The programs are not physical objects, they're more like blueprints for the physical objects. It's not like you have to raze a bridge to build a new bridge in its place. Instead a program's run is more like a physical object. Every time you restart a program an old bridge gets razed and a new one gets built for you according to the new plan. Or I guess another good example from the world of the physical things is the automobile manufacturing: the car models might stay but their internal mechanics is updated pretty regularly. At some point the cars would look the same but the new cars from that point on would get say a better alternator. The Chevy small block V8 engine has been around for more than 50 years. The only item that stayed for all this time is the dimensions of the cylinders. Everything else has been replaced. But it hasn't been replaced in one go, it has been a product of a long evolution with many gradual small changes.

Every time you hear about an order of magnitude budget and time overruns, it's when someone tried to re-do a program from scratch. Fred Brooks talks about this in his "Mythical man-month", so the effect has been known since at least 1970s, and yet it keeps being repeated. And guess what, that's not limited to software either. Have you ever heard of Tucker? It was a new automobile company in 1940s that decided to design its car completely from scratch, with all the newest and greatest technology. They've never got the cars working, they kept falling apart until the company ran out of money, and its founder got sued by the government for all the dubious financial activities he'd done to delay the running out of money. Here someone might ask "but what about the iPhone, what about Tesla?" They are actually the examples of the evolutionary development. iPhone was built around the clever idea of a big-screen phone used as an application platform but it wasn't developed all at once, and Apple didn't try to reinvent the cell communications chipsets at the same time. Some of the evolutionary moments of iPhone are widely known: such as, they started with a stylus as was then typical, and then Steve Jobs said "screw this, use the fingers instead". And the last-moment decision of replacing the plastic screens with glass had been a subject of many moralizing articles. Tesla is best seen in the contrast to GM's EV1. GM tried to develop the whole technology chain from scratch, spent a huge amount of time and money, and failed. While Tesla piggybacked on the already existing technologies and packaged them together into a car (that they also took off the Lotus's shelf). Even then Tesla's path wasn't easy, they've had their tuckeresque moments with the failing transmissions (which they eventually simply got rid of) and with the battery cooling for a few years before they were able to release their first car.

One of the worst statements I've heard about why a system needed to be rewritten from scratch went like this: "it has accumulated all these little features that make making changes without breaking them difficult, so we have to re-write everything to get rid of them and make the changes simple again". It completely misses the point: it's these little features that make the product worthwhile. Doing the 80% of the functionality is easy but nobody needs only 80% of the functionality (and in reality that particular project overran by years). The code is a living document that preserves the knowledge of the subject area. Discarding the code means discarding this knowledge which then has to be re-learned. This doesn't mean that the best solution isn't sometimes to throw away the old code and re-write it from scratch. But it's usually done by replacing the code in small chunks and checking that the functionality is preserved after the move of every chunk. To give an analogy from biology, the human (and animal) organs develop by growing the new cells within a scaffolding of the other neighboring cells that shapes them and makes them properly differentiate for the function. The old code provides such a scaffolding for the new chunks.

I want to illustrate this last point with a couple of examples. I'll leave the company and all the details in the first example unnamed, since the example is kind of embarassing.  This company decided to make an appliance with its software. It had all the software bits and pieces that it has been selling to the customers, so the only thing that needed to be done was put it all together, like the customers do. Simple, right? Turns out, no, the bits and pieces would not work together, and no customer had actually put them all together because they didn't work. Each piece worked in a way, so each responsible team was happy, but not together. And getting them to work together took a large amount of time and work. Another similar example comes from Microsoft where I've seen a very similar issue with the PowerShell support in the Windows clusters: there are many components involved and each one has its own PowerShell commandlets for management but they don't mesh well together with each other, need to be combined in the very non-obvious ways with no big goal-oriented commandlets, and some things turn out to be not really doable at all.

The obvious objection is "but how do we test all these changes"? This is where the automated testing comes in, and it's such a big subject that I wrote a whole separate post about that. But it's really a bogus question. When you write a completely new system,you still test each chunk of code as you write it, and if you don't have the automated testing then you just test it manually. The difference is that in an incomplete system your chunk will have only the limited interactions, so you would be able to test only some of its effects. In a complete system you get a much more complete coverage. And if the old system doesn't have any automated tests then it's a good way to start them: for each chunk write the tests before replacing it, and then use them to check that the the replaced chunk still works in the same way. Another important point, a chunk doesn't have to be all in one place, it might be distributed, such as an one-line change in a thousand files.

The small changes don't have to go together with the frequent releases. But every time you do a release one way or another, you get the issue "we have to complete all the changes before the release!" and it gets worse if you want to do the more frequent releases. Well, you definitely do want to make sure that the things don't get accidentally broken at the last moment, so the commits have to slow down just before the release. But it doesn't mean that you can't include the changes at all. Any change is fine to include as long as it doesn't break the existing functionality. The new functionality doesn't have to work, as long as it doesn't get called in the production environment nobody cares about that, it only has to not break the old functionality. This way you can integrate the changes a lot earlier, and then spend a lot less effort merging them together.

In the post on testing I've already said that an important part of the incrementality is the incremental tests, being ability to run easily the immediate subset of tests affected by a change. But a prerequisite for it is the incrementality of the build: the ability to build quickly all the components dependent on the change, all the way to the final package, while skipping the build of the unchanged parts of the tree. "Quickly" here means right away, not in some kind of a nightly build. This does wonders for the turnaround time. Obviously, in some cases, like a fundamental library, pretty much all of the tree will depend on it. But then the incrementality should be interpreted as the ability to build a few of the representative projects based on it to test that they didn't break. If the full nightly test shows that some dependent components got broken, the ability to rebuild them quickly also makes wonders for the turnaround of the investigation.

Another thing about the build is that I think the purely rules-based builds are bad. It's not that having the common rules is bad, they're actually quite convenient. But the build system must also allow to do the quick one-offs.  These one-offs are very important for the automation of the build. They allow to write quickly say a small Perl script that would do some kind of transformation or code generation. A build system without such one-offs turns the hour-long tasks into the week-long ones. So just please do support the full proper make and the scripting environment, not some crap like Cmake. The counter-argument is of course "but we need to build on multiple platforms including Windows". Yeah, building on Windows sucks, and the only decent solution I see it to bring a decent environment of the portable tools to each of your build platforms.

Another very simple but important thing about the build is that it must detect the errors. The Java-based tools in particular are notorious for ignoring the status codes, so when something breaks, the only way you know it is by the result files being absent. Or not even by them being absent but them failing in the tests. This is bad. All the errors have to be detected as early as possible.

This looks like a good place for another related rant, on reliability. Basically, when something works it should work reliably. Einstein said "The definition of insanity is doing the same thing over and over again and expecting different results." But in reality we don't control everything, and when these hidden changes cause a different result of the same actions, it's just maddening and time-consuming. The good code, including the tests and build, should work reliably. If the underlying systems are known to have outages, it should retry and still work reliably on top of them and not propagate this unreliability up. Every step the unreliability propagates makes it more difficult to fix, so it's best nipped in the bud. There of course are the reasonable limits too, the calls should not get stuck retrying forever, and probably not ever for an hour. Although retrying for an hour might be fine for a scheduled nightly run, but for a manual run it's way too long. And when a program is retrying it should not be silent, it should be telling what its is doing.

Next item is the low interdependencies. Basically, it means that the people should be able to work without stepping too much on each other. You might ask: but doesn't the principle of the small code base go against it? More code should allow more people to work on its part, right? No, it doesn't work like this. The catch is that if you have more code solving the same problem, any changes to it mean that you have to change about the same percentage of the code. This is why a sprawling code base is slow in development, because any change requires changing a lot of code. But this is also why it doesn't allow more people to work on it in parallel. So the corollary is that if you add too many people to a project, they will start stepping on each other a lot and the productivity will drop.

The other major kind of interaction is between the people in general. If you need to do a 5-minute task, it takes you 5 minutes. But if you need someone else to do a 5-minute task, it's likely to take you a day of interaction. Which still might be faster than learning how to do it yourself but if the task is repetitive then this becomes a major drag. To work fast, this drag has to be reduced and eliminated whenever possible. One of the consequences is that the posessive code ownership is bad. Now, having someone knowledgeable available to talk about a part of code is good, and it also helps to discuss the general directions of the code to avoid breaking things for other people. But there should not be a single-person or single-team bottleneck for changes to a certain part of code, nor any single person controlling what is allowed to go in. There are other reasons to avoid these things too but even the speed of development should be a sufficient reason. Another corollary is that the strict code style requirements are bad. The small style variations just don't matter, and the enforcement is a lot of pain with no gain whatsoever.

Another case where the excessive interaction shows is planning. Have you seen these beautiful Gantt charts and critical path diagrams, with the carefully planned moments of interdependencies? Well, they're all garbage. They might be an interesting exercise on the way to the better understanding of the project but things never work out like this in reality. In reality all the interdependency points end up as a mess. So there is no point in overplanning. Instead I really like the saying of my past manager: "We can plan an aggressive schedule because we don't mind it slipping once in a while". When combined with the idea of the incrementality, this points to the following concept: the detail level of the plans must be tied to their distance. A detailed planning of the far-off items doesn't make sense because they are sure to change before we get there. These items should be on the list, so that we can keep track of the direction, but only in the general terms. Only the very near-term items need to be planned in detail, and the farther in the future the less detail is needed. Though the very near-term items have their Heisenbergian uncertainty as well. In my experience the minimal unit of time for planning is a week. If you have five items that you expect to take one day each, they will never work out like this. Some of them will take an hour and some will take three days. But if you lump all five together, they will take together a week with a decent amount of certainty.

Since I've already talked about the scripted automation throughout the text, this is a wrap for this post. The things I've described look kind of small but they really are important, and I believe that taken together the do amount to the 10x difference in productivity. What do you think?

P.S. Someone else's thoughts on the subject of 10x: http://antirez.com/news/112


Friday, January 6, 2017

Bayesian networks & neural networks

I've been recently reading about the neural networks. By the way, here are some introductory links:

I can most highly recommend this document that explains in a very clear and simple way how the training of the neural networks through the backpropagation works:
http://www.numericinsight.com/uploads/A_Gentle_Introduction_to_Backpropagation.pdf

A simple introductory series of 6 (and maybe growing to more) articles starting with this one:
https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471#.hxsoanuo2

Some other links:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
https://colah.github.io/posts/2015-01-Visualizing-Representations/
http://mmlind.github.io/Deep_Neural_Network_for_MNIST_Handwriting_Recognition/

Also, the apparently the father of the deep neural networks is G.E. Hinton, and you may also want to search for the articles by Harry Shum. Hinton's home page is:
https://www.cs.toronto.edu/~hinton/
that seems to have a bunch of the links to his courses but I haven't looked at them yet.


As you can see from the introductory reading, each neuron in a neural network is a pretty simple machine: it takes some input values, multiples them by some coefficients, adds the results up, and then passes the result through a nonlinear (usually some kind of a sigmoid) function. The whole thing can be written as an expression:

result = nonlinear( sum [for inputs i] (inputi * Ci) )

The nonlinear part is pretty much what we do at the end of the Bayesian computations: if the probability of a hypothesis is above some level, we accept it as true, i.e. in other words pull it up to 1, if it's below some equal or lower level we reject it, i.e. pull it down to 0, and if these levels are not the same and the probability is in the middle then we leave it as some value in the middle, probably modified in some way from the original value.

The sum part is pretty much the same as the sum done in AdaBoost. AdaBoost does the sum of logarithms. And I've shown in the previous posts that this sum can be converted to a logarithm of a product, and then the product can be seen as a Bayesian computation expressed as chances, and the logarithm being a part of the decision-making process that converts the resulting chance value to a positive-or-negative value. So we can apply the same approach to the sum in the neuron, say that the values it sums up are logarithms, convert it to the logarithm of a product, make the logarithm a part of the final nonlinear function, and then the remaining product can be seen as a Bayesian computation on chances.

This pretty much means that a neuron can be seen as a Bayesian machine.

And guess what, apparently there is also such a thing as a Bayesian network. There people take multiple Bayesian machines, and connect the results of one set of machines as the input events to the Bayesian machines of the next level. For all I can tell, the major benefit of the handling of the problem when some hypotheses are indicated by a XOR of the input events, similarly to the splitting of the hypotheses into two and then merging them afterwards like I've shown before but instead of the external arithmetics the logic being folded into the Bayesian computation of the second level.

But if a neuron can be seen as a Bayesian machine then the neural networks can also be seen as the Bayesian networks! The only difference being that in the Bayesian networks the first-level (and in general intermediate-level) hypotheses are hand-picked while the neural networks find these intermediate hypotheses on their own during the training.

I wonder if this is something widely known, or not known, or known to be wrong?

Monday, December 26, 2016

On testing

I've started writing another post (hope to complete it somewhere soon), and as one of the offshoots it brought my thoughts to the automated testing. The automated testing had been a great new thing, then not so new and kind of a humbug thing, then "instead of testing we'll just run a canary in production", then maybe it's doing a bit of a come-back. I think the fundamental reason for automated testing dropping from the great new thing is (besides the natural cycle of newer greater things) is that there are multiple ways to do it and a lot of these ways being wrong. Let me elaborate.

I think that the most important and most often missed point about the automated testing is this: Tests are a development tool. The right tests do increase the quality of the product but first and foremost they increase the productivity of the developers while doing so.

A popular analogy of the programs is that a program is like a house, and writing a program is like building a house. I'd say not, this analogy is wrong and misleading. A program is not like a house, it's like a drawing or a blueprint of a house. A house is an analogy of what a program produces.

The engineering drawings are not done much with the pencil and paper nowadays, they're usually done in a CAD system. A CAD system not only records the drawing but also allows to model the items on the drawing and test that they will perform adequately in reality before they're brought to reality. The automated tests are the CAD for programs.

The CADs for drawings require entering the extra information to be able to do their modeling. So do the automated tests. It's an overhead but if done right this overhead is an acceptable one, bringing more benefits than overhead.

The first obvious benefit of the automated tests is that they make the programs robust. To bring the house analogy again, like a house of bricks, not a house of cards. You might remember the saying about "If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization". Not with the automated tests, not any more. With the automated tests you don't get into the situation "I've just changed one little thing and it all suddenly fell apart" any more.

But that's again just the reliability. Where is the productivity coming from? The major benefit for the productivity is that you don't need to do so much analysis any more. You can always change things quickly and see what happens in a test environment. This is also very useful for tracing the dependencies: if you change something and the product breaks, you see where it breaks. This is a major, major item that makes the programming much more tractable and turns the devilishly complicated modifications into the reasonably straightforward ones.

The programming tasks tend to come in two varieties: The first variety is the straightforward one: you just sit and write what the program needs to be doing, or even better copy a chunk of the existing code and modify it to do the new thing you need. The second variety happens when the thing you need to do hits the limitations of the infrastructure you have. It usually has some kind of a paradox in it, when you need to do this and that but if you do this then that falls apart and vice versa.

When you have this second kind of a task on your hands, you need to stretch your infrastructure to support both this and that at the same time. And this is the situation when things tend to get broken. I really like to solve such problems in two steps:

1. Stretch the infrastructure to support the new functionality but don't add the new functionality.
2. Add the new functionality.

At the end of step 1 the program works differently inside but in exactly the same way on the outside. Having the automated tests, they all still pass in exactly the same way as before. And only after the second step there will be the new functionality that will require the new tests and/or modification of the old tests.

But as I've said before, not all the tests are useful. I have formulated the following principles of the useful tests:

  • Easily discoverable.
  • Fast running. 
  • Fast starting.
  • Easy to run both by the developer and in an automated scheduler.
  • As much end-to-end as possible.
  • Test support built into the product and visible as an external API.

What do these things mean? Let's look in detail.

Easily discoverable means that the tests must live somewhere next to the code. If I change a line of code, I must be able to find the most relevant tests easily and run them. It's an easy mantra: changed something, run the tests on it. And doing so must be easy, even if the code is not familiar to the developer.

Fast running means that the tests should try not to take too much time. This comes partially to trying to make the tests fast and partially to the ability to run only the relevant subsets of the tests. The way of work here is: changed a few lines, run the basic tests, change a few more lines, run the tests again. The basic subset for some lines should be easy to specify and take seconds, or at worst a few minutes to run. This is sometimes quite complicated when testing the handling of the time-based events. For time-based events it's very good to have a concept of the application time that can be consistently accelerated at will. Another typical pitfall is the situation "I've requested to do this, how do I find out if it's done yet, before running the next step of the test?". This is something that is easy to resolve by adding the proper asynchronous notifications to the code, and making everyone's life easier along the way.  The fast running concept is also useful for debugging of the failed tests: how long does it take to re-run the test and reproduce the problem or verify the fix?

Fast starting is related to the fast running but is a bit different. It has to do with the initialization. The initialization should be fast. If you're running a large test suite of ten thousand tests, you can afford to have a lengthy initialization up front. If you run one small test, you can't have this lengthy initialization every time. If you really need this initialization, there must be a way to do it once and then repeatedly re-run the individual test.

The other aspect of the fast starting is that the programmer must be able to run the tests directly from the development environment. The repeating cycle is write-compile-test. The tests must directly and automatically consume the newly compiled code. There must not be any need to commit the code nor to run it through some kind of an official build.

This partially overlaps with the next requirement: the tests being easy to run both directly by the programmer and in an automated system that goes together with the automated official build. The need to run the tests directly comes from the need for the efficiency of development and from the use of the tests as a CAD. There shouldn't be any special hoops to jump through for including the tests into the official builds either, it should just work once checked in. This is another place where keeping the tests next to the code comes in: both the tests and the code must be a part of the same versioning system and must come in as a single commit. Well, not necessary literally as a single commit, personally I like to do a lot of small partial commits on my personal branch as I write the code and tests, but when it comes up into the official branch, the code of the program and of the tests must come together.

The end-to-end part is very important from both the standpoint of the cost of the tests and of their usefulness. It needs a good amount of elaboration. There had been much widespread love professed for the unit tests, and then also as much disillusionment. I think the reason for this is that the unit tests as they're often understood are basically crap. People try to test each function, and this tends to both require a lot of code and be fairly useless, as many functions just don't have enough substance to test. Mind you, there are exceptions: if a function does something complicated or is a part of a public API, it really should have some tests. But if a function is straightforward, there is no point in looking at it in isolation. Any issues will come up anyway as a part of a bigger test of the code that uses this function.

The worst possible thing is the use of the mocks: those result in just the tautologies when the same thing is written twice and compared. These "tests" test only that the function goes through the supposedly-right motions, not that these motions are actually right and produce the proper result. This produces many horrible failures on deployment to production. A real test must check that the result is correct. If some API you use changes under it and creaks the result, a proper test will detect it. If you really have to use mocks, you must use them at least one level down. I.e. if you're testing a function in API1 that calls API2 that calls API3, you might sometimes get something useful by mocking the functions in API3 but never mock the functions in API2.

Another benefit of the integrated testing is that it often turns up bugs where you didn't expect to. The real programs are full of complex interactions, and the more you exercise these interactions, the better are your tests.

So my concept of an unit test is an "integrated unit test": a test for one feature of an externally visible API. Note that there might be some complex internal components that should be seen as a layer of API and tested accordingly, especially if they are reused in multiple places. But the general rule is the same. This both increases the quality and cuts down on the amount of the tautology code.And it also facilitates the stretch-then-extend model I've described above: if a test tests what the API call does, not how it does that then you can change the underlying implementation without any change to the tests, and verify that all the tests still pass, your external API hasn't changed. This is a very important and fundamental ability of tests-as-CAD, if your tests don't have it then they are outright crap.

Some readers might now say: but doesn't that end-to-end approach contradict the requirement of the tests being fast? Doesn't this integration include the lengthy initialization? When done right, no, it doesn't. The answer comes two-fold: First, when the tests are more integrated, there are fewer of them, and due to the integration they cover more of the internal interactions. Second, this forces you to make the initialization not lengthy. It might require  a little extra work but it's worth it, and your customers will thank you for that.

This brings us to the last point: when you've built the support of the testing into your product, keep it there for the public releases and make it publicly available. This will make the life of your customers a lot easier, allowing them to easily build their own tests on top of your infrastructure. Instead of writing the stupid mocks, it's much better to have a side-door in the underlying APIs that would make them return the values you need to test your corner cases, and do it internally consistent with the rest of their state. I.e. if you mock some kind of an underlying error, how do you know that you mock it right, in the same way as the underlying API would manifest it? You don't. But if an underlying API has a way to ask it to simulate this error, it would simulate properly, as it does in the "real life".

Some people might now ask: But what about security? But what about performance?

As far as the security goes, security-through-obscurity is a bad idea anyway. Obviously, don't give the test access to the random entities, have a separate "access panel" in your API for the tests that would only allow the authorized connections. And if your product is a library then the security-through-obscurity is a REAL bad idea, and there are no random entities to start with. Make things accessible to your user. There is no good reason for a class to have any private members other than for the non-existing components. The most protection a class might ever need is protected. The same, there is no good reason for a class to have any final members. And if you're worried that someone will use the testing part of the API to do something that the main API doesn't allow then the solution is simple: make the main API allow it, otherwise you're crippling it intentionally.

As far as the performance goes, the overhead is usually pretty low. Note that you don't need to embed the test support all over the place, only at the substantial API layers. These API layers usually deal with the larger-scale concepts, not the ones you'd find at the bottom of a triple-nested loop. Sometimes there are exceptions to this rule but nothing that can't be resolved in some way, and the result after resolution is always better than before.

Hope that I've convinced you that the right testing infrastructure makes a world of difference in the software development, not only improves the quality but also makes it faster and cheaper. With the right tests you'll never get stuck with "it's working somehow, don't touch it" or "we will need to study for a year before making that change".  Or to use again the building analogy (yeah, I've just decried it as wrong but I'll use it differently), the tests are not like the scaffolding on a building that you discard after the construction is completed, they are like the maintenance passages and hatches that keep the building in an inhabitable condition throughout its life.