Monday, July 10, 2017

Bayes 26 & AdaBoost: more on the partial confidence

I think I've found at least a partial solution to the problem of partial confidence that avoids going exponential.

It grows right out of the problem definition: the problem really exists only for the closely related events. If two events are unrelated, then the confidence of one wouldn't affect the other in any noticeable way. But for the related events, the first one takes up the bulk of the effect, and then the effect of the following events becomes greatly diminished. If two events are actually the same (like E1 and E2 in tab24_01.txt in the part 24), the second event gets stripped of any effect whatsoever. But if at the execution time we find that we have no information about the first event and good information about the second event, it's too late to make use of the second event because its probabilities are already hardcoded in the table and condemn it to having no effect. But if we could change the probability at run time then we could use the second event instead. Better yet, the probabilities of the following events would be unaffected because they don't care which one of the duplicate events provided the information. For a sequence of not equal but closely correlated events, if the first one is excluded from the table preparation, the others still seem to converge in their effects to something close to the original sequence, and the following events don't get affected that much.

Since the logic in the part 24 assumes that the closely-related events would be placed in the sequence next to each other, the solution might be to compute an event's table entry in multiple versions, depending on whether the values of a limited number of the previous events are known or not. It would still be exponential for this number of previous events. But since we expect a fast convergence and hope that not too many event values will be unknown, this number of previous events can be small, and the exponent of it would be reasonably small too.

By the way, I've come up with a name for the K(E) described in the previous part: the "absolute confidence". The confidence C(E) has the range from 0 showing that we're completely sure  that the event is false, though 0.5 showing that we have no idea, to 1 showing that the event is true. The absolute confidence

K(E) = abs(2*C(E)  - 1)

has the range from 0 showing that we have no idea to 1 showing that we're completely sure that the event is true or false.

I had to draw a picture to figure it out how it would work. Suppose that we want to take 3 previous events into the consideration. The datapoints with the probability values for these events can then be seen as nodes of a tree, with the nodes for each event aligned vertically under the event name:

E1         E2       E3       E4        E5
                      +------o
                    0/
              +-----o
            0/      1\
            /         +------o
      +----o
     /      \         +------o
   0/       1\      0/
   /          +-----o
  /                 1\
 /                    +------o 
o                              
 \                            0+-------o
  \                   +------o<
   \                0/        1+-------o
   1\          +----o
     \        /     1\        0+-------o
      \     0/        +------o<
       \    /                 1+-------o
        +--o                  
            \                 0+-------o
            1\        +------o<
              \     0/        1+-------o
               +----o
                    1\        0+-------o
                      +------o<
                              1+-------o

E1 still has only one probability value because it has no preceding events. But E2 has two possible probabilities, depending on whether E1 was known (K(E1)=0) or unknown (K(E1)=1). E3 has four probabilities, depending on the possible combinations of E1 and E2. And E4 has eight probabilities.

When we come to E5, it can't just continue branching exponentially.It should depend on only three previous events, so we drop the dependency on E1. This is done by abandoning the subtree with K(E1)=0 and continuing branching the subtree with K(E1)=1. This is a reasonable thing to do because we assume that the combination {0, 0, 0} "should never happen" (i.e no more than 3 events in a row would be unknown, otherwise use a higher value of depth), and the other mirroring pairs of branches of {0, b, c} and {1, b, c} would converge to about the same probability value. Thus the events starting from E5 would also have 8 probability values each.

Then when we do the computation on the built table, we can have any K(E) in range [0, 1], so we do an interpolation between the branches. Suppose we're computing E4, and have 8 branches {E1, E2, E3} to interpolate. We do it by adding them up with the following weights:

W{0, 0, 0} = (1-K(E1))*(1-K(E2))*(1-K(E3))
W{0, 0, 1} = (1-K(E1))*(1-K(E2))*K(E3)
W{0, 1, 0} = (1-K(E1))*K(E2)*(1-K(E3))
W{0, 1, 1} = (1-K(E1))*K(E2)*K(E3)
W{1, 0, 0} = K(E1)*(1-K(E2))*(1-K(E3))
W{1, 0, 1} = K(E1)*(1-K(E2))*K(E3)
W{1, 1, 0} = K(E1)*K(E2)*(1-K(E3))
W{1, 1, 1} = K(E1)*K(E2)*K(E3)

Or the other way to look at it is that we start with the weight of 1 for every branch, then for E1 multiply the weight of every branch where K(E1)=0 by the actual 1-K(E1), and where K(E1)=1 by the actual K(E1). Then similarly for E2, multiply the weight of every branch where K(E2)=0 by the actual 1-K(E2), and where K(E2)=1 by the actual K(E2). And so on. The sum of all the weights will always be 1.

The physical meaning of this interpolation is that we split the computation into multiple branches and finally add them up together. Since the computation is additive, we can put them back together right away rather than waiting for the end.

And since I was implementing the partial confidences in the final computation, I've also implemented the support for the partial confidences in training: it's done by logically splitting the training case into two weighted sub-cases with the event values of 0 and 1.

All this is implemented in the script ex26_01run.pl, as a further modification of ex24_01run.pl. Without the extra parameters it works just as ex24_01 does. The new parameter -p sets the number of previous events whose absolute confidence affect the current event. Here is an example of a run with the same training table as before and the same input:

$ perl ex26_01run.pl -n tab24_01.txt in24_01_01.txt 
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E2             ,E3             ,
hyp1   ,3.00000,0.66667^0.66667,0.50000^0.50000,0.40000^0.33333,
hyp2   ,3.00000,0.33333^0.33333,0.50000^0.50000,0.60000^0.66667,
--- Initial weights
hyp1    w=3.00000 p=0.50000in a row 
hyp2    w=3.00000 p=0.50000
--- Applying event E1, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E2, c=1.000000
hyp1    w=1.00000 p=0.66667
hyp2    w=0.50000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.40000 p=0.57143
hyp2    w=0.30000 p=0.42857

The option -n disables the normalization of the weight of the hypotheses, just as before. The output is the same as ex24_01 produced. In the training data E1 and E2 are duplicates, so the probabilities computed for E2 in the adaboostian table (P(H|E) and P(~H|~E), they are separated by "^") are at 0.5, meaning that it has no effect.

And here is the same training data but with support for one previous unknown event, and the input data having E1 unknown (i.e. C(E1) = 0.5):

$ perl ex26_01run.pl -n -p 1 tab24_01.txt in26_01_01.txt 
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E2             ,E3             ,
hyp1   ,3.00000,0.66667^0.66667,0.66667^0.66667,0.40000^0.33333,
+      ,       ,               ,0.50000^0.50000,0.40000^0.33333,
hyp2   ,3.00000,0.33333^0.33333,0.33333^0.33333,0.60000^0.66667,
+      ,       ,               ,0.50000^0.50000,0.60000^0.66667,
--- Initial weights
hyp1    w=3.00000 p=0.50000
hyp2    w=3.00000 p=0.50000
--- Applying event E1, c=0.500000
hyp1    w=1.50000 p=0.50000
hyp2    w=1.50000 p=0.50000
--- Applying event E2, c=1.000000
hyp1    w=1.00000 p=0.66667
hyp2    w=0.50000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.40000 p=0.57143
hyp2    w=0.30000 p=0.42857

Here E2 and E3 have not one pair of probabilities in the adaboostian table but two pairs, as shown in the branching picture above. The extra branches are shown on the continuation lines that start with a "+". The branches for E2 show that if E1 is known (the second branch), E2 will have no effect as in the previous example, but if E1 is unknown (the first branch), E2's probabilities would be the same as for E1, so E2 would replace it and converge to the same values.

As you can see in the run log, after E1 with confidence 0.5 (i.e. absolute confidence 0) is applied, both hypotheses still have the equal weight. But then after E2 gets applied, the weights and probabilities of hypotheses become the same as in the previous example, thus the knowledge of E2 compensating for the absent knowledge of E1.

An interesting thing to notice is that the probabilities for E3  are the same in either case. This is because this table can handle only one unknown event in a row, which means that at least one of E1 or E2 should be known, and E2 is able to fully compensate for the unknown E1, so both ways converge to the same probabilities of E3.

Now some information about how it all is implemented:

Some of the arrays have gained an extra dimensions, to be able to keep track of the multiple branches.The weight of the training case {wt} has become an array instead of a scalar, and the probabilities {ppos} and {pneg} of the hypotheses became 2-dimensional arrays.

The function compute_abtable() that computes the adaboostean table had gained an extra nested loop that goes through all the branches. As the final step of processing every event it prunes the branches coming from the previous event and then splits each of these branches in two. The support of the partial confidence in the training data is done by replacing the statement

if ($c->{tc}[$i] >= 0.5) {
    $c->{wt} /= $abhyp{$hyp}->{ppos}[$i];
} else {
    $c->{wt} /= (1. - $abhyp{$hyp}->{pneg}[$i]);
} 

with

push @{$c->{wt}}, 
    $c->{tc}[$i] * $oldwt->[$br] / $abhyp{$hyp}->{ppos}[$i][$br]
    + (1. - $c->{tc}[$i]) * $oldwt->[$br] / (1. - $abhyp{$hyp}->{pneg}[$i][$br]);

The function apply_abevent() that does the computation based on the table has been updated to manage the branches during this step. It had gained the extra argument of an array that keeps the history of the absolute confidence of the few previous events. This history is used to compute the weights of the branches:

for (my $br = 0; $br <= $maxbr; $br++) {
    my $w = 1.; 
    for (my $bit = 0; $bit <= $#$evhist; $bit++) {
        if ($br & (1 << $bit)) {
            $w *= $evhist->[$bit];
        } else {
            $w *= 1. - $evhist->[$bit];
        }   
    }   
    push @brwt, $w; 
}   

Then an update based on the current event is computed for each branch, and the weight of the hypothesis is updated per these weighted bracnhes:

my $p = 0.; 

# each branch is weighed per the observed history
for (my $br = 0; $br <= $maxbr; $br++) {
    $p += $brwt[$br] * ( 
        $conf * $hyp->{ppos}[$evi][$br]
        + (1. - $conf) * (1. - $hyp->{pneg}[$evi][$br])
    );  
}   

$hyp->{cwt} *= $p; 

And of course the functions that print the tables and normalize the data have been updated too.

All this stuff works on the small examples. I'm not sure yet if it would work well in the real world, I'd need a bigger example for that, and I don't have one yet. Also, I've noticed one potential issue even on the small examples but more about it in the next installment.