Wednesday, March 1, 2017

Bayes 24: AdaBoost as a solution to the correlated events

When writing the whitepaper, I've realized that AdaBoost is very good at finding the cross-correlation of its partial hypotheses (that map to the Bayesian events), and it always picks the next step with the minimal cross-correlation. The Bayesian computations do have an issue with cross-correlation, as described in the whitepaper and near the end of Part 8 of my notes: if there are multiple events that duplicate or mirror each other, all of them are treated as independent and drive the probabilities off. This looked like AdaBoost could give a hint of how to solve this problem, so I've started thinking about it, and I think I've found a solution.

In retrospect the solution is pretty obvious: instead of computing the Bayesian model directly, treat it as an AdaBoost model whose set of the partial hypotheses equals to the set of the Bayesian events, with an extra limitation that each one of them must be used exactly once. If there are multiple equivalent events and they happen to go one after another, AdaBoost will treat the first one as real, and the following ones will get their error rate of 0.5 and will be effectively no-ops.

Well, if there are other events interspersed in between these equivalent copies, they won't be no-ops. These other events will shift the weights of the training cases, so that the second an further copies will find themselves drifted away from 0.5. The good news here is that AdaBoost is pretty good at removing the partial correlation too but the bad news is that the order matters, to fully remove the duplication the equivalent events should go one after another. And in general, it looks like the best removal of the cross-correlation happens when the closely related events follow closely one after another. This kind of order can be implemented if we reverse the AdaBoost preference of the partial hypotheses and make it always pick the one that has its training error rate closest to 0.5. Since (unlike the classic AdaBoost) the set of partial hypotheses is limited and each element from it must be used only once, the better ones will eventually be used as well.

To experiment with these ideas, I wrote the code example ex24_01run.pl. It doesn't reorder the events, just computes the AdaBoost values for whatever order it's given, allowing to experiment with the various orders. But it has a couple of advanced features from the later chapters of the book on AdaBoost: it does the multi-class (but single-label) version that also partitions the error rate computation for  the sign of the value of the partial hypothesis.

The multi-class means that it can work not only with two outcomes (a yes/no decision) but with multiple possible outcomes (i.e. Bayesian hypotheses), just like the normal Bayesian table. And just like the normal Bayesian table, these outcomes must be mutually exclusive, so each case may have only one outcome (single-label). The multi-class logic works in the way that is similar to the one described in Part 10: the resulting table is just a combination of the individual per-outcome tables. From the standpoint of each outcome, the single-label property means that the rest of the rows of the table taken together represent the opposite of this outcome, so there is no need to compute that opposite explicitly.

The partitioning of the error rate computation not only makes the result better but also comes very handy to prove that the opposite of one outcome is equivalent to the rest of the rows of the table. The AdaBoost training error e is the ratio of the total weights of the training cases where the value of the event E (that is, the Bayesian event, in AdaBoost terms it's a partial hypothesis) is not equal to the value of the outcome (Bayesian hypothesis) H. And we'll compute it separately for E (i.e. E=1) and ~E (i.e. E=0). It ends up expressable as probabilities that can be computed by summing and dividing weights W of the training cases:

e(H,E) = P(~H|E) = (sum for cases i where H(i)=0 and E=1 (Wi))/(sum for cases j where E=1 (Wj))
e(H,~E) = P(H|~E) = (sum for cases i where H(i)=1 and E=0 (Wi))/(sum for cases j where E=0 (Wj))

So if we pick a particular outcome and denote it H1, we can show that its opposite ~H1 is equivalent to the sum of the rest of hypotheses H2...HN by showing that

P(~H1|E) = sum for i in [2, N] (P(Hi|E))

which after removing the common denominator means

(sum for cases j where H1(E)=0 and E=1 (Wj)) = (sum for i in [2, N] (sum for cases k where Hi(E)=1 and E=1 (Wk))

The single-label property means that each case is marked with exactly one outcome, and all the cases where the outcome H1 is absent must have some other outcome Hi present, so these sums are indeed equal.

So it is all self-consistent. The key part of the computation is:

for my $hyp (values %abhyp) {
        if ($conf) {
            $hyp->{wt} *= $hyp->{ppos}[$evi];
        } else {
            $hyp->{wt} *= (1. - $hyp->{pneg}[$evi]);
        }
    }

And the matching part of the table building is:

foreach my $c (@case) {
        my $hyp = $c->{hyp}[0];
        # this code really handles only the TC values of 0 and 1,
        # for the arbitrary values it would have to do an interpolation
        if ($c->{tc}[$i] >= 0.5) {
            $c->{wt} /= $abhyp{$hyp}->{ppos}[$i];
        } else {
            $c->{wt} /= (1. - $abhyp{$hyp}->{pneg}[$i]);
        }   
    }   

The training undoes the weight change that will be done in the computation.

Now let's try some examples. Here is the example that I used in the whitepaper:

$ perl ex24_01run.pl -n tab24_01.txt in24_01_01.txt
Original training cases:
!      ,       ,E1          ,E2          ,E3          ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E2             ,E3             ,
hyp1   ,3.00000,0.66667^0.66667,0.50000^0.50000,0.40000^0.33333,
hyp2   ,3.00000,0.33333^0.33333,0.50000^0.50000,0.60000^0.66667,
--- Applying event E1, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E2, c=1.000000
hyp1    w=1.00000 p=0.66667
hyp2    w=0.50000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.40000 p=0.57143
hyp2    w=0.30000 p=0.42857

Here the option "-n" means "don't normalize the weights" (or when normalized to the sum of 1 they will be the same as probabilities). The two values printed in the AdaBoostian table separated by a "^" are the P(H|E) and P(~H|~E).

As you can see, the result is not correct either (because of the XOR problem) but it's better than with the plain Bayes version that gave hyp1 the probability 0.667. And since E1 and E2 are the same, E2 ended up with the values of 0.5 in the table, and applying it didn't change the weight.

If we change the order of events, E2 gains some use back:

$ perl ex24_01run.pl -n tab24_02.txt in24_01_01.txt 
Original training cases:
!      ,       ,E1          ,E3          ,E2          ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E1             ,E3             ,E2             ,
hyp1   ,3.00000,0.66667^0.66667,0.40000^0.33333,0.47368^0.48276,
hyp2   ,3.00000,0.33333^0.33333,0.60000^0.66667,0.52632^0.51724,
--- Applying event E1, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E3, c=1.000000
hyp1    w=0.80000 p=0.57143
hyp2    w=0.60000 p=0.42857
--- Applying event E2, c=1.000000
hyp1    w=0.37895 p=0.54545
hyp2    w=0.31579 p=0.45455

Rather surprisingly, the result ended up more correct. Or maybe it's not that surprising: after all, AdaBoost is designed to gain more information from available data.

Yet another interesting result get produced from repeating the same pair of events multiple times:

$ perl ex24_01run.pl -n tab24_03.txt in24_03_01.txt 
Original training cases:
!      ,       ,E01         ,E02         ,E11         ,E12         ,E21         ,E22         ,E31         ,E32         ,E41         ,E42         ,E51         ,E52         ,E61         ,E62         ,E71         ,E72         ,E81         ,E82         ,E91         ,E92         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp1   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,0.00000/1.00,
AdaBoostian table:
!      ,       ,E01            ,E02            ,E11            ,E12            ,E21            ,E22            ,E31            ,E32            ,E41            ,E42            ,E51            ,E52            ,E61            ,E62            ,E71            ,E72            ,E81            ,E82            ,E91            ,E92            ,
hyp1   ,3.00000,0.66667^0.66667,0.40000^0.33333,0.47368^0.48276,0.49694^0.49526,0.49916^0.49946,0.49990^0.49985,0.49997^0.49998,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,
hyp2   ,3.00000,0.33333^0.33333,0.60000^0.66667,0.52632^0.51724,0.50306^0.50474,0.50084^0.50054,0.50010^0.50015,0.50003^0.50002,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,0.50000^0.50000,
--- Applying event E01, c=1.000000
hyp1    w=2.00000 p=0.66667
hyp2    w=1.00000 p=0.33333
--- Applying event E02, c=1.000000
hyp1    w=0.80000 p=0.57143
hyp2    w=0.60000 p=0.42857
--- Applying event E11, c=1.000000
hyp1    w=0.37895 p=0.54545
hyp2    w=0.31579 p=0.45455
--- Applying event E12, c=1.000000
hyp1    w=0.18831 p=0.54242
hyp2    w=0.15886 p=0.45758
--- Applying event E21, c=1.000000
hyp1    w=0.09400 p=0.54159
hyp2    w=0.07956 p=0.45841
--- Applying event E22, c=1.000000
hyp1    w=0.04699 p=0.54149
hyp2    w=0.03979 p=0.45851
--- Applying event E31, c=1.000000
hyp1    w=0.02349 p=0.54147
hyp2    w=0.01990 p=0.45853
--- Applying event E32, c=1.000000
hyp1    w=0.01175 p=0.54146
hyp2    w=0.00995 p=0.45854
...

--- Applying event E92, c=1.000000
hyp1    w=0.00000 p=0.54146
hyp2    w=0.00000 p=0.45854

AdaBoost converges pretty quickly to the weights that make both events get the probability 0.5 and stop the further changes.

This seems to work, although it needs more testing. One more limitation of this implementation is that it works only with the events that have the value of either 0 or 1, not anything in between. I'll talk more about this in the next post.

No comments:

Post a Comment