## Tuesday, August 2, 2016

### Bayes 21: mutual exclusivity and independence revisited

I've been thinking about the use of AdaBoost on the multi-class problems, and I've accidentally realized what is going on when I've used the combination of the mutually exclusive and of independent computation of hypotheses, as described in the part 10.

Basically, the question boils down to the following: if the mutually exclusive computation shows that multiple hypotheses have risen to the top in about equal proportions (and consequently every one of them getting the probability below 0.5), how could it happen that in the independent computation each one of them would be above 0.5? If the training data had each case resulting in only one true hypothesis, the probabilities computed both ways would be exactly the same, in the independent version the probability of ~H being exactly equal to the sum of probabilities of the other hypotheses.

The answer lies in the training cases where multiple hypotheses were true. If we use the weight-based approach, it becomes easy to see that if a new case matches such a training case, it would bring all the hypotheses in this case to the top simultaneously. So the independent approach simply emphasizes the handling of these training cases. Equivalently, such training cases can be labeled with pseudo-hypotheses, and then the weight of these pseudo-hypotheses be added to the "pure" hypotheses. For example, let's consider re-labeling of the example from the part 10:

# tab09_01.txt and tab10_01.txt
evA evB evC
1 * hyp1,hyp2 1 1 0
1 * hyp2,hyp3 0 1 1
1 * hyp1,hyp3 1 0 1

Let's relabel it as:

evA evB evC
1 * hyp12 1 1 0
1 * hyp23 0 1 1
1 * hyp13 1 0 1

Then the probabilities of the original hypotheses can then be postprocessed as:

P(hyp1) = P(hyp12)+P(hyp13)
P(hyp2) = P(hyp12)+P(hyp23)
P(hyp3) = P(hyp23)+P(hyp13)

And this would give the same result as the independent computations for every hypothesis.

So this approach works well for when the combined set of symptoms for multiple hypotheses had been seen in training, and not much good for the combinations that haven't been seen in the training. The combined use of the independent and mutually-exclusive combinations with a low acceptance threshold for the independent computations tempers this effect only slightly.