I think the problem of removing the commonality can be boiled down to the following: Suppose we have an event A that gives the prediction with some effect

*a*on the training cases, and an event B that gives the prediction

*b*on the training cases. These predictions have some part

*ab*(that's a 2-letter identifier, not multiplication of

*a*and

*b*) in common. Then the effects can be written as:

A: a = a' + ab

B: b = b' + ab

Where

*a'*and

*b'*are the differences between the original

*a*and

*ab*, and

*b*and

*ab*. Here the operator "+" means not the addition but application of the events in the deduction one after another. So "a + b" means "apply a, then apply b". The reason why I chose the sign "+" is that this combined application should act like addition. I.e. have the commutative and distributive properties like

x + y = y + x

(x + y) + z = x + (y + z)

No, I don't have a proof why they do. But the problem with the AdaBoost hypotheses that I'm trying to solve is that the result there depends on the order of partial hypotheses (i.e. Bayesian events). I want to be able to decompose these events in such a way as to make the result independent of the order, hence the commutative requirement. The exact operation that satisfies these rules will need to be found but for now we can follow through the results of these rules.

So, going by these rules, if we apply both events, that will be:

a + b = a' + ab + b' + ab = a' + b' + 2*ab

The part

*ab*gets applied twice. To remove this duplication we've got rid of one copy of

*ab*. We can do this by splitting the original two events into three events A', B' and AB (again, that's a 2-letter identifier, not a multiplication):

A': a'

B': b'

AB:ab

a = a' + ab; b = b' +ab

The new event AB gets the common part, while A and B lose it and become A' and B'. Then with all events applied we get the result without duplication:

a' + b' +ab

If we want to add the fourth event C, we've got to get its overlap with all 3 previous events. This requires a whole bunch of multi-letter identifiers:

a'c would be the overlap between a' andc

b'c would be the overlap between b' and c

abc would be the overlap between ab and c

And the double-primes would be the "remainders":

a' = a'' + a'c

b' = b'' + b'c

ab = ab' +abc

c = c' + a'c + b'c +abc

Then the result of applying all these events without duplication will be:

a'' + a'c + b'' + b'c + ab' + abc + c'

This gives the outline of the solution but it still has two problems: the number of events after splitting grows exponentially (a power of 2) with the number of the original events, and the concrete meaning of the operation "+" needs to be defined. And actually there is one more operation that needs to be defined: what does the "overlap" mean and how do we compute the value of

*ab*from the values of

*a*and

*b*?

The answers to both problems might be connected. One way to define the overlap would be to associate each training case with exactly one event that predicts it well. Then when we split

*a*and

*b*into

*a'*,

*b'*, and

*ab*, we would be splitting the whole set of training cases into 3 parts: one part gets predicted well by both A and B, one only by A, and one only by B.

Then the problem with the exponential growth can be answered in a two-fold way. First, the number of the events after splitting can't grow higher than the number of the training cases. Second, we can put an artificial cap on the number of events that we're willing to use (Emax), and select up to this many events, ones that predict the most training cases each. We can then either sweep the remaining training cases under the carpet, saying that they are one-offs that would cause overfitting, or split them somehow with partial strength among the events that get included.

The last sentence also suggests the third way of tackling the growth: if we define some way to do the splitting, instead of multiplying the events we could just split the power they have on a particular training case. But I think this would create issues in case of the partial confidence during the execution of the model that can be handled better with the combined events.

Returning to the definition of "+", I couldn't figure out yet how to make such an operation directly in AdaBoost. Maybe it can be done through logarithms, that will need some more thinking. It requires the ability to say that some training case doesn't get affected by a partial hypothesis/event. But there seems to be an easy way to do it in the Bayesian way: mark the confidence of that event for that training case as 0.5. And then these transformed events/partial hypotheses can be fed into AdaBoost too.

This fits very well with the underlying goal of boosting: to find the approximately equal rate of the predominantly correct votes for each training case. After the transformation there will be exactly one right vote for each training case, and the rest of votes will be "undecided" (or at least sort of, subject to the limitations introduced by the mixing of training cases).

Another consequence is that such splitting would make the produced events fully independent of each other: each event would have an exactly 50% correlation with each other event, meaning that they can be applied in any order. And this is exactly what we've been aiming for.

So it all looks good and simple but I'm not sure yet if I've missed something important that will take the whole construction apart. There might well be.

P.S. After a little thinking I've realized that the idea of number of events being limited by the number of training cases is pretty much equivalent to the computation directly on the training cases by weight.

## No comments:

## Post a Comment