Saturday, July 2, 2016

AdaBoost 6: multiple confidence ranges & other thoughts

One of the chapters I quickly read through was on using a better estimation of probability of the success of the partial hypotheses (in the AdaBoost sense), with the following example:

Suppose, the underlying algorithm found a decision stump, and finds that it can classify the cases having a value above some limit pretty well (say, 90% of them are "+1") while the cases with the values under this limit are undifferentiated (say, 49% of them are "+1"). In this case if we allow the underlying algorithm to return the values not only +1 and -1 but also 0 for when it can't differentiate well, the boosting goes much better.

Since the part 3 I've been wondering if the algorithm can be improved in an obvious way by using the separate confidence values for the situations when the underlying algorithm returns +1 or -1 (or in equivalent Bayesian terms, 1 or 0), and maybe even do better than that by forming multiple "bands" of training cases by some rule and assigning a separate confidence values to each band, and maybe even further by computing an individual confidence value for each case. The idea above shows that it can, and the book then goes on to the exact same idea of multiple bands, each with its own confidence value, and on to the partial hypotheses returning the individual confidence values for each training case. So that idea really was obvious, and there is not much point in writing more about it. The only tricky part is the minimalization criteria, but that becomes straightforward from the description in the part 4.

One more idea would be to make the decision stumps not discrete ("1" on this side of this value, "0" on the other side) but graduated, with the confidence being 0.5 at the value and growing towards 1 on one side and towards 0 on the other side. Possibly saturating at some distance from the center point value.

An easy example of saturation would be this: If we're classifying a set of points in a 2-dimensional space with the decision stumps based on the coordinate values, this means that on each step we find some value of a coordinate (X or Y) and say that the points on one side have mostly the value "1" and on the other side they have mostly the value "0". Then the classic AdaBoost approach (without the square root in the confidence) takes the confidence C as the fraction of the points whose values got guessed right. Or, translating from the pair (value, confidence) to the just the confidence, the pair (1, C) becomes C, and the pair (0, C) becomes (1-C). But the breaking point is usually not one point, it's a range between the coordinate values closest to this break. The classic approach is to pick the centerpoint of this range as the breaking point. But we could say that the confidence changes linearly between these two coordinate values and beyond this range saturates to C and (1-C). This smooth transition could help with the training on the data with errors in it. The book contains an example of training on the data that contained 20% of errors which led to overfitting over many rounds of boosting. The smooth transitions could provide the averaging that would counteract this overfitting. Maybe.

Another thought that occurred to me is what if we're boosting while we're boosting? Or in other words, what if the partial algorithm runs a few rounds of boosting on its own? How can it be equivalently represented as a single round of boosting?

This looks like a good example where an individual value of the confidence for each training case would be handy. As we go through the execution stage, the weight of each example would be multiplied on each round of "nested boosting" in the same way as if would be on each round of normal boosting. I.e. if a round of boosting had the confidence C, and guessed this particular training case right, the weight of this training case will be multiplied by C, and if it guessed this training case wrong, the weight would be multiplied by (1-C).

So if we define the effective confidence Ce as:

(if the round guessed right, its Ce=C, otherwise its Ce=1-C)

the we can see that the multiplier will be:

product for rounds t(Cet)

Except that it's not properly scaled to be in the range [0..1]. To adjust the scaling, it has to become

product for rounds t(Cet) / (product for rounds t(Cet) + product for rounds t(1-Cet))

There won't be a single confidence value for all the training cases of such a composite round, each training case will have its own confidence value, with the probability of pointing towards 0 or 1 built into it.

One more interesting thing I've read is that after some number of rounds AdaBoost tends to start going in a circle through the same set of distributions (or sometimes not quite the same but very close). To me this looks like an indication that it's time to stop. Boosting the margins by going through this loop repeatedly looks like cheating, because if we look at its Bayesian meaning, this meaning implies that the events that get examined must be essentially independent of each other. But if we repeatedly examine the same events, they're not independent, they're extra copies of the events that have already been seen. Thus re-applying them repeatedly doesn't seem right. On the other hand, the events in such a loop do form this loop because they represent each event in a balanced way. So maybe the right thing to do is to throw away everything before this loop too, and just leave one iteration of this loop as the only events worth examining. This would be interesting to test if/when I get around to do it.