Friday, June 24, 2016

AdaBoost 4 or the square root returns

I've had to return the library book on boosting. I've ordered my own copy, but on the last day I've paged through the more interesting chapters. This gave me some ideas on how the Bayesian model can be fit to those too, and I started writing it up, but I've got stopped at the question: what should get minimized? So I went back to the book (my own copy now) and I think that I finally understand it. Here is how it works:

The real measure that gets minimized on each round of boosting is Z. In the book the authors prove that it's not the training error as such but the upper bound on the training error. They write it (with dropped subscript t of the boosting round) as

Z = sum for all good cases i (Wi * exp(-Alpha)) + sum for all bad cases j (Wj * exp(Alpha))

Where W is the weight of a particular training case (same as D(i) in the classic AdaBoost terms, just since I've been using weights rather than a distribution, the letter W comes easier to my mind). They have a use for this exponent in the proofs but here it doesn't matter. Let's create a notation with a new variable S (the name doesn't mean anything, it's just a letter that hasn't been used yet) that will sweep the exponent under the carpet:

S = exp(Alpha)
Z = sum for all good cases i (Wi / S) + sum for all bad cases j (Wj * S)

S has a "physical meaning": it represents how the chances of a particular case change. The weights of the "good" cases (i.e. those that got predicted correctly on a particular round) get divided by S because we're trying to undo the effects of the changes from this round of boosting for the next round. Getting back to the Bayesian computation,

S = C / (1-C)

Where C is the confidence of the round's prediction acting on this case. When the produced boosting algorithm runs, if the round's prediction votes for this case (which we know got predicted by it correctly in training), its weight will be multiplied by C. If the prediction votes against this case, its weight will be multiplied by (1-C). Thus S measures how strongly this prediction can differentiate this case. For the cases that get mispredicted by this round, the relation is opposite, so their weights get multiplied by S instead of division.

The mathematical way to minimize Z is by finding the point(s) where its derivative is 0.

dZ/dS = sum for all good cases i (Wi / -S2) + sum for all bad cases j (Wj) = 0
sum for all good cases i (Wi / S2) = sum for all bad cases j (Wj)

And since for now we assume that S is the same for all the cases,

sum for all good cases i (Wi) = Wgood

then

And since we know that the training error e is proportional to Wbad and (1-e) is proportional to Wgood:

1-e = Wgood / (Wgood + Wbad)

then we can rewrite S as

S = sqrt( (1-e) / e )

Or substituting the expression of S through C,

C / (1-C) = sqrt( (1-e) / e )
C = sqrt(1-e)

Which means that the Bayesian confidence in AdaBoost really is measured as a square root of the "non-error". This square root can be put away in the basic case but it becomes important for the more complex variations.
[I think now that this part is not right, will update later after more thinking]

Returning to the minimization, this found value of S gets substituted into the original formula of Z, and this is what we aim to minimize on each round. For the classic AdaBoost it is:

Z = Wgood / sqrt((1-e) / e) + Wbad * sqrt((1-e) / e)

Since Wgood is proportional to (1-e) and Wbad is proportional to e (and in case if the sum of all the weights is 1, Wgood=1-e, and Wbad=e), we can substitute them:

Z = (1-e) / sqrt((1-e) / e) + e * sqrt((1-e) / e)
= sqrt(e * (1-e)) + sqrt(e * (1-e))
= 2 * sqrt(e * (1-e))

This value gets minimized when e gets closer to either 0 or 1 (because the formula is symmetric and automatically compensates for the bad predictions by "reverting" them). The classic AdaBoost algorithm then says "we'll rely on the ability of the underlying algorithm to do the same kind of reverting" and just picks the minimization of e towards 0. This makes the computation more efficient for a particular way of computation for C but the real formula is above. If we want to handle the more generic ways, we've got to start with this full formula and only then maybe simplify it for a particular case.

This formula can be also written in a more generic way, where each training case may have its own different value of confidence C and thus of S:

Z = sum for all good cases i(Wi / Si) + sum for all bad cases j(Wj * Sj)

And now I'll be able to talk about these more generic cases.

By the way, the difference of how the variation of AdaBoost for logistic regression works from how the classic AdaBoost works is in a different choice of measure for what gets minimized, a different formula for Z.