Before I start on that explanation, I want to talk about why the boosting works. Yeah, I've followed the proofs (more or less) but it took me a while to get a bit of an intuitive feeling in my head, the "physical meaning" of the formulas. I want to share this understanding which is much simpler than these proofs (and I hope that it's correct):
The premise of boosting is that we're able to find a number of methods (what they call "hypotheses" in AdaBoost) to predict the correct outcomes of the training cases, each method correctly predicting more than 50% of the training cases. Then if we collect a large-ish number of these methods, we can predict the correct outcomes of all the training cases simply by averaging the predictions of these methods. And the other cases will follow the training cases (unless an overfitting happens). Since more than 50% of the cases are correctly predicted by each method, after the averaging more than 50% of the votes for each training case will be correct, and thus the result will be correct too. Of course this depends on the correct predictions being distributed pretty evenly among the cases. If we have a thousand methods that predict correctly the same cases and incorrectly the other cases, obviously after averaging these other cases will still be predicted incorrectly. So the selection of methods must somehow shuffle the preference for the cases, so that the next picked method will predict well the cases that have been predicted poorly by the previous picked methods. That's it, that's the whole basic idea. It's a bit of an oversimplification but it's easy to understand.
Now on to the explanation of the connection between AdaBoost and the Bayesian logic.
To start with, I want to show one more rewrite of the AdaBoost algorithm. It builds further on the version I've shown in the last installment. Now I'm returning back to the notation of et and (1-et). These values are proportional to Wbadt and Wgoodt but are confined to the range [0, 1] which fits well into the Bayesian computations. More importantly, this new version gets rid of the square root in the computation of Dt+1(i). It wasn't the first thing I realized for the connection between the AdaBoost and Bayesian logic, it was actually the last thing I realized, but after that all the parts of the puzzle fell into place. So I want to show this key piece of the puzzle first.
The point of this computation is to readjust the weights of the training cases for the next round, so that the total weight of all the cases successfully predicted by the current round equals to the total weight of the unsuccessfully predicted rounds. There is more than one way to make the weights satisfy this condition, they're all proportional to each other and all just as good. They way the weight are modified in the classic for of AdaBoost algorithm has been selected to make the modification symmetric, and allow it to write the adjustment for both correct and incorrect cases as a single formula:
Dt+1(i) = Dt(i)*exp(-Alphat*yi*ht(xi)) / Zt
But if we're writing an honest if/else statement, it doesn't have to be symmetric. We can as well write either of:
if ht(xi) != yi; {
Dt+1(i) = Dt(i) / et
} else {
Dt+1(i) = Dt(i) / (1-et)
}
or with a degenerate "else" part:
if ht(xi) != yi; {
Dt+1(i) = Dt(i) * (1-et) / et
} else {
Dt+1(i) = Dt(i)
}
Either way the result is the same. And there is no need to involve the square roots, the formula becomes simpler. After making this simplification, here is my latest and simplest version of the algorithm:
Given: (x1, y1), ..., (xm, ym) where xi belongs to X, yi belongs to {-1, +1}.
Initialize: D1(i) = 1 for i = 1, ..., m.
For t = 1, ..., T:
- Train the basic algorithm using the weights Dt.
- Get weak hypothesis ht: X -> {-1, +1}.
- Aim: select ht to minimalize the weighted error et:
Wgoodt = 0; Wbadt = 0;
for i = 1, ..., m {
if ht(xi) = yi {
Wgoodt += Dt(i)
} else {
Wbadt += Dt(i)
}
}
et = Wbadt / (Wgoodt + Wbadt)
- Update,
for i = 1, ..., m {
if ht(xi) != yi; {
Dt+1(i) = Dt(i) * (1-et) / et
} else {
Dt+1(i) = Dt(i)
}
}
H(x) {
chance = 1;
for t=1,...,T {
if (ht(x) > 0) {
chance *= (1-et)/et;
} else
chance *= et/(1-et);
}
}
return sign(chance - 1)
}
Now this form can be translated to the Bayesian approach. I'll be using the form of Bayesian computations that works with weights rather than probabilities. The translation to his form is fairly straightforward:
There are two Bayesian mutually-exclusive hypotheses (but to avoid confusion with the AdaBoost term "hypothesis", let's call them "outcomes" here): one that the result is true (1), another one that the result false (0).
Each "hypothesis" ht(x) in AdaBoost terms becomes an "event" Evt in the Bayesian form (to avoid the terminological confusion I'll call it an event henceforth). Each event is positive: it being true predicts that the true outcome is correct, and vice versa. The training table looks like this:
Weight Outcome Ev1 ... EvT 1 * true 1 ... 1 1 * false 0 ... 0
Aside from this, we remember et from all the boosting rounds, and of course the computations for all the "events" chosen by the boosting.
Then when we run the model, we get the set of arguments x as the input, and start computing the values of the events. When we compute the event Evt, we apply it with the confidence C(Evt) value of (1-et), as described in the fuzzy training logic. In result if the Evt is true, the weight of the true outcome gets multiplied by (1-et) and the weight of the false outcome gets multiplied by et. If the event is false, the multiplication goes the opposite way.
In the end we compute the probability of the true outcome from the final weights: W(true)/(W(true)+ W(false)). If it's over 0.5, the true outcome wins. This logic is exactly the same as in the function H(x) of the AdaBoost algorithm.
Why is the (1-et) used as the confidence value? Since it's the fraction of the training cases that got predicted correctly by Evt, it makes sense that we're only so much confident in this event. Well, for the first event it is, but for the following events (1-et) is computed from a modified distribution of the events, with the adjusted weights. Does it still make sense?
As it turns out, it does. To find out why, we need to look at the weights of the training cases after they pass the filter of the first event. Instead of looking at the training table with two composite rows we need to step back and look at the table with the M original uncombined training cases:
CaseId Weight Outcome Ev1 ... EvT 1 1 * true 1 ... 1 ... M 1 * false 0 ... 0
The cases that have the outcome of true still have 1 for all the events, and the one with the false outcome have 0 for all the events. In case if we run the algorithm for an input that matches a particular training case, Ev1 will predict the outcome correctly for some of the training cases and incorrectly for the others. When it predicts correctly, the weight of this training case will be multiplied by the confidence C(Ev1) and will become (1-e1). But when it predicts incorrectly, the weight will be multiplied by e1. The distribution will become skewed. We want to unskew it for computing the confidence of Ev2, so we compensate by multiplying the weight of the cases that got predicted incorrectly by (1-e1)/e1. Lo and behold, it's the same thing that is done in AdaBoost:
if ht(xi) != yi; {
Dt+1(i) = Dt(i) * (1-et) / et
} else {
Dt+1(i) = Dt(i)
}
So it turns out that on each step of AdaBoost the distribution represents the compensation of the weight adjustment for application of the previous events, and the value (1-et) is the proper adjusted fraction of the training cases that got predicted correctly. Amazing, huh? I couldn't have come up with this idea from scratch, but given an existing formula, it all fits together.
What is the benefit of looking at AdaBoost from the Bayesian standpoint? For one, we get the probability value for the result. For another one, it looks like AdaBoost can be slightly improved by changing the initial weights of true and false outcomes from 1:1 to their actual weights in the M training cases. That's an interesting prediction to test. And for the third one, maybe it can be used to improve the use of AdaBoost for the multi-class classifications. I haven't read the book that far yet, I want to understand the current state of affairs before trying to improve on it.