## Sunday, May 22, 2016

### AdaBoost in simpler formulas 2

I've been reading along the book on boosting, and I'm up to about 1/3 of it :-) I've finally realized an important thing about how the H(x) is built.

For easy reference, here is another copy of the AdaBoost algorithm from the previous installment, simplified slightly further by replacing (1-et)/et with Wgoodt/Wbadt, and et/(1-et) with Wbadt/Wgoodt as mentioned at the end of it and getting rid of et altogether:

Given: (x1, y1), ..., (xm, ym) where xi belongs to X, yi belongs to {-1, +1}.
Initialize: D1(i) = 1 for i = 1, ..., m.
For t = 1, ..., T:
• Train the basic algorithm using the weights Dt.
• Get weak hypothesis ht: X -> {-1, +1}.
• Aim: select ht to minimalize the weighted error Wbadt/Wgoodt:
Wgoodt = 0; Wbadt = 0;
for i = 1, ..., m {
if ht(xi) = yi {
Wgoodt += Dt(i)
} else {
}
}
• Update,
for i = 1, ..., m {
if ht() != yi; {
} else {
}
}
Output the final hypothesis:
H(x) = sign(sum for t=1,...,T (ln(sqrt(Wgoodt/Wbadt))*ht(x)) ).

I've been wondering, what's the meaning of ln() in the formula for H(x). Here is what it is:

First of all, let's squeeze everything into under the logarithm. The first step would be to put ht(x) there.

This happens by the rule of ln(a)*b = ln(ab).

Since ht(x) can be only +1 or -1, taking the value to the power of it basically means that depending on the result of the ht(x) the value be either taken as-is or 1 divided by it. Which is the exact same thing that is happening in the computation of Dt+1(i). The two formulas are getting closer.

The next step, let's stick the whole sum under the logarithm using the rule ln(a)+ln(b) = ln(a*b):

H(x) = sign(ln( product for t=1,...,T ( sqrt(Wgoodt/Wbadt)ht(x) ) ))

The expression under the logarithm becomes very similar to the formula for Dt+1(i) as traced through all the steps of the algorithm:

Dt+1(i) = product for t=1,...,T ( sqrt(Wgoodt/Wbadt)-yt*ht(x) )

So yeah, the cuteness of expressing the condition as a power comes handy. And now the final formula for H(x) makes sense, the terms in it are connected with the terms in the computation of D.

The next question, what is the meaning of the logarithm? Note that its result is fed into the sign function. So the exact value of the logarithm doesn't matter in the end result, what matters is only if it's positive or negative. The value of logarithm is positive if its agrument is > 1, and negative if it's < 1. So we can get rid of the logarithm and write the computation of H(x) as:

if ( product for t=1,...,T ( sqrt(Wgoodt/Wbadt)ht(x) ) > 1 ) then H(x) = +1 else H(x) = -1

Okay, if it's exactly = 1 then H(x) = 0 but we can as well push it to +1 or -1 in this case. Or we can write that

H(x) = sign( (product for t=1,...,T ( sqrt(Wgoodt/Wbadt)ht(x) ) - 1 )

The next thing, we can pull the square root out of the product:

if (sqrt( product for t=1,...,T (Wgoodt/Wbadtht(x)) ) > 1 ) then H(x) = +1 else H(x) = -1

But since the only operation on its result is the comparison with 1, taking the square root doesn't change the result of this comparison. If the argument of square root was > 1, the result will still be >1, and the same for < 1. So we can get rid of the square root altogether:

if ( product for t=1,...,T (Wgoodt/Wbadtht(x) ) > 1 ) then H(x) = +1 else H(x) = -1

The downside of course is that the computation becomes unlike the one for Dt+1(i). Not sure yet if this is important or not.

Either way, we can do one more thing to make the algorithm more readable, we can write the product as a normal loop:

chance = 1;
for t=1,...,T {
if (ht(x) > 0) {
} else
}
}
H(x) = sign(chance - 1)

Note that this code runs not at the training time but later, at the run time, with the actual input data set x. When the model runs, it computes the actual values ht(x) for the actual x and computes the result H(x).

I've named the variable "chance" for a good reason: it represents the chance that H(x) is positive. The chance can be expressed as a relation of two numbers A/B. The number A represents the positive "bid", and the number B the negative "bid". The chance and probability are connected and can be expressed though each other:

chance = p / (1-p)
p = chance / (1+chance)

The chance of 1 matches the probability of 0.5. Initially we have no knowledge about the result, so we start with the chance of 1, and with each t the chance gets updated according to the hypothesis picked on that round of boosting.

The final thing to notice is that in the Bayesian approach we do a very similar thing: we start with the prior probabilities (here there are two possible outcomes, with the probability 0.5 each), and then look at the events and multiply the probability accordingly. At the end we see which hypothesis wins. Thus I get the feeling that there should be a way to express the boosting in the Bayesian terms, for a certain somewhat strange definition of events. Freund and Shapire describe a lot of various ways to express the boosting, so why not one more. I can't quite formulate it yet, it needs more thinking. But the concept of "margins" maps very nicely to the Bayesian approach.

In case if you wonder what the margins are: as the rounds of boosting go, after each round of boosting H(x) can be computed for each set of the training data xi. At some point they all start matching the training results yi, however the boosting can be run further, and more rounds can still improve the results on the test data set. This happens because the sign() function in H(x) collapses the details, and the further improvements are not visible on the training data. But if we look at the argument of sign(), such as the result of the logarithm in the original formula, we'll notice that they keep moving away from the 0 boundary, representing more confidence. This extra confidence then helps make better decisions on the test data. This distance between the result of the logarithm and 0 is called the margin. Well, in the Bayesian systems we also have the boundary of the probability (in the simplest case for two outcomes, 0.5), and when a Bayesian system has more confidence, it drives the resulting probabilities farther from 0.5 and closer to 1. Very similar.