## Friday, March 18, 2016

I'm reading a book about the "boosting": the machine learning idea that a better algorithm can be built by combining multiple instances of a simple algorithm, each instance trained on the same set of training data but with different weights assigned to each case.

The very basic concept itself goes along the same way as I've been doing manually for the Bayesian system: after a system is trained on a set of training cases, re-test it on the same set of data and find what cases got diagnosed wrong (the proper word for it, as it turns out, is "training error"). Then try to improve the recognition of these cases. But unlike me the boosting does it in an automated way: it doesn't try to change the underlying algorithm, instead it simply produces the second training table, and the end result is produced by running the expert system with both tables then combining their results in a weighted fashion. Boosting can be done for hundreds of rounds, producing the systems that run the basic algorithm hundreds of times. For Bayesian algorithms, I think it might be possible to combine these many tables into one instead of running them separately, although I'm not quite sure about that.

There is also a limitation on the boosting: looks like it's mostly sorted out for the cases where there are only two mutually exclusive hypotheses. Apparently it starts having problems even with multiple mutually exclusive hypotheses, let alone the overlapping ones.

Well, getting back to the book, it's "Boosting", written by the inventors of the concept in general and of the AdaBoost (Adaptive Boost) algorithm in particular, Shapire and Freund.

I've spent some time understanding the AdaBoost algorithm, and I feel that it can be expressed in a simpler form. The authors' definition of the algorithm is like this (minus the mathematical characters that I don't know how to write in HTML and wrote in words or similar Latin characters):

Given: (x1, y1), ..., (xm, ym) where xi belongs to X, yi belongs to {-1, +1}.
Initialize: D1(i) = 1/m for i = 1, ..., m.
For t = 1, ..., T:
• Train the basic algorithm using the distribution Dt.
• Get weak hypothesis ht: X -> {-1, +1}.
• Aim: select ht to minimalize the weighted error:
et = Pri~Dt[ht(xi) != yi].
• Choose Alphat = 1/2 * ln((1-et)/et).
• Update, for i = 1, ..., m:
Dt+1(i) = Dt(i)/Zt * {
exp(-Alphat) if ht(xi)=yi;
exp(Alphat) if ht(xi) != yi
}
= Dt(i)*exp(-Alphat*yi*ht(xi)) / Zt
where Zt is a normalization factor (chosen so that Dt+1 will be a distribution).
Output the final hypothesis:
H(x) = sign(sum for t=1,...,T (Alphat*ht(x)) ).

Some explanations are in order. The pairs (xi, yi) are the training cases. xi represents the whole set of symptoms, yi represents the outcome (in Bayesian terms, we could also say "hypotheses" but here the word hypothesis has a different meaning). Only two outcomes are possible, and unlike the Bayesian tradition of representing them as 0 and 1, here they are represented as -1 and +1. This representation allows the clever trick in the formula for Dt+1(i), replacing the conditional choice of the sign for Alphat with multiplication of (yi*ht(xi)). If these values don't match, the result will be -1, if they do match then 1. The concept of hypothesis here is different from the Bayesian one, here the "hypothesis" means the whole basic algorithm with a computed training table. ht(xi) means the computation of the outcome by this trained algorithm by feeding the symptoms of a training case into it.

The epic expression Pri~Dt[ht(xi) != yi] means the weighted probability of a training case outcome being computed incorrectly when it's fed back into the basic algorithm with the current round's training table (the training table is computed independently in each round based only on the set of training cases and on the distribution of their weights for this round). Basically, it's the relation of the weight of incorrectly computed cases to the total weight of all cases. The underlying basic algorithm tries to make up its table to minimize this number to its best ability. But since it's not perfect and might be far from perfect, the number of erroneous cases is unlikely to become 0.

There are T rounds of boosting. Each round trains the same basic algorithm on the same set of training cases but assigns different weights to the importance of these cases, according to Dt. The weights in Dt are adjusted for each round such as to sum up to 1. Zt is thus the "normalization factor": the sum of values in Dt+1 before this adjustment. From the algorithmic standpoint, there is no good reason to do this normalization: weights are weights, who cares if they sum up to 1 or not, this algorithm doesn't depend on it in any way. There is a reason why the authors use the Zt as it is, because it turns up in the proofs about the properties of the algorithm. But here it's easier to place it into the calculation of et.

Note also that Alphat is first computed as a logarithm, and then an exponent is taken on it. This logarithm and exponent cancel each other out except for the sign. This and the manipulation of the sign by (yi*ht(xi)) look mathematically cute but confuse the hell out of it. Well, they have a reason for using a logarithm because they use this Alphat to prove some properties, but like Zt in this place it only complicates things.

Getting rid of all these complications, here is the simplified version:

Given: (x1, y1), ..., (xm, ym) where xi belongs to X, yi belongs to {-1, +1}.
Initialize: D1(i) = 1 for i = 1, ..., m.
For t = 1, ..., T:
• Train the basic algorithm using the weights Dt.
• Get weak hypothesis ht: X -> {-1, +1}.
• Aim: select ht to minimalize the weighted error et:
Wgoodt = 0; Wbadt = 0;
for i = 1, ..., m {
if ht(xi) = yi {
Wgoodt += Dt(i)
} else {
}
}
• Update,
for i = 1, ..., m {
if ht(xi) != yi; {
Dt+1(i) = Dt(i) * sqrt((1-et)/et)
} else {
Dt+1(i) = Dt(i) * sqrt(et/(1-et))
}
}
Output the final hypothesis:
H(x) = sign(sum for t=1,...,T (ln(sqrt((1-et)/et))*ht(x)) ).

I think it becomes much easier to understand in this form. In particular, it becomes much easier to see an important property: each round adjusts the weights such as if the last table were used on them, it would produce the weighted error of exactly 50%, which would mean that it has no idea what is going on. The following round is then forced to come up with the new solution and scrape up a new way to do some differentiation. This is where the "Adaptive" in the algorithm's name comes from. But they say that this is not how they've come up with this formula, they say that they've come up with this particular value of Alphat in order to minimize Zt. I don't understand yet, how exactly they've done it. I've tried to compute it and I get a different answer. I must be making an error somewhere, need to think more about it. And they try to minimize Zt because it represents the upper bound on the training error (again, I can follow the explanation but I can't tell on my own yet, exactly why, will need to read this section a couple more times).

I don't understand yet the motivation for choosing the per-round multipliers in H(x) either. I can see some things about it. For example, the logarithm nicely handles the case of et > 0.5. If more than half the cases get misclassified by the basic algorithm, this situation is easy to improve: just flip the sign of the outcomes, and suddenly less than half of them will be misclassified. In this case the value under the logarithm will be less than 1, and the value of the logarithm will be negative, thus automatically flipping the sign! But I don't know why the logarithm is the right choice the rest of the time.

We could also replace (1-et)/et with Wgoodt/Wbadt, and et/(1-et) with Wbadt/Wgoodt but this might be detracting too much from the properties of et.

P.S. Read more by the tag learning_boosting.