Saturday, August 27, 2016

AdaBoost 9: Boost by majority afterthought

After some time I've realized that all this monkeying with the conditional probabilities in the Bayesian table is not necessary. You can just throw away a whole training case or a part of it and continue like nothing happened, the probabilities should stay consistent anyway. After all, the point of adjusting the weights after each round opposite to how the run-time weights would be changed is to give each training case an equal chance. But if we don't want to give some training case an equal chance then there is no point in treating it equally, an ignored training case can be simply ignored.

Another thought is that it looks like the two-pass approach can be used to find what training cases to throw away in a dynamic way. We can do it by splitting the set of available cases randomly in half. Then use one half for the first pass of training of N rounds and remember the weights throughout it. Then test the effectiveness of this training on the second half of the cases. But not just run the N rounds in the test. Instead, keep the test results for using only 1 round, 2 rounds, 3 rounds, and so on all the way to the N rounds. Then see the number of rounds on which the test did best, say K rounds. Going back to the training weights, we can find, what training cases were not getting guessed well at K rounds. We can mark them as outliers. Then repeat the same thing swapping the two halves, and find the outliers in the second half. Then throw away the outliers and do the second pass of training on the rest of the cases.

Saturday, August 20, 2016

AdaBoost 8: Boost by majority

When I wrote before

The premise of boosting is that we're able to find a number of methods (what they call "hypotheses" in AdaBoost) to predict the correct outcomes of the training cases, each method correctly predicting more than 50% of the training cases. Then if we collect a large-ish number of these methods, we can predict the correct outcomes of all the training cases simply by averaging the predictions of these methods. And the other cases will follow the training cases (unless an overfitting happens). Since more than 50% of the cases are correctly predicted by each method, after the averaging more than 50% of the votes for each training case will be correct, and thus the result will be correct too. Of course this depends on the correct predictions being distributed pretty evenly among the cases. If we have a thousand methods that predict correctly the same cases and incorrectly the other cases, obviously after averaging these other cases will still be predicted incorrectly. So the selection of methods must somehow shuffle the preference for the cases, so that the next picked method will predict well the cases that have been predicted poorly by the previous picked methods. That's it, that's the whole basic idea. It's a bit of an oversimplification but it's easy to understand.

I really did mean it as an oversimplification, since AdaBoost uses the Bayesian decisions to do much better than the simple majority counting. Little did I know that there actually is the method of Boost By Majority (BBM) that does just the counting. It has some other differences but more about that later.

The simple averaging can be simulated through the Bayesian means too. Just use the same confidence for each event. Incidentally, that's what the NonAdaBoost algorithm, also known as epsilon-Boost does: it looks for the weak hypotheses that have at least a fixed "edge" gamma (i.e. the probability of the right guess being at least 0.5+gamma) and then always sets the confidence C=0.5+gamma, and uses the same value to adjust the weights of the training cases.

The NonAdaBoost is essentially a version of AdaBoost with a fixed confidence, and suffering from this defect. But Boost By Majority has another big difference: the way it adjusts the weight of the training cases. The formula it uses is pretty complicated, so I won't even try to reproduce it here. But here is the gist: it keeps track of how many rounds are left to the end of the boosting and what is the balance of the votes collected by each training case. If the modulo of the balance is higher than the number of rounds left, it means that the fate of this case can't be changed any more: it's either guaranteed to be guessed right if the balance is positive or guaranteed to be guessed wrong if the balance is negative, so the algorithm gives up on these cases. It gives the most weight to the most undecided cases, and spreads the rest of the weights in the bell shape of the binomial distribution. The result is that unlike AdaBoost and NonAdaBoost, BBM doesn't concentrate on the hardest cases that are likely to be the noise in the data anyway, and thus reduces the overfitting.

The last chapter of the book is about a combination of AdaBoost and BBM called BrownBoost (from the Brownian motion), or also "boosting in continuous time". It starts with the idea that if the returned partial hypothesis has a higher edge than minimally needed, it might still have enough edge after the re-weighting the training cases, then it can be directly reused on the next round too without searching for a new one, and so on until its edge wears off. This gets developed into an algorithm that uses a real number in the range [0,1) instead of the round count, with the actual rounds moving the current point on it by the varying amounts. The speed of the movement is determined by the pre-set desired training error. This training error gets reached when the end of the range is reached. If the target error is set to 0, the algorithm behaves in the same way as AdaBoost.

The downside is that the algorithm is complex, there isn't even a formula for determining the confidence values for each partial hypothesis. Instead you get a system of two equations that connect this confidence value and advancement through the range to be solved numerically. In the great scheme of things it's not a big deal, after all, compared to the finding of the partial hypotheses this shouldn't be such a big overhead. But it's not easy to think of. And the proof of the validity of this algorithm is quite complicated.

I can't help thinking of a couple of simpler ideas.

The first idea, or should I say guess, is that when we do AdaBoost, it fits into a Bayesian model. So if we keep this Bayesian model consistent, the boosting should still be valid. Obviously, I have no proper proof of that but it looks like a reasonable assumption. There is an easy way to take some training case (or a part of its weight) out of rotation and still keep the Bayesian model consistent.

Remember, previously I've described that we start with a Bayesian table that contains each individual training case

CaseId Weight Outcome  Ev1 ... EvT
1       1 *    true     1   ... 1
M       1 *    false    0   ... 0

Which then gets conflated into a two-line table, all the cases with the true outcome combined into one line, and the false ones into another line. The conditional probabilities in the table get averaged during conflation but since it's an averaging of all ones (or all zeroes), the result is still one (or zero).

Weight Outcome  Ev1 ... EvT
1 *    true     1   ... 1
1 *    false    0   ... 0

To take a training case out of rotation, we just change its conditional probability in all the following events (that match the AdaBoost's partial hypotheses) to 0.5. This says that no future events will affect it. And accordingly we set the AdaBoost weight of it to 0. Such decisions can be made according to any algorithm, matching any desired curve.

For example, suppose that we decide to take a case out of rotation when its weight relative to the sum of all weights reaches 0.1 (this is probably not a very good rule, since it allows at most 10 cases to be excluded, but simple for the sake of demonstration). Suppose that its a case with the true ("1") outcome. And suppose that all the weights of all the true cases are totaling to the same value as of all the false cases, each of them having the relative weight of 0.5 (not very likely in reality but just as good a number as any other).

After the disablement, the conditional probability of the true cases will become ((0.5*0.1) + (1 * 0.4))/0.5 = 0.9.

Weight Outcome  Ev1 ... EvN-1 EvN ...
1 *    true     1   ... 1     0.9 ...
1 *    false    0   ... 0     0   ...

Once a training case gets disabled, it stays disabled for all the future rounds, and the AdaBoost keeps acting only on the weights of those training cases that still have 1 or 0 for the conditional probability. Obviously, the more you disable, the less will be the effect of the following rounds when the Bayesian computation runs.

Interestingly though the edges of the partial hypotheses will be higher. Remember, the training cases that get thrown away get it for being difficult to predict. So suppose the partial hypothesis EvN would have returned the confidence of 0.8 if that case wasn't thrown away and had guessed that case wrong. When we throw away the stubborn case, that would become 0.8 out of former 0.9, so the confidence becomes 0.8/0.9 = 0.89, an improvement!

However all this throwing-away has no effect on the previous rounds, these are already set in stone. Which begs an easy solution which is my second idea: why not do two passes of AdaBoost? After the first pass look at the final weights of the training cases to determine the most stubborn ones. Throw them away and do the second pass from scratch. After all, BrownBoost requires an adjustment of the target training error which gets done by running it multiple times with the different values and then selecting the best one. Doing two passes of AdaBoost isn't any worse than that.

Saturday, August 6, 2016

Bayes 23: enumerated events revisited

Another idea that I've glimpsed from the book on boosting is the handling of the enumerated events, previously described in the part 19. The part 6 of my notes on boosting describes how the decision stumps can be treated as only "half-stumps": actionable if the answer is "yes" and non-actionable if the answer is "no" (or vice versa). This is actually the same thing as I complained of before as the mistreatment of the Bayesian formula where the negative answer is treated the same as the lack of answer. But taken in the right context, it makes sense.

If we take a complementary pair of such half-symptoms (asking the same question, and one of them equaling the negative answers with "don't know", another one equaling the positive answer with "don't know"), their combined effect on the probability of the hypotheses will be exactly the same as of one full symptom. In the weight-based model, the weights of the hypotheses after the complementary pair will be only half of those after one full symptom but they will all be scaled proportionally, making no difference. Alternatively, if we equal the negative answers not with "don't know" but with "irrelevant", even the weights will stay the same.

The interesting thing is that these half-symptoms can be straightforwardly extended to the multiple-choice questions. Each choice can be equaled with one half-symptom. So if the answer to this choice is "yes" then it takes effect, if "no" then it gets skipped. In the end exactly one choice takes effect. Or potentially the answer can also be "none of the above" and then this symptom will be simply skipped. It should also be relatively straightforward to accommodate the answers like "it's probably one of these two", taking both answers at half-weight. I didn't work through the exact formulas yet but I think it shouldn't be difficult.

The approach of taking the answer at a partial weight also provides a possible answer to "should we treat this problem as model-specific or generic?": it allows to mix both together, taking say the model-specific approach at the weight 0.99 and the generic at 0.01. Then if the model-specific approach finds a matching hypothesis, great, if not then the answer found with the generic approach will outweigh it. This weight of the generic approach should be higher than the confidence cap of the "can't happen" answer: the generic weight of 0.01 would probably work decently well together with the capping probability of 0.001.

Bayes 22: overfitting revisited

Previously I've been saying that I didn't experience overfitting in the Bayesian models, and pretty much discounted it. Now I've read a model of overfitting in the book on AdaBoost, and I understand why. Here is the gist, with some of my thoughts included.

The overfitting happens when the model starts picking the peculiarities of the particular training set rather than the general properties. It's down to the noise in the data: if the data contains random noise, only the cases without the noise can be predicted well on the general principles, and the noisy ones are bound to be mispredicted. The training data also contains noise. Since the noise is random, the noise in the test data (an in the future real-world cases) won't follow the noise in the training data closely. If the model starts following the noise in the training data too closely, it will mispredict the well-behaved cases in the test data, in addition to the noisy test cases. For all I can tell, this means that the overfitting magnifies the noise in the quadratic proportion, with probabilities:

P(good prediction) = P(no noise in the training data) * P(no noise in the test data)

If the model makes the decisions based on the "yes-no" questions (AKA binary symptoms), picking up the general trends takes a relatively small number of yes-no questions, because their effects are systematic. The effects of the noise are random, so each noisy training case is likely to require at least one extra yes-no question to tell it apart. If there is a substantial number of noisy cases in the training data, a lot of extra questions would be needed to tell them all apart. So the rule of thumb is, if there are few questions in the model compared to the number of the training cases, not much overfitting will happen.

In the models I was working with, there were tens of thousands of the training cases and only hundreds of symptoms. So there wasn't such a big chance of overfitting in general. Even if you say "but we should count the symptoms per outcome", there still were only low hundreds of outcomes, and if we multiply 100 symptoms by 100 outcomes, it's still only 10K decision points in the table, the same order of magnitude as the number of the training cases.

There also was very little noise as such in the data I've dealt with. If you do diagnostics, you get the natural testing: if the fix doesn't work, the client will come back. There is of course the problem of whether you've changed too many parts. It can be controlled to a degree by looking for training only at the cases where the fix was done at the first attempt. Though you still can't get the complete confidence for the cases where more than one part was changed. And of course if you don't look at the cases that required multiple attempts, it means that you're not learning to diagnose the more difficult cases.

But there was a particular kind of noise even in this kind of fairly clean data: the noise of multiple problems occurring or not occurring together in various combinations. If the model is sensitive to whether it had seen a particular combination or not, the randomness of the combinations means that they represent a random noise. And I've spent quite a bit of effort on reducing this dependence in the logic and on reducing this noise by preprocessing the training data. Which all amounts to reducing the overfitting. So I was wrong, there was an overfitting, just I didn't recognize it.

Actually, I think this can be used as a demonstration of the relation between the number of symptoms and amount of overfitting. If we're looking to pick only one correct outcome, the number of questions is just the number of questions, which was in hundreds for me. Much lower than the number of the training cases, and very little overfitting had a chance to happen. Yes, there were hundreds of possible outcomes but only one of them gets picked, and the number of questions that are used is the number of questions that affect it. But if we're looking at picking correctly all the outcomes, the number of questions gets multiplied by the number of outcomes. In the system I worked on, the total was comparable to the number of training cases, and the overfitting became noticeable. It would probably become even worse if the Bayesian table contained the rows not just for the outcomes but for the different ways to achieve these outcomes, like I've described in this series of posts. So with extra complexity you win on precision but the same precision magnifies the effects of overfitting. The sweet spot should be somewhere in the middle and depend a lot on the amount of noise in the data.

AdaBoost 7: multi-class & unseen combinations

The basic idea behind the multi-class (and also multi-label, i.e. where each case may have more than one outcome) AdaBoost can be described as boosting the recognition of all the outcomes in parallel. It takes the "yes or no" dichotomies for all the outcomes, and on each round it tries to find such a partial hypothesis where the sum of Z for all of them is minimal. This is very similar to what was described in the part 6, where multiple ranges were added, each with its own confidence. The difference is in the formula for the final computation: in the multi-class version there is a separate formula for each class that uses only the confidence values that all the training rounds computed for this particular class.

There is also a possible optimization for the situation where there may be only one outcome per case (i.e. single-label), saying that the mapping between the dichotomies used in the logic above and the outcomes doesn't have to be one-to-one. Instead each outcomes can be mapped to a unique combination (or multiple combinations) of the dichotomies. The dichotomies can be selected on some smart way, say if we're trying to recognize the handwritten digits, they can be "pointy vs roundy", "a loop on the bottom vs no loop at the bottom" etc. Or just by just dividing the outcomes in half in some blind way, like "0,1,2,3,4 vs 5,6,7,8,9", "0,2,4,6,8 vs 1,3,5,7,9" etc.

Returning to the multi-label situation, one of the problems I've experienced with it is the ability to recognize the combinations of outcomes that weren't present in the training data. That is, the outcomes were present but none of the training cases had exactly this combination. For the basic diagnostics, this can be discounted by saying "but what's the percentage of such cases" but when you start pushing the quality of diagnosis around 95%, it turns out that great many of the remaining misdiagnosed cases fall into this category.

AdaBoost doesn't have any built-in solution for this problem. The solution it produces is only as good as the underlying algorithm. There is nothing in AdaBoost that puts the pressure on the underlying algorithm to recognize the combinations that aren't present in the training data. If the underlying algorithm can do it anyway (and perhaps despite the pressure from AdaBoost), the resulting combined formula will be able to do it too. If it can't then the combined formula won't either. The simple algorithms like the decision stumps can't.

But maybe some multi-pass schemes can be devised. Run the boosting once, get a set of the candidate symptoms (i.e. partial hypotheses). Use these symptoms on the training cases to try to differentiate, which symptom is related to which outcome. Then run the boosting the second time from scratch, only this time with the relevance knowledge mixed in: whenever a symptom that is close to an irrelevant one is tested on a training case, make it return "I don't know", i.e. the confidence 0.5. This will shift the choice of symptoms. Obviously, if using the resulting formula, the same pruning of irrelevance has to be applied there in the same way. The symptoms from the second pass can be re-tested for relevance, and if any change is found, the third pass can be made, and so on.

Or even better, perhaps this logic can be merged directly into each round of the underlying algorithm in one pass of AdaBoost: when a candidate partial hypothesis (i.e. symptom) is found, measure its relevance right there and change its Z-value accordingly. Pick the candidate that has the lowest Z-value even after it has been corrected for the relevance. Include the relevance information into the partial hypothesis.

Tuesday, August 2, 2016

Bayes 21: mutual exclusivity and independence revisited

I've been thinking about the use of AdaBoost on the multi-class problems, and I've accidentally realized what is going on when I've used the combination of the mutually exclusive and of independent computation of hypotheses, as described in the part 10.

Basically, the question boils down to the following: if the mutually exclusive computation shows that multiple hypotheses have risen to the top in about equal proportions (and consequently every one of them getting the probability below 0.5), how could it happen that in the independent computation each one of them would be above 0.5? If the training data had each case resulting in only one true hypothesis, the probabilities computed both ways would be exactly the same, in the independent version the probability of ~H being exactly equal to the sum of probabilities of the other hypotheses.

The answer lies in the training cases where multiple hypotheses were true. If we use the weight-based approach, it becomes easy to see that if a new case matches such a training case, it would bring all the hypotheses in this case to the top simultaneously. So the independent approach simply emphasizes the handling of these training cases. Equivalently, such training cases can be labeled with pseudo-hypotheses, and then the weight of these pseudo-hypotheses be added to the "pure" hypotheses. For example, let's consider re-labeling of the example from the part 10:

# tab09_01.txt and tab10_01.txt
evA evB evC
1 * hyp1,hyp2 1 1 0
1 * hyp2,hyp3 0 1 1
1 * hyp1,hyp3 1 0 1

Let's relabel it as:

evA evB evC
1 * hyp12 1 1 0
1 * hyp23 0 1 1
1 * hyp13 1 0 1

Then the probabilities of the original hypotheses can then be postprocessed as:

P(hyp1) = P(hyp12)+P(hyp13)
P(hyp2) = P(hyp12)+P(hyp23)
P(hyp3) = P(hyp23)+P(hyp13)

And this would give the same result as the independent computations for every hypothesis.

So this approach works well for when the combined set of symptoms for multiple hypotheses had been seen in training, and not much good for the combinations that haven't been seen in the training. The combined use of the independent and mutually-exclusive combinations with a low acceptance threshold for the independent computations tempers this effect only slightly.