Previously I've been saying that I didn't experience overfitting in the Bayesian models, and pretty much discounted it. Now I've read a model of overfitting in the book on AdaBoost, and I understand why. Here is the gist, with some of my thoughts included.

The overfitting happens when the model starts picking the peculiarities of the particular training set rather than the general properties. It's down to the noise in the data: if the data contains random noise, only the cases without the noise can be predicted well on the general principles, and the noisy ones are bound to be mispredicted. The training data also contains noise. Since the noise is random, the noise in the test data (an in the future real-world cases) won't follow the noise in the training data closely. If the model starts following the noise in the training data too closely, it will mispredict the well-behaved cases in the test data, in addition to the noisy test cases. For all I can tell, this means that the overfitting magnifies the noise in the quadratic proportion, with probabilities:

P(good prediction) = P(no noise in the training data) * P(no noise in the test data)

If the model makes the decisions based on the "yes-no" questions (AKA binary symptoms), picking up the general trends takes a relatively small number of yes-no questions, because their effects are systematic. The effects of the noise are random, so each noisy training case is likely to require at least one extra yes-no question to tell it apart. If there is a substantial number of noisy cases in the training data, a lot of extra questions would be needed to tell them all apart. So the rule of thumb is, if there are few questions in the model compared to the number of the training cases, not much overfitting will happen.

In the models I was working with, there were tens of thousands of the training cases and only hundreds of symptoms. So there wasn't such a big chance of overfitting in general. Even if you say "but we should count the symptoms per outcome", there still were only low hundreds of outcomes, and if we multiply 100 symptoms by 100 outcomes, it's still only 10K decision points in the table, the same order of magnitude as the number of the training cases.

There also was very little noise as such in the data I've dealt with. If you do diagnostics, you get the natural testing: if the fix doesn't work, the client will come back. There is of course the problem of whether you've changed too many parts. It can be controlled to a degree by looking for training only at the cases where the fix was done at the first attempt. Though you still can't get the complete confidence for the cases where more than one part was changed. And of course if you don't look at the cases that required multiple attempts, it means that you're not learning to diagnose the more difficult cases.

But there was a particular kind of noise even in this kind of fairly clean data: the noise of multiple problems occurring or not occurring together in various combinations. If the model is sensitive to whether it had seen a particular combination or not, the randomness of the combinations means that they represent a random noise. And I've spent quite a bit of effort on reducing this dependence in the logic and on reducing this noise by preprocessing the training data. Which all amounts to reducing the overfitting. So I was wrong, there was an overfitting, just I didn't recognize it.

Actually, I think this can be used as a demonstration of the relation between the number of symptoms and amount of overfitting. If we're looking to pick only one correct outcome, the number of questions is just the number of questions, which was in hundreds for me. Much lower than the number of the training cases, and very little overfitting had a chance to happen. Yes, there were hundreds of possible outcomes but only one of them gets picked, and the number of questions that are used is the number of questions that affect it. But if we're looking at picking correctly all the outcomes, the number of questions gets multiplied by the number of outcomes. In the system I worked on, the total was comparable to the number of training cases, and the overfitting became noticeable. It would probably become even worse if the Bayesian table contained the rows not just for the outcomes but for the different ways to achieve these outcomes, like I've described in this series of posts. So with extra complexity you win on precision but the same precision magnifies the effects of overfitting. The sweet spot should be somewhere in the middle and depend a lot on the amount of noise in the data.

## No comments:

## Post a Comment