## Tuesday, October 6, 2015

### Bayes 3: the magic formula

The next thing is the conditional probability. It's written as P(H|E) or P(E|H).

P(H|E) means the probability of the hypothesis H being true in case if we know that the event E is true.

P(E|H) means the probability of the event E being true in case if the hypothesis H turns out to be true.

And now finally the magic formula discovered by Bayes:

P(H|E) = P(H) * P(E|H) / P(E)

It tells us how the probability of a hypothesis gets adjusted when we learn some related data. I haven't watched "House, MD" closely enough to give an example from the medical area, so I'll give one about the car repair.

Suppose our hypothesis is "the battery is discharged". One way to test it is to turn on the headlights (with the engine not running) and see that they come on at full brightness, that would be our event.

Suppose our original ("prior") probability P(H) was 0.1. And suppose the probability of the headlights being able to come up was 0.7. You might think that no way would the headlights shine brightly with the discharged battery, but the headlights consume a lot less current than the starter, and the battery might be capable of providing the small amount of current without too much of a voltage drop, so the headlights might still look decently bright to the eye. So let's say that P(E|H), the probability of the headlights shining brightly with a discharged battery, is 0.05.

Where do all these numbers come from? Rights now I've pulled them out of my imagination. In a real expert system they would come from two sources: the training data and the computation of the previous events. Some people might say that the numbers might come from the expert opinions but spit them in the eye, getting the good probabilities without a good set of training data is pretty much impossible. We'll get to the details of the training data a little later, now let's continue with the example.

So we've tested the headlights, found them shining bright, and now can plug the numbers into the formula:

P(H|E) = 0.1 * 0.05 / 0.7 = 0.00715

As you can see, the probability of this hypothesis took a steep plunge. Why did it happen? Basically, the effect depends on the difference between how probable this effect is in general (P(E)) and how probable it is if this hypothesis is true (P(E|H)). The formula can be reformatted to emphasise this:

P(H|E) = P(H) * ( P(E|H) / P(E) )

If P(E|H) is greater than P(E) then the probability of the hypothesis gets pulled up. If it's less then the probability gets pulled down.

What if we try the experiment and the result is false? Obviously, we can't just use this formula, since E is not true. But then the event ~E will be true, and we should instead use the similar formula for ~E and also adjust the probability of the hypothesis:

P(H|~E) = P(H) * P(~E|H) / P(~E)

Where do P(~E|H) and P(~E) come from? Again, from the training data, and they are connected with the direct values in the obvious way:

P(~E|H) = 1 - P(E|H)
P(~E) = 1 - P(E)

A lot of examples of Bayesian logic I've found on the Internet don't adjust the probabilities if an event is found to be false, and this is very, very wrong. Basically, they treat the situation "the event is known to be false" as "the event is unknown". If you have a spam filter and know that a message doesn't contain the word "viagra", you shouldn't treat it the same way as if you don't know if this word is present. The result will be that you'll be driving the probabilities of the message being spam way high on seeing any of the suspicious words. In case if you wonder if ignoring the negative event is sheer stupidity, the answer is no, it isn't, there are reasons to why people do this, they are trying to compensate for the other deficiencies of the bayesian method. Only a rare spam message indeed would contain the whole set of suspicious words, so if you properly test each word and apply it as positive or negative, you end up with pretty low probability values. I'll talk more about it later, although I don't have the perfect solutions either.

Continuing the example, suppose the headlights didn't come on brightly, and we have the probabilities

P(~E|H) = 1 - 0.05 = 0.95
P(~E) = 1 - 0.7 = 0.3

Then we can substitute them into the formula:

P(H|~E) = 0.1 * 0.95 / 0.3 = 0.31667

The hypothesis of a dead battery took quite a boost!

Now let me show you something else. Suppose the prior probability P(H) was not 0.1 but 0.5. We put the data with this single adjustment into the same formula and get:

P(H|~E) = 0.5 * 0.95 / 0.3 = 1.58333

Shouldn't the probability be in the range between 0 and 1? It sure should. If you get a probability value outside of this range, it means that something is wrong with your input data. Either you've used some expert estimations of probabilities and these experts turned out to be not any good (that's why using a proper set of training data is important), or you forgot to adjust the probabilities for the previous computations you've done. Or maybe the rounding errors from the previous computations have accumulated to become greater than the actual values.