## Wednesday, October 28, 2015

### Bayes 11: what it all really means

The Bayes table computation is good at one thing (hm, was the robot Bender built with it?): picking one of a set of mutually-exclusive hypotheses based on the independent events. It's kind of bad on the other things. The other things need to be made more like this one thing before feeding it into the computation.

The "independent events" part means that the table knows only whether an event should happen or not, but nothing about whether two events should be happening together. Let me demonstrate with an example. Suppose the training data consists of these cases:

```# tab11_01.txt
evA evB
1 * hyp1 1   0
1 * hyp1 0   1
1 * hyp2 1   1
1 * hyp2 0   0```

The hypothesis hyp1 happens when evA and evB are different, and hyp2 happens if they are the same. But once we translate these cases into a table of probabilities, this information becomes lost, and hyp1 and hyp2 start looking the same:

```# tab11_01.txt
!,,evA,evB
hyp1,0.5,0.5,0.5
hyp2,0.5,0.5,0.5```

Here is an example of a run:

```# in11_01_01.txt
evA,1
evB,0```

```\$ perl ex06_01run.pl -v tab11_01.txt in11_01_01.txt
!      ,       ,evA    ,evB    ,
hyp1   ,0.50000,0.50000,0.50000,
hyp2   ,0.50000,0.50000,0.50000,
--- Applying event evA, c=1.000000
P(E)=0.500000
H hyp1 *0.500000/0.500000
H hyp2 *0.500000/0.500000
!      ,       ,evA    ,evB    ,
hyp1   ,0.50000,0.50000,0.50000,
hyp2   ,0.50000,0.50000,0.50000,
--- Applying event evB, c=0.000000
P(E)=0.500000
H hyp1 *0.500000/0.500000
H hyp2 *0.500000/0.500000
!      ,       ,evA    ,evB    ,
hyp1   ,0.50000,0.50000,0.50000,
hyp2   ,0.50000,0.50000,0.50000,```

It has no idea, thinks that both hypotheses are equally probable. It's real obvious in this small specially targeted example but if you work with a large table where the results are only sometimes misdirected by this effect, it takes a while to notice.

But don't despair, there is a workaround. We can split each hypothesis in two, such as hyp1 becomes hyp1a and hyp1b, run them through the table independently, and then at the end of computation add up the probabilities of hyp1a and hyp1b together to get the probability of hyp1.

```# tab11_01.txt
evA evB
1 * hyp1a 1  0
1 * hyp1b 0  1
1 * hyp2a 1  1
1 * hyp2b 0  0```

```!,,evA,evB
hyp1a,0.25,1,0
hyp1b,0.25,0,1
hyp2a,0.25,1,1
hyp2b,0.25,0,0```

Now it can pick the result with confidence:

```\$ perl ex06_01run.pl -v tab11_02.txt in11_01_01.txt
!      ,       ,evA    ,evB    ,
hyp1a  ,0.25000,1.00000,0.00000,
hyp1b  ,0.25000,0.00000,1.00000,
hyp2a  ,0.25000,1.00000,1.00000,
hyp2b  ,0.25000,0.00000,0.00000,
--- Applying event evA, c=1.000000
P(E)=0.500000
H hyp1a *1.000000/0.500000
H hyp1b *0.000000/0.500000
H hyp2a *1.000000/0.500000
H hyp2b *0.000000/0.500000
!      ,       ,evA    ,evB    ,
hyp1a  ,0.50000,1.00000,0.00000,
hyp1b  ,0.00000,0.00000,1.00000,
hyp2a  ,0.50000,1.00000,1.00000,
hyp2b  ,0.00000,0.00000,0.00000,
--- Applying event evB, c=0.000000
P(E)=0.500000
H hyp1a *1.000000/0.500000
H hyp1b *0.000000/0.500000
H hyp2a *0.000000/0.500000
H hyp2b *1.000000/0.500000
!      ,       ,evA    ,evB    ,
hyp1a  ,1.00000,1.00000,0.00000,
hyp1b  ,0.00000,0.00000,1.00000,
hyp2a  ,0.00000,1.00000,1.00000,
hyp2b  ,0.00000,0.00000,0.00000,```

The same approach can also be used to solve the problem of the mutually-exclusive vs independent hypotheses. Just assign the cases with two hypotheses to a new pseudo-hypothesis, and then add up the computed probability of this hypothesis to the probabilities of both constituent hypotheses before choosing the winners. The obvious downside is that you end up with more hypotheses to compute.

But there is more. Notice that the table of probabilities looks very much the same as the training cases. That's not coincidental. It's the insight into how the Bayesian logic works, what's the meaning of it. We can get the exact same effect as with the Bayesian computations just by working with the training data and scratching out the lines in it.

```         evA evB
1 * hyp1 1  0
1 * hyp1 0  1
1 * hyp2 1  1
1 * hyp2 0  0```

Each line in it describes some identical cases and the count of these cases. We can also think of this count as the weight of this line. The prior probability of a hypothesis is the sum of the weights (or counts, if you prefer) of all its lines divided by the sum of the weights from all the lines (i.e. the total count of cases in the training data):

`P(H) = sum(weight(H)) / sum(weight(*))`

Then we start applying the events. If the event is true, we throw away all the lines where this event is false. Or if the event is false, we throw away all the lines where this event is true. Technically, we can express this "throwing away" by changing the weight of these lines to 0. Or we can actually remove these lines from the list of the cases.

For example, if evA is true, the table becomes:

```         evA evB
1 * hyp1 1  0
1 * hyp2 1  1```

The probabilities of the hypotheses after applying evA are computed in the same way as before, by dividing the sum of weights for this hypothesis by the sum of all weights. Here each hypothesis had its sum reduced from 2 to 1 but the sum of all weights got also reduced from 4 to 2, so they both stayed at the probability of 0.5. Compare that with the log of the program above.

Looking as the Bayes formula,

`P(H|E) = P(H) * P(E|H) / P(E)`

it says that P(H) gets multiplied by the ratio of the event probabilities. P(E|H) is the ratio of the sum of weights of cases for this hypothesis where the expected event E is true to the sum of weights of all cases for this hypothesis:

`P(E|H) = sum(weight(H,E)) / sum(weight(H))`

P(E) is the ratio of the sum of weights of cases for all hypotheses where the expected event E is true to the sum of weights of all cases altogether:

`P(E) = sum(weight(E)) / sum(weight(*))`

Let's substitute all these values into the formula for P(H|E):

```P(H|E) = ( sum(weight(H)) / sum(weight(*)) )
* ( sum(weight(H,E)) / sum(weight(H)) )
/ ( sum(weight(E)) / sum(weight(*)) )
= ( sum(weight(H)) * sum(weight(H,E)) * sum(weight(*)) )
/ ( sum(weight(*)) * sum(weight(H)) * sum(weight(E)) )
= sum(weight(H,E)) / sum(weight(E))```

But after the lines that don't match E are thrown away, sum(weight(E)) is really the sum of weights of all the remaining lines: sum(weightNew(*)). And sum(weight(H,E)) is really the sum of all the remaining lines for the hypothesis H: sum(weightNew(H)). And their ratio is Pnew(H). Which means that

`P(H|E) = sum(weightNew(H)) / sum(weightNew(*)) = Pnew(H)`

It all matches up, both ways are equivalent! When we compute the probabilities by the bayesian table, what we really do is follow through the list of the training cases and throw away the ones that aren't matching until we're left only with the matching ones. Or maybe with no matching cases at all, that's what happens when we see the division by 0 with the probability tables.

The only major difference is that when we use the classic tables, we scrunch up all the cases for a hypotheses into one line and take the average for whether the event should be present or not as its probability, thus losing any information about its cross-correlation with the other events. And if we follow the training list of cases, we preserve this information. Fundamentally, the bayesian approach turns out to be the same as following the binary decision tree, with some muddling thrown in.

If you're not sure, why is it equivalent to a tree, here is why: because we can organize the training cases in the form of a tree, making the decision one event at a time. The training data in this example can be represented as a tree shown here in ASCII-art:

```|                  |
|                  V
|               0 / \  1
|         +------<evA>------+
|         |       \ /       |
|         V                 V
|     0  / \  1         0  / \  1
|    +--<evB>--+       +--<evB>--+
|    |   \ /   |       |   \ /   |
|    V         V       V         V
|   hyp2      hyp1    hyp1      hyp2
```

Each level of the tree contains the comparisons for the same event but of course the meaning of the result depends on the path taken from the root of the tree, i.e. on the result of examination of the previous events. Some comparison outcomes may not continue to the next levels, meaning that such a combination of events is considered impossible according to the training data.

The meaning of the fuzzy logic with the event confidence can also be easily seen through the training cases: If an event is true with confidence C(E), we multiply the weight of the lines where the event is true by C(E) and the weight of the lines where the event is false by 1-C(E).

If the C(E) happens to be 0.5, all the weights of all the lines will be multiplied by 0.5. But since this also means that the sum of all the weights will also be equal 0.5 of the original ones, the ratios won't change, and all the P(H) will stay the same.

It also gives an insight into the effects of the confidence capping. Suppose we have a training set like this:

```         evA evB
1 * hyp1 1  0
1 * hyp2 0  1
1 * hyp3 0  0```

And the input data is

```evA,1
evB,1```

If we use the cap of 0.01, C(evA) will be set to 0.99. Applying it will result in the weights:

```            evA evB
0.99 * hyp1 1  0
0.01 * hyp2 0  1
0.01 * hyp3 0  0```

Then evB will also be capped to C(evB) = 0.99, and result in:

```              evA evB
0.0099 * hyp1 1  0
0.0099 * hyp2 0  1
0.0001 * hyp3 0  0```

When the model went through the "eye of the needle", the weights of all the lines have suffered badly. But some have suffered much worse than the others. The line for hyp3 didn't match either of the input events, so it got penalized twice, while hyp1 and hyp2 got penalized only once and ended up with the weights that are 99 times higher than hyp3. As far as the probabilities are concerned, hyp1 and hyp2 have grown from about 0.33 to about 0.5, at the expense of hyp3. But knowing their weights, we can say that in the grand scheme of things neither of them looks particularly hot, unless one of these events got misreported or maybe affected by some weird interaction of multiple hypotheses being true.

The absolute weight of the lines remaining at the end of computation can also be seen as an indication of confidence in the result: if the weight of a line is least 1, we can say that the input might plausibly match at least 1 real case from the training data. If it's less than 1, then the result is more like a guess, although it might be a good guess based on the uncertain input data.