Saturday, October 31, 2015

Bayes 13: the relevance

Before showing the code that computes by the table of training cases, I want to talk through one more aspect.

What causes the "impossible" cases that haven't been seen in the training data? Some of them are just rare occurrences that weren't observed during training. Some of them are the results of errors made when evaluating and entering the event data. But in my experience most of them are caused by the situations when multiple hypotheses are true and their evidences interact with each other.

It's not such a big deal for the hypotheses that occur often (i.e. they have the high prior P(H)) and also occur together relatively often. Then there will be the training cases reflecting these interactions, and they can be recognized fairly well. However if you have 200 hypotheses, that means the possibility of 40000 combinations of two of them occurring together. If you have less than 40000 training cases, you're guaranteed not to see them all. And if you remember that the trueness of two hypotheses is usually rather rare, and three or more hypotheses might happen to be true, you won't have a good chance to see all the reasonable combinations until you have millions of training cases.

One way to work around this problem is to try isolating the hypotheses from each other. I've already mentioned it in the part 10: if we can figure out, what events are relevant for a hypothesis, we can ignore the other events that are relevant for the other hypotheses, thus reducing the interaction between these hypotheses.

How can we figure out, which events are relevant and which aren't? One way is to use the human expertise and the knowledge about the structure of the object. Such as, if a brake light on a car is out, it has no connection with the blown engine head gasket.

Some amount of this kind of knowledge engineering is inevitable because the Bayesian logic is notoriously bad at structuring the data. Someone's got to prepare the meaningful events and combine the knowledge into this form before feeding it to this model. For example, the question "is the compression in the rear-most cylinders low?" can be used as an indication that the engine has been mildly overheated. But it combines multiple pieces of knowledge:

• How many cylinders does this car have, and which are the rear-most? Depending on an inline or V (or flat) engine configuration, there might be one or two rear-most cylinders.
• What is the measured compression in these cylinders?
• What is the normal range of compression for this car model?
• Is the coolant flow going from the front of the engine towards the rear? That's the normal situation but in some cars the direction might be opposite, and then we'd have to look at the front-most cylinders instead.

All this would be pretty hard to enter as events into a Bayesian model. But if the data gets combined and pre-processed, it becomes one convenient event. This pre-processing would use the knowledge about the car model to find the cylinders to look at, and then would combine the compression measurement with the lower end of the factory-specified range to give a confidence value that the compression is low.

The problem is obviously that someone has to program this pre-processing. And the same goes for the expert estimation of relevance of symptoms for a hypothesis: someone has to do this work.

The second problem with the expert estimations is that the experts' knowledge might be limited. I believe I've read about the history of the medicine where some easy-to-test symptoms of some illnesses weren't discovered until a much later time, and in the meantime everyone went by the roundabout and unreliable way. But I don't remember the details of this story: what illnesses, what symptoms? Well, we'd like such symptoms to be discovered automatically from the data. But if we force the expert knowledge onto the model, it would never discover such dependencies because it would be forced to follow the human opinions.

So how do we discover this relevance automatically? I don't have a good answer. I have some ideas but they've occurred to me fairly late, and I've had only a very limited time to experiment with some of them, and I haven't done any experimentation with the ones that occurred to me later yet. I'll describe them anyway but not now yet.

Right now let's discuss a different, simpler aspect: suppose we somehow know which events are relevant to which hypotheses, how do we use this knowledge in the processing?

In the part 10 I've shown a way to do it with the independent tables for each hypothesis: to treat an event as irrelevant to a hypothesis, either drop the event from the hypothesis table altogether or set the probabilities of both P(E|H) and P(E|~H) to 0.5.

The computation on training table allows to do the same for the combined table. Just add an additional relevance value to every cell of the table, to every event of every training case. If this value is 1, process the event on this case as before. If this value is 0, leave the weight of this case unchanged when applying this event.

There is also an obvious extension: rather than have the relevance as discrete 0 or 1, make it a number in the range [0..1]. The processing would follow the same principle as for the event confidence: logically split the case into two cases with their relative weights determined by the relevance value R(E|I), one case with the event being fully relevant, another with the event being fully irrelevant, process them separately, and then put back together:

W(I|E) = W(I)*(1 - R(E|I)) + W(I)*R(E|I)*( TC(E|I)*C(E) + (1 - TC(E|I))*(1 - C(E)) )

This concept can also be applied to the classic Bayesian probability tables. It's straightforward: treat the prior probabilities as weights, do the same computation, then divide each resulting weight by the sum of all these weights to get the posterior probability.

Bayes 12: fuzzy training

Now let's consider that the training data might also contain the event results with only partial confidence. Previously I've been talking about the confidence only when the model runs and computes its prediction based on the training data. But what is the training data? It's the same kind of information about the cases from the past where we've already ascertained the correct result. Which means that the event information that went into their diagnosis might also be fuzzy. Just as we might be unable to tell the outcome of an event for sure in the future, we might have had the same problem in the past as well. So we need to be able to handle this in the training data.

In the last installment I've shown that during the computation, the confidence C(E) can be handled by splitting the event into a superposition of two events, a positive one that will multiply the weights of the matching cases by C(E) and the negative one that will multiply the weights of the matching cases by 1-C(E). The same approach can be taken to the training data. If we have a case (with the weight of 1) that has an event with a fractional confidence, let's call it TC(E|I) for "training confidence of event E for the case I", it can be seen as two cases: one with the weight TC(E|I) and 1 for the outcome of that event and another one with the weight 1-TC(E|I) and 0 for the outcome of that event. The outcomes of the other events would be copied between both cases.

For example, if we have a case:

Wt        evA evB evC
1 * hyp1  1   0   0.75

we can split it into two cases:

Wt           evA evB evC
0.75 * hyp1  1   0   1
0.25 * hyp1  1   0   0

Incidentally, this is the reverse operation of what happens when we're building a classic probability table out of the individual training cases: there we add up the weights of the individual cases and take the weighted average for the outcomes of every event. Well, obviously, if we combine the individual cases, the original weight of each case will be 1, so the sum of the weights will be equal to the count of the cases, and the P(E|H) is a simple average of the outcomes for this event. But in the generalized form it can handle the other weights as well.

Let's go through the computation needed for processing of these sub-cases. When we get the confidence value C(E) during the computation, the weights of the sub-cases will be modified as follows. The weight of the first sub-case that has the event outcome as 1, will be:

W(I1|E) = W(I1)*C(E)

The weight of the second sub-case that has the event outcome as 0, will be:

W(I2|E) = W(I2)*(1 - C(E))

If it were the first event we looked at, the old weights will be:

W(I1) = TC(E)
W(I2) = 1 - TC(E)

And in this case the results can also be written as:

W(I1|E) = TC(E)*C(E)
W(I2|E) = (1 - TC(E))*(1 - C(E))

If we want to clump them back together into one case, the weight of that case will be:

W(I|E) = W(I1|E) + W(I2|E) = TC(E)*C(E) + (1 - TC(E))*(1 - C(E))

That's assuming that it was only one case split, with the original W(I) = 1. If it was the combination of multiple cases, when we need to take its original weight into account:

W(I|E) = W(I) * ( TC(E)*C(E) + (1 - TC(E))*(1 - C(E)) )

Compare that with the probability formula from the 4th installment:

Pc(P, C) = P*C + (1-P)*(1-C)
P(H|C(E)) = P(H) * Pc(P(E|H), C(E))/Pc(P(E), C(E))

It's the same. The upper half of the fraction mirrors the weight computation directly, and the lower part shows that the sum of the weights of all cases will change in the same way.

We've got two interesting consequences:

First, now we handle the training cases with the fuzzy events.

Second, with this addition the logic for the computation directly by the training cases can also handle the probability table entries, or any mix thereof.

With the computation directly by the training events, there is a problem with handling the new cases that don't exactly match any of the training cases. This is known as over-fitting. With the computation by probability tables, there is the problem of losing the cross-correlation data of the events (the problem of not matching any training case isn't exactly absent but it's at least reduced). With this new tool we can adjust the processing to be at any mid-point in the spectrum between these boundaries, and thus smoothly adjust the level of the fitting.

For example, we can split each case into two identical sub-cases, one with the weight 0.1, another with the weight 0.9. Then combine all the cases with weight 0.1 for the same hypothesis into one mixed-case similarly to how we've computed the probability tables, and leave the cases with weight 0.9 by themselves. This means that the result will be a mix of 10% of the probability-table processing and 90% of the training-event processing. And this split-point can be chosen anywhere within the range [0..1].

Even if you just want to go exclusively by the classic probability tables, this way of processing is more efficient because it saves on division on every step. You don't need to computate the divisor for every event, you don't need to divide every probability value, and you reduce the rounding errors because you skip these extra computations.

Wednesday, October 28, 2015

Bayes 11: what it all really means

The Bayes table computation is good at one thing (hm, was the robot Bender built with it?): picking one of a set of mutually-exclusive hypotheses based on the independent events. It's kind of bad on the other things. The other things need to be made more like this one thing before feeding it into the computation.

The "independent events" part means that the table knows only whether an event should happen or not, but nothing about whether two events should be happening together. Let me demonstrate with an example. Suppose the training data consists of these cases:

# tab11_01.txt
evA evB
1 * hyp1 1   0
1 * hyp1 0   1
1 * hyp2 1   1
1 * hyp2 0   0

The hypothesis hyp1 happens when evA and evB are different, and hyp2 happens if they are the same. But once we translate these cases into a table of probabilities, this information becomes lost, and hyp1 and hyp2 start looking the same:

# tab11_01.txt
!,,evA,evB
hyp1,0.5,0.5,0.5
hyp2,0.5,0.5,0.5

Here is an example of a run:

# in11_01_01.txt
evA,1
evB,0

\$ perl ex06_01run.pl -v tab11_01.txt in11_01_01.txt
!      ,       ,evA    ,evB    ,
hyp1   ,0.50000,0.50000,0.50000,
hyp2   ,0.50000,0.50000,0.50000,
--- Applying event evA, c=1.000000
P(E)=0.500000
H hyp1 *0.500000/0.500000
H hyp2 *0.500000/0.500000
!      ,       ,evA    ,evB    ,
hyp1   ,0.50000,0.50000,0.50000,
hyp2   ,0.50000,0.50000,0.50000,
--- Applying event evB, c=0.000000
P(E)=0.500000
H hyp1 *0.500000/0.500000
H hyp2 *0.500000/0.500000
!      ,       ,evA    ,evB    ,
hyp1   ,0.50000,0.50000,0.50000,
hyp2   ,0.50000,0.50000,0.50000,

It has no idea, thinks that both hypotheses are equally probable. It's real obvious in this small specially targeted example but if you work with a large table where the results are only sometimes misdirected by this effect, it takes a while to notice.

But don't despair, there is a workaround. We can split each hypothesis in two, such as hyp1 becomes hyp1a and hyp1b, run them through the table independently, and then at the end of computation add up the probabilities of hyp1a and hyp1b together to get the probability of hyp1.

# tab11_01.txt
evA evB
1 * hyp1a 1  0
1 * hyp1b 0  1
1 * hyp2a 1  1
1 * hyp2b 0  0

!,,evA,evB
hyp1a,0.25,1,0
hyp1b,0.25,0,1
hyp2a,0.25,1,1
hyp2b,0.25,0,0

Now it can pick the result with confidence:

\$ perl ex06_01run.pl -v tab11_02.txt in11_01_01.txt
!      ,       ,evA    ,evB    ,
hyp1a  ,0.25000,1.00000,0.00000,
hyp1b  ,0.25000,0.00000,1.00000,
hyp2a  ,0.25000,1.00000,1.00000,
hyp2b  ,0.25000,0.00000,0.00000,
--- Applying event evA, c=1.000000
P(E)=0.500000
H hyp1a *1.000000/0.500000
H hyp1b *0.000000/0.500000
H hyp2a *1.000000/0.500000
H hyp2b *0.000000/0.500000
!      ,       ,evA    ,evB    ,
hyp1a  ,0.50000,1.00000,0.00000,
hyp1b  ,0.00000,0.00000,1.00000,
hyp2a  ,0.50000,1.00000,1.00000,
hyp2b  ,0.00000,0.00000,0.00000,
--- Applying event evB, c=0.000000
P(E)=0.500000
H hyp1a *1.000000/0.500000
H hyp1b *0.000000/0.500000
H hyp2a *0.000000/0.500000
H hyp2b *1.000000/0.500000
!      ,       ,evA    ,evB    ,
hyp1a  ,1.00000,1.00000,0.00000,
hyp1b  ,0.00000,0.00000,1.00000,
hyp2a  ,0.00000,1.00000,1.00000,
hyp2b  ,0.00000,0.00000,0.00000,

The same approach can also be used to solve the problem of the mutually-exclusive vs independent hypotheses. Just assign the cases with two hypotheses to a new pseudo-hypothesis, and then add up the computed probability of this hypothesis to the probabilities of both constituent hypotheses before choosing the winners. The obvious downside is that you end up with more hypotheses to compute.

But there is more. Notice that the table of probabilities looks very much the same as the training cases. That's not coincidental. It's the insight into how the Bayesian logic works, what's the meaning of it. We can get the exact same effect as with the Bayesian computations just by working with the training data and scratching out the lines in it.

evA evB
1 * hyp1 1  0
1 * hyp1 0  1
1 * hyp2 1  1
1 * hyp2 0  0

Each line in it describes some identical cases and the count of these cases. We can also think of this count as the weight of this line. The prior probability of a hypothesis is the sum of the weights (or counts, if you prefer) of all its lines divided by the sum of the weights from all the lines (i.e. the total count of cases in the training data):

P(H) = sum(weight(H)) / sum(weight(*))

Then we start applying the events. If the event is true, we throw away all the lines where this event is false. Or if the event is false, we throw away all the lines where this event is true. Technically, we can express this "throwing away" by changing the weight of these lines to 0. Or we can actually remove these lines from the list of the cases.

For example, if evA is true, the table becomes:

evA evB
1 * hyp1 1  0
1 * hyp2 1  1

The probabilities of the hypotheses after applying evA are computed in the same way as before, by dividing the sum of weights for this hypothesis by the sum of all weights. Here each hypothesis had its sum reduced from 2 to 1 but the sum of all weights got also reduced from 4 to 2, so they both stayed at the probability of 0.5. Compare that with the log of the program above.

Looking as the Bayes formula,

P(H|E) = P(H) * P(E|H) / P(E)

it says that P(H) gets multiplied by the ratio of the event probabilities. P(E|H) is the ratio of the sum of weights of cases for this hypothesis where the expected event E is true to the sum of weights of all cases for this hypothesis:

P(E|H) = sum(weight(H,E)) / sum(weight(H))

P(E) is the ratio of the sum of weights of cases for all hypotheses where the expected event E is true to the sum of weights of all cases altogether:

P(E) = sum(weight(E)) / sum(weight(*))

Let's substitute all these values into the formula for P(H|E):

P(H|E) = ( sum(weight(H)) / sum(weight(*)) )
* ( sum(weight(H,E)) / sum(weight(H)) )
/ ( sum(weight(E)) / sum(weight(*)) )
= ( sum(weight(H)) * sum(weight(H,E)) * sum(weight(*)) )
/ ( sum(weight(*)) * sum(weight(H)) * sum(weight(E)) )
= sum(weight(H,E)) / sum(weight(E))

But after the lines that don't match E are thrown away, sum(weight(E)) is really the sum of weights of all the remaining lines: sum(weightNew(*)). And sum(weight(H,E)) is really the sum of all the remaining lines for the hypothesis H: sum(weightNew(H)). And their ratio is Pnew(H). Which means that

P(H|E) = sum(weightNew(H)) / sum(weightNew(*)) = Pnew(H)

It all matches up, both ways are equivalent! When we compute the probabilities by the bayesian table, what we really do is follow through the list of the training cases and throw away the ones that aren't matching until we're left only with the matching ones. Or maybe with no matching cases at all, that's what happens when we see the division by 0 with the probability tables.

The only major difference is that when we use the classic tables, we scrunch up all the cases for a hypotheses into one line and take the average for whether the event should be present or not as its probability, thus losing any information about its cross-correlation with the other events. And if we follow the training list of cases, we preserve this information. Fundamentally, the bayesian approach turns out to be the same as following the binary decision tree, with some muddling thrown in.

If you're not sure, why is it equivalent to a tree, here is why: because we can organize the training cases in the form of a tree, making the decision one event at a time. The training data in this example can be represented as a tree shown here in ASCII-art:

|                  |
|                  V
|               0 / \  1
|         +------<evA>------+
|         |       \ /       |
|         V                 V
|     0  / \  1         0  / \  1
|    +--<evB>--+       +--<evB>--+
|    |   \ /   |       |   \ /   |
|    V         V       V         V
|   hyp2      hyp1    hyp1      hyp2

Each level of the tree contains the comparisons for the same event but of course the meaning of the result depends on the path taken from the root of the tree, i.e. on the result of examination of the previous events. Some comparison outcomes may not continue to the next levels, meaning that such a combination of events is considered impossible according to the training data.

The meaning of the fuzzy logic with the event confidence can also be easily seen through the training cases: If an event is true with confidence C(E), we multiply the weight of the lines where the event is true by C(E) and the weight of the lines where the event is false by 1-C(E).

If the C(E) happens to be 0.5, all the weights of all the lines will be multiplied by 0.5. But since this also means that the sum of all the weights will also be equal 0.5 of the original ones, the ratios won't change, and all the P(H) will stay the same.

It also gives an insight into the effects of the confidence capping. Suppose we have a training set like this:

evA evB
1 * hyp1 1  0
1 * hyp2 0  1
1 * hyp3 0  0

And the input data is

evA,1
evB,1

If we use the cap of 0.01, C(evA) will be set to 0.99. Applying it will result in the weights:

evA evB
0.99 * hyp1 1  0
0.01 * hyp2 0  1
0.01 * hyp3 0  0

Then evB will also be capped to C(evB) = 0.99, and result in:

evA evB
0.0099 * hyp1 1  0
0.0099 * hyp2 0  1
0.0001 * hyp3 0  0

When the model went through the "eye of the needle", the weights of all the lines have suffered badly. But some have suffered much worse than the others. The line for hyp3 didn't match either of the input events, so it got penalized twice, while hyp1 and hyp2 got penalized only once and ended up with the weights that are 99 times higher than hyp3. As far as the probabilities are concerned, hyp1 and hyp2 have grown from about 0.33 to about 0.5, at the expense of hyp3. But knowing their weights, we can say that in the grand scheme of things neither of them looks particularly hot, unless one of these events got misreported or maybe affected by some weird interaction of multiple hypotheses being true.

The absolute weight of the lines remaining at the end of computation can also be seen as an indication of confidence in the result: if the weight of a line is least 1, we can say that the input might plausibly match at least 1 real case from the training data. If it's less than 1, then the result is more like a guess, although it might be a good guess based on the uncertain input data.

Saturday, October 24, 2015

Bayes 10: independence and relevance

Since the Bayes logic by its nature works only with the mutually exclusive hypotheses, another way to use it for the independent hypotheses is by splitting a single probability table into multiple probability tables, one per hypothesis. Each hypothesis will get its own table that deals with only two outcomes: H and ~H. This way the overlap of multiple hypotheses in the same case gets represented accurately.

For example, let's look again at the training data from the previous installment:

# tab09_01.txt and tab10_01.txt
evA evB evC
1 * hyp1,hyp2 1   1   0
1 * hyp2,hyp3 0   1   1
1 * hyp1,hyp3 1   0   1

But this time let's compute the independent tables. I'll show only one table for hyp2, the other tables get computed in the same way:

# tab10_01.txt
!,,evA,evB,evC
hyp2,0.66667,0.5,1,0.5
~hyp2,0.33333,1,0,1

The total number of cases is 3, and the hypothesis hyp2 is true for two of them. So the prior probability P(hyp2) = 0.66667. The prior probability P(~hyp2) = 1 - P(hyp2) = 0.33333. The event evA is true in 1 of 2 cases where the hypothesis hyp2 is true, so P(evA|hyp2) = 0.5. The event evA is true in 1 of 1 cases where the hypothesis hyp2 is false, so P(evA|~hyp2) = 1. And so on.

By the way, if you think that you can compute P(E|~H) for the independent tables by deriving it from the table for the mutually-exclusive hypotheses using a formula like the following, you're wrong.

P(E|~H) = count(E|~H) / count(~H) = ( count(E)-count(E|H) ) / count(~H)
= ( P(E)*count(all) - P(E|H)*count(H) ) / ( count(all) - count(H) )
|divide by count(all) = ( P(E) - P(E|H)*P(H) ) / ( 1 - P(H) )

This formula describes the dependency correctly only for the mutually-exclusive hypotheses. But we had to distort the data to make it fit the mutually-exclusive approach. If you use this formula, you'll just get the other form of the distorted data. Instead you need to go back to the training cases and compute the probabilities directly from it to get the benefit of the independent-hypothesis approach.

Let's try it with the input that points to all 3 hypotheses being true:

# in10_01_02.txt
evA,1
evB,1
evC,1

\$ perl ex06_01run.pl -v tab10_01.txt in10_01_02.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.66667,0.50000,1.00000,0.50000,
~hyp2  ,0.33333,1.00000,0.00000,1.00000,
--- Applying event evA, c=1.000000
P(E)=0.666665
H hyp2 *0.500000/0.666665
H ~hyp2 *1.000000/0.666665
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.50000,0.50000,1.00000,0.50000,
~hyp2  ,0.50000,1.00000,0.00000,1.00000,
--- Applying event evB, c=1.000000
P(E)=0.500004
H hyp2 *1.000000/0.500004
H ~hyp2 *0.000000/0.500004
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,1.00000,0.50000,1.00000,0.50000,
~hyp2  ,0.00000,1.00000,0.00000,1.00000,
--- Applying event evC, c=1.000000
P(E)=0.500000
H hyp2 *0.500000/0.500000
H ~hyp2 *1.000000/0.500000
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,1.00000,0.50000,1.00000,0.50000,
~hyp2  ,0.00000,1.00000,0.00000,1.00000,

It shows very clearly that hyp2 is true. Another input, this time pointing to a different hypothesis:

# in10_01_03.txt
evA,1
evB,0
evC,0

\$ perl ex06_01run.pl -v -c 0.01 tab10_01.txt in10_01_03.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.66667,0.50000,1.00000,0.50000,
~hyp2  ,0.33333,1.00000,0.00000,1.00000,
--- Applying event evA, c=0.990000
P(E)=0.666665
H hyp2 *0.500000/0.663332
H ~hyp2 *0.990000/0.663332
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.50252,0.50000,1.00000,0.50000,
~hyp2  ,0.49748,1.00000,0.00000,1.00000,
--- Applying event evB, c=0.010000
P(E)=0.502516
H hyp2 *0.010000/0.497534
H ~hyp2 *0.990000/0.497534
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.01010,0.50000,1.00000,0.50000,
~hyp2  ,0.98990,1.00000,0.00000,1.00000,
--- Applying event evC, c=0.010000
P(E)=0.994950
H hyp2 *0.500000/0.014949
H ~hyp2 *0.010000/0.014949
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.33782,0.50000,1.00000,0.50000,
~hyp2  ,0.66218,1.00000,0.00000,1.00000,

Shows fairly clearly that the hyp2 is unlikely in this case.

Now the other input, with all 3 hypotheses being false (and also an "impossible input" from the standpoint of the training data, so we'll use the capping option):

# in10_01_01.txt
evA,0
evB,0
evC,0

\$ perl ex06_01run.pl -v -c 0.01 tab10_01.txt in10_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.66667,0.50000,1.00000,0.50000,
~hyp2  ,0.33333,1.00000,0.00000,1.00000,
--- Applying event evA, c=0.010000
P(E)=0.666665
H hyp2 *0.500000/0.336668
H ~hyp2 *0.010000/0.336668
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.99010,0.50000,1.00000,0.50000,
~hyp2  ,0.00990,1.00000,0.00000,1.00000,
--- Applying event evB, c=0.010000
P(E)=0.990099
H hyp2 *0.010000/0.019703
H ~hyp2 *0.990000/0.019703
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.50252,0.50000,1.00000,0.50000,
~hyp2  ,0.49748,1.00000,0.00000,1.00000,
--- Applying event evC, c=0.010000
P(E)=0.748742
H hyp2 *0.500000/0.256233
H ~hyp2 *0.010000/0.256233
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.98058,0.50000,1.00000,0.50000,
~hyp2  ,0.01942,1.00000,0.00000,1.00000,

Whoops, it shows that hyp2 is very likely, even though it shouldn't be. I was very enthusiastic about the approach with the independent tables until I've discovered this effect, that the "impossible data" can easily drive the computation way wrong.

In most cases this issue can be resolved by using both the mutually-exclusive and independent approaches. First run the mutually-exclusive table and find the candidate hypotheses. Then run the independent computation for these candidate hypotheses, and accept them only if their probability gets over the acceptable probability limit (the limits may be chosen differently for the mutually-exclusive and the independent cases). This approach generally works well to pick the relevant hypotheses in case if the mutually-exclusive approach picks a large number of them as probable.

In this particular case though the result of the mutually-exclusive computation is:

\$ perl ex09_02run.pl -v -c 0.01  tab09_02.txt in10_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.33333,1.00000,0.50000,0.50000,
hyp2   ,0.33333,0.50000,1.00000,0.50000,
hyp3   ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evA, c=0.010000
P(E)=0.666660
H hyp1 *0.010000/0.336673
H hyp2 *0.500000/0.336673
H hyp3 *0.500000/0.336673
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00990,1.00000,0.50000,0.50000,
hyp2   ,0.49503,0.50000,1.00000,0.50000,
hyp3   ,0.49503,0.50000,0.50000,1.00000,
--- Applying event evB, c=0.010000
P(E)=0.747503
H hyp1 *0.500000/0.257447
H hyp2 *0.010000/0.257447
H hyp3 *0.500000/0.257447
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.01923,1.00000,0.50000,0.50000,
hyp2   ,0.01923,0.50000,1.00000,0.50000,
hyp3   ,0.96143,0.50000,0.50000,1.00000,
--- Applying event evC, c=0.010000
P(E)=0.980658
H hyp1 *0.500000/0.028955
H hyp2 *0.500000/0.028955
H hyp3 *0.010000/0.028955
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.33204,1.00000,0.50000,0.50000,
hyp2   ,0.33204,0.50000,1.00000,0.50000,
hyp3   ,0.33204,0.50000,0.50000,1.00000,
--- Result:
hyp1   ,0.33204,
hyp2   ,0.33204,
hyp3   ,0.33204,

It also thinks that all three hypotheses are true! Well, it's one of these cases when the addition of the hypothesis "ok" as described above would help a lot.

The other approach to resolving this issue comes from the analysis of why did the probabilities get driven like this. Looking at the log of the independent computation for hyp2, we can see that P(hyp2) gets driven up a lot after applying evA=0 and evC=0. What does the table of probabilities contain for these events? Let's look at it again:

# tab10_01.txt
!,,evA,evB,evC
hyp2,0.66667,0.5,1,0.5
~hyp2,0.33333,1,0,1

The probabilities P(evA|hyp2) and P(evC|hyp2) equal to 0.5, which means "these events don't matter for this hypothesis"! However they turn out to matter a lot for ~hyp2, where P(evA|~hyp2) and P(evC|~hyp2) equal to 1. We can try to doctor the probability table, saying that if these events are irrelevant for hyp2, they should also be irrelevant for ~hyp2.

To generalize, we can say that if P(E|H) is within a certain range around 0.5 (say, 0.25..0.75, or fine-tune this range to some other width), we'll consider the event irrelevant for the hypothesis and replace its probability with 0.5 for both H and ~H. We could as well just remove it from the table altogether but then the program will complain about an unknown event name, so disabling the event by altering its probabilities is more convenient. We'll get the table for hyp2 that looks like this:

# tab10_02.txt
!,,evA,evB,evC
hyp2,0.66667,0.5,1,0.5
~hyp2,0.33333,0.5,0,0.5

\$ perl ex06_01run.pl -v -c 0.01 tab10_02.txt in10_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.66667,0.50000,1.00000,0.50000,
~hyp2  ,0.33333,0.50000,0.00000,0.50000,
--- Applying event evA, c=0.010000
P(E)=0.500000
H hyp2 *0.500000/0.500000
H ~hyp2 *0.500000/0.500000
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.66667,0.50000,1.00000,0.50000,
~hyp2  ,0.33333,0.50000,0.00000,0.50000,
--- Applying event evB, c=0.010000
P(E)=0.666670
H hyp2 *0.010000/0.336663
H ~hyp2 *0.990000/0.336663
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.01980,0.50000,1.00000,0.50000,
~hyp2  ,0.98020,0.50000,0.00000,0.50000,
--- Applying event evC, c=0.010000
P(E)=0.500000
H hyp2 *0.500000/0.500000
H ~hyp2 *0.500000/0.500000
!      ,       ,evA    ,evB    ,evC    ,
hyp2   ,0.01980,0.50000,1.00000,0.50000,
~hyp2  ,0.98020,0.50000,0.00000,0.50000,

The result is much better! The model had picked up on the idea of what events are relevant for what hypothesis.

This approach is not perfect, since the relevance computation depends a lot on the randomness in the mix of the cases in the training data but it's definitely a big improvement. I'll talk more about the ways of finding the relevance in the future installments.

Bayes 9: mutual exclusivity

You might have already noticed that no matter what, the sum of all the P(H) stays equal to 1 (within the rounding error). This is not an accident. It means that the Bayes logic works with the complete set of mutually-exclusive hypotheses: exactly one of them must be true and they must cover all the possible eventualities of life. The weights of the hypotheses change during the calculations but these properties stay, so the sum of all the weights must equal to 1.

In this sense the spam detection is an ideal application for the Bayes logic: there are only two hypotheses, either the message is spam or it isn't, and thus exactly one of them must be true. The other uses, such as the diagnostics of the repairs or illnesses, are not so clear-cut. In them having more than one hypothesis true is perfectly valid. The Bayesian approach can still be used for them though it requires a certain amount of monkeying around. But fundamentally it's an abuse of the Bayesian system, not something it has been intended for.

To demonstrate the problems, I've made a simple set of training data where every case had resulted in two true hypotheses:

# tab09_01.txt
evA evB evC
1 * hyp1,hyp2 1   1   0
1 * hyp2,hyp3 0   1   1
1 * hyp1,hyp3 1   0   1

I've translated it to the table of probabilities exactly as before. There are 2 row with hyp1 and 3 rows total, to P(hyp1) is 2/3 = 0.66667. There are 2 rows with both hyp1 and evA, so P(evA|hyp1) = 2/2 = 1. There is 1 row with both hyp1 and evB, so P(evB|hyp1) = 1/2 = 0.5. And so on. It results in the table of probabilities:

# tab09_01.txt
!,,evA,evB,evC
hyp1,0.66667,1,0.5,0.5
hyp2,0.66667,0.5,1,0.5
hyp3,0.66667,0.5,0.5,1

Note that this table is different, the sum of all P(H) in it is 2, not 1, because every case lists 2 hypotheses.

Let's try to run it on this input:

# in09_01_01.txt
evA,0
evB,0
evC,1

\$ perl ex06_01run.pl -v tab09_01.txt in09_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.50000,0.50000,
hyp2   ,0.66667,0.50000,1.00000,0.50000,
hyp3   ,0.66667,0.50000,0.50000,1.00000,
--- Applying event evA, c=0.000000
P(E)=1.333340
Impossible events, division by -0.33334

Whoops, it crashes right away. P(E) = 1.33334 doesn't look right, the probabilities shouldn't be higher than 1. And this results in the large negative divisor in the formula. Good that the verification had caught it! If we unroll the computation, the root cause of the problem is that the sum of P(H) is 2. P(E) is calculated by summing its probabilities for every hypothesis, weighing them by P(H). If the sum of P(H) is 2 then the value of P(E) will be in the range of [0..2], which is completely unexpected by the rest of the computation.

One way to attempt resolving this problem is by re-weighting P(E), bringing it back into the range [0..1]. It can be done by dividing it by the sum of P(H). I've implemented this in the code of ex09_01run.pl. The computation of P(E) in the function apply_event() gets changed a little:

# ex09_01run.pl
my \$pe = 0.;
my \$sumph = 0.;
for (my \$i = 0; \$i <= \$#hypname; \$i++) {
\$pe += \$phyp[\$i] * \$peh[\$i][\$evi];
\$sumph += \$phyp[\$i]
}
\$pe /= \$sumph;

Let's try this new version:

\$ perl ex09_01run.pl -v tab09_01.txt in09_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.50000,0.50000,
hyp2   ,0.66667,0.50000,1.00000,0.50000,
hyp3   ,0.66667,0.50000,0.50000,1.00000,
--- Applying event evA, c=0.000000
P(E)=0.666667
H hyp1 *0.000000/0.333333
H hyp2 *0.500000/0.333333
H hyp3 *0.500000/0.333333
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.50000,0.50000,
hyp2   ,1.00000,0.50000,1.00000,0.50000,
hyp3   ,1.00000,0.50000,0.50000,1.00000,
--- Applying event evB, c=0.000000
P(E)=0.750000
H hyp1 *0.500000/0.250000
H hyp2 *0.000000/0.250000
H hyp3 *0.500000/0.250000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.50000,0.50000,
hyp2   ,0.00000,0.50000,1.00000,0.50000,
hyp3   ,2.00001,0.50000,0.50000,1.00000,
--- Applying event evC, c=1.000000
P(E)=1.000000
H hyp1 *0.500000/1.000000
H hyp2 *0.500000/1.000000
H hyp3 *1.000000/1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.50000,0.50000,
hyp2   ,0.00000,0.50000,1.00000,0.50000,
hyp3   ,2.00001,0.50000,0.50000,1.00000,

Much better, it didn't crash. But it produced the result with hyp2 having the probability of 2. Not really surprising if you remember that the sum of probabilities stays the same, so if some hypotheses become improbable, the extra probability has to go somewhere, in this case to the only probably hypothesis left, hyp3. But the probability of 2 is not a right thing in the great scheme of things.

By the way, in the real world where you have the sum of probabilities from the training data closer to 1, and lots of events to consider, the effect will be less pronounced. You might almost never see the P(H) above 1. Which doesn't mean that the effect is not present, it will still be messing with you. This example is specially constructed to highlight this effect and show why you should care.

So, how can we get rid of the probabilities of 2? The simple way that springs to mind is to do the same thing as we've done before for P(E): scale down the values proportionally to make them sum up to 1 and enter these scaled-down values into the table of probabilities.

This can be thought of as changing the computation of the table of probabilities in the following way: There are 2 row with hyp1 and 6 hypotheses total in the outcomes of all the cases, to P(hyp1) is 2/6 = 0.33333. And so on. The computation of P(E|H) doesn't change.

The table of probabilities becomes:

# tab09_02.txt
!,,evA,evB,evC
hyp1,0.33333,1,0.5,0.5
hyp2,0.33333,0.5,1,0.5
hyp3,0.33333,0.5,0.5,1

Since now the sum of P(H) is 1 (within the rounding), we can as well use the original code to compute with this table:

\$ perl ex06_01run.pl -v tab09_02.txt in09_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.33333,1.00000,0.50000,0.50000,
hyp2   ,0.33333,0.50000,1.00000,0.50000,
hyp3   ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evA, c=0.000000
P(E)=0.666660
H hyp1 *0.000000/0.333340
H hyp2 *0.500000/0.333340
H hyp3 *0.500000/0.333340
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.50000,0.50000,
hyp2   ,0.49999,0.50000,1.00000,0.50000,
hyp3   ,0.49999,0.50000,0.50000,1.00000,
--- Applying event evB, c=0.000000
P(E)=0.749978
H hyp1 *0.500000/0.250022
H hyp2 *0.000000/0.250022
H hyp3 *0.500000/0.250022
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.50000,0.50000,
hyp2   ,0.00000,0.50000,1.00000,0.50000,
hyp3   ,0.99988,0.50000,0.50000,1.00000,
--- Applying event evC, c=1.000000
P(E)=0.999880
H hyp1 *0.500000/0.999880
H hyp2 *0.500000/0.999880
H hyp3 *1.000000/0.999880
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.50000,0.50000,
hyp2   ,0.00000,0.50000,1.00000,0.50000,
hyp3   ,1.00000,0.50000,0.50000,1.00000,

The probabilities stay nicely within the range of [0..1].

Let's look at another approach. Instead of all this we can just split the cases in the training data that have two outcomes into two cases with a single outcome (and the same approach can be easily generalized for the larger number of outcomes):

evA evB evC
1 * hyp1 1   0   1
1 * hyp1 1   1   0
1 * hyp2 1   1   0
1 * hyp2 0   1   1
1 * hyp3 0   1   1
1 * hyp3 1   0   1

When we compute the table of probabilities for them, it turns out to be the same one as we've got in the other way, after the scaling-down of the P(H):

# tab09_02.txt
!,,evA,evB,evC
hyp1,0.33333,1,0.5,0.5
hyp2,0.33333,0.5,1,0.5
hyp3,0.33333,0.5,0.5,1

So we can go in two ways and get the same result. Which is good. In the probability calculations the things are interconnected, and if we can arrive to the same result in two ways, it means that the result makes sense.

But why did we go the first way? To keep the prior probabilities of the events more in line with the reality. The scaling down of the probabilities partially negates this effect. It can be compensated by setting the lower probability boundary for the acceptable hypotheses. For example, if your probabilities have initially added up to 1.2 (i.e. 20% of the training cases had a two-hypothesis outcome) and your original boundary was 0.9, then it would be reasonable to set the boundary to 0.9/1.2 = 0.75. I believe that this is a large part of the reason why I've been finding the boundary values around 0.7 working well for me. But I don't have an exact proof.

Though of course if one hypothesis wins, its probability would be driven to 1 anyway. But there is nothing wrong with that, if this probability would have cleared the higher boundary of 0.9, it would clear the boundary of 0.75 as well.

In our example where the sum of probabilities added up to 2, the boundary would look as 0.9/2 = 0.45. Let's look at the input that indicates two hypotheses being true:

# in09_01_02.txt
evA,0
evB,1
evC,1

\$ perl ex06_01run.pl -v tab09_02.txt in09_01_02.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.33333,1.00000,0.50000,0.50000,
hyp2   ,0.33333,0.50000,1.00000,0.50000,
hyp3   ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evA, c=0.000000
P(E)=0.666660
H hyp1 *0.000000/0.333340
H hyp2 *0.500000/0.333340
H hyp3 *0.500000/0.333340
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.50000,0.50000,
hyp2   ,0.49999,0.50000,1.00000,0.50000,
hyp3   ,0.49999,0.50000,0.50000,1.00000,
--- Applying event evB, c=1.000000
P(E)=0.749978
H hyp1 *0.500000/0.749978
H hyp2 *1.000000/0.749978
H hyp3 *0.500000/0.749978
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.50000,0.50000,
hyp2   ,0.66667,0.50000,1.00000,0.50000,
hyp3   ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evC, c=1.000000
P(E)=0.666667
H hyp1 *0.500000/0.666667
H hyp2 *0.500000/0.666667
H hyp3 *1.000000/0.666667
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.50000,0.50000,
hyp2   ,0.50000,0.50000,1.00000,0.50000,
hyp3   ,0.50000,0.50000,0.50000,1.00000,

The two hypotheses nicely clear the boundary of 0.45.

But what if the input data indicates all 3 hypothesis being true? Such a case wasn't in the training data but that doesn't mean that it's impossible, it might be just relatively rare:

# in09_01_03.txt
evA,1
evB,1
evC,1

\$ perl ex06_01run.pl -v tab09_02.txt in09_01_03.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.33333,1.00000,0.50000,0.50000,
hyp2   ,0.33333,0.50000,1.00000,0.50000,
hyp3   ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evA, c=1.000000
P(E)=0.666660
H hyp1 *1.000000/0.666660
H hyp2 *0.500000/0.666660
H hyp3 *0.500000/0.666660
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.50000,1.00000,0.50000,0.50000,
hyp2   ,0.25000,0.50000,1.00000,0.50000,
hyp3   ,0.25000,0.50000,0.50000,1.00000,
--- Applying event evB, c=1.000000
P(E)=0.625000
H hyp1 *0.500000/0.625000
H hyp2 *1.000000/0.625000
H hyp3 *0.500000/0.625000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.40000,1.00000,0.50000,0.50000,
hyp2   ,0.40000,0.50000,1.00000,0.50000,
hyp3   ,0.20000,0.50000,0.50000,1.00000,
--- Applying event evC, c=1.000000
P(E)=0.600000
H hyp1 *0.500000/0.600000
H hyp2 *0.500000/0.600000
H hyp3 *1.000000/0.600000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.33333,1.00000,0.50000,0.50000,
hyp2   ,0.33333,0.50000,1.00000,0.50000,
hyp3   ,0.33333,0.50000,0.50000,1.00000,

Whoops, none of the hypotheses have cleared the 0.45 boundary. The absolute boundary, even if it's lowered, is actually not such a great idea. A much better approach is to pick the few highest-ranked hypotheses that have the close probabilities and sum up to a non-corrected absolute boundary. In this example 0.33333+0.33333+0.33333=0.99999, and all the values are the same, so it makes sense to accept all the hypotheses.

To formalize this approach, the following condition can be used:

Pick N top hypotheses, for which P(H) >= limit/N

Pretty much, go hypothesis by hypothesis and add 1 to N while the next hypothesis satisfies this condition. However I've found that this can produce pretty long tails of hypotheses where the next one is just a little less probable that the previous one, enough to stay within the condition. This would be OK if they were all true but they aren't, and this generates the expensive false positives. So I've added one more condition to keep this sequence more flat:

P(H) of the Nth hypothesis must be >= 0.5*P(H) of the first hypothesis

This can be added with the following code:

# ex09_02run.pl.txt
# in options
our \$boundary = 0.9; # boundary for accepting a hypothesis as a probable outcome
# ...
# pick and print the most probable hypotheses
sub pick_hypotheses()
{
my \$w = 7; # width of every field
my \$tfmt = "%-\${w}.\${w}s,"; # format of one text field
my \$nw = \$w-2; # width after the dot in the numeric fields field
my \$nfmt = "%-\${w}.\${nw}f,"; # format of one numeric field (does the better rounding)

print "--- Result:\n";
# the order is reverse!
my @order = sort {\$phyp[\$a] <=> \$phyp[\$b]} (0..\$#phyp);

my \$half_first = \$phyp[\$order[0]] / 2.;

my \$n = \$#phyp + 1;
for my \$i (@order) {
my \$ph = \$phyp[\$i];
if (\$ph >= \$half_first && \$ph >= \$boundary / \$n) {
printf(\$tfmt, \$hypname[\$i]);
printf(\$nfmt, \$ph);
printf("\n");
} else {
# reduce the count only if not started printing yet
--\$n;
}
}
}
# main()
while (\$ARGV[0] =~ /^-(.*)/) {
if (\$1 eq "v") {
\$verbose = 1;
} elsif (\$1 eq "c") {
shift @ARGV;
\$cap = \$ARGV[0]+0.;
} elsif (\$1 eq "b") {
shift @ARGV;
\$boundary = \$ARGV[0]+0.;
} else {
confess "Unknown switch -\$1";
}
shift @ARGV;
}
&print_table;
if (\$#ARGV >= 1) {
&apply_input(\$ARGV[1]);
&pick_hypotheses();
}

The hypotheses are tested in the backwards order because otherwise it would be impossible to say, which value of N to use when testing the first hypothesis.

It might be still useful to put the limit on either the absolute value of P(H) that can be accepted or equivalently on what the maximal reasonable N can be, to avoid the situations where the probability is spread among a large number of hypotheses because the model has no good match to the input data. If N is over 10 then the results are probably invalid and none of them should be accepted. I'll also show a different way to look at this problem in the next installment.

To give an example of the result produced by this code, here are its results for the inputs shown before:

\$ perl ex09_02run.pl tab09_02.txt in09_01_02.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.33333,1.00000,0.50000,0.50000,
hyp2   ,0.33333,0.50000,1.00000,0.50000,
hyp3   ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evA, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.50000,0.50000,
hyp2   ,0.49999,0.50000,1.00000,0.50000,
hyp3   ,0.49999,0.50000,0.50000,1.00000,
--- Applying event evB, c=1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.50000,0.50000,
hyp2   ,0.66667,0.50000,1.00000,0.50000,
hyp3   ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evC, c=1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.50000,0.50000,
hyp2   ,0.50000,0.50000,1.00000,0.50000,
hyp3   ,0.50000,0.50000,0.50000,1.00000,
--- Result:
hyp2   ,0.50000,
hyp3   ,0.50000,

\$ perl ex09_02run.pl tab09_02.txt in09_01_03.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.33333,1.00000,0.50000,0.50000,
hyp2   ,0.33333,0.50000,1.00000,0.50000,
hyp3   ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evA, c=1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.50000,1.00000,0.50000,0.50000,
hyp2   ,0.25000,0.50000,1.00000,0.50000,
hyp3   ,0.25000,0.50000,0.50000,1.00000,
--- Applying event evB, c=1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.40000,1.00000,0.50000,0.50000,
hyp2   ,0.40000,0.50000,1.00000,0.50000,
hyp3   ,0.20000,0.50000,0.50000,1.00000,
--- Applying event evC, c=1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.33333,1.00000,0.50000,0.50000,
hyp2   ,0.33333,0.50000,1.00000,0.50000,
hyp3   ,0.33333,0.50000,0.50000,1.00000,
--- Result:
hyp1   ,0.33333,
hyp2   ,0.33333,
hyp3   ,0.33333,

And finally I want to return to the version of the code that scaled the sum of P(H) to 1 when computing P(E), in ex09_01run.pl. An interesting thing about it shows in the case if the sum of probabilities is not quite 1 due to the rounding, such as 3 hypotheses at 0.33333 each. With this version of the code, the sum stays faithfully at 0.99999 and doesn't get pulled up to 1 by the computations. Which might make the result a little more true, introducing less of the rounding errors. But in reality this little difference probably doesn't matter.

Tuesday, October 20, 2015

Bayes 8: impossible

By thins point I'm done with re-telling the things I've read, and we're going into the things I've discovered in practice, and the solutions that I've come up with. They aren't always perfect, and often I have more than one solution with different pros and cons, and you might be able to find some better ones, but you can't deal with the real data without some sort of these solutions.

Let's look at another example fed into the same table as before:

evC,0
evA,0
evB,0

The event evC goes first to highlight an interesting peculiarity. When we run this input into the system, we get:

\$ perl ex06_01run.pl tab06_01_01.txt in08_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evA, c=0.000000
Impossible events, division by -2.22045e-16
at ex06_01run.pl line 131

If we feed the same events in a different order

evA,0
evB,0
evC,0

we get:

\$ perl ex06_01run.pl tab06_01_01.txt in08_01_01a.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evA, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.66667,0.66667,
hyp2   ,1.00000,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.66667,0.66667,
hyp2   ,1.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
Impossible events, division by 0
at ex06_01run.pl line 131

What happened here? Why is there a division by 0, or even by the negative numbers? The negative number comes out of a rounding error. The data in the table has been rounded to only 5 digits after the decimal points to start with, and as the computations run, they introduce more and more rounding errors. Because of that the probability might go slightly out of the normal range, and go very slightly below 0 (as here, looks like it went negative by the lest-significant bit) or above 1. This very tiny negative value should have been 0 if everything had been perfect. If you really want, you could probably compare all the resulting probabilities for being in the proper range and bring them into it if they diverge slightly. Remember though that if your probabilities start to diverge substantially from the proper range, it means that you're getting some formulas terribly wrong, and you better take notice of that. Because of this I wouldn't recommend just blibdly forcing the values into the range, it's easier to let them be, and the small divergences won't matter much while the large divergences will become noticeable and alarming.

So why did it try to divide by 0? Let's run the computation once more with a verbose output:

\$ perl ex06_01run.pl -v tab06_01_01.txt in08_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
P(E)=0.777779
H hyp1 *0.333330/0.222221
H hyp2 *0.000000/0.222221
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evA, c=0.000000
P(E)=1.000000
Impossible events, division by -2.22045e-16

The reason is that P(E)=1, based on the previous seen events the model thinks that this event must absolutely be true, and we're telling it that it's false. The division by 0 means that the events form a combination that is impossible from the standpoint of the model.

But why is it impossible? Let's look again at the training data:

evA evB evC
4 * hyp1 1   1   1
2 * hyp1 1   0   0
3 * hyp2 0   0   1

No lines in it contain evC=0 and evA=0. The model knows only as much as the training data tells it. We're trying to feed the input (0, 0, 0) which says that none of the symptoms are present, and which wasn't present in the training data.

For this particular case we can add an item to the training data, representing the situation when there is nothing wrong with the subject of the expert system:

evA evB evC
4    * hyp1 1   1   1
2    * hyp1 1   0   0
3    * hyp2 0   0   1
0.01 * ok   0   0   0

I've added it with a real small count 0.01, so that it would not affect the probabilities of other outcomes very much. How can a count be not a whole number? Remember, it's all relative, from the computation standpoint it's not really counts but weights. Having 9 "real" cases and 0.01 "ok" case is the same as having 900 "real" cases and 1 "ok" case.

This training data translates to the table:

!,,evA,evB,evC
hyp1,0.66593,1,0.66667,0.66667
hyp2,0.33296,0,0,1
ok,0.00111,0,0,0

And let's apply the same input data:

\$ perl ex06_01run.pl -v tab08_01.txt in08_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66593,1.00000,0.66667,0.66667,
hyp2   ,0.33296,0.00000,0.00000,1.00000,
ok     ,0.00111,0.00000,0.00000,0.00000,
--- Applying event evC, c=0.000000
P(E)=0.776916
H hyp1 *0.333330/0.223084
H hyp2 *0.000000/0.223084
H ok *1.000000/0.223084
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.99502,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
ok     ,0.00498,0.00000,0.00000,0.00000,
--- Applying event evA, c=0.000000
P(E)=0.995024
H hyp1 *0.000000/0.004976
H hyp2 *1.000000/0.004976
H ok *1.000000/0.004976
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
ok     ,1.00000,0.00000,0.00000,0.00000,
--- Applying event evB, c=0.000000
P(E)=0.000000
H hyp1 *0.333330/1.000000
H hyp2 *1.000000/1.000000
H ok *1.000000/1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
ok     ,1.00000,0.00000,0.00000,0.00000,

After the first event evC the hypothesis "ok" inches up a little bit but after the event evA it jumps all the way to 1, becoming the winning one. The former impossibility now became the strong indication for "ok". Note that the hypothesis hyp2 would have surged up in the same way on evA but its probability is already 0 by that time and no amount of multiplication can make it change from 0.

Now let's look at another input set:

evA,0
evB,1
evC,1

It should come as no surprise to you that it also causes a division by 0, as this combination wasn't present in the training data either. In this case it might be reasonable to just stop and say "we don't know, this doesn't match any known outcome". Or it might mean that more than one thing is wrong with the thing we're trying to diagnose, and the symptoms of these illnesses are interacting with each other. It might be still work trying to continue the diagnosis to try figuring out the leading hypotheses. Of course, if there are only two hypotheses available, saying that "both might be true" sounds less useful, but with a large number of hypotheses narrowing down to a small number of them is still useful.

The trick I've come up with for this situation is to cap the confidence in the events. Instead of saying "I'm absolutely sure at confidence 1" let's say "I'm almost absolutely sure at confidence 0.99". And the same thing for 0, replacing it with 0.01. The code I've shown before contains an option (-c) that implements this capping without having to change the input data.

\$ perl ex06_01run.pl -v -c 0.01 tab06_01_01.txt in08_01_02.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evA, c=0.010000
P(E)=0.666670
H hyp1 *0.010000/0.336663
H hyp2 *0.990000/0.336663
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.01980,1.00000,0.66667,0.66667,
hyp2   ,0.98020,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.990000
P(E)=0.013202
H hyp1 *0.663337/0.022938
H hyp2 *0.010000/0.022938
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.57267,1.00000,0.66667,0.66667,
hyp2   ,0.42733,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.990000
P(E)=0.809113
H hyp1 *0.663337/0.802931
H hyp2 *0.990000/0.802931
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.47311,1.00000,0.66667,0.66667,
hyp2   ,0.52689,0.00000,0.00000,1.00000,

The first event ~evA drives the evidence strongly towards the hypothesis hyp2 but then the impossible event evB about evens things out, and evC moves the balance a little towards hyp2 again.

The value of the "unsurety cap" will vary the outcomes a little. If you use the smaller cap values, such as 0.00001, it will drive the balance a little sharper towards hyp2, a larger cap, such as 0.1, will produce the more even results. This effect gets more pronounced if you have hundreds of events to look at, and the larger cap values will muddle your results a lot and you won't be able to pick any hypothesis ever. In reality the values like 1e-8 are more suitable for the cap.

The same trick with capping would also work on all events being negative without the hypothesis "ok" but you probably wouldn't want to get a rather random diagnosis when there is nothing wrong with the patient. It's a fine line between being able to handle the cases that vary just a little from the known training data and giving the random diagnoses on the data that doesn't resemble anything known at all.

You might want to detect and count the cases when model goes though the "eye of a needle" of an almost-impossible event, and if it happens more that a certain small number of times, give up and admit that there is no good hypothesis that matches the data. How do you detect them? These are the cases when either the expected P(E) is very close to 0 and the event comes true, or the expected P(E) is very close to 1 and the event comes false. How close is "close"? It depends on your cap value. If you use a small enough cap, something within (cap*10) should be reasonable. Having a small cap value really pays in this case. Note that you still need the capping to make it work. You can't just go without capping, detect the division by 0 and drop the event that caused it. The division by 0 is caused not by one event but by a combination of multiple events and dropping one of them that happened to be the last one isn't right. The capping approach allows you to bring the probabilities back into the reasonable range when you see the last event in an "impossible" sequence.

It also happens that not all the impossible event combinations cause the division by 0. Look at this input:

evA,1
evB,0
evC,1

\$ perl ex06_01run.pl -v tab06_01_01.txt in08_01_03.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evA, c=1.000000
P(E)=0.666670
H hyp1 *1.000000/0.666670
H hyp2 *0.000000/0.666670
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.000000
P(E)=0.666670
H hyp1 *0.333330/0.333330
H hyp2 *1.000000/0.333330
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=1.000000
P(E)=0.666670
H hyp1 *0.666670/0.666670
H hyp2 *1.000000/0.666670
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,

It doesn't match any input in the training data, yet the model never tries to divide by 0 without any capping needed, and it's absolutely sure that the hypothesis hyp1 is correct. Why? As it happens, this input (1, 0, 1) represents a mix of inputs (1, 1, 1) and (1, 0, 0), both of which point to hyp1. When these training data are minced up into the table of probabilities, the information about the correlation of evB and evC that is showing in the cases for hyp1 becomes lost. It just thinks "if evA is true and evB might be true, and evC might be true, then it's hyp1". There are some ways to deal with it, and we'll look at them yet.

Sunday, October 18, 2015

Bayes 7: training

Where does the table of probabilities come from? From the training data. What is the training data? It's the information about the previously known cases with the known correct diagnosis and all the known symptoms.

Let's look at what kind of training data (admittedly, pulled from the thin air) I used to create the table in the last example. To remind, the table was:

!,,evA,evB,evC
hyp1,0.66667,1,0.66667,0.66667
hyp2,0.33333,0,0,1

Basically, you start by writing down the known cases. Some of them might have happened in the same way more than once, so instead of writing down the same case multiple times, you just write the count of this kind of cases.

I've made up the following imaginary cases:

----     evA evB evC
4 * hyp1 1   1   1
2 * hyp1 1   0   0
3 * hyp2 0   0   1

9 cases total, with hypothesis hyp1 happeing with two different combinations of symptoms (4 and 2 cases), and the hypothesis hyp2 happening with one combination of symptoms (3 cases).

The prior probabilities P(H) of hypotheses get computed by dividing the number of cases with this hypothesis by the total number of cases:

P(hyp1) = (4+2)/(4+2+3) = 6/9 = 0.66667
P(hyp2) = 3/(4+2+3) = 3/9 = 0.33333

The conditional probabilities of the events P(E/H) are computed by dividing the number of cases where this event is true for this hypothesis by the total number of the cases for this hypothesis:

P(evA|hyp1) = (4+2)/(4+2) = 6/6 = 1
P(evB|hyp1) = 4/(4+2) = 4/6 = 0.66667
P(evC|hyp1) = 4/(4+2) = 4/6 = 0.66667

P(evA|hyp2) = 0/3 = 0
P(evB|hyp2) = 0/3 = 0
P(evC|hyp2) = 3/3 = 1

These numbers go into the table. Very simple. Of course, with the real-sized data sets of many thousands of cases you'd have to do a lot more calculations but it's fundamentally straightforward. At least for this basic approach. There are more complicated approaches, we'll get to them later.

Bayes 6: basic code example

After going through in words, I want to show you the code, with a simple example. The code is fairly independent of the actual knowledge that is contained in the probability tables, so I'll make the probability tables loadable.

The probability tables format will be like this:

# this is a simple table example
!,,evA,evB,evC
hyp1,0.66667,1,0.66667,0.66667
hyp2,0.33333,0,0,1

The data lines contain the comma-separated values (CSV). The lines, except the one that starts with a "!" describe the hypotheses. The first field is the hypothesis name, the second field is the prior probability P(H), and the rest of the fields are the conditional probabilities P(E|H) for every event and this hypothesis. The lines may contain a trailing comma, it will get ignored:

hyp_name,P(H),P(E1/H),P(E2/H),...

The line that starts with "!" contains the event names. If this line is absent, the events will be assigned the numeric names, starting from 1. The first two fields of the line are ignored, and the rest contain the names of the events, forming the table header, where each table column correcponds to an event and its conditional probabilities.

The same table can also be formatted with the nice alignment:

!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,

When my code runs, it will print its information about the tables in this aligned format. I've arbitrarily limited the width of the columns at 7, so any longer names will be truncated in the print-out. Good enough for the examples. The spaces around the hypothesis and event names will be ignored.

The other data item will be the input data describing a particular case. It consists of the CSV lines in format

event_name,confidence

The event name must match one of the names defined in the table, and the confidence is in the range [0..1], with 0 meaning that the event is definitely false, and 1 meaning that it's definitely true.

Now to the code. Eventually I'll make the code downloadable as files but for now I'll probably do the changes in it for some time, so I'l just copy-paste the current version here to reduce the confusion between the code and text. I'll talk more about how and why it works but first just the pasted code and some examples of its use.

#!/usr/bin/perl
#
# Running of a Bayes expert system.

use strict;
use Carp;

our @evname; # the event names
our %evhash; # hash of event names to indexes
our @hypname; # the hypothesis names
our @phyp; # the hypothesis probabilities
our @peh; # probabilities P(E/H) by hypothesis by event

# options
our \$verbose = 0; # verbose printing during the computation
our \$cap = 0; # cap on the confidence, factor accounting for fuzziness of training data,
# limits the confidence to the range [\$cap..1-\$cap]

{
my \$filename = shift;

@evname = ();
%evhash = ();
@hypname = ();
@phyp = ();
@peh = ();

my \$nev = undef; # number of events minus 1

confess "Failed to open '\$filename': \$!\n"
unless open(INPUT, "<", \$filename);
while(<INPUT>) {
chomp;
s/,\s*\$//; # remove the trailing comma if any
if (/^\#/ || /^\s*\$/) {
# a comment line
} elsif (/^\!/) {
# row with event names
@evname = split(/,/); # CSV format, the first 2 elements gets skipped
shift @evname;
shift @evname;
} else {
my @s = split(/,/); # CSV format for hypothesis
push @hypname, shift @s;
push @phyp, shift @s;

if (defined \$nev) {
if (\$nev != \$#s) {
close(INPUT);
my \$msg = sprintf("Wrong number of events, expected %d, got %d in line: %s\n",
\$nev+1, \$#s+1, \$_);
confess \$msg;
}
} else {
\$nev = \$#s;
}
# the rest of fields are probabilities, just take the whole array by reference
push @peh, \@s;
}
}
close(INPUT);

if (\$#evname >= 0) {
if (\$#evname != \$nev) {
my \$msg = sprintf("Wrong number of event names, %d events in the table, %d names\n",
\$nev+1, \$#evname+1);
confess \$msg;
}
} else {
for (my \$i = 0; \$i <= \$nev; \$i++) {
push @evname, (\$i+1)."";
}
}

for (my \$i = 0; \$i <= \$#evname; \$i++) {
\$evname[\$i] =~ s/^\s+//;
\$evname[\$i] =~ s/\s+\$//;
\$evhash{\$evname[\$i]} = \$i;
}
for (my \$i = 0; \$i <= \$#hypname; \$i++) {
\$hypname[\$i] =~ s/^\s+//;
\$hypname[\$i] =~ s/\s+\$//;
}
}

sub print_table()
{
my \$w = 7; # width of every field
my \$tfmt = "%-\${w}.\${w}s,"; # format of one text field
my \$nw = \$w-2; # width after the dot in the numeric fields field
my \$nfmt = "%-\${w}.\${nw}f,"; # format of one numeric field (does the better rounding)

# the title row
printf(\$tfmt, "!");
printf(\$tfmt, "");
foreach my \$e (@evname) {
printf(\$tfmt, \$e);
}
print("\n");
# the hypotheses
for (my \$i = 0; \$i <= \$#hypname; \$i++) {
printf(\$tfmt, \$hypname[\$i]);
printf(\$nfmt, \$phyp[\$i]);
foreach my \$e (@{\$peh[\$i]}) {
printf(\$nfmt, \$e);
}
print("\n");
}
}

# Apply one event
# evi - event index in the array
# conf - event confidence [0..1]
sub apply_event(\$\$) # (evi, conf)
{
my (\$evi, \$conf) = @_;

# compute P(E)
my \$pe = 0;
for (my \$i = 0; \$i <= \$#hypname; \$i++) {
\$pe += \$phyp[\$i] * \$peh[\$i][\$evi];
}

if (\$verbose) {
printf("P(E)=%f\n", \$pe);
}

#compute the divisor
my \$div = \$pe*\$conf + (1. - \$pe)*(1. - \$conf);
if (\$div < 1e-10) {
confess (sprintf("Impossible events, division by %g\n", \$div));
}

# update the hypothesis probabilities
# (it works more efficiently if the array \$peh is indexed by the event first
# but for the simple example who cares)
for (my \$i = 0; \$i <= \$#hypname; \$i++) {
my \$mul = \$peh[\$i][\$evi]*\$conf + (1. - \$peh[\$i][\$evi])*(1. - \$conf);
if (\$verbose) {
printf("H %s *%f/%f\n", \$hypname[\$i], \$mul, \$div);
}
\$phyp[\$i] *= \$mul/\$div;
}
}

# Apply an input file
sub apply_input(\$) # (filename)
{
my \$filename = shift;

confess "Failed to open the input '\$filename': \$!\n"
unless open(INPUT, "<", \$filename);
while(<INPUT>) {
chomp;
next if (/^\#/ || /^\s*\$/); # a comment

my @s = split(/,/);
\$s[0] =~ s/^\s+//;
\$s[0] =~ s/\s+\$//;

confess ("Unknown event name '" . \$s[0] . "' in the input\n")
unless exists \$evhash{\$s[0]};
my \$evi = \$evhash{\$s[0]};

my \$conf = \$s[1] + 0.;
if (\$conf < \$cap) {
\$conf = \$cap;
} elsif (\$conf > 1.-\$cap) {
\$conf = 1. - \$cap;
}
printf("--- Applying event %s, c=%f\n", \$s[0], \$conf);
&apply_event(\$evi, \$conf);
&print_table;
}
close(INPUT);
}

# main()
while (\$ARGV[0] =~ /^-(.*)/) {
if (\$1 eq "v") {
\$verbose = 1;
} elsif (\$1 eq "c") {
shift @ARGV;
\$cap = \$ARGV[0]+0.;
} else {
confess "Unknown switch -\$1";
}
shift @ARGV;
}
&print_table;
if (\$#ARGV >= 1) {
&apply_input(\$ARGV[1]);
}

How to run it: Suppose the table above is locate din the file tab06_01_01.txt, and the code is in ex06_01run.pl. The first the code can do is just load the table and print it back with the nice formatting:

\$ perl ex06_01run.pl tab06_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,

Next, it can apply the input data for computation. Suppose the input data is in in06_01_01_01.txt:

evA,1
evB,0
evC,0

Let's apply it:

\$ perl ex06_01run.pl tab06_01_01.txt in06_01_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evA, c=1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,

It prints the table with the new probabilities after applying every event, so that we can trace, how the probabilities change. The only thing that changes is really the second column, P(H) of the hypotheses. The conditional probabilities of the events don't change.

In this case the very first event decided the fate: it drove the probability all the way for hyp1, and the rest of the events just stayed consistent with that. If we reorder the events, the intermediate computations will change:

evB,0
evA,1
evC,0

\$ perl ex06_01run.pl tab06_01_01.txt in06_01_01_04.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.40000,1.00000,0.66667,0.66667,
hyp2   ,0.60000,0.00000,0.00000,1.00000,
--- Applying event evA, c=1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,

If evB is applied first, it drives the probabilities toward hyp2, but then the next event brings it back to the full glory of hyp1. Which basically means that if evA is true, there is no way hyp2 could happen, so hyp1 wins big time.

The order in which the events are applied doesn't matter. Either way it will come to the same result. In reality it might not be the best result since the table in this form doesn't represent the cross-correlation between the events quite right. But for a given table the result is the same independently of the order in which the events are applied.

The Bayes formula is applied for one event at a time in the function apply_event(). The formula is exactly as described before.

P(H|C(E)) = P(H) * ( C(E)*(0+P(E|H)) + (1-C(E))*(1-P(E|H)) )
/ ( C(E)*(0+P(E)) + (1-C(E))*(1-P(E)) )

The only catch is that P(E) is not present anywhere in the table. It has to be calculated from the other available data. This is because the probabilities of the events change after the knowledge about the previous events has been taken into the account. It's the biggest effect of the cross-correlation of the events and it absolutely has to be taken into account or you'll be seeing the probability values around 15 or so. There are other subtler effects as well but more about that later.

The event probability is computed from adding up its weighed conditional probabilities:

P(E) = sum(P(H) * P(E/H))

Since the P(H) values change after every events and P(E/H) stay the same, the values of P(E) also change.

If used with the option -v, the program will print the verbose intermediate data as it preforms the computations. With the original event order it will print:

\$ perl ex06_01run.pl -v tab06_01_01.txt in06_01_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evA, c=1.000000
P(E)=0.666670
H hyp1 *1.000000/0.666670
H hyp2 *0.000000/0.666670
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.000000
P(E)=0.666670
H hyp1 *0.333330/0.333330
H hyp2 *1.000000/0.333330
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
P(E)=0.666670
H hyp1 *0.333330/0.333330
H hyp2 *0.000000/0.333330
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,

With the different order:

\$ perl ex06_01run.pl -v tab06_01_01.txt in06_01_01_04.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.000000
P(E)=0.444449
H hyp1 *0.333330/0.555551
H hyp2 *1.000000/0.555551
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.40000,1.00000,0.66667,0.66667,
hyp2   ,0.60000,0.00000,0.00000,1.00000,
--- Applying event evA, c=1.000000
P(E)=0.400001
H hyp1 *1.000000/0.400001
H hyp2 *0.000000/0.400001
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
P(E)=0.666670
H hyp1 *0.333330/0.333330
H hyp2 *0.000000/0.333330
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,

As you can see, the probabilities for evA and evB were wildly different in the first two rounds. But either way they were applied, by the time the evC gets into the fray, it has the same probability in both cases.

Sunday, October 11, 2015

Bayes 5: the big picture

It's not enough to just have the magic formulas, an expert system has to actually use them. So, how do you use them?

You start with the table of probabilities built from the training data. We'll look at the build later, for now let's just assume that the table is there.

Each event represents some symptom, so you look at the symptoms one by one, check if they are present, making the events true or false (or some confidence value in between), and adjust the probabilities of the hypotheses. At the end we pick the hypothesis that is most likely. It's usually a good idea to pick some absolute probability requirement as well. Basically, if you have 98 hypotheses at 1% probability and one hypothesis at 2% probability, it probably doesn't mean that the 2% hypothesis is the right one. It probably means that the system has no clue about the right answer, and the answer it gives should be "I don't know". You usually want the winning hypothesis to be at least somewhere in the vicinity of 70% probability. The optimal value might take some fine-tuning of your system. There is another important consideration to this point: are the hypotheses mutually exclusive or could multiple of them be true at the same time? I'll discuss this consideration in detail later, and for mow just leave it at that.

There are two ways to look for the symptoms. In one way you have some automated way to look for all the symptoms you know: in medicine that might be some panel of lab tests, in car diagnostics it might be some 107-point inspection, and of course as far as the computers or software are concerned, it's just the perfect application for the automated tests. The uses like the spam detection are just the perfect example: all the information you can get is right there in the e-mail, just go and extract it. If you have the results for all the symptoms, you just go and apply them all, one after another, and at the end you have the probabilities for all the hypotheses and can pick the best one.

In the other way the examination of a symptom requires some manual procedure, such as having a doctor poke the patient in some way, or ordering some test. There is some cost associated with performing the test. In some cases it might even be an an attempt at fixing the issue, mixing the symptoms and hypotheses a bit: such as "try this antibiotic" or "replace the alternator, did the problem go away"? When the symptoms are like this, there are two consequences: first, to save on the cost of diagnostics, you want to stop after examining as few symptoms as possible, and second, the order of examining the symptoms matters.

I've dealt in reality only with the situation when all the symptoms are available up-front, so I'll re-tell the solutions for the expensive symptoms I've read in that old British book (perhaps there are some better ones invented by now?).

A certain way to stop early is in the case if one hypothesis becomes certainly more probable than all the others. What does that mean, "certainly"? It means that if we look at all the symptoms we haven't examined yet and suppose that they go the worst way for this hypothesis, it would still beat all the other hypotheses, satisfying the requirements to accept it as the best one.

Another certain way to stop is if no hypothesis, even if all the yet unexamined symptoms go the best way for them, could satisfy the acceptance requirements. It's just the early admission of impending defeat without the useless thrashing beforehand.

The way you do it is by calculating the best and worst cases for each hypothesis after applying each symptom. You can also optimize it by remembering which hypotheses start failing the cut unconditionally, and skipping them in all the future checks. Obviously, if you have all the events available up-front, there is no point in stopping early, it's faster to just apply all the events rather than re-calculating the best and worst cases after every event.

How exacly do you compute the best and the worst cases? Like this:

• Start with the current probability of the hypothesis, and assign it as teh prior values Pbest(H) and Pworst(H)
• For all the un-examined symptoms:
• Check if P(E|H)/P(E) > 1 (i.e. the event has a positive correlation with this hypothesis)
• if this check is true, compute: Pbest(H) *= P(E|H)/P(E); Pworst(H) *= P(~E|H)/P(~E)
• else, compute Pbest(H): *= P(~E|H)/P(~E); Pworst(H) *= P(E|H)/P(E)

Now, what is the best next symptom to try out? The one that will have the strongest effect by affecting the probabilities of the hypotheses in the strongest way. I don't remember how that book defined the strength but here are some logical considerations, and it would probably work out in the same way as they did in that book.

We can use the difference between the probabilities as the measure of the effect of the event on a hypothesis:

abs(P(H|E) - P(H|~E))

Another possible way would be to divide the probabilities by each other but that is fraught with troubles. Let me show you why. Suppose we
define the effect as

max(P(H|E), P(H|~E)) / min(P(H|E), P(H|~E))

Max() and min() are needed to keep the value always above 1. But the minimum may always become 0, which would result in the infinities, which is not good. If we try to divide the other way around, min by max, we will produce the same results for 0/0.0001 and 0/0.9999, which is obviously also a bad metric. Thus the subtraction it is.

And then the obvious way to get the effect of an event on the whole set of hypotheses is to add up its effects on each individual hypotheses. Whichever event gets the highest metric is the most important event.

But there is also the question of the cost of checking the symptom. As everyone who performs the diagnostics knows, the reasonable approach is to try the cheapest things first. Which means that we should probably multiply that cumulative effect of an event by the cost of checking it, and use that as the metric.

The cost approach can be taken further: each hypothesis means a diagnosis that implies a prescription for a certain treatment. That treatment also has some cost. And again the reasonable thing is to hope for the diagnoses with the cheap treatments. So when computing the effect of an event on the whole set of hypotheses we can give the preference to the hypotheses with a cheaper treatment, for example by multiplying the effect on this hypothesis by 1/cost_of_treatment or log(1/cost_of_treatment) or some such.

What if a single test provides the knowledge about multiple symptoms? Then obviously we need to look for the choice not of the next symptom but of the whole next test. The effect of a test can be computed by adding up the effects of all the symptoms it resolves. Well, not necessarily exactly adding, this has issues with the correlated symptoms, but it's a reasonable first approach, and I'll talk more about this issue later.

Another interesting variation (that's not from the book, that's my own musings) is if the hypothesis are used to mean not just a diagnosis but a confirmed diagnosis, meaning "we digagnosed, treated, and confirmed that the treatment had cured the problem". In this case not just the symptoms as such but also the prescriptions for the treatment become the event (with the event being true if the treatment had worked). This nicely folds the knowledge about the alternative treatments into the same model. Obviously, this approach needs to have some stricter criteria of which treatments are worthy enough to try, the same as for the hypotheses in the more basic approach, to prevent the system from thrashing in vain when it's unable to make any useful conclusion.

This approach can be merged with the automated tests as well: We start by applying the results of all the automated tests as events, and then choose one of the possible treatments as the "next event to test" using the principles described above. If it fixes the issue, great, our job is done. Otherwise we can try the next treatment, and maybe re-run the automated tests, see if anything has changed, and take the changes into account before recommending the next treatment.

Wednesday, October 7, 2015

Bayes 4: the fuzzy logic

Before moving on to the other aspects, I want to wrap up with the basic formula, this time in its fuzzy-logic version. As I've already mentioned, sometimes it's hard to tell if the event is true or not. It would be nice to have some way to express the idea of the event being very likely or slighlty likely or slightly or very unlikely. It would be even nicer to represent our confidence in the event happening as a continuous range, with 1 meaning "the event is definitely true", 0 meaning "the event is definitely false", 0.5 meaning "we can't tell at all if the event is true or false", and the points in between representing the values in between. Let's use the letter C to denote this confidence, or maybe C(E) to show the confidence of which event we're talking about.

There is a way to generalize the Bayes formula for this confidence range. Way back when I've read it in that British book without explantions and marveled at how people come up with these formulas but in the more recent times I was able to re-create it from scratch from the logical consideration, so I'll show you, where does this marvelous formula come from (and a little later I'll show you where the bayes formula comes from too).

First, let's look at the boundary conditions. Kind of obviously, with C(E)=1 the formula must give the same result as the basic Bayes formula with E being true, with C(E)=0 it must give the same result as the basic formula for E being false, and for C(E)=0.5 it must leave the P(H) unchanged.

Let's look at the basic formulas again:

P(H|E) = P(H) * ( P(E|H) / P(E) )
P(H|~E) = P(H) * ( P(~E|H) / P(~E) )

If we substitute

P(~E|H) = 1 - P(E|H)
P(~E) = 1 - P(E)

then the formulas become:

P(H|E) = P(H) * ( P(E|H) / P(E) )
P(H|~E) = P(H) * ( (1-P(E|H)) / (1-P(E)) )

In both cases we multiply the same P(H) by some coefficient. The coefficient in both cases is computed by division of something done with P(E|H) by something done with P(E). For symmetry we can write in the first formula P(E|H) as 0+P(E|H) and P(E) as 0+P(E), then the formulas become even more like each other:

P(H|E)  = P(H) * ( (0+P(E|H)) / (0+P(E)) )
P(H|~E) = P(H) * ( (1-P(E|H)) / (1-P(E)) )

If we represent the points 0+P(E|H) and 1-P(E|H) graphically as a line section, we can see that in both cases they are located at the distance P(E|H), only one is to the right from 0, and another one is to the left of 1:

0   P(E|H)         1-P(E|H)  1
.....|......|..............|......|.......
------>\--------------/<------

And the same works for P(E). This way it looks even more symmetrical.

The natural follow-up from here is that we can use the confidence value C(E) to split the sections [ 0+P(E|H), 1-P(E|H) ] and [ 0+P(E), 1-P(E) ] proportionally. For example, if C(E) = 0.75, it will look like this:

P(0.75 E|H)
|
0   P(E|H) |       1-P(E|H)  1
.....|......|...|..........|......|.......
0.25   0.75

The (0+P(E|H)) gets taken with the weight C(E) and (1-P(E|H)) gets taken with the weight 1-C(E) and they get added together. If C(E) is 0.75, it means that the split point will be located closer to (0+P(E|H)), dividing the original range at its quarter-point.

In the formula form this can be expressed as:

P(H|C(E)) = P(H) * ( C(E)*(0+P(E|H)) + (1-C(E))*(1-P(E|H)) )
/ ( C(E)*(0+P(E)) + (1-C(E))*(1-P(E)) )

Let's check that this formula conforms to the boundary conditions formulated above:

P(H|C(E)=1) = P(H) * ( 1*(0+P(E|H)) + (1-1)*(1-P(E|H)) )
/ ( 1*(0+P(E)) + (1-1)*(1-P(E)) )
= P(H) * P(E|H) / P(E)

P(H|C(E)=0) = P(H) * ( 0*(0+P(E|H)) + (1-0)*(1-P(E|H)) )
/ ( 0*(0+P(E)) + (1-0)*(1-P(E)) )
= P(H) * (1-P(E|H)) / (1-P(E))

P(H|C(E)=0.5) = P(H) * ( 0.5*(0+P(E|H)) + (1-0.5)*(1-P(E|H)) )
/ ( 0.5*(0+P(E)) + (1-0.5)*(1-P(E)) )
= P(H) * ( 0.5*P(E|H)) + 0.5*(1-P(E|H)) )
/ ( 0.5*P(E) + 0.5*(1-P(E)) )
= P(H) * ( 0.5*P(E|H)) + 0.5 - 0.5*P(E|H)) )
/ ( 0.5*P(E) + 0.5 - 0.5*P(E)) )
= P(H) * 0.5 / 0.5
= P(H)

They all match. So this is the right formula even though a bit complicated. We can make it a bit simpler by splitting it into two formulas, one representing the function for the proportional splitting, and another one computing the probability with the use of the first function:

Pc(P, C) = P*C + (1-P)*(1-C)
P(H|C(E)) = P(H) * Pc(P(E|H), C(E))/Pc(P(E), C(E))

Now you might complain "but I don't like the shape of this dependency I feel that the confidence should be represented in a more non-linear way in one way or another". Well, if it's too linear for you, you don't have to use it directly. You can define your own confidence function Cmy(E) and map its values into C(E) in any kind of non-linear form, and then use the resulting C(E) values for computation.

In fact, I like a variation of that trick myself. I like limiting the values of C(E) to something like 0.00001 and 0.99999 instead of 0 and 1. It helps the model recover from the situations that it would otherwise think impossible and provide a way to handle the problem of the overfitting. But more on that later.

Tuesday, October 6, 2015

Bayes 3: the magic formula

The next thing is the conditional probability. It's written as P(H|E) or P(E|H).

P(H|E) means the probability of the hypothesis H being true in case if we know that the event E is true.

P(E|H) means the probability of the event E being true in case if the hypothesis H turns out to be true.

And now finally the magic formula discovered by Bayes:

P(H|E) = P(H) * P(E|H) / P(E)

It tells us how the probability of a hypothesis gets adjusted when we learn some related data. I haven't watched "House, MD" closely enough to give an example from the medical area, so I'll give one about the car repair.

Suppose our hypothesis is "the battery is discharged". One way to test it is to turn on the headlights (with the engine not running) and see that they come on at full brightness, that would be our event.

Suppose our original ("prior") probability P(H) was 0.1. And suppose the probability of the headlights being able to come up was 0.7. You might think that no way would the headlights shine brightly with the discharged battery, but the headlights consume a lot less current than the starter, and the battery might be capable of providing the small amount of current without too much of a voltage drop, so the headlights might still look decently bright to the eye. So let's say that P(E|H), the probability of the headlights shining brightly with a discharged battery, is 0.05.

Where do all these numbers come from? Rights now I've pulled them out of my imagination. In a real expert system they would come from two sources: the training data and the computation of the previous events. Some people might say that the numbers might come from the expert opinions but spit them in the eye, getting the good probabilities without a good set of training data is pretty much impossible. We'll get to the details of the training data a little later, now let's continue with the example.

So we've tested the headlights, found them shining bright, and now can plug the numbers into the formula:

P(H|E) = 0.1 * 0.05 / 0.7 = 0.00715

As you can see, the probability of this hypothesis took a steep plunge. Why did it happen? Basically, the effect depends on the difference between how probable this effect is in general (P(E)) and how probable it is if this hypothesis is true (P(E|H)). The formula can be reformatted to emphasise this:

P(H|E) = P(H) * ( P(E|H) / P(E) )

If P(E|H) is greater than P(E) then the probability of the hypothesis gets pulled up. If it's less then the probability gets pulled down.

What if we try the experiment and the result is false? Obviously, we can't just use this formula, since E is not true. But then the event ~E will be true, and we should instead use the similar formula for ~E and also adjust the probability of the hypothesis:

P(H|~E) = P(H) * P(~E|H) / P(~E)

Where do P(~E|H) and P(~E) come from? Again, from the training data, and they are connected with the direct values in the obvious way:

P(~E|H) = 1 - P(E|H)
P(~E) = 1 - P(E)

A lot of examples of Bayesian logic I've found on the Internet don't adjust the probabilities if an event is found to be false, and this is very, very wrong. Basically, they treat the situation "the event is known to be false" as "the event is unknown". If you have a spam filter and know that a message doesn't contain the word "viagra", you shouldn't treat it the same way as if you don't know if this word is present. The result will be that you'll be driving the probabilities of the message being spam way high on seeing any of the suspicious words. In case if you wonder if ignoring the negative event is sheer stupidity, the answer is no, it isn't, there are reasons to why people do this, they are trying to compensate for the other deficiencies of the bayesian method. Only a rare spam message indeed would contain the whole set of suspicious words, so if you properly test each word and apply it as positive or negative, you end up with pretty low probability values. I'll talk more about it later, although I don't have the perfect solutions either.

Continuing the example, suppose the headlights didn't come on brightly, and we have the probabilities

P(~E|H) = 1 - 0.05 = 0.95
P(~E) = 1 - 0.7 = 0.3

Then we can substitute them into the formula:

P(H|~E) = 0.1 * 0.95 / 0.3 = 0.31667

The hypothesis of a dead battery took quite a boost!

Now let me show you something else. Suppose the prior probability P(H) was not 0.1 but 0.5. We put the data with this single adjustment into the same formula and get:

P(H|~E) = 0.5 * 0.95 / 0.3 = 1.58333

Shouldn't the probability be in the range between 0 and 1? It sure should. If you get a probability value outside of this range, it means that something is wrong with your input data. Either you've used some expert estimations of probabilities and these experts turned out to be not any good (that's why using a proper set of training data is important), or you forgot to adjust the probabilities for the previous computations you've done. Or maybe the rounding errors from the previous computations have accumulated to become greater than the actual values.

Sunday, October 4, 2015

Bayes 2: hypotheses and events

A bayesian expert system deals with the probabilities of hypotheses. It tries to compute these probabilities based on the experimentation with the black box, and choose the most probable one (or maybe more than one).

The probability of a hypothesis H is commonly written as P(H). Naturally, it's a real number in the range between 0 ("definitely false") and 1 ("definitely true"). If you haven't dealt with the probabilities before, now you know this.

Each hypothesis H has an opposite hypothesis that I'll write as ~H (in the mathematical texts it's written as H with a line above it but it's inconvenient to type, so I chose a notation like the C language). If the hypothesis Hlupus is "the patient has lupus" then the opposite ~Hlupus will be "the patient doesn't have lupus". I'll also call ~H the "negative hypothesis", as opposed to the "positive" H.

The probabilities of positive and negative hypotheses are connected:

P(H) + P(~H) = 1

Which means that whatever happens, exactly one of H and ~H is true. We might not know for sure which one is true but if we think that H has the probability 0.7 (or in other words 70%) of being true, this means that it has the probability 0.3 of being false, while the opposite ~H must have the probability 0.3 of being true and 0.7 of being false. This is a very rigid connection, and we can use it for various deductions.

An event is not just an experiment but an experiment with a particular outcome. Not just "look at the patient's throat" but "I've looked at the patient's throat and found that it's sore". Just as the hypotheses, the events have their opposites too. If the event E is "I've looked at the patient's throat and found that it's sore", its opposite ~E will be "I've looked at the patient's throat and found that it's NOT sore". The experiment is the same but its outcome is opposite for the negative event. It's real important to note that the case "I haven't looked at the patient's throat yet" is different: until we look at the throat, neither E or ~E have happened. But once we look at the throat, we can tell that either E is true and ~E is false for this particular patient or the other way around, E is false and ~E is true. A typical mistake (one found in Wikipedia for example) is to mix up the cases when we haven't looked at the throat and when we looked at the throat and found that it's not sore. If you mix them up, your calculations will all be way wrong. Beware of it.

A clever reader might say "what if I looked at the patient's throat and can't say for sure whether it's sore or not, it's slightly reddish but not totally so"? Glad you asked, that's the department of the fuzzy logic, and I'll get to it later. But for now let's assume that we can tell whether it's one or the other. If we can't tell then we'll just treat it the same way as we haven't done the experiment at all (which is by the way consistent with how it's treated by the fuzzy logic).

We can't know the outcome of an experiment before we do the experiment, whether E or ~E will be true. But we can estimate the probability of E in advance, based on the other information we have collected so far. This advance estimation is called the probability of an event, P(E). For example, if we know that 30% of patients visiting a doctor have a sore throat, we can estimate P(Esorethroat)=0.3 based on our knowledge that this person came to see a doctor.

Just as with hypotheses, the probabilities of the complementary events are connected:

P(E) + P(~E) = 1

And it's really not any different from a hypothesis. It really is a hypothesis about the outcome of an experiment we haven't done yet, which then collapses to the true or false after the experiment is done and the result is observed.

Some experiments may have more than two possible outcomes. For example, suppose our black box under investigation contains some colored balls in it, and we can stick a hand in it and pull out a ball that might be of one of three colors: blue, yellow or green. This can be treated as three separate events:

Eblue: the ball is blue
Eyellow: the ball is yellow
Egreen: the ball is green

The total of all three probabilities will still be 1:

P(Eblue) + P(Eyellow) + P(Egreen) = 1

The connection between the probabilities of events means that these events are not independent. Observing any of these three events changes the probabilities of the other two events.

But the sum of probabilities in each pair will also be 1:

P(Eblue) + P(~Eblue) = 1
P(Eyellow) + P(~Eyellow) = 1
P(Egreen) + P(~Egreen) = 1

The connected events can be processed sequentially. We can predict the probabilities of the events that haven't been processed yet after each step of processing. We can achieve this in the real world by having an assistant pull out the ball out of the box and then asking him questions with the answers of "yes" or "no". Suppose we know for some reason that there is an equal probability for each color. Thus initially:

P(Eblue) = 1/3
P(~Eblue) = 2/3

P(Eyellow) = 1/3
P(~Eyellow) = 2/3

P(Egreen) = 1/3
P(~Egreen) = 2/3

Then we have the assistant pull out the ball and ask him: "Is the ball blue?" If the answer is "No" then the event probabilities change:

P(Eblue) = 0
P(~Eblue) = 1

P(Eyellow) = 1/2
P(~Eyellow) = 1/2

P(Egreen) = 1/2
P(~Egreen) = 1/2

The probabilities of Eblue and ~Eblue have collapsed to 0 and 1 but the remaining two outcomes are now equi-probable between two, not three.

Now we can ask the next question: "Is the ball yellow?". If the answer is again "No" then the probabilities change like this:

P(Eblue) = 0
P(~Eblue) = 1

P(Eyellow) = 0
P(~Eyellow) = 1

P(Egreen) = 1
P(~Egreen) = 0

At this point we also become quite sure that the ball must be green, and we can as well skip asking about it. Not in the real world though. In the real world something might have changed since we've made our original estimations and more colors became available to the box-builder. Or maybe the fourth outcome is extremely rare and our training data didn't contain any example of it. Thus it could happen that we ask "Is the ball green?" expecting to hear "Yes" and get a "No". Obviously, this would be a curve-ball that disturbs the model, but I'll show a possible way of dealing with it.