Tuesday, October 20, 2015

Bayes 8: impossible

By thins point I'm done with re-telling the things I've read, and we're going into the things I've discovered in practice, and the solutions that I've come up with. They aren't always perfect, and often I have more than one solution with different pros and cons, and you might be able to find some better ones, but you can't deal with the real data without some sort of these solutions.

Let's look at another example fed into the same table as before:

evC,0
evA,0
evB,0

The event evC goes first to highlight an interesting peculiarity. When we run this input into the system, we get:

$ perl ex06_01run.pl tab06_01_01.txt in08_01_01.txt 
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evA, c=0.000000
Impossible events, division by -2.22045e-16
 at ex06_01run.pl line 131

If we feed the same events in a different order

evA,0
evB,0
evC,0

we get:

$ perl ex06_01run.pl tab06_01_01.txt in08_01_01a.txt 
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evA, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.66667,0.66667,
hyp2   ,1.00000,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.66667,0.66667,
hyp2   ,1.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
Impossible events, division by 0
 at ex06_01run.pl line 131

What happened here? Why is there a division by 0, or even by the negative numbers? The negative number comes out of a rounding error. The data in the table has been rounded to only 5 digits after the decimal points to start with, and as the computations run, they introduce more and more rounding errors. Because of that the probability might go slightly out of the normal range, and go very slightly below 0 (as here, looks like it went negative by the lest-significant bit) or above 1. This very tiny negative value should have been 0 if everything had been perfect. If you really want, you could probably compare all the resulting probabilities for being in the proper range and bring them into it if they diverge slightly. Remember though that if your probabilities start to diverge substantially from the proper range, it means that you're getting some formulas terribly wrong, and you better take notice of that. Because of this I wouldn't recommend just blibdly forcing the values into the range, it's easier to let them be, and the small divergences won't matter much while the large divergences will become noticeable and alarming.

So why did it try to divide by 0? Let's run the computation once more with a verbose output:

$ perl ex06_01run.pl -v tab06_01_01.txt in08_01_01.txt 
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
P(E)=0.777779
H hyp1 *0.333330/0.222221
H hyp2 *0.000000/0.222221
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evA, c=0.000000
P(E)=1.000000
Impossible events, division by -2.22045e-16

The reason is that P(E)=1, based on the previous seen events the model thinks that this event must absolutely be true, and we're telling it that it's false. The division by 0 means that the events form a combination that is impossible from the standpoint of the model.

But why is it impossible? Let's look again at the training data:

         evA evB evC
4 * hyp1 1   1   1
2 * hyp1 1   0   0
3 * hyp2 0   0   1

No lines in it contain evC=0 and evA=0. The model knows only as much as the training data tells it. We're trying to feed the input (0, 0, 0) which says that none of the symptoms are present, and which wasn't present in the training data.

For this particular case we can add an item to the training data, representing the situation when there is nothing wrong with the subject of the expert system:

            evA evB evC
4    * hyp1 1   1   1
2    * hyp1 1   0   0
3    * hyp2 0   0   1
0.01 * ok   0   0   0

I've added it with a real small count 0.01, so that it would not affect the probabilities of other outcomes very much. How can a count be not a whole number? Remember, it's all relative, from the computation standpoint it's not really counts but weights. Having 9 "real" cases and 0.01 "ok" case is the same as having 900 "real" cases and 1 "ok" case.

This training data translates to the table:

!,,evA,evB,evC
hyp1,0.66593,1,0.66667,0.66667
hyp2,0.33296,0,0,1
ok,0.00111,0,0,0

And let's apply the same input data:

$ perl ex06_01run.pl -v tab08_01.txt in08_01_01.txt 
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66593,1.00000,0.66667,0.66667,
hyp2   ,0.33296,0.00000,0.00000,1.00000,
ok     ,0.00111,0.00000,0.00000,0.00000,
--- Applying event evC, c=0.000000
P(E)=0.776916
H hyp1 *0.333330/0.223084
H hyp2 *0.000000/0.223084
H ok *1.000000/0.223084
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.99502,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
ok     ,0.00498,0.00000,0.00000,0.00000,
--- Applying event evA, c=0.000000
P(E)=0.995024
H hyp1 *0.000000/0.004976
H hyp2 *1.000000/0.004976
H ok *1.000000/0.004976
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
ok     ,1.00000,0.00000,0.00000,0.00000,
--- Applying event evB, c=0.000000
P(E)=0.000000
H hyp1 *0.333330/1.000000
H hyp2 *1.000000/1.000000
H ok *1.000000/1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
ok     ,1.00000,0.00000,0.00000,0.00000,

After the first event evC the hypothesis "ok" inches up a little bit but after the event evA it jumps all the way to 1, becoming the winning one. The former impossibility now became the strong indication for "ok". Note that the hypothesis hyp2 would have surged up in the same way on evA but its probability is already 0 by that time and no amount of multiplication can make it change from 0.

Now let's look at another input set:

evA,0
evB,1
evC,1

It should come as no surprise to you that it also causes a division by 0, as this combination wasn't present in the training data either. In this case it might be reasonable to just stop and say "we don't know, this doesn't match any known outcome". Or it might mean that more than one thing is wrong with the thing we're trying to diagnose, and the symptoms of these illnesses are interacting with each other. It might be still work trying to continue the diagnosis to try figuring out the leading hypotheses. Of course, if there are only two hypotheses available, saying that "both might be true" sounds less useful, but with a large number of hypotheses narrowing down to a small number of them is still useful.

The trick I've come up with for this situation is to cap the confidence in the events. Instead of saying "I'm absolutely sure at confidence 1" let's say "I'm almost absolutely sure at confidence 0.99". And the same thing for 0, replacing it with 0.01. The code I've shown before contains an option (-c) that implements this capping without having to change the input data.

$ perl ex06_01run.pl -v -c 0.01 tab06_01_01.txt in08_01_02.txt 
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evA, c=0.010000
P(E)=0.666670
H hyp1 *0.010000/0.336663
H hyp2 *0.990000/0.336663
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.01980,1.00000,0.66667,0.66667,
hyp2   ,0.98020,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.990000
P(E)=0.013202
H hyp1 *0.663337/0.022938
H hyp2 *0.010000/0.022938
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.57267,1.00000,0.66667,0.66667,
hyp2   ,0.42733,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.990000
P(E)=0.809113
H hyp1 *0.663337/0.802931
H hyp2 *0.990000/0.802931
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.47311,1.00000,0.66667,0.66667,
hyp2   ,0.52689,0.00000,0.00000,1.00000,

The first event ~evA drives the evidence strongly towards the hypothesis hyp2 but then the impossible event evB about evens things out, and evC moves the balance a little towards hyp2 again.

The value of the "unsurety cap" will vary the outcomes a little. If you use the smaller cap values, such as 0.00001, it will drive the balance a little sharper towards hyp2, a larger cap, such as 0.1, will produce the more even results. This effect gets more pronounced if you have hundreds of events to look at, and the larger cap values will muddle your results a lot and you won't be able to pick any hypothesis ever. In reality the values like 1e-8 are more suitable for the cap.

The same trick with capping would also work on all events being negative without the hypothesis "ok" but you probably wouldn't want to get a rather random diagnosis when there is nothing wrong with the patient. It's a fine line between being able to handle the cases that vary just a little from the known training data and giving the random diagnoses on the data that doesn't resemble anything known at all.

You might want to detect and count the cases when model goes though the "eye of a needle" of an almost-impossible event, and if it happens more that a certain small number of times, give up and admit that there is no good hypothesis that matches the data. How do you detect them? These are the cases when either the expected P(E) is very close to 0 and the event comes true, or the expected P(E) is very close to 1 and the event comes false. How close is "close"? It depends on your cap value. If you use a small enough cap, something within (cap*10) should be reasonable. Having a small cap value really pays in this case. Note that you still need the capping to make it work. You can't just go without capping, detect the division by 0 and drop the event that caused it. The division by 0 is caused not by one event but by a combination of multiple events and dropping one of them that happened to be the last one isn't right. The capping approach allows you to bring the probabilities back into the reasonable range when you see the last event in an "impossible" sequence.

It also happens that not all the impossible event combinations cause the division by 0. Look at this input:

evA,1
evB,0
evC,1

$ perl ex06_01run.pl -v tab06_01_01.txt in08_01_03.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evA, c=1.000000
P(E)=0.666670
H hyp1 *1.000000/0.666670
H hyp2 *0.000000/0.666670
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.000000
P(E)=0.666670
H hyp1 *0.333330/0.333330
H hyp2 *1.000000/0.333330
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=1.000000
P(E)=0.666670
H hyp1 *0.666670/0.666670
H hyp2 *1.000000/0.666670
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,

It doesn't match any input in the training data, yet the model never tries to divide by 0 without any capping needed, and it's absolutely sure that the hypothesis hyp1 is correct. Why? As it happens, this input (1, 0, 1) represents a mix of inputs (1, 1, 1) and (1, 0, 0), both of which point to hyp1. When these training data are minced up into the table of probabilities, the information about the correlation of evB and evC that is showing in the cases for hyp1 becomes lost. It just thinks "if evA is true and evB might be true, and evC might be true, then it's hyp1". There are some ways to deal with it, and we'll look at them yet.

No comments:

Post a Comment