To remind, what the problem is, consider a simple training table:
# tab17_01.txt !,,evA,evB,evC hyp1,1,1,0,0 hyp2,1,0,1,0 hyp3,1,0,0,1
One event indicates one hypothesis. For now let's ignore the idea of the relevance because for now the relevance is not obvious to compute. I actually plan to work up to the computation of the relevance from the subtraction logic.
If we feed the input with all events present (and with capping), that results with all hypotheses getting the equal probability:
# in17_01_01.txt evA,1 evB,1 evC,1
$ perl ex16_01run.pl -c 0.01 tab17_01.txt in17_01_01.txt ! , ,evA ,evB ,evC , hyp1 ,1.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00, hyp2 ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00, --- Probabilities hyp1 0.33333 hyp2 0.33333 hyp3 0.33333 --- Applying event evA, c=0.990000 ! , ,evA ,evB ,evC , hyp1 ,0.99000,1.00000/1.00,0.00000/1.00,0.00000/1.00, hyp2 ,0.01000,0.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,0.01000,0.00000/1.00,0.00000/1.00,1.00000/1.00, --- Applying event evB, c=0.990000 ! , ,evA ,evB ,evC , hyp1 ,0.00990,1.00000/1.00,0.00000/1.00,0.00000/1.00, hyp2 ,0.00990,0.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,0.00010,0.00000/1.00,0.00000/1.00,1.00000/1.00, --- Applying event evC, c=0.990000 ! , ,evA ,evB ,evC , hyp1 ,0.00010,1.00000/1.00,0.00000/1.00,0.00000/1.00, hyp2 ,0.00010,0.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,0.00010,0.00000/1.00,0.00000/1.00,1.00000/1.00, --- Probabilities hyp1 0.33333 hyp2 0.33333 hyp3 0.33333
If we feed the input with all the events absent (and with capping), that also results with all hypotheses getting the equal probability:
# in17_01_02.txt evA,0 evB,0 evC,0
$ perl ex16_01run.pl -c 0.01 tab17_01.txt in17_01_02.txt ! , ,evA ,evB ,evC , hyp1 ,1.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00, hyp2 ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00, --- Probabilities hyp1 0.33333 hyp2 0.33333 hyp3 0.33333 --- Applying event evA, c=0.010000 ! , ,evA ,evB ,evC , hyp1 ,0.01000,1.00000/1.00,0.00000/1.00,0.00000/1.00, hyp2 ,0.99000,0.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,0.99000,0.00000/1.00,0.00000/1.00,1.00000/1.00, --- Applying event evB, c=0.010000 ! , ,evA ,evB ,evC , hyp1 ,0.00990,1.00000/1.00,0.00000/1.00,0.00000/1.00, hyp2 ,0.00990,0.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,0.98010,0.00000/1.00,0.00000/1.00,1.00000/1.00, --- Applying event evC, c=0.010000 ! , ,evA ,evB ,evC , hyp1 ,0.00980,1.00000/1.00,0.00000/1.00,0.00000/1.00, hyp2 ,0.00980,0.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,0.00980,0.00000/1.00,0.00000/1.00,1.00000/1.00, --- Probabilities hyp1 0.33333 hyp2 0.33333 hyp3 0.33333
How do we know that for the first input the right answer is "all three hypotheses are true" and in the second input the right answer is "all three hypotheses are false"? Note that if we look at the weights, the weights are much higher for the second input.
The idea I've come up with is that we can take the set of the highly probable hypotheses (all three hypotheses in the examples above) and try to subtract the effects of all but one hypothesis in the set from the input. Then run the modified input through the table again and see if that one remaining hypothesis will pop up above all the others. If it will, it should be accepted. If it won't, it should be refused. Repeat the computation for every hypothesis in the set.
To do that, we need to decide, what does it mean, "subtract"?
It seems reasonable to make the decision based on what probability this event has for this one hypothesis and what probability it has for all the other top hypotheses.
This can be interpreted in two ways depending on what case weights we're using for this computation: these from the original table or these from the result of the first computation. Using the weights from the result of the first computation seems to make more sense, since it favors the cases that have actually matched the input.
OK, suppose, we get these two probability values, how do we subtract the effects? Let's look at some examples of what results would make sense.
Let's name the probability of the event Pone (it can also be called P(E|H) fairly consistently with what we designated by it before, or TC(E|H)) in the situation where the one chosen hypothesis is true, and Prest in the situation where all the other top hypotheses are true. Let's call the actual event confidence C, and the confidence after subtraction Csub.
Some obvious cases would be if Pone and Prest are opposite:
Pone=0, Prest=1, C=1 => Csub=0 Pone=1, Prest=0, C=1 => Csub=1 Pone=0, Prest=1, C=0 => Csub=0 Pone=1, Prest=0, C=0 => Csub=1
Basically, if Prest and C are opposite, C stays as it was, if Prest and C are the same, C flips. The other way to say it is that Csub ends up matching Pone.
The less obvious cases are where both Pone and Prest point the same way. Should C stay? Should it move towards 0.5? One thing that can be said for sure is that C shouldn't flip in this situation. There are arguments for both staying and for moving towards 0.5. This situation means that all the remaining cases match the state of this event, so staying means that there is no reason to penalize one case just because the other cases match it. Moving towards 0.5 means that we say that the rest of the hypotheses can account well for this event by themselves, so let's try to eliminate the "also ran" hypothesis. Staying seems to make more sense to me.
The goal of the subtraction is that with the subtracted confidences applied, a valid hypothesis should be strongly boosted above all others. If it doesn't get boosted (i.e. if it still gets tangled with other hypotheses), it's probably not a valid hypothesis but just some random noise.
The only time I've used the subtraction approach with the real data, I did it in a simple way, and it still worked quite well. That implementation can be expressed as:
Csub = C * TCone/(TCone + TCrest)
Here TCone and TCrest are similar to Pone and Prest but represent the sums of weighted training confidences instead of probabilities:
TCone(E) = sum(TC(E|Hone)*W(Hone)) TCrest(E) = sum(TC(E|Hrest)*W(Hrest))
That implementation was asymmetric: if C is 1, Csub may become less than C, but if C is 0, Csub will stay at 0. It handles reasonably well the situations where the event is mostly positive for the top hypotheses but not the situations where the event is mostly negative for the top hypotheses.
If we compute the values of Csub in this way for the first example above (C(evA)=1, C(evB)=1, C(evC)=1), we will get:
- For hyp1: Csub(evA)=1, Csub(evB)=0, Csub(evC)=0
- For hyp2: Csub(evA)=0, Csub(evB)=1, Csub(evC)=0
- For hyp3: Csub(evA)=0, Csub(evB)=0, Csub(evC)=1
These exactly match the training cases, so all three hypotheses will be accepted, with each hypothesis going to the probability 1 on its run.
If we compute the values of Csub in this way for the second example above (C(evA)=0, C(evB)=0, C(evC)=0), we will get:
- For hyp1: Csub(evA)=0, Csub(evB)=0, Csub(evC)=0
- For hyp2: Csub(evA)=0, Csub(evB)=0, Csub(evC)=0
- For hyp3: Csub(evA)=0, Csub(evB)=0, Csub(evC)=0
They stay the same as the original input, thus the results won't change, the probabilities of all hypothesis will stay at 0.33 for each run, and all the hypotheses will be rejected.
The defect of this formula shows itself when the events are negative, their absence pointing towards the hypotheses, as in the following table:
# tab17_02.txt !,,evA,evB,evC hyp1,1,0,1,1 hyp2,1,1,0,1 hyp3,1,1,1,0
In this case the all-0 input should produce the result saying that all the hypotheses are true, and all-1 input should have all the hypotheses as false.
For all-one C(evA)=1, C(evB)=1, C(evC)=1, we will get:
- For hyp1: Csub(evA)=0, Csub(evB)=0.5, Csub(evC)=0.5
- For hyp2: Csub(evA)=0.5, Csub(evB)=0, Csub(evC)=0.5
- For hyp3: Csub(evA)=0.5, Csub(evB)=0.5, Csub(evC)=0
Let's try applying the computed values for hyp1:
# in17_02_03.txt evA,0 evB,0.5 evC,0.5
$ perl ex16_01run.pl -c 0.01 tab17_02.txt in17_02_03.txt ! , ,evA ,evB ,evC , hyp1 ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp2 ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00, hyp3 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, --- Probabilities hyp1 0.33333 hyp2 0.33333 hyp3 0.33333 --- Applying event evA, c=0.010000 ! , ,evA ,evB ,evC , hyp1 ,0.99000,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp2 ,0.01000,1.00000/1.00,0.00000/1.00,1.00000/1.00, hyp3 ,0.01000,1.00000/1.00,1.00000/1.00,0.00000/1.00, --- Applying event evB, c=0.500000 ! , ,evA ,evB ,evC , hyp1 ,0.49500,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp2 ,0.00500,1.00000/1.00,0.00000/1.00,1.00000/1.00, hyp3 ,0.00500,1.00000/1.00,1.00000/1.00,0.00000/1.00, --- Applying event evC, c=0.500000 ! , ,evA ,evB ,evC , hyp1 ,0.24750,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp2 ,0.00250,1.00000/1.00,0.00000/1.00,1.00000/1.00, hyp3 ,0.00250,1.00000/1.00,1.00000/1.00,0.00000/1.00, --- Probabilities hyp1 0.98020 hyp2 0.00990 hyp3 0.00990
The hyp1 still comes out as true! This is the problem caused by the asymmetry of the formula.
My next idea of a better formula was this: instead of subtractions, just either leave Csub=C or "flip" it: Csub=(1-C). The idea of the flipping is that the direction of C, whether it's less or greater than 0.5, shows the "value" of the event while the distance between C and 0.5 shows the confidence as such. The operation of flipping keeps the confidence (i.e. the distance between C and 0.5) the same while changes the direction. And if C was 0.5, the flipping will have no effect.
C would be flipped in the situation where it points against this hypothesis but for another top hypothesis. This situation likely means that this event is a symptom of another hypothesis but not really relevant for this one.
The logic will be like this:
If TCone and C point in the same direction (i.e. both >0.5 or both <0.5) then Csub = C; else if there exists another top hypothesis with TCother pointing in the same direction as C then Csub = (1-C); else Csub = C;
And instead of hypotheses, we can work with the individual training cases. Instead of picking the top hypotheses, pick the top cases. And then do the subtraction/flipping logic case-by-case. Except perhaps exclude the other cases of the same hypothesis from consideration of "exists another" for flipping.
Let's work this logic through our examples.
The first example is the table
# tab17_01.txt !,,evA,evB,evC hyp1,1,1,0,0 hyp2,1,0,1,0 hyp3,1,0,0,1
and the all-one input: C(evA)=1, C(evB)=1, C(evC)=1. From the previous computation we know that all three hypotheses are the top probable hypotheses.
Let's check hyp1:
- For evA, TC=1 and C=1, point the same way, so Csub=1.
- For evB, TC=0 and C=1, point opposite, and there is TC(evB|hyp2)=1, so flip to Csub=0.
- For evC, TC=0 and C=1, point opposite, and there is TC(evC|hyp3)=1, so flip to Csub=0.
We've got the subtracted values, let's run them through the processing:
# in17_01_03.txt evA,1 evB,0 evC,0
$ perl ex16_01run.pl -c 0.01 tab17_01.txt in17_01_03.txt ! , ,evA ,evB ,evC , hyp1 ,1.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00, hyp2 ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00, --- Probabilities hyp1 0.33333 hyp2 0.33333 hyp3 0.33333 --- Applying event evA, c=0.990000 ! , ,evA ,evB ,evC , hyp1 ,0.99000,1.00000/1.00,0.00000/1.00,0.00000/1.00, hyp2 ,0.01000,0.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,0.01000,0.00000/1.00,0.00000/1.00,1.00000/1.00, --- Applying event evB, c=0.010000 ! , ,evA ,evB ,evC , hyp1 ,0.98010,1.00000/1.00,0.00000/1.00,0.00000/1.00, hyp2 ,0.00010,0.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,0.00990,0.00000/1.00,0.00000/1.00,1.00000/1.00, --- Applying event evC, c=0.010000 ! , ,evA ,evB ,evC , hyp1 ,0.97030,1.00000/1.00,0.00000/1.00,0.00000/1.00, hyp2 ,0.00010,0.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,0.00010,0.00000/1.00,0.00000/1.00,1.00000/1.00, --- Probabilities hyp1 0.99980 hyp2 0.00010 hyp3 0.00010
It says that hyp1 is quite true. Similarly, hyp2 and hyp3 would show themselves as true.
For the second example, let's look at the same table and the inputs of all-0: C(evA)=0, C(evB)=0, C(evC)=0. From the previous computation we know that all three hypotheses are again the top probable hypotheses.
Let's check hyp1:
- For evA, TC=1 and C=0, point opposite, and there are two other hypotheses with TC=0, so flip to Csub=1.
- For evB, TC=0 and C=0, point the same, so Csub=0.
- For evC, TC=0 and C=0, point the same, so Csub=0.
It again produced (1, 0, 0), and hyp1 would also show as true! But that's not a good result, it's a false positive. This idea didn't work out well.
The problem is that we need to differentiate between the states of the event that say "there is nothing wrong" and "there is something wrong", and flip the event only if it was pointing in the direction of "there is something wrong". That's what my first asymmetric logic did, it always assumed that C=0 meant
"there is nothing wrong".
If we have the additional information about which state of the event is "normal" and which is "wrong", that would solve this problem. If we don't have this information, we can try to deduce it. A simple assumption could be that if a symptom is specific to some cases, then in most of the training cases it will be in the "normal" state, and only in a few of these specific cases it will be in the "wrong" state.
Of course, there will be exceptions, for example if a medical diagnostic system has an event with the question "is the patient feeling unwell?" then the answer for this question in most cases will be true, even though this is not the "normal" state. But it doesn't seem to cause problems: for most patients and most hypotheses TC and C on this event will be pointing the same way, and there will be no need for flipping anyway.
So, let's update the logic rules:
If TCone and C point in the same direction (i.e. both >0.5 or both <0.5) then Csub = C; else if there exists another top hypothesis with TCother pointing in the same direction as C and most cases in the original training data were pointing opposite C then Csub = (1-C); else Csub = C;
With this revised logic the revisited case of the all-0 inputs (C(evA)=0, C(evB)=0, C(evC)=0) for hyp1 will be:
- For evA, TC=1 and C=0, point opposite, but most training cases (2 of 3) also point to C=0, so leave Csub=0.
- For evB, TC=0 and C=0, point the same, so Csub=0.
- For evC, TC=0 and C=0, point the same, so Csub=0.
With this unchanged input, hyp1 will still finish with the probability of 0.33, and it won't make the cut. Neither will make hyp2 nor hyp3 when processed in the same way.
Let's look at the example with the opposite table
# tab17_02.txt !,,evA,evB,evC hyp1,1,0,1,1 hyp2,1,1,0,1 hyp3,1,1,1,0
and again the all-0 input. In it the handling of hyp1 will be:
- For evA, TC=0 and C=0, point the same, so Csub=0.
- For evB, TC=1 and C=0, point opposite, most training cases (2 of 3) point to C=1, and there is hyp2 with TC(evB|hyp2) = 0, so flip to Csub=1.
- For evC, TC=1 and C=0, point opposite, most training cases (2 of 3) point to C=1, and there is hyp3 with TC(evB|hyp3) = 0, so flip to Csub=1.
This result of (C(evA)=0, C(evB)=1, C(evC)=1) will match the training case for hyp1 exactly, and drive its probability all the way up, just as we wanted to.
This last logic has managed to handle all the examples fairly decently.