You might have already noticed that no matter what, the sum of all the P(H) stays equal to 1 (within the rounding error). This is not an accident. It means that the Bayes logic works with the complete set of mutually-exclusive hypotheses: exactly one of them must be true and they must cover all the possible eventualities of life. The weights of the hypotheses change during the calculations but these properties stay, so the sum of all the weights must equal to 1.
In this sense the spam detection is an ideal application for the Bayes logic: there are only two hypotheses, either the message is spam or it isn't, and thus exactly one of them must be true. The other uses, such as the diagnostics of the repairs or illnesses, are not so clear-cut. In them having more than one hypothesis true is perfectly valid. The Bayesian approach can still be used for them though it requires a certain amount of monkeying around. But fundamentally it's an abuse of the Bayesian system, not something it has been intended for.
To demonstrate the problems, I've made a simple set of training data where every case had resulted in two true hypotheses:
# tab09_01.txt
evA evB evC
1 * hyp1,hyp2 1 1 0
1 * hyp2,hyp3 0 1 1
1 * hyp1,hyp3 1 0 1
I've translated it to the table of probabilities exactly as before. There are 2 row with hyp1 and 3 rows total, to P(hyp1) is 2/3 = 0.66667. There are 2 rows with both hyp1 and evA, so P(evA|hyp1) = 2/2 = 1. There is 1 row with both hyp1 and evB, so P(evB|hyp1) = 1/2 = 0.5. And so on. It results in the table of probabilities:
# tab09_01.txt
!,,evA,evB,evC
hyp1,0.66667,1,0.5,0.5
hyp2,0.66667,0.5,1,0.5
hyp3,0.66667,0.5,0.5,1
Note that this table is different, the sum of all P(H) in it is 2, not 1, because every case lists 2 hypotheses.
Let's try to run it on this input:
# in09_01_01.txt
evA,0
evB,0
evC,1
$ perl ex06_01run.pl -v tab09_01.txt in09_01_01.txt
! , ,evA ,evB ,evC ,
hyp1 ,0.66667,1.00000,0.50000,0.50000,
hyp2 ,0.66667,0.50000,1.00000,0.50000,
hyp3 ,0.66667,0.50000,0.50000,1.00000,
--- Applying event evA, c=0.000000
P(E)=1.333340
Impossible events, division by -0.33334
Whoops, it crashes right away. P(E) = 1.33334 doesn't look right, the probabilities shouldn't be higher than 1. And this results in the large negative divisor in the formula. Good that the verification had caught it! If we unroll the computation, the root cause of the problem is that the sum of P(H) is 2. P(E) is calculated by summing its probabilities for every hypothesis, weighing them by P(H). If the sum of P(H) is 2 then the value of P(E) will be in the range of [0..2], which is completely unexpected by the rest of the computation.
One way to attempt resolving this problem is by re-weighting P(E), bringing it back into the range [0..1]. It can be done by dividing it by the sum of P(H). I've implemented this in the code of ex09_01run.pl. The computation of P(E) in the function apply_event() gets changed a little:
# ex09_01run.pl
my $pe = 0.;
my $sumph = 0.;
for (my $i = 0; $i <= $#hypname; $i++) {
$pe += $phyp[$i] * $peh[$i][$evi];
$sumph += $phyp[$i]
}
$pe /= $sumph;
Let's try this new version:
$ perl ex09_01run.pl -v tab09_01.txt in09_01_01.txt
! , ,evA ,evB ,evC ,
hyp1 ,0.66667,1.00000,0.50000,0.50000,
hyp2 ,0.66667,0.50000,1.00000,0.50000,
hyp3 ,0.66667,0.50000,0.50000,1.00000,
--- Applying event evA, c=0.000000
P(E)=0.666667
H hyp1 *0.000000/0.333333
H hyp2 *0.500000/0.333333
H hyp3 *0.500000/0.333333
! , ,evA ,evB ,evC ,
hyp1 ,0.00000,1.00000,0.50000,0.50000,
hyp2 ,1.00000,0.50000,1.00000,0.50000,
hyp3 ,1.00000,0.50000,0.50000,1.00000,
--- Applying event evB, c=0.000000
P(E)=0.750000
H hyp1 *0.500000/0.250000
H hyp2 *0.000000/0.250000
H hyp3 *0.500000/0.250000
! , ,evA ,evB ,evC ,
hyp1 ,0.00000,1.00000,0.50000,0.50000,
hyp2 ,0.00000,0.50000,1.00000,0.50000,
hyp3 ,2.00001,0.50000,0.50000,1.00000,
--- Applying event evC, c=1.000000
P(E)=1.000000
H hyp1 *0.500000/1.000000
H hyp2 *0.500000/1.000000
H hyp3 *1.000000/1.000000
! , ,evA ,evB ,evC ,
hyp1 ,0.00000,1.00000,0.50000,0.50000,
hyp2 ,0.00000,0.50000,1.00000,0.50000,
hyp3 ,2.00001,0.50000,0.50000,1.00000,
Much better, it didn't crash. But it produced the result with hyp2 having the probability of 2. Not really surprising if you remember that the sum of probabilities stays the same, so if some hypotheses become improbable, the extra probability has to go somewhere, in this case to the only probably hypothesis left, hyp3. But the probability of 2 is not a right thing in the great scheme of things.
By the way, in the real world where you have the sum of probabilities from the training data closer to 1, and lots of events to consider, the effect will be less pronounced. You might almost never see the P(H) above 1. Which doesn't mean that the effect is not present, it will still be messing with you. This example is specially constructed to highlight this effect and show why you should care.
So, how can we get rid of the probabilities of 2? The simple way that springs to mind is to do the same thing as we've done before for P(E): scale down the values proportionally to make them sum up to 1 and enter these scaled-down values into the table of probabilities.
This can be thought of as changing the computation of the table of probabilities in the following way: There are 2 row with hyp1 and 6 hypotheses total in the outcomes of all the cases, to P(hyp1) is 2/6 = 0.33333. And so on. The computation of P(E|H) doesn't change.
The table of probabilities becomes:
# tab09_02.txt
!,,evA,evB,evC
hyp1,0.33333,1,0.5,0.5
hyp2,0.33333,0.5,1,0.5
hyp3,0.33333,0.5,0.5,1
Since now the sum of P(H) is 1 (within the rounding), we can as well use the original code to compute with this table:
$ perl ex06_01run.pl -v tab09_02.txt in09_01_01.txt
! , ,evA ,evB ,evC ,
hyp1 ,0.33333,1.00000,0.50000,0.50000,
hyp2 ,0.33333,0.50000,1.00000,0.50000,
hyp3 ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evA, c=0.000000
P(E)=0.666660
H hyp1 *0.000000/0.333340
H hyp2 *0.500000/0.333340
H hyp3 *0.500000/0.333340
! , ,evA ,evB ,evC ,
hyp1 ,0.00000,1.00000,0.50000,0.50000,
hyp2 ,0.49999,0.50000,1.00000,0.50000,
hyp3 ,0.49999,0.50000,0.50000,1.00000,
--- Applying event evB, c=0.000000
P(E)=0.749978
H hyp1 *0.500000/0.250022
H hyp2 *0.000000/0.250022
H hyp3 *0.500000/0.250022
! , ,evA ,evB ,evC ,
hyp1 ,0.00000,1.00000,0.50000,0.50000,
hyp2 ,0.00000,0.50000,1.00000,0.50000,
hyp3 ,0.99988,0.50000,0.50000,1.00000,
--- Applying event evC, c=1.000000
P(E)=0.999880
H hyp1 *0.500000/0.999880
H hyp2 *0.500000/0.999880
H hyp3 *1.000000/0.999880
! , ,evA ,evB ,evC ,
hyp1 ,0.00000,1.00000,0.50000,0.50000,
hyp2 ,0.00000,0.50000,1.00000,0.50000,
hyp3 ,1.00000,0.50000,0.50000,1.00000,
The probabilities stay nicely within the range of [0..1].
Let's look at another approach. Instead of all this we can just split the cases in the training data that have two outcomes into two cases with a single outcome (and the same approach can be easily generalized for the larger number of outcomes):
evA evB evC
1 * hyp1 1 0 1
1 * hyp1 1 1 0
1 * hyp2 1 1 0
1 * hyp2 0 1 1
1 * hyp3 0 1 1
1 * hyp3 1 0 1
When we compute the table of probabilities for them, it turns out to be the same one as we've got in the other way, after the scaling-down of the P(H):
# tab09_02.txt
!,,evA,evB,evC
hyp1,0.33333,1,0.5,0.5
hyp2,0.33333,0.5,1,0.5
hyp3,0.33333,0.5,0.5,1
So we can go in two ways and get the same result. Which is good. In the probability calculations the things are interconnected, and if we can arrive to the same result in two ways, it means that the result makes sense.
But why did we go the first way? To keep the prior probabilities of the events more in line with the reality. The scaling down of the probabilities partially negates this effect. It can be compensated by setting the lower probability boundary for the acceptable hypotheses. For example, if your probabilities have initially added up to 1.2 (i.e. 20% of the training cases had a two-hypothesis outcome) and your original boundary was 0.9, then it would be reasonable to set the boundary to 0.9/1.2 = 0.75. I believe that this is a large part of the reason why I've been finding the boundary values around 0.7 working well for me. But I don't have an exact proof.
Though of course if one hypothesis wins, its probability would be driven to 1 anyway. But there is nothing wrong with that, if this probability would have cleared the higher boundary of 0.9, it would clear the boundary of 0.75 as well.
In our example where the sum of probabilities added up to 2, the boundary would look as 0.9/2 = 0.45. Let's look at the input that indicates two hypotheses being true:
# in09_01_02.txt
evA,0
evB,1
evC,1
$ perl ex06_01run.pl -v tab09_02.txt in09_01_02.txt
! , ,evA ,evB ,evC ,
hyp1 ,0.33333,1.00000,0.50000,0.50000,
hyp2 ,0.33333,0.50000,1.00000,0.50000,
hyp3 ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evA, c=0.000000
P(E)=0.666660
H hyp1 *0.000000/0.333340
H hyp2 *0.500000/0.333340
H hyp3 *0.500000/0.333340
! , ,evA ,evB ,evC ,
hyp1 ,0.00000,1.00000,0.50000,0.50000,
hyp2 ,0.49999,0.50000,1.00000,0.50000,
hyp3 ,0.49999,0.50000,0.50000,1.00000,
--- Applying event evB, c=1.000000
P(E)=0.749978
H hyp1 *0.500000/0.749978
H hyp2 *1.000000/0.749978
H hyp3 *0.500000/0.749978
! , ,evA ,evB ,evC ,
hyp1 ,0.00000,1.00000,0.50000,0.50000,
hyp2 ,0.66667,0.50000,1.00000,0.50000,
hyp3 ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evC, c=1.000000
P(E)=0.666667
H hyp1 *0.500000/0.666667
H hyp2 *0.500000/0.666667
H hyp3 *1.000000/0.666667
! , ,evA ,evB ,evC ,
hyp1 ,0.00000,1.00000,0.50000,0.50000,
hyp2 ,0.50000,0.50000,1.00000,0.50000,
hyp3 ,0.50000,0.50000,0.50000,1.00000,
The two hypotheses nicely clear the boundary of 0.45.
But what if the input data indicates all 3 hypothesis being true? Such a case wasn't in the training data but that doesn't mean that it's impossible, it might be just relatively rare:
# in09_01_03.txt
evA,1
evB,1
evC,1
$ perl ex06_01run.pl -v tab09_02.txt in09_01_03.txt
! , ,evA ,evB ,evC ,
hyp1 ,0.33333,1.00000,0.50000,0.50000,
hyp2 ,0.33333,0.50000,1.00000,0.50000,
hyp3 ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evA, c=1.000000
P(E)=0.666660
H hyp1 *1.000000/0.666660
H hyp2 *0.500000/0.666660
H hyp3 *0.500000/0.666660
! , ,evA ,evB ,evC ,
hyp1 ,0.50000,1.00000,0.50000,0.50000,
hyp2 ,0.25000,0.50000,1.00000,0.50000,
hyp3 ,0.25000,0.50000,0.50000,1.00000,
--- Applying event evB, c=1.000000
P(E)=0.625000
H hyp1 *0.500000/0.625000
H hyp2 *1.000000/0.625000
H hyp3 *0.500000/0.625000
! , ,evA ,evB ,evC ,
hyp1 ,0.40000,1.00000,0.50000,0.50000,
hyp2 ,0.40000,0.50000,1.00000,0.50000,
hyp3 ,0.20000,0.50000,0.50000,1.00000,
--- Applying event evC, c=1.000000
P(E)=0.600000
H hyp1 *0.500000/0.600000
H hyp2 *0.500000/0.600000
H hyp3 *1.000000/0.600000
! , ,evA ,evB ,evC ,
hyp1 ,0.33333,1.00000,0.50000,0.50000,
hyp2 ,0.33333,0.50000,1.00000,0.50000,
hyp3 ,0.33333,0.50000,0.50000,1.00000,
Whoops, none of the hypotheses have cleared the 0.45 boundary. The absolute boundary, even if it's lowered, is actually not such a great idea. A much better approach is to pick the few highest-ranked hypotheses that have the close probabilities and sum up to a non-corrected absolute boundary. In this example 0.33333+0.33333+0.33333=0.99999, and all the values are the same, so it makes sense to accept all the hypotheses.
To formalize this approach, the following condition can be used:
Pick N top hypotheses, for which P(H) >= limit/N
Pretty much, go hypothesis by hypothesis and add 1 to N while the next hypothesis satisfies this condition. However I've found that this can produce pretty long tails of hypotheses where the next one is just a little less probable that the previous one, enough to stay within the condition. This would be OK if they were all true but they aren't, and this generates the expensive false positives. So I've added one more condition to keep this sequence more flat:
P(H) of the Nth hypothesis must be >= 0.5*P(H) of the first hypothesis
This can be added with the following code:
# ex09_02run.pl.txt
# in options
our $boundary = 0.9; # boundary for accepting a hypothesis as a probable outcome
# ...
# pick and print the most probable hypotheses
sub pick_hypotheses()
{
my $w = 7; # width of every field
my $tfmt = "%-${w}.${w}s,"; # format of one text field
my $nw = $w-2; # width after the dot in the numeric fields field
my $nfmt = "%-${w}.${nw}f,"; # format of one numeric field (does the better rounding)
print "--- Result:\n";
# the order is reverse!
my @order = sort {$phyp[$a] <=> $phyp[$b]} (0..$#phyp);
my $half_first = $phyp[$order[0]] / 2.;
my $n = $#phyp + 1;
for my $i (@order) {
my $ph = $phyp[$i];
if ($ph >= $half_first && $ph >= $boundary / $n) {
printf($tfmt, $hypname[$i]);
printf($nfmt, $ph);
printf("\n");
} else {
# reduce the count only if not started printing yet
--$n;
}
}
}
# main()
while ($ARGV[0] =~ /^-(.*)/) {
if ($1 eq "v") {
$verbose = 1;
} elsif ($1 eq "c") {
shift @ARGV;
$cap = $ARGV[0]+0.;
} elsif ($1 eq "b") {
shift @ARGV;
$boundary = $ARGV[0]+0.;
} else {
confess "Unknown switch -$1";
}
shift @ARGV;
}
&load_table($ARGV[0]);
&print_table;
if ($#ARGV >= 1) {
&apply_input($ARGV[1]);
&pick_hypotheses();
}
The hypotheses are tested in the backwards order because otherwise it would be impossible to say, which value of N to use when testing the first hypothesis.
It might be still useful to put the limit on either the absolute value of P(H) that can be accepted or equivalently on what the maximal reasonable N can be, to avoid the situations where the probability is spread among a large number of hypotheses because the model has no good match to the input data. If N is over 10 then the results are probably invalid and none of them should be accepted. I'll also show a different way to look at this problem in the next installment.
To give an example of the result produced by this code, here are its results for the inputs shown before:
$ perl ex09_02run.pl tab09_02.txt in09_01_02.txt
! , ,evA ,evB ,evC ,
hyp1 ,0.33333,1.00000,0.50000,0.50000,
hyp2 ,0.33333,0.50000,1.00000,0.50000,
hyp3 ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evA, c=0.000000
! , ,evA ,evB ,evC ,
hyp1 ,0.00000,1.00000,0.50000,0.50000,
hyp2 ,0.49999,0.50000,1.00000,0.50000,
hyp3 ,0.49999,0.50000,0.50000,1.00000,
--- Applying event evB, c=1.000000
! , ,evA ,evB ,evC ,
hyp1 ,0.00000,1.00000,0.50000,0.50000,
hyp2 ,0.66667,0.50000,1.00000,0.50000,
hyp3 ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evC, c=1.000000
! , ,evA ,evB ,evC ,
hyp1 ,0.00000,1.00000,0.50000,0.50000,
hyp2 ,0.50000,0.50000,1.00000,0.50000,
hyp3 ,0.50000,0.50000,0.50000,1.00000,
--- Result:
hyp2 ,0.50000,
hyp3 ,0.50000,
$ perl ex09_02run.pl tab09_02.txt in09_01_03.txt
! , ,evA ,evB ,evC ,
hyp1 ,0.33333,1.00000,0.50000,0.50000,
hyp2 ,0.33333,0.50000,1.00000,0.50000,
hyp3 ,0.33333,0.50000,0.50000,1.00000,
--- Applying event evA, c=1.000000
! , ,evA ,evB ,evC ,
hyp1 ,0.50000,1.00000,0.50000,0.50000,
hyp2 ,0.25000,0.50000,1.00000,0.50000,
hyp3 ,0.25000,0.50000,0.50000,1.00000,
--- Applying event evB, c=1.000000
! , ,evA ,evB ,evC ,
hyp1 ,0.40000,1.00000,0.50000,0.50000,
hyp2 ,0.40000,0.50000,1.00000,0.50000,
hyp3 ,0.20000,0.50000,0.50000,1.00000,
--- Applying event evC, c=1.000000
! , ,evA ,evB ,evC ,
hyp1 ,0.33333,1.00000,0.50000,0.50000,
hyp2 ,0.33333,0.50000,1.00000,0.50000,
hyp3 ,0.33333,0.50000,0.50000,1.00000,
--- Result:
hyp1 ,0.33333,
hyp2 ,0.33333,
hyp3 ,0.33333,
And finally I want to return to the version of the code that scaled the sum of P(H) to 1 when computing P(E), in ex09_01run.pl. An interesting thing about it shows in the case if the sum of probabilities is not quite 1 due to the rounding, such as 3 hypotheses at 0.33333 each. With this version of the code, the sum stays faithfully at 0.99999 and doesn't get pulled up to 1 by the computations. Which might make the result a little more true, introducing less of the rounding errors. But in reality this little difference probably doesn't matter.