The short summary of changes is:
- The %hyphash went away since it turned out to be unnecessary.
- Option "-compose N" added. N is the fraction in range [0..1] of what part of the weights needs to be composed into the per-hypothesis merged cases. 0 is the default and means that nothing will be composed. 1 means that all the data will be composed per hypothesis.
- Option "-msplit" added. It enables the splitting of the multi-hypothesis cases into multiple single-hypothesis cases (just as shown before, the same data gets copied to every hypothesis separately).
- The table loading code has been extended to allow the reading of the multi-hypothesis cases that are formatted on multiple lines. This allows to save the result of the composition and then load it for the diagnosis runs.
- The new logic is in the function compose_hypotheses().
Initially I wanted to show only the changed code but then decided that since I'm not uploading the files anywhere yet, it would be easier for the people who want to play with the code to copy-and-paste it than to assembly it from bits and pieces. So here we go, the whole new code:
# ex16_01run.pl #!/usr/bin/perl # # Running of a Bayes expert system on a table of training cases. # Includes the code to compose the training cases by hypothesis. # And can parse back the multi-hypothesis cases printed in multi-line. use strict; use Carp; our @evname; # the event names, in the table order our %evhash; # hash of event names to indexes our @case; # the table of training cases # each case is represented as a hash with elements: # "hyp" - array of hypotheis names that were diagnosed in this case # "wt" - weight of the case # "origwt" - original weight of the case as loaded form the table # "tc" - array of training confidence of events TC(E|I) # "r" - array of relevance of events R(E|I) our %phyp; # will be used to store the computed probabilities # options our $verbose = 0; # verbose printing during the computation our $cap = 0; # cap on the confidence, factor adding fuzziness to compensate # for overfitting in the training data; # limits the confidence to the range [$cap..1-$cap] our $boundary = 0.9; # boundary for accepting a hypothesis as a probable outcome our $composite = 0.; # fraction [0..1] of the cases to compose by hypothesis our $mhyp_split = 0; # split the multi-hypothesis cases when computing the composites # print formatting values our $pf_w = 7; # width of every field our $pf_tfmt = "%-${pf_w}.${pf_w}s"; # format of one text field our $pf_nw = $pf_w-2; # width after the dot in the numeric fields field our $pf_nfmt = "%-${pf_w}.${pf_nw}f"; # format of one numeric field (does the better rounding) our $pf_rw = 4; # width of the field R(E|I) our $pf_rtfmt = "%-${pf_rw}.${pf_rw}s"; # format of text field of the same width as R(E|I) our $pf_rnw = $pf_rw-2; # width after the dot for R(E|I) our $pf_rnfmt = "%-${pf_rw}.${pf_rnw}f"; # format of the field for R(E|I) sub load_table($) # (filename) { my $filename = shift; @evname = (); %evhash = (); @case = (); my $nev = undef; # number of events minus 1 confess "Failed to open '$filename': $!\n" unless open(INPUT, "<", $filename); my $prepend; while(<INPUT>) { chomp; s/,\s*$//; # remove the trailing comma if any if (/^\#/ || /^\s*$/) { # a comment line } elsif (/^\!/) { # row with event names @evname = split(/,/); # CSV format, the first 2 elements gets skipped shift @evname; shift @evname; } elsif (/\+\s*$/) { # a split-line of hypothesis names $prepend .= $_; } else { $_ = $prepend . $_; $prepend = undef; my @s = split(/,/); # CSV format for a training case # Each line contains: # - list of hypotheses, separated by "+" # - weight (in this position it's compatible with the format of probability tables) # - list of event data that might be either of: # - one number - the event's training confidence TC(E|I), implying R(E|I)=1 # - a dash "-" - the event is irrelevant, meaning R(E|I)=0 # - two numbers separated by a "/": TC(E|I)/R(E|I) my $c = { }; my @hyps = split(/\+/, shift @s); for (my $i = 0; $i <= $#hyps; $i++) { $hyps[$i] =~ s/^\s+//; $hyps[$i] =~ s/\s+$//; } $c->{hyp} = \@hyps; $c->{origwt} = $c->{wt} = shift(@s) + 0.; if (defined $nev) { if ($nev != $#s) { close(INPUT); my $msg = sprintf("Wrong number of events, expected %d, got %d in line: %s\n", $nev+1, $#s+1, $_); confess $msg; } } else { $nev = $#s; } # the rest of fields are the events in this case foreach my $e (@s) { if ($e =~ /^\s*-\s*$/) { push @{$c->{r}}, 0.; push @{$c->{tc}}, 0.; } else { my @edata = split(/\//, $e); push @{$c->{tc}}, ($edata[0] + 0.); if ($#edata <= 0) { push @{$c->{r}}, 1.; } else { push @{$c->{r}}, ($edata[1] + 0.); } } } push @case, $c; } } close(INPUT); if ($prepend) { confess "The input contained the hanging hypothese names: $prepend\n"; } if ($#evname >= 0) { if ($#evname != $nev) { my $msg = sprintf("Wrong number of event names, %d events in the table, %d names\n", $nev+1, $#evname+1); confess $msg; } } else { for (my $i = 0; $i <= $nev; $i++) { push @evname, ($i+1).""; } } for (my $i = 0; $i <= $#evname; $i++) { $evname[$i] =~ s/^\s+//; $evname[$i] =~ s/\s+$//; $evhash{$evname[$i]} = $i; } } sub print_table() { # the title row printf($pf_tfmt . ",", "!"); printf($pf_tfmt . ",", ""); foreach my $e (@evname) { printf($pf_tfmt . " " . $pf_rtfmt . ",", $e, ""); } print("\n"); # the cases for (my $i = 0; $i <= $#case; $i++) { my $c = $case[$i]; # if more than one hypothesis, print each of them on a separate line for (my $j = 0; $j < $#{$c->{hyp}}; $j++) { printf($pf_tfmt . "+\n", $c->{hyp}[$j]); } printf($pf_tfmt . ",", $c->{hyp}[ $#{$c->{hyp}} ]); printf($pf_nfmt . ",", $c->{wt}); for (my $j = 0; $j <= $#evname; $j++) { printf($pf_nfmt . "/" . $pf_rnfmt . ",", $c->{tc}[$j], $c->{r}[$j]); } print("\n"); } } # Compute the hypothesis probabilities from weights sub compute_phyp() { %phyp = (); my $h; # start by getting the weights my $sum = 0.; for (my $i = 0; $i <= $#case; $i++) { my $w = $case[$i]->{wt}; $sum += $w; foreach $h (@{$case[$i]->{hyp}}) { $phyp{$h} += $w; } } if ($sum != 0.) { # if 0 then all the weights are 0, leave them alone for $h (keys %phyp) { $phyp{$h} /= $sum; } } } # Print the probabilities of the kypotheses sub print_phyp() { printf("--- Probabilities\n"); for my $h (sort keys %phyp) { printf($pf_tfmt . " " . $pf_nfmt . "\n", $h, $phyp{$h}); } } # Apply one event # evi - event index in the array # conf - event confidence [0..1] sub apply_event($$) # (evi, conf) { my ($evi, $conf) = @_; # update the weights for (my $i = 0; $i <= $#case; $i++) { my $w = $case[$i]->{wt}; my $r = $case[$i]->{r}[$evi]; my $tc = $case[$i]->{tc}[$evi]; $case[$i]->{wt} = $w * (1. - $r) + $w*$r*( $tc*$conf + (1. - $tc)*(1. - $conf) ); } } # Apply an input file sub apply_input($) # (filename) { my $filename = shift; confess "Failed to open the input '$filename': $!\n" unless open(INPUT, "<", $filename); while(<INPUT>) { chomp; next if (/^\#/ || /^\s*$/); # a comment my @s = split(/,/); $s[0] =~ s/^\s+//; $s[0] =~ s/\s+$//; confess ("Unknown event name '" . $s[0] . "' in the input\n") unless exists $evhash{$s[0]}; my $evi = $evhash{$s[0]}; my $conf = $s[1] + 0.; if ($conf < $cap) { $conf = $cap; } elsif ($conf > 1.-$cap) { $conf = 1. - $cap; } printf("--- Applying event %s, c=%f\n", $s[0], $conf); &apply_event($evi, $conf); &print_table; } close(INPUT); } # compose the training cases by hypothesis sub compose_hypotheses() { if ($mhyp_split) { # the number of cases will grow, remember the index of last original case my $lastcase = $#case + 1; for (my $i = 0; $i <= $lastcase; $i++) { my $c = $case[$i]; while ($#{$c->{hyp}} > 0) { # split a copy of the case for each resulting hypothesis in it my $hname = pop @{$c->{hyp}}; push @case, { hyp => [$hname], wt => $c->{wt}, origwt => $c->{origwt}, tc => $c->{tc}, r => $c->{r}, }; } } } if ($composite <= 0.) { return; # nothing to do } # newly-generated composite hypotheses my %hyps; # keyed by the hypothesis names for (my $i = 0; $i <= $#case; $i++) { my $c = $case[$i]; my $key = join("+", sort @{$c->{hyp}}); if (!exists $hyps{$key}) { $hyps{$key} = { hyp => $c->{hyp}, wt => 0., wtrel => [], # weight of relevant part, by event tc => [], # initially will contain the sum r => [], # will be filled later }; } my $hyp = $hyps{$key}; my $wt = $c->{wt}; $hyp->{wt} += $wt; for (my $e = 0; $e <= $#evname; $e++) { my $r = $c->{r}[$e]; $hyp->{wtrel}[$e] += $wt * $r; $hyp->{tc}[$e] += $wt * $r * $c->{tc}[$e]; } } if ($composite >= 1.) { # throw away the raw cases, since the hypotheses will replace them @case = (); } else { # 2nd pass: adjust the weight of the raw cases, # unless it's the only case of the hypothesis (then throw # away that hypothesis) for (my $i = 0; $i <= $#case; $i++) { my $c = $case[$i]; my $key = join("+", sort @{$c->{hyp}}); if ($hyps{$key}->{wt} == $c->{wt}) { delete $hyps{$key}; # hypothesis is a copy of the case } else { $c->{wt} *= (1. - $composite); } } } foreach my $h (sort keys %hyps) { my $hyp = $hyps{$h}; my $wt = $hyp->{wt}; for (my $e = 0; $e <= $#evname; $e++) { $hyp->{r}[$e] = $hyp->{wtrel}[$e] / $wt; if ($hyp->{wtrel}[$e] == 0.) { $hyp->{tc}[$e] = 0.; } else { $hyp->{tc}[$e] /= $hyp->{wtrel}[$e]; } } # scale down the weight $hyp->{wt} *= $composite; $hyp->{origwt} = $hyp->{wt}; delete $hyp->{wtrel}; push @case, $hyp; } } # main() while ($ARGV[0] =~ /^-(.*)/) { if ($1 eq "v") { $verbose = 1; } elsif ($1 eq "c") { shift @ARGV; $cap = $ARGV[0]+0.; } elsif ($1 eq "b") { shift @ARGV; $boundary = $ARGV[0]+0.; } elsif ($1 eq "compose") { shift @ARGV; $composite = $ARGV[0]+0.; } elsif ($1 eq "msplit") { $mhyp_split = 1; } else { confess "Unknown switch -$1"; } shift @ARGV; } &load_table($ARGV[0]); &print_table; if ($composite > 0. || $mhyp_split) { &compose_hypotheses; printf "--- Composite ${pf_nfmt}:\n", $composite; &print_table; } &compute_phyp; &print_phyp; if ($#ARGV >= 1) { &apply_input($ARGV[1]); &compute_phyp; &print_phyp; }
The basic example, loading of a multi-line case and splitting of the multi-hypothesis cases:
#tab16_01.txt ! , ,evA ,evB ,evC , hyp1 + hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp2 + hyp3 ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp1 + hyp3 ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
$ perl ex16_01run.pl -msplit tab16_01.txt ! , ,evA ,evB ,evC , hyp1 + hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp2 + hyp3 ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp1 + hyp3 ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00, --- Composite 0.00000: ! , ,evA ,evB ,evC , hyp1 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp2 ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp1 ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00, hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp3 ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00, --- Probabilities hyp1 0.33333 hyp2 0.33333 hyp3 0.33333
The same but also fully composed by hypotheses:
$ perl ex16_01run.pl -msplit -compose 1 tab16_01.txt ! , ,evA ,evB ,evC , hyp1 + hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp2 + hyp3 ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp1 + hyp3 ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00, --- Composite 1.00000: ! , ,evA ,evB ,evC , hyp1 ,2.00000,1.00000/1.00,0.50000/1.00,0.50000/1.00, hyp2 ,2.00000,0.50000/1.00,1.00000/1.00,0.50000/1.00, hyp3 ,2.00000,0.50000/1.00,0.50000/1.00,1.00000/1.00, --- Probabilities hyp1 0.33333 hyp2 0.33333 hyp3 0.33333
The same with only 0.5 of the weights composed:
$ perl ex16_01run.pl -msplit -compose 0.5 tab16_01.txt ! , ,evA ,evB ,evC , hyp1 + hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp2 + hyp3 ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp1 + hyp3 ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00, --- Composite 0.50000: ! , ,evA ,evB ,evC , hyp1 ,0.50000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp2 ,0.50000,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp1 ,0.50000,1.00000/1.00,0.00000/1.00,1.00000/1.00, hyp2 ,0.50000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp3 ,0.50000,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp3 ,0.50000,1.00000/1.00,0.00000/1.00,1.00000/1.00, hyp1 ,1.00000,1.00000/1.00,0.50000/1.00,0.50000/1.00, hyp2 ,1.00000,0.50000/1.00,1.00000/1.00,0.50000/1.00, hyp3 ,1.00000,0.50000/1.00,0.50000/1.00,1.00000/1.00, --- Probabilities hyp1 0.33333 hyp2 0.33333 hyp3 0.33333
It's a combination of the two examples above. The original split events are left half their weight, and the other half of the weight has been composed.
What if we try the composition without splitting?
$ perl ex16_01run.pl -compose 0.5 tab16_01.txt ! , ,evA ,evB ,evC , hyp1 + hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp2 + hyp3 ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp1 + hyp3 ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00, --- Composite 0.50000: ! , ,evA ,evB ,evC , hyp1 + hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp2 + hyp3 ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00, hyp1 + hyp3 ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00, --- Probabilities hyp1 0.66667 hyp2 0.66667 hyp3 0.66667
The result is exactly the same as the original. This is because the original had only one case for each hypothesis combination, so there is nothing to compose and there is no point in splitting them into two.
Now a demo of the relevance computation:
# tab16_02.txt !,,evA,evB,evC hyp1,1,1,0,- hyp1,1,0,1,- hyp1,6,-,-,1
$ perl ex16_01run.pl -compose 1 tab16_02.txt ! , ,evA ,evB ,evC , hyp1 ,1.00000,1.00000/1.00,0.00000/1.00,0.00000/0.00, hyp1 ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/0.00, hyp1 ,6.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00, --- Composite 1.00000: ! , ,evA ,evB ,evC , hyp1 ,8.00000,0.50000/0.25,0.50000/0.25,1.00000/0.75, --- Probabilities hyp1 1.00000
According to the logic from the previous installment, the relevance of evA and evB gets set to 0.25 because 2/8=1/4 of the cases for them have them relevant, and for evC the relevance is set to 0.75 because 6/8=3/4 cases for it are relevant.
No comments:
Post a Comment