Monday, November 2, 2015

Bayes 14: code for weight-based computation

As promised, here is the code that performs the computation directly from the table of the training cases, using the weights:

# ex14_01run.pl
#!/usr/bin/perl
#
# Running of a Bayes expert system on a table of training cases.

use strict;
use Carp;

our @evname; # the event names, in the table order
our %evhash; # hash of event names to indexes
our %hyphash; # the hypothesis names translation to the arrays of
  # refences to all cases involving this hypothesis
our @case; # the table of training cases
  # each case is represented as a hash with elements:
  # "hyp" - array of hypotheis names that were diagnosed in this case
  # "wt" - weight of the case
  # "origwt" - original weight of the case as loaded form the table
  # "tc" - array of training confidence of events TC(E|I)
  # "r" - array of relevance of events R(E|I)
our %phyp; # will be used to store the computed probabilities

# options
our $verbose = 0; # verbose printing during the computation
our $cap = 0; # cap on the confidence, factor adding fuzziness to compensate
  # for overfitting in the training data;
  # limits the confidence to the range [$cap..1-$cap]
our $boundary = 0.9; # boundary for accepting a hypothesis as a probable outcome

# print formatting values
our $pf_w = 7; # width of every field
our $pf_tfmt = "%-${pf_w}.${pf_w}s"; # format of one text field
our $pf_nw = $pf_w-2; # width after the dot in the numeric fields field
our $pf_nfmt = "%-${pf_w}.${pf_nw}f"; # format of one numeric field (does the better rounding)
our $pf_rw = 4; # width of the field R(E|I)
our $pf_rtfmt = "%-${pf_rw}.${pf_rw}s"; # format of text field of the same width as R(E|I)
our $pf_rnw = $pf_rw-2; # width after the dot for R(E|I)
our $pf_rnfmt = "%-${pf_rw}.${pf_rnw}f"; # format of the field for R(E|I)

sub load_table($) # (filename)
{
  my $filename = shift;

  @evname = ();
  %evhash = ();
  %hyphash = ();
  @case = ();

  my $nev = undef; # number of events minus 1

  confess "Failed to open '$filename': $!\n"
    unless open(INPUT, "<", $filename);
  while(<INPUT>) {
    chomp;
    s/,\s*$//; # remove the trailing comma if any
    if (/^\#/ || /^\s*$/) {
      # a comment line
    } elsif (/^\!/) {
      # row with event names
      @evname = split(/,/); # CSV format, the first 2 elements gets skipped
      shift @evname;
      shift @evname;
    } else {
      my @s = split(/,/); # CSV format for a training case
      # Each line contains:
      # - list of hypotheses, separated by "+"
      # - weight (in this position it's compatible with the format of probability tables)
      # - list of event data that might be either of:
      #   - one number - the event's training confidence TC(E|I), implying R(E|I)=1
      #   - a dash "-" - the event is irrelevant, meaning R(E|I)=0
      #   - two numbers separated by a "/": TC(E|I)/R(E|I)

      my $c = { };
      
      my @hyps = split(/\+/, shift @s);
      my %hypuniq;
      for (my $i = 0; $i <= $#hyps; $i++) {
        $hyps[$i] =~ s/^\s+//;
        $hyps[$i] =~ s/\s+$//;
        $hypuniq{$hyps[$i]} = 1;
      }
      foreach my $h (keys %hypuniq) {
        push @{$hyphash{$h}}, $c;
      }

      $c->{hyp} = \@hyps;
      $c->{origwt} = $c->{wt} = shift(@s) + 0.;

      if (defined $nev) {
        if ($nev != $#s) {
          close(INPUT);
          my $msg = sprintf("Wrong number of events, expected %d, got %d in line: %s\n",
            $nev+1, $#s+1, $_);
          confess $msg;
        }
      } else {
        $nev = $#s;
      }

      # the rest of fields are the events in this case
      foreach my $e (@s) {
        if ($e =~ /^\s*-\s*$/) {
          push @{$c->{r}}, 0.;
          push @{$c->{tc}}, 0.;
        } else {
          my @edata = split(/\//, $e);
          push @{$c->{tc}}, ($edata[0] + 0.);
          if ($#edata <= 0) {
            push @{$c->{r}}, 1.;
          } else {
            push @{$c->{r}}, ($edata[1] + 0.);
          }
        }
      }

      push @case, $c;
    }
  }
  close(INPUT);

  if ($#evname >= 0) {
    if ($#evname != $nev) {
      my $msg = sprintf("Wrong number of event names, %d events in the table, %d names\n",
        $nev+1, $#evname+1);
      confess $msg;
    }
  } else {
    for (my $i = 0; $i <= $nev; $i++) {
      push @evname, ($i+1)."";
    }
  }

  for (my $i = 0; $i <= $#evname; $i++) {
    $evname[$i] =~ s/^\s+//;
    $evname[$i] =~ s/\s+$//;
    $evhash{$evname[$i]} = $i;
  }
}

sub print_table()
{
  # the title row
  printf($pf_tfmt . ",", "!");
  printf($pf_tfmt . ",", "");
  foreach my $e (@evname) {
    printf($pf_tfmt . " " . $pf_rtfmt . ",", $e, "");
  }
  print("\n");
  # the cases
  for (my $i = 0; $i <= $#case; $i++) {
    my $c = $case[$i];
    # if more than one hypothesis, print each of them on a separate line
    for (my $j = 0; $j < $#{$c->{hyp}}; $j++) {
      printf($pf_tfmt . "+\n", $c->{hyp}[$j]);
    }

    printf($pf_tfmt . ",", $c->{hyp}[ $#{$c->{hyp}} ]);
    printf($pf_nfmt . ",", $c->{wt});
    for (my $j = 0; $j <= $#evname; $j++) {
      printf($pf_nfmt . "/" . $pf_rnfmt . ",", $c->{tc}[$j], $c->{r}[$j]);
    }
    print("\n");
  }
}

# Compute the hypothesis probabilities from weights
sub compute_phyp()
{
  %phyp = ();
  my $h;

  # start by getting the weights
  my $sum = 0.;
  for (my $i = 0; $i <= $#case; $i++) {
    my $w = $case[$i]->{wt};
    $sum += $w;

    foreach $h (@{$case[$i]->{hyp}}) {
      $phyp{$h} += $w;
    }
  }

  if ($sum != 0.) { # if 0 then all the weights are 0, leave them alone
    for $h (keys %phyp) {
      $phyp{$h} /= $sum;
    }
  }
}


# Print the probabilities of the kypotheses
sub print_phyp()
{
  printf("--- Probabilities\n");
  for my $h (sort keys %phyp) {
    printf($pf_tfmt . " " . $pf_nfmt . "\n", $h, $phyp{$h});
  }
}

# Apply one event
# evi - event index in the array
# conf - event confidence [0..1]
sub apply_event($$) # (evi, conf)
{
  my ($evi, $conf) = @_;

  # update the weights
  for (my $i = 0; $i <= $#case; $i++) {
    my $w = $case[$i]->{wt};
    my $r = $case[$i]->{r}[$evi];
    my $tc = $case[$i]->{tc}[$evi];

    $case[$i]->{wt} = $w * (1. - $r)
      + $w*$r*( $tc*$conf + (1. - $tc)*(1. - $conf) );
  }
}


# Apply an input file
sub apply_input($) # (filename)
{
  my $filename = shift;

  confess "Failed to open the input '$filename': $!\n"
    unless open(INPUT, "<", $filename);
  while(<INPUT>) {
    chomp;
    next if (/^\#/ || /^\s*$/); # a comment

    my @s = split(/,/);
    $s[0] =~ s/^\s+//;
    $s[0] =~ s/\s+$//;

    confess ("Unknown event name '" . $s[0] . "' in the input\n")
      unless exists $evhash{$s[0]};
    my $evi = $evhash{$s[0]};

    my $conf = $s[1] + 0.;
    if ($conf < $cap) {
      $conf = $cap;
    } elsif ($conf > 1.-$cap) {
      $conf = 1. - $cap;
    }
    printf("--- Applying event %s, c=%f\n", $s[0], $conf);
    &apply_event($evi, $conf);
    &print_table;
  }
  close(INPUT);
}

# main()
while ($ARGV[0] =~ /^-(.*)/) {
  if ($1 eq "v") {
    $verbose = 1;
  } elsif ($1 eq "c") {
    shift @ARGV;
    $cap = $ARGV[0]+0.;
  } elsif ($1 eq "b") {
    shift @ARGV;
    $boundary = $ARGV[0]+0.;
  } else {
    confess "Unknown switch -$1";
  }
  shift @ARGV;
}
&load_table($ARGV[0]);
&print_table;
&compute_phyp;
&print_phyp;
if ($#ARGV >= 1) {
  &apply_input($ARGV[1]);
  &compute_phyp;
  &print_phyp;
}

As you can see, the function apply_event() became much simpler, and there is no chance of division by 0 any more. The function apply_input() stayed exactly the same, just the functions called by it have changed.

The inputs are backwards-compatible with the previously shown examples, although what was formerly the field for hypothesis probabilities now becomes the field for case weights. The new code just works with weights and converts them to hypothesis probabilities at the end for display.

There are a couple of new features supported in the inputs. First, the field for hypothesis names is now allowed to contain multiple hypotheses, separated by the plus sign. That allows to enter the cases with multi-hypothesis results.

Second, what used to be the conditional probabilities of the events now can contain both the training confidence values and the relevance values of the events. These fields may have one of three formats:

  • A single number: the confidence value TC(E|I), similar to the previous P(E|H). The relevance R(E|I) is assumed to be 1.
  • The character "-": means that this event is not relevant for this case. Which means that the relevance value R(E|I) is 0, and TC(E|I) doesn't matter.
  • Two numbers separated by a "/": the first one is TC(E|I) and the second one is R(E|I).

Let's look at some examples.

The very first example I shown was this:

# tab06_01_01.txt
!,,evA,evB,evC
hyp1,0.66667,1,0.66667,0.66667
hyp2,0.33333,0,0,1

The very first input I've shown was this:

# in06_01_01_01.txt
evA,1
evB,0
evC,0

Let's compare the old and the new results. Old:

$ perl ex06_01run.pl tab06_01_01.txt in06_01_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evA, c=1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,Since this case has no match 
--- Applying event evB, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,

New:

$ perl ex14_01run.pl tab06_01_01.txt in06_01_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.66667,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.33333,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.33333
--- Applying event evA, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.66667,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.000000Since this case has no match 
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.22222,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.07407,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    1.00000
hyp2    0.00000

The result is the same, although the intermediate data is printed as weights, not probabilities, and the printed table contains the relevance information (all the relevance values are at 1 here).

This table of probabilities was produced from 6 cases for hyp1 and 3 cases for hyp2:

         evA evB evC
4 * hyp1 1   1   1
2 * hyp1 1   0   0
3 * hyp2 0   0   1

Before entering the raw cases, let's look at the same combined probability table with weights entered directly instead of probabilities:

# tab14_01a.txt
!,,evA,evB,evC
hyp1,6,1,0.66667,0.66667
hyp2,3,0,0,1

$ perl ex14_01run.pl tab14_01a.txt in06_01_01_01.txt 
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,6.00000,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,3.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.33333
--- Applying event evA, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,6.00000,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.99998,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.66665,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    1.00000
hyp2    0.00000

The end result is the same, and the intermediate weights are simply scaled up proportionally. Note that even though the case matches one of the training cases one-to-one, the weight of hyp1 ends up at only 0.66665. That's because the table entry is produced by folding two kinds of cases, and this case matched only one-third of the cases. Since originally this particular case had the weight of 2, the result is 2*1/3 = 0.67 (approximately).

If we enter the raw training cases into the table, the result changes:

# tab14_01b.txt
!,,evA,evB,evC
hyp1,4,1,1,1
hyp1,2,1,0,0
hyp2,3,0,0,1

$ perl ex14_01run.pl tab14_01b.txt in06_01_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,4.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,2.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,3.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.33333
--- Applying event evA, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,4.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,2.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,2.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,2.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    1.00000
hyp2    0.00000

Now the resulting probability is the same but the weight 2 matches what was in the training table.

Let's see how it handles an impossible input:

# in08_01_01.txt
evC,0
evA,0
evB,0

$ perl ex14_01run.pl tab14_01a.txt in08_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,6.00000,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,3.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.33333
--- Applying event evC, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.99998,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evA, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.00000
hyp2    0.00000

Since this case has no match in the training table, the weights come out as 0. The computation of the probabilities would have required to divide the weights by the sum of weights, but since the sum is 0, the code just leaves the probabilities at 0 to avoid the division by 0.

It's interesting to compare the results with capping for two different kinds of the tables:

$ perl ex14_01run.pl -c 0.01 tab14_01a.txt in08_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,6.00000,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,3.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.33333
--- Applying event evC, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,2.01998,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.03000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evA, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.02020,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.02970,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00680,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.02940,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.18784
hyp2    0.81216

$ perl ex14_01run.pl -c 0.01 tab14_01b.txt in08_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,4.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,2.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,3.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.33333
--- Applying event evC, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.04000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,1.98000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.03000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evA, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00040,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,0.01980,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.02970,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,0.01960,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.02940,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.40005
hyp2    0.59995

They came out different. Why? In the second run, the first training case for hyp1 mismatches the values of all 3 input events. Its weight gets multiplied by 0.01 thrice and becomes very close to 0. The second training case for hyp1 and the training case for hyp2 mismatch only one input event, so their weights get multiplied by 0.01 only once, and their relative weights decide the resulting probabilities. They were 2:3 to start with and that stayed as 2:3 (the little deviation is contributed by the first training case for hyp1).

On the other hand, in the first run all the cases for hyp1 were lumped into one line, and the average content of that line was seriously tilted towards the case that mismatched all 3 events. Thus hyp1 ended up with a much lower weight but hyp2 ended up with exactly the same weight as in the second run, so it outweighed the hyp1 much more seriously.

To look at the effects of the relevance values and of the training cases with multi-hypothesis results let's revisit the example from the 9th and 10th parts. First, a training table with multiple hypotheses, as in the part 9:

# tab14_02a.txt
!,,evA,evB,evC
hyp1+hyp2,1,1,1,0
hyp2+hyp3,1,0,1,1
hyp1+hyp3,1,1,0,1

With the input data:

# in09_01_01.txt
evA,0
evB,0
evC,1

$ perl ex14_01run.pl -c 0.01 tab14_02a.txt in09_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.66667
hyp3    0.66667
--- Applying event evA, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,0.01000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,0.99000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,0.01000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,0.00010,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,0.00990,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,0.00990,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.990000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,0.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,0.00980,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,0.00980,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.50003
hyp2    0.50003
hyp3    0.99995

The capping was needed since the data doesn't match any of the training cases, but in the end it points pretty conclusively to the hypothesis hyp3. The multi-hypothesis cases are printed out in the intermediate results with one hypothesis per line (this format cannot be read back as input).

Now a variation of the example from the part 10, where I've manually decided, which events should be relevant to which hypotheses, using the same in input as above. Only this table is for all 3 hypotheses at once, not one hypothesis at a time:

# tab14_02b.txt
!,,evA,evB,evC
hyp1,1,1,-,-
hyp2,1,-,1,-
hyp1,1,-,-,1

$ perl ex14_01run.pl tab14_02b.txt in09_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evA, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Applying event evB, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,0.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Applying event evC, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,0.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Probabilities
hyp1    0.00000
hyp2    0.00000
hyp3    1.00000

You can see in the printout that some of the relevances have been set to 1 and some to 0. This time it picked hyp3 with full certainty, without even any need for capping.

Now the same table but the input with all events true:

# in10_01_02.txt
evA,1
evB,1
evC,1

$ perl ex14_01run.pl tab14_02b.txt in10_01_02.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evA, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Applying event evB, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Applying event evC, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

As far as the weights are concerned, all the hypotheses got the full match. But the probabilities have naturally gotten split three-way. Let's contrast it with the result of another input:

# in08_01_01.txt
evC,0
evA,0
evB,0

$ perl ex14_01run.pl -c 0.01 tab14_02b.txt in08_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evC, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,0.01000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Applying event evA, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.01000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,0.01000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Applying event evB, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.01000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,0.01000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,0.01000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

In this run the probabilities also split three-way. But how do we know that for the first input (1,1,1) we should pick all 3 hypotheses and for the second input (0,0,0) we should pick none? Both are split evenly, so we can't pick based on the rule of even splitting discussed in the part 9, both would fit it. One indication is that the second input required capping, or it would have produced the probabilities of 0. For another indication we can compare the final weights of the cases with their initial weights. In the first run they stayed the same. In the second run they got multiplied by 0.01 by the capping. Thus we can say that neither really matched the input. Since only one multiplication by 0.01 was done, each case had only one mismatch. Is one mismatch that bad? Since each hypothesis has only one relevant event, we can say that yes, the mismatch in this one event is real bad, and all these hypotheses should be considered false.

With the larger sets of training data, it would be possible to make decisions based on how many relevant events are available for each case, and what fraction of them is allowed to be mismatched before we consider the case inapplicable. And if a hypothesis has no applicable cases, it would be considered false.

No comments:

Post a Comment