Monday, November 9, 2015

Bayes 16: code for hypothesis composition

This time I want to show the code that composes the separate cases into the merged hypotheses. I've shown before how to do it manually, now the code to do it.

The short summary of changes is:

  • The %hyphash went away since it turned out to be unnecessary.
  • Option "-compose N" added. N is the fraction in range [0..1] of what part of the weights needs to be composed into the per-hypothesis merged cases. 0 is the default and means that nothing will be composed. 1 means that all the data will be composed per hypothesis.
  • Option "-msplit" added. It enables the splitting of the multi-hypothesis cases into multiple single-hypothesis cases (just as shown before, the same data gets copied to every hypothesis separately).
  • The table loading code has been extended to allow the reading of the multi-hypothesis cases that are formatted on multiple lines. This allows to save the result of the composition and then load it for the diagnosis runs.
  • The new logic is in the function compose_hypotheses().

Initially I wanted to show only the changed code but then decided that since I'm not uploading the files anywhere yet, it would be easier for the people who want to play with the code to copy-and-paste it than to assembly it from bits and pieces. So here we go, the whole new code:

# ex16_01run.pl
#!/usr/bin/perl
#
# Running of a Bayes expert system on a table of training cases.
# Includes the code to compose the training cases by hypothesis.
# And can parse back the multi-hypothesis cases printed in multi-line.

use strict;
use Carp;

our @evname; # the event names, in the table order
our %evhash; # hash of event names to indexes
our @case; # the table of training cases
  # each case is represented as a hash with elements:
  # "hyp" - array of hypotheis names that were diagnosed in this case
  # "wt" - weight of the case
  # "origwt" - original weight of the case as loaded form the table
  # "tc" - array of training confidence of events TC(E|I)
  # "r" - array of relevance of events R(E|I)
our %phyp; # will be used to store the computed probabilities

# options
our $verbose = 0; # verbose printing during the computation
our $cap = 0; # cap on the confidence, factor adding fuzziness to compensate
  # for overfitting in the training data;
  # limits the confidence to the range [$cap..1-$cap]
our $boundary = 0.9; # boundary for accepting a hypothesis as a probable outcome
our $composite = 0.; # fraction [0..1] of the cases to compose by hypothesis
our $mhyp_split = 0; # split the multi-hypothesis cases when computing the composites

# print formatting values
our $pf_w = 7; # width of every field
our $pf_tfmt = "%-${pf_w}.${pf_w}s"; # format of one text field
our $pf_nw = $pf_w-2; # width after the dot in the numeric fields field
our $pf_nfmt = "%-${pf_w}.${pf_nw}f"; # format of one numeric field (does the better rounding)
our $pf_rw = 4; # width of the field R(E|I)
our $pf_rtfmt = "%-${pf_rw}.${pf_rw}s"; # format of text field of the same width as R(E|I)
our $pf_rnw = $pf_rw-2; # width after the dot for R(E|I)
our $pf_rnfmt = "%-${pf_rw}.${pf_rnw}f"; # format of the field for R(E|I)

sub load_table($) # (filename)
{
  my $filename = shift;

  @evname = ();
  %evhash = ();
  @case = ();

  my $nev = undef; # number of events minus 1

  confess "Failed to open '$filename': $!\n"
    unless open(INPUT, "<", $filename);

  my $prepend;
  while(<INPUT>) {
    chomp;
    s/,\s*$//; # remove the trailing comma if any
    if (/^\#/ || /^\s*$/) {
      # a comment line
    } elsif (/^\!/) {
      # row with event names
      @evname = split(/,/); # CSV format, the first 2 elements gets skipped
      shift @evname;
      shift @evname;
    } elsif (/\+\s*$/) {
      # a split-line of hypothesis names
      $prepend .= $_;
    } else {
      $_ = $prepend . $_;
      $prepend = undef;
      my @s = split(/,/); # CSV format for a training case
      # Each line contains:
      # - list of hypotheses, separated by "+"
      # - weight (in this position it's compatible with the format of probability tables)
      # - list of event data that might be either of:
      #   - one number - the event's training confidence TC(E|I), implying R(E|I)=1
      #   - a dash "-" - the event is irrelevant, meaning R(E|I)=0
      #   - two numbers separated by a "/": TC(E|I)/R(E|I)

      my $c = { };
      
      my @hyps = split(/\+/, shift @s);
      for (my $i = 0; $i <= $#hyps; $i++) {
        $hyps[$i] =~ s/^\s+//;
        $hyps[$i] =~ s/\s+$//;
      }

      $c->{hyp} = \@hyps;
      $c->{origwt} = $c->{wt} = shift(@s) + 0.;

      if (defined $nev) {
        if ($nev != $#s) {
          close(INPUT);
          my $msg = sprintf("Wrong number of events, expected %d, got %d in line: %s\n",
            $nev+1, $#s+1, $_);
          confess $msg;
        }
      } else {
        $nev = $#s;
      }

      # the rest of fields are the events in this case
      foreach my $e (@s) {
        if ($e =~ /^\s*-\s*$/) {
          push @{$c->{r}}, 0.;
          push @{$c->{tc}}, 0.;
        } else {
          my @edata = split(/\//, $e);
          push @{$c->{tc}}, ($edata[0] + 0.);
          if ($#edata <= 0) {
            push @{$c->{r}}, 1.;
          } else {
            push @{$c->{r}}, ($edata[1] + 0.);
          }
        }
      }

      push @case, $c;
    }
  }
  close(INPUT);
  if ($prepend) {
    confess "The input contained the hanging hypothese names: $prepend\n";
  }

  if ($#evname >= 0) {
    if ($#evname != $nev) {
      my $msg = sprintf("Wrong number of event names, %d events in the table, %d names\n",
        $nev+1, $#evname+1);
      confess $msg;
    }
  } else {
    for (my $i = 0; $i <= $nev; $i++) {
      push @evname, ($i+1)."";
    }
  }

  for (my $i = 0; $i <= $#evname; $i++) {
    $evname[$i] =~ s/^\s+//;
    $evname[$i] =~ s/\s+$//;
    $evhash{$evname[$i]} = $i;
  }
}

sub print_table()
{
  # the title row
  printf($pf_tfmt . ",", "!");
  printf($pf_tfmt . ",", "");
  foreach my $e (@evname) {
    printf($pf_tfmt . " " . $pf_rtfmt . ",", $e, "");
  }
  print("\n");
  # the cases
  for (my $i = 0; $i <= $#case; $i++) {
    my $c = $case[$i];
    # if more than one hypothesis, print each of them on a separate line
    for (my $j = 0; $j < $#{$c->{hyp}}; $j++) {
      printf($pf_tfmt . "+\n", $c->{hyp}[$j]);
    }

    printf($pf_tfmt . ",", $c->{hyp}[ $#{$c->{hyp}} ]);
    printf($pf_nfmt . ",", $c->{wt});
    for (my $j = 0; $j <= $#evname; $j++) {
      printf($pf_nfmt . "/" . $pf_rnfmt . ",", $c->{tc}[$j], $c->{r}[$j]);
    }
    print("\n");
  }
}

# Compute the hypothesis probabilities from weights
sub compute_phyp()
{
  %phyp = ();
  my $h;

  # start by getting the weights
  my $sum = 0.;
  for (my $i = 0; $i <= $#case; $i++) {
    my $w = $case[$i]->{wt};
    $sum += $w;

    foreach $h (@{$case[$i]->{hyp}}) {
      $phyp{$h} += $w;
    }
  }

  if ($sum != 0.) { # if 0 then all the weights are 0, leave them alone
    for $h (keys %phyp) {
      $phyp{$h} /= $sum;
    }
  }
}


# Print the probabilities of the kypotheses
sub print_phyp()
{
  printf("--- Probabilities\n");
  for my $h (sort keys %phyp) {
    printf($pf_tfmt . " " . $pf_nfmt . "\n", $h, $phyp{$h});
  }
}

# Apply one event
# evi - event index in the array
# conf - event confidence [0..1]
sub apply_event($$) # (evi, conf)
{
  my ($evi, $conf) = @_;

  # update the weights
  for (my $i = 0; $i <= $#case; $i++) {
    my $w = $case[$i]->{wt};
    my $r = $case[$i]->{r}[$evi];
    my $tc = $case[$i]->{tc}[$evi];

    $case[$i]->{wt} = $w * (1. - $r)
      + $w*$r*( $tc*$conf + (1. - $tc)*(1. - $conf) );
  }
}


# Apply an input file
sub apply_input($) # (filename)
{
  my $filename = shift;

  confess "Failed to open the input '$filename': $!\n"
    unless open(INPUT, "<", $filename);
  while(<INPUT>) {
    chomp;
    next if (/^\#/ || /^\s*$/); # a comment

    my @s = split(/,/);
    $s[0] =~ s/^\s+//;
    $s[0] =~ s/\s+$//;

    confess ("Unknown event name '" . $s[0] . "' in the input\n")
      unless exists $evhash{$s[0]};
    my $evi = $evhash{$s[0]};

    my $conf = $s[1] + 0.;
    if ($conf < $cap) {
      $conf = $cap;
    } elsif ($conf > 1.-$cap) {
      $conf = 1. - $cap;
    }
    printf("--- Applying event %s, c=%f\n", $s[0], $conf);
    &apply_event($evi, $conf);
    &print_table;
  }
  close(INPUT);
}

# compose the training cases by hypothesis
sub compose_hypotheses()
{
  if ($mhyp_split) {
    # the number of cases will grow, remember the index of last original case
    my $lastcase = $#case + 1;
    for (my $i = 0; $i <= $lastcase; $i++) {
      my $c = $case[$i];
      while ($#{$c->{hyp}} > 0) {
        # split a copy of the case for each resulting hypothesis in it
        my $hname = pop @{$c->{hyp}};
        push @case, {
          hyp => [$hname],
          wt => $c->{wt},
          origwt => $c->{origwt},
          tc => $c->{tc},
          r => $c->{r},
        };
      }
    }
  }

  if ($composite <= 0.) {
    return; # nothing to do
  }

  # newly-generated composite hypotheses
  my %hyps; # keyed by the hypothesis names
  for (my $i = 0; $i <= $#case; $i++) {
    my $c = $case[$i];
    my $key = join("+", sort @{$c->{hyp}});
    if (!exists $hyps{$key}) {
      $hyps{$key} = {
        hyp => $c->{hyp},
        wt => 0.,
        wtrel => [], # weight of relevant part, by event
        tc => [], # initially will contain the sum
        r => [], # will be filled later
      };
    }
    my $hyp = $hyps{$key};
    my $wt = $c->{wt};

    $hyp->{wt} += $wt;

    for (my $e = 0; $e <= $#evname; $e++) {
      my $r = $c->{r}[$e];
      $hyp->{wtrel}[$e] += $wt * $r;
      $hyp->{tc}[$e] += $wt * $r * $c->{tc}[$e];
    }
  }
  if ($composite >= 1.) {
    # throw away the raw cases, since the hypotheses will replace them
    @case = ();
  } else {
    # 2nd pass: adjust the weight of the raw cases,
    # unless it's the only case of the hypothesis (then throw
    # away that hypothesis)
    for (my $i = 0; $i <= $#case; $i++) {
      my $c = $case[$i];
      my $key = join("+", sort @{$c->{hyp}});
      if ($hyps{$key}->{wt} == $c->{wt}) {
        delete $hyps{$key}; # hypothesis is a copy of the case
      } else {
        $c->{wt} *= (1. - $composite);
      }
    }
  }

  foreach my $h (sort keys %hyps) {
    my $hyp = $hyps{$h};
    my $wt = $hyp->{wt};

    for (my $e = 0; $e <= $#evname; $e++) {
      $hyp->{r}[$e] = $hyp->{wtrel}[$e] / $wt;
      if ($hyp->{wtrel}[$e] == 0.) {
        $hyp->{tc}[$e] = 0.;
      } else {
        $hyp->{tc}[$e] /= $hyp->{wtrel}[$e];
      }
    }

    # scale down the weight
    $hyp->{wt} *= $composite;
    $hyp->{origwt} = $hyp->{wt};
    delete $hyp->{wtrel};
    push @case, $hyp;
  }
}

# main()
while ($ARGV[0] =~ /^-(.*)/) {
  if ($1 eq "v") {
    $verbose = 1;
  } elsif ($1 eq "c") {
    shift @ARGV;
    $cap = $ARGV[0]+0.;
  } elsif ($1 eq "b") {
    shift @ARGV;
    $boundary = $ARGV[0]+0.;
  } elsif ($1 eq "compose") {
    shift @ARGV;
    $composite = $ARGV[0]+0.;
  } elsif ($1 eq "msplit") {
    $mhyp_split = 1;
  } else {
    confess "Unknown switch -$1";
  }
  shift @ARGV;
}
&load_table($ARGV[0]);
&print_table;
if ($composite > 0. || $mhyp_split) {
  &compose_hypotheses;
  printf "--- Composite ${pf_nfmt}:\n", $composite;
  &print_table;
}
&compute_phyp;
&print_phyp;
if ($#ARGV >= 1) {
  &apply_input($ARGV[1]);
  &compute_phyp;
  &print_phyp;
}

The basic example, loading of a multi-line case and splitting of the multi-hypothesis cases:

#tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,

$ perl ex16_01run.pl -msplit tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Composite 0.00000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

The same but also fully composed by hypotheses:

$ perl ex16_01run.pl -msplit -compose 1 tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Composite 1.00000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,2.00000,1.00000/1.00,0.50000/1.00,0.50000/1.00,
hyp2   ,2.00000,0.50000/1.00,1.00000/1.00,0.50000/1.00,
hyp3   ,2.00000,0.50000/1.00,0.50000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

The same with only 0.5 of the weights composed:

$ perl ex16_01run.pl -msplit -compose 0.5 tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Composite 0.50000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.50000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,0.50000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,0.50000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,0.50000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.50000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp3   ,0.50000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.50000/1.00,0.50000/1.00,
hyp2   ,1.00000,0.50000/1.00,1.00000/1.00,0.50000/1.00,
hyp3   ,1.00000,0.50000/1.00,0.50000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

It's a combination of the two examples above. The original split events are left half their weight, and the other half of the weight has been composed.

What if we try the composition without splitting?

$ perl ex16_01run.pl -compose 0.5 tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Composite 0.50000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.66667
hyp3    0.66667

The result is exactly the same as the original. This is because the original had only one case for each hypothesis combination, so there is nothing to compose and there is no point in splitting them into two.

Now a demo of the relevance computation:

# tab16_02.txt
!,,evA,evB,evC
hyp1,1,1,0,-
hyp1,1,0,1,-
hyp1,6,-,-,1

$ perl ex16_01run.pl -compose 1 tab16_02.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,0.00000/0.00,
hyp1   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/0.00,
hyp1   ,6.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Composite 1.00000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,8.00000,0.50000/0.25,0.50000/0.25,1.00000/0.75,
--- Probabilities
hyp1    1.00000

According to the logic from the previous installment, the relevance of evA and evB gets set to 0.25 because 2/8=1/4 of the cases for them have them relevant, and for evC the relevance is set to 0.75 because 6/8=3/4 cases for it are relevant.

No comments:

Post a Comment