Sunday, October 18, 2015

Bayes 6: basic code example

After going through in words, I want to show you the code, with a simple example. The code is fairly independent of the actual knowledge that is contained in the probability tables, so I'll make the probability tables loadable.

The probability tables format will be like this:

# this is a simple table example
!,,evA,evB,evC
hyp1,0.66667,1,0.66667,0.66667
hyp2,0.33333,0,0,1

The comments start with a "#".

The data lines contain the comma-separated values (CSV). The lines, except the one that starts with a "!" describe the hypotheses. The first field is the hypothesis name, the second field is the prior probability P(H), and the rest of the fields are the conditional probabilities P(E|H) for every event and this hypothesis. The lines may contain a trailing comma, it will get ignored:

hyp_name,P(H),P(E1/H),P(E2/H),...

The line that starts with "!" contains the event names. If this line is absent, the events will be assigned the numeric names, starting from 1. The first two fields of the line are ignored, and the rest contain the names of the events, forming the table header, where each table column correcponds to an event and its conditional probabilities.

The same table can also be formatted with the nice alignment:

!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,

When my code runs, it will print its information about the tables in this aligned format. I've arbitrarily limited the width of the columns at 7, so any longer names will be truncated in the print-out. Good enough for the examples. The spaces around the hypothesis and event names will be ignored.

The other data item will be the input data describing a particular case. It consists of the CSV lines in format

event_name,confidence

The event name must match one of the names defined in the table, and the confidence is in the range [0..1], with 0 meaning that the event is definitely false, and 1 meaning that it's definitely true.

Now to the code. Eventually I'll make the code downloadable as files but for now I'll probably do the changes in it for some time, so I'l just copy-paste the current version here to reduce the confusion between the code and text. I'll talk more about how and why it works but first just the pasted code and some examples of its use.

#!/usr/bin/perl
#
# Running of a Bayes expert system.

use strict;
use Carp;

our @evname; # the event names
our %evhash; # hash of event names to indexes
our @hypname; # the hypothesis names
our @phyp; # the hypothesis probabilities
our @peh; # probabilities P(E/H) by hypothesis by event

# options
our $verbose = 0; # verbose printing during the computation
our $cap = 0; # cap on the confidence, factor accounting for fuzziness of training data,
  # limits the confidence to the range [$cap..1-$cap]

sub load_table($) # (filename)
{
  my $filename = shift;

  @evname = ();
  %evhash = ();
  @hypname = ();
  @phyp = ();
  @peh = ();

  my $nev = undef; # number of events minus 1

  confess "Failed to open '$filename': $!\n"
    unless open(INPUT, "<", $filename);
  while(<INPUT>) {
    chomp;
    s/,\s*$//; # remove the trailing comma if any
    if (/^\#/ || /^\s*$/) {
      # a comment line
    } elsif (/^\!/) {
      # row with event names
      @evname = split(/,/); # CSV format, the first 2 elements gets skipped
      shift @evname;
      shift @evname;
    } else {
      my @s = split(/,/); # CSV format for hypothesis
      push @hypname, shift @s;
      push @phyp, shift @s;

      if (defined $nev) {
        if ($nev != $#s) {
          close(INPUT);
          my $msg = sprintf("Wrong number of events, expected %d, got %d in line: %s\n",
            $nev+1, $#s+1, $_);
          confess $msg;
        }
      } else {
        $nev = $#s;
      }
      # the rest of fields are probabilities, just take the whole array by reference
      push @peh, \@s;
    }
  }
  close(INPUT);

  if ($#evname >= 0) {
    if ($#evname != $nev) {
      my $msg = sprintf("Wrong number of event names, %d events in the table, %d names\n",
        $nev+1, $#evname+1);
      confess $msg;
    }
  } else {
    for (my $i = 0; $i <= $nev; $i++) {
      push @evname, ($i+1)."";
    }
  }

  for (my $i = 0; $i <= $#evname; $i++) {
    $evname[$i] =~ s/^\s+//;
    $evname[$i] =~ s/\s+$//;
    $evhash{$evname[$i]} = $i;
  }
  for (my $i = 0; $i <= $#hypname; $i++) {
    $hypname[$i] =~ s/^\s+//;
    $hypname[$i] =~ s/\s+$//;
  }
}

sub print_table()
{
  my $w = 7; # width of every field
  my $tfmt = "%-${w}.${w}s,"; # format of one text field
  my $nw = $w-2; # width after the dot in the numeric fields field
  my $nfmt = "%-${w}.${nw}f,"; # format of one numeric field (does the better rounding)
  
  # the title row
  printf($tfmt, "!");
  printf($tfmt, "");
  foreach my $e (@evname) {
    printf($tfmt, $e);
  }
  print("\n");
  # the hypotheses
  for (my $i = 0; $i <= $#hypname; $i++) {
    printf($tfmt, $hypname[$i]);
    printf($nfmt, $phyp[$i]);
    foreach my $e (@{$peh[$i]}) {
      printf($nfmt, $e);
    }
    print("\n");
  }
}

# Apply one event
# evi - event index in the array
# conf - event confidence [0..1]
sub apply_event($$) # (evi, conf)
{
  my ($evi, $conf) = @_;

  # compute P(E)
  my $pe = 0;
  for (my $i = 0; $i <= $#hypname; $i++) {
    $pe += $phyp[$i] * $peh[$i][$evi];
  }

  if ($verbose) {
    printf("P(E)=%f\n", $pe);
  }
  
  #compute the divisor
  my $div = $pe*$conf + (1. - $pe)*(1. - $conf);
  if ($div < 1e-10) {
    confess (sprintf("Impossible events, division by %g\n", $div));
  }

  # update the hypothesis probabilities
  # (it works more efficiently if the array $peh is indexed by the event first
  # but for the simple example who cares)
  for (my $i = 0; $i <= $#hypname; $i++) {
    my $mul = $peh[$i][$evi]*$conf + (1. - $peh[$i][$evi])*(1. - $conf);
    if ($verbose) {
      printf("H %s *%f/%f\n", $hypname[$i], $mul, $div);
    }
    $phyp[$i] *= $mul/$div;
  }
}

# Apply an input file
sub apply_input($) # (filename)
{
  my $filename = shift;

  confess "Failed to open the input '$filename': $!\n"
    unless open(INPUT, "<", $filename);
  while(<INPUT>) {
    chomp;
    next if (/^\#/ || /^\s*$/); # a comment

    my @s = split(/,/);
    $s[0] =~ s/^\s+//;
    $s[0] =~ s/\s+$//;

    confess ("Unknown event name '" . $s[0] . "' in the input\n")
      unless exists $evhash{$s[0]};
    my $evi = $evhash{$s[0]};

    my $conf = $s[1] + 0.;
    if ($conf < $cap) {
      $conf = $cap;
    } elsif ($conf > 1.-$cap) {
      $conf = 1. - $cap;
    }
    printf("--- Applying event %s, c=%f\n", $s[0], $conf);
    &apply_event($evi, $conf);
    &print_table;
  }
  close(INPUT);
}

# main()
while ($ARGV[0] =~ /^-(.*)/) {
  if ($1 eq "v") {
    $verbose = 1;
  } elsif ($1 eq "c") {
    shift @ARGV;
    $cap = $ARGV[0]+0.;
  } else {
    confess "Unknown switch -$1";
  }
  shift @ARGV;
}
&load_table($ARGV[0]);
&print_table;
if ($#ARGV >= 1) {
  &apply_input($ARGV[1]);
}

How to run it: Suppose the table above is locate din the file tab06_01_01.txt, and the code is in ex06_01run.pl. The first the code can do is just load the table and print it back with the nice formatting:

$ perl ex06_01run.pl tab06_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,

Next, it can apply the input data for computation. Suppose the input data is in in06_01_01_01.txt:

evA,1
evB,0
evC,0

Let's apply it:

$ perl ex06_01run.pl tab06_01_01.txt in06_01_01_01.txt 
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evA, c=1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,

It prints the table with the new probabilities after applying every event, so that we can trace, how the probabilities change. The only thing that changes is really the second column, P(H) of the hypotheses. The conditional probabilities of the events don't change.

In this case the very first event decided the fate: it drove the probability all the way for hyp1, and the rest of the events just stayed consistent with that. If we reorder the events, the intermediate computations will change:

evB,0
evA,1
evC,0

$ perl ex06_01run.pl tab06_01_01.txt in06_01_01_04.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.40000,1.00000,0.66667,0.66667,
hyp2   ,0.60000,0.00000,0.00000,1.00000,
--- Applying event evA, c=1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,

If evB is applied first, it drives the probabilities toward hyp2, but then the next event brings it back to the full glory of hyp1. Which basically means that if evA is true, there is no way hyp2 could happen, so hyp1 wins big time.

The order in which the events are applied doesn't matter. Either way it will come to the same result. In reality it might not be the best result since the table in this form doesn't represent the cross-correlation between the events quite right. But for a given table the result is the same independently of the order in which the events are applied.

The Bayes formula is applied for one event at a time in the function apply_event(). The formula is exactly as described before.

P(H|C(E)) = P(H) * ( C(E)*(0+P(E|H)) + (1-C(E))*(1-P(E|H)) )
    / ( C(E)*(0+P(E)) + (1-C(E))*(1-P(E)) )

The only catch is that P(E) is not present anywhere in the table. It has to be calculated from the other available data. This is because the probabilities of the events change after the knowledge about the previous events has been taken into the account. It's the biggest effect of the cross-correlation of the events and it absolutely has to be taken into account or you'll be seeing the probability values around 15 or so. There are other subtler effects as well but more about that later.

The event probability is computed from adding up its weighed conditional probabilities:

P(E) = sum(P(H) * P(E/H))

Since the P(H) values change after every events and P(E/H) stay the same, the values of P(E) also change.

If used with the option -v, the program will print the verbose intermediate data as it preforms the computations. With the original event order it will print:

$ perl ex06_01run.pl -v tab06_01_01.txt in06_01_01_01.txt 
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evA, c=1.000000
P(E)=0.666670
H hyp1 *1.000000/0.666670
H hyp2 *0.000000/0.666670
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.000000
P(E)=0.666670
H hyp1 *0.333330/0.333330
H hyp2 *1.000000/0.333330
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
P(E)=0.666670
H hyp1 *0.333330/0.333330
H hyp2 *0.000000/0.333330
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,

With the different order:

$ perl ex06_01run.pl -v tab06_01_01.txt in06_01_01_04.txt 
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evB, c=0.000000
P(E)=0.444449
H hyp1 *0.333330/0.555551
H hyp2 *1.000000/0.555551
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.40000,1.00000,0.66667,0.66667,
hyp2   ,0.60000,0.00000,0.00000,1.00000,
--- Applying event evA, c=1.000000
P(E)=0.400001
H hyp1 *1.000000/0.400001
H hyp2 *0.000000/0.400001
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
P(E)=0.666670
H hyp1 *0.333330/0.333330
H hyp2 *0.000000/0.333330
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,

As you can see, the probabilities for evA and evB were wildly different in the first two rounds. But either way they were applied, by the time the evC gets into the fray, it has the same probability in both cases.

No comments:

Post a Comment