Saturday, February 25, 2012

The proper aggregation, part 5: optimized DELETEs

Previously I've mentioned that the deletes in an aggregator do not have to be recalculated every time. Instead the rows can be remembered from the insert time, and simply re-sent with the new opcode. That allows to trade the CPU time for the extra memory. Of course, this works best when there are many rows per aggregation group. How many is "many"? It depends on the particular cases. You'd have to measure. Anyway, here is how it's done:

sub computeAverage # (table, context, aggop, opcode, rh, state, args...)
  my ($table, $context, $aggop, $opcode, $rh, $state, @args) = @_;

  # don't send the NULL record after the group becomes empty
  return if ($context->groupSize()==0
    || $opcode == &Triceps::OP_NOP);
  if ($opcode == &Triceps::OP_DELETE) {
    $context->send($opcode, $$state) or die "$!";

  my $sum = 0;
  my $count = 0;
  for (my $rhi = $context->begin(); !$rhi->isNull(); 
      $rhi = $context->next($rhi)) {
    $sum += $rhi->getRow()->get("price");
  my $rLast = $context->last()->getRow() or die "$!";
  my $avg = $sum/$count;

  my $res = $context->resultType()->makeRowHash(
    symbol => $rLast->get("symbol"), 
    id => $rLast->get("id"), 
    price => $avg
  ) or die "$!";
  ${$state} = $res;
  $context->send($opcode, $res) or die "$!";

sub initRememberLast #  (@args)
  my $refvar;
  return \$refvar;

my $ttWindow = Triceps::TableType->new($rtTrade)
    Triceps::IndexType->newHashed(key => [ "id" ])
    Triceps::IndexType->newHashed(key => [ "symbol" ])
      Triceps::IndexType->newFifo(limit => 2)
        $rtAvgPrice, "aggrAvgPrice", \&initRememberLast, \&computeAverage)
or die "$!";

The rest of the example stays the same, so it's not shown. Even in the part that is shown, very little has changed.

The aggregator type now has an initialization function. (This function is NOT of the same kind as for the sorted index!) This function gets called every time a new aggregation group gets created, before the first row is inserted into it. It initializes the aggregator group's Perl state (the state is per aggregator type, so if there are two parallel index types, each with an aggregator, each aggregator will have its own group state).

The state is stored in the group as a single Perl variable. So it usually is a reference. In this case the value returned is a reference to a variable that would contain a Row reference. (Ironically, the simplest case looks a bit more confusing than if it were a reference to an array or hash). Returning a reference to a "my" variable is a way to create a reference to an anonymous value: each time "my" executes, it creates a new value. Which is then kept in a reference after the initialization function returns. The next time the function executes, "my" would create another new value.

The computation function is passed that state as an argument and now makes use of it. It has two small additions. Before sending a new result row, that row gets remembered in the state reference. And then before doing any computation the function checks, if the required opcode is DELETE, and if so then simply resends the last result with the new opcode. Remember, the rows are not copied but reference-counted, so this is fairly cheap.

The extra level of referencing is used because simply assigning to $state would only change the local variable and not the value kept in the group.

However if you change the argument of the function directly, that would change the value kept in the group (similar to changing the loop variable in a foreach loop). So you can save a bit of overhead by eliminating the extra indirection. The changes will be:

sub computeAverage # (table, context, aggop, opcode, rh, state, args...)
  if ($opcode == &Triceps::OP_DELETE) {
    $context->send($opcode, $state) or die "$!";
  $_[5] = $res;
  $context->send($opcode, $res) or die "$!";

sub initRememberLast #  (@args)
  return undef;

Even though the initialization function returns undef, it still must be present. If it's not present, the state argument of the comparison function will contain a special hardcoded and unmodifiable undef constant, and nothing could be remembered.

And here is an example of its work:

tWindow.aggrAvgPrice OP_INSERT symbol="AAA" id="1" price="10" 
tWindow.aggrAvgPrice OP_INSERT symbol="BBB" id="2" price="100" 
tWindow.aggrAvgPrice OP_DELETE symbol="AAA" id="1" price="10" 
tWindow.aggrAvgPrice OP_INSERT symbol="AAA" id="3" price="15" 
tWindow.aggrAvgPrice OP_DELETE symbol="BBB" id="2" price="100" 
tWindow.aggrAvgPrice OP_INSERT symbol="BBB" id="4" price="150" 
tWindow.aggrAvgPrice OP_DELETE symbol="AAA" id="3" price="15" 
tWindow.aggrAvgPrice OP_INSERT symbol="AAA" id="5" price="25" 
tWindow.aggrAvgPrice OP_DELETE symbol="AAA" id="5" price="25" 
tWindow.aggrAvgPrice OP_INSERT symbol="AAA" id="5" price="30" 
tWindow.aggrAvgPrice OP_DELETE symbol="AAA" id="5" price="30" 

Since the rows are grouped by the symbol, the symbols "AAA" and "BBB" will have separate aggregation states.

No comments:

Post a Comment