This concludes the documentation for the version 1.0. The C++ API hasn't been described yet, but it all has been taking way more time and effort than I've planned. Of course, the result is coming out better than I've planned too, both from the docs and from the features standpoint. Currently Triceps is at the state that I've planned more for the version 1.3 or so. But it keeps dragging, and at some point it's just time to draw the line. The C++ docs will have to wait until after the 1.0 release. I've had more ideas for the Perl examples as well, and they'll have to wait too.
Now I'm going to work on assembling and formatting the documentation nicely, and then producing the official release 1.0.
This started as my thoughts on the field of Complex Event Processing, mostly about my OpenSource project Triceps. But now it's about all kinds of software-related things.
Monday, May 21, 2012
Large deletes in small chunks
If you have worked with Coral8 and similar CEP systems, you should be familiar with the situation when you ask it to delete a million rows from the table and the model goes into self-contemplation for half an hour, not reacting to any requests. It starts responding again only when the deletes are finished. That's because the execution is single-threaded, and deleting a million rows takes time.
Triceps is succeptible to the same issue. So, how to avoid it? Even better, how to make the deletes work "in background", at a low priority, kicking in only when there is no other pending requests?
The solution is do do it in smaller chunks. Delete a few rows (say, a thousand or so) then check if there are any other requests. Keep processing these other request until the model becomes idle. Then continue with deleting the next chunk of rows.
Let's make a small example of it. First, let's make a table.
The data in the table is completely silly, just something to put in there. Even the index is a simple FIFO, just something to keep the table together. And the data will be put in there by the main loop in an equally silly way:
When we send the command like "data,3", the mail loop will insert 3 new rows into the table. The contents is generated with sequential numbers, so the rows can be told apart. As the table gets changed, the updates get printed by the label lbPrintData. Also the contents of the table can be dumped with the main loop command "dump". Now let's add a main loop command to clear the table, initially by going through all the data and deleting it at once.
It's done in a bit round-about way: the main loop will send the clearing notification row to the label lbClear. Which does the clearing, then sends a notification that the clearing has completed to the label lbReportNote. Which eventually gets printed.
In the real world not the whole table would be erased but only the old data, from before a certain date. I've shown it before in the traffic aggregation example, ordering the rows by a date field, and deleting until you see a newer row. Here for simplicity all the data get wiped out.
The part of the main loop responsible for the clearing command is:
With the basic clearing done, time to add the chunking logic. First, add a tray to collect things that need to be done when the model is idle:
Then modify the lbClear code to work with the limited chunks:
Since it's real inconvenient to play with a million rows, we'll play with just a few rows. And so the chunk size limit is also set smaller, to just two rows instead of a thousand. When the limit is reached, the code pushes the command row to the idle tray for later rescheduling and returns. The adoption part is not strictly necessary, and this small example would work fine without it. But it's a safeguard for the more complicated programs that may have the labels chained, with our clearing label being just one link in a chain. If the incoming rowop gets rescheduled as is, the whole chain will get executed again. which might not be desirable. Re-adopting it to our label will cause only our label (okay, and everything chained from it) to be executed.
How would the rowops in the idle tray get executed? In the real world, the main loop logic would be like this pseudocode:
But it's hugely inconvenient for a toy demonstration, getting the timing right would be a major pain. So instead let's just add an extra command "idle" to the main loop, to trigger the idle logic at will:
And while at it, let's make the "dump" command show the contents of the idle tray as well:
All the pieces have been put together, let's run the code. As usual, the input lines are shown in italics:
This is pretty much a dry run: put in one row (less than the chunk size), see it deleted on clearing. And see the completion reported afterwards.
Add more data, which will be enough for three chunks.
Now the clearing does one chunk and stops, waiting for the idle condition.
See what's inside: the remaining 3 rows, and a row in the idle tray saying that the clearing is in progress.
The model goes idle once more, one more chunk of two rows gets deleted.
What will happen if we add more data in between the chunks of clearing? Let's see, let's add one more row. It shows up in the table as usual.
And on the next idle condition the clearing picks up whatever was in the table for the next chunk. Since there were only two rows left, it's the last chunk, and the clearing reports a successful completion. And a dump shows that there is nothing left in the table nor in the idle tray. And the next idle condition does nothing, because the idle tray is empty.
The delete-by-chunks logic can be made into a template, just I'm not sure yet what is the best way to do it. It would have to have a lot of configurable parts.
On another subject, scheduling the things to be done on idle adds an element of unpredictability to the model. It's impossible to predict the exact timing of the incoming requests, and the idle work may get inserted between any of them. Presumably it's OK because the data being deleted should not be participating in any logic at this time any more. For repeatability in the unit tests, make the chunk size adjustable and adjust it to a size larger than the biggest amount of data used in the unit tests.
A similar logic can also be used in querying the data. But it's more difficult. For deletion the continuation is easy: just take the first row in the index, and it will be the place to continue (because the index is ordered correctly, and because the previous rows are getting deleted). For querying you would have to remember the next row handle and continue from it. Which is OK if it can not get deleted in the meantime. But if it can get deleted, you'll have to keep track of that too, and advance to the next row handle when this happens. And if you want to receive a full snapshot with the following subscription to all updates, you'd have to check whether the modified rows are before or after the marked handle, and pass them through if they are before it, letting the user see the updates to the data already received. And since the data is being sent to the user, filling up the output buffer and stopping would stop the whole model too, and not restart until the user reads the buffered data. So there has to be a flow control logic that would stop the query when output buffer fills up, return to the normal operation, and then reschedule the idle job for the query only when the output buffer drains down. I've kind of started on doing an example of the chunked query too, but then because of all these complications decided to leave it for later.
Triceps is succeptible to the same issue. So, how to avoid it? Even better, how to make the deletes work "in background", at a low priority, kicking in only when there is no other pending requests?
The solution is do do it in smaller chunks. Delete a few rows (say, a thousand or so) then check if there are any other requests. Keep processing these other request until the model becomes idle. Then continue with deleting the next chunk of rows.
Let's make a small example of it. First, let's make a table.
our $uChunks = Triceps::Unit->new("uChunks") or confess "$!"; # data is just some dumb easily-generated filler our $rtData = Triceps::RowType->new( s => "string", i => "int32", ) or confess "$!"; # the data is auto-generated by a sequence our $seq = 0; our $ttData = Triceps::TableType->new($rtData) ->addSubIndex("fifo", Triceps::IndexType->newFifo()) or confess "$!"; $ttData->initialize() or confess "$!"; our $tData = $uChunks->makeTable($ttData, &Triceps::EM_CALL, "tJoin1" ) or confess "$!"; makePrintLabel("lbPrintData", $tData->getOutputLabel());
The data in the table is completely silly, just something to put in there. Even the index is a simple FIFO, just something to keep the table together. And the data will be put in there by the main loop in an equally silly way:
while(<STDIN>) { chomp; my @data = split(/,/); # starts with a command, then string opcode my $type = shift @data; if ($type eq "data") { my $count = shift @data; for (; $count > 0; $count--) { ++$seq; $uChunks->makeHashCall($tData->getInputLabel(), "OP_INSERT", s => ("data_" . $seq), i => $seq, ) or confess "$!"; } } elsif ($type eq "dump") { for (my $rhit = $tData->begin(); !$rhit->isNull(); $rhit = $rhit->next()) { print("dump: ", $rhit->getRow()->printP(), "\n"); } } $uChunks->drainFrame(); # just in case, for completeness }
When we send the command like "data,3", the mail loop will insert 3 new rows into the table. The contents is generated with sequential numbers, so the rows can be told apart. As the table gets changed, the updates get printed by the label lbPrintData. Also the contents of the table can be dumped with the main loop command "dump". Now let's add a main loop command to clear the table, initially by going through all the data and deleting it at once.
# notifications about the clearing our $rtNote = Triceps::RowType->new( text => "string", ) or confess "$!"; our $lbReportNote = $uChunks->makeDummyLabel($rtNote, "lbReportNote" ) or confess "$!"; makePrintLabel("lbPrintNote", $lbReportNote); # code that clears the table our $lbClear = $uChunks->makeLabel($rtNote, "lbClear", undef, sub { my $next; for (my $rhit = $tData->begin(); !$rhit->isNull(); $rhit = $next) { $next = $rhit->next(); # advance before removal $tData->remove($rhit); } $uChunks->makeHashCall($lbReportNote, "OP_INSERT", text => "done clearing", ) or confess "$!"; }) or confess "$!";
It's done in a bit round-about way: the main loop will send the clearing notification row to the label lbClear. Which does the clearing, then sends a notification that the clearing has completed to the label lbReportNote. Which eventually gets printed.
In the real world not the whole table would be erased but only the old data, from before a certain date. I've shown it before in the traffic aggregation example, ordering the rows by a date field, and deleting until you see a newer row. Here for simplicity all the data get wiped out.
The part of the main loop responsible for the clearing command is:
elsif ($type eq "clear") { $uChunks->makeHashCall($lbClear, "OP_INSERT", text => "clear", ) or confess "$!"; }
With the basic clearing done, time to add the chunking logic. First, add a tray to collect things that need to be done when the model is idle:
our $trayIdle = $uChunks->makeTray();
Then modify the lbClear code to work with the limited chunks:
# code that clears the table in small chunks our $lbClear = $uChunks->makeLabel($rtNote, "lbClear", undef, sub { my $limit = 2; # no more than 2 rows per run my $next; for (my $rhit = $tData->begin(); !$rhit->isNull(); $rhit = $next) { if ($limit-- <= 0) { # request to be called again when the model becomes idle $trayIdle->push($_[0]->adopt($_[1])); return; } $next = $rhit->next(); # advance before removal $tData->remove($rhit); } $uChunks->makeHashCall($lbReportNote, "OP_INSERT", text => "done clearing", ) or confess "$!"; }) or confess "$!";
Since it's real inconvenient to play with a million rows, we'll play with just a few rows. And so the chunk size limit is also set smaller, to just two rows instead of a thousand. When the limit is reached, the code pushes the command row to the idle tray for later rescheduling and returns. The adoption part is not strictly necessary, and this small example would work fine without it. But it's a safeguard for the more complicated programs that may have the labels chained, with our clearing label being just one link in a chain. If the incoming rowop gets rescheduled as is, the whole chain will get executed again. which might not be desirable. Re-adopting it to our label will cause only our label (okay, and everything chained from it) to be executed.
How would the rowops in the idle tray get executed? In the real world, the main loop logic would be like this pseudocode:
while(1) { if (idle tray empty) timeout = infinity; else timeout = 0; poll(file descriptors, timeout); if (poll timed out) run the idle tray; else process the incoming data; }
But it's hugely inconvenient for a toy demonstration, getting the timing right would be a major pain. So instead let's just add an extra command "idle" to the main loop, to trigger the idle logic at will:
elsif ($type eq "idle") { $uChunks->schedule($trayIdle); $trayIdle->clear(); }
And while at it, let's make the "dump" command show the contents of the idle tray as well:
for my $r ($trayIdle->toArray()) { print("when idle: ", $r->printP(), "\n"); }
All the pieces have been put together, let's run the code. As usual, the input lines are shown in italics:
data,1 tJoin1.out OP_INSERT s="data_1" i="1" clear tJoin1.out OP_DELETE s="data_1" i="1" lbReportNote OP_INSERT text="done clearing"
This is pretty much a dry run: put in one row (less than the chunk size), see it deleted on clearing. And see the completion reported afterwards.
data,5 tJoin1.out OP_INSERT s="data_2" i="2" tJoin1.out OP_INSERT s="data_3" i="3" tJoin1.out OP_INSERT s="data_4" i="4" tJoin1.out OP_INSERT s="data_5" i="5" tJoin1.out OP_INSERT s="data_6" i="6"
Add more data, which will be enough for three chunks.
clear tJoin1.out OP_DELETE s="data_2" i="2" tJoin1.out OP_DELETE s="data_3" i="3"
Now the clearing does one chunk and stops, waiting for the idle condition.
dump dump: s="data_4" i="4" dump: s="data_5" i="5" dump: s="data_6" i="6" when idle: lbClear OP_INSERT text="clear"
See what's inside: the remaining 3 rows, and a row in the idle tray saying that the clearing is in progress.
idle tJoin1.out OP_DELETE s="data_4" i="4" tJoin1.out OP_DELETE s="data_5" i="5"
The model goes idle once more, one more chunk of two rows gets deleted.
data,1 tJoin1.out OP_INSERT s="data_7" i="7" dump dump: s="data_6" i="6" dump: s="data_7" i="7" when idle: lbClear OP_INSERT text="clear"
What will happen if we add more data in between the chunks of clearing? Let's see, let's add one more row. It shows up in the table as usual.
idle tJoin1.out OP_DELETE s="data_6" i="6" tJoin1.out OP_DELETE s="data_7" i="7" lbReportNote OP_INSERT text="done clearing" dump idle
And on the next idle condition the clearing picks up whatever was in the table for the next chunk. Since there were only two rows left, it's the last chunk, and the clearing reports a successful completion. And a dump shows that there is nothing left in the table nor in the idle tray. And the next idle condition does nothing, because the idle tray is empty.
The delete-by-chunks logic can be made into a template, just I'm not sure yet what is the best way to do it. It would have to have a lot of configurable parts.
On another subject, scheduling the things to be done on idle adds an element of unpredictability to the model. It's impossible to predict the exact timing of the incoming requests, and the idle work may get inserted between any of them. Presumably it's OK because the data being deleted should not be participating in any logic at this time any more. For repeatability in the unit tests, make the chunk size adjustable and adjust it to a size larger than the biggest amount of data used in the unit tests.
A similar logic can also be used in querying the data. But it's more difficult. For deletion the continuation is easy: just take the first row in the index, and it will be the place to continue (because the index is ordered correctly, and because the previous rows are getting deleted). For querying you would have to remember the next row handle and continue from it. Which is OK if it can not get deleted in the meantime. But if it can get deleted, you'll have to keep track of that too, and advance to the next row handle when this happens. And if you want to receive a full snapshot with the following subscription to all updates, you'd have to check whether the modified rows are before or after the marked handle, and pass them through if they are before it, letting the user see the updates to the data already received. And since the data is being sent to the user, filling up the output buffer and stopping would stop the whole model too, and not restart until the user reads the buffered data. So there has to be a flow control logic that would stop the query when output buffer fills up, return to the normal operation, and then reschedule the idle job for the query only when the output buffer drains down. I've kind of started on doing an example of the chunked query too, but then because of all these complications decided to leave it for later.
Saturday, May 19, 2012
JoinTwo reference
This is a short reference to JoinTwo, a template that joins two tables.
Create the JoinTwo object. Confesses on any errors. Many of the options are mandatory, unless explicitly said otherwise. The options are:
name - Name of this object (will be used to create the names of internal objects).
leftTable - Table object to join, for the left side (both tables must be of the same unit).
rightTable - Table object to join, for the right side (both tables must be of the same unit).
leftFromLabel (optional) - The label from which to receive to the rows on the left side. (default: leftTable's Output label unless it's a self-join, for a self-join leftTable's Pre label). Can be used to put in a label that would filter out some of the input (THIS IS DANGEROUS! To preserve consistency, always filter by key field(s) only, and the same condition on the left and right).
rightFromLabel (optional) - The label from which to receive to the rows on the right side. (default: rightTable's Output label). Can be used to put in a label that would filter out some of the input (THIS IS DANGEROUS! To preserve consistency, always filter by key field(s) only, and the same condition on the left and right).
leftIdxPath - Array reference containing the path name of index type in the left table used for look-up. The index absolutely must be a Hash (leaf or not), not of any other kind. The number and order of key fields in left and right indexes must match, since indexes define the fields used for the join. The types of fields have to match exactly unless allowed by the option overrideKeyTypes => 1.
rightIdxPath - Array reference containing the path name of index type in the right table used for look-up. The index absolutely must be a Hash (leaf or not), not of any other kind. The number and order of key fields in left and right indexes must match, since indexes define the fields used for the join. The types of fields have to match exactly unless allowed by the option overrideKeyTypes => 1.
leftFields (optional) - Reference to an array of patterns for the left-side fields to pass through to the result rows, with the syntax of Triceps::Fields::filter(). If not defined then pass everything.
rightFields (optional) - Reference to an array of patterns for the right-side fields to pass through to the result rows, with the syntax of Triceps::Fields::filter(). If not defined then pass everything.
fieldsLeftFirst (optional) - Flag: if 1, in the result rows put the fields from the left side first, then from the right side; if 0, then in the opposite order. (default:1)
fieldsUniqKey (optional) - Controls the logic that prevents the duplication of the key fields in the result rows (since by definition their originals are present in both the left and right tables). This is done by setting the option "fieldsMirrorKey" of the underlying LookupJoins to 1 and by manipulating the left/rightFields options: one side is left unchanged, and thus lets the user pass the key fields as usual, while the other side gets "!key" specs prepended to the front of it for each key field, thus removing the duplication.
The values are one of:
by (optional) - Reference to an array containing pairs of field names used for look-up, [ leftFld1, rightFld1, leftFld2, rightFld2, ... ]. The options "by" and "byLeft" are mutually exclusive. If none of them is used, by default the field lists are taken from the index type keys, matched up in the order they appear in the indexes. But if a different order is desired, this option can be used to override it (the fields must still be the same, just the order may change).
byLeft (optional) - Reference to an array containing the patterns in the syntax of Triceps::Fields::filter(). It gets applied to the left-side fields, the fields that pass through become the key fields, and their translations are the names of the matching fields on the right side. The options "by" and "byLeft" are mutually exclusive. If none of them is used, by default the field lists are taken from the index type keys, matched up in the order they appear in the indexes. But if a different order is desired, this option can be used to override it (the fields must still be the same, just the order may change).
type (optional) - The type of join from the inner/outer classification, one of: "inner", "left" for left outer, "right" for right outer, "outer" for full outer. (default: "inner")
leftSaveJoinerTo (optional) - Reference to a scalar where to save a copy of the joiner function source code for the left side.
rightSaveJoinerTo (optional) - Reference to a scalar where to save a copy of the joiner function source code for the right side.
overrideSimpleMinded (optional) - Flag: if 1, do not try to create the correct DELETE-INSERT sequence for the updates, just produce records with the same opcode as the incoming ones. The only possible usage of this option might be to simulate the CEP systems that do not support the opcodes and treat averything as an INSERT. The data produced is outright garbage. It can also be used for the entertainment value, to show, why it's garbage. (default: 0)
overrideKeyTypes (optional) - Flag: if 1, allow the key types to be not exactly the same. (default: 0)
Get the row type of the join result.
Get the output label of the joiner. The results from processing of the input rowops come out here. Note that there is no input label, the join is fed by connecting to the tables (with the possible override with the options "left/rightFromLabel").
Get back the values of the options. If such an option was not set, returns the default value, or the automatically calculated value. Sometimes an automatically calculated value may even override the user-specified value. There is no way to get back "left/rightFromLabel", it is discarded after the JoinTwo is constructed.
$joiner = Triceps::JoinTwo->new(optionName => optionValue, ...);
Create the JoinTwo object. Confesses on any errors. Many of the options are mandatory, unless explicitly said otherwise. The options are:
name - Name of this object (will be used to create the names of internal objects).
leftTable - Table object to join, for the left side (both tables must be of the same unit).
rightTable - Table object to join, for the right side (both tables must be of the same unit).
leftFromLabel (optional) - The label from which to receive to the rows on the left side. (default: leftTable's Output label unless it's a self-join, for a self-join leftTable's Pre label). Can be used to put in a label that would filter out some of the input (THIS IS DANGEROUS! To preserve consistency, always filter by key field(s) only, and the same condition on the left and right).
rightFromLabel (optional) - The label from which to receive to the rows on the right side. (default: rightTable's Output label). Can be used to put in a label that would filter out some of the input (THIS IS DANGEROUS! To preserve consistency, always filter by key field(s) only, and the same condition on the left and right).
leftIdxPath - Array reference containing the path name of index type in the left table used for look-up. The index absolutely must be a Hash (leaf or not), not of any other kind. The number and order of key fields in left and right indexes must match, since indexes define the fields used for the join. The types of fields have to match exactly unless allowed by the option overrideKeyTypes => 1.
rightIdxPath - Array reference containing the path name of index type in the right table used for look-up. The index absolutely must be a Hash (leaf or not), not of any other kind. The number and order of key fields in left and right indexes must match, since indexes define the fields used for the join. The types of fields have to match exactly unless allowed by the option overrideKeyTypes => 1.
leftFields (optional) - Reference to an array of patterns for the left-side fields to pass through to the result rows, with the syntax of Triceps::Fields::filter(). If not defined then pass everything.
rightFields (optional) - Reference to an array of patterns for the right-side fields to pass through to the result rows, with the syntax of Triceps::Fields::filter(). If not defined then pass everything.
fieldsLeftFirst (optional) - Flag: if 1, in the result rows put the fields from the left side first, then from the right side; if 0, then in the opposite order. (default:1)
fieldsUniqKey (optional) - Controls the logic that prevents the duplication of the key fields in the result rows (since by definition their originals are present in both the left and right tables). This is done by setting the option "fieldsMirrorKey" of the underlying LookupJoins to 1 and by manipulating the left/rightFields options: one side is left unchanged, and thus lets the user pass the key fields as usual, while the other side gets "!key" specs prepended to the front of it for each key field, thus removing the duplication.
The values are one of:
- "none" - Do not change either of the left/rightFields, and do not enable the key mirroring at all.
- "manual" - Enable the key mirroring; do not change either of the left/rightFields, leaving the full control to the user.
- "left" - Enable the key mirroring; do not change leftFields (and thus pass the key fields in there), remove the keys from rightFields.
- "right" - Enable the key mirroring; do not change rightFields (and thus pass the key fields in there), remove the keys from leftFields.
- "first" (default) - Enable the key mirroring; do not change whatever side goes first according to the option "fieldsLeftFirst" (and thus pass the key in there), remove the keys from the other side.
by (optional) - Reference to an array containing pairs of field names used for look-up, [ leftFld1, rightFld1, leftFld2, rightFld2, ... ]. The options "by" and "byLeft" are mutually exclusive. If none of them is used, by default the field lists are taken from the index type keys, matched up in the order they appear in the indexes. But if a different order is desired, this option can be used to override it (the fields must still be the same, just the order may change).
byLeft (optional) - Reference to an array containing the patterns in the syntax of Triceps::Fields::filter(). It gets applied to the left-side fields, the fields that pass through become the key fields, and their translations are the names of the matching fields on the right side. The options "by" and "byLeft" are mutually exclusive. If none of them is used, by default the field lists are taken from the index type keys, matched up in the order they appear in the indexes. But if a different order is desired, this option can be used to override it (the fields must still be the same, just the order may change).
type (optional) - The type of join from the inner/outer classification, one of: "inner", "left" for left outer, "right" for right outer, "outer" for full outer. (default: "inner")
leftSaveJoinerTo (optional) - Reference to a scalar where to save a copy of the joiner function source code for the left side.
rightSaveJoinerTo (optional) - Reference to a scalar where to save a copy of the joiner function source code for the right side.
overrideSimpleMinded (optional) - Flag: if 1, do not try to create the correct DELETE-INSERT sequence for the updates, just produce records with the same opcode as the incoming ones. The only possible usage of this option might be to simulate the CEP systems that do not support the opcodes and treat averything as an INSERT. The data produced is outright garbage. It can also be used for the entertainment value, to show, why it's garbage. (default: 0)
overrideKeyTypes (optional) - Flag: if 1, allow the key types to be not exactly the same. (default: 0)
$rt = $joiner->getResultRowType();
Get the row type of the join result.
$lb = $joiner->getOutputLabel();
Get the output label of the joiner. The results from processing of the input rowops come out here. Note that there is no input label, the join is fed by connecting to the tables (with the possible override with the options "left/rightFromLabel").
$res = $joiner->getUnit(); $res = $joiner->getName(); $res = $joiner->getLeftTable(); $res = $joiner->getRightTable(); $res = $joiner->getLeftIdxPath(); $res = $joiner->getRightIdxPath(); $res = $joiner->getLeftFields(); $res = $joiner->getRightFields(); $res = $joiner->getFieldsLeftFirst(); $res = $joiner->getFieldsUniqKey(); $res = $joiner->getBy(); $res = $joiner->getByLeft(); $res = $joiner->getType(); $res = $joiner->getOverrideSimpleMinded(); $res = $joiner->getOverrideKeyTypes();
Get back the values of the options. If such an option was not set, returns the default value, or the automatically calculated value. Sometimes an automatically calculated value may even override the user-specified value. There is no way to get back "left/rightFromLabel", it is discarded after the JoinTwo is constructed.
LookupJoin reference
The features of LookupJoin have been already described and illustrated in great many examples. Here is the short summary of them.
Construct the LookupJoin template. Confesses on any errors. Many of the options are in fact mandatory, the optional ones are marked as such. They are:
unit - Scheduling unit object (may be skipped if leftFromLabel is used) .
name - Name of this LookupJoin object (will be used to create the names of internal objects).
leftRowType (mutually exclusive with leftFromLabel, one must be present) - Type of the rows that will be coming in at the left side of the join, and will be used for lookup.
leftFromLabel (mutually exclusive with leftRowType, one must be present) - Source of rows for the left side of the join; implies their type and the scheduling unit where this object belongs.
rightTable - Table object where to do the look-ups.
rightIdxPath (optional) - Array reference containing the path name of index type in table used for the look-up (default: first top-level Hash index type). The index absolutely must be a Hash (leaf or non-leaf), not of any other kind.
leftFields (optional) - Reference to an array of patterns for the left-side fields to pass through. Syntax as described in Triceps::Fields::filter(). If not defined then pass everything.
rightFields (optional) - Reference to an array of patterns for the right-side fields to pass through. Syntax as described in Triceps::Fields::filter(). If not defined then pass everything (which is probably a bad idea since it would include the second copy of the key fields, so better override it).
fieldsLeftFirst (optional) - Flag: in the resulting rows put the fields from the left side first, then from right side. If 0, then opposite. (default:1)
fieldsMirrorKey (optional) - Flag: even if the join is an outer join and the row on one side is absent when generating the result row, the key fields in it will still be present by mirroring them from the other side. Used by JoinTwo. (default: 0)
by (mutually exclusive with byLeft, one must be present) - Reference to an array containing pairs of field names used for look-up, [ leftFld1, rightFld1, leftFld2, rightFld2, ... ]. The set of right-side fields must match the keys of the index path from the option rightIdxPath, though possibly in a different order.
byLeft (mutually exclusive with byLeft, one must be present) - Reference to an array containing the patterns in the syntax of Triceps::Fields::filter(). It gets applied to the left-side fields, the fields that pass through become the key fields, and their translations are the names of the matching fields on the right side. The set of right-side fields must match the keys of the index path from the option rightIdxPath, though possibly in a different order.
isLeft (optional) - Flag: 1 for left outer join, 0 for inner join. (default: 1)
limitOne (optional) - Flag: 1 to return no more than one row even if multiple rows have been found by the look-up, 0 to return all the found matches. (default: 0 for the non-leaf right index, 1 for leaf right index). If the right index is leaf, this option will be always automatically set to 1, even if the user specified otherwise, since there is no way to look up more than one matching row in it.
automatic (optional) - Flag: 1 means that the lookup() method will never be called manually. This allows to optimize the label handler code and always take the opcode into account when processing the rows. 0 means that lookup() will be used. (default: 1)
oppositeOuter (optional, used only with automatic => 1) - Flag: 1 for the right outer join, 0 for inner join. If both options isLeft and oppositeOuter are set to 1, then this is a full outer join. If set to 1, each update that finds a match in the right table, may produce a DELETE-INSERT sequence that keeps the state of the right or full outer join consistent. The full outer or right outer join logic makes sense only if this LookupJoin is one of a pair in a bigger JoinTwo object. Each of these LookupJoins thinks of itself as "left" and of the other one as "right", while JoinTwo presents a consistent whole picture to the user. (default: 0) Used by JoinTwo.
groupSizeCode (optional, used only with oppositeOuter => 1) - Reference to a function that would compute the group size for this side's table. The group size together with the opcode is then used to decide if a DELETE-INSERT sequence needs to be produced instead of a plain INSERT or DELETE. It is needed when this side's index (not visible here in LookupJoin but visible in the JoinTwo that envelopes it) is non-leaf, so multiple rows on this side may match each row on the other side. The DELETE-INSERT pair needs to be generated only if the current rowop was a deletion of the last matching row or insertion of the first matching row on this side. If groupSizeCode is not defined, the DELETE-INSERT part is always generated (which is appropriate is this side's index is leaf, and every row is the last or first one). If groupSizeCode is defined, it should return the group size in the left table by the left index for the input row. If the operation is INSERT, the size of 1 would mean that the DELETE-INSERT pair needs to be generated. If the operation is DELETE, the size of 0 would mean that the DELETE-INSERT pair needs to be generated. Called as:
The default undefined groupSizeCode is equivalent to
but leaving it undefined is more efficient since allows to hardcode this logic at compile time instead of calling the function for every rowop.
saveJoinerTo (optional) - Reference to a scalar where to save a copy of the joiner function source code.
Look up the matches for the $leftRow and return the array of the result rows. If the option "isLeft" is 0, the array may be empty. If the option "limitOne" is 1, the array will contain no more than one row, and may be assigned directly to a scalar.
Get the row type of the join result.
Get the input label of the joiner. The rowops sent there will be processed as coming on the left side of the join. The result will be produced on the output label.
Get the output label of the joiner. The results from processing of the input rowops come out here. Note that the results of the lookup() calls do not come out at the output label, they are only returned to the caller.
Get back the values of the options. If such an option was not set, returns the default value, or the automatically calculated value. Sometimes an automatically calculated value may even override the user-specified value. There is no way to get back "leftFromLabel", it is discarded after the LookupJoin is constructed.
$joiner = Triceps::LookupJoin->new(optionName => optionValue, ...);
Construct the LookupJoin template. Confesses on any errors. Many of the options are in fact mandatory, the optional ones are marked as such. They are:
unit - Scheduling unit object (may be skipped if leftFromLabel is used) .
name - Name of this LookupJoin object (will be used to create the names of internal objects).
leftRowType (mutually exclusive with leftFromLabel, one must be present) - Type of the rows that will be coming in at the left side of the join, and will be used for lookup.
leftFromLabel (mutually exclusive with leftRowType, one must be present) - Source of rows for the left side of the join; implies their type and the scheduling unit where this object belongs.
rightTable - Table object where to do the look-ups.
rightIdxPath (optional) - Array reference containing the path name of index type in table used for the look-up (default: first top-level Hash index type). The index absolutely must be a Hash (leaf or non-leaf), not of any other kind.
leftFields (optional) - Reference to an array of patterns for the left-side fields to pass through. Syntax as described in Triceps::Fields::filter(). If not defined then pass everything.
rightFields (optional) - Reference to an array of patterns for the right-side fields to pass through. Syntax as described in Triceps::Fields::filter(). If not defined then pass everything (which is probably a bad idea since it would include the second copy of the key fields, so better override it).
fieldsLeftFirst (optional) - Flag: in the resulting rows put the fields from the left side first, then from right side. If 0, then opposite. (default:1)
fieldsMirrorKey (optional) - Flag: even if the join is an outer join and the row on one side is absent when generating the result row, the key fields in it will still be present by mirroring them from the other side. Used by JoinTwo. (default: 0)
by (mutually exclusive with byLeft, one must be present) - Reference to an array containing pairs of field names used for look-up, [ leftFld1, rightFld1, leftFld2, rightFld2, ... ]. The set of right-side fields must match the keys of the index path from the option rightIdxPath, though possibly in a different order.
byLeft (mutually exclusive with byLeft, one must be present) - Reference to an array containing the patterns in the syntax of Triceps::Fields::filter(). It gets applied to the left-side fields, the fields that pass through become the key fields, and their translations are the names of the matching fields on the right side. The set of right-side fields must match the keys of the index path from the option rightIdxPath, though possibly in a different order.
isLeft (optional) - Flag: 1 for left outer join, 0 for inner join. (default: 1)
limitOne (optional) - Flag: 1 to return no more than one row even if multiple rows have been found by the look-up, 0 to return all the found matches. (default: 0 for the non-leaf right index, 1 for leaf right index). If the right index is leaf, this option will be always automatically set to 1, even if the user specified otherwise, since there is no way to look up more than one matching row in it.
automatic (optional) - Flag: 1 means that the lookup() method will never be called manually. This allows to optimize the label handler code and always take the opcode into account when processing the rows. 0 means that lookup() will be used. (default: 1)
oppositeOuter (optional, used only with automatic => 1) - Flag: 1 for the right outer join, 0 for inner join. If both options isLeft and oppositeOuter are set to 1, then this is a full outer join. If set to 1, each update that finds a match in the right table, may produce a DELETE-INSERT sequence that keeps the state of the right or full outer join consistent. The full outer or right outer join logic makes sense only if this LookupJoin is one of a pair in a bigger JoinTwo object. Each of these LookupJoins thinks of itself as "left" and of the other one as "right", while JoinTwo presents a consistent whole picture to the user. (default: 0) Used by JoinTwo.
groupSizeCode (optional, used only with oppositeOuter => 1) - Reference to a function that would compute the group size for this side's table. The group size together with the opcode is then used to decide if a DELETE-INSERT sequence needs to be produced instead of a plain INSERT or DELETE. It is needed when this side's index (not visible here in LookupJoin but visible in the JoinTwo that envelopes it) is non-leaf, so multiple rows on this side may match each row on the other side. The DELETE-INSERT pair needs to be generated only if the current rowop was a deletion of the last matching row or insertion of the first matching row on this side. If groupSizeCode is not defined, the DELETE-INSERT part is always generated (which is appropriate is this side's index is leaf, and every row is the last or first one). If groupSizeCode is defined, it should return the group size in the left table by the left index for the input row. If the operation is INSERT, the size of 1 would mean that the DELETE-INSERT pair needs to be generated. If the operation is DELETE, the size of 0 would mean that the DELETE-INSERT pair needs to be generated. Called as:
&$groupSizeCode($opcode, $leftrow)
The default undefined groupSizeCode is equivalent to
sub { &Triceps::isInsert($_[0]); }
but leaving it undefined is more efficient since allows to hardcode this logic at compile time instead of calling the function for every rowop.
saveJoinerTo (optional) - Reference to a scalar where to save a copy of the joiner function source code.
@rows = $joiner->lookup($leftRow);
Look up the matches for the $leftRow and return the array of the result rows. If the option "isLeft" is 0, the array may be empty. If the option "limitOne" is 1, the array will contain no more than one row, and may be assigned directly to a scalar.
$rt = $joiner->getResultRowType();
Get the row type of the join result.
$lb = $joiner->getInputLabel();
Get the input label of the joiner. The rowops sent there will be processed as coming on the left side of the join. The result will be produced on the output label.
$lb = $joiner->getOutputLabel();
Get the output label of the joiner. The results from processing of the input rowops come out here. Note that the results of the lookup() calls do not come out at the output label, they are only returned to the caller.
$res = $joiner->getUnit(); $res = $joiner->getName(); $res = $joiner->getLeftRowType(); $res = $joiner->getRightTable(); $res = $joiner->getRightIdxPath(); $res = $joiner->getLeftFields(); $res = $joiner->getRightFields(); $res = $joiner->getFieldsLeftFirst(); $res = $joiner->getFieldsMirrorKey(); $res = $joiner->getBy(); $res = $joiner->getByLeft(); $res = $joiner->getIsLeft(); $res = $joiner->getLimitOne(); $res = $joiner->getAutomatic(); $res = $joiner->getOppositeOuter(); $res = $joiner->getGroupSizeCode();
Get back the values of the options. If such an option was not set, returns the default value, or the automatically calculated value. Sometimes an automatically calculated value may even override the user-specified value. There is no way to get back "leftFromLabel", it is discarded after the LookupJoin is constructed.
The field processing helpers
The manipulation on the field lists from the joins is also available for reuse. It's grouped in the package Triceps::Fields.
The function
checks whether the simple type is an array type, from the standpoint of representation it in Perl. The array types are those that end with "[]", except "uint8[]" (because it's represented as a Perl scalar string).
The function
filters a list of fields by a pattern list of the same form as used in the join results. For example:
$callerDescr is the description of the caller used in the error messages, \@incomingFields is the reference to an array with the field names to be filtered, \@patterns is the reference to array of field name patterns. Each pattern is a string in one of the forms:
If a field name doesn't match any of the patterns, it doesn't pass through the filter.
Each field is checked against each pattern in order, and the first successful match determines what happens with the field. For example, when the pattern ['!key', '.*'] is used on the field name "key", the first '!key' matches it and blocks the field from passing through the filter.
In general, quoting the patterns with single quotes is better than with double quotes, because this way the special regexp characters don't need so much escaping with backslashes. Naturally, it's better to keep the field names alphanumeric too, to avoid getting funny effects when they are used in the patterns. Some particularly useful pattern examples:
More examples of the patterns have been shown with the joins.
The result is an array of field names after translation. That array has the same size as @incomingFields, and keeps the passed-through fields in the same order. The fields that don't pass through get replaced with undef.
If a pattern specifies a literal alphanumeric field name without any regexp wildcard (such as 'key' or '!key' or 'key/right_$&/'), this function makes sure that the field is present in the original field list. If it isn't, the function confesses. The reason is for the accidental typos in the field names not to go unnoticed. No such check is done on the general patterns, to allow the reuse of patterns on many different field lists, including those where the pattern doesn't match anything.
The function doesn't check for any duplicates in the resulting field names, nor for any funny characters in them. The reason for not checking the duplicates is that often the result is combined from multiple sets of filtered fields, and the check for duplicates makes sense only after these sets are combined.
The function
is a version of filter() that does the same but returns the result in a different form. This time the result contains a pair of values "oldName", "newName" for each field that passes through (of course, if the field is not renamed, "oldName" and "newName" will be the same). For example, if called
the result in @pairs will be: ("abcd", "abcd", "jkl", "qwe"). The field "abcd" made through as is, the field "jkl" got renamed to "qwe". You can also put the result of filterToPairs directly into a map:
Other than the result format, filterToPairs() works exactly the same as filter(). It's just that sometimes one format of the result is more convenient, sometimes the other.
The function
$result = &Triceps::Fields::isArrayType($simpleType);
checks whether the simple type is an array type, from the standpoint of representation it in Perl. The array types are those that end with "[]", except "uint8[]" (because it's represented as a Perl scalar string).
The function
@fields = &Triceps::Fields:: filter($callerDescr, \@incomingFields, \@patterns);
filters a list of fields by a pattern list of the same form as used in the join results. For example:
my @leftfld = $self->{leftRowType}->getFieldNames(); my @res =&Triceps::Fields::filter( "Triceps::LookupJoin::new: option '$leftFields'", \@leftfld, $self->{"$leftFields"});
$callerDescr is the description of the caller used in the error messages, \@incomingFields is the reference to an array with the field names to be filtered, \@patterns is the reference to array of field name patterns. Each pattern is a string in one of the forms:
- 'regexp' - pass through the field names matching the anchored regexp (i.e. implicitly wrapped as '^regexp$').
- '!regexp' - throw away the field names matching the anchored regexp.
- 'regexp/regsub' - pass through the field names matching the anchored regexp, performing a substitution on it.
If a field name doesn't match any of the patterns, it doesn't pass through the filter.
Each field is checked against each pattern in order, and the first successful match determines what happens with the field. For example, when the pattern ['!key', '.*'] is used on the field name "key", the first '!key' matches it and blocks the field from passing through the filter.
In general, quoting the patterns with single quotes is better than with double quotes, because this way the special regexp characters don't need so much escaping with backslashes. Naturally, it's better to keep the field names alphanumeric too, to avoid getting funny effects when they are used in the patterns. Some particularly useful pattern examples:
- '.*' - pass through everything
- '.*/second_$&/' - pass everything and prepend 'second_' to it
- 'right_(.*)/$1/' - pass the field names starting from 'right_' and remove this prefix from them
More examples of the patterns have been shown with the joins.
The result is an array of field names after translation. That array has the same size as @incomingFields, and keeps the passed-through fields in the same order. The fields that don't pass through get replaced with undef.
If a pattern specifies a literal alphanumeric field name without any regexp wildcard (such as 'key' or '!key' or 'key/right_$&/'), this function makes sure that the field is present in the original field list. If it isn't, the function confesses. The reason is for the accidental typos in the field names not to go unnoticed. No such check is done on the general patterns, to allow the reuse of patterns on many different field lists, including those where the pattern doesn't match anything.
The function doesn't check for any duplicates in the resulting field names, nor for any funny characters in them. The reason for not checking the duplicates is that often the result is combined from multiple sets of filtered fields, and the check for duplicates makes sense only after these sets are combined.
The function
@pairs = &Triceps::Fields::filterToPairs($callerDescr, \@incomingFields, \@patterns);
is a version of filter() that does the same but returns the result in a different form. This time the result contains a pair of values "oldName", "newName" for each field that passes through (of course, if the field is not renamed, "oldName" and "newName" will be the same). For example, if called
@pairs = &Triceps::Fields::filterToPairs("MyTemplate result", ["abc", "abcd", "fgh", "jkl"], ["!abc", "a.*", "jkl/qwe"]);
the result in @pairs will be: ("abcd", "abcd", "jkl", "qwe"). The field "abcd" made through as is, the field "jkl" got renamed to "qwe". You can also put the result of filterToPairs directly into a map:
%resultFields = &Triceps::Fields::filterToPairs(...);
Other than the result format, filterToPairs() works exactly the same as filter(). It's just that sometimes one format of the result is more convenient, sometimes the other.
More option checking
Some motifs in checking the options for the method calls have been coming up repeatedly, to I've added more Triceps::Opt methods that encapsulate them.
The first one deals with the mutually exclusive options. Triceps::Opt::parse() doesn't know how to check the mutual exclusivity correctly. For it the option is either mandatory or optional. And rather than complicate it with some convoluted specification of the option exclusivity groups, I've just added a separate method to check that:
You call parse() and then you call checkMutuallyExclusive(). If it finds an error, it confesses. It returns the name of the only option that has been defined (or undef if none of them were defined). For example, this is what the JoinTwo constructor does:
$callerDescr is some string that describes the caller for the error message. The names of the options are also used in the error messages. $mandatoryFlag is 1 if exactly one option must be defined, or 0 if having none of them defined is also OK. The "defined" here means that the value passed in the arguments is not undef.
The second method is more specialized. It deals with the triangle of (Unit, RowType, Label). It turns out quite convenient to either let a template define its own input label and then manually connect it or just give it another label and let it automatically chain the input to that label. In the first case the template has to be told, what Unit it belongs to, and what is the RowType of the input data. In the second case they can be found from the Label. The method
encapsulates this finding-out and other checks. Its rules are:
The values are passed by reference because they may be computed by this method from the other values.
Here is a usage example:
The label object doesn't strictly have to be a label object. It may be any object that supports the methods getUnit() and getRowType().
Here you might remember that a Label doesn't have the method getRowType(), its method for getting the row type is called getType(). Well, I've added it now. You can use now either of
with the same effect.
The first one deals with the mutually exclusive options. Triceps::Opt::parse() doesn't know how to check the mutual exclusivity correctly. For it the option is either mandatory or optional. And rather than complicate it with some convoluted specification of the option exclusivity groups, I've just added a separate method to check that:
$optName = &Triceps::Opt::checkMutuallyExclusive( $callerDescr, $mandatoryFlag, $optName1 => optValue1, ...);
You call parse() and then you call checkMutuallyExclusive(). If it finds an error, it confesses. It returns the name of the only option that has been defined (or undef if none of them were defined). For example, this is what the JoinTwo constructor does:
&Triceps::Opt::checkMutuallyExclusive("Triceps::JoinTwo::new", 0, by => $self->{by}, byLeft => $self->{byLeft});
$callerDescr is some string that describes the caller for the error message. The names of the options are also used in the error messages. $mandatoryFlag is 1 if exactly one option must be defined, or 0 if having none of them defined is also OK. The "defined" here means that the value passed in the arguments is not undef.
The second method is more specialized. It deals with the triangle of (Unit, RowType, Label). It turns out quite convenient to either let a template define its own input label and then manually connect it or just give it another label and let it automatically chain the input to that label. In the first case the template has to be told, what Unit it belongs to, and what is the RowType of the input data. In the second case they can be found from the Label. The method
&Triceps::Opt::handleUnitTypeLabel($callerDescr, $nameUnit, \$refUnit, $nameRowType, \$refRowType, $nameLabel, \$refLabel);
encapsulates this finding-out and other checks. Its rules are:
- The label option and the row type option are mutually exclusive.
- The unit option may be specified together with the label option, but it must be the same unit as in the label.
- If the label option is used, the unit and row type option values will be populated from the label.
- On any error it confesses, using $callerDescr for the caller description in the error message. The option name arguments are slao used for the error messages.
- It always returns 1.
The values are passed by reference because they may be computed by this method from the other values.
Here is a usage example:
&Triceps::Opt::handleUnitTypeLabel("Triceps::LookupJoin::new", unit => \$self->{unit}, leftRowType => \$self->{leftRowType}, leftFromLabel => \$self->{leftFromLabel});
The label object doesn't strictly have to be a label object. It may be any object that supports the methods getUnit() and getRowType().
Here you might remember that a Label doesn't have the method getRowType(), its method for getting the row type is called getType(). Well, I've added it now. You can use now either of
$lb->getType() $lb->getRowType()
with the same effect.
Finding the nested index types
The joins have introduced a syntax for choosing a nested index type in the table (or, more exactly, in the table type). You can say
to specify the index type "byCcy12" nested in index type "byCcy1". Internally the joins delegate this index finding to the TableType calls, and you can also use these calls directly:
For example:
The findIndexPath() simply finds the index type at the path. If there is no such index, it confesses.
The findIndexKeyPath() finds by path an index type that allows the direct look-up by key fields. It requires that every index type in the path returns a non-empty array of fields in getKey(). In practice it means that every index in the path must be a Hashed index. Otherwise the method confesses. When the Sorted and maybe other index types will support getKey(), they will be usable with this method too.
Besides checking that each index type in the path works by keys, this method builds and returns the list of all the key fields required for a look-up in this index. Note that @keys is an actual array and not a reference to array. The return protocol of this method is a little weird: it returns an array of values, with the first value being the reference to the index type, and the rest of them the names of the key fields. If the table type was defined as
then the look-up of [ "byCcy1", "byCcy12" ] would return ($ixtref, "ccy1", "ccy2"), where $ixtref is the reference to the index type. When assigned to ($ixt, @keys), $ixtref would go into $ixt, and ("ccy1", "ccy2") would go into @keys.
The key field names in the result go in the order they occurred in the definition, from the outermost to the innermost index. The key fields must not duplicate. It's possible to define the index types where the key fields duplicate in the path, say:
And they would even work fine, with just a little extra overhead from duplication. But findIndexKeyPath() will refuse such indexes and confess.
rightIdxPath => [ "byCcy1", "byCcy12" ],
to specify the index type "byCcy12" nested in index type "byCcy1". Internally the joins delegate this index finding to the TableType calls, and you can also use these calls directly:
$idxType = $tableType->findIndexPath(\@path); ($idxType, @keys) = $tableType->findIndexKeyPath(\@path);
For example:
$ixt = $tt->findIndexPath([ "byCcy1", "byCcy12" ]); ($ixt, @keys) = $tt->findIndexKeyPath([ "byCcy1", "byCcy12" ]);
The findIndexPath() simply finds the index type at the path. If there is no such index, it confesses.
The findIndexKeyPath() finds by path an index type that allows the direct look-up by key fields. It requires that every index type in the path returns a non-empty array of fields in getKey(). In practice it means that every index in the path must be a Hashed index. Otherwise the method confesses. When the Sorted and maybe other index types will support getKey(), they will be usable with this method too.
Besides checking that each index type in the path works by keys, this method builds and returns the list of all the key fields required for a look-up in this index. Note that @keys is an actual array and not a reference to array. The return protocol of this method is a little weird: it returns an array of values, with the first value being the reference to the index type, and the rest of them the names of the key fields. If the table type was defined as
$tt = Triceps::TableType->new($rt) ->addSubIndex("byCcy1", Triceps::IndexType->newHashed(key => [ "ccy1" ]) ->addSubIndex("byCcy12", Triceps::IndexType->newHashed(key => [ "ccy2" ]) ) ) ->addSubIndex("byCcy2", Triceps::IndexType->newHashed(key => [ "ccy2" ]) ->addSubIndex("grouping", Triceps::IndexType->newFifo()) ) or die "$!";
then the look-up of [ "byCcy1", "byCcy12" ] would return ($ixtref, "ccy1", "ccy2"), where $ixtref is the reference to the index type. When assigned to ($ixt, @keys), $ixtref would go into $ixt, and ("ccy1", "ccy2") would go into @keys.
The key field names in the result go in the order they occurred in the definition, from the outermost to the innermost index. The key fields must not duplicate. It's possible to define the index types where the key fields duplicate in the path, say:
$tt = Triceps::TableType->new($rt) ->addSubIndex("byCcy1", Triceps::IndexType->newHashed(key => [ "ccy1" ]) ->addSubIndex("byCcy12", Triceps::IndexType->newHashed(key => [ "ccy2", "ccy1" ]) ) ) or die "$!";
And they would even work fine, with just a little extra overhead from duplication. But findIndexKeyPath() will refuse such indexes and confess.
Friday, May 18, 2012
The hidden options of LookupJoin
These options aren't really hidden, just they aren't particularly useful unless you want to use a LookupJoin as a part of a multi-sided join, like JoinTwo does. It's even hard to explain what do they do without explaining the JoinTwo first. If you're not interested in such details, you can as well skip them.
So, setting
tells that this LookupJoin is a part of an outer join, with the opposite side (right side, for this LookupJoin) being an outer one (well, this side might be outer too if "isLeft => 1", but that's a whole separate question). This enables the logic that checks whether the row inserted here is the first one that matches a row in the right-side table, and whether the row deleted here was the last one that matches. If the condition is satisfied, not a simple INSERT or DELETE rowop is produced but a correct DELETE-INSERT pair that replaces the old state with the new one. Well, it's been described in detail for the JoinTwo.
But how does it know if the current row if the first one or last one or neither? After all, LookupJoin doesn't have any access to the left-side table. It has two ways to know.
First, by default it simply assumes that it's an one-to-something (1:1 or 1:M) join. Then there may be no more than one matching row on this side, and every row inserted is the first one, and every row deleted is the last one. Then it does the DELETE-INSERT trick every time.
Second, the option
can be used to compute the current group size for the current row. It provides a function that does the computation and gets called as
Note that it doesn't get the table reference nor the index type reference either, so it has to be a closure with the references compiled into it. JoinTwo does it with the definition
Why not just pass the table and index type references to JoinTwo and let it do the same computation without any mess with the closure references? Because the group size computation may need to be different. When the JoinTwo does a self-join, it feeds the left side from the table's Pre label, and the normal group size computation would be incorrect because the rowop didn't get applied to the table yet. Instead it has to predict what will happen when the rowop will get applied:
If you set the option "groupSizeCode" to undef, that's the default value that triggers the one-to-something behavior.
Setting another option
enables another magic behavior: mirroring the values of key fields to both sides before they are used to produce the result row. This way even if the join is an outer join and one side is not present, the key fields will be available on both sides nevertheless. This is the heavy machinery that underlies the JoinTwo's high-level option "fieldsUniqKey". The mirroring goes both ways: If this is a left join and no matching row is found on the right, the values of the key fields will be copied from the left to the right. If the option "oppositeOuter" is set and causes a row with the empty left side to be produced as a part of DELETE-INSERT pair, the key fields will be copied from the right to the left.
So, setting
oppositeOuter => 1
tells that this LookupJoin is a part of an outer join, with the opposite side (right side, for this LookupJoin) being an outer one (well, this side might be outer too if "isLeft => 1", but that's a whole separate question). This enables the logic that checks whether the row inserted here is the first one that matches a row in the right-side table, and whether the row deleted here was the last one that matches. If the condition is satisfied, not a simple INSERT or DELETE rowop is produced but a correct DELETE-INSERT pair that replaces the old state with the new one. Well, it's been described in detail for the JoinTwo.
But how does it know if the current row if the first one or last one or neither? After all, LookupJoin doesn't have any access to the left-side table. It has two ways to know.
First, by default it simply assumes that it's an one-to-something (1:1 or 1:M) join. Then there may be no more than one matching row on this side, and every row inserted is the first one, and every row deleted is the last one. Then it does the DELETE-INSERT trick every time.
Second, the option
groupSizeCode => \&groupSizeComputation
can be used to compute the current group size for the current row. It provides a function that does the computation and gets called as
$gsz = &{$self->{groupSizeCode}}($opcode, $row);
Note that it doesn't get the table reference nor the index type reference either, so it has to be a closure with the references compiled into it. JoinTwo does it with the definition
sub { # (opcode, row) $table->groupSizeIdx($ixt, $_[1]); }
Why not just pass the table and index type references to JoinTwo and let it do the same computation without any mess with the closure references? Because the group size computation may need to be different. When the JoinTwo does a self-join, it feeds the left side from the table's Pre label, and the normal group size computation would be incorrect because the rowop didn't get applied to the table yet. Instead it has to predict what will happen when the rowop will get applied:
sub { # (opcode, row) if (&Triceps::isInsert($_[0])) { $table->groupSizeIdx($ixt, $_[1])+1; } else { $table->groupSizeIdx($ixt, $_[1])-1; } }
If you set the option "groupSizeCode" to undef, that's the default value that triggers the one-to-something behavior.
Setting another option
fieldsMirrorKey => 1
enables another magic behavior: mirroring the values of key fields to both sides before they are used to produce the result row. This way even if the join is an outer join and one side is not present, the key fields will be available on both sides nevertheless. This is the heavy machinery that underlies the JoinTwo's high-level option "fieldsUniqKey". The mirroring goes both ways: If this is a left join and no matching row is found on the right, the values of the key fields will be copied from the left to the right. If the option "oppositeOuter" is set and causes a row with the empty left side to be produced as a part of DELETE-INSERT pair, the key fields will be copied from the right to the left.
A glimpse inside JoinTwo
For a while JoinTwo was compact and straightforward, and easy to demonstrate. Then it has grown all these extra features, options and error checks, and became quite complicated. So I'll show only the selected portions of the JoinTwo constructor, with the gist of its functionality:
So in the end it boils down to two LookupJoins, with the options computed from the JoinTwo's options. But you might notice that there are a few LookupJoin options that haven't been described before. And a few otehr methods not described before. They'll be described shortly.
... my $selfJoin = $self->{leftTable}->same($self->{rightTable}); if ($selfJoin && !defined $self->{leftFromLabel}) { # one side must be fed from Pre label (but still let the user override) $self->{leftFromLabel} = $self->{leftTable}->getPreLabel(); } ... my ($leftLeft, $rightLeft); if ($self->{type} eq "inner") { $leftLeft = 0; $rightLeft = 0; } elsif ($self->{type} eq "left") { $leftLeft = 1; $rightLeft = 0; } elsif ($self->{type} eq "right") { $leftLeft = 0; $rightLeft = 1; } elsif ($self->{type} eq "outer") { $leftLeft = 1; $rightLeft = 1; } else { Carp::confess("Unknown value '" . $self->{type} . "' of option 'type', must be one of inner|left|right|outer"); } $self->{leftRowType} = $self->{leftTable}->getRowType(); $self->{rightRowType} = $self->{rightTable}->getRowType(); ... for my $side ( ("left", "right") ) { if (defined $self->{"${side}FromLabel"}) { ... } else { $self->{"${side}FromLabel"} = $self->{"${side}Table"}->getOutputLabel(); } my @keys; ($self->{"${side}IdxType"}, @keys) = $self->{"${side}Table"}->getType()->findIndexKeyPath(@{$self->{"${side}IdxPath"}}); # would already confess if the index is not found if (!$self->{overrideSimpleMinded}) { if (!$self->{"${side}IdxType"}->isLeaf() && ($self->{type} ne "inner" && $self->{type} ne $side) ) { my $table = $self->{"${side}Table"}; my $ixt = $self->{"${side}IdxType"}; if ($selfJoin && $side eq "left") { # the special case, reading from the table's Pre label; # must adjust the count for what will happen after the row gets processed $self->{"${side}GroupSizeCode"} = sub { # (opcode, row) if (&Triceps::isInsert($_[0])) { $table->groupSizeIdx($ixt, $_[1])+1; } else { $table->groupSizeIdx($ixt, $_[1])-1; } }; } else { $self->{"${side}GroupSizeCode"} = sub { # (opcode, row) $table->groupSizeIdx($ixt, $_[1]); }; } } } ... my $fieldsMirrorKey = 1; my $uniq = $self->{fieldsUniqKey}; if ($uniq eq "first") { $uniq = $self->{fieldsLeftFirst} ? "left" : "right"; } if ($uniq eq "none") { $fieldsMirrorKey = 0; } elsif ($uniq eq "manual") { # nothing to do } elsif ($uniq =~ /^(left|right)$/) { my($side, @keys); if ($uniq eq "left") { $side = "right"; @keys = @rightkeys; } else { $side = "left"; @keys = @leftkeys; } if (!defined $self->{"${side}Fields"}) { $self->{"${side}Fields"} = [ ".*" ]; # the implicit pass-all } unshift(@{$self->{"${side}Fields"}}, map("!$_", @keys) ); } else { Carp::confess("Unknown value '" . $self->{fieldsUniqKey} . "' of option 'fieldsUniqKey', must be one of none|manual|left|right|first"); } # now create the LookupJoins $self->{leftLookup} = Triceps::LookupJoin->new( unit => $self->{unit}, name => $self->{name} . ".leftLookup", leftRowType => $self->{leftRowType}, rightTable => $self->{rightTable}, rightIdxPath => $self->{rightIdxPath}, leftFields => $self->{leftFields}, rightFields => $self->{rightFields}, fieldsLeftFirst => $self->{fieldsLeftFirst}, fieldsMirrorKey => $fieldsMirrorKey, by => \@leftby, isLeft => $leftLeft, automatic => 1, oppositeOuter => ($rightLeft && !$self->{overrideSimpleMinded}), groupSizeCode => $self->{leftGroupSizeCode}, saveJoinerTo => $self->{leftSaveJoinerTo}, ); $self->{rightLookup} = Triceps::LookupJoin->new( unit => $self->{unit}, name => $self->{name} . ".rightLookup", leftRowType => $self->{rightRowType}, rightTable => $self->{leftTable}, rightIdxPath => $self->{leftIdxPath}, leftFields => $self->{rightFields}, rightFields => $self->{leftFields}, fieldsLeftFirst => !$self->{fieldsLeftFirst}, fieldsMirrorKey => $fieldsMirrorKey, by => \@rightby, isLeft => $rightLeft, automatic => 1, oppositeOuter => ($leftLeft && !$self->{overrideSimpleMinded}), groupSizeCode => $self->{rightGroupSizeCode}, saveJoinerTo => $self->{rightSaveJoinerTo}, ); # create the output label $self->{outputLabel} = $self->{unit}->makeDummyLabel($self->{leftLookup}->getResultRowType(), $self->{name} . ".out"); Carp::confess("$!") unless (ref $self->{outputLabel} eq "Triceps::Label"); # and connect them together $self->{leftFromLabel}->chain($self->{leftLookup}->getInputLabel()); $self->{rightFromLabel}->chain($self->{rightLookup}->getInputLabel()); $self->{leftLookup}->getOutputLabel()->chain($self->{outputLabel}); $self->{rightLookup}->getOutputLabel()->chain($self->{outputLabel});
So in the end it boils down to two LookupJoins, with the options computed from the JoinTwo's options. But you might notice that there are a few LookupJoin options that haven't been described before. And a few otehr methods not described before. They'll be described shortly.
JoinTwo override options
Normally JoinTwo tries to work in a consistent manner, refusing to do the unsafe things that might corrupt the data. But if you really, really want, and are really sure of what you're doing, there are options to override these restrictions.
If you set
overrideSimpleMinded => 1
then the logic that produces the DELETE-INSERT sequences for the outer joins gets disabled. The only reason I can think of to use this option is if you want to simulate a CEP system that has no concept of opcodes. So if your data is INSERT-only and you want to produce the INSERT-only data too, and want the dumbed-down logic, this option is your solution.
The option
overrideKeyTypes => 1
disables the check for the exact match of the key field types. This might come helpful for example if you have an int32 field on one side and an int64 field on the other side, and you know that in reality they would always stay within the int32 range. Or if you have an integer on one side and a string that always contains an integer on the other side. Since you know that the type conversions can always be done with no loss, you can safely override the type check and still get the correct result.
By the way, even though JoinTwo doesn't refuse to have the float64 key fields, using them is a bad idea. The floating-point values are subject to non-obvious rounding. And if you have two floating-point values that print the same, this doesn't mean that they are internally the same down to the last bit (because the printing involves the conversion to decimal that involves rounding). The joining requires that the values are exactly equal. Because of this the joining on the floating-point field is rife with unpleasant surprises. Better don't do it. A possible solution is to round values by converting them to integers (scaled by multiplying by a fixed factor to get essentially a fixed-point value). You can even convert them back from fixed-point to floating-point and still join on these floating-point values, because the same values would always be produced from integers in exactly the same way, and will be exactly the same.
If you set
overrideSimpleMinded => 1
then the logic that produces the DELETE-INSERT sequences for the outer joins gets disabled. The only reason I can think of to use this option is if you want to simulate a CEP system that has no concept of opcodes. So if your data is INSERT-only and you want to produce the INSERT-only data too, and want the dumbed-down logic, this option is your solution.
The option
overrideKeyTypes => 1
disables the check for the exact match of the key field types. This might come helpful for example if you have an int32 field on one side and an int64 field on the other side, and you know that in reality they would always stay within the int32 range. Or if you have an integer on one side and a string that always contains an integer on the other side. Since you know that the type conversions can always be done with no loss, you can safely override the type check and still get the correct result.
By the way, even though JoinTwo doesn't refuse to have the float64 key fields, using them is a bad idea. The floating-point values are subject to non-obvious rounding. And if you have two floating-point values that print the same, this doesn't mean that they are internally the same down to the last bit (because the printing involves the conversion to decimal that involves rounding). The joining requires that the values are exactly equal. Because of this the joining on the floating-point field is rife with unpleasant surprises. Better don't do it. A possible solution is to round values by converting them to integers (scaled by multiplying by a fixed factor to get essentially a fixed-point value). You can even convert them back from fixed-point to floating-point and still join on these floating-point values, because the same values would always be produced from integers in exactly the same way, and will be exactly the same.
Self-join using LookupJoin
The experience with the manual join has made me think about using a similar approach to avoid triplication of the data in the version with join templates. And after some false-starts, I've realized that what that version needs is the LookupJoins. They replace the loops. So, one more version is:
It produces the exact same result as the version with the manual loops, with the only minor difference of the field order in the result rows.
And, in retrospect, I should have probably made a function for the row rotation, so that I would not have to copy that code here.
Well, it works the same as the version with the loops and maybe even looks a little bit neater, but in practice it's much harder to write, debug and understand.
our $join1 = Triceps::LookupJoin->new( name => "join1", leftFromLabel => $tRate->getOutputLabel(), leftFields => [ "ccy1", "ccy2", "rate/rate1" ], rightTable => $tRate, rightIdxPath => [ "byCcy1" ], rightFields => [ "ccy2/ccy3", "rate/rate2" ], byLeft => [ "ccy2/ccy1" ], isLeft => 0, ); # would die by itself on an error our $join2 = Triceps::LookupJoin->new( name => "join2", leftFromLabel => $join1->getOutputLabel(), rightTable => $tRate, rightIdxPath => [ "byCcy1", "byCcy12" ], rightFields => [ "rate/rate3" ], byLeft => [ "ccy3/ccy1", "ccy1/ccy2" ], isLeft => 0, ); # would die by itself on an error # now compute the resulting circular rate and filter the profitable loops our $rtResult = Triceps::RowType->new( $join2->getResultRowType()->getdef(), looprate => "float64", ) or die "$!"; my $lbResult = $uArb->makeDummyLabel($rtResult, "lbResult"); my $lbCompute = $uArb->makeLabel($join2->getResultRowType(), "lbCompute", undef, sub { my ($label, $rowop) = @_; my $row = $rowop->getRow(); my $ccy1 = $row->get("ccy1"); my $ccy2 = $row->get("ccy2"); my $ccy3 = $row->get("ccy3"); my $rate1 = $row->get("rate1"); my $rate2 = $row->get("rate2"); my $rate3 = $row->get("rate3"); my $looprate = $rate1 * $rate2 * $rate3; # now build the row in normalized order of currencies print("____Order before: $ccy1, $ccy2, $ccy3\n"); my $result; if ($ccy2 lt $ccy3) { if ($ccy2 lt $ccy1) { # rotate left $result = $lbResult->makeRowopHash($rowop->getOpcode(), ccy1 => $ccy2, ccy2 => $ccy3, ccy3 => $ccy1, rate1 => $rate2, rate2 => $rate3, rate3 => $rate1, looprate => $looprate, ) or die "$!"; } } else { if ($ccy3 lt $ccy1) { # rotate right $result = $lbResult->makeRowopHash($rowop->getOpcode(), ccy1 => $ccy3, ccy2 => $ccy1, ccy3 => $ccy2, rate1 => $rate3, rate2 => $rate1, rate3 => $rate2, looprate => $looprate, ) or die "$!"; } } if (!defined $result) { # use the straight order $result = $lbResult->makeRowopHash($rowop->getOpcode(), ccy1 => $ccy1, ccy2 => $ccy2, ccy3 => $ccy3, rate1 => $rate1, rate2 => $rate2, rate3 => $rate3, looprate => $looprate, ) or die "$!"; } if ($looprate > 1) { $uArb->call($result); } else { print("__", $result->printP(), "\n"); # for debugging } }) or die "$!"; $join2->getOutputLabel()->chain($lbCompute) or die "$!";
It produces the exact same result as the version with the manual loops, with the only minor difference of the field order in the result rows.
And, in retrospect, I should have probably made a function for the row rotation, so that I would not have to copy that code here.
Well, it works the same as the version with the loops and maybe even looks a little bit neater, but in practice it's much harder to write, debug and understand.
Thursday, May 17, 2012
Self-join done manually
As I've mentioned before, in many cases the self-joins are better suited to be done by the manual looping through the data. This is especially true if the table represents a tree, linked by the parent-child node id and the processing has to navigate through the tree. Indeed, if the tree may be of an arbitrary depth, there is no way to handle if with the common joins, you will need just an arbitrary number of joins (through there are some SQL extensions for the recursive self- joins).
The arbitration example can also be conveniently rewritten through the manual loops. The input row type, table type, table, unit, and the main loop do not change, so I won't copy them the second time. The rest of the code is:
Whenever a new rowop is processed in the table, it goes to the Compute label. The row in this rowop is the first leg of the triangle. The loop then finds all the possible second legs that can be connected to the first leg. And then for each second leg it checks whether it can make the third leg back to the original currency. If it can, good, we've found a candidate for a result row.
The way the loops work, this time there is no triplication. But the same triangle still can be found starting from any of its three currencies. This means that to keep the data consistent, no matter what was the first currency in a particular run, it still must produce the exact some result row. To achieve that, the currencies get rotated as explained in the last post, making sure that the first currency is has the lexically smallest name. These if-else statements do that by selecting the direction of rotation (if any) and build the result record in one of three ways.
Finally it compares the combined rate to 1, and if greater then sends the result. If not, a debugging printout starting with "__" prints the row, so that is can be seen. Another debugging printout prints the original order of the currencies, letting us check that the rotation was performed correctly.
On feeding the same input data this code produces the result:
It's the same result as before, only without the triplicates. And you can see that the rotation logic works right. The manual self-joining has produced the result without triplicates, without an intermediate table, and for me writing and understanding its logic is much easier than with the "proper" joins. I'd say that the manual self-join is a winner in every respect.
An interesting thing is that this manual logic produces the same result independently of whether it's connected to the Output or Pre label of the table. Try changing it, it works the same. This is because the original row is taken directly from the input rowop, and never participates in the join again; it's never read from the table by any of the loops. If it were read again from the table by the loops, the connection order would matter. And the right one would be fairly weird: the INSERT rowops would have to be processed coming from the Output label, the DELETE rowops coming from the Pre label.
This is because the row has to be in the table to be found. And for an INSERT the row gets there only after it goes through the table and comes out on the Output label. But for a DELETE the row would get already deleted from the table by that time. Instead it has to be handled before that, on the Pre label, when the table only prepares to delete it.
If you look at the version with JoinTwo, that's also how an inner self-join works. Since it's an inner join, both rows on both sides must be present to produce a result. An INSERT first arrives from the Pre label on the left side, doesn't find itself in the table, and produces no result (again, we're talking here about the situation when a row has to get joined to itself; it might well find the other pairs for itself and produce a result for them but not for itself joined with itself). Then it arrives the second time from the Output label on the right side. Now it looks in the table, and finds itself, and produces the result (an INSERT coming form the join). A DELETE also first arrives from the Pre label on the left side. It finds its copy in the table and produces the result (a DELETE coming from the join). When the second copy of the row arrives from the Output label on the right side, it doesn't find its copy in the table any more, and produces nothing. In the end it's the same thing, an INSERT comes out of the join triggered by the table Output label, a DELETE comes out of the join triggered by the table Pre label. It's not a whimsy, it's caused by the requirements of the correctness. The manual self-join would have to mimic this order to produce the correct result.
The arbitration example can also be conveniently rewritten through the manual loops. The input row type, table type, table, unit, and the main loop do not change, so I won't copy them the second time. The rest of the code is:
# now compute the resulting circular rate and filter the profitable loops our $rtResult = Triceps::RowType->new( ccy1 => "string", # currency code ccy2 => "string", # currency code ccy3 => "string", # currency code rate1 => "float64", rate2 => "float64", rate3 => "float64", looprate => "float64", ) or die "$!"; my $ixtCcy1 = $ttRate->findSubIndex("byCcy1") or die "$!"; my $ixtCcy12 = $ixtCcy1->findSubIndex("byCcy12") or die "$!"; my $lbResult = $uArb->makeDummyLabel($rtResult, "lbResult"); my $lbCompute = $uArb->makeLabel($rtRate, "lbCompute", undef, sub { my ($label, $rowop) = @_; my $row = $rowop->getRow(); my $ccy1 = $row->get("ccy1"); my $ccy2 = $row->get("ccy2"); my $rate1 = $row->get("rate"); my $rhi = $tRate->findIdxBy($ixtCcy1, ccy1 => $ccy2) or die "$!"; my $rhiEnd = $rhi->nextGroupIdx($ixtCcy12) or die "$!"; for (; !$rhi->same($rhiEnd); $rhi = $rhi->nextIdx($ixtCcy12)) { my $row2 = $rhi->getRow(); my $ccy3 = $row2->get("ccy2"); my $rate2 = $row2->get("rate"); my $rhj = $tRate->findIdxBy($ixtCcy12, ccy1 => $ccy3, ccy2 => $ccy1) or die "$!"; # it's a leaf primary index, so there may be no more than one match next if ($rhj->isNull()); my $row3 = $rhj->getRow(); my $rate3 = $row3->get("rate"); my $looprate = $rate1 * $rate2 * $rate3; # now build the row in normalized order of currencies print("____Order before: $ccy1, $ccy2, $ccy3\n"); my $result; if ($ccy2 lt $ccy3) { if ($ccy2 lt $ccy1) { # rotate left $result = $lbResult->makeRowopHash($rowop->getOpcode(), ccy1 => $ccy2, ccy2 => $ccy3, ccy3 => $ccy1, rate1 => $rate2, rate2 => $rate3, rate3 => $rate1, looprate => $looprate, ) or die "$!"; } } else { if ($ccy3 lt $ccy1) { # rotate right $result = $lbResult->makeRowopHash($rowop->getOpcode(), ccy1 => $ccy3, ccy2 => $ccy1, ccy3 => $ccy2, rate1 => $rate3, rate2 => $rate1, rate3 => $rate2, looprate => $looprate, ) or die "$!"; } } if (!defined $result) { # use the straight order $result = $lbResult->makeRowopHash($rowop->getOpcode(), ccy1 => $ccy1, ccy2 => $ccy2, ccy3 => $ccy3, rate1 => $rate1, rate2 => $rate2, rate3 => $rate3, looprate => $looprate, ) or die "$!"; } if ($looprate > 1) { $uArb->call($result); } else { print("__", $result->printP(), "\n"); # for debugging } } }) or die "$!"; $tRate->getOutputLabel()->chain($lbCompute) or die "$!"; makePrintLabel("lbPrint", $lbResult);
Whenever a new rowop is processed in the table, it goes to the Compute label. The row in this rowop is the first leg of the triangle. The loop then finds all the possible second legs that can be connected to the first leg. And then for each second leg it checks whether it can make the third leg back to the original currency. If it can, good, we've found a candidate for a result row.
The way the loops work, this time there is no triplication. But the same triangle still can be found starting from any of its three currencies. This means that to keep the data consistent, no matter what was the first currency in a particular run, it still must produce the exact some result row. To achieve that, the currencies get rotated as explained in the last post, making sure that the first currency is has the lexically smallest name. These if-else statements do that by selecting the direction of rotation (if any) and build the result record in one of three ways.
Finally it compares the combined rate to 1, and if greater then sends the result. If not, a debugging printout starting with "__" prints the row, so that is can be seen. Another debugging printout prints the original order of the currencies, letting us check that the rotation was performed correctly.
On feeding the same input data this code produces the result:
rate,OP_INSERT,EUR,USD,1.48 rate,OP_INSERT,USD,EUR,0.65 rate,OP_INSERT,GBP,USD,1.98 rate,OP_INSERT,USD,GBP,0.49 rate,OP_INSERT,EUR,GBP,0.74 ____Order before: EUR, GBP, USD __lbResult OP_INSERT ccy1="EUR" ccy2="GBP" ccy3="USD" rate1="0.74" rate2="1.98" rate3="0.65" looprate="0.95238" rate,OP_INSERT,GBP,EUR,1.30 ____Order before: GBP, EUR, USD __lbResult OP_INSERT ccy1="EUR" ccy2="USD" ccy3="GBP" rate1="1.48" rate2="0.49" rate3="1.3" looprate="0.94276" rate,OP_DELETE,EUR,USD,1.48 ____Order before: EUR, USD, GBP __lbResult OP_DELETE ccy1="EUR" ccy2="USD" ccy3="GBP" rate1="1.48" rate2="0.49" rate3="1.3" looprate="0.94276" rate,OP_INSERT,EUR,USD,1.28 ____Order before: EUR, USD, GBP __lbResult OP_INSERT ccy1="EUR" ccy2="USD" ccy3="GBP" rate1="1.28" rate2="0.49" rate3="1.3" looprate="0.81536" rate,OP_DELETE,USD,EUR,0.65 ____Order before: USD, EUR, GBP __lbResult OP_DELETE ccy1="EUR" ccy2="GBP" ccy3="USD" rate1="0.74" rate2="1.98" rate3="0.65" looprate="0.95238" rate,OP_INSERT,USD,EUR,0.78 ____Order before: USD, EUR, GBP lbResult OP_INSERT ccy1="EUR" ccy2="GBP" ccy3="USD" rate1="0.74" rate2="1.98" rate3="0.78" looprate="1.142856" rate,OP_DELETE,EUR,GBP,0.74 ____Order before: EUR, GBP, USD lbResult OP_DELETE ccy1="EUR" ccy2="GBP" ccy3="USD" rate1="0.74" rate2="1.98" rate3="0.78" looprate="1.142856" rate,OP_INSERT,EUR,GBP,0.64 ____Order before: EUR, GBP, USD __lbResult OP_INSERT ccy1="EUR" ccy2="GBP" ccy3="USD" rate1="0.64" rate2="1.98" rate3="0.78" looprate="0.988416"
It's the same result as before, only without the triplicates. And you can see that the rotation logic works right. The manual self-joining has produced the result without triplicates, without an intermediate table, and for me writing and understanding its logic is much easier than with the "proper" joins. I'd say that the manual self-join is a winner in every respect.
An interesting thing is that this manual logic produces the same result independently of whether it's connected to the Output or Pre label of the table. Try changing it, it works the same. This is because the original row is taken directly from the input rowop, and never participates in the join again; it's never read from the table by any of the loops. If it were read again from the table by the loops, the connection order would matter. And the right one would be fairly weird: the INSERT rowops would have to be processed coming from the Output label, the DELETE rowops coming from the Pre label.
This is because the row has to be in the table to be found. And for an INSERT the row gets there only after it goes through the table and comes out on the Output label. But for a DELETE the row would get already deleted from the table by that time. Instead it has to be handled before that, on the Pre label, when the table only prepares to delete it.
If you look at the version with JoinTwo, that's also how an inner self-join works. Since it's an inner join, both rows on both sides must be present to produce a result. An INSERT first arrives from the Pre label on the left side, doesn't find itself in the table, and produces no result (again, we're talking here about the situation when a row has to get joined to itself; it might well find the other pairs for itself and produce a result for them but not for itself joined with itself). Then it arrives the second time from the Output label on the right side. Now it looks in the table, and finds itself, and produces the result (an INSERT coming form the join). A DELETE also first arrives from the Pre label on the left side. It finds its copy in the table and produces the result (a DELETE coming from the join). When the second copy of the row arrives from the Output label on the right side, it doesn't find its copy in the table any more, and produces nothing. In the end it's the same thing, an INSERT comes out of the join triggered by the table Output label, a DELETE comes out of the join triggered by the table Pre label. It's not a whimsy, it's caused by the requirements of the correctness. The manual self-join would have to mimic this order to produce the correct result.
Wednesday, May 16, 2012
Self-joins with JoinTwo
The self-joins happen when a table is joined to itself. Previously I've said that the self-join's aren't supported in Triceps, but that's not true any more. They are now.
For an example of a model with self-joins, let's look at the Forex trading. There people exchange the currencies in every possible direction in multiple markets. There are the quoted rates for exchange of every pair of currencies, in every direction.
Naturally, if you exchange one currency into another and then back into the first one, you normally end up with less money than you've started with. The rest becomes the transaction cost and lines the pockets of the brokers, market makers and exchanges.
However once in a while interesting things happen. If the exchange rates between the different currencies become disbalanced, you may be able to exchange the currency A for currency B for currency C and back for currency A, and end up with more money than you've started with. (You don't have to do it in sequence, you would normally do all three transactions in parallel). However it's a short-lived opportunity: as you perform the transactions, you'll be changing the involved exchange rates towards the balance, and you won't be the only one exploiting this opportunity, so you better act fast. This activity of bringing the market into balance while simultaneously extracting profit is called "arbitration".
So let's make a model that will detect such arbitration opportunities, for the following automated execution. Mind you, it's all grossly simplified, but it shows the gist of it. And most importantly, it uses the self-joins. Here we go:
The rate quotes will be coming into this table. The indexes are provided to both work with the self-joins and to have a primary index as the first leaf.
There are no special options for the self-join: just use the same table for both the left and right side. This join represents two exchange transactions, so it's done by matching the second currency of the first quote to the first currency of the second quote. The result contains three currency names and two rate multiplier.
To find the exchange rate back to the first cuurency, we need to do one more join. But a join needs two tables, so we have to put the result of the first join into a table first. The first index is the primary index, the second one is used for the next join. Note that the order of key fields in the second index is suited for the join.
The result adds one more rate multiplier. Now to learn the effect of the circular conversion we only need to multiply all the multipliers. If it comes out below 1, the cycling transaction would return a loss, if above 1, a profit.
A label with Perl handler performs the multiplication, and if the result is over 1, passes the result to the next label, from which then the data gets printed. I've also added a debugging printout in case if the row doesn't get through. That one starts with "__" and helps seeing what goes on inside when no result is coming out.
Finally, the main loop reads the data and puts it into the rates table.
Now let's take a look at an example of a run.
The rate quotes start coming in. Note that the rates are separate for each direction of exchange. So far nothing happens because there aren't enough quotes to complete a loop of three steps.
Now there are enough currencies in play to complete the loop. None of them get the loop rate over 1 though, so the only printouts are from the debugging logic. There are only two loops, but each of them is printed three times. Why? It's a loop, so you can start from each of its elements and come back to the same element. One row for each starting point. And the joins find all of them.
To find and eliminate the duplicates, the order of currencies in the rows can be rotated to put the alphabetically lowest currency code first. Note that they can't be just sorted because the relative order matters. Trading in the order GBP-USD-EUR will give a different result than GBP-EUR-USD. The relative order has to be preserved. I didn't put any such elimination into the example to keep it smaller.
Someone starts changing lots of euros for dollars, and the rate moves. No good news for us yet though.
The rate for dollars-to-euros follows its opposite. This creates an arbitration opportunity! Step two: trade in the direction USD-EUR-GBP-USD, step three: PROFIT!!!
Our trading (and perhaps other people's trading too) moves the exchange rate of euros to pounds. And with that the balance of currencies is restored, and the arbitration opportunity disappears.
Now let's have a look inside JoinTwo. What is so special about the self-join? Normally the join works on two separate tables. They get updated one at a time. Even if some common reason causes both tables to be updated, the update arrives from one table first. The join sees this incoming update, looks in the unchanged second table, produces an updated result. Then the update from the second table comes to the join, which takes it, looks in the already modified first table, and produces another updated result.
If both inputs are from the same table, this logic breaks. Two copies of the updates will arrive, but by the time the first one arrives, the contents of the table has been already changed. When the join looks in the table, it gets the unexpected results and creates a mess.
But JoinTwo has a fix for this. It makes use of the "pre" label of the table for its left-side update (the right side would have worked just as good, it's just a random choice):
This way when the join sees the first update, the table hasn't changed yet. And then the second copy of that update comes though the normal output label, after the table has been modified. Everything just works out as normal and the self-joins produce the correct result.
Normally you don't need to concern yourself with this, except if you're trying to filter the data coming to the join. Then remember that for leftFromLabel you have to receive the data from the table's getPreLabel(), not getOutputLabel().
For an example of a model with self-joins, let's look at the Forex trading. There people exchange the currencies in every possible direction in multiple markets. There are the quoted rates for exchange of every pair of currencies, in every direction.
Naturally, if you exchange one currency into another and then back into the first one, you normally end up with less money than you've started with. The rest becomes the transaction cost and lines the pockets of the brokers, market makers and exchanges.
However once in a while interesting things happen. If the exchange rates between the different currencies become disbalanced, you may be able to exchange the currency A for currency B for currency C and back for currency A, and end up with more money than you've started with. (You don't have to do it in sequence, you would normally do all three transactions in parallel). However it's a short-lived opportunity: as you perform the transactions, you'll be changing the involved exchange rates towards the balance, and you won't be the only one exploiting this opportunity, so you better act fast. This activity of bringing the market into balance while simultaneously extracting profit is called "arbitration".
So let's make a model that will detect such arbitration opportunities, for the following automated execution. Mind you, it's all grossly simplified, but it shows the gist of it. And most importantly, it uses the self-joins. Here we go:
our $uArb = Triceps::Unit->new("uArb") or die "$!"; our $rtRate = Triceps::RowType->new( # an exchange rate between two currencies ccy1 => "string", # currency code ccy2 => "string", # currency code rate => "float64", # multiplier when exchanging ccy1 to ccy2 ) or die "$!"; # all exchange rates our $ttRate = Triceps::TableType->new($rtRate) ->addSubIndex("byCcy1", Triceps::IndexType->newHashed(key => [ "ccy1" ]) ->addSubIndex("byCcy12", Triceps::IndexType->newHashed(key => [ "ccy2" ]) ) ) ->addSubIndex("byCcy2", Triceps::IndexType->newHashed(key => [ "ccy2" ]) ->addSubIndex("grouping", Triceps::IndexType->newFifo()) ) or die "$!"; $ttRate->initialize() or die "$!"; our $tRate = $uArb->makeTable($ttRate, &Triceps::EM_CALL, "tRate") or die "$!";
The rate quotes will be coming into this table. The indexes are provided to both work with the self-joins and to have a primary index as the first leaf.
our $join1 = Triceps::JoinTwo->new( name => "join1", leftTable => $tRate, leftIdxPath => [ "byCcy2" ], leftFields => [ "ccy1", "ccy2", "rate/rate1" ], rightTable => $tRate, rightIdxPath => [ "byCcy1" ], rightFields => [ "ccy2/ccy3", "rate/rate2" ], ); # would die by itself on an error
There are no special options for the self-join: just use the same table for both the left and right side. This join represents two exchange transactions, so it's done by matching the second currency of the first quote to the first currency of the second quote. The result contains three currency names and two rate multiplier.
our $ttJoin1 = Triceps::TableType->new($join1->getResultRowType()) ->addSubIndex("byCcy123", Triceps::IndexType->newHashed(key => [ "ccy1", "ccy2", "ccy3" ]) ) ->addSubIndex("byCcy31", Triceps::IndexType->newHashed(key => [ "ccy3", "ccy1" ]) ->addSubIndex("grouping", Triceps::IndexType->newFifo()) ) or die "$!"; $ttJoin1->initialize() or die "$!"; our $tJoin1 = $uArb->makeTable($ttJoin1, &Triceps::EM_CALL, "tJoin1") or die "$!"; $join1->getOutputLabel()->chain($tJoin1->getInputLabel()) or die "$!";
To find the exchange rate back to the first cuurency, we need to do one more join. But a join needs two tables, so we have to put the result of the first join into a table first. The first index is the primary index, the second one is used for the next join. Note that the order of key fields in the second index is suited for the join.
our $join2 = Triceps::JoinTwo->new( name => "join2", leftTable => $tJoin1, leftIdxPath => [ "byCcy31" ], rightTable => $tRate, rightIdxPath => [ "byCcy1", "byCcy12" ], rightFields => [ "rate/rate3" ], # the field ordering in the indexes is already right, but # for clarity add an explicit join condition too byLeft => [ "ccy3/ccy1", "ccy1/ccy2" ], ); # would die by itself on an error
The result adds one more rate multiplier. Now to learn the effect of the circular conversion we only need to multiply all the multipliers. If it comes out below 1, the cycling transaction would return a loss, if above 1, a profit.
# now compute the resulting circular rate and filter the profitable loops our $rtResult = Triceps::RowType->new( $join2->getResultRowType()->getdef(), looprate => "float64", ) or die "$!"; my $lbResult = $uArb->makeDummyLabel($rtResult, "lbResult"); my $lbCompute = $uArb->makeLabel($join2->getResultRowType(), "lbCompute", undef, sub { my ($label, $rowop) = @_; my $row = $rowop->getRow(); my $looprate = $row->get("rate1") * $row->get("rate2") * $row->get("rate3"); print("__", $rowop->printP(), "looprate=$looprate \n"); # for debugging if ($looprate > 1) { $uArb->makeHashCall($lbResult, $rowop->getOpcode(), $row->toHash(), looprate => $looprate, ); } else { print("__", $rowop->printP(), "looprate=$looprate \n"); # for debugging } }) or die "$!"; $join2->getOutputLabel()->chain($lbCompute) or die "$!"; # label to print the changes to the detailed stats makePrintLabel("lbPrint", $lbResult);
A label with Perl handler performs the multiplication, and if the result is over 1, passes the result to the next label, from which then the data gets printed. I've also added a debugging printout in case if the row doesn't get through. That one starts with "__" and helps seeing what goes on inside when no result is coming out.
while(<STDIN>) { chomp; my @data = split(/,/); # starts with a command, then string opcode my $type = shift @data; if ($type eq "rate") { $uArb->makeArrayCall($tRate->getInputLabel(), @data) or die "$!"; } $uArb->drainFrame(); # just in case, for completeness }
Finally, the main loop reads the data and puts it into the rates table.
Now let's take a look at an example of a run.
rate,OP_INSERT,EUR,USD,1.48 rate,OP_INSERT,USD,EUR,0.65 rate,OP_INSERT,GBP,USD,1.98 rate,OP_INSERT,USD,GBP,0.49
The rate quotes start coming in. Note that the rates are separate for each direction of exchange. So far nothing happens because there aren't enough quotes to complete a loop of three steps.
rate,OP_INSERT,EUR,GBP,0.74 __join2.leftLookup.out OP_INSERT ccy1="EUR" ccy2="GBP" rate1="0.74" ccy3="USD" rate2="1.98" rate3="0.65" looprate=0.95238 __join2.leftLookup.out OP_INSERT ccy1="USD" ccy2="EUR" rate1="0.65" ccy3="GBP" rate2="0.74" rate3="1.98" looprate=0.95238 __join2.rightLookup.out OP_INSERT ccy1="GBP" ccy2="USD" rate1="1.98" ccy3="EUR" rate2="0.65" rate3="0.74" looprate=0.95238 rate,OP_INSERT,GBP,EUR,1.30 __join2.leftLookup.out OP_INSERT ccy1="GBP" ccy2="EUR" rate1="1.3" ccy3="USD" rate2="1.48" rate3="0.49" looprate=0.94276 __join2.leftLookup.out OP_INSERT ccy1="USD" ccy2="GBP" rate1="0.49" ccy3="EUR" rate2="1.3" rate3="1.48" looprate=0.94276 __join2.rightLookup.out OP_INSERT ccy1="EUR" ccy2="USD" rate1="1.48" ccy3="GBP" rate2="0.49" rate3="1.3" looprate=0.94276
Now there are enough currencies in play to complete the loop. None of them get the loop rate over 1 though, so the only printouts are from the debugging logic. There are only two loops, but each of them is printed three times. Why? It's a loop, so you can start from each of its elements and come back to the same element. One row for each starting point. And the joins find all of them.
To find and eliminate the duplicates, the order of currencies in the rows can be rotated to put the alphabetically lowest currency code first. Note that they can't be just sorted because the relative order matters. Trading in the order GBP-USD-EUR will give a different result than GBP-EUR-USD. The relative order has to be preserved. I didn't put any such elimination into the example to keep it smaller.
rate,OP_DELETE,EUR,USD,1.48 __join2.leftLookup.out OP_DELETE ccy1="EUR" ccy2="USD" rate1="1.48" ccy3="GBP" rate2="0.49" rate3="1.3" looprate=0.94276 __join2.leftLookup.out OP_DELETE ccy1="GBP" ccy2="EUR" rate1="1.3" ccy3="USD" rate2="1.48" rate3="0.49" looprate=0.94276 __join2.rightLookup.out OP_DELETE ccy1="USD" ccy2="GBP" rate1="0.49" ccy3="EUR" rate2="1.3" rate3="1.48" looprate=0.94276 rate,OP_INSERT,EUR,USD,1.28 __join2.leftLookup.out OP_INSERT ccy1="EUR" ccy2="USD" rate1="1.28" ccy3="GBP" rate2="0.49" rate3="1.3" looprate=0.81536 __join2.leftLookup.out OP_INSERT ccy1="GBP" ccy2="EUR" rate1="1.3" ccy3="USD" rate2="1.28" rate3="0.49" looprate=0.81536 __join2.rightLookup.out OP_INSERT ccy1="USD" ccy2="GBP" rate1="0.49" ccy3="EUR" rate2="1.3" rate3="1.28" looprate=0.81536
Someone starts changing lots of euros for dollars, and the rate moves. No good news for us yet though.
rate,OP_DELETE,USD,EUR,0.65 __join2.leftLookup.out OP_DELETE ccy1="USD" ccy2="EUR" rate1="0.65" ccy3="GBP" rate2="0.74" rate3="1.98" looprate=0.95238 __join2.leftLookup.out OP_DELETE ccy1="GBP" ccy2="USD" rate1="1.98" ccy3="EUR" rate2="0.65" rate3="0.74" looprate=0.95238 __join2.rightLookup.out OP_DELETE ccy1="EUR" ccy2="GBP" rate1="0.74" ccy3="USD" rate2="1.98" rate3="0.65" looprate=0.95238 rate,OP_INSERT,USD,EUR,0.78 lbResult OP_INSERT ccy1="USD" ccy2="EUR" rate1="0.78" ccy3="GBP" rate2="0.74" rate3="1.98" looprate="1.142856" lbResult OP_INSERT ccy1="GBP" ccy2="USD" rate1="1.98" ccy3="EUR" rate2="0.78" rate3="0.74" looprate="1.142856" lbResult OP_INSERT ccy1="EUR" ccy2="GBP" rate1="0.74" ccy3="USD" rate2="1.98" rate3="0.78" looprate="1.142856"
The rate for dollars-to-euros follows its opposite. This creates an arbitration opportunity! Step two: trade in the direction USD-EUR-GBP-USD, step three: PROFIT!!!
rate,OP_DELETE,EUR,GBP,0.74 lbResult OP_DELETE ccy1="EUR" ccy2="GBP" rate1="0.74" ccy3="USD" rate2="1.98" rate3="0.78" looprate="1.142856" lbResult OP_DELETE ccy1="USD" ccy2="EUR" rate1="0.78" ccy3="GBP" rate2="0.74" rate3="1.98" looprate="1.142856" lbResult OP_DELETE ccy1="GBP" ccy2="USD" rate1="1.98" ccy3="EUR" rate2="0.78" rate3="0.74" looprate="1.142856" rate,OP_INSERT,EUR,GBP,0.64 __join2.leftLookup.out OP_INSERT ccy1="EUR" ccy2="GBP" rate1="0.64" ccy3="USD" rate2="1.98" rate3="0.78" looprate=0.988416 __join2.leftLookup.out OP_INSERT ccy1="USD" ccy2="EUR" rate1="0.78" ccy3="GBP" rate2="0.64" rate3="1.98" looprate=0.988416 __join2.rightLookup.out OP_INSERT ccy1="GBP" ccy2="USD" rate1="1.98" ccy3="EUR" rate2="0.78" rate3="0.64" looprate=0.988416
Our trading (and perhaps other people's trading too) moves the exchange rate of euros to pounds. And with that the balance of currencies is restored, and the arbitration opportunity disappears.
Now let's have a look inside JoinTwo. What is so special about the self-join? Normally the join works on two separate tables. They get updated one at a time. Even if some common reason causes both tables to be updated, the update arrives from one table first. The join sees this incoming update, looks in the unchanged second table, produces an updated result. Then the update from the second table comes to the join, which takes it, looks in the already modified first table, and produces another updated result.
If both inputs are from the same table, this logic breaks. Two copies of the updates will arrive, but by the time the first one arrives, the contents of the table has been already changed. When the join looks in the table, it gets the unexpected results and creates a mess.
But JoinTwo has a fix for this. It makes use of the "pre" label of the table for its left-side update (the right side would have worked just as good, it's just a random choice):
my $selfJoin = $self->{leftTable}->same($self->{rightTable}); if ($selfJoin && !defined $self->{leftFromLabel}) { # one side must be fed from Pre label (but still let the user override) $self->{leftFromLabel} = $self->{leftTable}->getPreLabel(); }
This way when the join sees the first update, the table hasn't changed yet. And then the second copy of that update comes though the normal output label, after the table has been modified. Everything just works out as normal and the self-joins produce the correct result.
Normally you don't need to concern yourself with this, except if you're trying to filter the data coming to the join. Then remember that for leftFromLabel you have to receive the data from the table's getPreLabel(), not getOutputLabel().
Tuesday, May 15, 2012
JoinTwo: input event filtering
Let's look at how the business day logic interacts with the joins. It's typical for the business applications to keep the full data for the current day, or a few recent days, then clear the data that became old and maybe keep it only in an aggregated form.
So, let's add the business day logic to the left join example. It uses the indexes by date to find the rows that have become old:
The main loop gets extended with some more commands:
The roll-over to the next business day (after the input data previously shown) then looks like this:
Clearing the left-side table before the right-side one is more efficient than the other way around, since this is a left join. If the right-side table were cleared first, it would first update all the result records to change all the right-side fields in them to NULL, and then the clearing of the left-side table would finally delete these rows. Clearing the left side first removes this churn: it deletes all the rows from the result right away, and then when the right side is cleared, it still tries to look up the matching rows but finds nothing and produces no result. For an inner or full outer join the order would not matter: either one would produce the same amount of churn.
If you don't want these deletions to propagate though the rest of your model, you can just put a filtering logic after the join, to throw away all the modifications for the previous days. Through don't forget that you would have then to delete the previous-day data from the rest of the model's tables manually.
If you want to keep only the aggregated data, you may want to pass the join output to the aggregator without filtering and then filter the aggregator's output, thus stopping the updates to the aggregation results. You may even have a special logic in the aggregator, that would ignore the groups of the previous days. Both such approaches to the aggregation filtering have been shown before. And they aren't any less efficient than filtering on the output of the join, because if you filter after the join, you'd still have to remove the rows from the aggregation table, and would still have to filter after the aggregation too.
Now, suppose that you want to be extra optimal and don't want any join look-ups to happen at all when you delete the old data. JoinTwo has a feature that lets you do that. You can make it receive the events not directly from the tables but after filtering, using the options "leftFromLabel" and "rightFromLabel":
The same clearing now looks like this:
No output is coming from the join whatsoever. It all gets cut off before it reaches the join. It's not such a great gain though. Remember that if you want to keep the aggregated data, you would still have to delete the rows manually from the aggregation table afterwards. And the filtering logic will add overhead, not only during the clearing but all the time.
If you're not careful with the filtering conditions, it's also easy to make the results of the join inconsistent. This example filters both input tables on the same key field, with the same condition, so the output will stay always consistent. But if any of these elements were missing, it becomes possible to produce inconsistent output that has the DELETEs of different rows than INSERTs, and deletions of the rows that haven't been inserted in the first place. The reason is that even though the input events are filtered, the look-ups aren't. If some rows comes from the right side and get thrown away by the filter, and then another row comes on the left side, passes the filter, and then finds a match in that thrown-away right-side row, it will use that row in the result. And the join would think that the right-side row has already been in, and would produce an incorrect update.
So these options don't make a whole lot of a win but make a major opportunity for a mess, and probably should never be used. And will probably be deleted in the future, unless someone finds a good use for them. They have been added because they provide a roundabout way to do a self-join. But the recent additions to the Table make the self-joins possible without this kind of perversions.
So, let's add the business day logic to the left join example. It uses the indexes by date to find the rows that have become old:
our $ixtToUsdByDate = $ttToUsd->findSubIndex("byDate") or die "$!"; our $ixtPositionByDate = $ttPosition->findSubIndex("byDate") or die "$!"; sub clearByDate($$$) # ($table, $ixt, $date) { my ($table, $ixt, $date) = @_; my $next; for (my $rhit = $table->beginIdx($ixt); !$rhit->isNull(); $rhit = $next) { last if (($rhit->getRow()->get("date")) >= $date); $next = $rhit->nextIdx($ixt); # advance before removal $table->remove($rhit); } }
The main loop gets extended with some more commands:
our $businessDay = undef; while(<STDIN>) { chomp; my @data = split(/,/); # starts with a command, then string opcode my $type = shift @data; if ($type eq "cur") { $uJoin->makeArrayCall($tToUsd->getInputLabel(), @data) or die "$!"; } elsif ($type eq "pos") { $uJoin->makeArrayCall($tPosition->getInputLabel(), @data) or die "$!"; } elsif ($type eq "day") { # set the business day $businessDay = $data[0] + 0; # convert to an int } elsif ($type eq "clear") { # clear the previous day # flush the left side first, because it's an outer join &clearByDate($tPosition, $ixtPositionByDate, $businessDay); &clearByDate($tToUsd, $ixtToUsdByDate, $businessDay); } $uJoin->drainFrame(); # just in case, for completeness }
The roll-over to the next business day (after the input data previously shown) then looks like this:
day,20120311 clear join.leftLookup.out OP_DELETE date="20120310" customer="two" symbol="AAA" quantity="100" price="8" currency="GBP" toUsd="2.2" join.leftLookup.out OP_DELETE date="20120310" customer="three" symbol="AAA" quantity="100" price="300" currency="RUR" toUsd="0.04" join.leftLookup.out OP_DELETE date="20120310" customer="three" symbol="BBB" quantity="200" price="80" currency="GBP" toUsd="2.2" join.leftLookup.out OP_DELETE date="20120310" customer="one" symbol="AAA" quantity="200" price="16" currency="USD" toUsd="1"
Clearing the left-side table before the right-side one is more efficient than the other way around, since this is a left join. If the right-side table were cleared first, it would first update all the result records to change all the right-side fields in them to NULL, and then the clearing of the left-side table would finally delete these rows. Clearing the left side first removes this churn: it deletes all the rows from the result right away, and then when the right side is cleared, it still tries to look up the matching rows but finds nothing and produces no result. For an inner or full outer join the order would not matter: either one would produce the same amount of churn.
If you don't want these deletions to propagate though the rest of your model, you can just put a filtering logic after the join, to throw away all the modifications for the previous days. Through don't forget that you would have then to delete the previous-day data from the rest of the model's tables manually.
If you want to keep only the aggregated data, you may want to pass the join output to the aggregator without filtering and then filter the aggregator's output, thus stopping the updates to the aggregation results. You may even have a special logic in the aggregator, that would ignore the groups of the previous days. Both such approaches to the aggregation filtering have been shown before. And they aren't any less efficient than filtering on the output of the join, because if you filter after the join, you'd still have to remove the rows from the aggregation table, and would still have to filter after the aggregation too.
Now, suppose that you want to be extra optimal and don't want any join look-ups to happen at all when you delete the old data. JoinTwo has a feature that lets you do that. You can make it receive the events not directly from the tables but after filtering, using the options "leftFromLabel" and "rightFromLabel":
our $lbPositionCurrent = $uJoin->makeDummyLabel( $tPosition->getRowType, "lbPositionCurrent") or die "$!"; our $lbPositionFilter = $uJoin->makeLabel($tPosition->getRowType, "lbPositionFilter", undef, sub { if ($_[1]->getRow()->get("date") >= $businessDay) { $uJoin->call($lbPositionCurrent->adopt($_[1])); } }) or die "$!"; $tPosition->getOutputLabel()->chain($lbPositionFilter) or die "$!"; our $lbToUsdCurrent = $uJoin->makeDummyLabel( $tToUsd->getRowType, "lbToUsdCurrent") or die "$!"; our $lbToUsdFilter = $uJoin->makeLabel($tToUsd->getRowType, "lbToUsdFilter", undef, sub { if ($_[1]->getRow()->get("date") >= $businessDay) { $uJoin->call($lbToUsdCurrent->adopt($_[1])); } }) or die "$!"; $tToUsd->getOutputLabel()->chain($lbToUsdFilter) or die "$!"; our $join = Triceps::JoinTwo->new( name => "join", leftTable => $tPosition, leftFromLabel => $lbPositionCurrent, leftIdxPath => [ "currencyLookup" ], rightTable => $tToUsd, rightFromLabel => $lbToUsdCurrent, rightIdxPath => [ "primary" ], type => "left", ); # would die by itself on an error
The same clearing now looks like this:
day,20120311 clear
No output is coming from the join whatsoever. It all gets cut off before it reaches the join. It's not such a great gain though. Remember that if you want to keep the aggregated data, you would still have to delete the rows manually from the aggregation table afterwards. And the filtering logic will add overhead, not only during the clearing but all the time.
If you're not careful with the filtering conditions, it's also easy to make the results of the join inconsistent. This example filters both input tables on the same key field, with the same condition, so the output will stay always consistent. But if any of these elements were missing, it becomes possible to produce inconsistent output that has the DELETEs of different rows than INSERTs, and deletions of the rows that haven't been inserted in the first place. The reason is that even though the input events are filtered, the look-ups aren't. If some rows comes from the right side and get thrown away by the filter, and then another row comes on the left side, passes the filter, and then finds a match in that thrown-away right-side row, it will use that row in the result. And the join would think that the right-side row has already been in, and would produce an incorrect update.
So these options don't make a whole lot of a win but make a major opportunity for a mess, and probably should never be used. And will probably be deleted in the future, unless someone finds a good use for them. They have been added because they provide a roundabout way to do a self-join. But the recent additions to the Table make the self-joins possible without this kind of perversions.
Monday, May 14, 2012
Label chains and Rowop adoption
Those are actually two separate operations I've added for the benefits of the joins.
To check if there are any labels chained from this one, use:
The same check can be done with
but the hasChained() is more efficient since it doesn't have to construct that intermediate array.
The adoption is a way to pass the row and opcode from one rowop to another new one, with a different label:
Very convenient for building the label handlers that pass the rowops to the other labels unchanged. It also can be done with
But adopt() is easier to call and also more efficient, because less of the intermediate data surfaces from the C++ level to the Perl level. A more extended usage example will be shown momentarily.
To check if there are any labels chained from this one, use:
$result = $label->hasChained();
The same check can be done with
@chain = $label->getChain(); if ($#chain >= 0) { ... }
but the hasChained() is more efficient since it doesn't have to construct that intermediate array.
The adoption is a way to pass the row and opcode from one rowop to another new one, with a different label:
$rowop2 = $label->adopt($rowop1);
Very convenient for building the label handlers that pass the rowops to the other labels unchanged. It also can be done with
$rowop2 = $label->makeRowop($rowop1->getOpcode(), $rowop1->getRow());
But adopt() is easier to call and also more efficient, because less of the intermediate data surfaces from the C++ level to the Perl level. A more extended usage example will be shown momentarily.
JoinTwo and the key fields
The JoinTwo examples shown so far didn't have any of the result field specifications. The defaults were good enough: the non-key fields on the left and right sides have no name collisions, the key fields have the same names, and JoinTwo is smart enough to remove the duplicates in the key fields. It's time to take a closer look at this smartness.
So, what is the key duplication problem? An equi-join is done on the condition of equality of the fields on the left and on the right side. JoinTwo performs the join by requiring the indexes on these fields to be defined on both sides, so I call these fields the join key. This means that if we simply include all the fields from both sides into the result, the join results will include these key fields twice: once sourced from the left side and once sourced from the right side.
Okay, so let's avoid this duplication by removing these fields from one side's copy. Suppose we want to let them through on the left side and remove from the right side. This can be done with the option "rightFields". Assuming that the key fields on the right side are named "date" and "currency", this can be done with:
This works, but it happens to work correctly only for the inner joins. Or more exactly, inner to the side where the key fields are getting dropped, so this particular example would work for the left outer joins too. But try it with a right outer or full outer join, and you'll find that the rows that don't have the match from the left side produce the result with NULLs in the key fields. If we took this approach, the following output snippet from the recent outer join example would have been not
but will be
Why does this happen? Well, the output spec said to not pass these fields through from the right side. Only their left-side values get through. And if it's a right or full outer join, and there is no left-side match for this row, the values passed into the result from the left side will naturally be NULL.
So JoinTwo has a special bit of magic built into it: it knows how to recognize this situation and have the key fields copied into the result from whatever side happens to be present for a particular row. It makes these fields always available on both sides. And along the way it also takes care of modifying the option "rightFields" or "leftFields" to actually pass through only one of the copies.
It is all controlled by the option "fieldsUniqKey". The default value of this option is "first". It means: Enable the magic for copying the fields from the non-NULL side to the NULL side. Look at the option "fieldsLeftFirst" and figure out, which side goes first in the result. Let the key fields pass on that side unchanged (though the user can block them on that side manually too, or possibly rename them, it's his choice). On the other side, automatically generate the blocking specs for the key fields and prepend them to that side's result specification. It's smart enough to know that an undefined leftFields or rightFields means the same as ".*", so an undefined result spec is replaced by the blocking specs followed by ".*". If you later call the methods
then you will actually get back the modified field specs.
If you want the key fields to be present in a different location in the result, you can set "fieldsUniqKey" to "left" or "right". That will make them pass through on the selected side, and the blocking would be automatically added on the other side.
For more control yet, set this option to "manual". The magic for making the key fields available on both sides will still be enabled, but no automatic blocking. You can pick and choose the result fields manually, exactly as you want. Remember though that there can't be multiple fields with the same name in the result, so if both sides have these fields named the same, you've got to block or rename one of the two copies.
The final choice is "none": it simply disables the key field magic.
So, what is the key duplication problem? An equi-join is done on the condition of equality of the fields on the left and on the right side. JoinTwo performs the join by requiring the indexes on these fields to be defined on both sides, so I call these fields the join key. This means that if we simply include all the fields from both sides into the result, the join results will include these key fields twice: once sourced from the left side and once sourced from the right side.
Okay, so let's avoid this duplication by removing these fields from one side's copy. Suppose we want to let them through on the left side and remove from the right side. This can be done with the option "rightFields". Assuming that the key fields on the right side are named "date" and "currency", this can be done with:
rightFields => [ "!date", "!currency", ".*" ],
This works, but it happens to work correctly only for the inner joins. Or more exactly, inner to the side where the key fields are getting dropped, so this particular example would work for the left outer joins too. But try it with a right outer or full outer join, and you'll find that the rows that don't have the match from the left side produce the result with NULLs in the key fields. If we took this approach, the following output snippet from the recent outer join example would have been not
cur,OP_INSERT,20120310,GBP,2 join.rightLookup.out OP_INSERT date="20120310" currency="GBP" toUsd="2"
but will be
cur,OP_INSERT,20120310,GBP,2 join.rightLookup.out OP_INSERT toUsd="2"
Why does this happen? Well, the output spec said to not pass these fields through from the right side. Only their left-side values get through. And if it's a right or full outer join, and there is no left-side match for this row, the values passed into the result from the left side will naturally be NULL.
So JoinTwo has a special bit of magic built into it: it knows how to recognize this situation and have the key fields copied into the result from whatever side happens to be present for a particular row. It makes these fields always available on both sides. And along the way it also takes care of modifying the option "rightFields" or "leftFields" to actually pass through only one of the copies.
It is all controlled by the option "fieldsUniqKey". The default value of this option is "first". It means: Enable the magic for copying the fields from the non-NULL side to the NULL side. Look at the option "fieldsLeftFirst" and figure out, which side goes first in the result. Let the key fields pass on that side unchanged (though the user can block them on that side manually too, or possibly rename them, it's his choice). On the other side, automatically generate the blocking specs for the key fields and prepend them to that side's result specification. It's smart enough to know that an undefined leftFields or rightFields means the same as ".*", so an undefined result spec is replaced by the blocking specs followed by ".*". If you later call the methods
$fspec = $join->getLeftFields(); $fspec = $join->getRightFields();
then you will actually get back the modified field specs.
If you want the key fields to be present in a different location in the result, you can set "fieldsUniqKey" to "left" or "right". That will make them pass through on the selected side, and the blocking would be automatically added on the other side.
For more control yet, set this option to "manual". The magic for making the key fields available on both sides will still be enabled, but no automatic blocking. You can pick and choose the result fields manually, exactly as you want. Remember though that there can't be multiple fields with the same name in the result, so if both sides have these fields named the same, you've got to block or rename one of the two copies.
The final choice is "none": it simply disables the key field magic.
The new error handling
I've been considering the change to the error handling for a while. The repeating checks of the calls for "or confess" are pretty annoying, just dying/confessing by default would be better. And then if anyone wants to catch that death, they can use eval {} around the call.
The need to check for the recursive modification attempts in the tables struck me as something that could particularly benefit from just dying rather than returning an error code that would likely be missed. This pushed me towards starting the shift towards this new error handling scheme.
With the interaction through the Perl and the native C++ code, unrolling the call stack is a pretty tricky proposition but I've got it worked out. If the error is not caught with eval, it keeps unrolling the call sequence through both the Perl call stack and the Triceps unit call stack. If the error is caught with eval, the message and the whole stack trace will be as usual in $@.
The bad news is that so far only a limited subset of the calls use the new scheme. The rest are still setting $! and returning an undef. So overall it's a mix and you need to remember, which calls work which way. In the future versions eventually everything will be converted to the new scheme. For a bit of backwards-compatibility, the error messages from the new-style dying calls are saved in both $@ and $!. This will eventually go away, and only $@ will be used.
The converted calls are:
Note though that right now this works only with the label handlers. The handlers for the tracers, aggregators, sorted indexes etc. still work in the old way, with an error message printed to stderr.
These errors from the label handlers are not to be treated lightly. Usually you can't just catch them be an eval and continue on your way. The reason is that as the Unit scheduling stack gets unrolled, any unprocessed rowops in it get thrown away. By the time you catch the error, the data is probably in an inconsistent state, and you can't just dust off and continue. You would have to reset your model to a good state first. Treat such errors as near-fatal. It could have been possible to keep going through the scheduled rowops, collecting the errors along the way and then returning the whole pile. But it seems more important to report the error as soon as possible. And anyway, if something has died somewhere, it has probably already left some state inconsistent, and continuing to run forward as normal would just pile up crap upon crap. If you want the errors to be handled early and lightly, make sure that your Perl code doesn't die in the first place.
Another added item is an explicit check that the labels are not called recursively. That is, if a label is called, it can not call itself again, directly or through the other labels, until it returns. Such recursive calls don't hurt anything by themselves but they are a bad design practice, and it seems more important to catch the accidental errors of this kind early than to leave the door open for the intentional use of them by design. If you want a label's processing to loop back to itself, the proper way it to arrange it with schedule() or loopAt().
The need to check for the recursive modification attempts in the tables struck me as something that could particularly benefit from just dying rather than returning an error code that would likely be missed. This pushed me towards starting the shift towards this new error handling scheme.
With the interaction through the Perl and the native C++ code, unrolling the call stack is a pretty tricky proposition but I've got it worked out. If the error is not caught with eval, it keeps unrolling the call sequence through both the Perl call stack and the Triceps unit call stack. If the error is caught with eval, the message and the whole stack trace will be as usual in $@.
The bad news is that so far only a limited subset of the calls use the new scheme. The rest are still setting $! and returning an undef. So overall it's a mix and you need to remember, which calls work which way. In the future versions eventually everything will be converted to the new scheme. For a bit of backwards-compatibility, the error messages from the new-style dying calls are saved in both $@ and $!. This will eventually go away, and only $@ will be used.
The converted calls are:
- The Table modification methods:
- insert()
- remove()
- deleteRow()
- The Unit methods that deal with the scheduling:
- schedule()
- fork()
- call()
- enqueue()
- setMark()
- loopAt()
- callNext()
- drainFrame()
- clearLabels()
Note though that right now this works only with the label handlers. The handlers for the tracers, aggregators, sorted indexes etc. still work in the old way, with an error message printed to stderr.
These errors from the label handlers are not to be treated lightly. Usually you can't just catch them be an eval and continue on your way. The reason is that as the Unit scheduling stack gets unrolled, any unprocessed rowops in it get thrown away. By the time you catch the error, the data is probably in an inconsistent state, and you can't just dust off and continue. You would have to reset your model to a good state first. Treat such errors as near-fatal. It could have been possible to keep going through the scheduled rowops, collecting the errors along the way and then returning the whole pile. But it seems more important to report the error as soon as possible. And anyway, if something has died somewhere, it has probably already left some state inconsistent, and continuing to run forward as normal would just pile up crap upon crap. If you want the errors to be handled early and lightly, make sure that your Perl code doesn't die in the first place.
Another added item is an explicit check that the labels are not called recursively. That is, if a label is called, it can not call itself again, directly or through the other labels, until it returns. Such recursive calls don't hurt anything by themselves but they are a bad design practice, and it seems more important to catch the accidental errors of this kind early than to leave the door open for the intentional use of them by design. If you want a label's processing to loop back to itself, the proper way it to arrange it with schedule() or loopAt().