Thursday, May 10, 2012

JoinTwo: left outer join

I've had the JoinTwo all working, but then when I've started writing the docs, I've felt that some of its limitations are a bit weird. These limitations came from the way the tables work, so instead of documenting the weirdness, I've set out to fix these issues in the tables, which pulled other things with it. So, after an interruption, now it all works better, but I'll need to document the changes to the tables and a whole lot more. Oh well, more about this later, and now returning to our regular programming about the joins.

Last time we've seen that the left outer join would be better for the logic. Not a problem, just change the join type option:

our $join = Triceps::JoinTwo->new(
    name => "join",
    leftTable => $tPosition,
    leftIdxPath => [ "currencyLookup" ],
    rightTable => $tToUsd,
    rightIdxPath => [ "primary" ],
    type => "left",
); # would die by itself on an error


Now the positions would pass through even if the currency translation is not available. The same input now produces a different result:

cur,OP_INSERT,20120310,USD,1
cur,OP_INSERT,20120310,GBP,2
cur,OP_INSERT,20120310,EUR,1.5
pos,OP_INSERT,20120310,one,AAA,100,15,USD
join.leftLookup.out OP_INSERT date="20120310" customer="one"
 symbol="AAA" quantity="100" price="15" currency="USD" toUsd="1" 
pos,OP_INSERT,20120310,two,AAA,100,8,GBP
join.leftLookup.out OP_INSERT date="20120310" customer="two"
 symbol="AAA" quantity="100" price="8" currency="GBP" toUsd="2" 

So far things are going the same as for the inner join.

pos,OP_INSERT,20120310,three,AAA,100,300,RUR
join.leftLookup.out OP_INSERT date="20120310" customer="three"
 symbol="AAA" quantity="100" price="300" currency="RUR" 

Here comes the first difference: even though there is no translation for RUR, the row still passes through (with the field toUsd being NULL).

pos,OP_INSERT,20120310,three,BBB,200,80,GBP
join.leftLookup.out OP_INSERT date="20120310" customer="three"
 symbol="BBB" quantity="200" price="80" currency="GBP" toUsd="2" 

This is also unchanged.

cur,OP_INSERT,20120310,RUR,0.04
join.rightLookup.out OP_DELETE date="20120310" customer="three"
 symbol="AAA" quantity="100" price="300" currency="RUR" 
join.rightLookup.out OP_INSERT date="20120310" customer="three"
 symbol="AAA" quantity="100" price="300" currency="RUR" toUsd="0.04" 

Now the second difference: since this row from the left side has already passed through, just sending another INSERT for it would make the data inconsistent. The original result without the translation must be deleted first, and then a new one, with translation, inserted. JoinTwo is smart enough to figure it out all by itself.

cur,OP_DELETE,20120310,GBP,2
join.rightLookup.out OP_DELETE date="20120310" customer="two"
 symbol="AAA" quantity="100" price="8" currency="GBP" toUsd="2" 
join.rightLookup.out OP_INSERT date="20120310" customer="two"
 symbol="AAA" quantity="100" price="8" currency="GBP" 
join.rightLookup.out OP_DELETE date="20120310" customer="three"
 symbol="BBB" quantity="200" price="80" currency="GBP" toUsd="2" 
join.rightLookup.out OP_INSERT date="20120310" customer="three"
 symbol="BBB" quantity="200" price="80" currency="GBP" 

The same logic works for the deletes, only backwards: when the translation for GBP is deleted, the result rows that used it change to the ones without the translation.

cur,OP_INSERT,20120310,GBP,2.2
join.rightLookup.out OP_DELETE date="20120310" customer="two"
 symbol="AAA" quantity="100" price="8" currency="GBP" 
join.rightLookup.out OP_INSERT date="20120310" customer="two"
 symbol="AAA" quantity="100" price="8" currency="GBP" toUsd="2.2" 
join.rightLookup.out OP_DELETE date="20120310" customer="three"
 symbol="BBB" quantity="200" price="80" currency="GBP" 
join.rightLookup.out OP_INSERT date="20120310" customer="three"
 symbol="BBB" quantity="200" price="80" currency="GBP" toUsd="2.2" 

And again, when the new translation for GBP comes in, the DELETE-INSERT sequence is done for each of the rows. As you can see, the update of the GBP translation worked in the not most efficient way. Fundamentally, if we knew that a DELETE of GBP will be immediately followed by an INSERT, we could skip inserting and then deleting the rows with the NULL in toUsd. But we don't know, and in Triceps there is no way to know that.

If you really, really want to avoid the propagation of these intermediate changes, put a Collapse after the join, and flush it only after the whole update has been processed. There will be more overhead in the Collapse itself, but all the logic below it will skip the intermediate changes. If this logic below is heavy-weight, that might be an overall win. A caveat though: a Collapse requires that the data has a primary key, a JoinTwo doesn't require its result (nor its inputs) to have a primary key. Because of this, the collapse might not work right with every possible join, you'd have to limit yourself to the joins that produce the data with a primary key.

pos,OP_DELETE,20120310,one,AAA,100,15,USD
join.leftLookup.out OP_DELETE date="20120310" customer="one"
 symbol="AAA" quantity="100" price="15" currency="USD" toUsd="1" 
pos,OP_INSERT,20120310,one,AAA,200,16,USD
join.leftLookup.out OP_INSERT date="20120310" customer="one"
 symbol="AAA" quantity="200" price="16" currency="USD" toUsd="1"

And the rest is again the same as with an inner join.

No comments:

Post a Comment