Sergey Babkin on CEP and stuff: rowop

Showing posts with label rowop. Show all posts

Tuesday, December 25, 2012

Tray in C++

A Tray in C++, defined in sched/Tray.h, is simply a deque of Rowop references, plus an Starget, so that it can be referenced itself:

class Tray : public Starget, public deque< Autoref<Rowop> >

All it really defines is the constructors:

Tray();
Tray(const Tray &orig);

The operations on the Tray are just the usual deque operations.

Yes, you can copy the trays by constructing a new one from an old one:

Autoref<Tray> t1 = new Tray;
t1->push_back(op1);

Autoref<Tray> t3 = new Tray(*t1);

Afterwards t3 will contain references to the same rowops as t1 (but will be a different Tray than t1!).

The assignments (operator=) happen to just work out of the box because the operator= implementation in Starget does the smart thing and avoids the corruption of the reference counter. So you can do things like

*t3 = *t1;

It's worth noticing once more that unlike Rows and Rowops, the Trays are mutable. If you have multiple references to the same Tray, modifying the Tray will make all the references see its new contents!

An important difference from the Perl API is that in C++ the Tray is not associated with a Unit. It's constructed simply by calling its constructor, and there is no Unit involved. It's possible to create a tray that contains a mix of rowops for different units. If you combine the C++ and Perl code, and then create such mixes in the C++ part, the Perl part of your code won't be happy.

And there is actually a way to create the mixed-unit trays even in the Perl code, in the tray of FnBinding. But this situation would be caught when trying to get the tray into the Perl level, and the workaround is to use the method FnBinding:callTray().

The reason why Perl associates the trays with a unit is to make the check of enqueueing a tray easy: just check that the tray belongs to the right unit, and it's all guaranteed to be right. At the C++ level no such checks are made. If you enqueue the rowops on labels belonging to a wrong unit, they will be enqueued quietly, will attempt to execute, and from there everything will likely to go wrong. So be disciplined. And maybe I'll think of a better way for keeping the unit consistency in the future.

Monday, December 24, 2012

Rowop in C++

I've jumped right into the Unit without showing the objects it operates on. Now let's start catching up and look at the Rowops. The Rowop class is defined in sched/Rowop.h.

The Rowop in C++ consists of all the same parts as in Perl API: a label, a row, and opcode.

It has one more item that's not really visible in the Perl API, the enqueueing mode, but it's semi-hidden in the C++ API as well. The only place where it's used is in Unit::enqueueDelayedTray(). This basically allows to build a tray of rowops, each with its own enqueueing mode, and then enqueue all of them appropriately in one go. This is actually kind of historic and caused by the explicit enqueueing mode specification for the Table labels. It's getting obsolete and will be removed somewhere soon.

The Rowop class inherits from Starget, usable in one thread only. Since it refers to the Labels, that are by definition single-threaded, this makes sense. A consequence is that you can't simply pass the Rowops between the threads. The passing-between-threads requires a separate representation that doesn't refer to the Labels but instead uses something like a numeric index (and of course the Mtarget base class). This is a part of the ongoing work on multithreading, but you can also make your own.

The opcodes are defined in the union Rowop::Opcode, so you normally use them as Rowop::OP_INSERT etc. As described before, the opcodes actually contain a bitmap of individual flags, defined in the union Rowop::OpcodeFlags: Rowop::OCF_INSERT and Rowop::OCF_DELETE. You don't really need to use these flags directly unless you really, really want to.

Besides the 3 already described opcodes (OP_NOP, OP_INSERT and OP_DELETE) there is another one, OP_BAD. It's a special value returned by the string-to-opcode conversion method instead of the -1 returne dby the other similar method. The reason is that OP_BAD is specially formatted to be understood by all the normal opcode type checks as NOP, while -1 would be seen as a combination of INSERT and DELETE. So if you miss checking the result of conversion on a bad string, at least you would get a NOP and not some mysterious operation. The reason why OP_BAD is not exporeted to Perl is that in Perl an undef is used as the indication of the invalid value, and works even better.

There is a pretty wide variety of Rowop constructors:

Rowop(const Label *label, Opcode op, const Row *row);
Rowop(const Label *label, Opcode op, const Rowref &row);

Rowop(const Label *label, Opcode op, const Row *row, int enqMode);
Rowop(const Label *label, Opcode op, const Rowref &row, int enqMode);

Rowop(const Rowop &orig);
Rowop(const Label *label, const Rowop *orig);

The constructors with the explicit enqMode are best not be used outside of the Triceps internals, and will eventually be obsoleted. The last two are the copy constructor, and the adoption constructor which underlies Label::adopt() and can as well be used directly.

Once a rowop is constructed, its components can not be changed any more, only read.

Opcode getOpcode() const;
const Label *getLabel() const;
const Row *getRow() const;
int getEnqMode() const;

Read back the components of the Rowop. Again, the getEnqMode() is on the way to obsolescence. And if you need to check the opcode for being an insert or delete, the better way is to use the explicit test methods, rather than getting the opcode and comparing it for equality:

bool isInsert() const;
bool isDelete() const;
bool isNop() const;

Check whether the opcode requests an insert or delete (or neither).

The same checks are available as static methods that can be used on the opcode values:

static bool isInsert(int op);
static bool isDelete(int op);
static bool isNop(int op);

And the final part is the conversion between the strings and values for the Opcode and OpcodeFlags enums:

static const char *opcodeString(int code);
static int stringOpcode(const char *op);
static const char *ocfString(int flag, const char *def = "???");
static int stringOcf(const char *flag);

As mentioned above, stringOpcode() returns OP_BAD for the unknown strings, not -1.

Monday, May 14, 2012

Label chains and Rowop adoption

Those are actually two separate operations I've added for the benefits of the joins.

To check if there are any labels chained from this one, use:

$result = $label->hasChained();

The same check can be done with

@chain = $label->getChain();

if ($#chain >= 0) { ... }

but the hasChained() is more efficient since it doesn't have to construct that intermediate array.

The adoption is a way to pass the row and opcode from one rowop to another new one, with a different label:

$rowop2 = $label->adopt($rowop1);

Very convenient for building the label handlers that pass the rowops to the other labels unchanged. It also can be done with

$rowop2 = $label->makeRowop($rowop1->getOpcode(), $rowop1->getRow());

But adopt() is easier to call and also more efficient, because less of the intermediate data surfaces from the C++ level to the Perl level. A more extended usage example will be shown momentarily.

Saturday, February 18, 2012

Rowop creation wrappers

When writing the examples, I've got kind of tired of making all these 3-level-nested method calls to pass a set of data to a label. So now I've added a bunch of convenience wrappers for that purpose. The row-sending part from the manual example in computeAverage() now looks much simpler:

  if ($count) {
    $avg = $sum/$count;
    $uTrades->makeHashCall($lbAvgPriceHelper, &Triceps::OP_INSERT,
      symbol => $rhLast->getRow()->get("symbol"),
      id => $rhLast->getRow()->get("id"),
      price => $avg
    );
  } else {
    $uTrades->makeHashCall($lbAvgPriceHelper, &Triceps::OP_DELETE,
      symbol => $rLastMod->get("symbol"),
    );
  }

The full list of the methods added is:

$label->makeRowopHash($opcode, $fieldName => $fieldValue, ...)
$label->makeRowopArray($opcode, $fieldValue, ...)

$unit->makeHashCall($label, $opcode, $fieldName => $fieldValue, ...)
$unit->makeArrayCall($label, $opcode, $fieldValue, ...)

$unit->makeHashSchedule($label, $opcode, $fieldName => $fieldValue, ...)
$unit->makeArraySchedule($label, $opcode, $fieldValue, ...)

$unit->makeHashLoopAt($mark, $label, $opcode, $fieldName => $fieldValue, ...)
$unit->makeArrayLoopAt($mark, $label, $opcode, $fieldValue, ...)

The label methods amount to calling makeRowHash() or makeRowArray() on their row type, and then wrapping the result into makeRowop(). The unit methods call the new label methods to create the rowop and then call, schedule of loop it. There aren't similar wrappers for forking or general enqueueing because those methods are not envisioned to be used often.

In the future the convenience methods will move into the C++ code and will become not only more convenient but also more efficient.

Wednesday, January 18, 2012

Execution Unit

After discussing the principles of scheduling in Triceps, let's get down to the nuts and bolts.

An execution unit represents one logical thread of Triceps. In some special cases more that one unit per actual thread may be useful, but never one unit shared between threads.

A unit is created as:

$myUnit = Triceps::Unit->new("name") or die "$!";

The name argument as usual is used for later debugging, and by convention should be the same as the name of the unite variable ("myUnit" in this case). The name can also be changed later:

$myUnit->setName("newName");

It returns no value. Though in practice there is no good reason for it, and this call will likely be removed in the future. The name can be received back:

$name = $myUnit->getName();

Also, as usual, the variable $myName here contains a reference to the actual unit object, and two references can be compared, whether they refer to the same object:

$result = $unit1->same($unit2);

The rowops are enqueued with the calls:

$unit->call($rowop, ...) or die "$!";
$unit->fork($rowop, ...) or die "$!";
$unit->schedule($rowop, ...) or die "$!";

Also there is a call that selects the enqueueing mode by argument:

$unit->enqueque($mode, $rowop, ...) or die "$!";

Multiple rowops can be specified as arguments. Calling these functions with multiple arguments produces the same result as doing multiple calls with one argument at a time. Not only rowops but also trays (to be discussed later) of rowops can be used as arguments.

The mode for enqueue() is one of either Triceps constants

&Triceps::EM_CALL
&Triceps::EM_FORK
&Triceps::EM_SCHEDULE

or the matching strings "EM_CALL", "EM_FORK", "EM_SCHEDULE". As usual, the constant form is more efficient. There are calls to convert between the constant and string representations:

$string = &Triceps::emString($value);
$value = &Triceps::stringEm($string);

As usual, if the value can not be translated they return undef.

The frame marks for looping are created as their own class:

$mark = Triceps::FrameMark->new("name") or die "$!";

The name can be received back from the mark:

$name = $mark->getName();

Other than that, the frame marks are completely opaque, and can be used only for the loop scheduling. Not even the same() method is supported for them at the moment, though it probably will be in the future. The mark gets set and used as:

$unit->setMark($mark);
$unit->loopAt($mark, $rowop, ...) or die "$!";

The rowop arguments of the loopAt() are the same as for the other enqueueing functions.

The methods for creation of labels have been already discussed. There also are similar methods for creation of tables and trays that will be discussed later:

$label = $unit->makeDummyLabel($rowType, "name") or die "$!";
$label = $unit->makeLabel($rowType, "name",
  $clearSub, $execSub, @args) or die "$!";
$table = $unit->makeTable($tableType, $endMode, "name") or die "$!";
$tray = $unit->makeTray(@rowops) or die "$!";

The unit can be checked for the emptiness of its queues:

$result = $unit->empty();

The functions for execution are:

$unit->callNext();
$unit->drainFrame();

The callNext() takes one label from the top stack frame queue and calls it. If the innermost frame happens to be empty, it does nothing. The drainFrame() calls the rowops from the top stack frame until it becomes empty. This includes any rowops that may be created and enqueued as part of the execution of the previous rowops.

A typical way of processing the incoming rowops in a loop is:

$stop = 0; 
while (!$stop) {
  $rowop = readRowop(); # some user-defined function
  $unit->schedule($rowop);
  $unit->drainFrame();
}

All the unit's labels get cleared with the call

$unit->clearLabels();

To not forget calling it, a separate clearing trigger object can be created:

my $clearUnit = $unit->makeClearingTrigger();

The variable $clearUnit would normally be a global (in a thread) variable. Don't copy the reference to the other variables! Then when the thread completes and the global variables get destroyed, the trigger object will be also destroyed, and will trigger the clearing of the unit's labels, thus breaking up any reference loops and allowing to destroy the bits and pieces.

The only item left is the tracers, and they will be described in a separate post.

Saturday, January 7, 2012

Row operations

A row operation (also known as rowop) in Triceps is an unit of work for a label. It's always destined for a particular label (which could also pass it to its chained labels), and has a row to process and an opcode. The defined opcodes are:

Triceps::OP_NOP
Triceps::OP_INSERT
Triceps::OP_DELETE

A row operation is constructed as:

$rowop = $label->makeRowop($opcode, $row);

The opcode may be specified as a Triceps constant or as one of the strings "OP_NOP", "OP_INSERT", "OP_DELETE". Historically, there is also an optional extra argument for enqueuing mode but it's already obsolete.

Since the labels are single-threaded, the rowops are single-threaded too. The rowops are immutable, just as the rows are.

The references to rowops can be compared as usual

$rowop1->same($rowop2)

returns true if both point to the same rowop object.

The rowop data can be extracted back:

$label = $rowop->getLabel();
$opcode = $rowop->getOpcode();
$row = $rowop->getRow();

A copy of rowop (not just another reference but an honest separate copied object) can be created with

$rowop2 = $rowop1->copy();

However, since the rowops are immutable, a reference is just as good as a copy. This method is historic and will likely be removed or modified.

There also are calls to directly check the meaning of the opcode:

$rowop->isNop()
$rowop->isInsert()
$rowop->isDelete()

The typical idiom for handling a label is:

if ($rowop->isInsert()) {
  # handle the insert logic ...
} elsif($rowop->isDelete()) {
  # handle the delete logic...
}

:$
The NOPs get silently ignored in this idiom, as they should be. Generally there is no point in creation of the rowops with NOP opcode, unless you want to use them for some weird logic.

The main Triceps package also provides functions to check the extracted opcodes:

Triceps::isNop($opcode)
Triceps::isInsert($opcode)
Triceps::isDelete($opcode)

The same-named methods of Rowop are just the more convenient and efficient ways to say

Triceps::isNop($rowop->getOpcode())
Triceps::isInsert($rowop->getOpcode())
Triceps::isDelete($rowop->getOpcode())

The functions to convert between the opcode integer values and strings are

$opcode = &Triceps::stringOpcode($opcodeName);
$opcodeName = &Triceps::opcodeString($opcode);

A Rowop can be printed (usually for debugging purposes) with

$string = $rowop->printP();

Just as with a row, printP() means that it's implemented in Perl. In the future a print() done right in C++ may be added, but for now I try to keep all the interpretation of the data on the Perl side. The following example gives an idea of the format in which the rowops get printed:

$lb = $unit->makeDummyLabel($rt, "lb");
$rowop = $lb->makeRowop(&Triceps::OP_INSERT, $row);
print $rowop->printP(), "\n";

would produce

lb OP_INSERT a="123" b="456" c="3000000000000000" d="3.14" e="text"

The row contents is printed through Row::printP(), so it has the same format.

Friday, December 23, 2011

Hello, world!

Let's finally get to business: write the "Hello, world!" program with Triceps. Since Triceps is an embeddable library, naturally, the smallest "Hello, world!" program would be in the host language without Triceps, but it would not be interesting. So here is the a bit contrived but more interesting Perl program that passes some data through the Triceps machinery:

use Triceps;

$hwunit = Triceps::Unit->new("hwunit") or die "$!";
$hw_rt = Triceps::RowType->new(
  greeting => "string",
  address => "string",
) or die "$!";

my $print_greeting = $hwunit->makeLabel($hw_rt, "print_greeting", undef, 
  sub {
    my ($label, $rowop) = @_;
    printf "%s!\n", join(', ', $rowop->getRow()->toArray());
  } 
) or die "$!";

$hwunit->call($print_greeting->makeRowop(&Triceps::OP_INSERT,
  $hw_rt->makeRowHash(
    greeting => "Hello",
    address => "world",
  ) 
)) or die "$!";

What happens there? First, we import the Triceps module. Then we create a Triceps execution unit. An execution unit keeps the Triceps context and controls the execution for one thread. Nothing really stops you from having multiple execution units in the same thread, however there is not a whole lot of benefit in it either. But a single execution unit must never ever be used in multiple threads. It's single-threaded by design and has no synchronization in it. The argument of the constructor is the name of the unit, that can be used in printing messages about it. It doesn't have to be the same as the name of the variable that keeps the reference to the unit, but it's a convenient convention to make the debugging easier.

If something goes wrong, the constructor will return an undef and set the error message in $!. This actually has turned out to be not such a good idea as it seemed, since writing "or die" at every line quickly turns tedious. And there is usually not much point in catching the errors of this type, since they are essentially the compilation errors and should cause the program to die anyway. So, this will be soon changed throughought the code to just die with the message (and if it needs to be caught, it can be caught with eval).

The next statement creates the type for rows. For the simplest example, one row type is enough. It contains two string fields. A row type does not belong to an execution unit. It may be used in parallel by multiple threads. Once a row type is created, it's immutable, and that's the story for pretty much all the Triceps objects that can be shared between multiple threads: they are created, they become immutable, and then they can be shared. (Of course, the containers that facilitate the passing of data between the threads would have to be an exception to this rule).

Then we create a label. If you look at the "SQLy vs procedural" example a little while back, you'll see that the labels are analogs of streams in Coral8. And that's what they are in Triceps. Of course, now, in the days of the structured programming, we don't create labels for GOTOs all over the place. But we still use labels. The function names are labels, the loops in Perl may have labels. So a Triceps label can often be seen kind of like a function definition, but so far only kind of. It takes a data row as a parameter and does something with it. But unlike a proper function it has no way to return the processed data back to the caller. It has to either pass the processed data to other labels or collect it in some hardcoded data structure, from which the caller can later extract it back. This means that until this gets worked out better, a Triceps label is still much more like a GOTO label or Coral8 stream than a proper function. Just like the unit, a label may be used in only one thread.

A label takes a row type for the rows it accepts, a name (again, purely for the ease of debugging) and a reference to a Perl function that will be processing the data. Extra arguments for the function can be specified as well, but there is no use for them in this example.

Here it's a simple unnamed function. Though of course a reference to a named function can be used instead, and the same function may be reused for multiple labels. Whenever the label gets a row operation to process, its function gets called with the reference to the label object, the row operation object, and whatever extra arguments were specified at the label creation (none in this example). The example just prints a message combined from the data in the row.

Note that a label doesn't just get a row. It gets a row operation ("rowop" as it's called throughout the code). It's an important distinction. A row just stores some data. As the row gets passed around, it gets referenced and unreferenced, but it just stays the same until the last reference to it disappears, and then it gets destroyed. It doesn't know what happens with the data, it just stores them. A row may be shared between multiple threads. On the other hand, a row operation says "take these data and do a such and such thing with them". A row operation is a combination of a row of data, an operation code, and a label that is to execute the operation. It is confined to a single thread. Inside this thread a reference to a row operation may be kept and reused again and again, since the row operation object is also immutable.

Triceps has the explicit operation codes, very much like Aleri (only Aleri doesn't differentiate between a row and row operation, every row there has an opcode in it, and the Sybase CEP R5 does the same). It might be just my background, but let me tell you: the CEP systems without the explicit opcodes are a pain. The visible opcodes make life a lot easier. However unlike Aleri, there is no UPDATE opcode. The available opcodes are INSERT, DELETE and NOP (no-operation). If you want to update something, you send two operations: first DELETE for the old value, then INSERT for the new value. There will be a section later with more details and comparisons, but for now that's enough information.

For this simple example, the opcode doesn't really matter, so the label function quietly ignores it. It gets the row from the row operation and extracts the data from it into the Perl format, then prints them. There are two Perl formats supported: an array and a hash. In the array format, the array contains the values of the fields in the order they are defined in the row type. The hash format consists of name-value pairs, which may be stored either in an actual hash or in an array. The conversion from row to a hash actually returns an array of values which becomes a hash if it gets stored into a hash variable.

As a side note, this also suggests, how the systems without explicit opcodes came to be: they've been initially built on the simple stateless examples. And when the more complex examples have turned up, they've been aready stuck on this path, and could not afford too deep a retrofit.

The final part of the example is the creation of a row operation for our label, with an INSERT opcode and a row created from hash-formatted Perl data, and calling it through the execution unit. The row type provides a method to construct the rows, and the label provides a method to construct the row operations for it. The call() method of the execution unit does exactly what its name implies: it evaluates the label function right now, and returns after all its processing its done.

Sergey Babkin on CEP and stuff