Sergey Babkin on CEP and stuff: row

Showing posts with label row. Show all posts

Monday, June 10, 2013

checking a row for emptiness

I've mentioned that the rowops with the empty rows (i.e. rows with all fields NULL) on the _BEGIN_ and _END_ labels in the facets are treated specially. The same check is also available as the directly callable method, in case if you have some other uses for it.

In Perl it is:

$result = $row->isEmpty();

It returns 1 if all the fields are NULL and 0 otherwise.

In C++ this is done as a method of the RowType class:

virtual bool RowType::isRowEmpty(const Row *row) const;

And is used like this:

bool res = type->isRowEmpty(row);

It's also available as a convenience wrapper method on the Rowref:

Rowref r1(...);
if (r1.isRowEmpty()) { ... }

Friday, April 19, 2013

Object passing between threads, and Perl code snippets

A limitation of the Perl threads is that no variables can be shared between them. When a new thread gets created, it gets a copy of all the variables of the parent. Well, of all the plain Perl variables. With the XS extensions your luck may vary: the variables might get copied, might become undef, or just become broken (if the XS module is not threads-aware). Copying the XS variables requires a quite high overhead at all the other times, so Triceps doesn't do it and all the Triceps object become undefined in the new thread.

However there is a way to pass around certain objects through the Nexuses.

First, obviously, the Nexuses are intended to pass through the Rowops. These Rowops coming out of a nexus are not the same Rowop objects that went in. Rowop is a single-threaded object and can not be shared by two threads. Instead it gets converted to an internal form while in the nexus, and gets re-created, pointing to the same Row object and to the correct Label in the local facet.

Then, again obviously, the Facets get imported through the Nexus, together with their row types.

And two more types of objects can be exported through a Nexus: the RowTypes and TableTypes. They get exported through the options as in this example:

$fa = $owner->makeNexus(
    name => "nx1",
    labels => [
        one => $rt1,
        two => $lb,
    ],
    rowTypes => [
        one => $rt2,
        two => $rt1,
    ],
    tableTypes => [
        one => $tt1,
        two => $tt2,
    ],
    import => "writer",
);

As you can see, the namespaces for the labels, row types and table types are completely independent, and the same names can be reused in each of them for different meaning. All the three sections are optional, so if you want, you can order only the types in the nexus, without any labels.

They can then be extracted from the imported facet as:

$rt1 = $fa->impRowType("one");
$tt1 = $fa->impTableType("one");

Or the whole set of name-value pairs can be obtained with:

@rtset = $fa->impRowTypesHash();
@ttset = $fa->impTableTypesHash();

The exact table types and row types (by themselves or in the table types or labels) in the importing thread will be copied. It's technically possible to share the references to the same row type in the C++ code but it's more efficient to make a separate copy for each thread, and thus the Perl API goes along the more efficient way.

The import is smart in the sense that it preserves the sameness of the row types: if in the exporting thread the same row type was referred from multiple places in the labels, row types and table types sections, in the imported facet that would again be the same row type (even though of course not the one that has been exported but its copy). This again helps with the efficiency when various objects decide if the rows created by this and that type are compatible.

This is all well until you want to export a table type that has an index with a Perl sort condition in it, or an aggregator with the Perl code. The Perl code objects are tricky: they get copied OK when a new thread is created but the attempts to import them through a nexus later causes a terrible memory corruption. So Triceps doesn't allow to export the table types with the function references in it. But it provides an alternative solution: the code snippets can be specified as the source code. It gets compiled when the table type gets initialized. When a table type gets imported through a nexus, it brings the source code with it. The imported table types are always uninitialized, so at initialization time the source code gets compiled in the new thread and works.

It all works transparently: just specify a string instead of a function reference when creating the index, and it will be recognized and processed. For example:

$it= Triceps::IndexType->newPerlSorted("b_c", undef, '
    my $res = ($_[0]->get("b") <=> $_[1]->get("b")
        || $_[0]->get("c") <=> $_[1]->get("c"));
    return $res;
    '
);

Before the code gets compiled, it gets wrapped into a 'sub { ... }', so don't write your own sub in the code string, that would be an error.

There is also the issue of arguments that can be specified for these functions. Triceps is now smart enough to handle the arguments that are one of:

undef
integer
floating-point
string
Triceps::RowType object
Triceps::Row object
reference to an array or hash thereof

It converts the data to an internal C++ representation in the nexus and then converts it back on import. So, if a TableType has all the code in it in the source form, and the arguments for this code within the limits of this format, it can be exported through the nexus. Otherwise an attempt to export it will fail.

I've modified the SimpleOrderedIndex to use the source code format, and it will pass through the nexuses as well.

The Aggregators have a similar problem, and I'm working on converting them to the source code format too.

A little more about the differences between the code references and the source code format:

When you compile a function, it carries with it the lexical context. So you can make the closures that refer to the "my" variables in their lexical scope. With the source code you can't do this. The table type compiles them at initialization time in the context of the main package, and that's all they can see. Remember also that the global variables are not shared between the threads, so if you refer to a global variable in the code snippet and rely on a value in that variable, it won't be present in the other threads (unless the other threads are direct descendants and the value was set before their creation).

While working with the custom sorted indexes, I've also fixed the way the errors are reported in their Perl handlers. The errors used to be just printed on stderr. Now they propagate properly through the table, and the table operations die with the Per handler's error message. Since an error in the sorting function means that things are going very, very wrong, after that the table becomes inoperative and will die on all the subsequent operations as well.

Thursday, August 30, 2012

RowType operations on Rows

As has been mentioned before, a RowType acts as a virtual call table for the rows of that type. The operations are:

bool isFieldNull(const Row *row, int nf) const;

Checks if the field nf in a Row is NULL.

bool getField(const Row *row, int nf, const char *&ptr, intptr_t &len) const;

Returns, where to find a field in a row. nf is as usual the field number, with the data returned in ptr (pointer to the start of the field) and len (length of the field data). The function return value shows whether the field is not NULL. Also, for a NULL field the len will be 0. The returned data pointer type is constant, to remind that the rows are immutable and the data in them must not be changed.

However for the most types you can't refer by this pointer and get the desired value directly, because the data might not be aligned right for that data type. Because of this the returned pointer is a char* and not void *. If you have an int64 field, you can't just do

int64_t *data;
intptr_t len;
if (getField(myrow, myfield, data, len)) {
int64_t val = *data; // WRONG!
}

Fortunately, the type checks will catch this usage attempt right at the call of getField(). But there also are the convenience functions that return the values of particular types. They are implemented on top of getField() and take care of the alignment issue.

uint8_t getUint8(const Row *row, int nf, int pos = 0) const;
int32_t getInt32(const Row *row, int nf, int pos = 0) const;
int64_t getInt64(const Row *row, int nf, int pos = 0) const;
double getFloat64(const Row *row, int nf, int pos = 0) const;
const char *getString(const Row *row, int nf) const;

The extra argument pos is the position of the value in an array field. It's an array index, not a byte offset. For the scalar fields it must be 0. If the field is NULL or pos points beyond the end of the array, the returned value will be 0, which matches the Perl idiom of treating the undefined values as zeroes. If you care whether the field is NULL or not, check it first:

if (!rt1->isFieldNull(r1, nf)) {
int64_t val = rt1->getInt64(r1, nf);
...
}

Since the strings are normally stored 0-terminated (but it's your responsibility to store them 0-terminated!), getString() just returns the pointer directly to a value in the field. If the string field is NULL, a not a NULL pointer but a pointer to an empty string is returned, in the same spirit of treating the undefined values as zeroes or empty strings. If you want to explicitlcy check for NULLs or get the string field length (including \0 at the end), use getField(). Since there are no string arrays, there is no position argument for getString().

For a side note, the arguments of these calls are Row*, not Rowref. It's cheaper, and OK for memory management because it's expected that the row would be held in a Rowref variable anyway while the data is extracted form it. Don't construct an anonymous rowref object and immediately try to extract a value from it!

int64_t val = rt1->getInt64(Rowref(rt1, datavec), nf); // WRONG!

However if you have a data vector, there is no point in constructing a row to extract back the same data in the first place.

Right now when I was writing this, it has impressed me, how ugly are these calls on a Rowref:

Rowref r1(...);
...
int64_t val = r1.getType()->getInt64(r1, 3);

So I've added the matching convenience methods on Rowref, like:

int64_t val = r1.getInt64(3);

They will be available in the version 1.1. Note that they are called with ".", not "->". The "." makes them called directly on the Rowref object, while "->" would have meant that the Rowref is dereferenced to a Row pointer, and then a method be called on the Row object at that pointer.

Continuing with the type methods, the constructor and destructor for the rows are also here:

Row *makeRow(FdataVec &data_) const;
void destroyRow(Row *row) const;

The makeRow() has been already discussed, and normally you never need to call destroyRow() manually, Rowref takes care of that. If you ever do the destruction manually, remember to honor the reference counts and call the destructor only after the reference count went to 0.

Another method compares the rows for absolute equality:

bool equalRows(const Row *row1, const Row *row2) const;

Right now it's defined to work only on the rows of the same type, including the same representation (but since only one CompactRowType representation is available, this is not a problem). When more representations become available, it will likely be extended. The FIFO index uses this method to find the rows by value.

The final method is provided for debugging:

void hexdumpRow(string &dest, const Row *row, const string &indent="") const;

It makes a hex dump of the internal representation of the row and appends it to the dest string. It's a very low-level method that requires the knowledge of the internal layout of a row and useful for investigation of the memory corruptions.

Saturday, August 25, 2012

Row, Rowref and RowType

A row is defined naturally with the class Row. Which is fundamentally an opaque buffer. You can't do anything with it directly other than having a pointer to it. You can't even delete a Row object using that pointer. To do anything with a Row, you have to go through that row's RowType. There are some helper classes, like CompactRow, but you don't need to concern yourself with them: they are helpers for the appropriate row types and are never used directly.

That opaque buffer is internally wired for the reference counting, of the Mtarget multithreaded variety. The rows can be passed and shared freely between the threads. No locks are needed for that (other than in the reference counter), the thread-safety is achieved by the rows being immutable. Once a row is created, it stays the same. If you need to change a row, just create a new row with the new contents. Basically, it's the same rules as in the Perl API.

The tricky part in the C++ API is that you can't simply use an Autoref<Row> for rows. As mentioned before, it won't know, how to destroy the Row when its reference counter goes to zero. Instead you use a special variety of it called Rowref, defined in type/RowType.h, and described in a previous post. To summarize, it holds a reference both to the Row (that keeps the data) and to the RowType (that knows how to work with the Row). The RowType must be correct for the Row. It's possible to combine the completely unrelated Row and RowType, and the result will be at least some garbage data, or at most a program crash. The Perl wrapper goes to great lengths to make sure that this doesn't happen. In the C++ API you're on your own. You gain the efficiency at the price of higher responsibility.

The general rule is that it's safe to combine a Row and RowType if this RowType matches the RowType used to create that row. The matching RowTypes may have different names of the fields but the same substance.

A Row is created similarly to a RowType: build a vector describing the values in the row, call the constructor, you get the row. The vector type is FdataVec, and its element type is Fdata. Both of them are top-level (i.e. Triceps::FdataVec and Triceps::Fdata), not inside some other class, and both are defined in type/RowType.h.

An Fdata describes the data for one field. It tells whether the field is not null, and if so, where to find the data to place into that field. It doesn't know anything about the field types or such. It deals with the raw bytes: the pointer to the first byte of the value, and the number of bytes. As a special case, if you want the field to be filled with zeroes, set the data pointer to NULL. It is possible to specify an incorrect number of bytes, for example create an int64 field of 3 bytes. This data will be garbage, and if it happens to be at the end of the row, might cause a crash. It's your responsibility to store the correct data. The same goes for the string fields: it's your responsibility to make sure that the data is terminated with an '\0', and that '\0' is included into the length of the data. On the other hand, the unit8[] fields don't need a '\0' at the end, all the bytes included into them are a part of the value.

The data vector gets constructed similarly to the field vector: either start with an empty vector and push pack the elements, or allocate one of the right size and set the elements. The relevant Fdata constructors and methods are:

Fdata(bool notNull, const void *data, intptr_t len);
void setPtr(bool notNull, const void *data, intptr_t len);
void setNull();

The setNull() is a shortcut of setPtr() that sets the notNull to false and ignores the other fields. In version 1.0 the default Fdata constructor leaves all the fields uninitialized. I've changed this now for version 1.1 to set notNull to false by default.

For example:

uint8_t v_uint8[10] = "123456789"; // just a convenient representation
int32_t v_int32 = 1234;
int64_t v_int64 = 0xdeadbeefc00c;
double v_float64 = 9.99e99;
char v_string[] = "hello world";

FdataVec fd1;
fd1.push_back(Fdata(true, &v_uint8, sizeof(v_uint8)-1)); // exclude \0
fd1.push_back(Fdata(true, &v_int32, sizeof(v_int32)));
fd1.push_back(Fdata(false, NULL, 0)); // a NULL field
fd1.push_back(Fdata(true, &v_float64, sizeof(v_float64)));
fd1.push_back(Fdata(true, &v_string, sizeof(v_string)));

Rowref r1(rt1, rt1->makeRow(fd1));
Rowref r2(rt1, fd1);

The Rowref constructor from Fdata vector calls the makeRow() implicitly, for convenience, so both forms provide the same result. For another example that allocates a vector and then fills it:

Rowref r2(rt1, fd1);

FdataVec fd2(3);
fd2[0].setPtr(true, &v_uint8, sizeof(v_uint8)-1); // exclude \0
fd2[1].setNull();
fd2[2].setFrom(r1.getType(), r1.get(), 2); // copy from r1 field 2

Rowref r3(rt1, fd2);

The field 2 is set by copying it from a field of another row. It sets the data pointer to the location inside the original row, and the data will be copied when the new row gets created. So make sure to not release the reference to the original row until the new row is created. The prototype is:

void setFrom(const RowType *rtype, const Row *row, int nf);

In fd2 the vector is smaller than the number of fields in the row. The rest of fields are filled with NULLs. They actually are literally filled with NULLs in fd2: if the size of the argument vector for makeRow() is smaller than the number of fields in the row type, the vector gets extended with the NULL values before anything is done with it. It's no accident that the argument of the RowType::makeRow() is not const:

class RowType {
    virtual Row *makeRow(FdataVec &data) const;
};

class Rowref {
    Rowref(const RowType *t, FdataVec &data);
    Rowref &operator=(FdataVec &data);
};

It's also possible to have more elements in the FdataVec than in the row type. In this case the extra arguments are considered the "overlays": the "main" elements set the size of the fields while the "overlays" copy the data fragments over that. It's a convenient way to assemble the array fields from the fragments, for example:

RowType::FieldVec fields4;
fields4.push_back(RowType::Field("a", Type::r_int64, RowType::Field::AR_VARIABLE));

Autoref<RowType> rt4 = new CompactRowType(fields4);
if (rt4->getErrors()->hasError())
    throw Exception(rt4->getErrors(), true);

FdataVec fd4;
Fdata fdtmp;
fd4.push_back(Fdata(true, NULL, sizeof(v_float64)*10)); // allocate space
fd4.push_back(Fdata(0, sizeof(v_int64)*2, &v_int64, sizeof(v_int64)));
// fill a temporary element with setOverride and then insert it
fdtmp.setOverride(0, sizeof(v_int64)*4, &v_int64, sizeof(v_int64));
fd4.push_back(fdtmp);
// manually copy an element from r1
fdtmp.nf_ = 0;
fdtmp.off_ = sizeof(v_int64)*5;
r1.getType()->getField(r1.get(), 2, fdtmp.data_, fdtmp.len_);
fd4.push_back(fdtmp);

Rowref r4(rt4, fd4);

This creates a row type from a single field "a" at index 0, an array of int64. The data vector fd4 has the 0th element define the space for 10 elements in the array, filled by default with zeroes. It doesn't have to zero them, it could copy the data from some location in memory. I've just done the zeroing here to show how it can be done.

The rest of elements are the "overrides" constructed in different ways.

The first one uses the override constructor:

Fdata(int nf, intptr_t off, const void *data, intptr_t len);

Here nf is the number of the field whose contents to overried, off is the byte offset in it, and data and len point to the location to copy from as usual. In this case the 2nd element (counting from 0) of the array gets set with the value from v_int64.

The second override uses the method setOverride() for the same purpose:

void setOverride(int nf, intptr_t off, const void *data, intptr_t len);

It sets a temporary Fdata which then gets appended (copied) to the vector. It sets the element of the vector at index 4 to the same value of v_int64.

The third override copies the value from the row r1. Since there is no ready method for this purpose (perhaps there should be?), it goes about its way manually, setting the fields explicitly. nf_ if the same as nf in the methods, the field number to override. off_ is the offset. And the location and length gets filled into data_ and len_ by getField(), which takes the data from the row r1, field 2.

But wait, the field 2 of r1 has been set to NULL! Should not the NULL indication be set in the copy as well? As it turns out, no. The NULL indication (the field notNull_ being set to false) is ignoredby makeRow() in the override elements. However getField() will set the length to 0, so nothing will get copied. The value at index 5 will be left as it was initially set, which happens to be 0.

So in the end the values in the field "a" array at indexes 2 and 4 will be set to the same as v_int64, and the other indexes 0..10 to 0.

If multiple overrides specify the overlapping ranges, they will just sequentially overwrite each other, and the last one will win.

If an override attempts to specify writing past the end of the originally reserved area of the field, it will be quietly ignored. Just don't do this. If the field was originally set to NULL, its reserved area will be zero bytes, so any overrides attempting to write into it will be silently ignored.

The summary is: the overrides allow to build the array values efficiently from the disjointed areas of memory, but if they are used, they have to be used with care.

Wednesday, August 8, 2012

Row reference

The Row objects are reference-counted too but Autoref can't be used with them. The reason is about the destructors.

It isn't used anywhere yet, but the rows in Triceps are designed to be able to contain references to the other rows and in general other objects. So they can't be destroyed by just freeing the memory. The destructor must know, how to release the references to these nested objects first. And knowing where these references are depends on the type of the row. And rows may be of different types. This calls for a virtual destructor.

But having a virtual destructor requires that every object has a pointer to the table of virtual functions. That adds the overhead of 8 bytes per row, and the rows are likely to be kapt by the million, and that overhead adds up. So I've made the decision to save these 8 bytes and split that knowledge. It might turn out a premature optimization, but since it's something that would be difficult to change later, I've got it in early.

The knowledge of how to destroy a row (and also how to copy the row and to access the elements in it) is kept in the row type object. So the reference to a row needs to know two things: the row and the row's type. It's still an extra 8 bytes of a pointer, but there are only a few row references active at a time (the tables don't use the common row references to keep the rows, instead they are implemented as a special case and have one single row type for all the rows they store).

The special reference class for the rows is Rowref, defined in type/RowType.h. It gets constructed as:

Rowref(const RowType *t, Row *r = NULL);
Rowref(const Rowref &ar);

It can be then assigned from rowref or from a row (keeping the row type the same):

Rowref &operator=(const Rowref &ar);
Rowref &operator=(Row *r);

Also both a new row type and a row can be assigned at the same time:

void assign(const RowType *t, Row *r);

There is also the ways to copy a row and assign a copy to this reference:

Rowref &copyRow(const RowType *rtype, const Row *row);
Rowref &copyRow(const Rowref &ar);

The other functionality is similar to Autoref(): isNull(), get(), comparisons, calling the Row methods through ->, and conversion to a Row pointer.

Friday, July 6, 2012

more updates

Some more stuff has been getting cleaned up:

$unit->setTracer($tracer);

now confesses on errors, so the problem with its error checking is fixed.

The tables now allow to create rowops of rows of all matching types, not only of the equal types. The approach with matching types was not consistent with what the labels did, so I've changed it.

The TableType now has the method

$tt->getRowType()

that has the name consistent with the tables and labels. The old method rowType() also still exists.

The Opt::ck_ref() now also accepts the subclasses of the defined classes.

The new helper Fields::isStringType() has been added.

There also have been some major addition of the examples:

I've added a pretty big example of the main loop that includes the full socket handling in the chapter on Scheduling. It's not exactly production-ready but gives some idea.

Another addition in the chapter on Scheduling is an example of a topological loop that computes the Fibonacci numbers.

Many new examples have been added to the chapter on Templates.

Saturday, February 18, 2012

Rowop creation wrappers

When writing the examples, I've got kind of tired of making all these 3-level-nested method calls to pass a set of data to a label. So now I've added a bunch of convenience wrappers for that purpose. The row-sending part from the manual example in computeAverage() now looks much simpler:

  if ($count) {
    $avg = $sum/$count;
    $uTrades->makeHashCall($lbAvgPriceHelper, &Triceps::OP_INSERT,
      symbol => $rhLast->getRow()->get("symbol"),
      id => $rhLast->getRow()->get("id"),
      price => $avg
    );
  } else {
    $uTrades->makeHashCall($lbAvgPriceHelper, &Triceps::OP_DELETE,
      symbol => $rLastMod->get("symbol"),
    );
  }

The full list of the methods added is:

$label->makeRowopHash($opcode, $fieldName => $fieldValue, ...)
$label->makeRowopArray($opcode, $fieldValue, ...)

$unit->makeHashCall($label, $opcode, $fieldName => $fieldValue, ...)
$unit->makeArrayCall($label, $opcode, $fieldValue, ...)

$unit->makeHashSchedule($label, $opcode, $fieldName => $fieldValue, ...)
$unit->makeArraySchedule($label, $opcode, $fieldValue, ...)

$unit->makeHashLoopAt($mark, $label, $opcode, $fieldName => $fieldValue, ...)
$unit->makeArrayLoopAt($mark, $label, $opcode, $fieldValue, ...)

The label methods amount to calling makeRowHash() or makeRowArray() on their row type, and then wrapping the result into makeRowop(). The unit methods call the new label methods to create the rowop and then call, schedule of loop it. There aren't similar wrappers for forking or general enqueueing because those methods are not envisioned to be used often.

In the future the convenience methods will move into the C++ code and will become not only more convenient but also more efficient.

Wednesday, December 28, 2011

Rows

The rows in Triceps always belong to some row type, and are always immutable. Once a row is created, it can not be changed. This allows it to be referenced from multiple places, instead of copying the whole row value. Naturally, a row may be passed and shared between multiple threads.

The row type provides the constructor methods for the rows:

$row = $rowType->makeRowArray(fieldValue1, ..., fieldValueN);
$row = $rowType->makeRowHash(fieldName => fieldValue, ...);

Here $row is a reference to the resulting row. As usual, in case of error it will be left as undef, with the error message in $! (to be changed later to dying on error).

In the array form, the values for the fields go in the same order as they are specified in the row type (if there are too few values, the rest will be considered NULL, having too many values is an error).

The Perl value of undef is treated as NULL.

In the hash form, the fields are specified as name-value pairs. If the same field is specified multiple times, the last value will overwrite all the previous ones. The unspecified fields will be left as NULL. Again, the arguments of the function actually are an array, but if you pass a hash, its contents will be converted to an array on the calling stack.

If the performance is important, the array form is more efficient, since the hash form has to translate internally the field names to indexes.

The row itself and its type don't know anything about any keys and such. So any fields may be left as NULL.

Some examples:

$row  = $rowType->makeRowArray(@fields) or die "$!";
$row  = $rowType->makeRowArray($a, $b, $c) or die "$!";
$row  = $rowType->makeRowHash(%fields) or die "$!";
$row  = $rowType->makeRowHash(a => $a, b => $b) or die "$!";

The usual Perl conversions are applied to the values. So for example, if you pass an integer 1 for a string field, it will be converted to the string "1". Or if you pass a string "" for an integer field, it will be converted to 0.

If a field is an array (as always, except for uint8[] which is represented as a Perl string), its value is a Perl array reference (or undef). For example:

$rt1 = Triceps::RowType->new(
  a => "uint8[]",
  b => "int32[]",
) or die "$!";
$row = $rt1->makeRowArray("abcd", [1, 2, 3]) or die "$!";

An empty array will become a NULL value. So the following two are equivalent:

$row = $rt1->makeRowArray("abcd", []) or die "$!";
$row = $rt1->makeRowArray("abcd", undef) or die "$!";

Remember that an array field may not contain NULL values. So any undefs in the array fields will be silently converted to zeroes. The following two are equivalent:

$row = $rt1->makeRowArray("abcd", [undef, undef]) or die "$!";
$row = $rt1->makeRowArray("abcd", [0, 0]) or die "$!";

The row also provides a way to copy itself, modifying the values of selected fields:

$row2 = $row1->copymod(fieldName => fieldValue, ...);

The fields that are not explicitly specified will be left unchanged. Since the rows are immutable, this is the closest thing to the field assignment. copymod() is generally more efficient than extracting the row into an array or hash, replacing a few of them with new values and constructing a new row. It bypasses a the binary-to-Perl-to-binary conversions for the unchanged fields.

The row knows its type. It can be obtained as

$row->getType()

Note that this will create a new Perl wrapper to the underlying type object. So if you do:

$rt1 = ...;
$row = $rt1->makeRow...;
$rt2 = $row->getType();

then $rt1 will not be equal to $rt2 by direct Perl comparison ($rt1 != $rt2). However both $rt1 and $rt2 will refer to the same row type object, so $rt1->same($rt2) will be true.

The row references can also be compared for sameness:

$row1->same($row2)

The row contents can be extracted back into Perl representation as

@adata = $row->toArray();
%hdata = $row->toHash();

Again, the NULL fields will become undefs, and the array fields (unless they are NULL) will become Perl array references. Since the empty array fields are equivalent to NULL array fields, on extraction back they will be treated the same as NULL fields, and become undefs.

There is also a convenience function to get one field from a row at a time by name:

$value = $row->get("fieldName");

If you need to access only a few fields from a big row, get() is more efficient (and easier to write) that extracting the whole row with toHash() or even with toArray(). But don't forget that every time you call get(), it creates a new Perl value, which may be pretty involved if the value is an array. So the most efficient way then for the values that get reused is to call get() , remember the result in a Perl variable, and then reuse that variable.

There is also a way to conveniently print a rows contents, usually for the debugging purposes:

$result = $row->printP();

The name printP is an artifact of implementation: it shows that this method is implemented in Perl and uses the default Perl conversions of values to strings. The uint8[] arrays are printed directly as strings. The result is a sequence of name="value" or name=["value", "value", "value"] for all the non-NULL fields. The backslashes and double quotes inside the values are escaped by backslashes in Perl style. For example, reusing the row type above,

$row = $rt1->makeRowArray('ab\ "cd"', [0, 0]) or die "$!";
print $row->printP(), "\n";

will produce

a="ab\\ \"cd\"" b=["0", "0"]

Finally, there is a deep debugging method:

$result = $row->hexdump()

That dumps the raw bytes of the row's binary format, and is useful only to debug the more weird issues.

Friday, December 23, 2011

Hello, world!

Let's finally get to business: write the "Hello, world!" program with Triceps. Since Triceps is an embeddable library, naturally, the smallest "Hello, world!" program would be in the host language without Triceps, but it would not be interesting. So here is the a bit contrived but more interesting Perl program that passes some data through the Triceps machinery:

use Triceps;

$hwunit = Triceps::Unit->new("hwunit") or die "$!";
$hw_rt = Triceps::RowType->new(
  greeting => "string",
  address => "string",
) or die "$!";

my $print_greeting = $hwunit->makeLabel($hw_rt, "print_greeting", undef, 
  sub {
    my ($label, $rowop) = @_;
    printf "%s!\n", join(', ', $rowop->getRow()->toArray());
  } 
) or die "$!";

$hwunit->call($print_greeting->makeRowop(&Triceps::OP_INSERT,
  $hw_rt->makeRowHash(
    greeting => "Hello",
    address => "world",
  ) 
)) or die "$!";

What happens there? First, we import the Triceps module. Then we create a Triceps execution unit. An execution unit keeps the Triceps context and controls the execution for one thread. Nothing really stops you from having multiple execution units in the same thread, however there is not a whole lot of benefit in it either. But a single execution unit must never ever be used in multiple threads. It's single-threaded by design and has no synchronization in it. The argument of the constructor is the name of the unit, that can be used in printing messages about it. It doesn't have to be the same as the name of the variable that keeps the reference to the unit, but it's a convenient convention to make the debugging easier.

If something goes wrong, the constructor will return an undef and set the error message in $!. This actually has turned out to be not such a good idea as it seemed, since writing "or die" at every line quickly turns tedious. And there is usually not much point in catching the errors of this type, since they are essentially the compilation errors and should cause the program to die anyway. So, this will be soon changed throughought the code to just die with the message (and if it needs to be caught, it can be caught with eval).

The next statement creates the type for rows. For the simplest example, one row type is enough. It contains two string fields. A row type does not belong to an execution unit. It may be used in parallel by multiple threads. Once a row type is created, it's immutable, and that's the story for pretty much all the Triceps objects that can be shared between multiple threads: they are created, they become immutable, and then they can be shared. (Of course, the containers that facilitate the passing of data between the threads would have to be an exception to this rule).

Then we create a label. If you look at the "SQLy vs procedural" example a little while back, you'll see that the labels are analogs of streams in Coral8. And that's what they are in Triceps. Of course, now, in the days of the structured programming, we don't create labels for GOTOs all over the place. But we still use labels. The function names are labels, the loops in Perl may have labels. So a Triceps label can often be seen kind of like a function definition, but so far only kind of. It takes a data row as a parameter and does something with it. But unlike a proper function it has no way to return the processed data back to the caller. It has to either pass the processed data to other labels or collect it in some hardcoded data structure, from which the caller can later extract it back. This means that until this gets worked out better, a Triceps label is still much more like a GOTO label or Coral8 stream than a proper function. Just like the unit, a label may be used in only one thread.

A label takes a row type for the rows it accepts, a name (again, purely for the ease of debugging) and a reference to a Perl function that will be processing the data. Extra arguments for the function can be specified as well, but there is no use for them in this example.

Here it's a simple unnamed function. Though of course a reference to a named function can be used instead, and the same function may be reused for multiple labels. Whenever the label gets a row operation to process, its function gets called with the reference to the label object, the row operation object, and whatever extra arguments were specified at the label creation (none in this example). The example just prints a message combined from the data in the row.

Note that a label doesn't just get a row. It gets a row operation ("rowop" as it's called throughout the code). It's an important distinction. A row just stores some data. As the row gets passed around, it gets referenced and unreferenced, but it just stays the same until the last reference to it disappears, and then it gets destroyed. It doesn't know what happens with the data, it just stores them. A row may be shared between multiple threads. On the other hand, a row operation says "take these data and do a such and such thing with them". A row operation is a combination of a row of data, an operation code, and a label that is to execute the operation. It is confined to a single thread. Inside this thread a reference to a row operation may be kept and reused again and again, since the row operation object is also immutable.

Triceps has the explicit operation codes, very much like Aleri (only Aleri doesn't differentiate between a row and row operation, every row there has an opcode in it, and the Sybase CEP R5 does the same). It might be just my background, but let me tell you: the CEP systems without the explicit opcodes are a pain. The visible opcodes make life a lot easier. However unlike Aleri, there is no UPDATE opcode. The available opcodes are INSERT, DELETE and NOP (no-operation). If you want to update something, you send two operations: first DELETE for the old value, then INSERT for the new value. There will be a section later with more details and comparisons, but for now that's enough information.

For this simple example, the opcode doesn't really matter, so the label function quietly ignores it. It gets the row from the row operation and extracts the data from it into the Perl format, then prints them. There are two Perl formats supported: an array and a hash. In the array format, the array contains the values of the fields in the order they are defined in the row type. The hash format consists of name-value pairs, which may be stored either in an actual hash or in an array. The conversion from row to a hash actually returns an array of values which becomes a hash if it gets stored into a hash variable.

As a side note, this also suggests, how the systems without explicit opcodes came to be: they've been initially built on the simple stateless examples. And when the more complex examples have turned up, they've been aready stuck on this path, and could not afford too deep a retrofit.

The final part of the example is the creation of a row operation for our label, with an INSERT opcode and a row created from hash-formatted Perl data, and calling it through the execution unit. The row type provides a method to construct the rows, and the label provides a method to construct the row operations for it. The call() method of the execution unit does exactly what its name implies: it evaluates the label function right now, and returns after all its processing its done.

Sergey Babkin on CEP and stuff