Sergey Babkin on CEP and stuff: row

Showing posts with label row_type. Show all posts

Saturday, July 9, 2022

constants in Perl

When writing Triceps code, I've been whining that it would be great to have symbolic constants in Perl, this would allow to identify the fields of Triceps records by symbolic names that would translate at compile time into integers. Well, it turns out that there are constants in Perl and have been for more than a decade now. They've been there when I've started writing Triceps, just I wasn't aware of them. They're used like this:

use constant X => 1;
use constant Y => 2;

$a[X] = $a[Y] + 1;

They're referenced by names without any prefix, and can be scoped in packages and exported from them like everything else. Duh.

Tuesday, July 2, 2013

Facet reference, C++

The general functioning of a facet is the same in C++ as in Perl, so please refer to the part 1 of the Perl description for this information.

However the construction of a Facet is different in C++. The import is still the same: you call TrieadOwner::importNexus() method or one of its varieties and it returns a Facet. However the export is different: first you construct a Facet from scratch and then give it to TrieadOwner::exportNexus() to create a Nexus from it.

In the C++ API the Facet has a notion of being imported, very much like say the FnReturn has a notion of being initialized. When a Facet is first constructed, it's not imported. Then exportNexus() creates the nexus from the facet, exports it into the App, and also imports the nexus information back into the facet, marking the facet as imported. It returns back a reference to the exact same Facet object, only now that object becomes imported. Obviously, you can use either the original or returned reference, they point to the same object. Once a Facet has been imported, it can not be modified any more. The Facet object returned by the importNexus() is also marked as imported, so it can not be modified either. And also an imported facet can not be exported again.

However exportNexus() has an exception. If the export is done with the argument import set to false, the facet object will be left unchanged and not marked as imported. A facet that is not marked as imported can not be used to send or receive data. Theoretically, it can be used to export another nexus, but practically this would not work because that would be an attempt to export another nexus with the same name from the same thread. In reality such a Facet can only be thrown away, and there is not much use for it. You can read its components and use them to construct another Facet but that's about it.
It might be more convenient to use the TrieadOwner::makeNexus*() methods to build a Facet object rather than building it directly. In either case, the methods are the same and accept the same arguments, just the Facet methods return a Facet pointer while the nexus maker methods return a NexusMaker pointer.

The Facet class is defined in app/Facet.h. It inherits from Mtarget for an obscure reason that has to do with App topology analysis but it's intended to be used from one thread only.

You don't have to keep your own references to all your Facets. The TrieadOwner will keep a reference to all the imported Facets, and they will not be destroyed while the TrieadOwner exists (and this applies to Perl as well).

    enum {
        DEFAULT_QUEUE_LIMIT = 500,
    };

The default value used for the nexus queue limit (as a count of Xtrays, not rowops). Since the reading from the nexus involves double-buffering, the real queue size might grow up to twice that amount.

static string buildFullName(const string &tname, const string &nxname);

Build the full nexus name from its components.

Facet(Onceref<FnReturn> fret, bool writer);

Create a Facet, initially non-imported (and non-exported). The FnReturn object fret defines the set of labels in the facet (and nexus), and the name of FnReturn also becomes the name of the Facet, and of the Nexus. The writer flag determines whether this facet will become a writer (if true) or a reader (if false) when a nexus is created from it. If the nexus gets created without importing the facet back, the writer flag doesn't matter and can be set either way.

The FnReturn should generally be not initialized yet. The Facet constructor will check if FnReturn already has the labels _BEGIN_ and _END_ defined, and if either is missing, will add it to the FnReturn, then initialize it. If both _BEGIN_ and _END_ are already present, then the FnReturn can be used even if it's already initialized. But if any of them is missing, FnReturn must be not initialized yet, otherwise the Facet constructor will fail to add these labels.

The same FnReturn object may be used to create only one Facet object. And no, you can not import a Facet, get an FnReturn from it, then use it to create another Facet.

If anything goes wrong, the constructor will not throw but will remember the error, and later the exportNexus() will find it and throw an Exception from it.

static Facet *make(Onceref<FnReturn> fret, bool writer);

Same as the constructor, used for the more convenient operator priority for the chained calls.

static Facet *makeReader(Onceref<FnReturn> fret);
static Facet *makeWriter(Onceref<FnReturn> fret);

Syntactic sugar around the constructor, hardcoding the writer flag.

Normally the facets are constructed and exported with the chained calls, like:

Autoref<Facet> myfacet = to->exportNexus(
   Facet::makeWriter(FnReturn::make("My")->...)
   ->setReverse()
   ->exportTableType(Table::make(...)->...)
);

Because of this, the methods that are used for post-construction return the pointer to the original Facet object. They also almost never throw the Exceptions, to prevent the memory leaks through the orphaned Facet objects. The only way an Exception might get thrown is on an attempt to use these methods on an already imported Facet. Any errors get collected, and eventually exportNexus() will find them and properly throw an Exception, making sure that the Facet object gets properly disposed of.

Facet *exportRowType(const string &name, Onceref<RowType> rtype);

Add a row type to the Facet. May throw an Exception if the the facet is already imported. On other errors remembers them to be thrown on an export attempt.

Facet *exportTableType(const string &name, Autoref<TableType> tt);

Add a table type to the Facet. May throw an Exception if the the facet is already imported. On other errors remembers them to be thrown on an export attempt. The table type must also be deep-copyable and contain no errors. Not sure if I described this before, but if the deep copy can not proceed (say, a table type involves a Perl sort condition with a direct reference to the compiled Perl code) the deepCopy() method must still return a newly created object but remember the error inside it. Later when the table type is initialized, that object's initialization must return this error. The exportTableType() does a deep copy then initializes the copied table type. If this detects any errors, they get remembered and cause an Exception later in exportNexus().

Facet *setReverse(bool on = true);

Set (or clear) the nexus reverse flag. May throw an Exception if the the facet is already imported.

Facet *setQueueLimit(int limit);

Set the nexus queue limit. May throw an Exception if the the facet is already imported.

Erref getErrors() const;

Get the collected errors, so that they can be found without an export attempt.

bool isImported() const;

Check whether this facet is imported.

The rest of the methods are the same as in Perl. They can be used even if the facet is not imported.

bool isWriter() const;

Check whether this is a writer facet (or if returns false, a reader facet).

bool isReverse() const;

Check whether the underlying nexus is reverse.

int queueLimit() const;

Get the queue size limit of the nexus. Until the facet is exported, this will always return the last value set by setQueueLimit(). However if the nexus is reverse, on import the value will be changed to a very large integer value, currently INT32_MAX, and on all the following calls this value will be returned. Technically speaking, the queue size of the reverse nexuses is not unlimited, it's just very large, but in practice it amounts to the same thing.

FnReturn *getFnReturn() const;

Get the FnReturn object. If you plan to destroy the Facet object soon after this method is called, make sure that you put the FnReturn pointer into an Autoref first.

const string &getShortName() const;

Get the short name, AKA "as-name", which is the same as the FnReturn's name.

const string &getFullName() const;

Get the full name of the nexus imported through this facet. If the facet is not imported, will return an empty string.

typedef map<string, Autoref<RowType> > RowTypeMap;const RowTypeMap &rowTypes() const;

Get the map of the defined row types. Returns the reference to the Facet's internal map object.

typedef map<string, Autoref<TableType> > TableTypeMap;
const TableTypeMap &tableTypes() const;

Get the map of defined table types. Returns the reference to the Facet's internal map object.

RowType *impRowType(const string &name) const;

Find a single row type by name. If the name is not known, returns NULL.

TableType *impTableType(const string &name) const;

Find a single table type by name. If the name is not known, returns NULL.

Nexus *nexus() const;

Get the nexus of this facet. If the facet is not imported, returns NULL.

int beginIdx() const;
int endIdx() const;

Return the indexes (as in "integer offset") of the _BEGIN_ and _END_ labels in FnReturn.

bool flushWriter();

Flush the collected rowops into the nexus as a single Xtray. If there is no data collected, does nothing. Returns true on a successful flush (even if there was no data collected), false if the Triead was requested to die and thus all the data gets thrown away.

Tuesday, June 11, 2013

options passing through

I've already shown it in the examples, but here is also the official description: you can accept the arbitrary options, typically if your function is a wrapper to another function, and you just want to process a few options and let the others through. The Triead::start() is a good example, passing the options through to the main function of the thread.

You specify the acceptance of the arbitrary options by using "*" in the Opt::parse() arguments. For example:

&Triceps::Opt::parse($myname, $opts, {
    app => [ undef, \&Triceps::Opt::ck_mandatory ],
    thread => [ undef, \&Triceps::Opt::ck_mandatory ],
    fragment => [ "", undef ],
    main => [ undef, sub { &Triceps::Opt::ck_ref(@_, "CODE") } ],
    '*' => [],
}, @_);

The specification array for "*" is empty. The unknown options will be collected in the array referred to from $opts->{'*'}, that is @{$opts->{'*'}}.

From there on your wrapper has the choice of either passing through all the options to the wrapped function, using @_, or explicitly specifying a few options and passing through the rest from @{$opts->{'*'}}.

There is also the third possibility: filter out only some of the incoming options. This can be done with Opt::drop(). For example, Triead::startHere() works like this:

our @startOpts = (
app => [ undef, \&Triceps::Opt::ck_mandatory ],
thread => [ undef, \&Triceps::Opt::ck_mandatory ],
fragment => [ "", undef ],
main => [ undef, sub { &Triceps::Opt::ck_ref(@_, "CODE") } ],
);

sub startHere # (@opts)
{
my $myname = "Triceps::Triead::start";
my $opts = {};
my @myOpts = ( # options that don't propagate through
    harvest => [ 1, undef ],
    makeApp => [ 1, undef ],
);

&Triceps::Opt::parse($myname, $opts, {
    @startOpts,
    @myOpts,
    '*' => [],
}, @_);

my @args = &Triceps::Opt::drop({
    @myOpts
}, \@_);
@_ = (); # workaround for threads leaking objects

# no need to declare the Triead, since all the code executes synchronously anyway
my $app;
if ($opts->{makeApp}) {
    $app = &Triceps::App::make($opts->{app});
} else {
    $app = &Triceps::App::resolve($opts->{app});
}
my $owner = Triceps::TrieadOwner->new(undef, undef, $app, $opts->{thread}, $opts->{fragment});
push(@args, "owner", $owner);
eval { &{$opts->{main}}(@args) };
...

The @startOpts are both used by the startHere() and passed through. The @myOpts are only used in startHere() and do not pass through. And the rest of the options pass through without baing used in startHere(). So the options from @myOpts get dropped from @_, and the result goes to the main thread.

The Opt::drop() takes the specification of the options to drop as a hash reference, the same as Opt::parse(). The values in the hash are not important in this case, only the keys are used. But it's simpler to store the same specification of the options and reuse it for both parse() and drop() than to write it twice.

There is also an opposite function, Opt::dropExcept(). It passes through only the listed options and drops the rest. It can come handy if your wrapper wants to pass different subsets of its incoming options to multiple functions.

The functions drop() and dropExcept() can really be used on any name-value arrays, not just the options as such. And the same goes for the Fields::filter() and friends. So you can use them interchangeably: you can use Opt::drop() on the row type specifications and Fields::filter() on the options if you feel that it makes your code simpler.

Monday, June 10, 2013

checking a row for emptiness

I've mentioned that the rowops with the empty rows (i.e. rows with all fields NULL) on the _BEGIN_ and _END_ labels in the facets are treated specially. The same check is also available as the directly callable method, in case if you have some other uses for it.

In Perl it is:

$result = $row->isEmpty();

It returns 1 if all the fields are NULL and 0 otherwise.

In C++ this is done as a method of the RowType class:

virtual bool RowType::isRowEmpty(const Row *row) const;

And is used like this:

bool res = type->isRowEmpty(row);

It's also available as a convenience wrapper method on the Rowref:

Rowref r1(...);
if (r1.isRowEmpty()) { ... }

Monday, June 3, 2013

the many ways to do a copy

So far the way to copy an index type or table type was with the method copy(), in both Perl and C++. It copies the items, and as needed its components, but tries to share the objects that can be shared (such as the row types). But the multithreading support required more kinds of copying.

The Perl method TableType::copyFundamental() copies a table type with only a limited subset of its index types, and excludes all the aggregators. It is implemented in Perl, and if you look in lib/Triceps/TableType.pm, you can find that it starts by making a new table type with the same row type, and then one by one adds the needed index types to it. It needs index types without any aggregators or nested index types, and thus there is a special method for doing this kind of copies:

$idxtype2 = $idxtype->flatCopy();

The "flat" means exactly what it looks like: copy just the object itself without any connected hierarchies.

On the C++ level there is no such method, instead there is an optional argument to IndexType::copy():

virtual IndexType *copy(bool flat = false) const;

So if you want a flat copy, you call

idxtype2 = idxtype->copy(true);

There is no copyFundamental() on the C++ level, though it probably should be, and should be added in the future. For now, if you really want it, you can make it yourself by copying the logic from Perl.

In the implementation of the index types, this argument "flat" mostly just propagates to the base class, without a whole lot needed from the subclass. For example, this is how it works in the SortedIndexType:

IndexType *SortedIndexType::copy(bool flat) const
{
    return new SortedIndexType(*this, flat);
}

SortedIndexType::SortedIndexType(const SortedIndexType &orig, bool flat) :
    TreeIndexType(orig, flat),
    sc_(orig.sc_->copy())
{ }

The base class TreeIndexType takes care of everything, all the subclass needs to do is carry the flat argument to it.

The next kind of copying is exactly the opposite: it copies the whole table type or index type, including all the objects involved, including the row types or such. It is used for passing the table types through the nexus.

Remember that all these objects are reference-counted. Whenever a Row object is created or deleted in Perl, it increase or decreases the reference of the RowType. If the same RowType object is shared between multiple threads, they will have a contention for this atomic counter, or at the very least will be shuttling the cache line with it back and forth between the CPUs. It's more efficient to give each thread its own copy of a RowType, and then it can stay stuck in one CPU's cache.

So wherever a table type is exported into a nexus, it's deep-copied (and when a row type is exported into a nexus, it's simply copied, and so are the row types of the labels in the nexus). Then when a nexus is imported, the types are again deep-copied into the thread's facet.

But there is one more catch. Suppose we have a label and a table type that use the same row type. Whenever a row coming from a label is inserted into the table (in Perl), the row types of the row (i.e. the label's row type in this case) and of the table are checked for the match. If the row type is the same, this check is very quick. But if the same row type gets copied and then different copies are used for the label and for the table, the check will have to go and actually compare the contents of these row types, and will be much slower. To prevent this slowness, the deep copy has to be smart: it must be able to copy a bunch of things while preserving the identity of the underlying row types. If it's given a label and then a table type, both referring to the same row type, it will copy the row type from the label, but then when copying the table type it will realize that it had already seen and copied the table type, and will reuse its first type. And the same applies even within a table: it may have multiple references to the same row type from the aggregators, and will be smart enough to figure out if they are the same, and copy the same row type once.

This smartness stays mostly undercover in Perl. When you import a facet, it will do all the proper copying, sharing the row types. (Though of course if you export the same row type through two separate nexuses, and then import them both into another thread, these facets will not share the types between them any more). There is a method TableType::deepCopy() but it was mostly intended for testing and it's self-contained: it will copy one table type with the correct row type sharing inside it but it won't do the sharing between two table types.

All the interesting uses of the deepCopy() are at the C++ level. It's used all over the place: for the TableType, IndexType, AggregatorType, SortedIndexCondition and RowSetType (the type of FnReturn and FnBinding). If you decide to create your own subclass of these classes, you need to implement the deepCopy() (and of course the normal copy()) for it.

Its prototype generally looks like this (substitute the correct return type as needed):

virtual IndexType *deepCopy(HoldRowTypes *holder) const;

The HoldRowTypes object is what takes care of sharing the underlying row types. To copy a bunch of objects with sharing, you create a HoldRowTypes, copy the bunch, destroy the HoldRowTypes.

For example, the Facet copies the table types from a Nexus like this:

Autoref<HoldRowTypes> holder = new HoldRowTypes;

for (TableTypeMap::iterator it = nx->tableTypes_.begin(); it != nx->tableTypes_.end(); ++it)
    tableTypes_[it->first] = it->second->deepCopy(holder);

It also uses the same holder to copy the exported row types and the labels. A row type gets copied through a holder like this:

Autoref <RowType> rt2 = holder->copy(rt);

For all the other purposes, HoldRowTypes is an opaque object.

A special feature is that you can pass the holder pointer as NULL, and the deep copy will still work, only it obviously won't be able to share the underlying row types. So if you don't care about sharing, you can always use NULL as an argument. It even works for the direct copying:

HoldRowTypes *holder = NULL;
Autoref <RowType> rt2 = holder->copy(rt);

Its method copy() is smart enough to recognize the this being NULL and do the correct thing.

In the classes, the deepCopy() is typically implemented like this:

IndexType *SortedIndexType::deepCopy(HoldRowTypes *holder) const
{
    return new SortedIndexType(*this, holder);
}

SortedIndexType::SortedIndexType(const SortedIndexType &orig, HoldRowTypes *holder) :
    TreeIndexType(orig, holder),
    sc_(orig.sc_->deepCopy(holder))
{ }

The wrapper passes the call to the deep-copy constructor with a holder which in turn propagates the deep-copying to all the components using their constructor with a holder. Of course, if some component doesn't have any RowType references in it, it doesn't need a constructor with a holder, and can be copied without it. But again, the idea of the deepCopy() it to copy as deep as it goes, without sharing any references with the original.

Friday, April 19, 2013

Object passing between threads, and Perl code snippets

A limitation of the Perl threads is that no variables can be shared between them. When a new thread gets created, it gets a copy of all the variables of the parent. Well, of all the plain Perl variables. With the XS extensions your luck may vary: the variables might get copied, might become undef, or just become broken (if the XS module is not threads-aware). Copying the XS variables requires a quite high overhead at all the other times, so Triceps doesn't do it and all the Triceps object become undefined in the new thread.

However there is a way to pass around certain objects through the Nexuses.

First, obviously, the Nexuses are intended to pass through the Rowops. These Rowops coming out of a nexus are not the same Rowop objects that went in. Rowop is a single-threaded object and can not be shared by two threads. Instead it gets converted to an internal form while in the nexus, and gets re-created, pointing to the same Row object and to the correct Label in the local facet.

Then, again obviously, the Facets get imported through the Nexus, together with their row types.

And two more types of objects can be exported through a Nexus: the RowTypes and TableTypes. They get exported through the options as in this example:

$fa = $owner->makeNexus(
    name => "nx1",
    labels => [
        one => $rt1,
        two => $lb,
    ],
    rowTypes => [
        one => $rt2,
        two => $rt1,
    ],
    tableTypes => [
        one => $tt1,
        two => $tt2,
    ],
    import => "writer",
);

As you can see, the namespaces for the labels, row types and table types are completely independent, and the same names can be reused in each of them for different meaning. All the three sections are optional, so if you want, you can order only the types in the nexus, without any labels.

They can then be extracted from the imported facet as:

$rt1 = $fa->impRowType("one");
$tt1 = $fa->impTableType("one");

Or the whole set of name-value pairs can be obtained with:

@rtset = $fa->impRowTypesHash();
@ttset = $fa->impTableTypesHash();

The exact table types and row types (by themselves or in the table types or labels) in the importing thread will be copied. It's technically possible to share the references to the same row type in the C++ code but it's more efficient to make a separate copy for each thread, and thus the Perl API goes along the more efficient way.

The import is smart in the sense that it preserves the sameness of the row types: if in the exporting thread the same row type was referred from multiple places in the labels, row types and table types sections, in the imported facet that would again be the same row type (even though of course not the one that has been exported but its copy). This again helps with the efficiency when various objects decide if the rows created by this and that type are compatible.

This is all well until you want to export a table type that has an index with a Perl sort condition in it, or an aggregator with the Perl code. The Perl code objects are tricky: they get copied OK when a new thread is created but the attempts to import them through a nexus later causes a terrible memory corruption. So Triceps doesn't allow to export the table types with the function references in it. But it provides an alternative solution: the code snippets can be specified as the source code. It gets compiled when the table type gets initialized. When a table type gets imported through a nexus, it brings the source code with it. The imported table types are always uninitialized, so at initialization time the source code gets compiled in the new thread and works.

It all works transparently: just specify a string instead of a function reference when creating the index, and it will be recognized and processed. For example:

$it= Triceps::IndexType->newPerlSorted("b_c", undef, '
    my $res = ($_[0]->get("b") <=> $_[1]->get("b")
        || $_[0]->get("c") <=> $_[1]->get("c"));
    return $res;
    '
);

Before the code gets compiled, it gets wrapped into a 'sub { ... }', so don't write your own sub in the code string, that would be an error.

There is also the issue of arguments that can be specified for these functions. Triceps is now smart enough to handle the arguments that are one of:

undef
integer
floating-point
string
Triceps::RowType object
Triceps::Row object
reference to an array or hash thereof

It converts the data to an internal C++ representation in the nexus and then converts it back on import. So, if a TableType has all the code in it in the source form, and the arguments for this code within the limits of this format, it can be exported through the nexus. Otherwise an attempt to export it will fail.

I've modified the SimpleOrderedIndex to use the source code format, and it will pass through the nexuses as well.

The Aggregators have a similar problem, and I'm working on converting them to the source code format too.

A little more about the differences between the code references and the source code format:

When you compile a function, it carries with it the lexical context. So you can make the closures that refer to the "my" variables in their lexical scope. With the source code you can't do this. The table type compiles them at initialization time in the context of the main package, and that's all they can see. Remember also that the global variables are not shared between the threads, so if you refer to a global variable in the code snippet and rely on a value in that variable, it won't be present in the other threads (unless the other threads are direct descendants and the value was set before their creation).

While working with the custom sorted indexes, I've also fixed the way the errors are reported in their Perl handlers. The errors used to be just printed on stderr. Now they propagate properly through the table, and the table operations die with the Per handler's error message. Since an error in the sorting function means that things are going very, very wrong, after that the table becomes inoperative and will die on all the subsequent operations as well.

Thursday, August 30, 2012

RowType operations on Rows

As has been mentioned before, a RowType acts as a virtual call table for the rows of that type. The operations are:

bool isFieldNull(const Row *row, int nf) const;

Checks if the field nf in a Row is NULL.

bool getField(const Row *row, int nf, const char *&ptr, intptr_t &len) const;

Returns, where to find a field in a row. nf is as usual the field number, with the data returned in ptr (pointer to the start of the field) and len (length of the field data). The function return value shows whether the field is not NULL. Also, for a NULL field the len will be 0. The returned data pointer type is constant, to remind that the rows are immutable and the data in them must not be changed.

However for the most types you can't refer by this pointer and get the desired value directly, because the data might not be aligned right for that data type. Because of this the returned pointer is a char* and not void *. If you have an int64 field, you can't just do

int64_t *data;
intptr_t len;
if (getField(myrow, myfield, data, len)) {
int64_t val = *data; // WRONG!
}

Fortunately, the type checks will catch this usage attempt right at the call of getField(). But there also are the convenience functions that return the values of particular types. They are implemented on top of getField() and take care of the alignment issue.

uint8_t getUint8(const Row *row, int nf, int pos = 0) const;
int32_t getInt32(const Row *row, int nf, int pos = 0) const;
int64_t getInt64(const Row *row, int nf, int pos = 0) const;
double getFloat64(const Row *row, int nf, int pos = 0) const;
const char *getString(const Row *row, int nf) const;

The extra argument pos is the position of the value in an array field. It's an array index, not a byte offset. For the scalar fields it must be 0. If the field is NULL or pos points beyond the end of the array, the returned value will be 0, which matches the Perl idiom of treating the undefined values as zeroes. If you care whether the field is NULL or not, check it first:

if (!rt1->isFieldNull(r1, nf)) {
int64_t val = rt1->getInt64(r1, nf);
...
}

Since the strings are normally stored 0-terminated (but it's your responsibility to store them 0-terminated!), getString() just returns the pointer directly to a value in the field. If the string field is NULL, a not a NULL pointer but a pointer to an empty string is returned, in the same spirit of treating the undefined values as zeroes or empty strings. If you want to explicitlcy check for NULLs or get the string field length (including \0 at the end), use getField(). Since there are no string arrays, there is no position argument for getString().

For a side note, the arguments of these calls are Row*, not Rowref. It's cheaper, and OK for memory management because it's expected that the row would be held in a Rowref variable anyway while the data is extracted form it. Don't construct an anonymous rowref object and immediately try to extract a value from it!

int64_t val = rt1->getInt64(Rowref(rt1, datavec), nf); // WRONG!

However if you have a data vector, there is no point in constructing a row to extract back the same data in the first place.

Right now when I was writing this, it has impressed me, how ugly are these calls on a Rowref:

Rowref r1(...);
...
int64_t val = r1.getType()->getInt64(r1, 3);

So I've added the matching convenience methods on Rowref, like:

int64_t val = r1.getInt64(3);

They will be available in the version 1.1. Note that they are called with ".", not "->". The "." makes them called directly on the Rowref object, while "->" would have meant that the Rowref is dereferenced to a Row pointer, and then a method be called on the Row object at that pointer.

Continuing with the type methods, the constructor and destructor for the rows are also here:

Row *makeRow(FdataVec &data_) const;
void destroyRow(Row *row) const;

The makeRow() has been already discussed, and normally you never need to call destroyRow() manually, Rowref takes care of that. If you ever do the destruction manually, remember to honor the reference counts and call the destructor only after the reference count went to 0.

Another method compares the rows for absolute equality:

bool equalRows(const Row *row1, const Row *row2) const;

Right now it's defined to work only on the rows of the same type, including the same representation (but since only one CompactRowType representation is available, this is not a problem). When more representations become available, it will likely be extended. The FIFO index uses this method to find the rows by value.

The final method is provided for debugging:

void hexdumpRow(string &dest, const Row *row, const string &indent="") const;

It makes a hex dump of the internal representation of the row and appends it to the dest string. It's a very low-level method that requires the knowledge of the internal layout of a row and useful for investigation of the memory corruptions.

Saturday, August 25, 2012

Row, Rowref and RowType

A row is defined naturally with the class Row. Which is fundamentally an opaque buffer. You can't do anything with it directly other than having a pointer to it. You can't even delete a Row object using that pointer. To do anything with a Row, you have to go through that row's RowType. There are some helper classes, like CompactRow, but you don't need to concern yourself with them: they are helpers for the appropriate row types and are never used directly.

That opaque buffer is internally wired for the reference counting, of the Mtarget multithreaded variety. The rows can be passed and shared freely between the threads. No locks are needed for that (other than in the reference counter), the thread-safety is achieved by the rows being immutable. Once a row is created, it stays the same. If you need to change a row, just create a new row with the new contents. Basically, it's the same rules as in the Perl API.

The tricky part in the C++ API is that you can't simply use an Autoref<Row> for rows. As mentioned before, it won't know, how to destroy the Row when its reference counter goes to zero. Instead you use a special variety of it called Rowref, defined in type/RowType.h, and described in a previous post. To summarize, it holds a reference both to the Row (that keeps the data) and to the RowType (that knows how to work with the Row). The RowType must be correct for the Row. It's possible to combine the completely unrelated Row and RowType, and the result will be at least some garbage data, or at most a program crash. The Perl wrapper goes to great lengths to make sure that this doesn't happen. In the C++ API you're on your own. You gain the efficiency at the price of higher responsibility.

The general rule is that it's safe to combine a Row and RowType if this RowType matches the RowType used to create that row. The matching RowTypes may have different names of the fields but the same substance.

A Row is created similarly to a RowType: build a vector describing the values in the row, call the constructor, you get the row. The vector type is FdataVec, and its element type is Fdata. Both of them are top-level (i.e. Triceps::FdataVec and Triceps::Fdata), not inside some other class, and both are defined in type/RowType.h.

An Fdata describes the data for one field. It tells whether the field is not null, and if so, where to find the data to place into that field. It doesn't know anything about the field types or such. It deals with the raw bytes: the pointer to the first byte of the value, and the number of bytes. As a special case, if you want the field to be filled with zeroes, set the data pointer to NULL. It is possible to specify an incorrect number of bytes, for example create an int64 field of 3 bytes. This data will be garbage, and if it happens to be at the end of the row, might cause a crash. It's your responsibility to store the correct data. The same goes for the string fields: it's your responsibility to make sure that the data is terminated with an '\0', and that '\0' is included into the length of the data. On the other hand, the unit8[] fields don't need a '\0' at the end, all the bytes included into them are a part of the value.

The data vector gets constructed similarly to the field vector: either start with an empty vector and push pack the elements, or allocate one of the right size and set the elements. The relevant Fdata constructors and methods are:

Fdata(bool notNull, const void *data, intptr_t len);
void setPtr(bool notNull, const void *data, intptr_t len);
void setNull();

The setNull() is a shortcut of setPtr() that sets the notNull to false and ignores the other fields. In version 1.0 the default Fdata constructor leaves all the fields uninitialized. I've changed this now for version 1.1 to set notNull to false by default.

For example:

uint8_t v_uint8[10] = "123456789"; // just a convenient representation
int32_t v_int32 = 1234;
int64_t v_int64 = 0xdeadbeefc00c;
double v_float64 = 9.99e99;
char v_string[] = "hello world";

FdataVec fd1;
fd1.push_back(Fdata(true, &v_uint8, sizeof(v_uint8)-1)); // exclude \0
fd1.push_back(Fdata(true, &v_int32, sizeof(v_int32)));
fd1.push_back(Fdata(false, NULL, 0)); // a NULL field
fd1.push_back(Fdata(true, &v_float64, sizeof(v_float64)));
fd1.push_back(Fdata(true, &v_string, sizeof(v_string)));

Rowref r1(rt1, rt1->makeRow(fd1));
Rowref r2(rt1, fd1);

The Rowref constructor from Fdata vector calls the makeRow() implicitly, for convenience, so both forms provide the same result. For another example that allocates a vector and then fills it:

Rowref r2(rt1, fd1);

FdataVec fd2(3);
fd2[0].setPtr(true, &v_uint8, sizeof(v_uint8)-1); // exclude \0
fd2[1].setNull();
fd2[2].setFrom(r1.getType(), r1.get(), 2); // copy from r1 field 2

Rowref r3(rt1, fd2);

The field 2 is set by copying it from a field of another row. It sets the data pointer to the location inside the original row, and the data will be copied when the new row gets created. So make sure to not release the reference to the original row until the new row is created. The prototype is:

void setFrom(const RowType *rtype, const Row *row, int nf);

In fd2 the vector is smaller than the number of fields in the row. The rest of fields are filled with NULLs. They actually are literally filled with NULLs in fd2: if the size of the argument vector for makeRow() is smaller than the number of fields in the row type, the vector gets extended with the NULL values before anything is done with it. It's no accident that the argument of the RowType::makeRow() is not const:

class RowType {
    virtual Row *makeRow(FdataVec &data) const;
};

class Rowref {
    Rowref(const RowType *t, FdataVec &data);
    Rowref &operator=(FdataVec &data);
};

It's also possible to have more elements in the FdataVec than in the row type. In this case the extra arguments are considered the "overlays": the "main" elements set the size of the fields while the "overlays" copy the data fragments over that. It's a convenient way to assemble the array fields from the fragments, for example:

RowType::FieldVec fields4;
fields4.push_back(RowType::Field("a", Type::r_int64, RowType::Field::AR_VARIABLE));

Autoref<RowType> rt4 = new CompactRowType(fields4);
if (rt4->getErrors()->hasError())
    throw Exception(rt4->getErrors(), true);

FdataVec fd4;
Fdata fdtmp;
fd4.push_back(Fdata(true, NULL, sizeof(v_float64)*10)); // allocate space
fd4.push_back(Fdata(0, sizeof(v_int64)*2, &v_int64, sizeof(v_int64)));
// fill a temporary element with setOverride and then insert it
fdtmp.setOverride(0, sizeof(v_int64)*4, &v_int64, sizeof(v_int64));
fd4.push_back(fdtmp);
// manually copy an element from r1
fdtmp.nf_ = 0;
fdtmp.off_ = sizeof(v_int64)*5;
r1.getType()->getField(r1.get(), 2, fdtmp.data_, fdtmp.len_);
fd4.push_back(fdtmp);

Rowref r4(rt4, fd4);

This creates a row type from a single field "a" at index 0, an array of int64. The data vector fd4 has the 0th element define the space for 10 elements in the array, filled by default with zeroes. It doesn't have to zero them, it could copy the data from some location in memory. I've just done the zeroing here to show how it can be done.

The rest of elements are the "overrides" constructed in different ways.

The first one uses the override constructor:

Fdata(int nf, intptr_t off, const void *data, intptr_t len);

Here nf is the number of the field whose contents to overried, off is the byte offset in it, and data and len point to the location to copy from as usual. In this case the 2nd element (counting from 0) of the array gets set with the value from v_int64.

The second override uses the method setOverride() for the same purpose:

void setOverride(int nf, intptr_t off, const void *data, intptr_t len);

It sets a temporary Fdata which then gets appended (copied) to the vector. It sets the element of the vector at index 4 to the same value of v_int64.

The third override copies the value from the row r1. Since there is no ready method for this purpose (perhaps there should be?), it goes about its way manually, setting the fields explicitly. nf_ if the same as nf in the methods, the field number to override. off_ is the offset. And the location and length gets filled into data_ and len_ by getField(), which takes the data from the row r1, field 2.

But wait, the field 2 of r1 has been set to NULL! Should not the NULL indication be set in the copy as well? As it turns out, no. The NULL indication (the field notNull_ being set to false) is ignoredby makeRow() in the override elements. However getField() will set the length to 0, so nothing will get copied. The value at index 5 will be left as it was initially set, which happens to be 0.

So in the end the values in the field "a" array at indexes 2 and 4 will be set to the same as v_int64, and the other indexes 0..10 to 0.

If multiple overrides specify the overlapping ranges, they will just sequentially overwrite each other, and the last one will win.

If an override attempts to specify writing past the end of the originally reserved area of the field, it will be quietly ignored. Just don't do this. If the field was originally set to NULL, its reserved area will be zero bytes, so any overrides attempting to write into it will be silently ignored.

The summary is: the overrides allow to build the array values efficiently from the disjointed areas of memory, but if they are used, they have to be used with care.

RowType in C++, part 3

The information about the contents of a RowType can be read back:

int fieldCount() const;
const vector<Field> &fields() const;
int findIdx(const string &fname) const;
const Field *find(const string &fname) const;

fields() has already been shown. fieldCount() returns the count of fields. findIdx() finds the index of the field by name, so that it can then be looked up in the result of fields(). (Or -1 if there is no such field). find() directly returns the pointer to the field by name, combining these two actions. (Or it returns NULL if there is no such field).

The rest of the RowType methods have to do with the manipulation of the rows. Remember, the rows are not virtual, in a micro-optimization to save a little bit of space, so the RowType methods act as virtuals for the rows. They will be described momentarily, after an introduction to the rows.

RowType in C++, part 2

Let's get to constructing the row types. To reiterate the last post, you don't construct the objects of RowType class itself, it's an abstract class. You construct the objects of the concrete subclass(es), specifically CompactRowType. Make a vector describing the fields and do the construction.

You can make the vector by either starting with an empty one and adding the fields to it or allocating a vector of the right size in advance and setting the fields to it.

RowType::FieldVec fields1;
fields1.push_back(RowType::Field("a", Type::r_int64)); // scalar by default
fields1.push_back(RowType::Field("b", Type::r_int32, RowType::Field::AR_SCALAR));
fields1.push_back(RowType::Field("c", Type::r_uint8, RowType::Field::AR_VARIABLE));

RowType::FieldVec fields2(2);
fields2[0].assign("a", Type::r_int64); // scalar by default
fields2[1].assign("b", Type::r_int32, RowType::Field::AR_VARIABLE);

You can also reuse the same vector and clean/resize is as needed to create more types.

If you're used to laying out the C structures placing the larger elements first for the more efficient alignment, know that this is not needed for the Triceps rows. The CompactRowType stores the row data unaligned, so any field order will result in the same size of the rows. And it can't make use of some fields happening to be aligned either.

You can also find the simple types by their string names:

fields1.push_back(RowType::Field("d", Type::findSimpleType("uint8"), RowType::Field::AR_VARIABLE));

If the type name is incorrect and the type is not found, findSimpleType() will return NULL, which NULL will be caught later at the row type creation times. Note that there is no automatic look-up of the array types. You can't simply pass "uint8[]" to findSimpleType(). You have to break it up into the simple type name as such an the array indication, like is done in perl/Triceps/RowType.xs. This would probably a good thing to add to RowType::Field in the future.

You can't use the type Type::r_void for the fields, it will be reported as an error.

After the fields array is created, create the row type:

Autoref<RowType> rt1 = new CompactRowType(fields1);
if (rt1->getErrors()->hasError())
    throw Exception(rt1->getErrors(), true);

You could also use Autoref<CompactRowType> but there isn't any point to it, since all the methods of CompactRowType are virtuals inherited from RowType.

Don't forget to check that the constructed type has no errors, and bail out if so. Throwing an Exception is a convenient way to abort with a nice error message. I have plans to add a function checkOrThrow() that will replace this "if", but the details are to be worked out yet. A type with errors can't be used for anything, or it will cause the program to crash.

The RowType and its subclasses are immutable after construction, so they can be shared all you want. If you really need to create a copy, you can do it:

Autoref<RowType> rt2 = new CompactRowType(rt1);
if (rt2->getErrors()->hasError())
    throw Exception(rt2->getErrors(), true);

Checking the errors after the copy creation is kind of optional if the original type was correct, but it's better to be safe than sorry.

You can get back the information about the fields:

const RowType::FieldVec &f = rt1->fields();

It's a reference to the vector directly inside the RowType, so const reminds you not to change it (that vector is a copy of the vector used during the construction, so the original vector can be changed afterwards). If you want to extend a type with more fields, make a copy of its fields and extend it:

RowType::FieldVec fields3 = rt1->fields();
fields3.push_back(RowType::Field("z", Type::r_string));
Autoref<RowType> rt3 = new CompactRowType(fields3);
if (rt3->getErrors()->hasError())
    throw Exception(rt3->getErrors(), true);

That's about it for the RowType construction.

Tuesday, August 21, 2012

RowType in C++, part 1

In the Perl API a row type is a collection of fields. Under the hood the things are more complicated. In the C++ API Triceps allows for more flexibility, more ways to represent a row. The row type is represented by the abstract base class RowType that tells the logical structure of a row and by its concrete subclasses that define the concrete layout of data in a row. To create, read or manipulate a row, all you need to know is a reference to RowType. It would refer to a concrete row type object, and the concrete row operations are accessed by the virtual methods. But when you create a row type, you need to know, to which concrete row type it will belong.

Currently the choice is easy: there is only one such concrete subclass CompactRowType. The "compact" means that the data is stored in the rows in a compact form, one field value after another, without alignment. Perhaps some day there will be an AlignedRowType, allowing to read the values more efficiently. Or perhaps some day there will be a ZippedRowType that would store the data in the compressed format.

You would never use the RowType constructor directly, it's called from the subclasses. But every subclass is expected to define a similar constructor:

RowType(const FieldVec &fields);
CompactRowType(const FieldVec &fields);

The FieldVec is the definition of fields in the row type. It's defined as simple as:

typedef vector<Field> FieldVec;

An important side note is that the field is defined within the RowType, so it's really RowType::FieldVec and RowType::Field, and you need to refer to them in your code by this qualified name. So, to create a row type, you create a vector of field definitions first and then construct the row type from it. You can throw away or modify that vector afterwards.

As usual, the constructor arguments might not be correct, and any errors will be remembered and returned with getErrors(). Don't use a type with errors (other than to read the error messages from it, and to destroy it), it might cause your program to crash.

A Field consists of the basic information about it: the name, the type, and the array indication (remember, a Triceps field may contain an array). The array indication is either RowType::Field::AR_SCALAR for a scalar value or RowType::Field::AR_VARIABLE for a variable-sized array. The original plan was also to use the integer values for the fixed-sized array fields, but in reality the variable-sized array fields have turned out to be easier to implement and that was it. So don't use the integer values. Most probably they would work like the same variable-sized arrays but they haven't been tested, and something somewhere might crash. Use the symbolic enum AR_*.

The normal Field constructor provides all this information:

Field(const string &name, Autoref<const Type> t, int arsz = AR_SCALAR);

Or you can use the default constructor and later change the fields in the Field (or of course read them) as you please:

string name_;
Autoref <const Type> type_;
int arsz_;

Or you can assign them in one fell swoop:

void assign(const string &name, Autoref<const Type> t, int arsz = -1);

Note that even though theoretically you can define a field of any Type, in practice it has to be of a SimpleType, or the RowType constructor will return an error later. Why isn't it defined as an Autoref<SimpleType> then? The grand plan is to allow some more interesting data structures in the rows, and this keeps the door open. In particular, the rows will be able to hold references to the other rows, just I haven't got to implementing it yet.

Once again, a RowType constructor makes a copy of the FieldVec for its use, so you can modify or destroy the original FieldVec right away. You can get back the information about the fields in RowType:

const vector<Field> &fields() const;

It returns a reference directly to the FieldVec contained in the row type, so you must never modify it! The const-ness gives a reminder about it.

There are more row type constructors (but no default one). First, each subclass variety is supposed to be able to construct its variety by copying any RowType:

CompactRowType(const RowType &proto);
CompactRowType(const RowType *proto);

The version with the pointer argument also works for passing the Autoref<RowType> as the argument which gets automatically converted to a pointer. And it's really the more typically used one than the reference version.

The resulting type will have the same logical structure but possibly a different representation than the original. By the way, if you care only about the logical structure but not representation, you still can't directly construct a RowType because it's an abstract class. But just construct any concrete subclass, say CompactRowType (since it's the only one available at the moment anyway), and then use its logical structure.

The other constructor variety is a factory method:

virtual RowType *newSameFormat(const FieldVec &fields) const;

It combines the representation format from one row type and the arbitrary logical structure (the fields vector), possibly from another row type. Or course, until more concrete type representations become available, its use is purely theoretical.

Saturday, May 19, 2012

More option checking

Some motifs in checking the options for the method calls have been coming up repeatedly, to I've added more Triceps::Opt methods that encapsulate them.

The first one deals with the mutually exclusive options. Triceps::Opt::parse() doesn't know how to check the mutual exclusivity correctly. For it the option is either mandatory or optional. And rather than complicate it with some convoluted specification of the option exclusivity groups, I've just added a separate method to check that:

$optName = &Triceps::Opt::checkMutuallyExclusive(
  $callerDescr, $mandatoryFlag,
  $optName1 => optValue1, ...);

You call parse() and then you call checkMutuallyExclusive(). If it finds an error, it confesses. It returns the name of the only option that has been defined (or undef if none of them were defined). For example, this is what the JoinTwo constructor does:

&Triceps::Opt::checkMutuallyExclusive("Triceps::JoinTwo::new", 0,
  by => $self->{by},
  byLeft => $self->{byLeft});

$callerDescr is some string that describes the caller for the error message. The names of the options are also used in the error messages. $mandatoryFlag is 1 if exactly one option must be defined, or 0 if having none of them defined is also OK. The "defined" here means that the value passed in the arguments is not undef.

The second method is more specialized. It deals with the triangle of (Unit, RowType, Label). It turns out quite convenient to either let a template define its own input label and then manually connect it or just give it another label and let it automatically chain the input to that label. In the first case the template has to be told, what Unit it belongs to, and what is the RowType of the input data. In the second case they can be found from the Label. The method

&Triceps::Opt::handleUnitTypeLabel($callerDescr, $nameUnit, \$refUnit,
  $nameRowType, \$refRowType, $nameLabel, \$refLabel);

encapsulates this finding-out and other checks. Its rules are:

The label option and the row type option are mutually exclusive.
The unit option may be specified together with the label option, but it must be the same unit as in the label.
If the label option is used, the unit and row type option values will be populated from the label.
On any error it confesses, using $callerDescr for the caller description in the error message. The option name arguments are slao used for the error messages.
It always returns 1.

The values are passed by reference because they may be computed by this method from the other values.

Here is a usage example:

&Triceps::Opt::handleUnitTypeLabel("Triceps::LookupJoin::new",
  unit => \$self->{unit},
  leftRowType => \$self->{leftRowType},
  leftFromLabel => \$self->{leftFromLabel});

The label object doesn't strictly have to be a label object. It may be any object that supports the methods getUnit() and getRowType().

Here you might remember that a Label doesn't have the method getRowType(), its method for getting the row type is called getType(). Well, I've added it now. You can use now either of

$lb->getType()
$lb->getRowType()

with the same effect.

Tuesday, December 27, 2011

printing the object contents

When debugging the programs, it's important to find from the error messages, what is going on, what kinds of objects are getting involved. Because of this, most of the Triceps objects provide a way to print out their contents into a string. This is done with the method print(). The simplest use is as follows:

$message = "Error in object " . $object->print();

Most of the objects tend to have a pretty complicated internal structure and are printed on multiple lines. They look better when the components are appropriately indented. The default call prints as if the basic message is un-indented, and indents every extra level by 2 spaces.

This can be changed with extra arguments. The general format of print() is:

$object->print([indent, [subindent] ])

where indent is the initial indentation, and subindent is the additional indentation for every level. So the default print() is equivalent to print("", " ").

A special case is

$object->print(undef)

It prints the object in a single line, without line breaks.

The row types support the print() method. Here is an example of how a type would get printed:

$rt1 = Triceps::RowType->new(
  a => "uint8",
  b => "int32",
  c => "int64",
  d => "float64",
  e => "string",
);

Then $rt1->print() produces:

row {
  uint8 a,
  int32 b,
  int64 c,
  float64 d,
  string e,
}

With extra arguments $rt1->print("++", "--"):

row {
++--uint8 a,
++--int32 b,
++--int64 c,
++--float64 d,
++--string e,
++}

And finally with an undef argument $rt1->print(undef):

row { uint8 a, int32 b, int64 c, float64 d, string e, }

Sunday, December 25, 2011

Row types equivalence

The Triceps objects are usually strongly typed. A label handles rows of a certain type. A table stores rows of a certain type.

However there may be multiple ways to check whether a row fits for a certain type:

It may be a row of the exact same type, created with the same type object.
It may be a row of another type but one with the exact same definition.
It may be a row of another type that has the same number of fields and field types but different field names. The field names (and everything else in Triceps) are case-sensitive.

The types may be compared for these conditions using the methods:

$rt1->same($rt2)
$rt1->equals($rt2)
$rt1->match($rt2)

The comparisons are hierarchical: if two type references are the same, they would also be equal and matching; two equal types are also matching.

Most of objects would accept the rows of any matching type (this may change or become adjustable in the future). However if the rows are not of the same type, this check involves a performance penalty. If the types are the same, the comparison is limited to comparing the pointers. But if not, then the whole type definition has to be compared. So every time a row of a different type is passed, it would involve the overhead of type comparison.

For example:

my @schema = (
  a => "int32",
  b => "string"
);

my $rt1 = Triceps::RowType->new(@schema) or die "$!";
# $rt2 is equal to $rt1: same field names and field types
my $rt2 = Triceps::RowType->new(@schema) or die "$!";
# $rt3  matches $rt1 and $rt2: same field types but different names
my $rt3 = Triceps::RowType->new(
  A => "int32",
  B => "string"
) or die "$!";

my $lab = $unit->makeDummyLabel($rt1, "lab") or die "$!";
# same type, efficient
my $rop1 = $lab->makeRowop(&Triceps::OP_INSERT,
  $rt1->makeRowArray(1, "x")) or die "$!";
# different row type, involves a comparison overhead
my $rop2 = $lab->makeRowop(&Triceps::OP_INSERT,
  $rt2->makeRowArray(1, "x")) or die "$!";
# different row type, involves a comparison overhead
my $rop3 = $lab->makeRowop(&Triceps::OP_INSERT,
  $rt3->makeRowArray(1, "x")) or die "$!";

A dummy label used here is a label that does nothing (its usefulness will be explained later).

Row types

In Triceps the relational data is stored and passed around as rows (once in a while I call them records, which is the same thing here). Each row belongs to a certain type, that defines the types of the fields. Each field may belong to one of the simple types:

uint8
int32
int64
float64
string

I like the explicit specification of the data size, so it's not some mysterious "double" but an explicit "float64".

uint8 is the type intended to represent the raw bytes. So, for example, when they are compared, they should be compared as raw bytes, not according to the locale. Since Perl stores the raw bytes in strings, and its pack() and unpack() functions operate on strings, The Perl side of Triceps extracts the uint8 values from records into Perl strings, and the other way around.

The string type is intended to represent a text string in whatever current locale (at some point it may become always UTF-8, this question is open for now).

Perl on the 32-bit machines has an issue with int64: it has no type to represent it directly. Because of that, when the int64 values are passed to Perl on the 32-bit machines, they are converted into the floating-point numbers. This gives only 54 bits (including sign) of precision, but that's close enough. Anyway, the 32-bit machines are obsolete by now, and Triceps it targeted towards the 64-bit machines.

On the 64-bit machines both int32 and int64 translate to the Perl 64-bit integers.

Note that there is no special type for timestamps. As of version 1.0 there is no time-based processing inside Triceps, but that does not prevent you from passing around timestamps as data and use them in your logic. Just store the timestamps as integers (or, if you prefer, as floating point numbers). When the time-based processing will be added to Perl, the plan is to still use the int64 to store the number of microseconds since the Unix epoch. My experience with the time types in the other CEP systems is that they cause nothing but confusion.

A row type is created from a sequence of (field-name, field-type) string pairs, for example:

$rt1 = Triceps::RowType->new(
  a => "uint8",
  b => "int32",
  c => "int64",
  d => "float64",
  e => "string",
) or die "$!";

Even though the pairs look like a hash, don't use an actual hash to create row types! The order of pairs in a hash is unpredictable, while the order of fields in a row type usually matters.

In an actual row the field may have a value or be NULL. The NULLs are represented in Perl as undef.

The real-world records tend to be pretty wide and contain repetitive data. Hundreds of fields are not unusual, and I know of a case when an Aleri customer wanted to have records of two thousand fields (and succeeded). This just begs for arrays. So the Triceps rows allow the array fields. They are specified by adding "[]" at the end of field type. The arrays may only be made up of fixed-width data, so no arrays of strings.

$rt2 = Triceps::RowType->new(
  a => "uint8[]",
  b => "int32[]",
  c => "int64[]",
  d => "float64[]",
  e => "string", # no arrays of strings!
) or die "$!";

The arrays are of variable length, whatever array data passed when a row is created determines its length. The individual elements in the array may not be NULL (and if undefs are passed in the array used to construct the row, they will be replaced with 0s). The whole array field may be NULL, and this situation is equivalent to an empty array.

The type uint8 is typically used in arrays, "uint8[]" is the Triceps way to define a blob field. In Perl the "uint8[]" is represented as a string value, same as a simple "unit8".

The rest of array values are represented in Perl as references to Perl arrays, containing the actual values.

The row type objects provide a way for introspection:

$rt->getdef()

returns back the array of pairs used to create this type. It can be used among other things for the schema inheritance. For example, the multi-part messages with daily unique ids can be defined as:

$rtMsgKey = Triceps::RowType->new(
  date => "string",
  id => "int32",
) or die "$!";

$rtMsg = Triceps::RowType->new(
  $rtMsgKey->getdef(),
  from => "string",
  to => "string",
  subject => "string",
) or die "$!";

$rtMsgPart = Triceps::RowType->new(
  $rtMsgKey->getdef(),
  type => "string",
  payload => "string",
) or die "$!";

The meaning here is the same as in the CCL example:

create schema rtMsgKey (
  string date,
  integer id
);
create schema rtMsg inherits from rtMsgKey (
  string from,
  string to,
  string subject
);
create schema rtMsgPart inherits from rtMsgKey (
  string type,
  string payload
);

The grand plan is to provide some better ways of defining the commonality of fields between row types. It should include the ability to rename fields, to avoid conflicts, and to remember this equivalence to be reused in the further joins without the need to write it over and over again. But it has not come to the implementation stage yet.

$rt->getFieldNames()

returns the array of field names only.

$rt->getFieldTypes()

returns the array of field types only.

$rt->getFieldMapping()

returns the array of pairs that map the field names to their indexes in the field definitions. It can be stored into a hash and used for name-to-index translation. It's used mostly in the templates, to generate code that accesses data in the rows by field index (which is more efficient than access by name). For example, for rtMsgKey defined above it would return (date => 0, id => 1).

Friday, December 23, 2011

Hello, world!

Let's finally get to business: write the "Hello, world!" program with Triceps. Since Triceps is an embeddable library, naturally, the smallest "Hello, world!" program would be in the host language without Triceps, but it would not be interesting. So here is the a bit contrived but more interesting Perl program that passes some data through the Triceps machinery:

use Triceps;

$hwunit = Triceps::Unit->new("hwunit") or die "$!";
$hw_rt = Triceps::RowType->new(
  greeting => "string",
  address => "string",
) or die "$!";

my $print_greeting = $hwunit->makeLabel($hw_rt, "print_greeting", undef, 
  sub {
    my ($label, $rowop) = @_;
    printf "%s!\n", join(', ', $rowop->getRow()->toArray());
  } 
) or die "$!";

$hwunit->call($print_greeting->makeRowop(&Triceps::OP_INSERT,
  $hw_rt->makeRowHash(
    greeting => "Hello",
    address => "world",
  ) 
)) or die "$!";

What happens there? First, we import the Triceps module. Then we create a Triceps execution unit. An execution unit keeps the Triceps context and controls the execution for one thread. Nothing really stops you from having multiple execution units in the same thread, however there is not a whole lot of benefit in it either. But a single execution unit must never ever be used in multiple threads. It's single-threaded by design and has no synchronization in it. The argument of the constructor is the name of the unit, that can be used in printing messages about it. It doesn't have to be the same as the name of the variable that keeps the reference to the unit, but it's a convenient convention to make the debugging easier.

If something goes wrong, the constructor will return an undef and set the error message in $!. This actually has turned out to be not such a good idea as it seemed, since writing "or die" at every line quickly turns tedious. And there is usually not much point in catching the errors of this type, since they are essentially the compilation errors and should cause the program to die anyway. So, this will be soon changed throughought the code to just die with the message (and if it needs to be caught, it can be caught with eval).

The next statement creates the type for rows. For the simplest example, one row type is enough. It contains two string fields. A row type does not belong to an execution unit. It may be used in parallel by multiple threads. Once a row type is created, it's immutable, and that's the story for pretty much all the Triceps objects that can be shared between multiple threads: they are created, they become immutable, and then they can be shared. (Of course, the containers that facilitate the passing of data between the threads would have to be an exception to this rule).

Then we create a label. If you look at the "SQLy vs procedural" example a little while back, you'll see that the labels are analogs of streams in Coral8. And that's what they are in Triceps. Of course, now, in the days of the structured programming, we don't create labels for GOTOs all over the place. But we still use labels. The function names are labels, the loops in Perl may have labels. So a Triceps label can often be seen kind of like a function definition, but so far only kind of. It takes a data row as a parameter and does something with it. But unlike a proper function it has no way to return the processed data back to the caller. It has to either pass the processed data to other labels or collect it in some hardcoded data structure, from which the caller can later extract it back. This means that until this gets worked out better, a Triceps label is still much more like a GOTO label or Coral8 stream than a proper function. Just like the unit, a label may be used in only one thread.

A label takes a row type for the rows it accepts, a name (again, purely for the ease of debugging) and a reference to a Perl function that will be processing the data. Extra arguments for the function can be specified as well, but there is no use for them in this example.

Here it's a simple unnamed function. Though of course a reference to a named function can be used instead, and the same function may be reused for multiple labels. Whenever the label gets a row operation to process, its function gets called with the reference to the label object, the row operation object, and whatever extra arguments were specified at the label creation (none in this example). The example just prints a message combined from the data in the row.

Note that a label doesn't just get a row. It gets a row operation ("rowop" as it's called throughout the code). It's an important distinction. A row just stores some data. As the row gets passed around, it gets referenced and unreferenced, but it just stays the same until the last reference to it disappears, and then it gets destroyed. It doesn't know what happens with the data, it just stores them. A row may be shared between multiple threads. On the other hand, a row operation says "take these data and do a such and such thing with them". A row operation is a combination of a row of data, an operation code, and a label that is to execute the operation. It is confined to a single thread. Inside this thread a reference to a row operation may be kept and reused again and again, since the row operation object is also immutable.

Triceps has the explicit operation codes, very much like Aleri (only Aleri doesn't differentiate between a row and row operation, every row there has an opcode in it, and the Sybase CEP R5 does the same). It might be just my background, but let me tell you: the CEP systems without the explicit opcodes are a pain. The visible opcodes make life a lot easier. However unlike Aleri, there is no UPDATE opcode. The available opcodes are INSERT, DELETE and NOP (no-operation). If you want to update something, you send two operations: first DELETE for the old value, then INSERT for the new value. There will be a section later with more details and comparisons, but for now that's enough information.

For this simple example, the opcode doesn't really matter, so the label function quietly ignores it. It gets the row from the row operation and extracts the data from it into the Perl format, then prints them. There are two Perl formats supported: an array and a hash. In the array format, the array contains the values of the fields in the order they are defined in the row type. The hash format consists of name-value pairs, which may be stored either in an actual hash or in an array. The conversion from row to a hash actually returns an array of values which becomes a hash if it gets stored into a hash variable.

As a side note, this also suggests, how the systems without explicit opcodes came to be: they've been initially built on the simple stateless examples. And when the more complex examples have turned up, they've been aready stuck on this path, and could not afford too deep a retrofit.

The final part of the example is the creation of a row operation for our label, with an INSERT opcode and a row created from hash-formatted Perl data, and calling it through the execution unit. The row type provides a method to construct the rows, and the label provides a method to construct the row operations for it. The call() method of the execution unit does exactly what its name implies: it evaluates the label function right now, and returns after all its processing its done.

Sergey Babkin on CEP and stuff