Sergey Babkin on CEP and stuff: December 2011

Wednesday, December 28, 2011

a little about templates

Since people have started commenting about templates, let me show a bit more, what do I mean by them on a simple example.

Coral8 doesn't provide a way to query the windows directly, especially when the CCL is compiled without debugging. So you're expected to make your own. People at DB have developed a nice pattern that goes approximately like this:

// some window that we want to make queryable
create window w_my schema s_my
keep last per key_a per key_b
keep 1 week;

// the stream to send the query requests
// (the schema can be shared by all simple queries) 
create schema s_query (
  qqq_id string // unique id of the query
);
create input stream query_my schema s_query;

// the stream to return the results
// (all result streams will inherit a partial schema)
create schema s_result (
  qqq_id string, // returns back the id received in the query
  qqq_end boolean, // will be TRUE in the special end indicator record
);
create output stream result_my schema inherits from s_result, s_my;

// now process the query
insert into result_my
select q.qqq_id, NULL, w.*
from s_query as q, w_my as w;

// the end marker
insert into result_my (qqq_id, qqq_end)
select qqq_id, TRUE
from s_query;

To query the window, a program would select a unique query id, subscribe to result_my with a filter (qqq_id = unique_id) and send a record of (unique_id) into query_my. Then it would sit and collect the result rows. Finally it would get a row with qqq_end = TRUE and disconnect.

This is a fairly large amount of code to be repeated for every window. What I would like to to instead is to just write

create window w_my schema s_my
keep last per key_a per key_b
keep 1 week;
make_queryable(w_my);

and have the template make_queryable expand into the rest of the code (obviously, the schema definitions would not need to be expanded repeatedly, they would go into an include file).

To make things more interesting, it would be nice to have the query filter the results by some field values. Nothing as fancy as SQL, just by equality to some fields. Suppose, s_my includes the fields field_c and field_d, and we want to be able to filter by them. Then the query can be done as:

create input stream query_my schema inherits from s_query (
  field_c integer,
  field_d string
);

// result_my is the same as before...

// query with filtering (in a rather inefficient way) 
insert into result_my
select q.qqq_id, NULL, w.*
from s_query as q, w_my as w
where
  (q.field_c is null or q.field_c = w.field_c)
  and (q.field_d is null or q.field_d = w.field_d);

// the end marker is as before
insert into result_my (qqq_id, qqq_end)
select qqq_id, TRUE
from s_query;

It would be nice then to create this kind of query as a template instantiation

make_query(w_my, (field_c, field_d));

If there weren't already an entrenched tradition at DB, I would not write directly in CCL at all. I would have made a macro language that would generate CCL. Of course, then the IDE would see only the results of the code generation and could not be used directly to write code in it, but who cares, IDEs are useless for this purpose anyway.

Interestingly, there already are people who do that kind of things. Some people actually prefer the Aleri XML format because it's easier for them to generate the code in XML. (I don't exactly see why generating the code in XML would be easier but there are all kinds of weird XML-based infrastructures out there).

Triceps constants

Triceps has a number of symbolic constants that are grouped into essentially enums. The constants themselves will be introduced with the classes that use them, but here is the general description common to them all.

In Perl they all are placed into the same namespace. Each group of constants (that can be thought of as an enum) gets its name prefix. For example, the operation codes are all prefixed with OP_, the enqueueing modes with EM_, and so on.

The underlying constants are all integer. The way to give symbolic names to constants in Perl is to define a function without arguments that would return the value. Each constant has such a function defined for it. For example, the opcode for the "insert" operation is the result of function Triceps::OP_INSERT.

Most methods that take constants as arguments are also smart enough to recognise the constant names as strings, and automatically convert them to integers. For example, the following calls are equivalent:

$label->makeRowop(&Triceps::OP_INSERT, ...);
$label->makeRowop("OP_INSERT", ...);

However the version with Triceps::OP_INSERT is more efficient. The ampersand (function call designator in Perl) is usually optional, but I tend to use it for clarity.

What if you need to print out a constant in a message? Triceps provides the conversion functions for each group of constants. They generally are named Triceps::somethingString. For example,

print &Triceps::opcodeString(&Triceps::OP_INSERT);

would print "OP_INSERT". If the argument is out of range of the valid enums, it would return undef (but not set any error message in $!).

There also are functions to convert from strings to constant values. They generally are named Triceps::stringSomething. For example,

&Triceps::stringOpcode("OP_INSERT")

would return the integer value of Triceps::OP_INSERT. If the string name is not valid for this kind of constants, it would also return undef.

Rows

The rows in Triceps always belong to some row type, and are always immutable. Once a row is created, it can not be changed. This allows it to be referenced from multiple places, instead of copying the whole row value. Naturally, a row may be passed and shared between multiple threads.

The row type provides the constructor methods for the rows:

$row = $rowType->makeRowArray(fieldValue1, ..., fieldValueN);
$row = $rowType->makeRowHash(fieldName => fieldValue, ...);

Here $row is a reference to the resulting row. As usual, in case of error it will be left as undef, with the error message in $! (to be changed later to dying on error).

In the array form, the values for the fields go in the same order as they are specified in the row type (if there are too few values, the rest will be considered NULL, having too many values is an error).

The Perl value of undef is treated as NULL.

In the hash form, the fields are specified as name-value pairs. If the same field is specified multiple times, the last value will overwrite all the previous ones. The unspecified fields will be left as NULL. Again, the arguments of the function actually are an array, but if you pass a hash, its contents will be converted to an array on the calling stack.

If the performance is important, the array form is more efficient, since the hash form has to translate internally the field names to indexes.

The row itself and its type don't know anything about any keys and such. So any fields may be left as NULL.

Some examples:

$row  = $rowType->makeRowArray(@fields) or die "$!";
$row  = $rowType->makeRowArray($a, $b, $c) or die "$!";
$row  = $rowType->makeRowHash(%fields) or die "$!";
$row  = $rowType->makeRowHash(a => $a, b => $b) or die "$!";

The usual Perl conversions are applied to the values. So for example, if you pass an integer 1 for a string field, it will be converted to the string "1". Or if you pass a string "" for an integer field, it will be converted to 0.

If a field is an array (as always, except for uint8[] which is represented as a Perl string), its value is a Perl array reference (or undef). For example:

$rt1 = Triceps::RowType->new(
  a => "uint8[]",
  b => "int32[]",
) or die "$!";
$row = $rt1->makeRowArray("abcd", [1, 2, 3]) or die "$!";

An empty array will become a NULL value. So the following two are equivalent:

$row = $rt1->makeRowArray("abcd", []) or die "$!";
$row = $rt1->makeRowArray("abcd", undef) or die "$!";

Remember that an array field may not contain NULL values. So any undefs in the array fields will be silently converted to zeroes. The following two are equivalent:

$row = $rt1->makeRowArray("abcd", [undef, undef]) or die "$!";
$row = $rt1->makeRowArray("abcd", [0, 0]) or die "$!";

The row also provides a way to copy itself, modifying the values of selected fields:

$row2 = $row1->copymod(fieldName => fieldValue, ...);

The fields that are not explicitly specified will be left unchanged. Since the rows are immutable, this is the closest thing to the field assignment. copymod() is generally more efficient than extracting the row into an array or hash, replacing a few of them with new values and constructing a new row. It bypasses a the binary-to-Perl-to-binary conversions for the unchanged fields.

The row knows its type. It can be obtained as

$row->getType()

Note that this will create a new Perl wrapper to the underlying type object. So if you do:

$rt1 = ...;
$row = $rt1->makeRow...;
$rt2 = $row->getType();

then $rt1 will not be equal to $rt2 by direct Perl comparison ($rt1 != $rt2). However both $rt1 and $rt2 will refer to the same row type object, so $rt1->same($rt2) will be true.

The row references can also be compared for sameness:

$row1->same($row2)

The row contents can be extracted back into Perl representation as

@adata = $row->toArray();
%hdata = $row->toHash();

Again, the NULL fields will become undefs, and the array fields (unless they are NULL) will become Perl array references. Since the empty array fields are equivalent to NULL array fields, on extraction back they will be treated the same as NULL fields, and become undefs.

There is also a convenience function to get one field from a row at a time by name:

$value = $row->get("fieldName");

If you need to access only a few fields from a big row, get() is more efficient (and easier to write) that extracting the whole row with toHash() or even with toArray(). But don't forget that every time you call get(), it creates a new Perl value, which may be pretty involved if the value is an array. So the most efficient way then for the values that get reused is to call get() , remember the result in a Perl variable, and then reuse that variable.

There is also a way to conveniently print a rows contents, usually for the debugging purposes:

$result = $row->printP();

The name printP is an artifact of implementation: it shows that this method is implemented in Perl and uses the default Perl conversions of values to strings. The uint8[] arrays are printed directly as strings. The result is a sequence of name="value" or name=["value", "value", "value"] for all the non-NULL fields. The backslashes and double quotes inside the values are escaped by backslashes in Perl style. For example, reusing the row type above,

$row = $rt1->makeRowArray('ab\ "cd"', [0, 0]) or die "$!";
print $row->printP(), "\n";

will produce

a="ab\\ \"cd\"" b=["0", "0"]

Finally, there is a deep debugging method:

$result = $row->hexdump()

That dumps the raw bytes of the row's binary format, and is useful only to debug the more weird issues.

Tuesday, December 27, 2011

printing the object contents

When debugging the programs, it's important to find from the error messages, what is going on, what kinds of objects are getting involved. Because of this, most of the Triceps objects provide a way to print out their contents into a string. This is done with the method print(). The simplest use is as follows:

$message = "Error in object " . $object->print();

Most of the objects tend to have a pretty complicated internal structure and are printed on multiple lines. They look better when the components are appropriately indented. The default call prints as if the basic message is un-indented, and indents every extra level by 2 spaces.

This can be changed with extra arguments. The general format of print() is:

$object->print([indent, [subindent] ])

where indent is the initial indentation, and subindent is the additional indentation for every level. So the default print() is equivalent to print("", " ").

A special case is

$object->print(undef)

It prints the object in a single line, without line breaks.

The row types support the print() method. Here is an example of how a type would get printed:

$rt1 = Triceps::RowType->new(
  a => "uint8",
  b => "int32",
  c => "int64",
  d => "float64",
  e => "string",
);

Then $rt1->print() produces:

row {
  uint8 a,
  int32 b,
  int64 c,
  float64 d,
  string e,
}

With extra arguments $rt1->print("++", "--"):

row {
++--uint8 a,
++--int32 b,
++--int64 c,
++--float64 d,
++--string e,
++}

And finally with an undef argument $rt1->print(undef):

row { uint8 a, int32 b, int64 c, float64 d, string e, }

Sunday, December 25, 2011

Row types equivalence

The Triceps objects are usually strongly typed. A label handles rows of a certain type. A table stores rows of a certain type.

However there may be multiple ways to check whether a row fits for a certain type:

It may be a row of the exact same type, created with the same type object.
It may be a row of another type but one with the exact same definition.
It may be a row of another type that has the same number of fields and field types but different field names. The field names (and everything else in Triceps) are case-sensitive.

The types may be compared for these conditions using the methods:

$rt1->same($rt2)
$rt1->equals($rt2)
$rt1->match($rt2)

The comparisons are hierarchical: if two type references are the same, they would also be equal and matching; two equal types are also matching.

Most of objects would accept the rows of any matching type (this may change or become adjustable in the future). However if the rows are not of the same type, this check involves a performance penalty. If the types are the same, the comparison is limited to comparing the pointers. But if not, then the whole type definition has to be compared. So every time a row of a different type is passed, it would involve the overhead of type comparison.

For example:

my @schema = (
  a => "int32",
  b => "string"
);

my $rt1 = Triceps::RowType->new(@schema) or die "$!";
# $rt2 is equal to $rt1: same field names and field types
my $rt2 = Triceps::RowType->new(@schema) or die "$!";
# $rt3  matches $rt1 and $rt2: same field types but different names
my $rt3 = Triceps::RowType->new(
  A => "int32",
  B => "string"
) or die "$!";

my $lab = $unit->makeDummyLabel($rt1, "lab") or die "$!";
# same type, efficient
my $rop1 = $lab->makeRowop(&Triceps::OP_INSERT,
  $rt1->makeRowArray(1, "x")) or die "$!";
# different row type, involves a comparison overhead
my $rop2 = $lab->makeRowop(&Triceps::OP_INSERT,
  $rt2->makeRowArray(1, "x")) or die "$!";
# different row type, involves a comparison overhead
my $rop3 = $lab->makeRowop(&Triceps::OP_INSERT,
  $rt3->makeRowArray(1, "x")) or die "$!";

A dummy label used here is a label that does nothing (its usefulness will be explained later).

Row types

In Triceps the relational data is stored and passed around as rows (once in a while I call them records, which is the same thing here). Each row belongs to a certain type, that defines the types of the fields. Each field may belong to one of the simple types:

uint8
int32
int64
float64
string

I like the explicit specification of the data size, so it's not some mysterious "double" but an explicit "float64".

uint8 is the type intended to represent the raw bytes. So, for example, when they are compared, they should be compared as raw bytes, not according to the locale. Since Perl stores the raw bytes in strings, and its pack() and unpack() functions operate on strings, The Perl side of Triceps extracts the uint8 values from records into Perl strings, and the other way around.

The string type is intended to represent a text string in whatever current locale (at some point it may become always UTF-8, this question is open for now).

Perl on the 32-bit machines has an issue with int64: it has no type to represent it directly. Because of that, when the int64 values are passed to Perl on the 32-bit machines, they are converted into the floating-point numbers. This gives only 54 bits (including sign) of precision, but that's close enough. Anyway, the 32-bit machines are obsolete by now, and Triceps it targeted towards the 64-bit machines.

On the 64-bit machines both int32 and int64 translate to the Perl 64-bit integers.

Note that there is no special type for timestamps. As of version 1.0 there is no time-based processing inside Triceps, but that does not prevent you from passing around timestamps as data and use them in your logic. Just store the timestamps as integers (or, if you prefer, as floating point numbers). When the time-based processing will be added to Perl, the plan is to still use the int64 to store the number of microseconds since the Unix epoch. My experience with the time types in the other CEP systems is that they cause nothing but confusion.

A row type is created from a sequence of (field-name, field-type) string pairs, for example:

$rt1 = Triceps::RowType->new(
  a => "uint8",
  b => "int32",
  c => "int64",
  d => "float64",
  e => "string",
) or die "$!";

Even though the pairs look like a hash, don't use an actual hash to create row types! The order of pairs in a hash is unpredictable, while the order of fields in a row type usually matters.

In an actual row the field may have a value or be NULL. The NULLs are represented in Perl as undef.

The real-world records tend to be pretty wide and contain repetitive data. Hundreds of fields are not unusual, and I know of a case when an Aleri customer wanted to have records of two thousand fields (and succeeded). This just begs for arrays. So the Triceps rows allow the array fields. They are specified by adding "[]" at the end of field type. The arrays may only be made up of fixed-width data, so no arrays of strings.

$rt2 = Triceps::RowType->new(
  a => "uint8[]",
  b => "int32[]",
  c => "int64[]",
  d => "float64[]",
  e => "string", # no arrays of strings!
) or die "$!";

The arrays are of variable length, whatever array data passed when a row is created determines its length. The individual elements in the array may not be NULL (and if undefs are passed in the array used to construct the row, they will be replaced with 0s). The whole array field may be NULL, and this situation is equivalent to an empty array.

The type uint8 is typically used in arrays, "uint8[]" is the Triceps way to define a blob field. In Perl the "uint8[]" is represented as a string value, same as a simple "unit8".

The rest of array values are represented in Perl as references to Perl arrays, containing the actual values.

The row type objects provide a way for introspection:

$rt->getdef()

returns back the array of pairs used to create this type. It can be used among other things for the schema inheritance. For example, the multi-part messages with daily unique ids can be defined as:

$rtMsgKey = Triceps::RowType->new(
  date => "string",
  id => "int32",
) or die "$!";

$rtMsg = Triceps::RowType->new(
  $rtMsgKey->getdef(),
  from => "string",
  to => "string",
  subject => "string",
) or die "$!";

$rtMsgPart = Triceps::RowType->new(
  $rtMsgKey->getdef(),
  type => "string",
  payload => "string",
) or die "$!";

The meaning here is the same as in the CCL example:

create schema rtMsgKey (
  string date,
  integer id
);
create schema rtMsg inherits from rtMsgKey (
  string from,
  string to,
  string subject
);
create schema rtMsgPart inherits from rtMsgKey (
  string type,
  string payload
);

The grand plan is to provide some better ways of defining the commonality of fields between row types. It should include the ability to rename fields, to avoid conflicts, and to remember this equivalence to be reused in the further joins without the need to write it over and over again. But it has not come to the implementation stage yet.

$rt->getFieldNames()

returns the array of field names only.

$rt->getFieldTypes()

returns the array of field types only.

$rt->getFieldMapping()

returns the array of pairs that map the field names to their indexes in the field definitions. It can be stored into a hash and used for name-to-index translation. It's used mostly in the templates, to generate code that accesses data in the rows by field index (which is more efficient than access by name). For example, for rtMsgKey defined above it would return (date => 0, id => 1).

memory management

The memory is managed in Triceps using the reference counters. Each Triceps object has a reference counter in it. There is an Autoref template that produces reference objects. As the references are copied around between these objects, the reference counts in the target objects are adjusted. When the reference count drops to 0, the target object gets destroyed. While there are live references, the object can't get destroyed from under them. All nice and well and simple, however still possible to get wrong.

The major problem with the reference counters is the reference loops. If object A has a reference to object B, and object B has a reference (possibly, indirect) to object A, then neither of them will ever be destroyed. Many of these cases can be resolved by keeping a reference in one direction and a plain pointer in the other. This of course introduces the problem of hanging pointers, so extra care has to be taken to not reference them. There also are the unpleasant situations when there is absolutely no way around the reference loops. For example, the Triceps label's method may keep a reference to the next label, where to send its processed results. If the labels are connected into a loop (a perfectly normal occurrence), this would cause a reference loop. Here the way around is to know when all the labels are no longer used (before the thread exit), and explicitly tell them to clear their references to the other labels. This breaks up the loop, and then bits and pieces can be collected by the reference count logic.

The reference counting maybe single-threaded or multi-threaded. If an object may only be used inside one thread, the references to it use the faster single-threaded counting.In C++ it's real important to not access and not reference the single-threaded objects from multiple threads. In Perl, when a new thread is created, only the multithreaded objects from the parent thread become accessible to it, the rest become undefined, so the issue gets handled automatically (as of version 1.0 even the potentially multithreaded objects are still exported to Perl as single-threaded, with no connection between threads yet).

The C++ objects are exported into Perl through wrappers. The wrappers perform the adaptation between Perl reference counting and Triceps reference counting, and sometimes more helper functions. Perl sees them as blessed objects, from which you can inherit and otherwise treat like normal objects.

When the Perl references are copied between the variables, this increases the Perl reference count to the same wrapper object. However if an object goes into the C++ land, and then is extracted back (such as, create a Rowop from a Row, and then extract the Row from that Rowop), a brand new wrapper gets created. It's the same underlying C++ object but with multiple wrappers. You can't tell that it's the same object by comparing the Perl references, because they may be pointing to the different wrappers. However Triceps provides the method same() that compares the data inside the wrappers. It can be used as

$row1->same($row2)

and if it returns true, then both $row1 and $row2 point to the same underlying row. Note also that if you inherit from the Triceps objects and add some extra data to them, none of that data nor even your derived class identity will be preserved when a anew wrapper is created from the underlying C++ object.

API introduction

As mentioned before, at the moment Triceps provides the APIs in C++ and Perl. They are similar but not quite the same, because the nature of the compiled and scripted languages is different. The C++ API is more direct and expects discipline from the programmer: if some incorrect arguments are passed, everything might crash. The Perl API should never crash. It should detect any incorrect use and report an orderly error. Besides, the idioms of the scripted languages are different from the compiled languages, and different usages become convenient.

The Perl API is implemented in XS. Some people, may wonder, why not SWIG? SWIG would automatically export the API into many languages, not just Perl. The problem with SWIG is that it just maps the API one to one. And this doesn't work any good, it makes some very ugly APIs with abilities to crash. Which then have to be wrapped into more scripting code before they become usable. So then why bother with SWIG, easier to just use the scripting language's native extension methods. Another benefit of the native support is the access to the correct memory management.

In general, I've tried to avoid the premature optimization. The idea is to get it working at all first, and then bother about working fast. Except for the cases when the need for optimization looked obvious, and the logic intertwined with the general design strongly ehough, that if done one way, would be difficult to change in the future. We'll see, if these "obvious" cases really turn out to be the obvious wins, or will they become a premature-optimization mess.

Going forward, I'll try to split the subjects into the separate posts, centered around Perl API, C++ API, or guts of the (C++) implementation and tag them as such. Feel free to pick only the ones you're interested in.Some of the posts still will be general, and have all three tags.

Friday, December 23, 2011

Hello, world!

Let's finally get to business: write the "Hello, world!" program with Triceps. Since Triceps is an embeddable library, naturally, the smallest "Hello, world!" program would be in the host language without Triceps, but it would not be interesting. So here is the a bit contrived but more interesting Perl program that passes some data through the Triceps machinery:

use Triceps;

$hwunit = Triceps::Unit->new("hwunit") or die "$!";
$hw_rt = Triceps::RowType->new(
  greeting => "string",
  address => "string",
) or die "$!";

my $print_greeting = $hwunit->makeLabel($hw_rt, "print_greeting", undef, 
  sub {
    my ($label, $rowop) = @_;
    printf "%s!\n", join(', ', $rowop->getRow()->toArray());
  } 
) or die "$!";

$hwunit->call($print_greeting->makeRowop(&Triceps::OP_INSERT,
  $hw_rt->makeRowHash(
    greeting => "Hello",
    address => "world",
  ) 
)) or die "$!";

What happens there? First, we import the Triceps module. Then we create a Triceps execution unit. An execution unit keeps the Triceps context and controls the execution for one thread. Nothing really stops you from having multiple execution units in the same thread, however there is not a whole lot of benefit in it either. But a single execution unit must never ever be used in multiple threads. It's single-threaded by design and has no synchronization in it. The argument of the constructor is the name of the unit, that can be used in printing messages about it. It doesn't have to be the same as the name of the variable that keeps the reference to the unit, but it's a convenient convention to make the debugging easier.

If something goes wrong, the constructor will return an undef and set the error message in $!. This actually has turned out to be not such a good idea as it seemed, since writing "or die" at every line quickly turns tedious. And there is usually not much point in catching the errors of this type, since they are essentially the compilation errors and should cause the program to die anyway. So, this will be soon changed throughought the code to just die with the message (and if it needs to be caught, it can be caught with eval).

The next statement creates the type for rows. For the simplest example, one row type is enough. It contains two string fields. A row type does not belong to an execution unit. It may be used in parallel by multiple threads. Once a row type is created, it's immutable, and that's the story for pretty much all the Triceps objects that can be shared between multiple threads: they are created, they become immutable, and then they can be shared. (Of course, the containers that facilitate the passing of data between the threads would have to be an exception to this rule).

Then we create a label. If you look at the "SQLy vs procedural" example a little while back, you'll see that the labels are analogs of streams in Coral8. And that's what they are in Triceps. Of course, now, in the days of the structured programming, we don't create labels for GOTOs all over the place. But we still use labels. The function names are labels, the loops in Perl may have labels. So a Triceps label can often be seen kind of like a function definition, but so far only kind of. It takes a data row as a parameter and does something with it. But unlike a proper function it has no way to return the processed data back to the caller. It has to either pass the processed data to other labels or collect it in some hardcoded data structure, from which the caller can later extract it back. This means that until this gets worked out better, a Triceps label is still much more like a GOTO label or Coral8 stream than a proper function. Just like the unit, a label may be used in only one thread.

A label takes a row type for the rows it accepts, a name (again, purely for the ease of debugging) and a reference to a Perl function that will be processing the data. Extra arguments for the function can be specified as well, but there is no use for them in this example.

Here it's a simple unnamed function. Though of course a reference to a named function can be used instead, and the same function may be reused for multiple labels. Whenever the label gets a row operation to process, its function gets called with the reference to the label object, the row operation object, and whatever extra arguments were specified at the label creation (none in this example). The example just prints a message combined from the data in the row.

Note that a label doesn't just get a row. It gets a row operation ("rowop" as it's called throughout the code). It's an important distinction. A row just stores some data. As the row gets passed around, it gets referenced and unreferenced, but it just stays the same until the last reference to it disappears, and then it gets destroyed. It doesn't know what happens with the data, it just stores them. A row may be shared between multiple threads. On the other hand, a row operation says "take these data and do a such and such thing with them". A row operation is a combination of a row of data, an operation code, and a label that is to execute the operation. It is confined to a single thread. Inside this thread a reference to a row operation may be kept and reused again and again, since the row operation object is also immutable.

Triceps has the explicit operation codes, very much like Aleri (only Aleri doesn't differentiate between a row and row operation, every row there has an opcode in it, and the Sybase CEP R5 does the same). It might be just my background, but let me tell you: the CEP systems without the explicit opcodes are a pain. The visible opcodes make life a lot easier. However unlike Aleri, there is no UPDATE opcode. The available opcodes are INSERT, DELETE and NOP (no-operation). If you want to update something, you send two operations: first DELETE for the old value, then INSERT for the new value. There will be a section later with more details and comparisons, but for now that's enough information.

For this simple example, the opcode doesn't really matter, so the label function quietly ignores it. It gets the row from the row operation and extracts the data from it into the Perl format, then prints them. There are two Perl formats supported: an array and a hash. In the array format, the array contains the values of the fields in the order they are defined in the row type. The hash format consists of name-value pairs, which may be stored either in an actual hash or in an array. The conversion from row to a hash actually returns an array of values which becomes a hash if it gets stored into a hash variable.

As a side note, this also suggests, how the systems without explicit opcodes came to be: they've been initially built on the simple stateless examples. And when the more complex examples have turned up, they've been aready stuck on this path, and could not afford too deep a retrofit.

The final part of the example is the creation of a row operation for our label, with an INSERT opcode and a row created from hash-formatted Perl data, and calling it through the execution unit. The row type provides a method to construct the rows, and the label provides a method to construct the row operations for it. The call() method of the execution unit does exactly what its name implies: it evaluates the label function right now, and returns after all its processing its done.

Wednesday, December 21, 2011

Enter Triceps

The Triceps development has been largely shaped by two considerations:

It has to be different from the Sybase products on which I worked. This is helpful from both legal standpoint and from marketing standpoint: Sybase and StreamBase already have similar products that compete head to head. There is no use getting into the same fray without major resources.

It has to be small. I can't spend the same amount of effort on Triceps as a large company, or even as a small one. Not only this saves time but also allows the modifications to be easy and fast. The point of Triceps is to experiment with the CEP language to make it easy to use: try out the ideas, make sure that they work well, or replace them with other ideas. The companies with a large established product can't really afford the radical changes: they have invested much effort into the product, and are stuck with supporting it and providing compatibility into the future.

Both of these considerations point into the same direction: an embeddable CEP system. Adapting an integrated system for an embedded usage is not easy, so it's a good open niche. Yeah, there is Esper, but from a cursory look, it seems to have the same issues as Coral8/StreamBase.

And an embeddable system saves on a lot of components.

For starters, no IDE. Anyway, I find the IDEs pretty useless for development in general, and especially for the CEP development. Though it comes handy once in a while for the analysis of the code and debugging.

No new language, no need to develop compilers, virtual machines, function libraries, external callout APIs. Well, the major goal of Triceps is actually the development of a new and better language. But it's one of these paradoxes: Aleri does the relational logic looking like procedural, Coral8 and StreamBase do the procedural logic looking like relational, and Triceps is a design of a language without a language. Eventually there probably will be a language, to be mixed with the parent one. But for now a lot can be done by simply using the Triceps library in an existing scripting language. The existing scripting languages are already powerful, fast, and also allow the dynamic compilation.

No separate server executable, no need to control it, and no custom network protocols: the users can put the code directly into their executables and devise any protocols they please. Well, it's not a real good answer for the protocols, since it means that everyone who wants to communicate the streaming data for Triceps over the network has to implement these protocols from scratch. So eventually Triceps will provide a default implementation. But it doesn't have to be done right away.

No data persistence for now either. It's a nice feature, and I have some ideas about it too, but it requires a large amount of work, and doesn't really affect the API.

The language used to implement Triceps is C++, and the scripting language is Perl. Nothing really prevents embedding Triceps into other languages but it's not going to happen anywhere soon. The reason being that extra code adds weight and makes the changes more difficult.

The multithreading support has been a major consideration from the start. All the C++ code has been written with the multithreading in mind. However for the first release the multithreading did not propagate into the Perl API yet.

Even though Triceps is an experimental system, that does not imply that it's of a toy quality. The code is written in production quality to start with, with a full array of unit tests.

Monday, December 19, 2011

we're not in 1950s any more, or are we?

Part of the complexity with CCL programming is that the CCL programs tend to feel very broken-up, with the flow of the logic jumping all over the place.

Consider a simple example: some incoming financial information may identify the securities by either RIC (Reuters identifier) or SEDOL or ISIN, and before processing it further we want to convert them all to ISIN (since the fundamentally same security may be identified in multiple ways when it's traded in multiple countries).

This can be expressed in CCL approximately like this (no guarantees about the correctness of this code, since I don't have a compiler to try it out):

// the incoming data
create schema s_incoming (
  id_type string, // identifier type: RIC, SEDOL or ISIN
  id_value string, // the value of the identifier
  // add another 90 fields of payload...
);

// the normalized data
create schema s_normalized (
  isin string, // the identity is normalized to ISIN
  // add another 90 fields of payload...
);

// schema for the identifier translation tables
create schema s_translation (
  from string, // external id value (RIC or SEDOL)
  isin string, // the translation to ISIN
);

// the windows defining the translations from RIC and SEDOL to ISIN
create window w_trans_ric schema s_translation
  keep last per from;
create window w_trans_sedol schema s_translation
  keep last per from;

create input stream i_incoming schema s_incoming;
create stream incoming_ric  schema s_incoming;
create stream incoming_sedol  schema s_incoming;
create stream incoming_isin  schema s_incoming;
create output stream o_normalized schema s_normalized;

insert
  when id_type = 'RIC' then incoming_ric
  when id_type = 'SEDOL' then incoming_sedol
  when id_type = 'ISIN' then incoming_isin
select *
from i_incoming;

insert into o_normalized
select
  w.isin,
  i. ... // the other 90 fields
from
  incoming_ric as i join w_tranc_ric as w
    on i.id_value =  w.from;

insert into o_normalized
select
  w.isin,
  i. ... // the other 90 fields
from
  incoming_sedol as i join w_tranc_sedol as w
    on i.id_value =  w.from;

insert into o_normalized
select
  i.id_value,
  i. ... // the other 90 fields
from
  incoming_isin;

Not exactly easy, is it, even with the copying of payload data skipped? You may notice that what it does could also be expressed as procedural pseudo-code:

// the incoming data
struct s_incoming (
  string id_type, // identifier type: RIC, SEDOL or ISIN
  string id_value, // the value of the identifier
  // add another 90 fields of payload...
);

// schema for the identifier translation tables
struct s_translation (
  string from, // external id value (RIC or SEDOL)
  string isin, // the translation to ISIN
);

// the windows defining the translations from RIC and SEDOL to ISIN
table s_translation w_trans_ric
  key from;
table s_translation w_trans_sedol
  key from;

s_incoming i_incoming;
string isin;

if (i_incoming.id_type == 'RIC') {
  isin = lookup(w_trans_ric, 
    w_trans_ric.from == i_incoming.id_value
  ).isin;
} elsif (i_incoming.id_type == 'SEDOL') {
  isin = lookup(w_trans_sedol, 
    w_trans_sedol.from == i_incoming.id_value
  ).isin;
} elsif (i_incoming.id_type == 'ISIN') {
  isin = i_incoming.id_value;
}

if (isin != NULL) {
  output o_ normalized(isin,
    i_incoming.(* except id_type, id_value)
  );
}

Basically, writing in CCL feels like programming in Fortran in the 50s: lots of labels, lots of GOTOs. Each stream is essentially a label, when looking from the procedural standpoint. It's actually worse than Fortran, since all the labels have to be pre-defined (with types!). And there isn't even the normal sequential flow, each statement must be followed by a GOTO, like on those machines with magnetic-drum main memory.

This is very much like the example in my book, in section 6.4. Queues as the sole synchronization mechanism. You can alook at the draft text online. This similarity is not accidental: the CCL streams are queues, and they are the only communication mechanism in CCL.

The SQL statement structure also adds to the confusion: each statement has the destination followed by the source of the data, so each statement reads like it flows backwards.

In Triceps I aim to do better. It is not as smooth as the shown pseudo-code yet, but things are moving in this direction. I have a few ideas about improving this pseudo-code too but they would have to wait until another day.

P.S. I don't seem to be able to post comments. I'm not sure, what is wrong with the Blogspot engine. But answering the comment, yeah, I don't know much about Esper. Both Coral8 and Streambase also have the .* syntax, and Aleri has a similar ExtendStream. However that copies all the fields, without dropping any of them (like id_type and id_value here).

Saturday, December 17, 2011

surveying the landscape

What do we have in the CEP area now? The scene is pretty much dominated by Sybase (combining the former Aleri and Coral8) and StreamBase.

There seem to be two major approaches to the execution model. One was used by Aleri, another by Coral8 and StreamBase. I'm not hugely familiar with StreamBase, but that's how it seems to me. Since I'm much more familiar with Coral8, I'll be calling the second model the Coral8 model. If you find StreamBase substantially different, let me know.

The Aleri idea is to collect and keep all the data. The relational operators get applied on the data, producing the derived data ("materialized views") and eventually the results. So, even though the Aleri models were usually expressed in XML (though an SQL compiler was also available), fundamentally it's a very relational and SQLy approach.

This creates a few nice properties. All steps of execution can be pipelined and executed in parallel.For persistence, it's fundamentally enough to keep only the input data (what has been called BaseStreams and then SourceStreams), and all the derived computations can be easily reprocessed on restart (it's funny but it turns out that often it's faster to read a small state from the disk and recalculate the rest from scratch in memory than to load a large state from the disk).

It also has issues. It doesn't allow loops, and the procedural calculation aren't always easy to express. And keeping all the state requires more memory. The issues of loops and procedural computations have been addressed by FlexStreams: modules that would perform the procedural computations instead of relational operations, written in SPLASH - a vaguely C-ish or Java-ish language. However this tends to break the relational properties: once you add a FlexStream, usually you do it for the reasons that prevent the derived calculations from being re-done, creating issues with saving and restoring the state. Mind you, you can write a FlexStream that doesn't break any of them, but then it would probably be doing something that can be expressed without it in the first place.

Coral8 has grown from the opposite direction: the idea has been to process the incoming data while keeping a minimal state in variables and short-term "windows" (limited sliding recordings of the incoming data). The language (CCL) is very SQL-like. It relies on the state of variables and windows being pretty much global (module-wide), and allows the statements to be connected in loops. Which means that the execution order matters a lot. Which means that there are some quite extensive rules, determining this order. The logic ends up being very much procedural, but written in the peculiar way of SQL statements and connecting streams.

The good thing is that all this allows to control the execution order very closely and write things that are very difficult to express in pure un-ordered relational operators. Which allows to aggregate the data early and creatively, keeping less data in memory.

The bad news is that it limits the execution to a single thread. If you want a separate thread, you must explicitly make a separate module, and program the communications between the modules, which is not exactly easy to get right. There are lots of people who do it the easy way and then wonder, why do they get the occasional data corruption. Also, the ordering rules for execution inside a module are quite tricky. Even for fairly simple logic, it requires writing a lot of code, some of which is just bulky (try enumerating 90 fields in each statement), and some of which is tricky to get right.

The summary is that everything is not what it seems: the Aleri models aren't usually written in SQL but are very declarative in their meaning, while the Coral8/StreamBase models are written in an SQL-like language but in reality are totally procedural.

Sybase is also striking for a middle ground, combining the features inherited from Aleri and Coral8 in its CEP R5 and later: use the CCL language but relax the execution order rules to the Aleri level, except for the explicit single-threaded sections where the order is important. Include the SPLASH fragments for where the outright procedural logic is easy to use. Even though it sounds hodgy-podgy, it actually came together pretty nicely. Forgive me for saying so myself since I've done a fair amount of design and the execution logic implementation for it before I've left Sybase.

Still, not everything is perfect in this merged world. The SQLy syntax still requires you to drag around all your 90 fields into nearly every statement. The single-threaded order of execution is still non-obvious. It's possible to write the procedural code directly in SPLASH but the boundary where the data passes between the SQLy and C-ish code still has a whole lot of its own kinks (less than in Aleri). And worst of all, there is still no modular programming. Yeah, there are "modules" but they are not really reusable. They are tied too tightly to the schema of the data. What is needed, is more like C++ templates.

Monday, December 12, 2011

What this blog is about

It happened that I've worked for a while on and with the Complex Event Processing (CEP) systems. I've worked for a few years on the internals of the Aleri CEP engine, then after Aleri acquired Coral8, some on the Coral8 engine, then after Sybase gobbled up them both, I've designed and did the early implementation of a fair bit of the Sybase CEP R5. After that I've moved on to Deutsche Bank and got the experience from the other side: using the CEP systems, primarily the former Coral8, now known as Sybase CEP R4.

This made me feel that writing the CEP models is unnecessarily difficult. Even the essentially simple things take too much effort. I've had this feeling before as well, but one thing is to have it in abstract, and another is to grind against it every day.

Which in turn led me to thinking about making my own open-source CEP system, where I could try out the ideas I get, and make the streaming models easier to write. Thus the Triceps project was born. For a while it was called Biceps, until I learned of the existence of a recearch project called BiCEP. It's spelled differently, and is in a substantially differnt area of CEP work, but it's easier to avoid confusion, so I went one better and renamed it Triceps.

Since then I've moved on from DB, and I'm currently not using any CEP at work (though you never know what would happen), but Triceps has already gained momentum by itself.

This blog is about Triceps. It's a part of the Triceps release model: first, write and release the code, then write the docs as a blog, then integrate the docs with the code into a proper release. This would describe both the usage and internals of Triceps, and the reasons for why the internals work this way and not the other.

Sergey Babkin on CEP and stuff