Sergey Babkin on CEP and stuff: Labels, part 1

In each CEP engine there are two kinds of logic: One is to get some request, look up some state, maybe update some state, and return the result. The other has to do with the maintenance of the state: make sure that when one part of the state is changed, the change propagates consistently through the rest of it. If we take a common RDBMS for an analog, the first kind would be like the ad-hoc queries, the second kind will be like the triggers. The CEP engines are very much like database engines driven by triggers, so the second kind tends to account for a lot of code.

The first kind of logic is often very nicely accommodated by the procedural logic. The second kind often (but not always) can benefit for a more relational, SQLy definition. Also, when every every SQL statement executes, it gets compiled first into the procedural form, and only then executes as the procedural code.

The Triceps approach is tilted toward procedural execution. That is, the procedural definitions come out of the box, and then the high-level relational logic can be defined with templates and code generators.

These bits of code, especially where the first and second kind connect, need some way to pass the data and operations between them. In Triceps these connection points are called Labels.

The streaming data rows enter the procedural logic through a label. Each row causes one call on the label. From the functional standpoint they are the same as Coral8 Streams, as has been shown earlier in the introduction. Except that in Triceps the labels get not just rows but operations on rows, as in Aleri: a combination of a row and an operation code. The name is "labels" because Triceps has been built around the more procedural ideas, and when looked at from that side, the labels are targets of calls and GOTOs.

If the streaming model is defined as a data flow graph, each arrow in the graph is essentially a GOTO operation, and each node is a label.

A Triceps label is not quite a GOTO label, since the actual procedural control always returns back after executing the label's code. It can be thought of as a label of a function or procedure. But if the caller does nothing but immedially return after getting the control back, it works very much like a GOTO label.

Each label accepts operations on rows of a certain type.

Each label belongs to a certain execution unit, so a label can be used only strictly inside one thread and can not be shared between threads.

Each label may have some code to execute when it receives a row operation. The labels without code can be useful too.

A Triceps model contains the straightforward code and the mode complex stateful elements, such as tables, aggregators, joiners (which may be implemented in C++ or in Perl, or created as user templates). These stateful elements would have some input labels, where the actions may be sent to them (and the actions may also be done as direct method calls), and output labels, where they would produce the indications of the changed state and/or responses to the queries. The output labels are typically the ones without code ("dummy labels"). They do nothing by themselves, but can pass the data to the other labels. This passing of data is achieved by chaining the labels: when a label is called, it will first execute its own code (if it has any), and then call the same operation on whatever labels are chained from it. Which may have more labels chained from them in turn. So, to pass the data, chain the input label of the following element to the output label of the previous element.

The execution unit provides methods to construct labels. A dummy label is constructed as:

$label = $unit->makeDummyLabel($rowType, "name");

It takes as arguments the type of rows that the label will accept and the symbolic name of the label. The name can be any but for the ease of debugging it's better to give the same name as the label variable.

The label with Perl code is constructed as follows:

$label = $unit->makeLabel($rowType, "name", \&clearSub,
  \&execSub, args...);

The row type and name arguments are the same as for the dummy label. The following arguments provide the references to the Perl functions that perform the actions. execSub is the function that executes to handle the incoming rows. It gets the arguments:

execSub($label, $rowop, args...)

Here $label is this label, $rowop is the row operation, and args are the same as extra arguments specified at the label creation.

The row operation actually contains the label reference, so why pass it the second time? The reason lies in the chaining. The current label may be chained, possibly through multiple levels, to some original label, and the rowop will refer to that original label. So the extra argument lets the code find the current label.

The clearSub deals with the destruction. Remember that the Triceps memory management uses the reference counting, which does not like the reference loops. The reference loops cause the objects to be never freed. It's no big deal if the data structures exist until the program exit anyway but becomes a memory leak if they are created and deleted dynamically.

If the labels are arranged in a cyclic graph, they refear to each other and create a reference loop. So the execution unit keeps track of all its labels, and when it gets destoryed, clears them up, breaking up the loops.

The clearing of a label drops all the references to execSub, clearSub and arguments, and clears all the chainings. But before anything else is done, clearSub gets a chance to execute and clear any application-level data. It gets as argument the label reference all the args from the label constructor:

clearSub($label, args...)

A typical case is to keep the state of a stateful element in a hash:

package MyElement;

sub new # (class, unit, name...)
{
  my ($class, $unit, $name) = @_; 
  my $self = {};
  ... 
  $self->inLabel = $unit->makeLabel(..., \&clear, \&handle, $self);
  $self->outLabel = $unit->makeDummyLabel(...);
  ...
  bless $self, $class;
  return $self;
}

Then the clearing function can wipe out the whole state of the element by undefining its hash:

sub clear # (label, self)
{
  my ($label, $self) = @_;
  undef %$self;
}

Either of execSub and clearSub can be specified as undef. Though a label with an undefined execSub is essentially a dummy label, only more heavyweight.

Another potential for reference loops is between the execution unit and the labels. A unit keeps a reference to all its labels. So the labels can not keep a reference to the unit. And they don't. Internally they have a plain pointer. Note however that in the example shown the labels have a Perl reference to the object where they belong. If that object is to have a Perl reference to the unit, it would create a reference loop, and the object will never be destroyed and never clear the labels. So generally the objects should never keep references to the unit. The unit also provides another way around this situation: it has a way to force the label clearing when a helper object gets destroyed. It will be described later.

P.S. The original published version of this post had a few paragraphs lost, the updated version from 01/06/12 has them re-added.

Sergey Babkin on CEP and stuff

Thursday, January 5, 2012

Labels, part 1

No comments:

Post a Comment

Links

About Me

Labels

Blog Archive