Sergey Babkin on CEP and stuff: 2012

Sunday, December 30, 2012

RowSetType

RowSetType, defined in types/RowSetType.h, is another item that is not visible in Perl. Maybe it will be in the future but at the moment things look good enough without it. It has been added for 1.1.0 and expresses the type ("return type" if you want to be precise) of a streaming function (FnReturn and FnBinding classes). Naturally, it's a sequence of the row types, and despite the word "set", the order matters.

A RowSetType is one of these objects that gets assembled from many parts and then initialized, like this:

Autoref<RowSetType> rst = initializeOr Throw(RowSetType::make()
->addRow("name1", rt1)
->addRow("name2", rt2)
);

The function, or actually template, initializeOrThrow() itself is also a new addition, that I'll describe in detail later.

Of course, nothing stops you from adding the row types one by one, in a loop or in some other way, and then calling initialize() manually. And yes, of course you can keep a reference to a row set type as soon as it has been constructed, not waiting for initialization. You could do instead:

Autoref<RowSetType> rst = new RowSetType();
rst->addRow("name1", rt1);
rst->addRow("name2", rt2);
rst->initialize();
if (rst->getErrors()->hasError()) {
...
}

You could use the initializeOrThrow() template here as well, just I also wanted to show the way for the manual handling of the errors. And you can use the new or make() interchangeably too.

All that the initialization does is fixate the row set, forbid the addition of the further row types to it. Which kind of makes sense at the moment but I'm not so sure about the future, in the future the dynamically expandable row sets might come useful. We'll see when we get there.
RowSetType();
static RowSetType *make();

Construct a row set type. The method make() is just a wrapper around the constructor that is more convenient to use with the following ->addRow(), because of the way the operator priorities work in C++. Like any other type, RowSetType is unnamed by itself, and takes no constructor arguments. Like any other type, RowSetType is an Mtarget and can be shared between multiple threads after it has been initialized.

RowSetType *addRow(const string &rname, const_Autoref<RowType>rtype);

Add a row type to the set. All the row types are named, and all the names must be unique within the set. The order of the addition matters too. See the further explanation of why it does in the description of the FnReturn. If this method detects an error (such as duplicate names), it will append the error to the internal Errors object, that can be read later by getErrors(). A type with errors must not be used.

The row types may not be added after the row set type has been initialized.

void initialize();

Initialize the type. Any detected errors can be read afterwards with getErrors(). The repeated calls of initialize() are ignored.

bool isInitialized() const;

Check whether the type has been initialized.

typedef vector<string> NameVec;
const NameVec &getRowNames() const;
typedef vector<Autoref<RowType> > RowTypeVec;
const RowTypeVec &getRowTypes() const;

Read back the contents of the type. The elements will go in the order they were added.

int size() const;

Read the number of row types in the set.

int findName(const string &name) const;

Translate the row type name to index (i.e. the order in which it was added, starting from 0).Returns -1 on an invalid name.

RowType *getRowType(const string &name) const;

Find the type by name. Returns NULL on an invalid name.

const string *getRowTypeName(int idx) const;
RowType *getRowType(int idx) const;

Read the data by index. These methods check that the index is in the valid range, and otherwise return NULL.

The usual methods inherited from Type also work: getErrors(), equals(), match(), printTo().

The row set types are considered equal if they contain the equal row types with equal names going in the same order. They are considered matching if they contain matching row types going in the same order, with any names. If the match condition seems surprising to you, think of it as "nothing will break if one type is substituted for another at execution time".

void addError(const string &msg);
Erref appendErrors();

The ways to add extra errors to the type's errors. It's for convenience of the users of this type, the thinking being that since we already have one Errors object, can as well use it for everything, and also keep all the errors reported in the order of the fields, rather than first all the errors from the type then all the errors from its user. The FnReturn and FnBinding use it.

FrameMark in C++

The FrameMark (defined in sched/FrameMark.h) marks the unit's frame at the start of the loop, to fork there the rowops for the next iterations of the loop. It's pretty simple:

FrameMark(const string &name);

The constructor that gives the mark a name. A FrameMark is an Starget, so it's reference-counted and may be used only in one thread.

const string &getName() const;

Read back the name.

Unit *getUnit() const;

This method is different from getUnit() on most of the other classes. It returns the pointer to the unit, on which it has been set. A freshly created FrameMark would return NULL. Internally a FrameMark doesn't keep a reference to the unit, it's just a pointer, and a way for the Unit to check in loopAt() that the mark has been indeed set on this unit. And you can use it for the entertainment purposes too. Normally when the frame marked with this mark gets popped from the Unit's stack, the mark becomes unset, and its getUnit() will return NULL.

All the actions on the FrameMark are done by passing it to the appropriate methods of the Unit. When a mark is set on a frame, the frame has a reference to it, so the mark won't be destroyed until the frame is freed.

Saturday, December 29, 2012

AggregatorGadget

AggregatorGadget is a fairly internal class, but I'll describe it as well while at it. Each aggregator in a table has its own gadget, and that's what it is. It carries some extra information.

The grand plan was that the different aggregator types may define their own subclasses of AggregatorGadget but in reality there appears no need to. So far all the aggregators happily live with the base AggregatorGadget.

AggregatorGadget(const AggregatorType *type, Table *table, IndexType *intype);

The type of the aggregator and the index type on which this particular aggregator is defined will be kept as references in the AggregatorGadget. The table will be remembered as a simple pointer (as usual, to avoid the circulare references, since the Table references all its AggregatorGadgets).

Table *getTable() const;
const AggregatorType* getType() const;

Get back the information. By now I'm not sure, why there is no method to get back the index type. Looks like nothing needs it, so the index type reference from the gadget is fully superfluous. The potential subclasses may read it from the field indexType_.

The normal way to use the AggregatorGadget is to call its method sendDelayed(). And it's called by other classes, not by its subclasses, so it's exported as publuc. On the other hand, the method send() must never be used with the AggregatorGadget, so it's made private (yes, I know that if you really want, you can use the superclass method, but just don't, the idea here is to guard against the accidental misuse, not against the malicious one).

Gadget in C++

The Gadget is unique to the C++ API, it has no parallels in Perl. Gadget is a base class defined in sched/Gadget.h, its object being a something with an output label. And the details of what this something is, are determined by the subclass. Presumably, it also has some kind of inputs but it's up to the subclass. The Gadget itself defines only the output label. To make a concrete example, a table is a gadget, and every aggregator in the table is also a gadget. However the "pre" and "dump" labels of the table is not a gadget, it's just an extra label strapped on the side.

Some of the reasons for the Gadget creation are purely historic by now. At some point it seemed important to have the ability to associate a particular enqueueing mode with each output label. Most tables might be using EM_CALL but some, ones in a loop, would use EM_FORK, and those that don't need to produce the streaming output would use EM_IGNORE. This approach didn't work out as well as it seemed at first, and now is outright deprecated: just use EM_CALL everywhere, and there are the newer and better ways to handle the loops. The whole Gadget thing should be redesigned at some point but for now I'll just describe it as it is.

As the result of that history, the enqueueing mode constants are defined in the Gadget class, enum EnqMode: EM_SCHEDULE, EM_FORK, EM_CALL, EM_IGNORE.

static const char *emString(int enval, const char *def = "???");
static int stringEm(const char *str);

Convert from the enqueueing mode constant to string, and back.

Gadget(Unit *unit, EnqMode mode, const string &name = "", const_Onceref<RowType> rt = (const RowType*)NULL);

The Gadget constructor is protected, since Gadget is intended to be used only as a base class, and never instantiated directly. The name and row type can be left undefined if they aren't known yet and initialized later. The output label won't be created until the row type is known, and you better also set the name by that time too. The enqueueing mode may also be changed later, so initially it can be set to anything. All this is intended only to split the initialization in a more convenient way, once the Gadget components are set, they must not be changed any more.

The output label of the Gadget is a DummyLabel, and it shares the name with the Gadget. So if you want to differentiate that label with a suffix in the name, you have to give the suffixed name to the whole Gadget. For example, the Table constructor does:

Gadget(unit, emode, name + ".out", rowt),

A Gadget keeps a reference to both its output label and its unit. This means that the unit won't disappears from under a Gadget, but to avoid the circular references, the Unit must not have references to the Gadgets (having references to their output labels is fine).

void setEnqMode(EnqMode mode);void setName(const string &name);
void setRowType(const_Onceref<RowType> rt);

The protected methods to finish the initialization. Once the values are set, they must not be changed any more. Calling setRowType() creates the output label, and since the name of the output label is taken from the Gadget, you need to set the name before you set the row type.

EnqMode getEnqMode() const;
const string &getName() const;
Unit *getUnit() const;
Label *getLabel() const;

Get back the gadget's information. The label will be returned only after it's initialized (i.e. the row type is known), before then getLabel() would return NULL. And yes, it's getLabel(), NOT getOutputLabel().

The rest of the methods are for convenience of sending the rows to the output label. They are protected, since they are intended for the Gadget subclasses (which in turn may decide to make them pubclic).

void send(const Row *row, Rowop::Opcode opcode) const;

Construct a Rowop from the given row and opcode, and enqueue it to the output label according to the gadget's enqueueing method. This is the most typical use.

void sendDelayed(Tray *dest, const Row *row, Rowop::Opcode opcode) const;

Create a Rowop and put it into the dest tray. The rowop will have the enqueueing mode populated according to the Gadget's setting. This method is used when the whole set of the rowops needs to be generated before any of them can be enqueued, such as when a Table computes its aggregators. After the delayed tray is fully generated, it can be enqueued with Unit::enqueueDelayedTray(), which will consult each rowop's enqueueing method and process it accordingly. Again, this stuff exists for the historic reasons, and will likely be removed somewhere soon.

Thursday, December 27, 2012

Label in C++

In C++ the custom labels are defined by defining your own class that inherits from Label (in sched/Label.h). The subclass needs to define its own execution method:

virtual void execute(Rowop *arg) const;

The base class takes care of all the general execution mechanics, chaining etc. All you need to do in this method is perform your user-defined actions. By the way, this method is protected and should never be called directly. The labels must always be called through a unit, which will then execute them in the correct way.

It may (though doesn't have to) also define the custom clearing method:

virtual void clearSubclass();

Currently this method is called by clear() after the label is marked as cleared but before clearing of the chain, though this order may change in the future.

Now, the rest of the methods:

Label(Unit *unit, const_Onceref<RowType> rtype, const string &name);

The base class constructor. It's always constructed from a subclass, you can not instantiate the base Label class because it contains an abstract execute() method. The name argument used to be optional (and if you really want, you still may use an empty string as an explicit argument) but the unnamed labels are very difficult to make sense of later.

The constructed label keeps a reference to its row type, and a pointer (not reference, to avoid the circular references!) to the Unit.

The information from the constructor can be read back:

const string &getName() const;
const RowType *getType() const;
Unit *getUnitPtr() const;

The method getUnitPtr() is named this way and not getUnit() to emphasize that the Label has only a pointer to the Unit, not a reference. After the label gets cleared, getUnitPtr() will return NULL.The reason is that after that the label doesn't know any more whether the unit still exists or has been deleted, and doesn't want to return a pointer to a potentially freed memory.

const string &getUnitName() const;

A convenience method for the special case of getting the label's unit name. It's used in many error message. You can't just say label->getUnitPtr()->getName() because getUnitPtr() might return a NULL. getUnitName() takes care of it and returns a special string "[label cleared]" if the label has been cleared.

void clear();

Clears the label. After that the label stops working. Note that clearing a label doesn't dissociate it from its unit. Well, the label won't tell you its unit any more but the unit will still have a reference to the label! Use the unit's method forgetLabel() to dissociate it (but that won't clear the label by itself, so you have to call both unit->forgetLabel() and label->clear()). Of course, if you call unit->clearLabels(), that would take care of everything.

Clearing cleans the chaining list of this label but doesn't call recursively clear() on the formerly chained labels. If you need that, you have to do it yourself.

bool isCleared() const;

Check if the label is cleared.

void setNonReentrant();
bool isNonReentrant() const;

Mark the label as non-reentrant, and check this flag. There is no way to unset this flag. The meaning of it has been described at length before.

Erref chain(Onceref<Label> lab);

Chain another label to this one (so when this label is executed, the chained labels will also be executed in order). This label will keep a reference of the chained label. The circular chainings are forbidden and will throw an Exception.

typedef vector<Autoref<Label> > ChainedVec;
const ChainedVec &getChain() const;

Get back the information about the chained labels. This returns a reference to the internal vector, so if the chainings are changed afterwards, the changes will be visible in the vector.

bool hasChained() const;

A quick check, whether there is anything chained.

void clearChained();

Clear the chaining list of this label. (But doesn't call clear() on these labels!)

Rowop *adopt(Rowop *from) const;

A convenient factory method for adopting the rowops. Treat it as a constructor: the returned Rowop will be newly constructed and have the reference count of 0; the pointed must be stored in an Autoref (or Onceref). This method by itself doesn't check whether the original Rowop has a matching type, it simply makes a copy with the label reference replaced. It's up to you to make sure that the labels are correct.

A special subclass of the Label is DummyLabel: it's a label that does nothing. It's execute() method is empty. It's constructed very similarly to the Label:

DummyLabel(Unit *unit, const_Onceref<RowType> rtype, const string &name);

The dummy labels are convenient for chaining the other labels to them.

Wednesday, December 26, 2012

SourceForge flux completed

The SourceForge conversion has completed, and I've updated the source code repository links on the web page. I'm not exactly sure, why did they require this conversion. Okay, the svn+ssh access method to SVN is slightly more convenient, but the SVN browser seems to have become worse, and the other project functionality seems to have become slightly worse too.

SourceForge flux

SourceForge has been insisting on the conversion of the project to their new engine, and I've finally given in. This means that the SVN repository location has changed, and the links there don't work any more. I'll update them shortly. And if you've checked out the code from SVN, you'd need to re-do it from the new location.

Tuesday, December 25, 2012

Tray in C++

A Tray in C++, defined in sched/Tray.h, is simply a deque of Rowop references, plus an Starget, so that it can be referenced itself:

class Tray : public Starget, public deque< Autoref<Rowop> >

All it really defines is the constructors:

Tray();
Tray(const Tray &orig);

The operations on the Tray are just the usual deque operations.

Yes, you can copy the trays by constructing a new one from an old one:

Autoref<Tray> t1 = new Tray;
t1->push_back(op1);

Autoref<Tray> t3 = new Tray(*t1);

Afterwards t3 will contain references to the same rowops as t1 (but will be a different Tray than t1!).

The assignments (operator=) happen to just work out of the box because the operator= implementation in Starget does the smart thing and avoids the corruption of the reference counter. So you can do things like

*t3 = *t1;

It's worth noticing once more that unlike Rows and Rowops, the Trays are mutable. If you have multiple references to the same Tray, modifying the Tray will make all the references see its new contents!

An important difference from the Perl API is that in C++ the Tray is not associated with a Unit. It's constructed simply by calling its constructor, and there is no Unit involved. It's possible to create a tray that contains a mix of rowops for different units. If you combine the C++ and Perl code, and then create such mixes in the C++ part, the Perl part of your code won't be happy.

And there is actually a way to create the mixed-unit trays even in the Perl code, in the tray of FnBinding. But this situation would be caught when trying to get the tray into the Perl level, and the workaround is to use the method FnBinding:callTray().

The reason why Perl associates the trays with a unit is to make the check of enqueueing a tray easy: just check that the tray belongs to the right unit, and it's all guaranteed to be right. At the C++ level no such checks are made. If you enqueue the rowops on labels belonging to a wrong unit, they will be enqueued quietly, will attempt to execute, and from there everything will likely to go wrong. So be disciplined. And maybe I'll think of a better way for keeping the unit consistency in the future.

Monday, December 24, 2012

Rowop in C++

I've jumped right into the Unit without showing the objects it operates on. Now let's start catching up and look at the Rowops. The Rowop class is defined in sched/Rowop.h.

The Rowop in C++ consists of all the same parts as in Perl API: a label, a row, and opcode.

It has one more item that's not really visible in the Perl API, the enqueueing mode, but it's semi-hidden in the C++ API as well. The only place where it's used is in Unit::enqueueDelayedTray(). This basically allows to build a tray of rowops, each with its own enqueueing mode, and then enqueue all of them appropriately in one go. This is actually kind of historic and caused by the explicit enqueueing mode specification for the Table labels. It's getting obsolete and will be removed somewhere soon.

The Rowop class inherits from Starget, usable in one thread only. Since it refers to the Labels, that are by definition single-threaded, this makes sense. A consequence is that you can't simply pass the Rowops between the threads. The passing-between-threads requires a separate representation that doesn't refer to the Labels but instead uses something like a numeric index (and of course the Mtarget base class). This is a part of the ongoing work on multithreading, but you can also make your own.

The opcodes are defined in the union Rowop::Opcode, so you normally use them as Rowop::OP_INSERT etc. As described before, the opcodes actually contain a bitmap of individual flags, defined in the union Rowop::OpcodeFlags: Rowop::OCF_INSERT and Rowop::OCF_DELETE. You don't really need to use these flags directly unless you really, really want to.

Besides the 3 already described opcodes (OP_NOP, OP_INSERT and OP_DELETE) there is another one, OP_BAD. It's a special value returned by the string-to-opcode conversion method instead of the -1 returne dby the other similar method. The reason is that OP_BAD is specially formatted to be understood by all the normal opcode type checks as NOP, while -1 would be seen as a combination of INSERT and DELETE. So if you miss checking the result of conversion on a bad string, at least you would get a NOP and not some mysterious operation. The reason why OP_BAD is not exporeted to Perl is that in Perl an undef is used as the indication of the invalid value, and works even better.

There is a pretty wide variety of Rowop constructors:

Rowop(const Label *label, Opcode op, const Row *row);
Rowop(const Label *label, Opcode op, const Rowref &row);

Rowop(const Label *label, Opcode op, const Row *row, int enqMode);
Rowop(const Label *label, Opcode op, const Rowref &row, int enqMode);

Rowop(const Rowop &orig);
Rowop(const Label *label, const Rowop *orig);

The constructors with the explicit enqMode are best not be used outside of the Triceps internals, and will eventually be obsoleted. The last two are the copy constructor, and the adoption constructor which underlies Label::adopt() and can as well be used directly.

Once a rowop is constructed, its components can not be changed any more, only read.

Opcode getOpcode() const;
const Label *getLabel() const;
const Row *getRow() const;
int getEnqMode() const;

Read back the components of the Rowop. Again, the getEnqMode() is on the way to obsolescence. And if you need to check the opcode for being an insert or delete, the better way is to use the explicit test methods, rather than getting the opcode and comparing it for equality:

bool isInsert() const;
bool isDelete() const;
bool isNop() const;

Check whether the opcode requests an insert or delete (or neither).

The same checks are available as static methods that can be used on the opcode values:

static bool isInsert(int op);
static bool isDelete(int op);
static bool isNop(int op);

And the final part is the conversion between the strings and values for the Opcode and OpcodeFlags enums:

static const char *opcodeString(int code);
static int stringOpcode(const char *op);
static const char *ocfString(int flag, const char *def = "???");
static int stringOcf(const char *flag);

As mentioned above, stringOpcode() returns OP_BAD for the unknown strings, not -1.

Unit tracing in C++

By the way, I forgot to mention that Unit lives in sched/Unit.h. Now, to the tracing.

Unlike Perl, in C++ the tracer is defined by inheriting from the class Unit::Tracer. The base class provides the Mtarget, and in the subclass all you need is define your virtual method:

virtual void execute(Unit *unit, const Label *label, const Label *fromLabel, Rowop *rop, TracerWhen when);

It gets called at the exactly same points as the Perl tracer (the C++ part of the UnitTracerPerl forwards the calls to the Perl level). The arguments are also the same as described in the Perl docs. The only difference is that the argument when is a value of enum Unit::TracerWhen.

For example:

class SampleTracer : public Unit::Tracer
{
public:
    virtual void execute(Unit *unit, const Label *label, const Label *fromLabel, Rowop *rop, Unit::TracerWhen when)
    {
        printf("trace %s label '%s' %c\n", Unit::tracerWhenHumanString(when), label->getName().c_str(), Unit::tracerWhenIsBefore(when)? '{' : '}');
    }
};

This also shows a few Unit methods used for conversion and testing of the constants:

static const char *tracerWhenString(int when, const char *def = "???");
static int stringTracerWhen(const char *when);

Convert between the when enum value and the appropriate name. def is as usual the default placeholder that will be used for an invalid value. And the conversion from string would return a -1 on an invalid value.

static const char *tracerWhenHumanString(int when, const char *def = "???");
static int humanStringTracerWhen(const char *when);

The same conversion, only using a "human-readable" string format that is nivcer for the messages. Basically, the same thing, only in the lowercase words. For example, TW_BEFORE_CHAINED would become "before-chained".

static bool tracerWhenIsBefore(int when);
static bool tracerWhenIsAfter(int when);

Determines whether a when value is a "before" or "after" kind. This is an addition from 1.1, that was introduced together with the reformed scheduling. As you can see in the example above, it's convenient for printing the braces, or if you prefer indentation, for adjusting the indentation.

The tracer object (not a class but a constructed object!) is set into the Unit:

void setTracer(Onceref<Tracer> tracer);
Onceref<Tracer> getTracer() const;

Theoretically, nothing stops you from using the same tracer object for multiple units, even from multiple threads. But the catch for that is that for the multithreaded calls the tracer must have the internal synchronization. Sharing a tracer between multiple units in the same thread is a more interesting idea. It might be useful in case of the intertwined execution, with the cross-unit calls. But the catch is that the trace will be intertwined all the time.

The SampleTracer above was just printing the trace right away. Usually a better idea is to save the trace in the tracer object and return it on demand. Triceps provides a couple of ready tracers, and they use exactly this approach.

Here is the StringTracer interface:

class StringTracer : public Tracer
{
public:
    // @param verbose - if true, record all the events, otherwise only the BEGIN records
    StringTracer(bool verbose = false);

    // Get back the buffer of messages
    // (it can also be used to add messages to the buffer)
    Erref getBuffer() const
    {
      return buffer_;
    }

    // Replace the message buffer with a clean one.
    // The old one gets simply dereferenced, so if you have a reference, you can keep it.
    void clearBuffer();

    // from Tracer
    virtual void execute(Unit *unit, const Label *label, const Label *fromLabel, Rowop *rop, TracerWhen when);

protected:
    Erref buffer_;
    bool verbose_;
};

An Erref object is used as a buffer, where the data can be added efficiently line-by-line, and later read. On each call StringTracer::execute() builds the string res, and appends it to the buffer:

buffer_->appendMsg(false, res);

The pattern of reading the buffer contents works like this:

string tlog = trace->getBuffer()->print();
trace->clearBuffer();

The log can then be actually printed, or used in any other way. An interesting point is that clearBuffer() doesn't clear the buffer but replaces it with a fresh one. So if you keep a reference to the buffer, you can keep using it:

Erref buf = trace->getBuffer();trace->clearBuffer();
string tlog = buf->print();

The two ready tracers provided with Triceps are:

StringTracer: collects the trace in a buffer, identifying the objects as addresses. This is not exactly easy to read normally but may come useful if you want to analyze a core dump.

StringNameTracer: similar but prints the object identification as names. More convenient but prone to the duplicate names used for different objects.

Unfortunately, at the C++ level there is currently no nice printout of the rowops, like in Perl. But you can always make your own.

The tracing does not have to be used just for tracing. It can also be used for debugging, as a breakpoint: check in your tracer for an arbitrary condition, and stop if it has been met.

There is only one tracer per uint at a time. However if you want, you can implement the chaining in your own tracer (particularly useful if it's a breakpoint tracer): support a reference to another tracer object, and after doing your own part, call that one's execute() method.

Thursday, December 20, 2012

Unit in C++

I've been distracted a bit with the other things, and now I'm working on the multithreaded support. I expect that it will be some time until that jells together. In the meantime, let's continue the description of the C++ API. The next class is the Unit. This class has been modified in 1.1.0, and I will be describing the new version, without going separately into 1.0.

Unit(const string &name);

Constructs the execution unit.

const string &getName() const;
void setName(const string &name);

Get back of modify the name. Modifying the name is probably not a good idea, but the method is still here.

void schedule(Onceref<Rowop> rop);
void scheduleTray(const_Onceref<Tray> tray);
void fork(Onceref<Rowop> rop);
void forkTray(const_Onceref<Tray> tray);
void call(Onceref<Rowop> rop);
void callTray(const_Onceref<Tray> tray);

void enqueue(int em, Onceref<Rowop> rop);
void enqueueTray(int em, const_Onceref<Tray> tray);

Schedule, fork or call a rowop or tray, like in Perl. Unlike Perl, the methods with a tray argument have different names. And the enqueueing mode is always an integer constant. These constants are defined in the enum Gadget::EnqMode (the Gadget class will be described soon), and is one of Gadget::EM_SCHEDULE, Gadget::EM_FORK, Gadget::EM_CALL and Gadget::EM_IGNORE. I'm not sure if I've described EM_IGNORE before. I think I did but just in case: it means "do nothing with this rowop", and it's available in Perl too.

bool empty() const;

Check whether all the Unit's frames are empty.

void callNext();
void drainFrame();

Execute the next rowop from the current (innermost) frame, or all the rowops on the current frame. The semantics is the same as in the Perl code.

void setMark(Onceref<FrameMark> mark);

Set a mark on the current frame, same as in Perl.

void loopAt(FrameMark *mark, Onceref<Rowop> rop);
void loopTrayAt(FrameMark *mark, const_Onceref<Tray> tray);

Enqueue a rowop of tray at the marked frame.

void callAsChained(const Label *label, Rowop *rop, const Label *chainedFrom);

This method was introduced in version 1.1, and hasn't propagated to Perl yet. I'm not even sure that I want it visible in Perl, since it's kind of low-level. It executes a label call, assuming that it was chained from another label (before 1.1 the functionality itself had obviously existed but was not visible in the API).

Here the row types of all the arguments must be matching. It asks to call the label with rowop, where the target label was chained from another label. It will do all the correct tracing for the chained calls. This method is used for example in the streaming functions, when an FnReturn calls through an FnBinding. You can use it directly as well, just be careful. And remember that keeping the consistency in the tracing is up to you: if you use the chainedFrom label argument that hasn't actually been called, the trace will look surprising.

void clearLabels();

Clear all the unit's labels, same semantics as when called from Perl.

void rememberLabel(Label *lab);

The method that connects a label to the unit. Normally you don't need to call it manually, the label constructor calls it (and that's why it's not in the Perl API). The only real reason to use this method manually is if you've disconnected a label manually from the unit, and want to reconnect it back (and I'm not sure if anyone would ever want that). Calling this method repeatedly with the same label and unit will have no effect. Remembering the same label in multiple units is not a good idea.

void forgetLabel(Label *lab);

Make the unit forget a label, so on clearLabels() that label won't be cleared. This is another dangerous low-level method, since only the unit will forget about the label but the label will still keep the pointer to the unit, unless it's cleared. Because of the danger, it's also not in the Perl API. The reason to use it would be if you want to disassemble and discard a part of the unit without disturbing the rest of it. However a safer alternative is to just create multiple units in one thread and discard by a whole unit.

RowType *getEmptyRowType() const;

A convenience method to get a reference to a row type with no fields. Such row type is useful for creation of pseudo-labels that have the user-defined clearing handlers that clear some user data. This has been described in more detail before.

void setMaxStackDepth(int v);
int maxStackDepth() const;
void setMaxRecursionDepth(int v);
int maxRecursionDepth() const;

Set and get the maximal unit stack depth and recursion depth, works the same as in Perl.

That's it, except for the tracing support. I'll describe that in a separate post.

And there is also a class that can be used to trigger the unit clearing on leaving scope:

UnitClearingTrigger(Unit *unit);

The trigger is an Mtarget, so the typical use would be:

{
Autoref<UnitClearingTrigger> ctrig = new UnitClearingTrigger(myunit);
...
}

At the block exit the Autoref will get destroyed, destroy the trigger, which would in turn cause the clearing of the unit. Of course, you can also place the Autoref into another object, and then the destruction of that object would cause the clearing, instead of the end of the block.

Thursday, November 29, 2012

A snapshot pre-release of 1.1.0

Now the Streaming Functions look wrapped up, and before embarking on the next big feature, it looks like a good time to publish a snapshot. This one is named 1.0.91-20121129, and is available for download now.

As usual with the snapshots, the Developer's Guide in the package has not been updated, instead this blog serves as the interim documentation. The posts with the label 1_1_0 up to now describe all the new features.

Monday, November 26, 2012

Streaming functions and unit boundaries (and TQL guts)

Now let's take a look at the insides of the Tql module. I'll be skipping over the code that is less interesting, you can find the full version in the source code of perl/Triceps/lib/Triceps/X/Tql.pm as always. The constructor is one of these things to be skipped. The initialization part is more interesting:

sub initialize # ($self)
{
    my $myname = "Triceps::X::Tql::initialize";
    my $self = shift;

    return if ($self->{initialized});

    my %dispatch;
    my @labels;
    for (my $i = 0; $i <= $#{$self->{tables}}; $i++) {
        my $name = $self->{tableNames}[$i];
        my $table = $self->{tables}[$i];

        confess "$myname: found a duplicate table name '$name', all names are: "
                . join(", ", @{$self->{tableNames}})
            if (exists $dispatch{$name});

        $dispatch{$name} = $table;
        push @labels, $name, $table->getDumpLabel();
    }

    $self->{dispatch} = \%dispatch;
    $self->{fret} = Triceps::FnReturn->new(
        name => $self->{name} . ".fret",
        labels => \@labels,
    );

    $self->{initialized} = 1;
}

It creates a dispatch table of name-to-table and also an FnReturn that contains the dump labels of all the tables.

Each query will be created as its own unit. It will run, and then get cleared and disposed of, very convenient. By the way, that is the answer to the question of why would someone want to use multiple units in the same thread: for modular disposal.

But the labels in the main unit and the query unit can't be directly connected. A direct connection would create the stable references, and the disposal won't work. That's where the streaming function interface comes to the rescue: it provides a temporary connection. Build the query unit, build a binding for it, push the binding onto the FnReturn of the main unit, run the query, pop the binding, dispose of the query unit.

And the special capacity (or if you will, superpower) of the streaming functions that allows all that is that the FnReturn and FnBinding don't have to be of the same unit. They may be of the different units and will still work together fine.

The query() method then handles the creation of the unit and stuff:

sub query # ($self, $argline)
{
    my $myname = "Triceps::X::Tql::query";

    my $self = shift;
    my $argline = shift;

    confess "$myname: may be used only on an initialized object"
        unless ($self->{initialized});

    $argline =~ s/^([^,]*)(,|$)//; # skip the name of the label
    my $q = $1; # the name of the query itself
    #&Triceps::X::SimpleServer::outCurBuf("+DEBUGquery: $argline\n");
    my @cmds = split_braced($argline);
    if ($argline ne '') {
        # Presumably, the argument line should contain no line feeds, so it should be safe to send back.
        &Triceps::X::SimpleServer::outCurBuf("+ERROR,OP_INSERT,$q: mismatched braces in the trailing $argline\n");
        return
    }

    # The context for the commands to build up an execution of a query.
    # Unlike $self, the context is created afresh for every query.
    my $ctx = {};
    # The query will be built in a separate unit
    $ctx->{tables} = $self->{dispatch};
    $ctx->{fretDumps} = $self->{fret};
    $ctx->{u} = Triceps::Unit->new("${q}.unit");
    $ctx->{prev} = undef; # will contain the output of the previous command in the pipeline
    $ctx->{actions} = []; # code that will run the pipeline
    $ctx->{id} = 0; # a unique id for auto-generated objects

    # It's important to place the clearing trigger outside eval {}. Otherwise the
    # clearing will erase any errors in $@ returned from eval.
    my $cleaner = $ctx->{u}->makeClearingTrigger();
    if (! eval {
        foreach my $cmd (@cmds) {
            #&Triceps::X::SimpleServer::outCurBuf("+DEBUGcmd, $cmd\n");
            my @args = split_braced($cmd);
            my $argv0 = bunquote(shift @args);
            # The rest of @args do not get unquoted here!
            die "No such TQL command '$argv0'\n" unless exists $tqlDispatch{$argv0};
            $ctx->{id}++;
            &{$tqlDispatch{$argv0}}($ctx, @args);
            # Each command must set its result label (even if an undef) into
            # $ctx->{next}.
            die "Internal error in the command $argv0: missing result definition\n"
                unless (exists $ctx->{next});
            $ctx->{prev} = $ctx->{next};
            delete $ctx->{next};
        }
        if (defined $ctx->{prev}) {
            # implicitly print the result of the pipeline, no options
            &{$tqlDispatch{"print"}}($ctx);
        }

        # Now run the pipeline
        foreach my $code (@{$ctx->{actions}}) {
            &$code;
        }

        # Now run the pipeline
        1; # means that everything went OK
    }) {
        # XXX this won't work well with the multi-line errors
        &Triceps::X::SimpleServer::outCurBuf("+ERROR,OP_INSERT,$q: error: $@\n");
        return
    }
}

Each TQL command is defined as its own method, all of them collected in the %tqlDispatch. query() splits the pipeline and then lets each command build its part of the query, connecting them through $ctx. A command may also register an action to be run later. After everything is built, the actions run and produce the result.

The functions split_braced() and bunquote() are imported from the package Triceps::X::Braced that handles the parsing of the braced nested lists.

Another interesting part is the error reporting, done as a special label "+ERROR". It's actually one of the sticky points of why the code is not of production quality: because the errors may be multi-line, and the SimpleServer protocol really expects everything to be single-line. Properly, some quoting would have to be done.

Moving on, here is the "read" command handler:

sub _tqlRead # ($ctx, @args)
{
    my $ctx = shift;
    die "The read command may not be used in the middle of a pipeline.\n"
        if (defined($ctx->{prev}));
    my $opts = {};
    &Triceps::Opt::parse("read", $opts, {
        table => [ undef, \&Triceps::Opt::ck_mandatory ],
    }, @_);

    my $fret = $ctx->{fretDumps};
    my $tabname = bunquote($opts->{table});

    die ("Read found no such table '$tabname'\n")
        unless (exists $ctx->{tables}{$tabname});
    my $unit = $ctx->{u};
    my $table = $ctx->{tables}{$tabname};
    my $lab = $unit->makeDummyLabel($table->getRowType(), "lb" . $ctx->{id} . "read");
    $ctx->{next} = $lab;

    my $code = sub {
        Triceps::FnBinding::call(
            name => "bind" . $ctx->{id} . "read",
            unit => $unit,
            on => $fret,
            labels => [
                $tabname => $lab,
            ],
            code => sub {
                $table->dumpAll();
            },
        );
    };
    push @{$ctx->{actions}}, $code;
}

It's the only command that registers an action, which sends data into the query unit. The rest of commands just add more handlers to the pipeline in the unit, and get the data that flows from "read". The action sets up a binding and calls the table dump, to send the data into that binding.

The reading of the tables could have also been done without the bindings, and without the need to bind the units at all: just iterate through the table procedurally in the action. But this whole example has been built largely to showcase that the bindings can be used in this way, so naturally it uses bindings.

The bindings come more useful when the query logic has to react to the normal logic of the main unit, such as in the subscriptions: set up the query, read its initial state, and then keep reading as the state gets updated. But guess what, the subscriptions can't be done with the FnReturns as shown because the FnReturn only sends its data to the last binding pushed onto it. This means, if multiple subscriptions get set up, only the last one will be getting the data. There will be a separate mechanism for that.

running the TQL query server

The code that produced the query output examples from the previous post looks like this:

# The basic table type to be used for querying.
# Represents the trades reports.
our $rtTrade = Triceps::RowType->new(
id => "int32", # trade unique id
symbol => "string", # symbol traded
price => "float64",
size => "float64", # number of shares traded
) or confess "$!";

our $ttWindow = Triceps::TableType->new($rtTrade)
->addSubIndex("bySymbol",
    Triceps::SimpleOrderedIndex->new(symbol => "ASC")
      ->addSubIndex("last2",
        Triceps::IndexType->newFifo(limit => 2)
      )
)
or confess "$!";
$ttWindow->initialize() or confess "$!";

# Represents the static information about a company.
our $rtSymbol = Triceps::RowType->new(
symbol => "string", # symbol name
name => "string", # the official company name
eps => "float64", # last quarter earnings per share
) or confess "$!";

our $ttSymbol = Triceps::TableType->new($rtSymbol)
->addSubIndex("bySymbol",
    Triceps::IndexType->newHashed(key => [ "symbol" ])
)
or confess "$!";
$ttSymbol->initialize() or confess "$!";

my $uTrades = Triceps::Unit->new("uTrades");
my $tWindow = $uTrades->makeTable($ttWindow, "EM_CALL", "tWindow")
or confess "$!";
my $tSymbol = $uTrades->makeTable($ttSymbol, "EM_CALL", "tSymbol")
or confess "$!";

# The information about tables, for querying.
my $tql = Triceps::X::Tql->new(
name => "tql",
tables => [
    $tWindow,
    $tSymbol,
],
);

my %dispatch;
$dispatch{$tWindow->getName()} = $tWindow->getInputLabel();
$dispatch{$tSymbol->getName()} = $tSymbol->getInputLabel();
$dispatch{"query"} = sub { $tql->query(@_); };
$dispatch{"exit"} = \&Triceps::X::SimpleServer::exitFunc;

Triceps::X::DumbClient::run(\%dispatch);

It's very much like the example shown before in the section 7.8 "Main loop with a socket", with a few differences. Obviously, Tql has been added, and we'll get to that part just in a moment. But the other differences are centered around the way the server and client code has been restructured.

The Triceps::X::DumbClient is a module for testing that starts the server, then starts the client that sends the data to it and reads the result back. Its run method is:

sub run # ($labels)
{
    my $labels = shift;

    my ($port, $pid) = Triceps::X::SimpleServer::startServer(0, $labels);
    my $sock = IO::Socket::INET->new(
        Proto => "tcp",
        PeerAddr => "localhost",
        PeerPort => $port,
    ) or confess "socket failed: $!";
    while(<STDIN>) {
        $sock->print($_);
        $sock->flush();
    }
    $sock->print("exit,OP_INSERT\n");
    $sock->flush();
    $sock->shutdown(1); # SHUT_WR
    while(<$sock>) {
        print($_);
    }
    waitpid($pid, 0);
}

It's really intended only for the very small examples that fit into the TCP buffer, since it sends the whole input before it starts reading the output.

The interesting server things happen inside startServer() which now also stayed almost the same but became a part of a module. The "almost the same" part is about the server loop being able to dispatch not only to the labels but also to the arbitrary Perl functions, citing from the example:

$dispatch{"query"} = sub { $tql->query(@_); };
$dispatch{"exit"} = \&Triceps::X::SimpleServer::exitFunc;

It recognizes automatically whether the entry in the dispatch table is a Label or a function, and handles them appropriately. In the server it's implemented with:

...
        my $label = $labels->{$lname};
        if (defined $label) {
          if (ref($label) eq 'CODE') {
            &$label($line);
          } else {
            my $unit = $label->getUnit();
            confess "label '$lname' received from client $id has been cleared"
              unless defined $unit;
            eval {
              $unit->makeArrayCall($label, @data);
              $unit->drainFrame();
            };
            warn "input data error: $@\nfrom data: $line\n" if $@;
          }
        } else {
          warn "unknown label '$lname' received from client $id: $line "
        }
...

And the exitFunc() method is another way to trigger the server exit, instead of makeExitLabel():
sub exitFunc # ($line)
{
    $srv_exit = 1;
}

As you can see, the dispatched functions receive the whole argument line as the client had sent it, including the label name, rather than having it split by commas. The functions can then do the text parsing in their own way, which comes real handy for TQL. It's convenient for the exit function too, as now there is no need to send the opcode with the "exit" (although X::DumbClient::run() still does send the opcode, to be compatible with the exit label approach, and the extra information doesn't hurt the exit function).

And now, the TQL definition. The TQL object gets created with the definition of a table, and then the TQL handler function shown above calls the method query() on it:

# The information about tables, for querying.
my $tql = Triceps::X::Tql->new(
name => "tql",
tables => [
    $tWindow,
    $tSymbol,
],
);

There are multiple ways to create the Tql objects. By default the option "tables" lists all the queryable tables, and their "natural" names will be used in the queries. It's possible to specify the names explicitly as well:

my $tql = Triceps::X::Tql->new(
name => "tql",
tables => [
    $tWindow,
    $tSymbol,
    $tWindow,
    $tSymbol,
],
tableNames => [
    "window",
    "symbol",
    $tWindow->getName(),
    $tSymbol->getName(),
],
);

This version defines each table under two synonymous names. It's also possible to create a Tql object without tables, and add tables to it later as they are created:

my $tql = Triceps::X::Tql->new(name => "tql");
$tql->addNamedTable(
window => $tWindow,
symbol => $tSymbol,
);
# add 2nd time, with different names
$tql->addTable(
$tWindow,
$tSymbol,
);
$tql->initialize();

The tables can be added with explicit names or with "natural" names. After all the tables are added, the Tql object has to be initialized. The two ways of creation are mutually exclusive: if the option "tables" is used, the object will be initialized right away in the constructor. If it's not used, the explicit initialization has to be done later. The methods addTable() and addNamedTable() can not be used on an initialized table, and query() can not be used on an uninitialized table.

Sunday, November 25, 2012

TQL: the Trivial Query Language

In the Developer's Guide section 7.8. "Main loop with a socket" I've been showing the execution of the simple queries. I've wanted to use the queries to demonstrate a feature of the streaming functions, so I've substantially extended that example.

Now the query example has grown to have its own language, TQL. You can think of it as a Trivial Query Language or Triceps Query Language. It's trivial, and so far it's of only an example quality, but it's extensible and it already can do some interesting things.

Why not SQL, after all, there are multiple parser building tools available in Perl? Partially, because I wanted to keep it trivial and to avoid introducing extra dependencies, especially just for the examples. Partially, because I don't like SQL. I think that the queries can be expressed much more naturally in the form of shell-like pipelines. Back at DB when I wrote a simple toolkit for querying and comparison of the CSV files (yeah, I didn't find the DBD::CSV module), I've used a pipeline semantics and it worked pretty well. It also did things that are quite difficult with SQL, like mass renaming and reordering of fields, and diffing. Although TQL is not a descendant of the language I've used in that query tool, it is a further development of the pipeline idea.

Syntactically, TQL is very simple: its query is a represented as a nested list, similar to Tcl (or if you like Lisp better, you can think that it's similar to Lisp but with different parentheses). A list is surrounded by curly braces "{}". The elements of a list are either other lists or words, consisting of non-space characters.

{word1 {word21 word22} word3}

Unlike Tcl, there are no quotes in the TQL syntax, the quote characters are just the normal word characters. If you want to include spaces into a word, you use the curly braces instead of the quotes.

{ this is a {brace-enquoted} string with spaces and nested braces }

Note that the spaces inside a list are used as delimiters and thrown away but within a brace-quoted word-string they are significant. How do you know, which way they will be treated in a particular case? It all depends on what is expected in this case. If the command expects a string as an argument, it will treat it as a string. If the command expects a list as an argument, it will treat it as a list.

What if you need to include an unbalanced brace character inside a string? Escape it with a backslash, "\{". The other usual Perl backslash sequences work too (though in the future TQL may get separated from Perl and then only the C sequences will work, that is to be seen). Any non-alphanumeric characters (including spaces) can be prepended with a backslash too. An important point is that when you build the lists, unlike shell, and like Tcl, you do the backslash escaping only once, when accepting a raw string. After that you can include into the lists of any depth without any extra escapes (and you must not add any extra escapes in the lists).

Unlike shell, you can't combine a single string out of the quoted and unquoted parts. Instead the quoting braces work as implicit separators. For example, if you specify a list as {a{b}c d}, you don't get two strings "abc" and "d", you get four strings "a", "b", "c", "d".

A TQL query is a list that represents a pipeline. Each element of the list is a command. The first command reads the data from a table, and the following commands perform transformations on that data. For example:

{read table tWindow} {project fields {symbol price}} {print tokenized 0}

If the print command is missing at the end of the pipeline, it will be added implicitly, with the default arguments: {print}.

The arguments of each TQL command are always in the option name-value format, very much like the Perl constructors of many Triceps objects. There aren't any arguments in TQL that go by themselves without an option name.

So for example the command "read" above has the option "table" with value "tWindow". The command "project" has an option "fields" with a list value of two elements. In this case the elements are simple words and don't need the further quoting. But the extra quoting won't hurt. Say, if you wanted to rename the field "price" to "trade_price", you use the Triceps::Fields::filter() syntax for it, and even though the format doesn't contain any spaces and can be still used just as a word, it looks nicer with the extra braces:

{project fields {symbol {price/trade_price} }}

I'm sure that the list of commands and their options will expand and change over time. So far the supported commands are:

read
Defines a table to read from and starts the command pipeline.
Options:
table - name of the table to read from.

project
Projects (and possibly renames) a subset of fields in the current pipeline.
Options:
fields - an array of field definitions in the syntax of Triceps::Fields::filter() (same as in the joins).

print
The last command of the pipeline, which prints the results. If not used explicitly, the query adds this command implicitly at the end of the pipeline, with the default options.
Options:
tokenized (optional) - Flag: print in the name-value format, as in Row::printP(). Otherwise prints only the values in the CSV format. (default: 1)

join
Joins the current pipeline with another table.This is functionally similar to LookupJoin, although the options are closer to JoinTwo.
Options:
table - name of the table to join with. The current pipeline is considered the "left side", the table the "right side". The duplicate key fields on the right side are always excluded from the result, like JoinTwo option (fieldsUniqKey => "left").
rightIdxPath - path name of the table's index on which to join. At the moment there is no way to join without knowing the name of the index. (As usual, the path is an array of nested names).
by (semi-optional) - the join equality condition specified as pairs of fields. Similarly to JoinTwo, it's a single-level array with the fields logically paired:{leftFld1 rightFld1 leftFld2 rightFld2 ... }. Options "by" and "byLeft" are mutually exclusive, and one of them must be present.
byLeft (semi-optional) - the join equality condition specified as a transformation on the left-side field set in the syntax of Triceps::Fields::filter(), with an implicit element {!.*} added at the end. Options "by" and "byLeft" are mutually exclusive, and one of them must be present.
leftFields (optional) - the list of patterns for the left-side fields to pass through and possibly rename, in the syntax of Triceps::Fields::filter(). (default: pass all, with the same name)
rightFields (optional) - the list of patterns for the right-side fields to pass through and possibly rename, in the syntax of Triceps::Fields::filter(). The key fields get implicitly removed before. (default: pass all, with the same name)
type (optional) - type of the join, "inner" or "left". (default: "inner")

where
Filters/selects the rows.
Options:
istrue - a Perl expression, the condition for the rows to pass through. The particularly dangerous constructions are not allowed in the expression, including the loops and the general function calls. The fields of the row are referred to as $%field, these references get translated before the expression is compiled.

Here are some examples of the Tql queries, with results produced from the output of the code examples I'll show in a moment.

> query,{read table tSymbol}
lb1read OP_INSERT symbol="AAA" name="Absolute Auto Analytics Inc" eps="0.5"
+EOD,OP_NOP,lb1read

Reads the stock symbol information table and prints it in the default tokenized format. The result format is a bit messy for now, a mix of tokenized and CSV data. In the previous examples in the chapter 7 I've been marking the end-of-data either by a row with opcode OP_NOP or not marking it at all. For the TQL queries I've decided to try out a different approach: send a CSV row on the pseudo-label "+EOD" with the value equal to the name of the label that has been completed. The labels with names starting with "+" are special in this convention, they represent some kind of metadata.

The name "lb1read" in the result rows is coming from an auto-generated label name in TQL. It will probably become less random-looking in the future, but for now I haven't yet figured out the best way to to it.

> query,{read table tWindow} {project fields {symbol price}}
lb2project OP_INSERT symbol="AAA" price="20"
lb2project OP_INSERT symbol="AAA" price="30"
+EOD,OP_NOP,lb2project

Reads the trade window rows and projects the fields "symbol" and "price" from them.

> query,{read table tWindow} {project fields {symbol price}} {print tokenized 0}
lb2project,OP_INSERT,AAA,20
lb2project,OP_INSERT,AAA,30
+EOD,OP_NOP,lb2project

The same, only explicitly prints the data in the CSV format.

> query,{read table tWindow} {where istrue {$%price == 20}}
lb2where OP_INSERT id="3" symbol="AAA" price="20" size="20"
+EOD,OP_NOP,lb2where

Selects the trade window row with price equal to 20.

> query,{read table tWindow} {join table tSymbol rightIdxPath bySymbol byLeft {symbol}}
join2.out OP_INSERT id="3" symbol="AAA" price="20" size="20" name="Absolute Auto Analytics Inc" eps="0.5"
join2.out OP_INSERT id="5" symbol="AAA" price="30" size="30" name="Absolute Auto Analytics Inc" eps="0.5"
+EOD,OP_NOP,join2.out

Reads the trade window and enriches it by joining with the symbol information.

A nice feature of TQL is that it allows to combine the operations in the pipeline in any order, repeated any number of times. For example, you can read a table, filter it, join with another table, filter again, join with the third table, filter again and so on. SQL in the same situation has to resort to specially named clauses, for example WHERE filters before grouping and HAVING filters after grouping.

Of course, a typical smart SQL compiler would determine the earliest application point for each WHERE sub-expression and build a similar pipeline. But TQL allows to keep the compiler trivial, following the explicit pipelining in the query. And nothing really prevents a smart TQL compiler either, it could as well analyze, split and reorder the pipeline stages.

Saturday, November 24, 2012

broken blog label

I've found out that the label "c++" doesn't work correctly. Weird thing, it does work on the post list in the blog editing mode, but not in the blog reader. I guess, the handling of the non-word characters got screwed up somewhere in the reader.

To work around this issue, I'll be using the label "cpp" instead from now on.

Tuesday, November 20, 2012

Table dump

Another intermediate step for the example I'm working on is the table dumping. It allows to iterate on a table in a functional manner.

A new label "dump" is added to the table and its FnReturn. Whenever the method dumpAll() is called, it sends the whole contents of the table to that label. Then you can set a binding on the table's FnReturn, call dumpAll(), and the binding will iterate through the whole table's contents.

The grand plan is also to add the dumping by a a condition that selects a sub-index, but it's not implemented yet.

It's also possible to dump in an alternative order: dumpAllIdx() can send the rows in the order of any index, rather than the default first leaf index.

If you want to get the dump label explicitly, you can do it with

my $dlab = $table->getDumpLabel();

Normally the only reason to do that would be to add it to another FnReturn (besides the table's FnReturn). Chaining anything else directly to this label would not make much sense, because the dump of the table can be called from many places, and the directly chained label will receive data every time the dump is called.

The typical usage looks like this:

    Triceps::FnBinding::call(
        name => "iterate",
        on => $table->fnReturn(),
        unit => $unit,
        labels => [
            dump => sub { ... },        ],
        code => sub {
            $table->dumpAll();
        },
    );

It's less efficient than the normal iteration but sometimes comes handy.

Normally the rowops are sent with the opcode OP_INSERT. But the opcode can also be specified explicitly:

$table->dumpAll($opcode);

The alternative order can be achieved with:

$table->dumpAllIdx($indexType);
$table->dumpAllIdx($indexType, $opcode);

As usual, the index type must belong to the exact type of this table. For example:

$table->dumpAllIdx($table->getType()->findIndexPath("cb"), "OP_NOP");

And some more interesting examples will be forthcoming later.

Thursday, November 15, 2012

test code restructuring

There is one more feature of the streaming functions I want to show but it requires a bit of work. The feature itself is small and easy but a good way to show it an example is in the user queries on a socket, which is a little involved.

With that goal in mind, so far I've done some restructuring. It's been an inconvenience to not share the code between the example files, requiring to either put everything that uses a certain code fragment into one file, or to copy that fragment around.

Now there is a place for such code, collected under the namespace Triceps::X. X can be thought of as a mark of eXperimental, eXample, eXtraneous code. This code is not exactly of production quality but is good enough for the examples, and can be used as a starting point for development of the better code. Quite a few fragments of Triceps went this way: the joins have been done as an example first, and then solidified for the main code base, and so did the aggregation.

The socket-handling examples discussed in the section 7.8. "Main loop with a socket" of the manual have been moved there. The server part became Triceps::X::SimpleServer, and the client part became Triceps::X::DumbClient. More to be added soon.

Another module that got extracted is Triceps::X::TestFeed. It's a small infrastructure to run the examples, pretending that it gets the input from stdin and sends output to stdout, while actually doing it all in memory. I haven't been discussing it much, but all of the more complicated examples have been written to use it. It also shows once in a while in the blog when I forget to edit the code to pretend that it uses stdin/stdout, and then a &readLine shows instead of <STDIN>, and a &send instead of print (and for the manual I have a script that does these substitutions automatically when I insert the code examples into it).

Saturday, November 10, 2012

Streaming functions and template results

The same way as the FnReturns can be used to get back the direct results of the operations on the tables, can be also used on the templates in general. Indeed, it's a good idea to have a method that would create an FnReturn in all the templates. So I went ahead and added it to the LookupJoin, JoinTwo and Collapse.

For the joins, the resulting FnReturn has one label "out". It's created similarly to the table's:

my $fret = $join->fnReturn();

And then it can be used as usual. The implementation of this method is fairly simple:

sub fnReturn # (self)
{
    my $self = shift;
    if (!defined $self->{fret}) {
        $self->{fret} = Triceps::FnReturn->new(
            name => $self->{name} . ".fret",
            labels => [
                out => $self->{outputLabel},
            ],
        );
    }
    return $self->{fret};
}

All this kind of makes the method lookup() of LookupJoin redundant, since now pretty much all the same can be done with the streaming function API, and even better, because it provides the opcodes on rowops, can handle the full processing, and calls the rowops one by one without necessarily creating an array. But it could happen yet that the lookup() has some more convenient uses too, so I didn't remove it yet.

For Collapse the interface is a little more complicated: the FnReturn contains a label for each data set, named the same as the data set. The order of labels follows the order of the data set definitions (though right now it's kind of moot, because only one data set is supported). The implementation is:

sub fnReturn # (self)
{
    my $self = shift;
    if (!defined $self->{fret}) {
        my @labels;
        for my $n (@{$self->{dsetnames}}) {
            push @labels, $n, $self->{datasets}{$n}{lbOut};
        }
        $self->{fret} = Triceps::FnReturn->new(
            name => $self->{name} . ".fret",
            labels => \@labels,
        );
    }
    return $self->{fret};
}

It uses the new element $self->{dsetnames} that wasn't present in the code shown before. I've added it now to keep the array of data set names in the order they were defined.

Use these examples to write the fnReturn() in your templates.

Wednesday, November 7, 2012

Streaming functions and tables

The Copy Tray used in the tables in the version 1.0 was really a precursor to the streaming functions. Now when the full-blown streaming functions became worked out, there is no sense in keeping the copy trays any more, so I've removed them.

Instead, I've added a Table method that gets the FnReturn for that table:

$fret = $table->fnReturn();

The return contains the labels "pre", "out", and the named labels for each aggregators. The FnReturn object is created on the first call of this method and is kept in the table. All the following calls return the same object. This has some interesting consequences for the "pre" label: the rowop for the "pre" label doesn't get created at all if there is nothing chained from that label. But when the FnReturn gets created, one of its labels gets chained from the "pre" label. Which means that once, you call $table->fnReturn() for the first time, you will see that table's "pre" label called in all the traces. It's not a huge extra overhead, but still something to keep in mind and not be surprised when calling fnReturn() changes all your traces.

The produced FnReturn then gets used as any other one. If you use it with an FnBinding that has withTrace => 1, you get an improved equivalent of the Copy Tray. For example:

$fret2 = $t2->fnReturn();
$fbind2 = Triceps::FnBinding->new(
    unit => $u1,
    name => "fbind2",
    on => $fret2,
    withTray => 1,
    labels => [
        out => sub { }, # another way to make a dummy
    ],
);

$fret2->push($fbind2);
$t2->insert($r2);
$fret2->pop($fbind2);

# $ctr is the Copy Tray analog
$ctr = $fbind2->swapTray(); # get the updates on an insert

Of course, most of the time you would not want to make a dummy label and then iterate manually through the copy tray. You would want to create bindings to the actual next logical labels and simply execute them, immediately or delayed with a tray.

Wednesday, October 24, 2012

Streaming functions and recursion, part 6

The combination of the two previous examples (the one with the trays and the one with the forks) doesn't work. They could be combined but the combination just doesn't work right.

The problem is that the example with trays relies on the recursive function being executed before the tray gets called. But if both of them are forked, things break.

Well, if there is only one recursive call, it still works because the execution frame looks like this:

arg1
pop1

The rowop arg1 executes, places the result into the tray (provided that it calls the FnReturn label, not forks to it!). Then the rowop pop1 executes and calls the tray. So far so good.

Now let's do the recursion with depth two. The first level starts the same:

arg1
pop1

Then arg1 executes and forks the second level of recursion:

pop1
arg2
pop2

Do you see what went wrong? The unit execution frames are FIFO. So the second level of recursion got queued after the popping of the first level. That pop1 executes next, doesn't get any return values, and everything goes downhill from there.

Streaming functions and recursion, part 5

And there is also a way to run the recursive calls without even the need to increase the recursion depth limit. It can be left at the default 1, without setMaxRecursionDepth(). The secret is to fork the argument rowops to the functions instead of calling them.

###
# A streaming function that computes a Fibonacci number.

# Input:
#   $lbFibCompute: request to compute the number.
# Output (by FnReturn labels):
#   "result": the computed value.
# The opcode is preserved through the computation.

my @stackFib; # stack of the function states
my $stateFib; # The current state

my $frFib = Triceps::FnReturn->new(
    name => "Fib",
    unit => $uFib,
    labels => [
        result => $rtFibRes,
    ],
    onPush => sub { push @stackFib, $stateFib; $stateFib = { }; },
    onPop => sub { $stateFib = pop @stackFib; },
);

my $lbFibResult = $frFib->getLabel("result");

# Declare the label & binding variables in advance, to define them sequentially.
my ($lbFibCompute, $fbFibPrev1, $fbFibPrev2);
$lbFibCompute = $uFib->makeLabel($rtFibArg, "FibCompute", undef, sub {
    my $row = $_[1]->getRow();
    my $op = $_[1]->getOpcode();
    my $idx = $row->get("idx");

    if ($idx <= 1) {
        $uFib->fork($frFib->getLabel("result")->makeRowopHash($op,
            idx => $idx,
            fib => $idx < 1 ? 0 : 1,
        ));
    } else {
        $stateFib->{op} = $op;
        $stateFib->{idx} = $idx;

        $frFib->push($fbFibPrev1);
        $uFib->fork($lbFibCompute->makeRowopHash($op,
            idx => $idx - 1,
        ));
    }
}) or confess "$!";
$fbFibPrev1 = Triceps::FnBinding->new(
    unit => $uFib,
    name => "FibPrev1",
    on => $frFib,
    labels => [
        result => sub {
            $frFib->pop($fbFibPrev1);

            $stateFib->{prev1} = $_[1]->getRow()->get("fib");

            # must prepare before pushing new state and with it new $stateFib
            my $rop = $lbFibCompute->makeRowopHash($stateFib->{op},
                idx => $stateFib->{idx} - 2,
            );

            $frFib->push($fbFibPrev2);
            $uFib->fork($rop);
        },
    ],
);
$fbFibPrev2 = Triceps::FnBinding->new(
    unit => $uFib,
    on => $frFib,
    name => "FibPrev2",
    labels => [
        result => sub {
            $frFib->pop($fbFibPrev2);

            $stateFib->{prev2} = $_[1]->getRow()->get("fib");
            $uFib->fork($frFib->getLabel("result")->makeRowopHash($stateFib->{op},
                idx => $stateFib->{idx},
                fib => $stateFib->{prev1} + $stateFib->{prev2},
            ));
        },
    ],
);

# End of streaming function
###

This is a variety of the pre-previous example, with the split push and pop. The split is required for the fork to work: when the forked rowop executes, the calling label has already returned, so obviously the scoped approach won't work.

In this version the unit stack depth required to compute the 6th (and any) Fibonacci number reduces to 2: it's really only one level on top of the outermost frame.

Streaming functions and recursion, part 4

Following up on the previous installment, here is the example that uses the bindings with tray:

###
# A streaming function that computes a Fibonacci number.

# Input:
#   $lbFibCompute: request to compute the number.
# Output (by FnReturn labels):
#   "result": the computed value.
# The opcode is preserved through the computation.

my @stackFib; # stack of the function states
my $stateFib; # The current state

my $frFib = Triceps::FnReturn->new(
    name => "Fib",
    unit => $uFib,
    labels => [
        result => $rtFibRes,
    ],
    onPush => sub { push @stackFib, $stateFib; $stateFib = { }; },
    onPop => sub { $stateFib = pop @stackFib; },
);

my $lbFibResult = $frFib->getLabel("result");

# Declare the label & binding variables in advance, to define them sequentially.
my ($lbFibCompute, $fbFibPrev1, $fbFibPrev2);
$lbFibCompute = $uFib->makeLabel($rtFibArg, "FibCompute", undef, sub {
    my $row = $_[1]->getRow();
    my $op = $_[1]->getOpcode();
    my $idx = $row->get("idx");

    if ($idx <= 1) {
        $uFib->makeHashCall($frFib->getLabel("result"), $op,
            idx => $idx,
            fib => $idx < 1 ? 0 : 1,
        );
    } else {
        $stateFib->{op} = $op;
        $stateFib->{idx} = $idx;

        {
            my $ab = Triceps::AutoFnBind->new(
                $frFib => $fbFibPrev1
            );
            $uFib->makeHashCall($lbFibCompute, $op,
                idx => $idx - 1,
            );
        }
        $fbFibPrev1->callTray();
    }
}) or confess "$!";
$fbFibPrev1 = Triceps::FnBinding->new(
    unit => $uFib,
    name => "FibPrev1",
    on => $frFib,
    withTray => 1,
    labels => [
        result => sub {
            $stateFib->{prev1} = $_[1]->getRow()->get("fib");

            # must prepare before pushing new state and with it new $stateFib
            my $rop = $lbFibCompute->makeRowopHash($stateFib->{op},
                idx => $stateFib->{idx} - 2,
            );

            {
                my $ab = Triceps::AutoFnBind->new(
                    $frFib => $fbFibPrev2
                );
                $uFib->call($rop);
            }
            $fbFibPrev2->callTray();
        },
    ],
);
$fbFibPrev2 = Triceps::FnBinding->new(
    unit => $uFib,
    on => $frFib,
    name => "FibPrev2",
    withTray => 1,
    labels => [
        result => sub {
            $stateFib->{prev2} = $_[1]->getRow()->get("fib");
            $uFib->makeHashCall($frFib->getLabel("result"), $stateFib->{op},
                idx => $stateFib->{idx},
                fib => $stateFib->{prev1} + $stateFib->{prev2},
            );
        },
    ],
);

# End of streaming function
###

The stack depth is now greatly reduced because the unit stack pops the frames before pushing more of them. For the 2nd Fibonacci number the trace is:

unit 'uFib' before label 'FibCompute' op OP_DELETE {
unit 'uFib' before label 'FibCompute' op OP_DELETE {
unit 'uFib' before label 'Fib.result' op OP_DELETE {
unit 'uFib' after label 'Fib.result' op OP_DELETE }
unit 'uFib' after label 'FibCompute' op OP_DELETE }
unit 'uFib' before label 'FibPrev1.result' op OP_DELETE {
unit 'uFib' before label 'FibCompute' op OP_DELETE {
unit 'uFib' before label 'Fib.result' op OP_DELETE {
unit 'uFib' after label 'Fib.result' op OP_DELETE }
unit 'uFib' after label 'FibCompute' op OP_DELETE }
unit 'uFib' before label 'FibPrev2.result' op OP_DELETE {
unit 'uFib' before label 'Fib.result' op OP_DELETE {
unit 'uFib' before label 'FibCall.result' (chain 'Fib.result') op OP_DELETE {
unit 'uFib' after label 'FibCall.result' (chain 'Fib.result') op OP_DELETE }
unit 'uFib' after label 'Fib.result' op OP_DELETE }
unit 'uFib' after label 'FibPrev2.result' op OP_DELETE }
unit 'uFib' after label 'FibPrev1.result' op OP_DELETE }
unit 'uFib' after label 'FibCompute' op OP_DELETE }

For the 6th number the maximal required stack depth now gets reduced to only 9 instead of 51.

Streaming functions and recursion, part 3

FnBinding:call() with closures is easy to use but it creates a closure and an FnBinding object on each run. Can things be rearranged to reuse the same objects? With some effort, they can:

###
# A streaming function that computes a Fibonacci number.

# Input:
#   $lbFibCompute: request to compute the number.
# Output (by FnReturn labels):
#   "result": the computed value.
# The opcode is preserved through the computation.

my @stackFib; # stack of the function states
my $stateFib; # The current state

my $frFib = Triceps::FnReturn->new(
    name => "Fib",
    unit => $uFib,
    labels => [
        result => $rtFibRes,
    ],
    onPush => sub { push @stackFib, $stateFib; $stateFib = { }; },
    onPop => sub { $stateFib = pop @stackFib; },
);

my $lbFibResult = $frFib->getLabel("result");

# Declare the label & binding variables in advance, to define them sequentially.
my ($lbFibCompute, $fbFibPrev1, $fbFibPrev2);
$lbFibCompute = $uFib->makeLabel($rtFibArg, "FibCompute", undef, sub {
    my $row = $_[1]->getRow();
    my $op = $_[1]->getOpcode();
    my $idx = $row->get("idx");

    if ($idx <= 1) {
        $uFib->makeHashCall($frFib->getLabel("result"), $op,
            idx => $idx,
            fib => $idx < 1 ? 0 : 1,
        );
    } else {
        $stateFib->{op} = $op;
        $stateFib->{idx} = $idx;

        $frFib->push($fbFibPrev1);
        $uFib->makeHashCall($lbFibCompute, $op,
            idx => $idx - 1,
        );
    }
}) or confess "$!";
$fbFibPrev1 = Triceps::FnBinding->new(
    unit => $uFib,
    name => "FibPrev1",
    on => $frFib,
    labels => [
        result => sub {
            $frFib->pop($fbFibPrev1);

            $stateFib->{prev1} = $_[1]->getRow()->get("fib");

            # must prepare before pushing new state and with it new $stateFib
            my $rop = $lbFibCompute->makeRowopHash($stateFib->{op},
                idx => $stateFib->{idx} - 2,
            );

            $frFib->push($fbFibPrev2);
            $uFib->call($rop);
        },
    ],
);
$fbFibPrev2 = Triceps::FnBinding->new(
    unit => $uFib,
    on => $frFib,
    name => "FibPrev2",
    labels => [
        result => sub {
            $frFib->pop($fbFibPrev2);

            $stateFib->{prev2} = $_[1]->getRow()->get("fib");
            $uFib->makeHashCall($frFib->getLabel("result"), $stateFib->{op},
                idx => $stateFib->{idx},
                fib => $stateFib->{prev1} + $stateFib->{prev2},
            );
        },
    ],
);

# End of streaming function
###

The rest of the code stays the same, so I won't copy it here.

The computation still needs to keep the intermediate results of two recursive calls. With no closures, these results have to be kept in a global object $stateFib (which is a hash that keeps multiple values).

But it can't just be a single object! The recursive calls would overwrite it. So it has to be built into a stack of objects, a new one pushed for each call and popped after it. This pushing and popping can be tied to the pushing and popping of the bindings on an FnReturn. When the FnReturn is defined, the options onPush and onPop define the custom Perl code to execute, which is used here for the management of the state stack.

The whole logic is then split into the sections around the calls:

before the first call
between the first and second call
after the second call

The first section goes as a normal label and the rest are done as bindings.

A tricky moment is that a simple scoped AutoFnBind can't be used here. The pushing of the binding happens in the calling label (such as FibCompute) but then the result is processed in another label (such as FibPrev1.result). The procedural control won't return to FibCompute until after FibPrev1.result has been completed. But FibPrev1.result needs the state popped before it can do its work! So the pushing and popping of the binding is done explicitly in two split steps: push() called in FibCompute() and pop() called in FibPrev1.result. And of course then after FibPrev1.result saves the result, it pushes the binding again, which then gets popped in FibPrev2.result.

The popping can also be done without arguments, as pop(), but if it's given an argument, it will check that the binding popped is the same as its argument. This is helpful for detecting the call stack corruptions.

Now, can you guess, what depth of the unit call stack is required to compute and print the 2nd Fibonacci number? It's 7. If the tracing is enabled, it will produce this trace:

unit 'uFib' before label 'FibCompute' op OP_DELETE {
unit 'uFib' before label 'FibCompute' op OP_DELETE {
unit 'uFib' before label 'Fib.result' op OP_DELETE {
unit 'uFib' before label 'FibPrev1.result' (chain 'Fib.result') op OP_DELETE {
unit 'uFib' before label 'FibCompute' op OP_DELETE {
unit 'uFib' before label 'Fib.result' op OP_DELETE {
unit 'uFib' before label 'FibPrev2.result' (chain 'Fib.result') op OP_DELETE {
unit 'uFib' before label 'Fib.result' op OP_DELETE {
unit 'uFib' before label 'FibCall.result' (chain 'Fib.result') op OP_DELETE {
unit 'uFib' after label 'FibCall.result' (chain 'Fib.result') op OP_DELETE }
unit 'uFib' after label 'Fib.result' op OP_DELETE }
unit 'uFib' after label 'FibPrev2.result' (chain 'Fib.result') op OP_DELETE }
unit 'uFib' after label 'Fib.result' op OP_DELETE }
unit 'uFib' after label 'FibCompute' op OP_DELETE }
unit 'uFib' after label 'FibPrev1.result' (chain 'Fib.result') op OP_DELETE }
unit 'uFib' after label 'Fib.result' op OP_DELETE }
unit 'uFib' after label 'FibCompute' op OP_DELETE }
unit 'uFib' after label 'FibCompute' op OP_DELETE }

9 labels get called in a sequence, all the way from the initial call to the result printing. And only then the whole sequence unrolls back. 3 of them are chained through the bindings, so they don't push the stack frames onto the stack, and there is always the outermost stack frame, with the result of 9-3+1 = 7. This number grows fast. For the 6th number the number of labels becomes 75 and the frame count 51.

It happens because all the calls get unrolled into a single sequence, like what I've warned against in the section on the loops. The function return does unroll its FnReturn stack but doesn't unroll the unit call stack, it just goes even deeper by calling the label that processes it.

There are ways to improve it. The simplest one is to use the FnBinding with a tray, and call this tray after the function completely returns. This works out quite conveniently in two other ways too: First, AutoFnBind with its scoped approach can be used again. And second, it allows to handle the situations where a function returns not just one row but multiple of them. That will be the next example.