Sergey Babkin on CEP and stuff: index

Showing posts with label index_type. Show all posts

Friday, April 17, 2015

Ordered Index implemented in C++

The ordered index implemented in Perl has been pretty slow, so I've finally got around to implementing one in C++. It became much faster. Here is an excerpt from the performance test results (with an optimized build):

Table insert makeRowArray (single hashed idx, direct) 0.551577 s, 181298.48 per second.
Excluding makeRowArray 0.311936 s, 320578.44 per second.
Table insert makeRowArray (single ordered int idx, direct) 0.598462 s, 167095.09 per second.
Excluding makeRowArray 0.358821 s, 278690.37 per second.
Table insert makeRowArray (single ordered string idx, direct) 1.070565 s, 93408.64 per second.
Excluding makeRowArray 0.830924 s, 120347.91 per second.
Table insert makeRowArray (single perl sorted idx, direct) 21.810374 s, 4584.97 per second.
Excluding makeRowArray 21.570734 s, 4635.91 per second.
Table lookup (single hashed idx) 0.147418 s, 678342.02 per second.
Table lookup (single ordered int idx) 0.174585 s, 572785.77 per second.
Table lookup (single ordered string idx) 0.385963 s, 259092.38 per second.
Table lookup (single perl sorted idx) 6.840266 s, 14619.32 per second.

Here the "ordered" is the new ordered index in C++ (on an integer field and on a string field), and "perl sorted" is the ordering in Perl, on an integer field. Even though the ordered index is a little slower than the hashed index, it's about 40-70 times faster than the ordering implemented in Perl.

Naturally, ordering by a string field is slower than by an integer one, since it not only has to compare more bytes one-by-one but also honors the locale-specific collation order.

In Perl, the ordered index type is created very similarly to the hashed index type:

Triceps::IndexType->newOrdered(key => [ "a", "!b" ])

The single option "key" specifies the array with the names key fields of the index. The "!" in front of the field name says that the order by this field is descending, and otherwise the order is ascending. This is more compact and arguably easier-to-read format than the one used by the SimpleOrderedIndex.

As usual, the NULL values in the key fields are permitted, and are considered less than any non-NULL value.

The array fields may also be used in the ordered indexes.

The ordered index does return the list of key fields with getKey(), so it can be used in joins and can be found by key fields with TableType::findIndexPathForKeys() and friends, just like the hashed index.

The getKey() for an ordered index returns the list of plain field names, without the indications of descending order. Thus the finding of the index by key fields just works out of the box, unchanged.

To get the key fields as they were specified in the index creation, including the possible "!", use the new method:

@fields = $indexType->getKeyExpr();

The idea of this method is that the contents of the array returned by it depends on the index type and is an "expression" that can be used to build another instance of the same index type. For the hashed index it simply returns the same data as getKey(). For the ordered index it returns the list of keys with indications. For the indexes with Perl conditions it still returns nothing, though in the future might be used to store the condition.

In the C++ API the index type gets created with the constructor:

OrderedIndexType(NameSet *key = NULL);
static OrderedIndexType *make(NameSet *key = NULL);

The key can also be set after construction:

OrderedIndexType *setKey(NameSet *key);

Same as in Perl, the field names in the NameSet can be prefixed with a "!" to specify the descending order on that field.

The new method in the IndexType is:

virtual const NameSet *getKeyExpr() const;

And the new constant for the index id is IT_ORDERED, available in both C++ and Perl.

Thursday, August 1, 2013

NameSet update

While editing the docs, I've realized that the NameSet class being a single-threaded Starget while it's used inside the multithreaded Mtarget IndexType, is not a good thing. So I've upgraded it to an Mtarget as well.

The other related item is the method IndexType::getKey(). There is no good reason to return an Autoref<NameSet> from it, a pointer is just as good and even better. The new prototype for it is:

virtual const NameSet *getKey() const;

Thursday, July 11, 2013

no more explicit confessions

It's official: all the code has been converted to the new error handling. Now if anything goes wrong, the Triceps Perl calls just confess right away. No more need for the pattern 'or confess "$!"' that was used throughout the code (though of course you can still use it for handling the other errors).

It also applies to the error checks done by the XS typemaps, these will also confess automatically.

I've also added one more method that doesn't confess: IndexType::getTabtypeSafe(). If the index type is not set into a table type, it will silently return an undef without any error indications.

On a related note, the construction of the Type subclasses has been made nicer in the C++: instead of calling abort() on the major errors, they now throw Exceptions. Mind you, these exceptions are thrown not in the constructors as such but in the chainable methods that set the contents of the types. And they try to be smart enough to preserve the reference count correctness: if the object was not assigned into any reference yet (as is typical for the chained calls), they take care to temporarily increase and decrease the reference count, thus freeing the object, before throwing. Of course, the default reaction to Exceptions is still to dump core, but need be, these exceptions can be caught.

Saturday, July 6, 2013

findSubIndexSafe

The scalar leakage in Carp::confess was causing an unpleasant issue with the functions that were trying to look up the nested indexes and catch when they went missing. So, similarly to the string conversions, I've added the method findSubIndexSafe() to the TableType and IndexType:

$ixt = $tt->findSubIndexSafe($name);
$ixt = $ixt_parent->findSubIndexSafe($name);

If the name is not found, it silently returns an undef, without setting any error codes.

Eventually the issue with the leakage in confess() would have to be fixed, but for now it's a good enough plug for the most typical cases. I'll need to think about other methods that could use the safe treatment.

Monday, June 3, 2013

the many ways to do a copy

So far the way to copy an index type or table type was with the method copy(), in both Perl and C++. It copies the items, and as needed its components, but tries to share the objects that can be shared (such as the row types). But the multithreading support required more kinds of copying.

The Perl method TableType::copyFundamental() copies a table type with only a limited subset of its index types, and excludes all the aggregators. It is implemented in Perl, and if you look in lib/Triceps/TableType.pm, you can find that it starts by making a new table type with the same row type, and then one by one adds the needed index types to it. It needs index types without any aggregators or nested index types, and thus there is a special method for doing this kind of copies:

$idxtype2 = $idxtype->flatCopy();

The "flat" means exactly what it looks like: copy just the object itself without any connected hierarchies.

On the C++ level there is no such method, instead there is an optional argument to IndexType::copy():

virtual IndexType *copy(bool flat = false) const;

So if you want a flat copy, you call

idxtype2 = idxtype->copy(true);

There is no copyFundamental() on the C++ level, though it probably should be, and should be added in the future. For now, if you really want it, you can make it yourself by copying the logic from Perl.

In the implementation of the index types, this argument "flat" mostly just propagates to the base class, without a whole lot needed from the subclass. For example, this is how it works in the SortedIndexType:

IndexType *SortedIndexType::copy(bool flat) const
{
    return new SortedIndexType(*this, flat);
}

SortedIndexType::SortedIndexType(const SortedIndexType &orig, bool flat) :
    TreeIndexType(orig, flat),
    sc_(orig.sc_->copy())
{ }

The base class TreeIndexType takes care of everything, all the subclass needs to do is carry the flat argument to it.

The next kind of copying is exactly the opposite: it copies the whole table type or index type, including all the objects involved, including the row types or such. It is used for passing the table types through the nexus.

Remember that all these objects are reference-counted. Whenever a Row object is created or deleted in Perl, it increase or decreases the reference of the RowType. If the same RowType object is shared between multiple threads, they will have a contention for this atomic counter, or at the very least will be shuttling the cache line with it back and forth between the CPUs. It's more efficient to give each thread its own copy of a RowType, and then it can stay stuck in one CPU's cache.

So wherever a table type is exported into a nexus, it's deep-copied (and when a row type is exported into a nexus, it's simply copied, and so are the row types of the labels in the nexus). Then when a nexus is imported, the types are again deep-copied into the thread's facet.

But there is one more catch. Suppose we have a label and a table type that use the same row type. Whenever a row coming from a label is inserted into the table (in Perl), the row types of the row (i.e. the label's row type in this case) and of the table are checked for the match. If the row type is the same, this check is very quick. But if the same row type gets copied and then different copies are used for the label and for the table, the check will have to go and actually compare the contents of these row types, and will be much slower. To prevent this slowness, the deep copy has to be smart: it must be able to copy a bunch of things while preserving the identity of the underlying row types. If it's given a label and then a table type, both referring to the same row type, it will copy the row type from the label, but then when copying the table type it will realize that it had already seen and copied the table type, and will reuse its first type. And the same applies even within a table: it may have multiple references to the same row type from the aggregators, and will be smart enough to figure out if they are the same, and copy the same row type once.

This smartness stays mostly undercover in Perl. When you import a facet, it will do all the proper copying, sharing the row types. (Though of course if you export the same row type through two separate nexuses, and then import them both into another thread, these facets will not share the types between them any more). There is a method TableType::deepCopy() but it was mostly intended for testing and it's self-contained: it will copy one table type with the correct row type sharing inside it but it won't do the sharing between two table types.

All the interesting uses of the deepCopy() are at the C++ level. It's used all over the place: for the TableType, IndexType, AggregatorType, SortedIndexCondition and RowSetType (the type of FnReturn and FnBinding). If you decide to create your own subclass of these classes, you need to implement the deepCopy() (and of course the normal copy()) for it.

Its prototype generally looks like this (substitute the correct return type as needed):

virtual IndexType *deepCopy(HoldRowTypes *holder) const;

The HoldRowTypes object is what takes care of sharing the underlying row types. To copy a bunch of objects with sharing, you create a HoldRowTypes, copy the bunch, destroy the HoldRowTypes.

For example, the Facet copies the table types from a Nexus like this:

Autoref<HoldRowTypes> holder = new HoldRowTypes;

for (TableTypeMap::iterator it = nx->tableTypes_.begin(); it != nx->tableTypes_.end(); ++it)
    tableTypes_[it->first] = it->second->deepCopy(holder);

It also uses the same holder to copy the exported row types and the labels. A row type gets copied through a holder like this:

Autoref <RowType> rt2 = holder->copy(rt);

For all the other purposes, HoldRowTypes is an opaque object.

A special feature is that you can pass the holder pointer as NULL, and the deep copy will still work, only it obviously won't be able to share the underlying row types. So if you don't care about sharing, you can always use NULL as an argument. It even works for the direct copying:

HoldRowTypes *holder = NULL;
Autoref <RowType> rt2 = holder->copy(rt);

Its method copy() is smart enough to recognize the this being NULL and do the correct thing.

In the classes, the deepCopy() is typically implemented like this:

IndexType *SortedIndexType::deepCopy(HoldRowTypes *holder) const
{
    return new SortedIndexType(*this, holder);
}

SortedIndexType::SortedIndexType(const SortedIndexType &orig, HoldRowTypes *holder) :
    TreeIndexType(orig, holder),
    sc_(orig.sc_->deepCopy(holder))
{ }

The wrapper passes the call to the deep-copy constructor with a holder which in turn propagates the deep-copying to all the components using their constructor with a holder. Of course, if some component doesn't have any RowType references in it, it doesn't need a constructor with a holder, and can be copied without it. But again, the idea of the deepCopy() it to copy as deep as it goes, without sharing any references with the original.

Friday, May 24, 2013

how to find an index

When working on the next example (still in progress but now getting closer to completion), I've solved one more nagging little problem: finding an index type in the table type by key fields. It makes the joins more convenient: if you know, by which fields you want to join, it's nice to find the correct index automatically. My previous solution to this problem has been to go in the opposite direction: specify the indexes for the join, and find the list of fields automatically from there. But that created a problem with deciding, which field on the left to pair with which on the right, and so either the indexes had to be carefully controlled to have the key fields in the same order, or a separate option still had to be used to specify the pairing of the fields.

Now this problem is solved with the method:

@idxPath = $tableType-> findIndexPathForKeys(@keyFields);

It returns the array that represents the path to an index type that matches these key fields (the index type and all the types in the path still have to be of the Hashed variety). If the correct index can not be found, an empty array is returned. If you specify the fields that aren't present in the row type in the first place, this is simply treated the same as being unable to find an index for these fields.

If more that one index would match, the first one found in the direct order of the index tree walk is returned.

This allowed to change the LookupJoin to make the option "rightIdxPath" optional: if undefined, the index that matches the key fields specified in the option "by" or "byLeft" will be found automatically. Of course, if such an index doesn't exist, it's still an error, the join could not create it for you yet.

The same way, in the JoinTwo now the options "leftIdxPath" and "rightIdxPath" have become optional if either "by" or "byLeft" is specified. The old way of specifying "leftIdxPath" and "rightIdxPath" without any of the "by" varieties still works too.

And this propagated into the TQL command "join": no more need for the option "rightIdxPath". You can still use it if you want but there is rarely any point to it, only if there are multiple matching indexes and you strongly prefer one of them.

One more indirect fall-out from these changes is the new option to LookupJoin:

fieldsDropRightKey => 0/1

The default is 0, and if you set it to 1, the logic will automatically exclude from the result row type the key fields from the right side. This is convenient because these fields contain the values that are duplicates of the key fields from the left side anyway. It's what the TQL join was doing all along, and the implementation of this logic becomes simpler when it gets moved directly into LookupJoin.

Saturday, May 11, 2013

on the export of table types

I've made an example for the export of the table types between threads but it didn't come out well. It has turned out to not particularly need this feature, and came out contrived and ugly. I'm working on a better example now, so in the meantime I want to tell more about the subject matter.

As I've said before, the limitation of exporting the table types between the threads is in keeping the involved Perl code snippets as source code. Their support for the sorted and ordered indexes has been already described, and I've also mentioned the aggregators. I've done this support for the basic aggregators too, with a side effect that the fatal errors in both the index and aggregator code snippets are now propagated much more nicely and come out as the table operations confessing. However when it came to the SimpleAggregator, I've found that I can't just do it as-is, the missing piece of the puzzle is the aggregator initialization routine that would run at the table type initialization time. It's a nice feature to have overall but I'm trying to cut a release here and push everything non-essential past it.

Fortunately, some thinking had showed that this feature is not really needed. There usually just isn't any need to export a table type with aggregators. Moreover, there is a need to export the table types with many elements stripped. What is to be stripped and why?

The most central part of the table type is its primary index. It defines how the data gets organized. And then the secondary indexes and aggregators perform the computations from the data in the table. The tables can not be shared between threads, and thus the way to copy a table between the threads is to export the table type and send the data, and let the other thread construct a copy of the table from that. But the table created in another thread really needs only the base data organization. If it does any computations on that data, that would be its own computations, different than the ones in the exporting thread. So all it needs to get is the basic table type with the primary index, very rarely some secondary indexes, and pretty much never the aggregators.

The way to get such a stripped table type with only the fundamentally important parts is:

$tabtype_fundamental = $tabtype->copyFundamental();

That copies the row type and the primary index (the whole path to the first leaf index type) and leaves alone the rest. All the aggregators on all the indexes, even on the primary one, are not included in the copy. In the context of the full nexus making it can look like

$facet = $owner->makeNexus(
    name => "data"
    labels => [ @labels ],
    tableTypes => [
   mytable => $mytable->getType()->copyFundamental(),
    ],
    import => "writer",
);

In case if more index types need to be included, they can be specified by path in the arguments of copyFundamental():

$tabtype_fundamental = $tabtype->copyFundamental(
[ "byDate", "byAddress", "fifo" ],
[ "byDate", "byPriority", "fifo" ],

);

The paths may overlap, as shown here, and the matching subtrees will be copied correctly, still properly overlapping in the result. There is also a special syntax:

$tabtype_fundamental = $tabtype->copyFundamental(
[ "secondary", "+" ],

);

The "+" in the path means "do the path to the first leaf index of that subtree" and saves the necessity to write out the whole path.

Finally, what if you don't want to include the original primary index at all? You can use the string "NO_FIRST_LEAF" as the first argument. That would skip it. You can still include it by using its explicit path, possibly at the other position.

For example, suppose that you have a table type with two top-level indexes, "first" is the primary index and "second" as secondary, and make a copy:

$tabtype_fundamental = $tabtype->copyFundamental(
   "NO_FIRST_LEAF",
[ "second", "+" ],

[ "first", "+" ],

);

In the copied table type the index "second" becomes primary and "first" secondary.

Sunday, February 17, 2013

Aggregator in C++, part 2

Building an aggregator is a fairly complex subject, so next I want to show a working example. It turns out that I've never actually written an aggregator in C++ before, other than some tiny fragments; I was always working through Perl. So I've written some more interesting examples just now. The full text can be found in the unit test file cpp/table/t_Agg.cpp.

First, if your aggregator is truly stateless and fully hardcoded, the easier way to do it as by defining a plain function with the same handler arguments and building a BasicAggregatorType with it. And here is one that sums the values of an int64 field (the test case aggBasicSum):

void sumC(Table *table, AggregatorGadget *gadget, Index *index,
        const IndexType *parentIndexType, GroupHandle *gh, Tray *dest,
    Aggregator::AggOp aggop, Rowop::Opcode opcode, RowHandle *rh)
{
// don't send the NULL record after the group becomes empty
if (opcode == Rowop::OP_NOP || parentIndexType->groupSize(gh) == 0)
    return;

int64_t sum = 0;
for (RowHandle *rhi = index->begin(); rhi != NULL; rhi = index->next(rhi)) {
    sum += table->getRowType()->getInt64(rhi->getRow(), 2, 0); // field c at idx 2
}

// pick the rest of fields from the last row of the group
RowHandle *lastrh = index->last();

// build the result row; relies on the aggregator result being of the
// same type as the rows in the table
FdataVec fields;
table->getRowType()->splitInto(lastrh->getRow(), fields);
fields[2].setPtr(true, &sum, sizeof(sum)); // set the field c from the sum

Rowref res(gadget->getLabel()->getType(), fields);
gadget->sendDelayed(dest, res, opcode);
}

...
Autoref<TableType> tt = TableType::make(rt1)
    ->addSubIndex("Hashed", HashedIndexType::make( // will be the default index
        (new NameSet())->add("e")
      )->addSubIndex("Fifo", FifoIndexType::make()
        ->setAggregator(new BasicAggregatorType("aggr", rt1, sumC))
      )
    );
...

As I've told before, you create the BasicAggregatorType by giving it the aggregator name, aggregator result row type, and the handler function.

In this case the handler function is completely hardcoded. It works on the int64 field at index 2. The row type I used in this example is:

row {
uint8[10] a,
int32[] b,
int64 c,
float64 d,
string e,
}

So the field is actually named "c", and that's why the aggregator function is named "sumC". But since in this case everything is known in advance, to make it more efficient, the look-up of the field by name has been skipped, and the field index has been pre-computed and placed into the function.

The general outline of the aggregator is exactly the same as in Perl: check for an empty group, then iterate through all the rows in the group and compute a the sum, fill the rest of the fields from the last row, and send the result. The difference is that there is no AggregatorContext, and the calls are done directly on the bits and pieces received as arguments.

The group size is computed as

parentIndexType->groupSize(gh)

This is pretty much the reason for parentIndexType and gh to be passed as arguments to the handler. Why it's done like this is a long story. It allows an individual index (as in instance, not type!) to not care about the number of rows in it. Remember, a group may have multiple "parallel" indexes defined on it, with all of them containing the same rows, and thus the same number of rows. The group handle (gh) contains the common information about the group, including the group size. And the parent index type is the one that manages the group handles inside it, and knows how to extract the information from them.

The iteration is done as usual with the begin() and next(), only called directly on the index. It will return NULL after the last row. The last row can also be found directly with index->last(); on an empty group it would return NULL, so be careful with that.

The input row type for reading the rows from the group is found as:

table->getRowType()

The result row is built in this example by copying the last row and replacing one field. The data from the last row is split into FdataVec (the data itself is not copied at this point but the data descriptors in the construction vector are made to point to the data in the original row). Then the descriptor for the field "c" is changed to point to the computed sum. Then the new row is built from the descriptor.

In this particular case the type of the input rows and of the result rows is the same, so either could have been used to construct the result rows. There are two ways to find the result type:

gadget()->getType()->getRowType()
gadget->getLabel()->getType()

They are exactly the same, just there are two paths leading to the same object.

Finally, the constructed row is sent. sendDelayed() takes care of constructing the rowop from the components.

Potentially, you could even call directly

sendDelayed(dest, gadget->getLabel()->getType()->makeRow(fields), opcode);

However the caveat is that if the table's enqueuing mode is set to EM_IGNORE (which is an old idea, nowadays everyone should use EM_CALL and eventually there will be no other way than EM_CALL, but for now it could happen), the rows would leak. The reason for that leak is that sendDelayed() will skip the rowop creation with EM_IGNORE, and thus will skip the reference creation, and the row would never have a reference to it created and would never be freed. Creating an explicit reference makes sure that the row will always be cleaned up when it becomes unused.

Wednesday, September 12, 2012

SortedIndexType, row handle section and sequences

Now we get to an advanced feature that has been mentioned before in the description of the row handles but is not accessible from Perl. A row handle contains a chunk of memory for every index type in the table. It is called a "row handle section". At the very least this chunk of memory contains the iterator in an index of that type, which allows to navigate through the table and to delete the row handles from the table efficiently.

But an index may request more memory (the same fixed amount for each row handle) to store some row-specific information. For example, the Hashed index stores the value of the hash in its section, and uses this value for the efficient comparisons.

A sort condition may request and use memory in this section of a SortedIndexType. It is done by defining a few more virtual methods that handle the row section.

I could have showed an example of the Hashed index re-implementation through the Sorted interface, but it's kind of boring, since you could as well look directly at the source code of the HashedIndexType. Instead I want to show a different kind of index that doesn't use the data in the rows for comparison at all but keeps the rows in the order they were inserted. Like a more expensive variety of FIFO index type. It's also a bit of a preview of a future feature. It assigns a new auto-generated sequence number to each row handle, and uses that sequence number for ordering. Later you can find the row handle quickly if you know its sequence number. If a table may contain multiple copies of a row, the sequence numbers allow you to tell, which copy you are dealing with. It comes handy for such things as results of joins without a natural primary key. Of course, the usefulness of this preview is limited by the fact that there is no place for the sequence numbers in the rowops, and thus there is no way to propagate the sequence numbers in the model. That would have to be addressed before it becomes a full feature.

Now, you might ask, why not just add an extra field and put the sequence number in there? Sure, that would work too, and also solve the issue with the propagation in the rowops. However this means that as a row goes through a table, it gets copied to set the sequence number in it, which is less efficient. So ultimately keeping the sequence numbers "on the side" is more beneficial.

Now, to the implementation:

class SeqSortCondition : public SortedIndexCondition
{
protected:
    class SeqRhSection : public TreeIndexType::BasicRhSection
    {
    public:
        SeqRhSection(int64_t val) :
            seq_(val)
        { }

        int64_t seq_; // the sequence number of this row handle
    };

public:
    SeqSortCondition() :
        seq_(0)
    { }

    virtual void initialize(Erref &errors, TableType *tabtype, SortedIndexType *indtype)
    {
        SortedIndexCondition::initialize(errors, tabtype, indtype);
        seq_ = 0;
    }

    virtual bool equals(const SortedIndexCondition *sc) const
    {
        return true;
    }

    virtual bool match(const SortedIndexCondition *sc) const
    {
        return true;
    }

    virtual void printTo(string &res, const string &indent = "", const string &subindent = " ") const
    {
        res.append("Sequenced");
    }

    virtual SortedIndexCondition *copy() const
    {
        return new SeqSortCondition(*this);
    }

    virtual size_t sizeOfRhSection() const
    {
        return sizeof(SeqRhSection);
    }

    virtual void initRowHandleSection(RowHandle *rh) const
    {
        // initialize the Seq part, the general Sorted index
        // will initialize the iterator
        SeqRhSection *rs = rh->get<SeqRhSection>(rhOffset_);
        new(rs) SeqRhSection(seq_++);
    }

    virtual void clearRowHandleSection(RowHandle *rh) const
    {
        // clear the iterator by calling its destructor
        SeqRhSection *rs = rh->get<SeqRhSection>(rhOffset_);
        rs->~SeqRhSection();
    }

    virtual void copyRowHandleSection(RowHandle *rh, const RowHandle *fromrh) const
    {
        SeqRhSection *rs = rh->get<SeqRhSection>(rhOffset_);
        SeqRhSection *fromrs = fromrh->get<SeqRhSection>(rhOffset_);

        // initialize the iterator by calling its copy constructor inside the placement,
        // the sequence number gets copied too
        new(rs) SeqRhSection(*fromrs);
    }

    // Helper method to read the sequence from the row handle,
    // can also be used by the end-user. The row handle must as usual
    // belong to this table.
    int64_t getSeq(const RowHandle *rh) const
    {
        return rh->get<SeqRhSection>(rhOffset_)->seq_;
    }
    // Helper method to set the sequence in the row handle.
    // May be used only on the rows that are not in table.
    void setSeq(const RowHandle *rh, int64_t val) const
    {
        if (rh->isInTable()) {
            throw Exception("Attempted to change the sequence on a row in table.", true);
        }
        rh->get<SeqRhSection>(rhOffset_)->seq_ = val;
    }

    virtual bool operator() (const RowHandle *rh1, const RowHandle *rh2) const
    {
        return getSeq(rh1) < getSeq(rh2);
    }

    mutable int64_t seq_; // the next sequence number to assign
};

    ...
     Autoref<IndexType> it = new SortedIndexType(new SeqSortCondition());
    ...

The nested class SeqRhSection defines the structure of this index's section. For the sort condition it must always inherit from TreeIndexType::BasicRhSection, to get the iterator part from it. Any extra fields are owned by the sort condition.

The SeqSortCondition contains the sequence number generator seq_ (not to be confused with the same-named field seq_ in SeqRhSection), that gets initialized to 0, and will be incremented from there. Since each sorted index type has its own copy of the condition, and each table type gets its own sorted index type, each of them will be counting independently. However there is a bit of a catch when multiple tables get created from the same table type: they will all share the same copy of the sort condition, and thus the same sequence number generator. In practice it should not be a problem, as long as all of the tables are in the same thread. If they are in different threads, a synchronization would be needed around the sequence generator increment. Or better, make a copy of the table type for each thread and avoid the synchronization issues.

The equals() and match() always return true because there is nothing configurable in this sort condition.

The new features start at sizeOfRhSection(). The size of each row handle in a table type is the same, and is computed by asking every index type in it at initialization time and adding up the totals (plus alignment and some fixed amount of basic data). sizeOfRhSection() does its part by telling the caller the size of SeqRhSection.

Then each row handle section must provide the ways to construct and destruct it. Naturally, to save space, a section must have no virtual table, so like for the rows, a separate method in the index type acts as its virtual destructor. And there is no such thing as a virtual constructor in C++, which gets simulater through more methods in the index type. The SortedIndexType delegates most of this work to the sort condition in it. The basic constructor is initRowHandleSection(), the copy constructor is copyRowHandleSection(), and the destructor is clearRowHandleSection().

Each of them gets the location of this index type's section in the row handle with

SeqRhSection *rs = rh->get<SeqRhSection>(rhOffset_);

The field rhOffset_ gets initialized by the SortedIndexType machinery before either of these methods gets ever called. Here rs points to the raw bytes, on which the placement constructors and the explicit destructor are called.

The methods getSeq() and setSeq() are not virtual, they are unique to this SeqSortCondition. They allow to read the sequence from a row handle or set the sequence in it. Naturally, the sequence number may be changed only when the row handle is not in the table yet, or it would mess up the indexing horribly. It's OK to throw the exceptions from setSeq() and getSeq() since they are called directly from the used code and won't confuse any Triceps code along the way.

If you want to find a row handle in the table by its sequence number, you start with creating a new row handle (which can even use an empty row). That new row handle will have a new sequence number assigned to it, but it doesn't matter, because next you call setSeq() and overwrite it with your desired number. Then you use this row handle to call find() or delete() on the table as usual. Like this:

Rhref rh1(table, r1);
sc->setSeq(rh1, desired_number);

Or to read the number, you do:

int64_t seq = sc->getSeq(rh);

Here sc is the exact initialized sort condition from the actual table type. If you use a wrong or uninitialized one, the rhOffset_ in it will likely be wrong, and will cause all kinds of memory corruption. You can get the sort condition from a table type like this:

    Autoref<SortedIndexType> ixt = dynamic_cast<SortedIndexType *>(tt->findSubIndex("primary"));
    Autoref<SeqSortCondition> sc = dynamic_cast<SeqSortCondition *>(ixt->getCondition());

You don't have to use the dynamic cast but it's safer, and since you'd normally do it once at the model setup time and then just keep using the value, there is no noticeable performance penalty for it.

The full example can be found in svn in cpp/type/test/t_xSortedIndex.cpp, and will be also included in version 1.1.

Monday, September 10, 2012

SortedIndexType example with getKey

The method getKey() is just a little optional addition to the sort condition. If you leave it unimplemented in your subclass, the basic implementation will just return NULL. I wanted to show a bit more techniques, so I've done a fairly big extension of the previous example. Now it can compare multiple fields, and the fields are specified by names. The only part missing from it being a general OrderedIndexType is the support of all the field types, not just int32.

Here it goes, method by method, with commentary.

class MultiInt32SortCondition : public SortedIndexCondition
{
public:
// @param key - the key fields specification
    MultiInt32SortCondition(NameSet *key):
        key_(key)
    { }

The key is specified as a name set. Unlike the HashedIndexType, there is no changing the key later, it must be specified in the constructor, and must not be NULL. The same as with HashedIndexType, the original name set becomes referenced by this sort condition and all its copies. So don't change and don't even use the original condition any more after you've passed it to the sort condition.

    virtual void initialize(Erref &errors, TableType *tabtype, SortedIndexType *indtype)
    {
        SortedIndexCondition::initialize(errors, tabtype, indtype);
        idxs_.clear();

        for (int i = 0; i < key_->size(); i++) {
            const string &s = (*key_)[i];
            int n = rt_->findIdx(s);
            if (n < 0) {
                errors->appendMsg(true, strprintf("No such field '%s'.", s.c_str()));
                continue;
            }
            const RowType::Field &fld = rt_->fields()[n];
            if (fld.type_->getTypeId() != Type::TT_INT32) {
                errors->appendMsg(true, strprintf("The field '%s' must be an int32.", s.c_str()));
                continue;
            }
            if (fld.arsz_ != RowType::Field::AR_SCALAR) {
                errors->appendMsg(true, strprintf("The field '%s' must not be an array.", s.c_str()));
                continue;
            }
            idxs_.push_back(n);
        }
    }

The initialization translates the field names to indexes (um, that's a confusing double usage of the word "index", here it's like "array indexes") in the row type, and checks that the fields are as expected.

    virtual bool equals(const SortedIndexCondition *sc) const
    {
        // the cast is safe to do because the caller has checked the typeid
        MultiInt32SortCondition *other = (MultiInt32SortCondition *)sc;

        // names must be the same
        if (!key_->equals(other->key_))
            return false;

        // and if initialized, the indexs must be the same too
        if (!rt_.isNull()) {
            if (idxs_.size() != other->idxs_.size())
                return false;

            for (int i = 0; i < idxs_.size(); i++) {
                if (idxs_[i] != other->idxs_[i])
                    return false;
            }
        }

        return true;
    }

    virtual bool match(const SortedIndexCondition *sc) const
    {
        MultiInt32SortCondition *other = (MultiInt32SortCondition *)sc;
        if (rt_.isNull()) {
            // not initialized, check by names
            return key_->equals(other->key_);
        } else {
            // initialized, check by indexes
            if (idxs_.size() != other->idxs_.size())
                return false;

            for (int i = 0; i < idxs_.size(); i++) {
                if (idxs_[i] != other->idxs_[i])
                    return false;
            }
            return true;
        }
    }

The equality and match checks follow the fashion of HashedIndexType in 1.1: if not initialized, they rely on field names, if initialized, they take the field indexes into the consideration (for equality both the names and indexes must be equal, for matching, only the indexes need to be equal).

    virtual void printTo(string &res, const string &indent = "", const string &subindent = " ") const
    {
        res.append("MultiInt32Sort(");
        for (NameSet::iterator i = key_->begin(); i != key_->end(); ++i) {
            res.append(*i);
            res.append(", "); // extra comma after last field doesn't hurt
        }
        res.append(")");
    }

    virtual SortedIndexCondition *copy() const
    {
        return new MultiInt32SortCondition(*this);
    }

    virtual const_Onceref<NameSet> getKey() const
    {
        return key_;
    }

The printing and copying is nothing particularly new and fancy. getKey() simply returns back the key. This feels a bit like an anti-climax, a whole big example for this little one-liner, but again, that's not the only thing that this example shows.

     virtual bool operator() (const RowHandle *rh1, const RowHandle *rh2) const
    {
        const Row *row1 = rh1->getRow();
        const Row *row2 = rh2->getRow();

        int sz = idxs_.size();
        for (int i = 0; i < sz; i++) {
            int idx = idxs_[i];
            {
                bool v1 = rt_->isFieldNull(row1, idx);
                bool v2 = rt_->isFieldNull(row2, idx);
                if (v1 > v2) // isNull at true goes first, so the direction is opposite
                    return true;
                if (v1 < v2)
                    return false;
            }
            {
                int32_t v1 = rt_->getInt32(row1, idx);
                int32_t v2 = rt_->getInt32(row2, idx);
                if (v1 < v2)
                    return true;
                if (v1 > v2)
                    return false;
            }
        }
        return false; // falls through on equality, which is not less
    }

    vector<int> idxs_;
    Autoref<NameSet> key_;
};

The "less" comparison function now loops through all the fields in the key. It can't do the shortcuts in the int32 comparison part any more, so that has been expanded to the full condition. If the whole loop falls through without returning, it means that the key fields in both rows are equal, so it returns false.

    ...
    Autoref<IndexType> it = new SortedIndexType(new MultiInt32SortCondition(
        NameSet::make()->add("b")->add("c")
    ));
    ...

The key argument is created very much like to the hashed index.

Sunday, September 9, 2012

SortedIndexType

The SortedIndexType is defined in type/SortedIndexType.h, and provides a way to define any custom sorting criteria. That is done by defining your condition class, derived from SortedIndexCondition, and passing it to the SortedIndexType. Because of this the index type itself is simple, all the complex things are in the condition:

SortedIndexType(Onceref<SortedIndexCondition> sc);
static SortedIndexType *make(Onceref<SortedIndexCondition> sc);
SortedIndexCondition *getCondition() const;

The examples of the sort conditions can be found in type/test/t_TableType.cpp and in perl/Triceps/PerlSortCondition.*. SortedIndexCondition provides a set of virtual methods that can be re-defined in the subclass to create a custom condition. Indeed, some of them must be re-defined, since they are pure virtual.

I've also added now a simple example that will be included in 1.1, but can as well be compiled with 1.0. It's located in type/ test/t_xSortedIndex.cpp. Here is the subclass:

class Int32SortCondition : public SortedIndexCondition
{
public:
    // @param idx - index of field to use for comparison (starting from 0)
    Int32SortCondition(int idx) :
        idx_(idx)
    { }

    virtual void initialize(Erref &errors, TableType *tabtype, SortedIndexType *indtype)
    {
        SortedIndexCondition::initialize(errors, tabtype, indtype);
        if (idx_ < 0)
            errors->appendMsg(true, "The index must not be negative.");
        if (rt_->fieldCount() <= idx_)
            errors->appendMsg(true, strprintf("The row type must contain at least %d fields.", idx_+1));

        if (!errors->hasError()) { // can be checked only if index is within range
            const RowType::Field &fld = rt_->fields()[idx_];
            if (fld.type_->getTypeId() != Type::TT_INT32)
                errors->appendMsg(true, strprintf("The field at index %d must be an int32.", idx_));
            if (fld.arsz_ != RowType::Field::AR_SCALAR)
                errors->appendMsg(true, strprintf("The field at index %d must not be an array.", idx_));
        }
    }

    virtual bool equals(const SortedIndexCondition *sc) const
    {
        // the cast is safe to do because the caller has checked the typeid
        Int32SortCondition *other = (Int32SortCondition *)sc;
        return idx_ == other->idx_;
    }
    virtual bool match(const SortedIndexCondition *sc) const
    {
        return equals(sc);
    }
    virtual void printTo(string &res, const string &indent = "", const string &subindent = " ") const
    {
        res.append(strprintf("Int32Sort(%d)", idx_));
    }
    virtual SortedIndexCondition *copy() const
    {
        return new Int32SortCondition(*this);
    }

    virtual bool operator() (const RowHandle *rh1, const RowHandle *rh2) const
    {
        const Row *row1 = rh1->getRow();
        const Row *row2 = rh2->getRow();
        {
            bool v1 = rt_->isFieldNull(row1, idx_);
            bool v2 = rt_->isFieldNull(row2, idx_);
            if (v1 > v2) // isNull at true goes first, so the direction is opposite
                return true;
            if (v1 < v2)
                return false;
        }
        {
            int32_t v1 = rt_->getInt32(row1, idx_);
            int32_t v2 = rt_->getInt32(row2, idx_);
            return (v1 < v2);
        }
    }

    int idx_;
};

...

Autoref<IndexType> it = new SortedIndexType(new Int32SortCondition(1));

It's the very basic example that defined only the absolute minimum of methods. It sorts by an int32 field, whose index (starting as usual from 0) is specified in the constructor.

The method initialize() is called at the table type initialization time. The argument errors is an already allocated Errors object to return the error indications, tabtype is the table type where the initialization is happening, and indtype is the index type that owns this condition. Also the field rt_ gets magically initialized to the table's row type reference before the sort condition initialization is called. This method is expected to do all the initialization of the internal state, check for all the errors, and return these errors if found.

equals() and match() compare two conditions for equality and match. Before they get called, the caller checks that both conditions are of the same type (i.e. have the same C++ typeid), so it's safe to cast the second condition's pointer to our type. The easiest way to define match() is to make it the same as equals(). These methods may be called on both uninitialized and initialized conditions; if not initialized then the field rt_ will be NULL.

printTo() appends the printout of this index's description to a string. For the simple single-line printouts it just appends to the result string. The multi-line prints have to handle the indenting correctly, as will be shown later.

copy() creates a deep copy of this object. In particular, the SortedIndexType constructor makes a private copy of the condition, and remembers that copy, not the original.

Finally, the operator() implements the comparison for "Less": it gets two row handles, and returns true if the first one contains a row that is "less" (i.e. goes before in the sorting order) than the second one. The reason for why it's done like this is that the SortedIndexCondition is really a Less comparator class for the STL tree that has grown a few extra methods.

This example shows how to compare a value consisting of multiple elements. Even though this sort condition sorts by only one field, it first compares separately for NULL in that field, and then for the actual value. For each element you must:

find the values of the element in both rows
compare if "<", and if so, return true
compare if ">", and if so, return false
otherwise (if they are equal), fall through to the next element

The last element doesn't have to go through the whole procedure, it can just return the result of "<". And in this case the comparison for NULL wants the NULL value go before all the non-NULL values, so the result of true must go before false, and the comparison signs are reversed. It's real important that the second comparison, normally for ">", can not be skipped (except for the last element). If you skip it, you will get a mess of the data and will spend a lot of time trying to figure out, what is going on.

Another important point is that none of the methods, especially operator(), must not throw any exceptions. Print an error message and either call abort() or return false. This might be handled better in the future versions.

That's it for the basics, the minimal subset of the methods that has to be defined.

Saturday, September 8, 2012

HashedIndexType

The HashedIndexType is defined in type/HashedIndexType.h. It also allows to specify its only argument, the selection of the key fields, in the constructor, or set it later with a chainable call:

HashedIndexType(NameSet *key = NULL);
static HashedIndexType *make(NameSet *key = NULL);
HashedIndexType *setKey(NameSet *key);

One way or the other, the key has to be set, or a missing key will be detected as an error at the initialization time. Obviously, all the conditions described in the Perl API apply: the key fields must be present in the table's row type, and so on.

The key can be get back using the parent class method IndexType::getKey(). The value returned there is a const_Onceref<NameSet>, telling you that the key NameSet must not be changed afterward.

The check for equals() and match() of HashedIndexType are the same in 1.0: the list of key fields must be the same (not the exact same NameSet object, but its contents being the same).

However this has a tricky effect: if two table types have matching row types with different field names, and the same Hashed indexes differing only in the key field names (following the difference in the row types), these table types will be considered non-matching because their hashed indexes are non-matching.

So for version 1.1 I've updated the semantics. If the index types have been initialized, they can find their table types, and from there the row types. And then compare if the keys refer to the matching fields or not, even if their names are different. If the index types are not initialized, everything works the old way. This may lead to slightly surprising effects when the two indexes match inside the initialized table types but their uninitialized copies don't. However the comparison of the uninitialized index types is probably not that usable anyway.

And of course this semantics change in 1.1 propagates to Perl as well.

FifoIndexType

The FifoIndexType works the same way as in Perl, except that it provides two ways to set the configuration values: either as the constructor arguments or as chainable methods:

FifoIndexType(size_t limit = 0, bool jumping = false, bool reverse = false);
static FifoIndexType *make(size_t limit = 0, bool jumping = false, bool reverse = false);

FifoIndexType *setLimit(size_t limit);
FifoIndexType *setJumping(bool jumping);
FifoIndexType *setReverse(bool reverse);

So the following are equivalent:

Autoref<IndexType> it1 = new FifoIndexType(100, true, true);
Autoref<IndexType> it2 = FifoIndexType::make()
    ->setLimt(100)
    ->setJumping(true)
    ->setReverse(true);

As usual, the settings can be changed only until the initialization. The settings can be read back at any time with

size_t getLimit() const;
bool isJumping() const;
bool isReverse() const;

Note that the limit is unsigned, and setting it to negative values results in it being set to very large positive values. The limit of 0 means "unlimited".

All the common methods inherited from IndexType and Type work as usual.

The equals() and match() are equivalent for the FifoIndexType, and are true when all the parameters are set to the same values.

Types, aborts and exceptions

And by the way, in version 1.0 the attempts to modify a TableType or IndexType after initialization caused an error message and abort(). For 1.1 I've changed them to throw an Exception instead. There is no more direct abort() anywhere in the code.

Though of course unless you set the Triceps Exception to be interceptable, it will amount to the same error message and abort().

NameSet

NameSet is fundamentally a reference-counted vector of strings that allows to construct them from a sequence of calls. It's used to construct such things as field list for the index key. Previously I've said that the names in the set must be different, and that should normally be the use case, but NameSet itself doesn't check for that. The order of values in it usually matters. So its name is slightly misleading: it's not really a set, it's a vector, but the name has been applied historically. And in the future it might include the set functionality too, by adding a quick look up of index by name.

It's defined in type/NameSet.h as

class NameSet : public Starget, public vector<string> { ... }

All the vector methods are also directly accessible.

NameSet();
NameSet(const NameSet *other);
static NameSet *make();
static NameSet *make(const NameSet &other);

The constructors are also duplicated as make(), for the more convenient operator priority than with new().

NameSet *add(const string &s);

The method for the chained construction, such as:

Autoref<NameSet> ns1 = (new NameSet())->add("b")->add("c");
Autoref<NameSet> ns2 = NameSet::make()->add("b")->add("c");

It's possible to combine the copy constructor and the addition of extra fields:

Autoref<NameSet> ns3 = NameSet::make(*ns2)->add("x");

One more feature of the NameSet is the comparison for equality:

bool equals(const NameSet *other) const;

As I've been writing this, it has struck me that the constructors are not particularly conveniend, so for the version 1.1.0, I've changed the copy constructors:

NameSet(const NameSet *other);
NameSet(const vector<string> &other);
static NameSet *make(const NameSet *other);
static NameSet *make(const vector<string> &other);

Now you can construct from a plain vector too, and since NameSet is its subclass, that replaces the old copy constructor. The constructor from a pointer makes the use of Autorefs more convenient, now you can do:

Autoref<NameSet> ns3 = NameSet::make(ns2)->add("x");

without dereferencing ns2.

IndexType

Very much like the Perl API, the IndexType is an abstract class, in which you can't create the objects directly, you have to create the objects with its concrete sub-classes. It has the methods common for all the index types, and is defined in type/IndexType.h.

The index type id is defined here, as enum IndexType::IndexId. The supported index types are still IT_HASHED, IT_FIFO, IT_SORTED. There is also the semi-hidden type IT_ROOT: you can't create it directly but every table type creates one implicitly, as the root of its index tree. There is also IT_LAST defined past the last actual type, so if you ever need to iterate through the types, you can do it as

for (int i = 0; i < IndexType::IT_LAST; i++) { ... }

The conversion between the index type id and name can be done with the methods:

static const char *indexIdString(int enval, const char *def = "???");
static int stringIndexId(const char *str);

As usual with the contant-and-name conversions, the numeric id is invalid, the string def is returned, by default "???". If the string name is unknown, -1 is returned.

IndexId getIndexId() const;

Returns the id of an index type object. It can't be changed, it gets hardcoded in the subclass constructor.

IndexType *addSubIndex(const string &name, Onceref<IndexType> index);

Adds a sub-index, in exactly the same way as adding an index type to the TableType. It also adds a copy of the argument, not the argument itself. It's also designed for chaining, like:

Autoref<TableType> tt = (new TableType(rt1)
    )->addSubIndex("primary", new HashedIndexType(
        (new NameSet())->add("b")->add("c"))
    )->addSubIndex("limit", (new FifoIndexType())
        ->setLimit(2) // will policy-delete 2 rows
    );

The getting of the key back is done with:

const_Onceref<NameSet> getKey() const;

It will work with any kind of index, but will return a NULL if the index doesn't support the key. The NameSet is an ordered list of unique names, and will be described in detail soon.

IndexType *setAggregator(Onceref<AggregatorType> agg);
const AggregatorType *getAggregator() const;

Sets or gets an aggregator type for this index type. As usual, any setting can be done only until the index type is initialized.

IndexType *copy() const;

Creates an un-initialized deep copy (with all the sub-index and aggregator types also copied) of this index. This method is used by addSubIndex(). By the way, the usual copy constructor could theoretically be used on the index types but usually doesn't make a whole lot of a sense because the sub-types and such will end up shared by reference.

bool isLeaf() const;

Returns true if this index type has no sub-indexes. Of course, if this type is not initialized yet, more sub-types can be added to it to make it non-leaf later.

IndexType *findSubIndex(const string &name) const;
IndexType *findSubIndexById(IndexId it) const;

Find the sub-index by name or id, works in the same way as for TableType. A special feature is that it can be applied on a NULL object reference, like this:

Autoref<IndexType> it; // NULL by default
Autoref<IndexType> itsub = it->findSubIndex("xxx"); // doesn't crash, returns NULL

The idea here is to allow the safe chaining of findSubIndex() for the look-ups of the nested types:

Autoref<IndexType> it = tt->findSubIndex("level1")->findSubIndex("level2");
if (it.isNull()) {
    // not found
}

If any of the elements in the path are missing, the end result will be NULL, conveniently. But it won't tell you, which one was missing, inconveniently.

const IndexTypeVec &getSubIndexes() const;

Returns back the whole set of sub-indexes.

IndexType *getFirstLeaf() const;

Returns the first leaf index type (if a leaf itself, will return itself).

bool isInitialized() const;

Checks whether the index type has been initialized.

TableType *getTabtype() const;

Returns the table type, to which this index type is tied. The tying-together happens at the initialization time, so for an initialized index type this method will return NULL.

There are great many more methods on the IndexType, that are used to maintain the index trees, but you don't need to look at them unless you are interested in the inner workings of the Triceps tables.

Saturday, May 19, 2012

Finding the nested index types

The joins have introduced a syntax for choosing a nested index type in the table (or, more exactly, in the table type). You can say

rightIdxPath => [ "byCcy1", "byCcy12" ],

to specify the index type "byCcy12" nested in index type "byCcy1". Internally the joins delegate this index finding to the TableType calls, and you can also use these calls directly:

$idxType = $tableType->findIndexPath(\@path);
($idxType, @keys) = $tableType->findIndexKeyPath(\@path);

For example:

$ixt = $tt->findIndexPath([ "byCcy1", "byCcy12" ]);
($ixt, @keys) = $tt->findIndexKeyPath([ "byCcy1", "byCcy12" ]);

The findIndexPath() simply finds the index type at the path. If there is no such index, it confesses.

The findIndexKeyPath() finds by path an index type that allows the direct look-up by key fields. It requires that every index type in the path returns a non-empty array of fields in getKey(). In practice it means that every index in the path must be a Hashed index. Otherwise the method confesses. When the Sorted and maybe other index types will support getKey(), they will be usable with this method too.

Besides checking that each index type in the path works by keys, this method builds and returns the list of all the key fields required for a look-up in this index. Note that @keys is an actual array and not a reference to array. The return protocol of this method is a little weird: it returns an array of values, with the first value being the reference to the index type, and the rest of them the names of the key fields. If the table type was defined as

$tt = Triceps::TableType->new($rt)
  ->addSubIndex("byCcy1",
    Triceps::IndexType->newHashed(key => [ "ccy1" ])
    ->addSubIndex("byCcy12",
      Triceps::IndexType->newHashed(key => [ "ccy2" ])
    )
  )
  ->addSubIndex("byCcy2",
    Triceps::IndexType->newHashed(key => [ "ccy2" ])
    ->addSubIndex("grouping", Triceps::IndexType->newFifo())
  )
or die "$!";

then the look-up of [ "byCcy1", "byCcy12" ] would return ($ixtref, "ccy1", "ccy2"), where $ixtref is the reference to the index type. When assigned to ($ixt, @keys), $ixtref would go into $ixt, and ("ccy1", "ccy2") would go into @keys.

The key field names in the result go in the order they occurred in the definition, from the outermost to the innermost index. The key fields must not duplicate. It's possible to define the index types where the key fields duplicate in the path, say:

$tt = Triceps::TableType->new($rt)
  ->addSubIndex("byCcy1",
    Triceps::IndexType->newHashed(key => [ "ccy1" ])
    ->addSubIndex("byCcy12",
      Triceps::IndexType->newHashed(key => [ "ccy2", "ccy1" ])
    )
  )
or die "$!";

And they would even work fine, with just a little extra overhead from duplication. But findIndexKeyPath() will refuse such indexes and confess.

Monday, April 9, 2012

A better index type lookup

I've been recently distracted with other things, and also I've been working on the joins, before writing about them. The joins were actually some of the first templates I wrote, but that's also why they look a bit dated compared to the newer stuff. I probably won't do all I wanted to do with them, but I'm at least cleaning them up a bit and adding some of the more recent features. This rework takes time.

Along that way, I've extracted the logic that looks up the index types by path from SimpleAggregator and put it for reuse into TableType. Now it can get used like this:

$indexType = $tableType->findIndexPath("primary", "fifo");

The arguments form a path of names in the index type tree. If the path is not found, the function would confess (that is die, with a stack trace). An empty path is also illegal and would cause the same result.

Tuesday, March 20, 2012

IndexType reference

The index types in Triceps are available in the following kinds:

Hashed: Provides quick random access based on the key formed from the fields of the row in the table. May be leaf or non-leaf. The order of of rows in the index will be repeatable between the runs of the same program on the same machine architecture, but not easily predictable. Internally the rows are stored in a tree but the comparisons of the rows are accelerated by pre-calculating a hash value from the key fields and keeping it in the row handle.

FIFO: Keeps the rows in the order they were received. There is no efficient way to find a particular row in this index, the search in it works by going through all the rows sequentially and comparing the rows for exact equality. It provides the expiration policies based on the row count. It may only be a leaf.

PerlSorted: Provides random access based on the key field comparison, expressed as a Perl function. This results in a predictable order of rows but the execution of the Perl code makes it slower than the Hashed index. May be leaf or non-leaf. There is also a SimpleOrdered index implementation done in Perl on top of the PerlSorted index, that allows to specify the keys in a more convenient way.

The hashed index is created with:

$it = Triceps::IndexType->newHashed($optionName => $optionValue, ...)
  or die "$!";

The only available option is "key", and it's mandatory. It's argument is the reference to an array of strings that specify the names of the key fields.

The FIFO index is created with:

$it = Triceps::IndexType->newFifo($optionName => $optionValue, ...)
  or die "$!";

The options are:

limit: sets the limit value for the replacement policy. Once the number of rows attempts to grow beyond this value, the older records get removed. Setting it to 0 disables the replacement policy, which is the default. Don't try to set it to negative values, they will be treated as unsigned, and thus become some very large positive ones.

jumping: determines the variation of the replacement policy in effect. If set to 0 (default), implements the sliding window policy, removing the older rows one by one. If non-0, implements the jumping window policy, removing all the older rows when a new row causes the limit overflow.

reverse: if non-0, the iteration on this index goes in the reverse order. However the expiration policy still works in the direct order! The default is 0.

The PerlSorted index is created as:

$it = Triceps::IndexType->newPerlSorted($sortName, $initFunc,
  $compareFunc, @args...) or die "$!";

The $sortName is a string describing the sorting order, used in print() and error messages. $initFunc is a function reference that can be used to generate the comparison function dynamically at the table type initialization time (or use undef if using a fixed comparison function). $compareFunc is the fixed comparison function, if preferred (or use undef if it will be generated dynamically by the init function). The args are the optional extra arguments for the initialization and/or comparison function. See the details in the dedicated post.

The index types are connected in a table type to form a tree. To nest the indexType1 under indexType2, use:

$indexType2-> addSubIndex("indexName", $indexType1) or die "$!";

It returns the reference to the same $indexType2, so these calls can be conveniently chained, to add multiple sub-indexes under it. If $indexType2 can not be non-leaf, the call will fail. The added sub-index is actually a copy of $indexType1, the same as when adding an index type under a table type.

The same introspection methods of the nested index types are available as in the table type:

$itSub = $it-> findSubIndex("indexName") or die "$!";
$itSub = $it-> findSubIndexById($indexTypeId) or die "$!";

@itSubs = $it->getSubIndexes();
$itSub = $it->getFirstLeaf();

If the index type is already a leaf, getFirstLeaf() will return itself.

An aggregator type can be set on an index type. It will create aggregators that run on the rows stored withing the indexes of this type.

$it->setAggregator($aggType) or die "$!";

The value returned is the same index type reference $it, allowing the chaining calls, along with the addSubIndex(). Only one aggregator type is allowed on an index type. Calling setAggregator() repeatedly will replace the aggregator type. The aggregator type can be read back with

$aggType = $it->getAggregator();

The returned value may be undef (but "$!" not set) if no aggregator type has been set.

The index type gets initialized when the table type where it belongs gets initialized. After an index type has been initialized, it can not be changed any more, and any methods that change it will return an error. Whether an index type has been initialized, can be found with

$result = $it->isInitialized();

An index type can be copied with

$itCopy = $it->copy();

The copy reverts to the un-initialized state. It's always a deep copy, with all the nested index and aggregator types copied. All of these copies are un-initialized.

When an index type becomes initialized, it becomes tied to a particular table type. This table type can be read with

$tabType = $it->getTabtype() or die "$!";

If the index type is not initialized yet, this will return an error.

The usual sameness comparisons and print methods are available (and the print method has the usual optional arguments):

$result = $it1->same($it2);
$result = $it1->equals($it2); 
$result = $it1->match($it2);
$result = $it->print();

The matching index types may differ in the names of nested indexes. In the matching PerlSorted index types, the descriptive names of the types may also differ.

An index type can be checked for being a leaf:

$result = $it->isLeaf();

The index type id (see the explanation of that in the TableType reference) can be read with

$id = $it->getIndexId();

A special method that works only on the Hashed index types

@keys = $it->getKey();

returns the array of field names forming the key. On the other index types it returns an empty array, though probably a better support would be available for the PerlSorted indexes in the future.

A special method that works only on the PerlSorted index types

$it->setComparator($compareFunc, @args...) or die "$!";

allows to set an auto-generater comparator function and its optional arguments from an initializer function at the table initialization time. On success it returns 1. For all other index types this method returns an error.

TableType reference

I've shown quite a few of examples that use the tables. And there will be more, since the tables are such a centerpiece of processing. But some more obscure details have been skipped over. So, this starts a series of posts that list out the table-related features.

The TableType gets created from a row type as

$tt = Triceps::TableType->new($rowType) or die "$!";

After that it can be configured by adding the index types. Eventually it has to be initialized and that freezes the table type and makes it immutable. All the steps up to and including the initialization must be done from a single thread, after initialization a table type may be shared between multiple threads.

The index types are added with

$tt->addSubIndex("indexName", $indexType) or die "$!";

The result is the same table type (unless it's an undef signifying an error), so the index type additions can be chained with each other and with the construction:

$tt = Triceps::TableType->new($rowType) 
  ->addSubIndex("indexName1", $indexType1) 
  ->addSubIndex("indexName2", $indexType2)
  or die "$!";

Since the table type initialization freezes not only the table type itself but also all the index types in it, that would make things difficult if the same index type is added to two table types. To avoid these issues, addSubIndex() adds not the actual argument index type but first creates a fresh uninitialized copy of it, and then adds it. The initialization is done with:

$tt->initialize() or die "$!";

The index types check most of their arguments at the initialization time, so that's where most of the errors will be reported. Calling initialize() repeatedly will have no effect and just return the same result again and again. It's possible to check whether the table type has been initialized:

$result = $tt->isInitialized();

The other introspection is the row type:

$rowType = $tt->rowType();

There are multiple ways to get back the index types. To find a index type by name, use:

$indexType = $tt->findSubIndex("indexName") or die "$!";

This is symmetric with addSubIndex(), so it works only for the top-level index types, to get the nested ones, repeat the same call on the found index types.

There is also a way to find the first index type of a particular kind. It's called somewhat confusingly

$indexType = $tt->findSubIndexById($indexTypeId) or die "$!";

where $indexTypeId is one of either strings of Triceps constants

"IT_HASHED" or &Triceps::IT_HASHED
"IT_FIFO" or &Triceps::IT_FIFO
"IT_SORTED" or &Triceps::IT_SORTED

Technically, there is also IT_ROOT but it's of little use for this situation since it's the root of the index type tree hidden inside the table type, and would never be a sub-index type. It's possible to iterate through all the possible index type ids as

for ($i = 0; $i < &Triceps::IT_LAST; $i++) { ... }

The conversion between the strings and constants for index type ids is done with

$intId = &Triceps::stringIndexId($stringId);
$stringId = &Triceps::indexIdString($intId);

If an invalid value is supplied, the conversion functions will return undef.

The first leaf index type (the one used for the default look-ups and iteration) can be found with

$indexType = $tt->getFirstLeaf();

And all the top-level indexes can be found with

@indexTypes = $tt->getSubIndexes();

The resulting array contains the pairs of names and index types. If the order is not important but you want to perform the look-ups by name, the result can be stored directly into a hash:

%indexTypes = $tt->getSubIndexes();

The usual reference comparison methods are:

$result = $tt1->same($tt2);
$result = $tt1->equals($tt2);
$result = $tt1->match($tt2);

And finally the content of a table type can be converted to a human-readable description with

$res = $tt->print();

with the usual optional arguments.

Sergey Babkin on CEP and stuff