Sunday, September 9, 2012

SortedIndexType

The SortedIndexType is defined in type/SortedIndexType.h, and provides a way to define any custom sorting criteria. That is done by defining your condition class, derived from SortedIndexCondition, and passing it to the SortedIndexType. Because of this the index type itself is simple, all the complex things are in the condition:

SortedIndexType(Onceref<SortedIndexCondition> sc);
static SortedIndexType *make(Onceref<SortedIndexCondition> sc);
SortedIndexCondition *getCondition() const;

The examples of the sort conditions can be found in type/test/t_TableType.cpp and in perl/Triceps/PerlSortCondition.*. SortedIndexCondition provides a set of virtual methods that can be re-defined in the subclass to create a custom condition. Indeed, some of them must be re-defined, since they are pure virtual.

I've also added now a simple example that will be included in 1.1, but can as well be compiled with 1.0. It's located in type/ test/t_xSortedIndex.cpp. Here is the subclass:

class Int32SortCondition : public SortedIndexCondition
{
public:
    // @param idx - index of field to use for comparison (starting from 0)
    Int32SortCondition(int idx) :
        idx_(idx)
    { }

    virtual void initialize(Erref &errors, TableType *tabtype, SortedIndexType *indtype)
    {
        SortedIndexCondition::initialize(errors, tabtype, indtype);
        if (idx_ < 0)
            errors->appendMsg(true, "The index must not be negative.");
        if (rt_->fieldCount() <= idx_)
            errors->appendMsg(true, strprintf("The row type must contain at least %d fields.", idx_+1));

        if (!errors->hasError()) { // can be checked only if index is within range
            const RowType::Field &fld = rt_->fields()[idx_];
            if (fld.type_->getTypeId() != Type::TT_INT32)
                errors->appendMsg(true, strprintf("The field at index %d must be an int32.", idx_));
            if (fld.arsz_ != RowType::Field::AR_SCALAR)
                errors->appendMsg(true, strprintf("The field at index %d must not be an array.", idx_));
        }
    }

    virtual bool equals(const SortedIndexCondition *sc) const
    {
        // the cast is safe to do because the caller has checked the typeid
        Int32SortCondition *other = (Int32SortCondition *)sc;
        return idx_ == other->idx_;
    }
    virtual bool match(const SortedIndexCondition *sc) const
    {
        return equals(sc);
    }
    virtual void printTo(string &res, const string &indent = "", const string &subindent = "  ") const
    {
        res.append(strprintf("Int32Sort(%d)", idx_));
    }
    virtual SortedIndexCondition *copy() const
    {
        return new Int32SortCondition(*this);
    }

    virtual bool operator() (const RowHandle *rh1, const RowHandle *rh2) const
    {
        const Row *row1 = rh1->getRow();
        const Row *row2 = rh2->getRow();
        {
            bool v1 = rt_->isFieldNull(row1, idx_);
            bool v2 = rt_->isFieldNull(row2, idx_);
            if (v1 > v2) // isNull at true goes first, so the direction is opposite
                return true;
            if (v1 < v2)
                return false;
        }
        {
            int32_t v1 = rt_->getInt32(row1, idx_);
            int32_t v2 = rt_->getInt32(row2, idx_);
            return (v1 < v2);
        }
    }

    int idx_;
};

...

Autoref<IndexType> it = new SortedIndexType(new Int32SortCondition(1));

It's the very basic example that defined only the absolute minimum of methods. It sorts by an int32 field, whose index (starting as usual from 0) is specified in the constructor.

The method initialize() is called at the table type initialization time. The argument errors is an already allocated Errors object to return the error indications, tabtype is the table type where the initialization is happening, and indtype is the index type that owns this condition. Also the field rt_ gets magically initialized to the table's row type reference before the sort condition initialization is called. This method is expected to do all the initialization of the internal state, check for all the errors, and return these errors if found.

equals() and match() compare two conditions for equality and match. Before they get called, the caller checks that both conditions are of the same type (i.e. have the same C++ typeid), so it's safe to cast the second condition's pointer to our type. The easiest way to define match() is to make it the same as equals(). These methods may be called on both uninitialized and initialized conditions; if not initialized then the field rt_ will be NULL.

printTo() appends the printout of this index's description to a string. For the simple single-line printouts it just appends to the result string. The multi-line prints have to handle the indenting correctly, as will be shown later.

copy() creates a deep copy of this object. In particular, the SortedIndexType constructor makes a private copy of the condition, and remembers that copy, not the original.

Finally, the operator() implements the comparison for "Less": it gets two row handles, and returns true if the first one contains a row that is "less" (i.e. goes before in the sorting order) than the second one. The reason for why it's done like this is that the SortedIndexCondition is really a Less comparator class for the STL tree that has grown a few extra methods.

This example shows how to compare a value consisting of multiple elements. Even though this sort condition sorts by only one field, it first compares separately for NULL in that field, and then for the actual value. For each element you must:

  • find the values of the element in both rows
  • compare if "<", and if so, return true
  • compare if ">", and if so, return false
  • otherwise (if they are equal), fall through to the next element
The last element doesn't have to go through the whole procedure, it can just return the result of "<". And in this case the comparison for NULL wants the NULL value go before all the non-NULL values, so the result of true must go before false, and the comparison signs are reversed. It's real important that the second comparison, normally for ">", can not be skipped (except for the last element). If you skip it, you will get a mess of the data and will spend a lot of time trying to figure out, what is going on.

Another important point is that none of the methods, especially operator(), must not throw any exceptions. Print an error message and either call abort() or return false. This might be handled better in the future versions.

That's it for the basics, the minimal subset of the methods that has to be defined.

No comments:

Post a Comment