Thursday, August 30, 2012

RowType operations on Rows

As has been mentioned before, a RowType acts as a virtual call table for the rows of that type. The operations are:

bool isFieldNull(const Row *row, int nf) const;

Checks if the field nf in a Row is NULL.

bool getField(const Row *row, int nf, const char *&ptr, intptr_t &len) const;

Returns, where to find a field in a row. nf is as usual the field number, with the data returned in ptr (pointer to the start of the field) and len (length of the field data). The function return value shows whether the field is not NULL. Also, for a NULL field the len will be 0. The returned data pointer type is constant, to remind that the rows are immutable and the data in them must not be changed.

However for the most types you can't refer by this pointer and get the desired value directly, because the data might not be aligned right for that data type. Because of this the returned pointer is a char* and not void *. If you have an int64 field, you can't just do

int64_t *data;
intptr_t len;
if (getField(myrow, myfield, data, len)) {
    int64_t val = *data; // WRONG!
}

Fortunately, the type checks will catch this usage attempt right at the call of getField(). But there also are the convenience functions that return the values of particular types. They are implemented on top of getField() and take care of the alignment issue.

uint8_t getUint8(const Row *row, int nf, int pos = 0) const;
int32_t getInt32(const Row *row, int nf, int pos = 0) const;
int64_t getInt64(const Row *row, int nf, int pos = 0) const;
double getFloat64(const Row *row, int nf, int pos = 0) const;
const char *getString(const Row *row, int nf) const;

The extra argument pos is the position of the value in an array field. It's an array index, not a byte offset. For the scalar fields it must be 0. If the field is NULL or pos points beyond the end of the array, the returned value will be 0, which matches the Perl idiom of treating the undefined values as zeroes. If you care whether the field is NULL or not, check it first:

if (!rt1->isFieldNull(r1, nf)) {
    int64_t val = rt1->getInt64(r1, nf);
    ...
}

Since the strings are normally stored 0-terminated (but it's your responsibility to store them 0-terminated!), getString() just returns the pointer directly to a value in the field. If the string field is NULL, a not a NULL pointer but a pointer to an empty string is returned, in the same spirit of treating the undefined values as zeroes or empty strings. If you want to explicitlcy check for NULLs or get the string field length (including \0 at the end), use getField(). Since there are no string arrays, there is no position argument for getString().

For a side note, the arguments of these calls are Row*, not Rowref. It's cheaper, and OK for memory management because it's expected that the row would be held in a Rowref variable anyway while the data is extracted form it. Don't construct an anonymous rowref object and immediately try to extract a value from it!

int64_t val = rt1->getInt64(Rowref(rt1, datavec), nf); // WRONG!

However if you have a data vector, there is no point in constructing a row to extract back the same data in the first place.

Right now when I was writing this, it has impressed me, how ugly are these calls on a Rowref:

Rowref r1(...);
...
int64_t val = r1.getType()->getInt64(r1, 3);

So I've added the matching convenience methods on Rowref, like:

int64_t val = r1.getInt64(3);

They will be available in the version 1.1. Note that they are called with ".", not "->". The "." makes them called directly on the Rowref object, while "->" would have meant that the Rowref is dereferenced to a Row pointer, and then a method be called on the Row object at that pointer.

Continuing with the type methods, the constructor and destructor for the rows are also here:

Row *makeRow(FdataVec &data_) const;
void destroyRow(Row *row) const;

The makeRow() has been already discussed, and normally you never need to call destroyRow() manually, Rowref takes care of that. If you ever do the destruction manually, remember to honor the reference counts and call the destructor only after the reference count went to 0.

Another method compares the rows for absolute equality:


bool equalRows(const Row *row1, const Row *row2) const;

Right now it's defined to work only on the rows of the same type, including the same representation (but since only one CompactRowType representation is available, this is not a problem). When more representations become available, it will likely be extended. The FIFO index uses this method to find the rows by value.

The final method is provided for debugging:

void hexdumpRow(string &dest, const Row *row, const string &indent="") const;

It makes a hex dump of the internal representation of the row and appends it to the dest string. It's a very low-level method that requires the knowledge of the internal layout of a row and useful for investigation of the memory corruptions.

Saturday, August 25, 2012

Row, Rowref and RowType

A row is defined naturally with the class Row. Which is fundamentally an opaque buffer. You can't do anything with it directly other than having a pointer to it. You can't even delete a Row object using that pointer. To do anything with a Row, you have to go through that row's RowType. There are some helper classes, like CompactRow, but you don't need to concern yourself with them: they are helpers for the appropriate row types and are never used directly.

That opaque buffer is internally wired for the reference counting, of the Mtarget multithreaded variety. The rows can be passed and shared freely between the threads. No locks are needed for that (other than in the reference counter), the thread-safety is achieved by the rows being immutable. Once a row is created, it stays the same. If you need to change a row, just create a new row with the new contents. Basically, it's the same rules as in the Perl API.

The tricky part in the C++ API is that you can't simply use an Autoref<Row> for rows. As mentioned before, it won't know, how to destroy the Row when its reference counter goes to zero. Instead you use a special variety of it called Rowref, defined in type/RowType.h, and described in a previous post. To summarize, it holds a reference both to the Row (that keeps the data) and to the RowType (that knows how to work with the Row). The RowType must be correct for the Row. It's possible to combine the completely unrelated Row and RowType, and the result will be at least some garbage data, or at most a program crash. The Perl wrapper goes to great lengths to make sure that this doesn't happen. In the C++ API you're on your own. You gain the efficiency at the price of higher responsibility.

The general rule is that it's safe to combine a Row and RowType if this RowType matches the RowType used to create that row. The matching RowTypes may have different names of the fields but the same substance.

A Row is created similarly to a RowType: build a vector describing the values in the row, call the constructor, you get the row. The vector type is FdataVec, and its element type is Fdata. Both of them are top-level (i.e. Triceps::FdataVec and Triceps::Fdata), not inside some other class, and both are defined in type/RowType.h.

An Fdata describes the data for one field. It tells whether the field is not null, and if so, where to find the data to place into that field. It doesn't know anything about the field types or such. It deals with the raw bytes: the pointer to the first byte of the value, and the number of bytes. As a special case, if you want the field to be filled with zeroes, set the data pointer to NULL. It is possible to specify an incorrect number of bytes, for example create an int64 field of 3 bytes. This data will be garbage, and if it happens to be at the end of the row, might cause a crash. It's your responsibility to store the correct data. The same goes for the string fields: it's your responsibility to make sure that the data is terminated with an '\0', and that '\0' is included into the length of the data. On the other hand, the unit8[] fields don't need a '\0' at the end, all the bytes included into them are a part of the value.

The data vector gets constructed similarly to the field vector: either start with an empty vector and push pack the elements, or allocate one of the right size and set the elements. The relevant Fdata constructors and methods are:

Fdata(bool notNull, const void *data, intptr_t len);
void setPtr(bool notNull, const void *data, intptr_t len);
void setNull();

The setNull() is a shortcut of setPtr() that sets the notNull to false and ignores the other fields. In version 1.0 the default Fdata constructor leaves all the fields uninitialized. I've changed this now for version 1.1 to set notNull to false by default.

For example:

uint8_t v_uint8[10] = "123456789"; // just a convenient representation
int32_t v_int32 = 1234;
int64_t v_int64 = 0xdeadbeefc00c;
double v_float64 = 9.99e99;
char v_string[] = "hello world";

FdataVec fd1;
fd1.push_back(Fdata(true, &v_uint8, sizeof(v_uint8)-1)); // exclude \0
fd1.push_back(Fdata(true, &v_int32, sizeof(v_int32)));
fd1.push_back(Fdata(false, NULL, 0)); // a NULL field
fd1.push_back(Fdata(true, &v_float64, sizeof(v_float64)));
fd1.push_back(Fdata(true, &v_string, sizeof(v_string)));

Rowref r1(rt1,  rt1->makeRow(fd1));
Rowref r2(rt1,  fd1);





The Rowref constructor from Fdata vector calls the makeRow() implicitly, for convenience, so both forms provide the same result. For another example that allocates a vector and then fills it:

Rowref r2(rt1,  fd1);

FdataVec fd2(3);
fd2[0].setPtr(true, &v_uint8, sizeof(v_uint8)-1); // exclude \0
fd2[1].setNull();
fd2[2].setFrom(r1.getType(), r1.get(), 2); // copy from r1 field 2

Rowref r3(rt1,  fd2);

The field 2 is set by copying it from a field of another row. It sets the data pointer to the location inside the original row, and the data will be copied when the new row gets created. So make sure to not release the reference to the original row until the new row is created. The prototype is:

void setFrom(const RowType *rtype, const Row *row, int nf);

In fd2 the vector is smaller than the number of fields in the row. The rest of fields are filled with NULLs. They actually are literally filled with NULLs in fd2: if the size of the argument vector for makeRow() is smaller than the number of fields in the row type, the vector gets extended with the NULL values before anything is done with it. It's no accident that the argument of the RowType::makeRow() is not const:

class RowType {
    virtual Row *makeRow(FdataVec &data) const;
};

class Rowref {
    Rowref(const RowType *t, FdataVec &data);
    Rowref &operator=(FdataVec &data);
};

It's also possible to have more elements in the FdataVec than in the row type. In this case the extra arguments are considered the "overlays": the "main" elements set the size of the fields while the "overlays" copy the data fragments over that. It's a convenient way to assemble the array fields from the fragments, for example:

RowType::FieldVec fields4;
fields4.push_back(RowType::Field("a", Type::r_int64, RowType::Field::AR_VARIABLE));

Autoref<RowType> rt4 = new CompactRowType(fields4);
if (rt4->getErrors()->hasError())
    throw Exception(rt4->getErrors(), true);

FdataVec fd4;
Fdata fdtmp;
fd4.push_back(Fdata(true, NULL, sizeof(v_float64)*10)); // allocate space
fd4.push_back(Fdata(0, sizeof(v_int64)*2, &v_int64, sizeof(v_int64)));
// fill a temporary element with setOverride and then insert it
fdtmp.setOverride(0, sizeof(v_int64)*4, &v_int64, sizeof(v_int64));
fd4.push_back(fdtmp);
// manually copy an element from r1
fdtmp.nf_ = 0;
fdtmp.off_ = sizeof(v_int64)*5;
r1.getType()->getField(r1.get(), 2, fdtmp.data_, fdtmp.len_);
fd4.push_back(fdtmp);

Rowref r4(rt4,  fd4);

This creates a row type from a single field "a" at index 0, an array of int64. The data vector  fd4 has the 0th element define the space for 10 elements in the array, filled by default with zeroes. It doesn't have to zero them, it could copy the data from some location in memory. I've just done the zeroing here to show how it can be done.

The rest of elements are the "overrides" constructed in different ways.

The first one uses the override constructor:

Fdata(int nf, intptr_t off, const void *data, intptr_t len);

Here nf is the number of the field whose contents to overried, off is the byte offset in it, and data and len point to the location to copy from as usual. In this case the 2nd element (counting from 0) of the array gets set with the value from v_int64.

The second override uses the method setOverride() for the same purpose:

void setOverride(int nf, intptr_t off, const void *data, intptr_t len);

It sets a temporary Fdata which then gets appended (copied) to the vector. It sets the element of the vector at index 4 to the same value of v_int64.

The third override copies the value from the row r1. Since there is no ready method for this purpose (perhaps there should be?), it goes about its way manually, setting the fields explicitly. nf_ if the same as nf in the methods, the field number to override. off_ is the offset. And the location and length gets filled into data_ and len_ by getField(), which takes the data from the row r1, field 2.

But wait, the field 2 of r1 has been set to NULL! Should not the NULL indication be set in the copy as well? As it turns out, no. The NULL indication (the field notNull_ being set to false) is ignoredby makeRow()  in the override elements. However getField() will set the length to 0, so nothing will get copied. The value at index 5 will be left as it was initially set, which happens to be 0.

So in the end the values in the field "a" array at indexes 2 and 4 will be set to the same as v_int64, and the other indexes 0..10 to 0.

If multiple overrides specify the overlapping ranges, they will just sequentially overwrite each other, and the last one will win.

If an override attempts to specify writing past the end of the originally reserved area of the field, it will be quietly ignored. Just don't do this. If the field was originally set to NULL, its reserved area will be zero bytes, so any overrides attempting to write into it will be silently ignored.

The summary is: the overrides allow to build the array values efficiently from the disjointed areas of memory, but if they are used, they have to be used with care.

RowType in C++, part 3

The information about the contents of a RowType can be read back:

int fieldCount() const;
const vector<Field> &fields() const;
int findIdx(const string &fname) const;
const Field *find(const string &fname) const;

fields() has already been shown. fieldCount() returns the count of fields. findIdx() finds the index of the field by name, so that it can then be looked up in the result of fields(). (Or -1 if there is no such field). find() directly returns the pointer to the field by name, combining these two actions. (Or it returns NULL if there is no such field).

The rest of the RowType methods have to do with the manipulation of the rows. Remember, the rows are not virtual, in a micro-optimization to save a little bit of space, so the RowType methods act as virtuals for the rows. They will be described momentarily, after an introduction to the rows.

RowType in C++, part 2

Let's get to constructing the row types. To reiterate the last post, you don't construct the objects of RowType class itself, it's an abstract class. You construct the objects of the concrete subclass(es), specifically CompactRowType. Make a vector describing the fields and do the construction.

You can make the vector by either starting with an empty one and adding the fields to it or allocating a vector of the right size in advance and  setting the fields to it.

RowType::FieldVec fields1;
fields1.push_back(RowType::Field("a", Type::r_int64)); // scalar by default
fields1.push_back(RowType::Field("b", Type::r_int32, RowType::Field::AR_SCALAR));
fields1.push_back(RowType::Field("c", Type::r_uint8, RowType::Field::AR_VARIABLE));

RowType::FieldVec fields2(2);
fields2[0].assign("a", Type::r_int64); // scalar by default
fields2[1].assign("b", Type::r_int32, RowType::Field::AR_VARIABLE);

You can also reuse the same vector and clean/resize is as needed to create more types.

If you're used to laying out the C structures placing the larger elements first for the more efficient alignment, know that this is not needed for the Triceps rows. The CompactRowType stores the row data unaligned, so any field order will result in the same size of the rows. And it can't make use of some fields happening to be aligned either.

You can also find the simple types by their string names:

fields1.push_back(RowType::Field("d", Type::findSimpleType("uint8"), RowType::Field::AR_VARIABLE));

If the type name is incorrect and the type is not found, findSimpleType() will return NULL, which NULL will be caught later at the row type creation times. Note that there is no automatic look-up of the array types. You can't simply pass "uint8[]" to findSimpleType(). You have to break it up into the simple type name as such an the array indication, like is done in perl/Triceps/RowType.xs. This would probably a good thing to add to RowType::Field in the future.

You can't use the type Type::r_void for the fields, it will be reported as an error.

After the fields array is created, create the row type:

Autoref<RowType> rt1 = new CompactRowType(fields1);
if (rt1->getErrors()->hasError())
    throw Exception(rt1->getErrors(), true);


You could also use Autoref<CompactRowType> but there isn't any point to it, since all the methods of CompactRowType are virtuals inherited from RowType.

Don't forget to check that the constructed type has no errors, and bail out if so. Throwing an Exception is a convenient way to abort with a nice error message. I have plans to add a function checkOrThrow() that will replace this "if", but the details are to be worked out yet. A type with errors can't be used for anything, or it will cause the program to crash.

The RowType and its subclasses are immutable after construction, so they can be shared all you want. If you really need to create a copy, you can do it:

Autoref<RowType> rt2 = new CompactRowType(rt1);
if (rt2->getErrors()->hasError())
    throw Exception(rt2->getErrors(), true);


Checking the errors after the copy creation is kind of optional if the original type was correct, but it's better to be safe than sorry.

You can get back the information about the fields:

const RowType::FieldVec &f = rt1->fields();

It's a reference to the vector directly inside the RowType, so const reminds you not to change it (that vector is a copy of the vector used during the construction, so the original vector can be changed afterwards). If you want to extend a type with more fields, make a copy of its fields and extend it:

RowType::FieldVec fields3 = rt1->fields();
fields3.push_back(RowType::Field("z", Type::r_string));
Autoref<RowType> rt3 = new CompactRowType(fields3);
if (rt3->getErrors()->hasError())
    throw Exception(rt3->getErrors(), true);

That's about it for the RowType construction.

Tuesday, August 21, 2012

RowType in C++, part 1

In the Perl API a row type is a collection of fields. Under the hood the things are more complicated. In the C++ API Triceps allows for more flexibility, more ways to represent a row. The row type is represented by the abstract base class RowType that tells the logical structure of a row and by its concrete subclasses that define the concrete layout of data in a row. To create, read or manipulate a row, all you need to know is a reference to RowType. It would refer to a concrete row type object, and the concrete row operations are accessed by the virtual methods. But when you create a row type, you need to know, to which concrete row type it will belong.

Currently the choice is easy: there is only one such concrete subclass CompactRowType. The "compact" means that the data is stored in the rows in a compact form, one field value after another, without alignment. Perhaps some day there will be an AlignedRowType, allowing to read the values more efficiently. Or perhaps some day there will be a ZippedRowType that would store the data in the compressed format.

You would never use the RowType constructor directly, it's called from the subclasses. But every subclass is expected to define a similar constructor:

RowType(const FieldVec &fields);
CompactRowType(const FieldVec &fields);

The FieldVec is the definition of fields in the row type. It's defined as simple as:

typedef vector<Field> FieldVec;

An important side note is that the field is defined within the RowType, so it's really RowType::FieldVec and RowType::Field, and you need to refer to them in your code by this qualified name. So, to create a row type, you create a vector of field definitions first and then construct the row type from it. You can throw away or modify that vector afterwards.

As usual, the constructor arguments might not be correct, and any errors will be remembered and returned with getErrors(). Don't use a type with errors (other than to read the error messages from it, and to destroy it), it might cause your program to crash.

A Field consists of the basic information about it: the name, the type, and the array indication (remember, a Triceps field may contain an array). The array indication is either RowType::Field::AR_SCALAR for a scalar value or RowType::Field::AR_VARIABLE for a variable-sized array. The original plan was also to use the integer values for the fixed-sized array fields, but in reality the variable-sized array fields have turned out to be easier to implement and that was it. So don't use the integer values. Most probably they would work like the same variable-sized arrays but they haven't been tested, and something somewhere might crash. Use the symbolic enum AR_*.

The normal Field constructor provides all this information:

 Field(const string &name, Autoref<const Type> t, int arsz = AR_SCALAR);

Or you can use the default constructor and later change the fields in the Field (or of course read them) as you please:

string name_;
Autoref <const Type> type_;
int arsz_;

Or you can assign them in one fell swoop:

void assign(const string &name, Autoref<const Type> t, int arsz = -1);

Note that even though theoretically you can define a field of any Type, in practice it has to be of a SimpleType, or the RowType constructor will return an error later. Why isn't it defined as an Autoref<SimpleType> then? The grand plan is to allow some more interesting data structures in the rows, and this keeps the door open. In particular, the rows will be able to hold references to the other rows, just I haven't got to implementing it yet.

Once again, a RowType constructor makes a copy of the FieldVec for its use, so you can modify or destroy the original FieldVec right away. You can get back the information about the fields in RowType:

const vector<Field> &fields() const;

It returns a reference directly to the FieldVec contained in the row type, so you must never modify it! The const-ness gives a reminder about it.

There are more row type constructors (but no default one). First, each subclass variety is supposed to be able to construct its variety by copying any RowType:

CompactRowType(const RowType &proto);
CompactRowType(const RowType *proto);

The version with the pointer argument also works for passing the Autoref<RowType> as the argument which gets automatically converted to a pointer. And it's really the more typically used one than the reference version.

The resulting type will have the same logical structure but possibly a different representation than the original. By the way, if you care only about the logical structure but not representation, you still can't directly construct a RowType because it's an abstract class. But just construct any concrete subclass, say CompactRowType (since it's the only one available at the moment anyway), and then use its logical structure.

The other constructor variety is a factory method:

virtual RowType *newSameFormat(const FieldVec &fields) const;

It combines the representation format from one row type and the arbitrary logical structure (the fields vector), possibly from another row type. Or course, until more concrete type representations become available, its use is purely theoretical.

Simple types

The simple types are defined as instances of the abstract class SimpleType, and have one method in addition to the base Type:

int getSize() const

It returns the size of the value of this type. For void it's 0, for string 1 (the minimal string size), for the rest of them it's a sizeof. This size is used to extract the values from and copy the values to the compact row format.

For now this is the absolute minimum of information that makes the data usable. The list of methods will be extended over time. For example, the methods for value comparisons will eventually go here. And if the rows will ever hold the aligned values, the alignment information too.

The SimpleType is defined in type/SimpleType.h, and all the actual simple types are defined in type/AllSimpleTypes.h:

VoidType
Uint8Type
Int32Type
Int64Type
Float64Type
StringType

Wednesday, August 15, 2012

Types

Fundamentally, Triceps is a language, even though it is piggybacking on the other languages. And like in pretty much any programming language, pretty much anything in it has a type. Only the tip of that type system is exposed in the Perl API, as the RowType and TableType. But the C++ API has the whole depth. The types go all the way down to the simple types of the fields.

The classes for types are generally defined in the subdirectory type/. The class Type, defiined in type/Type.h is the common base class.

First, every kind of type has its entry in the enum TypeId:

        TT_VOID, // no value
        TT_UINT8, // unsigned 8-bit integer (byte)
        TT_INT32, // 32-bit integer
        TT_INT64,
        TT_FLOAT64, // 64-bit floating-point, what C calls "double"
        TT_STRING, // a string: a special kind of byte array
        TT_ROW, // a row of a table
        TT_RH, // row handle: item through which all indexes in the table own a row
        TT_TABLE, // data store of rows (AKA "window")
        TT_INDEX, // a table contains one or more indexes for its rows
        TT_AGGREGATOR, // user piece of code that does aggregation on the indexes
        TT_ROWSET, // an ordered set of rows

TT_ROWSET is something added in version 1.1.0, it will be described later. TT_VOID is pretty much a placeholder, in case if a void type would be needed later. The TypeId gets hardcoded in the constructor of every Type sub-class. It can be gotten back with the method

TypeId getTypeId() const;

Another method finds out if the type is the simple type of a field:

bool isSimple() const;

It would be true for the types of ids TT_VOID, TT_UINT8,  TT_INT32,  TT_INT64, TT_FLOAT64, TT_STRING.


Generally, you can check the TypeId and then cast the Type pointer to its subclass. All the simple types have the common base class SimpleType, which will be described in a moment.


There is also a static Type method that finds a simple type object by name (like "int32", "string" etc.):


static Onceref<const SimpleType> findSimpleType(const char *name);


Basically, there is not a whole lot of point in having lots of copies of the simple type objects (though if you want, you can). So there is one common copy of each simple type that can be found by name. If the type is known when you compile your C++ program, you can even avoid the look-up and refer to these objects directly.

    static Autoref<const SimpleType> r_void;
    static Autoref<const SimpleType> r_uint8;
    static Autoref<const SimpleType> r_int32;
    static Autoref<const SimpleType> r_int64;
    static Autoref<const SimpleType> r_float64;
    static Autoref<const SimpleType> r_string;

The type construction may cause errors. It is usually done either by a single constructor with all the needed arguments, or a simple constructor, then additional methods to add the information in bits in pieces, then an initialization method. In both cases there is a problem of how to report the errors. They're not easy to return from a constructor and a pain to check in the bit-by-bit construction.

Instead the error information gets internally collected in an Errors object, and can be read after the construction and/or initialization is completed:

virtual Erref getErrors() const;

A type with errors may not be used for anything other than reading the errors.

The rest of the common virtual methods has to do with the type comparison and print-outs. The comparison methods essentially check if two type objects are aliases for each other:

virtual bool equals(const Type *t) const;
virtual bool match(const Type *t) const;

The concept has been previously described with the Perl API. The equal types are exactly the same. The matching types are the same except for the names of their elements, so it's generally safe to pass the values between these types.

equals() is also available as operator==.

The print methods create a string representation of a type, used mostly for the error messages. There is no method to parse this string representation back, at least yet.

virtual void printTo(string &res, const string &indent = "", const string &subindent = "  ") const = 0;
string print(const string &indent = "", const string &subindent = "  ") const;

printTo() appends the information to an existing string. print() returns a new string with the message. print() is a wrapper around printTo() that creates an empty string, does printTo() into it and returns it.

The printing is normally done in a multi-line format, nicely indented, and the arguments indent and subindent define the initial indent level and the additional indentation for every level.

There is also a way to print everything in one line: pass the special constant NOINDENT (defined in common/StringUtil.h) in the argument indent. This is similar to using an undef for the same purpose in the Perl API.

The definitions of all the types are collected together in type/AllTypes.h.

Saturday, August 11, 2012

Developer's Guide on Kindle

By the way, if you want to read the Developer's Guide on Kindle, you can get it directly from Amazon for their minimal price of $1:

http://www.amazon.com/Complex-Event-Processing-Triceps-ebook/dp/B008T63HKW

Or I think it should be available in the Kindle library too, if you are subscribed to it. Or of course you can just download the official PDF or HTML for free.

The Exception

There are different ways to report the errors. Sometimes a function would return a false value. Sometime it would return an Erref with an error in it. And there is also a way to throw the exceptions.

In general I don't particularly like the exceptions. They tend to break the logic in the unexpected ways, and if not handled properly, mess up the multithreading. The safe way of working with exceptions is with the scope-based variables. This guarantees that all the allocated memory will be freed and all the locked data will be unlocked when the block exits, naturally or on an exception. However not everything can be done with the scopes, and this results in a whole mess of try-catch blocks, and a missed catch can mess up the program in horrible ways.

However sometimes the exceptions come handy. They have been a late addition to version 1.0. They are definitely here to stay for the communication in the XS code and for the user-defined handlers in C++ but other than that I'm not so sure about whether and how they would be used internally by Triceps.  Not all the Triceps code works correctly with the exceptions yet, and the experience of converting it for the exception handling has not been entirely positive.  So far the only part that can deal with the exceptions nicely is the scheduler and the user-defined labels. Not the aggregators nor user-defined indexes.

But for the user C++ code for the most part it doesn't matter. In Triceps the approach is that the exceptions are used for the substantially fatal events. If the user attempts to do something that can't be executed, this qualifies for an exception. Essentially, use the exceptions for the things that qualify for the classic C abort() or assert(). The idea is that at this point we want to print an error message, print the call stack the best we can, and dump the core for the future analysis.

Why not just use an abort() then? In the C++ code you certainly can if you're not interested in the extra niceties provided by the exceptions. In fact, that's what the Triceps exceptions do by default: when you construct an exception, they print a log message and the stack trace (using a nice feature of glibc) then abort. The error output gives the basic idea of what went wrong and the rest can be found from the core file created by abort().

However remember that Triceps is designed to be embedded into the interpreted (or compiled too) languages. When something goes wrong inside the Triceps program in Perl, you don't want to get a core dump of the Perl interpreter. An interpreted program must never ever crash the interpreter. You want to get the error reported in the Perl die() or its nicer cousin confess(), and possibly get intercepted in eval{}. So the Perl wrapper of Triceps changes the mode of Triceps exceptions to actually throw the C++ exceptions instead of aborting. Since  the Perl code is not really interested in the details at the C++ level, the C++ stack trace is in this case configured to not be included into the text of the exception. However another interesting thing happens: if the exception happened in a label handler, the Triceps scheduler stack gets unwound and the information about it gets included. Eventually the XS interface does an analog of confess(), including the Perl stack trace. When the code goes through multiple layers of Perl and C++ code (Perl code calling the Triceps scheduler, calling the label handlers in Perl, calling the Triceps scheduler again etc.), the whole layered sequence gets nicely unwound and reported. However the state of the scheduler suffers along the way: all the scheduled rowops get freed when their stack frame is unwound, so prepare to repair the state of your model if you catch the exception.

If you are willing to handle the exceptions (for example, if you add elements dynamically by user description and don't want the whole program to abort because of one faulty description), you can do the same in C++. Just disable the abort mode for the exceptions and catch them. Of course, it's even better to catch your exceptions before they reach the Triceps scheduler, since then you won't have to repair the state.

The same feature comes handy in the unit test: when you test for the detection of a fatal error, you don't want you test to abort, you want it to throw a nice catchable exception.

After all this introductory talk, to the gritty details. The class is Exception (as usual, in the namespace Triceps or whatever custom namespace you define as TRICEPS_NS), defined in common/Exception.h. Inside it is an Erref with the errors. An Exception can be constructed in multiple ways:

explicit Exception(Onceref<Errors> err, bool trace);

The basic constructor. if trace==true, the C++ stack trace will be added to the messages, if it is otherwise permitted by the exception modes. If trace==false, the stack trace definitely won't be added. Why would you want to not add the stack trace? Generally, if you catch an exception, add some information to it and re-throw a new exception. The information from the original exception will contain the full stack trace, so there is no need to include the partial stack trace again. Also, if you throw an exception with high-level information (in Perl or such), you don't need to put any C++ stack info into it.

The Errors are remembered by reference, so changing them later will change the contents of the exception.

explicit Exception(const string &err, bool trace);

A convenience constructor to make  a simple string with the error. Internally creates an Errors object with the string in it. The string gets usually created with strprintf().

explicit Exception(Onceref<Errors> err, const string &msg);
explicit Exception(Onceref<Errors> err, const char *msg);
explicit Exception(const Exception &exc, const string &msg);

Wrapping a nested error with a descriptive message and re-throwing it.

virtual const char *what();

The usual, returns the text of the error messages in the exception.

virtual Errors *getErrors() const;

Returns the Errors object from the exception.

The modes I've mentioned before are set with the class static variables:

static bool abort_;

Flag: when attempting to create an exception, instead print the message and abort. This behavior is more convenient for debugging of the C++ programs, and is the default one. Also forces the stack trace in the error reports. The interpreted language wrappers should reset it to get the proper exceptions. Default: true.

static bool enableBacktrace_;

Flag: enable the backtrace if the constructor requests it. The interpreted language wrappers should reset it to remove the confusion of the C stack traces in the error reports. Default: true.

Error reports

When building some kind of a language, the complicated errors often need to be reported. Often there are many errors at a time, or an error that needs to be tracked through multiple nested layers. And then these error messages need to be nicely printed, with the indentation by nested layers. Think of the errors from a C++ compiler. Triceps is a kind of a language, so it has a class to represent such errors. It hasn't propagated to the Perl layer yet and is available only in the C++ API.

The class is Errors, defined in common/Errors.h, and inheriting from Starget (for single-threaded reference counting). The references to it are used so often, that Autoref<Errors> is typedefed to have its own name Erref (yes, that's 2 "r"s, not 3).

In general it contains messages, not all of which have to be errors. Some might be warnings. But in practice is has turned out that without a special dedicated compile stage it's hard to report the warnings. Even when there is a special compile stage, and the code gets compiled before it runs, as in Aleri, with the warnings written to a log file, still people rarely pay attention to the warnings. You would not believe, how may people would be calling support while the source of their problem is clearly described in the warnings in the log file. Even in C/C++ it's difficult to pay attention to the warnings. I better like the approach of a separate lint tool for this purpose: at least when you run it, you're definitely looking for warnings.

Because of this, the current Triceps approach is to not have warnings. If something looks possibly right but suspicious, report it as an error but provide an option to override that error (and tell about that option in the error message).

In general, the Errors are built of two kinds of information:

  • the error messages
  • the nested Errors reports

More exactly, an Errors contains a sequence of elements, each of which may contain a string, a nested Errors object, or both. When both, the idea is that the string gives a high-level description and the location of the error in the high-level object while the nested Errors dig into the further detail. The string gets printed before the nested Errors. The nested Errors get printed with an indentation. The indentation gets added only when the errors get "printed", i.e. the top-level Errors object gets converted to a string. Until then the elements may be nested every which way without incurring any extra overhead.

Obviously, you must not try to nest an Errors object inside itself, directly or indirectly. Not only will it create a memory reference cycle, but also an endless recursion when printing.

The basic way to create an Errors is:

 Errors(bool e = false);

Where "e" is an indicator than it contains an actual error. It will also be set when an error message is added, or whan a nested Errors with an error in it is added.

There also are a number of convenience constructors that make one-off Errors from one element:

Errors(const char *msg);
Errors(const string &msg);
Errors(const string &msg, Autoref<Errors> clde);

In all of them the error flag is always set, and the message is checked for being multi-line (that is, containing '\n' in the middle of it), and if so, it gets broken up into multiple messages, one per line.

When an Errors object is constructed, more elements can be added to it:

void appendMsg(bool e, const string &msg);
void appendMultiline(bool e, const string &msg);
bool append(const string &msg, Autoref<Errors> clde);

The "e" shows whether the message is an error message. In append() the indication of the error presence is take from the child element clde. The appendMsg() expects a single-liner message, don't use a '\n' in it! The appendMultiline() will safely break a multi-line message into multiple single-liners and will ignore the '\n' at the end.

In all the cases of adding a nested child element, it's safe to pass a NULL. If it's a NULL or contains no data in it, the parent will ignore it, except for the error indication that would be processed anyway. Moreover, if the clde is empty, append() will also ignore the string part, and will add nothing. The return value of append() will be true if the child element contained any data in it or an error indication flag. This can be used together with another method

void replaceMsg(const string &msg);

to add a complex high-level description if a child element has reported an error:

Erref clde = someThing(...);
if (e.append("", clde)) {
    string msg;
    // ... generate msg in some complicated way
    e.replaceMsg(msg);
}

The replaceMsg() replaces the string portion of the last element, which owns the last added child error.

The typical way to create the messages is with strprintf(), which is like sprintf() but returns a C++ std::string. It's defined in common/Strprintf.h, or as a part of the typical collection in common/Common.h.

It's also possible to append the contents of another Errors directly, without nesting it:

bool absorb(Autoref<Errors> clde);

The return value has the same meaning as with append(). Finally, an Errors object can be cleared to its empty state:

void clear();

To get the number of elements in Errors, use

size_t size() const;

However the more typical methods are:

bool isEmpty();
bool hasError();

They check whether there is nothing at all or whether there is an error. The special convenience of these methods is that they can be called on NULL pointers. Quite a few Triceps methods return a NULL Erref if there was no error.Even if er is NULL, calling

er->isEmpty()
er->hasError()
parent_er->append(msg, er)
parent_er->absorb(er)


is still safe and officially supported. But NOT er->size().

The data gets extracted from Erref by converting it to a string, either appending to an existing string, or adding a new one.

void printTo(string &res, const string &indent = "", const string &subindent = "  ");
string print(const string &indent = "", const string &subindent = "  ");

The indent argument specifies the initial indentation, subindent the additional indentation for each level.

Friday, August 10, 2012

Reference to a RowHandle

The row handles have the requirements very similar to the rows. They get created by the million, so the efficiency is important. They contain data that has to be properly destroyed. For example, when an additive Perl aggregator stores its last state, it 's stored in a row handle.

So they are handled similarly to the rows. they don't have a virtual destructor but rely on the Table that owns them to destroy them right. The special reference type for them is Rhref, defined in mem/Rhref.h (the RowHandle itself is defined in table/Table.h).

It follows in the exact same mold as Rowref, only uses the Table instead of a RowType:

Rhref(Table *t, RowHandle *r = NULL);
void assign(Table *t, RowHandle *r);

The rest of comparisons, assignments etc. work the same.

An important point is that a Rhref contains an Autoref to the table, safely holding the table in place while the Rhref is alive. So does the Rowref with the RowType as well, I just forgot to mention it before.

To find out the table of a Rhref, use:

Table *t = rhr.getTable();

Why is the value returned a simple pointer to the table and not an Autoref or Onceref? Basically, because it's the cheapest way and because the row handle is not likely to go anywhere. Nobody is likely to construct a RowHandle only to get the table from it and have it immediately destroyed. And even if someone does something of the sort

Autoref<table> t = RowHandle(t_orig, rh).getTable();

then the table itself is likely to not go anywhere, there is still likely to be another reference to the table that will still hold it in place. If there isn't then of course all bets are off, and t will end up with a dead reference to corrupted memory. Just exercise a little care, and everything will be fine. The same reasoning was used for the argument of the RowHandle constructor being also a table pointer, not an Autoref or Onceref.

An Rhref may also be conveniently used to construct a RowHandle for a Row:

Rhref(Table *t, Row *row);

In fact, this is the official way to construct a RowHandle. The Rowref has a similar method to construct the rows from raw data but the data is much more complex, so I've left its description until later.

Wednesday, August 8, 2012

Row reference

The Row objects are reference-counted too but Autoref can't be used with them. The reason is about the destructors.

It isn't used anywhere yet, but the rows in Triceps are designed to be able to contain references to the other rows and in general other objects.  So they can't be destroyed by just freeing the memory. The destructor must know, how to release the references to these nested objects first. And knowing where these references are depends on the type of the row. And rows may be of different types. This calls for a virtual destructor.

But having a virtual destructor requires that every object has a pointer to the table of virtual functions. That adds the overhead of 8 bytes per row, and the rows are likely to be kapt by the million, and that overhead adds up. So I've made the decision to save these 8 bytes and split that knowledge. It might turn out a premature optimization, but since it's something that would be difficult to change later, I've got it in early.

The knowledge of how to destroy a row (and also how to copy the row and to access the elements in it) is kept in the row type object. So the reference to a row needs to know two things: the  row and the row's type. It's still an extra 8 bytes of a pointer, but there are only a few row references active at a time (the tables don't use the common row references to keep the rows, instead they are implemented as a special case and have one single row type for all the rows they store).

The special reference class for the rows is Rowref, defined in type/RowType.h.  It gets constructed as:

Rowref(const RowType *t, Row *r = NULL);
Rowref(const Rowref &ar);

It can be then assigned from rowref or from a row (keeping the row type the same):

Rowref &operator=(const Rowref &ar);
Rowref &operator=(Row *r);

Also both a new row type and a row can be assigned at the same time:

void assign(const RowType *t, Row *r);

There is also the ways to copy a row and assign a copy to this reference:

Rowref &copyRow(const RowType *rtype, const Row *row);
Rowref &copyRow(const Rowref &ar);

The other functionality is similar to Autoref(): isNull(), get(), comparisons, calling the Row methods through ->, and conversion to a Row pointer.

Autoref summary

Autoref can be constructed with or assigned from another Autoref or a pointer:

T *ptr;
Autoref<T> ref1(ptr);
Autoref<T> ref2(ref1);
ref1 = ref2;
ref1 = ptr;

The assignments work for exactly the same type and also for assignment to any parent in the class hierarchy:

Autoref<Label> = new DummyLabel(...);

The automatic conversion to pointers works too:

ptr = ref1;

Or a pointer can get extracted from an Autoref explicitly too:

ptr = ref1.get();

The dereferencing and arrow operations work like on a pointer too:

T val = *ref1;
ref1->method();

The Autorefs can also be compared for equality and inequality:

ref1 == ref2
ref1 != ref2

To compare them to pointers, use get(). Except for one special case: the comparison to NULL happens so often that a special method is provided for it:

ref1.isNull()

And yes, NULL can be assigned to the Autorefs too.

A little about how Autoref works: it can work transparently on both Starget and Mtarget because Autoref doesn't modify the reference counters by itself. Instead the target class is expected to provide the methods 

void incref() const;
int decref() const;

They are defined as const to allow the reference counting of even the const objects, but of course the reference counter must be mutable. decref() returns the resulting counter value. When it goes down to 0, Autoref calls the destructor.

Monday, August 6, 2012

Reference Counting

The code related to the memory management is generally collected under mem/. The reference counting has two parts to it:

  • The objects that can be managed by reference counting.
  • The references that do the counting.

The managed objects come in two varieties: single-threaded and multi-threaded. The single-threaded objects lead their whole life in a single thread, so their reference counts don't need locking. The multi-threaded objects can be shared by multiple threads, so their reference counts are kept thread-safe by using the atomic integers (if the NSPR library is available) or by using a lock (if NSPR is not used). That whole implementation of atomic data with or without NSPR is encapsulated in the class AtomitInc in mem/Atomic.h.

The way a class selects whether it will be single-threaded or multi-threaded is by inheriting from the appropriate class:

Starget (defined in mem/Starget.h) for single-threaded;
Mtarget (defined in mem/Mtarget.h) for multi-threaded.

If you do the multiple inheritance, the [SM]target has to be inherited only once. Also, you can't change the choice along the inheritance chain. Once chosen, you're stuck with it. The only way around it is by encapsulating that inner class's  object instead of inheriting from it.

The references are created with the template Autoref<>, defined in mem/Autoref.h. For example, if you have an object of class RowType, the reference to it will be Autoref<RowType>. There are are some similar references in the Boost library, but I prefer to avoid the avoidable dependencies (and anyway, I've never used Boost much).

The target objects are created in the constructors with the reference count of 0. The first time the object pointer is assigned to an Autoref, the count goes up to 1. After that it stays above 0 for the whole life of the object. As soon as it goes back to 0 (meaning that the last reference to it has disappeared), the object gets destroyed.No locks are held during the destruction itself. After all the references are gone, nobody should be using it, and destroying it is safe without any extra locks.

An important point is that to do all this, the Autoref must be able to execute the correct destructor when it destroys the object that ran out of references. Starget and Mtarget do not provide the virtual destructors. If you don't use the polymorphism for some class, you don't have to use the virtual destructors. But if you do use it, i.e. create a class B inheriting from A, inheriting from [SM]target, and then assign something like

Autoref<A> ref = new B;

then the class A (and by extension all the classes inheriting from it) must have a virtual destructor to get everything working right.

It's also possible to mess up the destruction with the use of pointers. For example, look at this sequence:

Autoref<RowType> rt = new RowType(...);
RowType *rtp = rt; // copies a reference to a pointer
rt = NULL; // reference cleared, count down to 0, object destroyed
Autoref <RowType> rt2 = rtp; // resurrects the dead pointer, corrupts memory

The lesson here is that even though you can mix the references with pointers to reduce the overhead (the reference assignments change the reference counters, the pointer assignments don't), and I do it in my code, you need to be careful. A pointer may be used only when you know that there is a reference that holds the object in place. Once that reference is gone, the pointer can't be used any more, and especially can't be assigned to another reference. Be careful.

There are more varieties of Autoref<>:

Onceref<>
const_Autoref<>
const_Onceref<>

The Onceref is an attempt at optimization when passing the function arguments and results. It's supposed to work like the standard auto_ptr: you assign a value there once, and then when that value gets assigned to an Autoref or another Onceref, it moves to the new location, leaving the reference count unchanged and the original Onceref as NULL. This way you avoid a spurious extra increase-decrease. However in practice I haven't got around to implementing it yet, so for now it's a placeholder that is defined to be an alias of Autoref.

const_Autoref<> is a reference to a constant object. Essentially, const_Autoref<T> is equivalent to Autoref<const T>, only it handles the automatic type casts much better. The approach is patterned after the const_iterator. The only problem with const_Autoref is that when you try to assign a NULL to it, that blows the compiler's mind. So you have to write an explicit cast of (T*)NULL of (const T*)NULL to help it out.

Finally, const_Onceref is the const version of Onceref.

Sunday, August 5, 2012

The const-ness in C++

I've been using the const keyword for two purposes:

  • To let the compiler optimize a little better the methods that do not change the state of the objects.
  • To mark the fragments of the read-only object internal state returned by the methods. This is much more efficient than making copies of them.

So if you get a const vector<> & returned from a method, this is a gentle reminder that you should not be modifying this vector. Of course, nothing can stop a determined programmer from doing a type cast and modifying it anyway, but be aware that such inconsistent modifications will likely cause the program to crash in the future. And if the vector contains references to other objects, these objects usually should not be modified either, even they might not be marked with const.

However all this const stuff is not all rainbows and unicorns but also produces a sizable amount of suffering. One consequence is that you can not use the normal iterators on the const vectors, you have to use the const_iterators. Another is that once in a while you get something like a (const RowType *) from one method and need to pass it as an argument to another method that takes a (RowType *). In this case make sure that you know what you are doing and then proceed boldly with using a const_cast. There is just no way to get all the const-ness self-consistent without ripping it out altogether.


Introduction to the C++ API

Fundamentally, the C++ and Perl APIs are shaped similarly. So I won't be making a version of all the examples in C++. Please read the Perl-based documentation first to understand the spirit and usage of Triceps. The C++-based documentation is more of the reference type and concentrates on the low-level specifics and differences from Perl necessitated by this specifics.

In many cases just reading the descriptions of the methods in the .h files should be enough to understand the details and be able to use the API. However in some cases the class hierarchies differ, with the Perl API covering the complexities exposed in the C++ API.

Writing directly in C++ is significantly harder than in Perl, so I highly recommend sticking with the Perl API unless you have a good reason to do otherwise. Even when the performance is important, it's usually enough to write a few critical elements in C++ and then bind them together in Perl. (If you wonder about Java, and why I didn't use it instead, the answer is that Java successfully combines the drawbacks of both and adds some of its own).

The C++ Triceps API is much more sensitive to the errors. The Perl API checks all the arguments for consistency, it's principle is that the interpreter must never crash. The C++ API is geared towards the efficiency of execution. It checks for errors when constructing the major elements but then does almost no checks at run time. The expectation is that the caller knows what he is doing. If the caller sends bad data, mislays the pointers etc., the program will crash. The idea here is that most likely the C++ API will be used from another layer: either an interpreted one (like Perl) or a compiled one (like a possible future custom language). Either way that layer is responsible for detecting the user errors at either interpretation or compile time. By the time the data gets to the C++ code, it's already checked and there is no need to check it again. Of course, if you write the programs manually in C++, that checking is upon you.

The more high-level features are currently available only in Perl. For example, there are no joins in the C++ API. If you want to do the joins in C++, you have to code your own. This will change over time, as these features will solidify and move to a C++ implementation to become more efficient and available through all the APIs. But it's much easier to experiment with the initial implementations in Perl.