Sergey Babkin on CEP and stuff: constants

Showing posts with label constants. Show all posts

Saturday, July 27, 2013

string utilities

As I'm editing the documentation for 2.0, I've found an omission: the string helper functions in the C++ API haven't been documented yet. Some of them have been mentioned but not officially documented.

The first two are declared in common/Strprintf.h:

string strprintf(const char *fmt, ...);
string vstrprintf(const char *fmt, va_list ap);

They are entirely similar to sprintf() and vsprintf() with the difference that they place the result of formatting into a newly constructed string and return that string.

The rest are defined in common/StringUtil.h:

extern const string &NOINDENT;

The special constant that when passed to the printing of the Type, causes it to print without line breaks. Doesn't have any special effect on Errors, there it's simply treated as an empty string.

const string &nextindent(const string &indent, const string &subindent, string &target);

Compute the indentation for the next level when printing a Type. The arguments are:

indent - indent string of the current level
subindent - characters to append for the next indent level
target - buffer to store the extended indent string

The passing of target as an argument allows to reuse the same string object and avoid the extra construction.

The function returns the computed reference: if indent was NOINDENT, then reference to NOINDENT, otherwise reference to target. This particular calling pattern is strongly tied to how things are computed inside the type printing, but you're welcome to look inside it and do the same for any other purpose.

void newlineTo(string &res, const string &indent);

Another helper function for the printing of Type, inserting a line break. The indent argument specifies the indentation, with the special handling of NOINDENT: if indent is NOINDENT, a single space is added, thus printing everything in one line; otherwise a \n and the contents of indent are added. The res argument is the result string, where the line break characters are added.

void hexdump(string &dest, const void *bytes, size_t n, const char *indent = "");

Print a hex dump of a sequence of bytes (at address bytes and of length n), appending the dump to the destination string dest. The data will be nicely broken into lines, with 16 bytes printed per line. The first line is added directly to the end of the dest as-is, but if n is over 16, the other lines will follow after \n. The indent argument allows to add indentation at the start of each following string.

void hexdump(FILE *dest, const void *bytes, size_t n, const char *indent = "");

Another version, sending the dumped data directly into a file descriptor.

The next pair of functions provides a generic mechanism for converting enums between a string and integer representation:

struct Valname
{
    int val_;
    const char *name_;
};

int string2enum(const Valname *reft, const char *name);
const char *enum2string(const Valname *reft, int val, const char *def = "???");

The reference table is defined with an n array of Valnames, with the last element being { -1, NULL }. Then it's passed as the argument reft of the conversion functions which do a sequential look-up by that table. If the argument is not found, string2enum() will return -1, and enum2string() will return the value of the def argument (which may be NULL).

Here is an example of how it's used for the conversion of opcode flags:

Valname opcodeFlags[] = {
    { Rowop::OCF_INSERT, "OCF_INSERT" },
    { Rowop::OCF_DELETE, "OCF_DELETE" },
    { -1, NULL }
};

const char *Rowop::ocfString(int flag, const char *def)
{
    return enum2string(opcodeFlags, flag, def);
}

int Rowop::stringOcf(const char *flag)
{
    return string2enum(opcodeFlags, flag);
}

Thursday, July 4, 2013

string-and-constant conversion errors

As outlined before, I'm going through the the Perl API, changing to the new way of the error reporting by confessing. On of the consequences is that the functions for parsing the string name of a constant into the value got converted too. They now confess on an invalid string. However sometimes it's still convenient to check the string for validity without causing an error, so for that I've added another version of them, with the suffix Safe attached, that still work the old way, returning an undef for an unknown string.

For example, &Triceps::humanStringTracerWhen and Triceps::humanStringTracerWhenSafe.

The same applies the other way around to: the functions converting the integer enumeration value to string will confess when the value is not in the proper enum. The functions with the suffix Safe will still return an undef. Such as &Triceps::tracerWhenHumanString and &Triceps::tracerWhenHumanStringSafe.

The only special case is &Triceps::opcodeString and &Triceps::opcodeStringSafe: neither of them confesses nor returns undef, they print the unknown strings as a combination of bits, but for consistency both names are present anyway, doing the same thing.

Friday, February 15, 2013

Aggregator in C++, part 1

I've described the AggregatorType and AggregatorGadget before. Now, the last piece of the puzzle is the Aggregator class.

An object of subclass of AggregatorType defines the logic of the aggregation, and there is one of them per a TableType. An object of subclass of AggregatorGadget provides the aggregation output of a Table, and there is one of them per table. An object of subclass of Aggregator keeps the state of an aggregation group, and there is one of them per Index that holds a group in the table.

Repeating what I wrote before:

"The purpose of the Aggregator object created by AggregatorType's makeAggregator() is to keep the state of the group. If you're doing an additive aggregation, it allows you to keep the previous results. If you're doing the optimization of the deletes, it allows you to keep the previous sent row.

What if your aggregator keeps no state? You still have to make an Aggregator for every group, and no, you can't just return NULL, and no, they are not reference-countable, so you have to make a new copy of it for every group (i.e. for every call of makeAggergator()). This looks decidedly sub-optimal, and eventually I'll get around to straighten it out. The good news though is that most of the real aggerators keep the state anyway, so it doesn't matter much."

Once again, the Aggregator object doesn't normally do the logic by itself, it calls its AggregatorType for the logic. It only provides storage for one group's state and lets the AggergatorType logic use that storage.

The Aggregator is defined in table/Aggregator.h. It defines two things: the constants and the virtual method.

The constants are in the enum Aggregator::AggOp, same as defined before in Perl:

AO_BEFORE_MOD,
AO_AFTER_DELETE,
AO_AFTER_INSERT,
AO_COLLAPSE,

Their meaning is also the same. They can be converted to and from the string representation with methods

static const char *aggOpString(int code, const char *def = "???");
static int stringAggOp(const char *code);

They work the same way as the other constant conversion methods.

Notably, there is no explicit Aggregator constructor. The base class Aggregator doesn't have any knowledge of the group state, so the default constructor is good enough. You add this knowledge of the state in your subclass, and there you define your own constructor. Whether the constructor takes any arguments or not is completely up to you, since your subclass of AggregatorType will be calling that constructor.

And the actual work then happens in the method

virtual void handle(Table *table, AggregatorGadget *gadget, Index *index,
const IndexType *parentIndexType, GroupHandle *gh, Tray *dest,
AggOp aggop, Rowop::Opcode opcode, RowHandle *rh);

Your subclass absolutely has to define this method because it's abstract in the base class. The arguments provide the way to get back to the table, its components and its type, so you don't have to pass them through your Aggregator constructor and keep them in your instance, this saves memory.

Notably absent is the pointer to the AggregatorType. This is because the AggregatorGadget already has this pointer, so all you need to to is call

gadget->getType();

You might also want to cast it to your aggregator's type in case if you plan to call your custom methods, like:

MyAggregatorType *at = (MyAggregatorType *)gadget->getType();

Some of the argument types, Index and GroupHandle, have not been described yet, and will be described shortly.

But in general the method handle() is expected to work very much like the aggregator handler does in Perl. It's called at the same times (and indeed the Perl handler is called from the C++-level of the handler), with the same aggop and opcode. The rh is still the current modified row.

A bit of a difference is that the Perl way hides the sensitive arguments in an AggregatorContext while in C++ they are passed directly.

The job of the handle() is to do whatever is necessary, possibly iterate through the group, possibly use the previous state, produce the new state (and possibly remember it), then send the rowop to the gadget. This last step usually is:

gadget->sendDelayed(dest, newrow, opcode);

Here gadget is the AggregatorGadget from the arguments, opcode also helpfully comes from the arguments (and if it's OP_NOP, there is no need to send anything, it will be thrown away anyway), and newrow is the newly computed result Row (not RowHandle!). Obviously, the type of the result row must match the output row type of the AggregatorGadget, or all kinds of bad things will happen. dest is also an argument of handle(), and is a tray that will hold the aggregator results until a proper time when they can be sent. The "delayed" part of "sendDelayed" means exactly that: the created rowops are not sent directly to the gadget's output but are collected in a Tray until the Table decides that it's the right time to send them.

You absolutely must not use the Gadget's default method send(), use only sendDelayed(). To help with this discipline, AggregatorGadget hides the send() method and makes only the sendDelayed() visible.

Saturday, December 29, 2012

Gadget in C++

The Gadget is unique to the C++ API, it has no parallels in Perl. Gadget is a base class defined in sched/Gadget.h, its object being a something with an output label. And the details of what this something is, are determined by the subclass. Presumably, it also has some kind of inputs but it's up to the subclass. The Gadget itself defines only the output label. To make a concrete example, a table is a gadget, and every aggregator in the table is also a gadget. However the "pre" and "dump" labels of the table is not a gadget, it's just an extra label strapped on the side.

Some of the reasons for the Gadget creation are purely historic by now. At some point it seemed important to have the ability to associate a particular enqueueing mode with each output label. Most tables might be using EM_CALL but some, ones in a loop, would use EM_FORK, and those that don't need to produce the streaming output would use EM_IGNORE. This approach didn't work out as well as it seemed at first, and now is outright deprecated: just use EM_CALL everywhere, and there are the newer and better ways to handle the loops. The whole Gadget thing should be redesigned at some point but for now I'll just describe it as it is.

As the result of that history, the enqueueing mode constants are defined in the Gadget class, enum EnqMode: EM_SCHEDULE, EM_FORK, EM_CALL, EM_IGNORE.

static const char *emString(int enval, const char *def = "???");
static int stringEm(const char *str);

Convert from the enqueueing mode constant to string, and back.

Gadget(Unit *unit, EnqMode mode, const string &name = "", const_Onceref<RowType> rt = (const RowType*)NULL);

The Gadget constructor is protected, since Gadget is intended to be used only as a base class, and never instantiated directly. The name and row type can be left undefined if they aren't known yet and initialized later. The output label won't be created until the row type is known, and you better also set the name by that time too. The enqueueing mode may also be changed later, so initially it can be set to anything. All this is intended only to split the initialization in a more convenient way, once the Gadget components are set, they must not be changed any more.

The output label of the Gadget is a DummyLabel, and it shares the name with the Gadget. So if you want to differentiate that label with a suffix in the name, you have to give the suffixed name to the whole Gadget. For example, the Table constructor does:

Gadget(unit, emode, name + ".out", rowt),

A Gadget keeps a reference to both its output label and its unit. This means that the unit won't disappears from under a Gadget, but to avoid the circular references, the Unit must not have references to the Gadgets (having references to their output labels is fine).

void setEnqMode(EnqMode mode);void setName(const string &name);
void setRowType(const_Onceref<RowType> rt);

The protected methods to finish the initialization. Once the values are set, they must not be changed any more. Calling setRowType() creates the output label, and since the name of the output label is taken from the Gadget, you need to set the name before you set the row type.

EnqMode getEnqMode() const;
const string &getName() const;
Unit *getUnit() const;
Label *getLabel() const;

Get back the gadget's information. The label will be returned only after it's initialized (i.e. the row type is known), before then getLabel() would return NULL. And yes, it's getLabel(), NOT getOutputLabel().

The rest of the methods are for convenience of sending the rows to the output label. They are protected, since they are intended for the Gadget subclasses (which in turn may decide to make them pubclic).

void send(const Row *row, Rowop::Opcode opcode) const;

Construct a Rowop from the given row and opcode, and enqueue it to the output label according to the gadget's enqueueing method. This is the most typical use.

void sendDelayed(Tray *dest, const Row *row, Rowop::Opcode opcode) const;

Create a Rowop and put it into the dest tray. The rowop will have the enqueueing mode populated according to the Gadget's setting. This method is used when the whole set of the rowops needs to be generated before any of them can be enqueued, such as when a Table computes its aggregators. After the delayed tray is fully generated, it can be enqueued with Unit::enqueueDelayedTray(), which will consult each rowop's enqueueing method and process it accordingly. Again, this stuff exists for the historic reasons, and will likely be removed somewhere soon.

Monday, December 24, 2012

Rowop in C++

I've jumped right into the Unit without showing the objects it operates on. Now let's start catching up and look at the Rowops. The Rowop class is defined in sched/Rowop.h.

The Rowop in C++ consists of all the same parts as in Perl API: a label, a row, and opcode.

It has one more item that's not really visible in the Perl API, the enqueueing mode, but it's semi-hidden in the C++ API as well. The only place where it's used is in Unit::enqueueDelayedTray(). This basically allows to build a tray of rowops, each with its own enqueueing mode, and then enqueue all of them appropriately in one go. This is actually kind of historic and caused by the explicit enqueueing mode specification for the Table labels. It's getting obsolete and will be removed somewhere soon.

The Rowop class inherits from Starget, usable in one thread only. Since it refers to the Labels, that are by definition single-threaded, this makes sense. A consequence is that you can't simply pass the Rowops between the threads. The passing-between-threads requires a separate representation that doesn't refer to the Labels but instead uses something like a numeric index (and of course the Mtarget base class). This is a part of the ongoing work on multithreading, but you can also make your own.

The opcodes are defined in the union Rowop::Opcode, so you normally use them as Rowop::OP_INSERT etc. As described before, the opcodes actually contain a bitmap of individual flags, defined in the union Rowop::OpcodeFlags: Rowop::OCF_INSERT and Rowop::OCF_DELETE. You don't really need to use these flags directly unless you really, really want to.

Besides the 3 already described opcodes (OP_NOP, OP_INSERT and OP_DELETE) there is another one, OP_BAD. It's a special value returned by the string-to-opcode conversion method instead of the -1 returne dby the other similar method. The reason is that OP_BAD is specially formatted to be understood by all the normal opcode type checks as NOP, while -1 would be seen as a combination of INSERT and DELETE. So if you miss checking the result of conversion on a bad string, at least you would get a NOP and not some mysterious operation. The reason why OP_BAD is not exporeted to Perl is that in Perl an undef is used as the indication of the invalid value, and works even better.

There is a pretty wide variety of Rowop constructors:

Rowop(const Label *label, Opcode op, const Row *row);
Rowop(const Label *label, Opcode op, const Rowref &row);

Rowop(const Label *label, Opcode op, const Row *row, int enqMode);
Rowop(const Label *label, Opcode op, const Rowref &row, int enqMode);

Rowop(const Rowop &orig);
Rowop(const Label *label, const Rowop *orig);

The constructors with the explicit enqMode are best not be used outside of the Triceps internals, and will eventually be obsoleted. The last two are the copy constructor, and the adoption constructor which underlies Label::adopt() and can as well be used directly.

Once a rowop is constructed, its components can not be changed any more, only read.

Opcode getOpcode() const;
const Label *getLabel() const;
const Row *getRow() const;
int getEnqMode() const;

Read back the components of the Rowop. Again, the getEnqMode() is on the way to obsolescence. And if you need to check the opcode for being an insert or delete, the better way is to use the explicit test methods, rather than getting the opcode and comparing it for equality:

bool isInsert() const;
bool isDelete() const;
bool isNop() const;

Check whether the opcode requests an insert or delete (or neither).

The same checks are available as static methods that can be used on the opcode values:

static bool isInsert(int op);
static bool isDelete(int op);
static bool isNop(int op);

And the final part is the conversion between the strings and values for the Opcode and OpcodeFlags enums:

static const char *opcodeString(int code);
static int stringOpcode(const char *op);
static const char *ocfString(int flag, const char *def = "???");
static int stringOcf(const char *flag);

As mentioned above, stringOpcode() returns OP_BAD for the unknown strings, not -1.

Wednesday, December 28, 2011

Triceps constants

Triceps has a number of symbolic constants that are grouped into essentially enums. The constants themselves will be introduced with the classes that use them, but here is the general description common to them all.

In Perl they all are placed into the same namespace. Each group of constants (that can be thought of as an enum) gets its name prefix. For example, the operation codes are all prefixed with OP_, the enqueueing modes with EM_, and so on.

The underlying constants are all integer. The way to give symbolic names to constants in Perl is to define a function without arguments that would return the value. Each constant has such a function defined for it. For example, the opcode for the "insert" operation is the result of function Triceps::OP_INSERT.

Most methods that take constants as arguments are also smart enough to recognise the constant names as strings, and automatically convert them to integers. For example, the following calls are equivalent:

$label->makeRowop(&Triceps::OP_INSERT, ...);
$label->makeRowop("OP_INSERT", ...);

However the version with Triceps::OP_INSERT is more efficient. The ampersand (function call designator in Perl) is usually optional, but I tend to use it for clarity.

What if you need to print out a constant in a message? Triceps provides the conversion functions for each group of constants. They generally are named Triceps::somethingString. For example,

print &Triceps::opcodeString(&Triceps::OP_INSERT);

would print "OP_INSERT". If the argument is out of range of the valid enums, it would return undef (but not set any error message in $!).

There also are functions to convert from strings to constant values. They generally are named Triceps::stringSomething. For example,

&Triceps::stringOpcode("OP_INSERT")

would return the integer value of Triceps::OP_INSERT. If the string name is not valid for this kind of constants, it would also return undef.

Sergey Babkin on CEP and stuff