Sergey Babkin on CEP and stuff: 2

Showing posts with label 2_1. Show all posts

Saturday, June 3, 2023

the first actual use of Triceps

I've recently got Triceps used for a real application, for the first time ever, after approximately 13 years. Not really a production usage, rather a scientific demo, but it's an outside project, not part of Triceps itself. Embedded in Matlab, or all things. No, not a full Matlab API, but an application written in C++ embedded as a native function into Matlab. Which is exactly what Triceps was designed for.

But really Matlab and Triceps with a full Matlab API might be a match made in heaven. Matlab really sucks at two things: the data structures where you can append and search by value, and parallelism. And those are things where Triceps excels. Yeah, Matlab has the parfor loops but the trouble there is that the parallel threads (they're more typically processes but logically still threads) are stateless. All the input data is packed up and sent to them (at a great overhead), and then the results are packed up and sent back. You can't just preserve state in a thread between two calls, it has to be sent from scratch every time. And no, it doesn't seem to be the same constant in shared memory read in parallel by multiple threads. It actually gets copied for every thread. So parfor only works well when you send a small amount of data, process it for some long time, and then send a small result back. Not well when you want your thread to make queries to a large database. But keeping state is what a Triceps thread does. The Triceps threads are also easy to organize in pipelines. (Yeah, Matlab does have some pipelines in one of extensions, but they seem as cumbersome as parfor). And the read-only shared memory would work too if queried through Triceps and only small results of the queries get returned to Matlab. It could work really, really awesomely together. The trouble of course is that I personally don't have much of an incentive to do this work.

That's the good part. What went badly? Well, Triceps in C++ feels rather dated. It's a project that was started in 2010, before C++11, and it feels like that. I didn't even aim for C++ to be the main language, but more as an API for embedding into the other languages. But now it can be done much more smoothly straight in C++. So if I ever get to it, a modernization of the C++ API is in order. Some libraries are dated too, in particular NSPR. It also proved to be a pain in building: I haven't thought much about it before, but the definitions used in building of the applications that use Triceps have to be the same as when building Triceps itself. So if triceps is built with NSPR, and the application doesn't include the right definitions for the Triceps headers to use NSPR, it crashes in a surprising way. Fortunately, the standard C++ library now has APIs for the atomic values, so transition to that API is now in order. On the other hand, shared_ptr is a more complicated question, and keeping the Triceps Autoref might still be better for efficiency.

Tuesday, December 25, 2018

Triceps 2.1.0 released

After a long break, Triceps 2.1.0 is finally released: http://triceps.sourceforge.net/

It includes two important features:

1. The fast compiled Ordered index. The code for it has been sitting in SVN for three years, and I have finally found time to update the documentation to replace the old ordered index implemented in Perl with the new one.

2. Triceps now builds works with the modern versions of C++ and Perl. This was harder to do than it sounds, since the C++ standard has changed in an incompatible and unpleasant way. But the good news is that the code produced by the new compiler is a good deal faster than by the old one.

Saturday, December 1, 2018

Triceps in flux

I've finally updated my machine to a modern version of Linux (Mint 19), and now I'm going through, so to say, bringing Triceps into the Century of the Fruit Bat. Both C++ and Perl languages have changed, (and as has been mentioned in a previous post, Ghostscript had functionality removed).

Most annoyingly, C++ doesn't allow to call even the non-virtual methods on the NULL pointers any more. Well, it does allow to call them, but auto-removes any checks for (this == NULL) inside the methods. And I've been using this in multiple places. But now it's all worked around by using the special static objects and the code builds again.

The tests still fail though, but now it's hopefully all trivial: some message got changed because of the different hashing, and also the Perl tests fail because the syntax of regexps got changed in Perl 5.26, it doesn't accept the plain "{" in the patterns any more. But I'm working on it.

Wednesday, April 22, 2015

On performace of reading the unaligned data

I've made a template that was supposed to optimize the reading of the unaligned data:

template <typename T>
T getUnaligned(const T *ptr)
{
    if ((intptr_t)ptr & (sizeof(T)-1)) {
        T value;
        memcpy(&value, ptr, sizeof(T));
        return value;
    } else {
        return *ptr;
    }
}

And I've been wondering if it really makes a difference. So I've made a test.The test compares 5 cases:

Just read the aligned memory directly;
Read the aligned memory through memcpy;
Read the aligned memory through memcpy;
Read the aligned memory through the macro;
Read the unaligned memory through the macro.

All the cases follow this general pattern:

UTESTCASE readAlignedDirect(Utest *utest)
{
    int64_t n = findRunCount();
    int64_t x = 0;

    double tstart = now();
    for (int64_t i = 0; i < n; i++) {
        x += apdata[i % ARRAYSZ];
    }
    double tend = now();
    printf("        [%lld] %lld iterations, %f seconds, %f iter per second\n", (long long) x, (long long)n, (tend-tstart), (double)n / (tend-tstart));
}

The results on an AMD64 machine with -O3 came out like this:

   case 1/6: warmup
        [0] 4294967296 iterations, 4.601818 seconds, 933319757.968194 iter per second
case 2/6: readAlignedDirect
        [0] 4294967296 iterations, 4.579874 seconds, 937791527.916276 iter per second
case 3/6: readAlignedMemcpy
        [0] 4294967296 iterations, 4.537628 seconds, 946522579.007429 iter per second
case 4/6: readUnalignedMemcpy
        [0] 4294967296 iterations, 6.215574 seconds, 691000961.046305 iter per second
case 5/6: readAlignedTemplate
        [0] 4294967296 iterations, 4.579680 seconds, 937831317.452678 iter per second
case 6/6: readUnalignedTemplate
        [0] 4294967296 iterations, 5.052706 seconds, 850033049.741168 iter per second

Even more interestingly, the code generated for readAlignedDirect(), readAlignedTemplate() and readUnalignedTemplate() is exactly the same:

.L46:
    movq    %rax, %rdx
    addq    $1, %rax
    andl    $15, %edx
    addq    (%rcx,%rdx,8), %rbx
    cmpq    %rbp, %rax
    jne .L46

The code generated for readAlignedMemcpy() and readUnalignedMemcpy() is slightly different but very similar:

.L34:
    movq    %rax, %rdx
    addq    $1, %rax
    andl    $15, %edx
    salq    $3, %rdx
    addq    (%rcx), %rdx
    movq    (%rdx), %rdx
    addq    %rdx, %rbx
    cmpq    %rax, %rbp
    movq    %rdx, (%rsi)
    jg .L34

Basically, the compiler is smart enough to know that the machine doesnt' care about alignment, and generates memcpy() as a plain memory read, and then for condition in the template it's also smart enough to know that both branches end up with the same code and eliminate the condition check.

Smart, huh? So now I'm not sure if keeping this template as it is makes sense, or should I just change it to a plain memcpy().

BTW, I must say it took me a few attempts to make up the code that would not get optimized out by the compiler. I've found out that memcpy() is not suitable for reading the volatile data nowadays. It defines its second argument as (const void *), and so the volatility has to be cast away for passing the pointer to it, and then the compiler just happily takes the whole memory read out of the loop. And it was even smart enough to sometimes replace the whole loop with a multiplication instead of collecting a sum. That's where the modulo operator came handy to confuse this extreme smartness.

A separate interesting note is that each iteration of the loop executes in an about one-billionth of a second. Since this is a 3GHz CPU, this means that every iteration takes only about 3 CPU cycles. That's 6 or 10 instructions executed in 3 CPU cycles. Such are the modern miracles.

ThreadedClient improvement of timeout handling

The work on updating the tests for the new ordered index brought a nice side effect: the handling of the timeouts in the Perl class X:ThreadedClient has improved. Now it returns not only the error message but also the data received up to that point. This helps a lot with diagnosing the errors in the automated tests of TQL: the error messages get returned in a clear way.

Friday, April 17, 2015

Ordered Index implemented in C++

The ordered index implemented in Perl has been pretty slow, so I've finally got around to implementing one in C++. It became much faster. Here is an excerpt from the performance test results (with an optimized build):

Table insert makeRowArray (single hashed idx, direct) 0.551577 s, 181298.48 per second.
Excluding makeRowArray 0.311936 s, 320578.44 per second.
Table insert makeRowArray (single ordered int idx, direct) 0.598462 s, 167095.09 per second.
Excluding makeRowArray 0.358821 s, 278690.37 per second.
Table insert makeRowArray (single ordered string idx, direct) 1.070565 s, 93408.64 per second.
Excluding makeRowArray 0.830924 s, 120347.91 per second.
Table insert makeRowArray (single perl sorted idx, direct) 21.810374 s, 4584.97 per second.
Excluding makeRowArray 21.570734 s, 4635.91 per second.
Table lookup (single hashed idx) 0.147418 s, 678342.02 per second.
Table lookup (single ordered int idx) 0.174585 s, 572785.77 per second.
Table lookup (single ordered string idx) 0.385963 s, 259092.38 per second.
Table lookup (single perl sorted idx) 6.840266 s, 14619.32 per second.

Here the "ordered" is the new ordered index in C++ (on an integer field and on a string field), and "perl sorted" is the ordering in Perl, on an integer field. Even though the ordered index is a little slower than the hashed index, it's about 40-70 times faster than the ordering implemented in Perl.

Naturally, ordering by a string field is slower than by an integer one, since it not only has to compare more bytes one-by-one but also honors the locale-specific collation order.

In Perl, the ordered index type is created very similarly to the hashed index type:

Triceps::IndexType->newOrdered(key => [ "a", "!b" ])

The single option "key" specifies the array with the names key fields of the index. The "!" in front of the field name says that the order by this field is descending, and otherwise the order is ascending. This is more compact and arguably easier-to-read format than the one used by the SimpleOrderedIndex.

As usual, the NULL values in the key fields are permitted, and are considered less than any non-NULL value.

The array fields may also be used in the ordered indexes.

The ordered index does return the list of key fields with getKey(), so it can be used in joins and can be found by key fields with TableType::findIndexPathForKeys() and friends, just like the hashed index.

The getKey() for an ordered index returns the list of plain field names, without the indications of descending order. Thus the finding of the index by key fields just works out of the box, unchanged.

To get the key fields as they were specified in the index creation, including the possible "!", use the new method:

@fields = $indexType->getKeyExpr();

The idea of this method is that the contents of the array returned by it depends on the index type and is an "expression" that can be used to build another instance of the same index type. For the hashed index it simply returns the same data as getKey(). For the ordered index it returns the list of keys with indications. For the indexes with Perl conditions it still returns nothing, though in the future might be used to store the condition.

In the C++ API the index type gets created with the constructor:

OrderedIndexType(NameSet *key = NULL);
static OrderedIndexType *make(NameSet *key = NULL);

The key can also be set after construction:

OrderedIndexType *setKey(NameSet *key);

Same as in Perl, the field names in the NameSet can be prefixed with a "!" to specify the descending order on that field.

The new method in the IndexType is:

virtual const NameSet *getKeyExpr() const;

And the new constant for the index id is IT_ORDERED, available in both C++ and Perl.

Sunday, March 22, 2015

more of performance numbers, with optimization

I've realized that the previous published performance numbers were produced by a build without optimization. So I've enabled the optimization with -O3 -fno-strict-aliasing, and the numbers improved, some of them more than twice (that's still an old Core 2 Duo 3GHz laptop):

Performance test, 100000 iterations, real time.
Empty Perl loop 0.005546 s, 18030711.03 per second.
Empty Perl function of 5 args 0.031840 s, 3140671.52 per second.
Empty Perl function of 10 args 0.027833 s, 3592828.57 per second.
Row creation from array and destruction 0.244782 s, 408526.42 per second.
Row creation from hash and destruction 0.342986 s, 291557.00 per second.
Rowop creation and destruction 0.133347 s, 749922.94 per second.
Calling a dummy label 0.047770 s, 2093352.57 per second.
Calling a chained dummy label 0.049562 s, 2017695.16 per second.
Pure chained call 0.001791 s, 55827286.04 per second.
Calling a Perl label 0.330590 s, 302489.48 per second.
Row handle creation and destruction 0.140406 s, 712218.40 per second.
Repeated table insert (single hashed idx, direct) 0.121556 s, 822665.80 per second.
Repeated table insert (single hashed idx, direct & Perl construct) 0.337209 s, 296551.58 per second.
RowHandle creation overhead in Perl 0.215653 s, 463707.00 per second.
Repeated table insert (single sorted idx, direct) 1.083628 s, 92282.62 per second.
Repeated table insert (single hashed idx, call) 0.153614 s, 650981.13 per second.
Table insert makeRowArray (single hashed idx, direct) 0.553364 s, 180713.02 per second.
Excluding makeRowArray 0.308581 s, 324063.65 per second.
Table insert makeRowArray (double hashed idx, direct) 0.638617 s, 156588.31 per second.
Excluding makeRowArray 0.393835 s, 253913.40 per second.
Overhead of second index 0.085254 s, 1172969.41 per second.
Table insert makeRowArray (single sorted idx, direct) 22.355793 s, 4473.11 per second.
Excluding makeRowArray 22.111011 s, 4522.63 per second.
Table lookup (single hashed idx) 0.142762 s, 700466.78 per second.
Table lookup (single sorted idx) 6.929484 s, 14431.09 per second.
Lookup join (single hashed idx) 2.944098 s, 33966.26 per second.
Nexus pass (1 row/flush) 0.398944 s, 250661.96 per second.
Nexus pass (10 rows/flush) 0.847021 s, 1180608.79 per row per second.
Overhead of each row 0.049786 s, 2008583.51 per second.
Overhead of flush 0.349157 s, 286403.84 per second.

I've also tried to run the numbers in a newer laptop with 2GHz Core i7 CPU, in a VM configured with 2CPUs. On the build without the optimization, the numbers came out very similar to the old laptop. On the build with optimization they came up to 50% better but not consistently so (perhaps, running in a VM added variability).

Sunday, March 1, 2015

Triceps 2.0.1 released

This time I'm trying to update the docs following the code changes, and do the small releases.

The release 2.0.1 is here! It includes:

Fixed the version information that was left incorrect (at 0.99).
Used a more generic pattern in tests for Perl error messages that have changed in the more recent versions of Perl (per CPAN report #99268).
Added the more convenient way to wrap the error reports in Perl, Triceps::nestfess() and Triceps::wrapfess().
Added functions for the nicer printing of auto-generated code, Triceps::alignsrc() and Triceps::numalign().
In the doc chapter on the templates, fixed the output of the examples: properly interleaved the inputs and outputs.

Monday, December 29, 2014

formatting the source code snippets

I've got to printing the snippets of the source code in the error messages when the said code fails to compile and realized that it could be printed better. So I wrote a couple of helper functions to print it better.

First of all, it could be indented better, to match the indenting of the rest of the error message.

Second (but connected), it could be left-aligned better: the code snippets tend to have the extra spacing on the left.

And third, it could use the line numbers, to make the error location easier to find.

The magic is done with two functions, Triceps::Code::alignsrc() and Triceps::Code::numalign(). They work in the exact same way, have the same arguments, only numalign() adds the line numbers while alignsrc() doesn't.

Here is an example of use:

        confess "$myname: error in compilation of the generated function:\n $@function text:\n"
        . Triceps::Code::numalign($gencode, " ") . "\n";

It can produce an error message like this (with a deliberately introduced syntax error):

Triceps::Fields::makeTranslation: error in compilation of the generated function:
syntax error at (eval 27) line 13, near "})
"
function text:
     2 sub { # (@rows)
     3   use strict;
     4   use Carp;
     5   confess "template internal error in Triceps::Fields::makeTranslation: result translation expected 1 row args, received " . ($#_+1)
     6     unless ($#_ == 0);
     7   # $result_rt comes at compile time from Triceps::Fields::makeTranslation
     8   return $result_rt->makeRowArray(
     9     $_[0]->get("one"),
    10     $_[0]->get("two"),
    11   );
    12 })
at /home/babkin/w/triceps/trunk/perl/Triceps/blib/lib/Triceps/Fields.pm line 219
    Triceps::Fields::makeTranslation('rowTypes', 'ARRAY(0x2943cb0)', 'filterPairs', 'ARRAY(0x2943b30)', '_simulateCodeError', 1) called at t/Fields.t line 205
    eval {...} called at t/Fields.t line 204

The first argument of alignsrc() and numalign() is the code snippet string to be aligned. The following will be done to it:

The empty lines at the front will be removed. numalign() is smart enough to take the removed lines into account and adjust the numbering. That's why the numbering in the error message shown above starts with 2. You can also get the number of the removed lines afterwards, from the global variable $Triceps::Code::align_removed_lines.
The \n at the end of the snippet will be chomped. But only one, the rest of the empty lines at the end will be left alone.
Then the "baseline" indenting of the code will be determined by looking at the first three and last two lines. The shortest non-empty indenting will be taken as the baseline. If some lines examined start with spaces and some start with tabs, the lines starting with tabs will be preferred as the baseline indenting.
The baseline indenting will be removed from the front of all lines. If some lines in the middle of the code have a shorter indenting, they will be left unchanged.
The tabs will be replaced by two spaces each. If you prefer a different replacement, you can specify it as the third argument of the function.
In numalign() the line numbers will be prepended to the lines.
The indenting from the second argument of the function will be prepended to the lines.

The second argument contains the new indenting that allows to align the code nicely with the rest of the error message. Technically, it's optional, and will default to an empty string.

The third argument is optional and allows to provide an alternative replacement to the tab characters in the code. If it's an empty string, it will revert to the default two spaces " ". To keep the tabs unchanged, specify it as "\t".

Friday, December 26, 2014

error reporting in the templates

When writing the Triceps templates, it's always good to make them report any usage errors in the terms of the template (though the extra detail doesn't hurt either). That is, if a template builds a construction out of the lower-level primitives, and one of these primitives fail, the good approach is to not just pass through the error from the primitive but wrap it into a high-level explanation.

This is easy to do if the primitives report the errors by returning them directly, as Triceps did in the version 1. Just check for the error in the result, and if an error is found, add the high-level explanation and return it further.

It becomes more difficult when the errors are reported like the exceptions, which means in Perl by die() or confess(). The basic handling is easy, there is just no need to do anything to let the exception propagate up, but adding the extra information becomes difficult. First, you've got to explicitly check for these errors by catching them with eval() (which is more difficult than checking for the errors returned directly), and only then can you add the extra information and re-throw. And then there is this pesky problem of the stack traces: if the re-throw uses confess(), it will likely add a duplicate of at least a part of the stack trace that came with the underlying error, and if it uses die(), the stack trace might be incomplete since the native XS code includes the stack trace only to the nearest eval() to prevent the same problem when unrolling the stacks mixed between Perl and Triceps scheduling.

Because of this, some of the template error reporting got worse in Triceps 2.0.

Well, I've finally come up with the solution. The solution is not even limited to Triceps, it can be used with any kind of Perl programs. Here is a small example of how this solution is used, from Fields::makeTranslation():

    my $result_rt = Triceps::wrapfess
        "$myname: Invalid result row type specification:",
        sub { Triceps::RowType->new(@rowdef); };

The function Triceps::wrapfess() takes care of wrapping the confessions. It's very much like the try/catch, only it has the hardcoded catch logic that adds the extra error information and then re-throws the exception.

Its first argument is the error message that describes the high-level problem. This message will get prepended to the error report when the error propagates up (and the original error message will get a bit of extra indenting, to nest under that high-level explanation).

The second argument is the code that might throw an error, like the try-block. The result from that block gets passed through as the result of wrapfess().

The full error message might look like this:

Triceps::Fields::makeTranslation: Invalid result row type specification:
Triceps::RowType::new: incorrect specification:
    duplicate field name 'f1' for fields 3 and 2
    duplicate field name 'f2' for fields 4 and 1
Triceps::RowType::new: The specification was: {
    f2 => int32[]
    f1 => string
    f1 => string
    f2 => float64[]
} at blib/lib/Triceps/Fields.pm line 209.
    Triceps::Fields::__ANON__ called at blib/lib/Triceps.pm line 192
    Triceps::wrapfess('Triceps::Fields::makeTranslation: Invalid result row type spe...', 'CODE(0x1c531e0)') called at blib/lib/Triceps/Fields.pm line 209
    Triceps::Fields::makeTranslation('rowTypes', 'ARRAY(0x1c533d8)', 'filterPairs', 'ARRAY(0x1c53468)') called at t/Fields.t line 186
    eval {...} called at t/Fields.t line 185

It contains both the high-level and the detailed description of the error, and the stack trace.

The stack trace doesn't get indented, no matter how many times the message gets wrapped. wrapfess() uses a slightly dirty trick for that: it assumes that the error messages are indented by the spaces while the stack trace from confess() is indented by a single tab character. So the extra spaces of indenting are added only to the lines that don't start with a tab.

Note also that even though wrapfess() uses eval(), there is no eval above it in the stack trace. That's the other part of the magic: since that eval is not meaningful, it gets cut from the stack trace, and wrapfess() also uses it to find its own place in the stack trace, the point from which a simple re-confession would dump the duplicate of the stack. So it cuts the eval and everything under it in the original stack trace, and then does its own confession, inserting the stack trace again. This works very well for the traces thrown by the XS code, which actually doesn't write anything below that eval; wrapfess() then adds the missing part of the stack.

Wrapfess() can do a bit more. Its first argument may be another code reference that generates the error message on the fly:

    my $result_rt = Triceps::wrapfess sub {
            "$myname: Invalid result row type specification:"
        },
        sub { Triceps::RowType->new(@rowdef); };

In this small example it's silly but if the error diagnostics is complicated and requires some complicated printing of the data structures, it will be called only if the error actually occurs, and the normal code path will avoid the extra overhead.

It gets even more flexible: the first argument of wrapfess() might also be a reference to a scalar variable that contains either a string or a code reference. I'm not sure yet if it will be very useful but it was easy to implement. The idea there is that it allows to write only one wrapfess() call and then change the error messages from inside the second argument, providing different error reports for its different sections. Something like this:

    my $eref;
    return Triceps::wrapfess \$eref,
        sub {
$eref = "Bad argument foo";
          buildTemplateFromFoo();
$eref = sub {
my $bardump = $argbar->dump();
$bardump =~ s/^/    /mg;
return "Bad argument bar:\n bar value is:\n$bardump";
   }
          buildTemplateFromBar();
...

       };

It might be too flexible, we'll see how it works.

Internally, wrapfess() uses the function Triceps::nestfess() to re-throw the error. Nestfess() can also be used directly, like this:

eval {
buildTemplatePart();
};
if ($@) {

Triceps::nestfess("High-level error message", $@);
}

The first argument is the high-level descriptive message to prepend, the second argument is the original error caught by eval. Nestfess() is responsible for all the smart processing of the indenting and stack traces, wrapfess() is really just a bit of syntactic sugar on top of it.

Sergey Babkin on CEP and stuff