Wednesday, April 22, 2015

On performace of reading the unaligned data

I've made a template that was supposed to optimize the reading of the unaligned data:

template <typename T>
T getUnaligned(const T *ptr)
{      
    if ((intptr_t)ptr & (sizeof(T)-1)) {
        T value;
        memcpy(&value, ptr, sizeof(T));
        return value;
    } else {
        return *ptr;
    }
}

And I've been wondering if it really makes a difference. So I've made a test.The test compares 5 cases:

Just read the aligned memory directly;
Read the aligned memory through memcpy;
Read the aligned memory through memcpy;
Read the aligned memory through the macro;
Read the unaligned memory through the macro.

All the cases follow this general pattern:

UTESTCASE readAlignedDirect(Utest *utest)
{
    int64_t n = findRunCount();
    int64_t x = 0;

    double tstart = now();
    for (int64_t i = 0; i < n; i++) {
        x += apdata[i % ARRAYSZ];
    }
    double tend = now();
    printf("        [%lld] %lld iterations, %f seconds, %f iter per second\n", (long long) x, (long long)n, (tend-tstart), (double)n / (tend-tstart));
}


The results on an AMD64 machine with -O3 came out like this:

   case 1/6: warmup
        [0] 4294967296 iterations, 4.601818 seconds, 933319757.968194 iter per second
  case 2/6: readAlignedDirect
        [0] 4294967296 iterations, 4.579874 seconds, 937791527.916276 iter per second
  case 3/6: readAlignedMemcpy
        [0] 4294967296 iterations, 4.537628 seconds, 946522579.007429 iter per second
  case 4/6: readUnalignedMemcpy
        [0] 4294967296 iterations, 6.215574 seconds, 691000961.046305 iter per second
  case 5/6: readAlignedTemplate
        [0] 4294967296 iterations, 4.579680 seconds, 937831317.452678 iter per second
  case 6/6: readUnalignedTemplate
        [0] 4294967296 iterations, 5.052706 seconds, 850033049.741168 iter per second

Even more interestingly, the code generated for readAlignedDirect(), readAlignedTemplate() and readUnalignedTemplate() is exactly the same:

.L46:
    movq    %rax, %rdx
    addq    $1, %rax
    andl    $15, %edx
    addq    (%rcx,%rdx,8), %rbx
    cmpq    %rbp, %rax
    jne .L46

The code generated for readAlignedMemcpy() and readUnalignedMemcpy() is slightly different but very similar:

.L34:
    movq    %rax, %rdx
    addq    $1, %rax
    andl    $15, %edx
    salq    $3, %rdx
    addq    (%rcx), %rdx
    movq    (%rdx), %rdx
    addq    %rdx, %rbx
    cmpq    %rax, %rbp
    movq    %rdx, (%rsi)
    jg  .L34

Basically, the compiler is smart enough to know that the machine doesnt' care about alignment, and generates memcpy() as a plain memory read, and then for condition in the template it's also smart enough to know that both branches end up with the same code and eliminate the condition check.

Smart, huh? So now I'm not sure if keeping this template as it is makes sense, or should I just change it to a plain memcpy().

BTW,  I must say it took me a few attempts to make up the code that would not get optimized out by the compiler. I've found out that memcpy() is not suitable for reading the volatile data nowadays. It defines its second argument as (const void *), and so the volatility has to be cast away for passing the pointer to it, and then the compiler just happily takes the whole memory read out of the loop. And it was even smart enough to sometimes replace the whole loop with a multiplication instead of collecting a sum. That's where the modulo operator came handy to confuse this extreme smartness.

A separate interesting note is that each iteration of the loop executes in an about one-billionth of a second. Since this is a 3GHz CPU, this means that every iteration takes only about 3 CPU cycles.  That's 6 or 10 instructions executed in 3 CPU cycles. Such are the modern miracles.

ThreadedClient improvement of timeout handling

The work on updating the tests for the new ordered index brought a nice side effect: the handling of the timeouts in the Perl class X:ThreadedClient has improved. Now it returns not only the error message but also the data received up to that point. This helps a lot with diagnosing the errors in the automated tests of TQL: the error messages get returned in a clear way.

Friday, April 17, 2015

Ordered Index implemented in C++

The ordered index implemented in Perl has been pretty slow, so I've finally got around to implementing one in C++. It became much faster. Here is an excerpt from the performance test results (with an optimized build):

Table insert makeRowArray (single hashed idx, direct) 0.551577 s, 181298.48 per second.
  Excluding makeRowArray 0.311936 s, 320578.44 per second.
Table insert makeRowArray (single ordered int idx, direct) 0.598462 s, 167095.09 per second.
  Excluding makeRowArray 0.358821 s, 278690.37 per second.
Table insert makeRowArray (single ordered string idx, direct) 1.070565 s, 93408.64 per second.
  Excluding makeRowArray 0.830924 s, 120347.91 per second.
Table insert makeRowArray (single perl sorted idx, direct) 21.810374 s, 4584.97 per second.
  Excluding makeRowArray 21.570734 s, 4635.91 per second.
Table lookup (single hashed idx) 0.147418 s, 678342.02 per second.
Table lookup (single ordered int idx) 0.174585 s, 572785.77 per second.
Table lookup (single ordered string idx) 0.385963 s, 259092.38 per second.
Table lookup (single perl sorted idx) 6.840266 s, 14619.32 per second.


Here the "ordered" is the new ordered index in C++ (on an integer field and on a string field), and "perl sorted" is the ordering in Perl, on an integer field. Even though the ordered index is a little slower than the hashed index, it's about 40-70 times faster than the ordering implemented in Perl.

Naturally, ordering by a string field is slower than by an integer one, since it not only has to compare more bytes one-by-one but also honors the locale-specific collation order.

In Perl, the ordered index type is created very similarly to the hashed index type:

Triceps::IndexType->newOrdered(key => [ "a", "!b" ])

The single option "key" specifies the array with the names key fields of the index. The "!" in front of the field name says that the order by this field is descending, and otherwise the order is ascending. This is more compact and arguably easier-to-read format than the one used by the SimpleOrderedIndex.

As usual, the NULL values in the key fields are permitted, and are considered less than any non-NULL value.

The array fields may also be used in the ordered indexes.

The ordered index does return the list of key fields with getKey(), so it can be used in joins and can be found by key fields with TableType::findIndexPathForKeys() and friends, just like the hashed index.

The getKey() for an ordered index returns the list of plain field names, without the indications of descending order. Thus the finding of the index by key fields just works out of the box, unchanged.

To get the key fields as they were specified in the index creation, including the possible "!", use the new method:

@fields = $indexType->getKeyExpr();

The idea of this method is that the contents of the array returned by it depends on the index type and is an "expression" that can be used to build another instance of the same index type. For the hashed index it simply returns the same data as getKey(). For the ordered index it returns the list of keys with indications. For the indexes with Perl conditions it still returns nothing, though in the future might be used to store the condition.

In the C++ API the index type gets created with the constructor:

OrderedIndexType(NameSet *key = NULL);
static OrderedIndexType *make(NameSet *key = NULL);

The key can also be set after construction:

OrderedIndexType *setKey(NameSet *key);

Same as in Perl, the field names in the NameSet can be prefixed with a "!" to specify the descending order on that field.

The new method in the IndexType is:

virtual const NameSet *getKeyExpr() const;

And the new constant for the index id is IT_ORDERED, available in both C++ and Perl.