Sergey Babkin on CEP and stuff: performance

Showing posts with label performance. Show all posts

Wednesday, April 22, 2015

On performace of reading the unaligned data

I've made a template that was supposed to optimize the reading of the unaligned data:

template <typename T>
T getUnaligned(const T *ptr)
{
    if ((intptr_t)ptr & (sizeof(T)-1)) {
        T value;
        memcpy(&value, ptr, sizeof(T));
        return value;
    } else {
        return *ptr;
    }
}

And I've been wondering if it really makes a difference. So I've made a test.The test compares 5 cases:

Just read the aligned memory directly;
Read the aligned memory through memcpy;
Read the aligned memory through memcpy;
Read the aligned memory through the macro;
Read the unaligned memory through the macro.

All the cases follow this general pattern:

UTESTCASE readAlignedDirect(Utest *utest)
{
    int64_t n = findRunCount();
    int64_t x = 0;

    double tstart = now();
    for (int64_t i = 0; i < n; i++) {
        x += apdata[i % ARRAYSZ];
    }
    double tend = now();
    printf("        [%lld] %lld iterations, %f seconds, %f iter per second\n", (long long) x, (long long)n, (tend-tstart), (double)n / (tend-tstart));
}

The results on an AMD64 machine with -O3 came out like this:

   case 1/6: warmup
        [0] 4294967296 iterations, 4.601818 seconds, 933319757.968194 iter per second
case 2/6: readAlignedDirect
        [0] 4294967296 iterations, 4.579874 seconds, 937791527.916276 iter per second
case 3/6: readAlignedMemcpy
        [0] 4294967296 iterations, 4.537628 seconds, 946522579.007429 iter per second
case 4/6: readUnalignedMemcpy
        [0] 4294967296 iterations, 6.215574 seconds, 691000961.046305 iter per second
case 5/6: readAlignedTemplate
        [0] 4294967296 iterations, 4.579680 seconds, 937831317.452678 iter per second
case 6/6: readUnalignedTemplate
        [0] 4294967296 iterations, 5.052706 seconds, 850033049.741168 iter per second

Even more interestingly, the code generated for readAlignedDirect(), readAlignedTemplate() and readUnalignedTemplate() is exactly the same:

.L46:
    movq    %rax, %rdx
    addq    $1, %rax
    andl    $15, %edx
    addq    (%rcx,%rdx,8), %rbx
    cmpq    %rbp, %rax
    jne .L46

The code generated for readAlignedMemcpy() and readUnalignedMemcpy() is slightly different but very similar:

.L34:
    movq    %rax, %rdx
    addq    $1, %rax
    andl    $15, %edx
    salq    $3, %rdx
    addq    (%rcx), %rdx
    movq    (%rdx), %rdx
    addq    %rdx, %rbx
    cmpq    %rax, %rbp
    movq    %rdx, (%rsi)
    jg .L34

Basically, the compiler is smart enough to know that the machine doesnt' care about alignment, and generates memcpy() as a plain memory read, and then for condition in the template it's also smart enough to know that both branches end up with the same code and eliminate the condition check.

Smart, huh? So now I'm not sure if keeping this template as it is makes sense, or should I just change it to a plain memcpy().

BTW, I must say it took me a few attempts to make up the code that would not get optimized out by the compiler. I've found out that memcpy() is not suitable for reading the volatile data nowadays. It defines its second argument as (const void *), and so the volatility has to be cast away for passing the pointer to it, and then the compiler just happily takes the whole memory read out of the loop. And it was even smart enough to sometimes replace the whole loop with a multiplication instead of collecting a sum. That's where the modulo operator came handy to confuse this extreme smartness.

A separate interesting note is that each iteration of the loop executes in an about one-billionth of a second. Since this is a 3GHz CPU, this means that every iteration takes only about 3 CPU cycles. That's 6 or 10 instructions executed in 3 CPU cycles. Such are the modern miracles.

Sunday, March 22, 2015

more of performance numbers, with optimization

I've realized that the previous published performance numbers were produced by a build without optimization. So I've enabled the optimization with -O3 -fno-strict-aliasing, and the numbers improved, some of them more than twice (that's still an old Core 2 Duo 3GHz laptop):

Performance test, 100000 iterations, real time.
Empty Perl loop 0.005546 s, 18030711.03 per second.
Empty Perl function of 5 args 0.031840 s, 3140671.52 per second.
Empty Perl function of 10 args 0.027833 s, 3592828.57 per second.
Row creation from array and destruction 0.244782 s, 408526.42 per second.
Row creation from hash and destruction 0.342986 s, 291557.00 per second.
Rowop creation and destruction 0.133347 s, 749922.94 per second.
Calling a dummy label 0.047770 s, 2093352.57 per second.
Calling a chained dummy label 0.049562 s, 2017695.16 per second.
Pure chained call 0.001791 s, 55827286.04 per second.
Calling a Perl label 0.330590 s, 302489.48 per second.
Row handle creation and destruction 0.140406 s, 712218.40 per second.
Repeated table insert (single hashed idx, direct) 0.121556 s, 822665.80 per second.
Repeated table insert (single hashed idx, direct & Perl construct) 0.337209 s, 296551.58 per second.
RowHandle creation overhead in Perl 0.215653 s, 463707.00 per second.
Repeated table insert (single sorted idx, direct) 1.083628 s, 92282.62 per second.
Repeated table insert (single hashed idx, call) 0.153614 s, 650981.13 per second.
Table insert makeRowArray (single hashed idx, direct) 0.553364 s, 180713.02 per second.
Excluding makeRowArray 0.308581 s, 324063.65 per second.
Table insert makeRowArray (double hashed idx, direct) 0.638617 s, 156588.31 per second.
Excluding makeRowArray 0.393835 s, 253913.40 per second.
Overhead of second index 0.085254 s, 1172969.41 per second.
Table insert makeRowArray (single sorted idx, direct) 22.355793 s, 4473.11 per second.
Excluding makeRowArray 22.111011 s, 4522.63 per second.
Table lookup (single hashed idx) 0.142762 s, 700466.78 per second.
Table lookup (single sorted idx) 6.929484 s, 14431.09 per second.
Lookup join (single hashed idx) 2.944098 s, 33966.26 per second.
Nexus pass (1 row/flush) 0.398944 s, 250661.96 per second.
Nexus pass (10 rows/flush) 0.847021 s, 1180608.79 per row per second.
Overhead of each row 0.049786 s, 2008583.51 per second.
Overhead of flush 0.349157 s, 286403.84 per second.

I've also tried to run the numbers in a newer laptop with 2GHz Core i7 CPU, in a VM configured with 2CPUs. On the build without the optimization, the numbers came out very similar to the old laptop. On the build with optimization they came up to 50% better but not consistently so (perhaps, running in a VM added variability).

Saturday, January 18, 2014

performance variations

I've had an opportunity to run the performance tests on a few more laptops. Al of the same Core2 generation, but with the different CPU frequencies. The 2GHz version showed expectedly an about 10% lower performance, except for the inter-thread communication through the nexus: it went up. On a 3GHz CPU all the performance went up about proportionally. So I'm not sure, what was up with the 2.2GHz CPU, maybe the timing worked out just wrong to add more overhead.

Here are the result from a 3GHz CPU:

Performance test, 100000 iterations, real time.
Empty Perl loop 0.006801 s, 14702927.05 per second.
Empty Perl function of 5 args 0.031918 s, 3133046.99 per second.
Empty Perl function of 10 args 0.027418 s, 3647284.30 per second.
Row creation from array and destruction 0.277232 s, 360708.19 per second.
Row creation from hash and destruction 0.498189 s, 200727.04 per second.
Rowop creation and destruction 0.155996 s, 641043.69 per second.
Calling a dummy label 0.098480 s, 1015437.21 per second.
Calling a chained dummy label 0.110546 s, 904601.83 per second.
Pure chained call 0.012066 s, 8287664.25 per second.
Calling a Perl label 0.512559 s, 195099.61 per second.
Row handle creation and destruction 0.195778 s, 510781.66 per second.
Repeated table insert (single hashed idx, direct) 0.772934 s, 129377.12 per second.
Repeated table insert (single hashed idx, direct & Perl construct) 1.109781 s, 90107.89 per second.
RowHandle creation overhead in Perl 0.336847 s, 296871.05 per second.
Repeated table insert (single sorted idx, direct) 2.122350 s, 47117.59 per second.
Repeated table insert (single hashed idx, call) 0.867846 s, 115227.88 per second.
Table insert makeRowArray (single hashed idx, direct) 1.224588 s, 81660.14 per second.
Excluding makeRowArray 0.947355 s, 105557.02 per second.
Table insert makeRowArray (double hashed idx, direct) 1.443053 s, 69297.51 per second.
Excluding makeRowArray 1.165821 s, 85776.47 per second.
Overhead of second index 0.218466 s, 457738.04 per second.
Table insert makeRowArray (single sorted idx, direct) 29.880962 s, 3346.61 per second.
Excluding makeRowArray 29.603730 s, 3377.95 per second.
Table lookup (single hashed idx) 0.287407 s, 347938.44 per second.
Table lookup (single sorted idx) 9.160540 s, 10916.39 per second.
Lookup join (single hashed idx) 3.940388 s, 25378.21 per second.
Nexus pass (1 row/flush) 0.618648 s, 161642.86 per second.
Nexus pass (10 rows/flush) 2.417818 s, 413596.01 per second.

I've added a few more tests: the table look-ups, and the passing of rows through the nexus with 10 rows per batch. With the batching, the inter-thread communications work quite decently fast.

Sunday, November 10, 2013

Triceps performance

I've finally got interested enough in Triceps performance to write a little test, Perf.t. By default it runs only one thousand iterations, to be fast and not delay the run of the full tests suite. But the number can be increased by setting an environment variable, like:

$ TRICEPS_PERF_COUNT=100000 perl t/Perf.t

An important caveat, the test is of the Perl interface, so it includes all the overhead of constructing the Perl objects. I've tried to structure it so that some of the underlying performance can be deduced, but it's still approximate. I haven't done the performance testing of just the underlying C++ implementation yet, it will be better.

Here are the numbers I've got on my 6-year old laptop (dual-CPU Intel Core2 T7600 2.33GHz) with explanations. The time in seconds for each value is for the whole test loop. The "per second" number shows, how many loop iterations were done per second.

The computations are done with the real elapsed time, so if the machine is not idle, the time of the other processes will get still counted against the tests, and the results will show slower than they really are.

Performance test, 1000 iterations, real time.

The first thing it prints is the iteration count, to set the expectations for the run length and precision.

Empty Perl loop 0.000083 s, 11983725.71 per second.

A calibration to see, how much overhead is added by the execution of the loop itself. As it turns out, not much.

Row creation from array and destruction 0.003744 s, 267085.07 per second.

The makeRowArray() for a row of 5 fields. Each created row gets destroyed before the next one gets created.

Row creation from hash and destruction 0.006420 s, 155771.52 per second.

The makeRowHash() for a row of 5 fields.

Rowop creation and destruction 0.002067 s, 483716.30 per second.

The makeRowop() from an existing row. Same thing, each rowop gets destroyed before constructing the next one.

Calling a dummy label 0.001358 s, 736488.85 per second.

Repeated calls of a dummy label with the same rowop object.

Calling a chained dummy label 0.001525 s, 655872.40 per second.
Pure chained call 0.000167 s, 5991862.86 per second.

Repeated calls of a dummy label that has another dummy label chained to it. The "pure" part is the difference from the previous case that gets added by adding another chained dummy label.

Calling a Perl label 0.006669 s, 149946.52 per second.

Repeated calls of a Perl label with the same rowop object. The Perl label has an empty sub but that empty sub still gets executed, along with all the support functionality.

Row handle creation and destruction 0.002603 s, 384234.52 per second.

The creation of a table's row handle from a single row, including the creation of the Perl wrapper for the row handle object.

Repeated table insert (single hashed idx, direct) 0.010403 s, 96126.88 per second.

Insert of the same row into a table. Since the row is the same, it keeps replacing the previous one, and the table size stays at 1 row. Even though the row is the same, a new row handle gets constructed for it every time by the table, the code is $tSingleHashed->insert($row1). "Single hashed idx" means that the table has a single Hashed index, on an int32 field. "Direct" means the direct insert() call, as opposed to using the table's input label.

Repeated table insert (single hashed idx, direct & Perl construct) 0.014809 s, 67524.82 per second.
RowHandle creation overhead in Perl 0.004406 s, 226939.94 per second.

The same, only the row handles are constructed in Perl before inserting them: $tSingleHashed->insert($tSingleHashed->makeRowHandle($row1)). And the second line shows that the overhead of wrapping the row handles for Perl is pretty noticeable (it's the difference from the previous test case).

Repeated table insert (single sorted idx, direct) 0.028623 s, 34937.39 per second.

The same thing, only for a table that uses a Sorted index that executes a Perl comparison on the same int32 field. As you can see, it gets 3 times slower.

Repeated table insert (single hashed idx, call) 0.011656 s, 85795.90 per second.

The same thing, again the table with a single Hashed index, but this time by sending the rowops to its input label.

Table insert makeRowArray (single hashed idx, direct) 0.015910 s, 62852.02 per second.
Excluding makeRowArray 0.012166 s, 82194.52 per second.

Now the different rows get inserted into the table, each row having a different key. At the end of this test the table contains 1000 rows (or however many were requested by the environment variable). Naturally, this is slower than the repeated insertions of the same row, since the tree of the table's index becomes deeper and requires more comparisons and rebalancing. This performance will be lower in the tests with more rows, since the index will become deeper and will create more overhead. Since the rows are all different, they are created on the fly, so this row creation overhead needs to be excluded to get the actual Table's performance.

Table insert makeRowArray (double hashed idx, direct) 0.017231 s, 58033.37 per second.
Excluding makeRowArray 0.013487 s, 74143.61 per second.
Overhead of second index 0.001321 s, 756957.95 per second.

Similar to previous but on a table that has two Hashed indexes (both on the same int32 field). The details here compute also the overhead contributed by the second index.

Table insert makeRowArray (single sorted idx, direct) 0.226725 s, 4410.64 per second.
Excluding makeRowArray 0.222980 s, 4484.70 per second.

Similar but for a table with a Sorted index with a Perl expression. As you can see, it's about 20 times slower (and it gets even worse for the larger row sets).

Nexus pass 0.034009 s, 29403.79 per second.

The performance of passing the rows between threads through a Nexus. This is a highly pessimistic case, with only one row per nexus transaction. The time also includes the draining and stopping of the app.

And here are the numbers for a run with 100 thousand iterations, for comparison:

Performance test, 100000 iterations, real time.
Empty Perl loop 0.008354 s, 11970045.66 per second.
Row creation from array and destruction 0.386317 s, 258854.76 per second.
Row creation from hash and destruction 0.640852 s, 156042.16 per second.
Rowop creation and destruction 0.198766 s, 503105.38 per second.
Calling a dummy label 0.130124 s, 768497.20 per second.
Calling a chained dummy label 0.147262 s, 679062.46 per second.
Pure chained call 0.017138 s, 5835066.29 per second.
Calling a Perl label 0.652551 s, 153244.80 per second.
Row handle creation and destruction 0.252007 s, 396813.99 per second.
Repeated table insert (single hashed idx, direct) 1.053321 s, 94937.81 per second.
Repeated table insert (single hashed idx, direct & Perl construct) 1.465050 s, 68257.07 per second.
RowHandle creation overhead in Perl 0.411729 s, 242878.43 per second.
Repeated table insert (single sorted idx, direct) 2.797103 s, 35751.28 per second.
Repeated table insert (single hashed idx, call) 1.161150 s, 86121.54 per second.
Table insert makeRowArray (single hashed idx, direct) 1.747032 s, 57239.94 per second.
Excluding makeRowArray 1.360715 s, 73490.78 per second.
Table insert makeRowArray (double hashed idx, direct) 2.046829 s, 48856.07 per second.
Excluding makeRowArray 1.660511 s, 60222.41 per second.
Overhead of second index 0.299797 s, 333559.51 per second.
Table insert makeRowArray (single sorted idx, direct) 38.355396 s, 2607.20 per second.
Excluding makeRowArray 37.969079 s, 2633.72 per second.
Nexus pass 1.076210 s, 92918.63 per second.
As you can see, the table insert performance got worse due to the added depth of the index trees while the nexus performance got better because the drain overhead got spread over a larger number of rows.

Sunday, February 24, 2013

atomic int vs mutex

I've been wondering, how much faster is the native implementation of an atomic integer (such as Triceps imports from NSPR) versus a simple mutex-based approach (lock the mutex, change the value, unlock the mutex).

I've finally got interested enough to measure it. On my machine, compiled with GCC optimization -O3, the difference is at about 2.5-3 times. Both with and without contention, the difference is about the same. So it's faster but not hugely so.

An interesting thing is that with contention of two threads for the same mutex, Linux shows the total user CPU time about the same as the real time. With contention of two threads for the same atomic integer, the user CPU time is twice higher than the real time. It looks like Linux manages to account for the wait time properly even in these tiny increments and without adding a huge amount of overhead.

And the real time with contention of two threads for the same mutex (that is, threads doing nothing but locking-unlocking the mutex in a loop) is about 3-3.5 times higher than with the same two threads running sequentially. The same goes for the atomic integers as well.

The program I used was:

#include <stdlib.h>
#include <vector>
#include "pw/ptwrap.h"
#include "mem/Atomic.h"

using namespace Triceps;

const int NTHREADS = 2;

pw::pmutex m;
AtomicInt ai(0);

class TestThread: public pw::pwthread
{
public:
TestThread():
    p_(NULL)
{ }

virtual void *execute()
{
    m.lock();
    m.unlock();
    for (int i = 0; i < 10*1000*1000; i++) {
      if (p_)
        p_();
      // ai.inc();
      m.lock();
      m.unlock();
    }
}

void (*p_)();
};

std::vector<TestThread *> t;

int main(int argc, char **argv)
{
m.lock();
for (int i = 0; i < NTHREADS; i++) {
    TestThread *job = new TestThread;
    t.push_back(job);
    job->start();
}
sched_yield();
sched_yield();
sched_yield();
sched_yield();
sched_yield();
sched_yield();
sched_yield();
sched_yield();
sched_yield();
m.unlock(); // the run starts!
for (int i = 0; i < NTHREADS; i++) {
    t[i]->join();
    delete t[i];
}
return 0;
}

Comment, uncomment and adjust the constants as needed. The iteration count of 10 millions may be much too low for the newer faster machines.

The build command for me was:

g++ -DTRICEPS_NSPR=4 -I $TRICEPS/trunk/cpp -O3 -o x x.cpp -lnspr4 -lpthread $TRICEPS/trunk/cpp/build/libtriceps.a

Adjust the library names and TRICEPS_NSPR value as needed (for Ubuntu it would likely be "-DTRICEPS_NSPR=0 -lnspr").To try the simple-minded implementation of atomics through the mutex, remove "-DTRICEPS_NSPR".

To measure the time, run it in the time command:

time ./x

When comparing time, don't forget that when you run 2 threads, they make twice more iterations!

Sergey Babkin on CEP and stuff