Wednesday, April 22, 2015

On performace of reading the unaligned data

I've made a template that was supposed to optimize the reading of the unaligned data:

template <typename T>
T getUnaligned(const T *ptr)
{      
    if ((intptr_t)ptr & (sizeof(T)-1)) {
        T value;
        memcpy(&value, ptr, sizeof(T));
        return value;
    } else {
        return *ptr;
    }
}

And I've been wondering if it really makes a difference. So I've made a test.The test compares 5 cases:

Just read the aligned memory directly;
Read the aligned memory through memcpy;
Read the aligned memory through memcpy;
Read the aligned memory through the macro;
Read the unaligned memory through the macro.

All the cases follow this general pattern:

UTESTCASE readAlignedDirect(Utest *utest)
{
    int64_t n = findRunCount();
    int64_t x = 0;

    double tstart = now();
    for (int64_t i = 0; i < n; i++) {
        x += apdata[i % ARRAYSZ];
    }
    double tend = now();
    printf("        [%lld] %lld iterations, %f seconds, %f iter per second\n", (long long) x, (long long)n, (tend-tstart), (double)n / (tend-tstart));
}


The results on an AMD64 machine with -O3 came out like this:

   case 1/6: warmup
        [0] 4294967296 iterations, 4.601818 seconds, 933319757.968194 iter per second
  case 2/6: readAlignedDirect
        [0] 4294967296 iterations, 4.579874 seconds, 937791527.916276 iter per second
  case 3/6: readAlignedMemcpy
        [0] 4294967296 iterations, 4.537628 seconds, 946522579.007429 iter per second
  case 4/6: readUnalignedMemcpy
        [0] 4294967296 iterations, 6.215574 seconds, 691000961.046305 iter per second
  case 5/6: readAlignedTemplate
        [0] 4294967296 iterations, 4.579680 seconds, 937831317.452678 iter per second
  case 6/6: readUnalignedTemplate
        [0] 4294967296 iterations, 5.052706 seconds, 850033049.741168 iter per second

Even more interestingly, the code generated for readAlignedDirect(), readAlignedTemplate() and readUnalignedTemplate() is exactly the same:

.L46:
    movq    %rax, %rdx
    addq    $1, %rax
    andl    $15, %edx
    addq    (%rcx,%rdx,8), %rbx
    cmpq    %rbp, %rax
    jne .L46

The code generated for readAlignedMemcpy() and readUnalignedMemcpy() is slightly different but very similar:

.L34:
    movq    %rax, %rdx
    addq    $1, %rax
    andl    $15, %edx
    salq    $3, %rdx
    addq    (%rcx), %rdx
    movq    (%rdx), %rdx
    addq    %rdx, %rbx
    cmpq    %rax, %rbp
    movq    %rdx, (%rsi)
    jg  .L34

Basically, the compiler is smart enough to know that the machine doesnt' care about alignment, and generates memcpy() as a plain memory read, and then for condition in the template it's also smart enough to know that both branches end up with the same code and eliminate the condition check.

Smart, huh? So now I'm not sure if keeping this template as it is makes sense, or should I just change it to a plain memcpy().

BTW,  I must say it took me a few attempts to make up the code that would not get optimized out by the compiler. I've found out that memcpy() is not suitable for reading the volatile data nowadays. It defines its second argument as (const void *), and so the volatility has to be cast away for passing the pointer to it, and then the compiler just happily takes the whole memory read out of the loop. And it was even smart enough to sometimes replace the whole loop with a multiplication instead of collecting a sum. That's where the modulo operator came handy to confuse this extreme smartness.

A separate interesting note is that each iteration of the loop executes in an about one-billionth of a second. Since this is a 3GHz CPU, this means that every iteration takes only about 3 CPU cycles.  That's 6 or 10 instructions executed in 3 CPU cycles. Such are the modern miracles.

No comments:

Post a Comment