Effect of alignment and peeling on vectorised loops - linaro-toolchain

30 Nov 2011


      I had a play with the vecotiser to see how peeling, unrolling, and
alignment affected the performance of simple memory bound loops.
The short story is:
 * For fixed length loops, don't peel
 * Performance is the same for 8 byte aligned arrays and up
 * Performance is very similar for unaliged arrays
 * vld1 is as fast as vldmia
 * vld1 with specified alignment is much faster than vld1
The loop is the rather ugly and artifical::
void op(struct ains * __restrict out, const struct aints * __restrict in)
 {
   for (int i = 0; i < COUNT; i++)
     {
       out->v[i] = (in->v[i] * 173) | in->v[i];
     }
 }
where `struct aints` is a aligned structure.  I couldn't figure out how
to use an aligned typedef of ints without still introducing a runtime
check.  I assume I was running into some type of runtime alias
checking.
This compiled into::
vmov.i32 q10, #173
   add r3, r0, #5
   0:
   vldmia        r1!, {d16-d17}
   vmul.i32      q9, q8, q10
   vorr          q8, q9, q8
   vstmia        r0!, {d16-d17}
   cmp           r0, r3
   bne           0b
I then lied to the compiler by changing the actual alignment at
runtime. See:
 http://people.linaro.org/~michaelh/incoming/runtime-offset.png
The performance didn't change for actual alignments of 8,
16, or 32 bytes.
I then converted the loop into one using vld1 and fed it smaller
alignments.  See:
 http://people.linaro.org/~michaelh/incoming/small-offsets.png
The throughput falls into two camps: one of alignments
1, 2, or 4 and one of 8, 16, 32.  The throughput is very similar for
both camps but has some stange dropoffs at 24 words, around 48 words,
and around 96 words.  The terminal throughput at 300 words and above
is within 0.5 %
I then converted the vld1 and vst1 to specifiy an alignment of 64
bits. See:
 http://people.linaro.org/~michaelh/incoming/set-alignment.png
This improved the throughput in all cases and in cases for more than 50
words by 14 %.  This graph also shows the overhead of the runtime
peeling check.  The blue line is the vectoriser version which is
slower to pick up due the greater per call overhead.
I then went back to the vectoriser and changed the alignment of the
struct to cause peeling to turn on and off.  See:
 http://people.linaro.org/~michaelh/incoming/unroll.png
At 200 words, the version without peeling is 2.9 % faster.  This is
partly due to a fixed count loop turning into a runtime count due to
unknown alignment.
This run also showed the affect of loop unrolling.  The loop seems to
be unrolled for loops of <= 64 words and drops off in performance past
around 8 words.  When the unrolling finally drops out, performance
increases by 101 %.
Raw results and the test cases are available in
lp:~linaro-toolchain-dev/linaro-toolchain-benchmarks/private-runs
A graph of all results is at:
 http://people.linaro.org/~michaelh/incoming/everything.png
The usual caveats apply: this test was all in L1, only on the A9, and
very artificial.
-- Michael