Idea for auto-increment performance improvement - linaro-toolchain

16 May 2011


      Last week, Ramana pointed me at an upstream bug report about the
inefficient code that GCC generates for vzip, vuzp and vtrn:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48941
It was filed not longer after the Neon seminar at the summit;
I'm not sure whether that was a coincidence or not.
I attached a patch to the bug last week and will test it this week.
However, a cut-down version shows up another problem that isn't related
specifically to intrinsics.  Given:
#include <arm_neon.h>
void foo (float32x4x2_t *__restrict dst, float32x4_t *__restrict src, int n)
  {
    while (n--)
      {
        dst[0] = vzipq_f32 (src[0], src[1]);
        dst[1] = vzipq_f32 (src[2], src[3]);
        dst += 2;
        src += 4;
      }
  }
GCC produces:
cmp     r2, #0
        bxeq    lr
.L3:
        vldmia  r1, {d16-d17}
        vldr    d18, [r1, #16]
        vldr    d19, [r1, #24]
        vldr    d20, [r1, #32]
        vldr    d21, [r1, #40]
        vldr    d22, [r1, #48]
        vldr    d23, [r1, #56]
        add     r3, r0, #32
        vzip.32 q8, q9
        vzip.32 q10, q11
        subs    r2, r2, #1
        vstmia  r0, {d16-d19}
        add     r1, r1, #64
        vstmia  r3, {d20-d23}
        add     r0, r0, #64
        bne     .L3
        bx      lr
We're missing many auto-increment opportunities here.  I think this
is due to the limitations of GCC's auto-inc-dec pass rather than to
a problem in the ARM port itself.  I think there are two main areas
for improvement:
- The pass only tries to use auto-incs in cases where there is a
    separate addition and memory access.  It doesn't try to handle
    cases where there are two consecutive memory accesses of the
    form *base and *(base + size), even if the address costs make
    it clear that post-increments would be a win.
- The pass uses a backward scan rather than a forward scan,
    which makes it harder to spot chains of more than two accesses.
FWIW, I've got fairly specific ideas about how to do this.
Unfortunately, the pass is in need of some TLC before it's
easy to make changes.  So in terms of work items, how about:
1. Clean up the auto-inc pass so that it's easier to modify
  2. Investigate improvements to the pass
  3. Submit the changes upstream
  4. Backport the changes to the Linaro branches
I wrote some patches for (1) last week.
I'd estimate it's about 2 weeks' work for (1) and (2).  (3) and (4)
would hopefully be background tasks.  The aim would be for something
like:
.L3:
        vldmia  r1!, {d16-d17}
        vldmia  r1!, {d18-d19}
        vldmia  r1!, {d20-d21}
        vldmia  r1!, {d22-d23}
        vzip.32 q8, q9
        vzip.32 q10, q11
        subs    r2, r2, #1
        vstmia  r0!, {d16-d19}
        vstmia  r0!, {d20-d23}
        bne     .L3
        bx      lr
This should help with auto-vectorised code, as well as normal core code.
(Combining the vldmias and vstmias is a different topic.  The fact that
this particular example could be implemented using one load and one
store is to some extent coincidental.)
Richard