Last week, Ramana pointed me at an upstream bug report about the inefficient code that GCC generates for vzip, vuzp and vtrn:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48941
It was filed not longer after the Neon seminar at the summit; I'm not sure whether that was a coincidence or not.
I attached a patch to the bug last week and will test it this week. However, a cut-down version shows up another problem that isn't related specifically to intrinsics. Given:
#include <arm_neon.h>
void foo (float32x4x2_t *__restrict dst, float32x4_t *__restrict src, int n) { while (n--) { dst[0] = vzipq_f32 (src[0], src[1]); dst[1] = vzipq_f32 (src[2], src[3]); dst += 2; src += 4; } }
GCC produces:
cmp r2, #0 bxeq lr .L3: vldmia r1, {d16-d17} vldr d18, [r1, #16] vldr d19, [r1, #24] vldr d20, [r1, #32] vldr d21, [r1, #40] vldr d22, [r1, #48] vldr d23, [r1, #56] add r3, r0, #32 vzip.32 q8, q9 vzip.32 q10, q11 subs r2, r2, #1 vstmia r0, {d16-d19} add r1, r1, #64 vstmia r3, {d20-d23} add r0, r0, #64 bne .L3 bx lr
We're missing many auto-increment opportunities here. I think this is due to the limitations of GCC's auto-inc-dec pass rather than to a problem in the ARM port itself. I think there are two main areas for improvement:
- The pass only tries to use auto-incs in cases where there is a separate addition and memory access. It doesn't try to handle cases where there are two consecutive memory accesses of the form *base and *(base + size), even if the address costs make it clear that post-increments would be a win.
- The pass uses a backward scan rather than a forward scan, which makes it harder to spot chains of more than two accesses.
FWIW, I've got fairly specific ideas about how to do this. Unfortunately, the pass is in need of some TLC before it's easy to make changes. So in terms of work items, how about:
1. Clean up the auto-inc pass so that it's easier to modify 2. Investigate improvements to the pass 3. Submit the changes upstream 4. Backport the changes to the Linaro branches
I wrote some patches for (1) last week.
I'd estimate it's about 2 weeks' work for (1) and (2). (3) and (4) would hopefully be background tasks. The aim would be for something like:
.L3: vldmia r1!, {d16-d17} vldmia r1!, {d18-d19} vldmia r1!, {d20-d21} vldmia r1!, {d22-d23} vzip.32 q8, q9 vzip.32 q10, q11 subs r2, r2, #1 vstmia r0!, {d16-d19} vstmia r0!, {d20-d23} bne .L3 bx lr
This should help with auto-vectorised code, as well as normal core code.
(Combining the vldmias and vstmias is a different topic. The fact that this particular example could be implemented using one load and one store is to some extent coincidental.)
Richard