One of the vectorisation discussions from last year was about the poor code GCC generates for vld{2,3,4}_*() and vst{2,3,4}_*(). It forces the result of the loads onto the stack, then loads the individual pieces from there. It does the same thing in reverse for stores.
I think there are two major problems here:
1. The result of the vld*() is a record type such as:
typedef struct int16x4x3_t { int16x4_t val[3]; } int16x4x3_t;
Ideally, we'd like one of these structures to be stored in a pseudo register. However, the ARM port currently limits in-register record types to 64 bits, so something this big is always given BLKmode and stored on the stack.
A simple "fix" for this is to increase MAX_FIXED_MODE_SIZE. That would do the right thing for the structures in arm_neon.h, but wouldn't be safe in general.
2. The vld*() returns values as a single integer (such as EI mode), while uses of the value will typically be in a vector mode such as V4SI. CANNOT_CHANGE_MODE_CLASS doesn't allow direct "mode-punning" between the two in VFP_REGS, so this again forces the punning to be done on the stack.
The code in question is:
/* FPA registers can't do subreg as all values are reformatted to internal precision. VFP registers may only be accessed in the mode they were set. */ #define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS) \ (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO) \ ? reg_classes_intersect_p (FPA_REGS, (CLASS)) \ || reg_classes_intersect_p (VFP_REGS, (CLASS)) \
However, the VFP restriction appears to be specific to VFPv1 -- thanks to Peter for the archaeology -- and isn't a problem for v6+. In that case, removing this restriction is an important optimisation.
I tried the patch below on the following simple testcase:
#include "arm_neon.h"
void foo (uint16_t *a) { uint16x4x3_t x, y;
x = vld3_u16 (a); y = vld3_u16 (a + 12); x.val[0] = vadd_u16 (x.val[0], y.val[0]); x.val[1] = vadd_u16 (x.val[1], y.val[1]); x.val[2] = vadd_u16 (x.val[2], y.val[2]); vst3_u16 (a, x); }
(not necessarily sensible!). Before the patch, -O2 produced:
sub sp, sp, #48 add r3, r0, #24 vld3.16 {d16-d18}, [r3] vld3.16 {d20-d22}, [r0] add r3, sp, #24 vstmia sp, {d20-d22} vstmia r3, {d16-d18} fldd d19, [sp, #8] fldd d16, [sp, #0] fldd d17, [sp, #24] fldd d20, [sp, #32] vadd.i16 d18, d16, d17 vadd.i16 d17, d19, d20 fldd d19, [sp, #16] fldd d20, [sp, #40] vadd.i16 d16, d19, d20 fstd d18, [sp, #0] fstd d17, [sp, #8] fstd d16, [sp, #16] vldmia sp, {d16-d18} vst3.16 {d16-d18}, [r0] add sp, sp, #48 bx lr
After the patch we get:
vld3.16 {d24-d26}, [r0] add r3, r0, #24 vld3.16 {d20-d22}, [r3] vmov q8, q12 @ ti vadd.i16 d17, d17, d21 vadd.i16 d16, d24, d20 vadd.i16 d18, d26, d22 vst3.16 {d16-d18}, [r0] bx lr
The VMOV is a bit disappointing, and needs further investigation.
The first hunk fixes (2), and I think is correct. The second hunk hacks (1), and isn't suitable in itself. I'll next try to make arm_neon.h use built-in record types that are explicitly EImode, which should remove the need to change MAX_FIXED_MODE_SIZE.
Richard
Index: gcc/gcc/config/arm/arm.h =================================================================== --- gcc.orig/gcc/config/arm/arm.h +++ gcc/gcc/config/arm/arm.h @@ -1171,10 +1171,12 @@ enum reg_class /* FPA registers can't do subreg as all values are reformatted to internal precision. VFP registers may only be accessed in the mode they were set. */ -#define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS) \ - (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO) \ - ? reg_classes_intersect_p (FPA_REGS, (CLASS)) \ - || reg_classes_intersect_p (VFP_REGS, (CLASS)) \ 2+#define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS) \ + (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO) \ + ? (reg_classes_intersect_p (FPA_REGS, (CLASS)) \ + || (TARGET_VFP \ + && reg_classes_intersect_p (VFP_REGS, (CLASS)) \ + && arm_fpu_desc->rev == 1)) \ : 0)
/* The class value for index registers, and the one for base regs. */ @@ -2458,4 +2460,6 @@ enum arm_builtins instruction. */ #define MAX_LDM_STM_OPS 4
+#define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (XImode) + #endif /* ! GCC_ARM_H */