Improving the code generated for vld and vst intrinsics - linaro-toolchain

21 Feb 2011

One of the vectorisation discussions from last year was about the poor
code GCC generates for vld{2,3,4}_*() and vst{2,3,4}_*().  It forces the
result of the loads onto the stack, then loads the individual pieces from
there.  It does the same thing in reverse for stores.
I think there are two major problems here:
1. The result of the vld*() is a record type such as:
typedef struct int16x4x3_t
    {
      int16x4_t val[3];
    } int16x4x3_t;
Ideally, we'd like one of these structures to be stored in a pseudo
   register.  However, the ARM port currently limits in-register
   record types to 64 bits, so something this big is always given
   BLKmode and stored on the stack.
A simple "fix" for this is to increase MAX_FIXED_MODE_SIZE.
   That would do the right thing for the structures in arm_neon.h,
   but wouldn't be safe in general.
2. The vld*() returns values as a single integer (such as EI mode),
   while uses of the value will typically be in a vector mode such
   as V4SI.  CANNOT_CHANGE_MODE_CLASS doesn't allow direct
   "mode-punning" between the two in VFP_REGS, so this again
   forces the punning to be done on the stack.
The code in question is:
/* FPA registers can't do subreg as all values are reformatted to internal
       precision.  VFP registers may only be accessed in the mode they
       were set.  */
    #define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS)	\
      (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO)		\
       ? reg_classes_intersect_p (FPA_REGS, (CLASS))	\
         || reg_classes_intersect_p (VFP_REGS, (CLASS))	\
However, the VFP restriction appears to be specific to VFPv1 --
   thanks to Peter for the archaeology -- and isn't a problem for v6+.
   In that case, removing this restriction is an important optimisation.
I tried the patch below on the following simple testcase:
#include "arm_neon.h"
void
    foo (uint16_t *a)
    {
      uint16x4x3_t x, y;
x = vld3_u16 (a);
      y = vld3_u16 (a + 12);
      x.val[0] = vadd_u16 (x.val[0], y.val[0]);
      x.val[1] = vadd_u16 (x.val[1], y.val[1]);
      x.val[2] = vadd_u16 (x.val[2], y.val[2]);
      vst3_u16 (a, x);
    }
(not necessarily sensible!).  Before the patch, -O2 produced:
sub     sp, sp, #48
        add     r3, r0, #24
        vld3.16 {d16-d18}, [r3]
        vld3.16 {d20-d22}, [r0]
        add     r3, sp, #24
        vstmia  sp, {d20-d22}
        vstmia  r3, {d16-d18}
        fldd    d19, [sp, #8]
        fldd    d16, [sp, #0]
        fldd    d17, [sp, #24]
        fldd    d20, [sp, #32]
        vadd.i16        d18, d16, d17
        vadd.i16        d17, d19, d20
        fldd    d19, [sp, #16]
        fldd    d20, [sp, #40]
        vadd.i16        d16, d19, d20
        fstd    d18, [sp, #0]
        fstd    d17, [sp, #8]
        fstd    d16, [sp, #16]
        vldmia  sp, {d16-d18}
        vst3.16 {d16-d18}, [r0]
        add     sp, sp, #48
        bx      lr
After the patch we get:
vld3.16 {d24-d26}, [r0]
        add     r3, r0, #24
        vld3.16 {d20-d22}, [r3]
        vmov    q8, q12  @ ti
        vadd.i16        d17, d17, d21
        vadd.i16        d16, d24, d20
        vadd.i16        d18, d26, d22
        vst3.16 {d16-d18}, [r0]
        bx      lr
The VMOV is a bit disappointing, and needs further investigation.
The first hunk fixes (2), and I think is correct.  The second hunk
hacks (1), and isn't suitable in itself.  I'll next try to make
arm_neon.h use built-in record types that are explicitly EImode,
which should remove the need to change MAX_FIXED_MODE_SIZE.
Richard
Index: gcc/gcc/config/arm/arm.h
===================================================================

--- gcc.orig/gcc/config/arm/arm.h
+++ gcc/gcc/config/arm/arm.h
@@ -1171,10 +1171,12 @@ enum reg_class
 /* FPA registers can't do subreg as all values are reformatted to internal
    precision.  VFP registers may only be accessed in the mode they
    were set.  */
-#define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS)	\
-  (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO)		\
-   ? reg_classes_intersect_p (FPA_REGS, (CLASS))	\
-     || reg_classes_intersect_p (VFP_REGS, (CLASS))	\
2+#define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS)		\
+  (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO)			\
+   ? (reg_classes_intersect_p (FPA_REGS, (CLASS))		\
+      || (TARGET_VFP						\
+	  && reg_classes_intersect_p (VFP_REGS, (CLASS))	\
+	  && arm_fpu_desc->rev == 1))				\
    : 0)
/* The class value for index registers, and the one for base regs.  */
@@ -2458,4 +2460,6 @@ enum arm_builtins
    instruction.  */
 #define MAX_LDM_STM_OPS 4
+#define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (XImode)
+
 #endif /* ! GCC_ARM_H */