NEON vectorization: use of specialized load/store instructions - linaro-toolchain

11 Oct 2010


      I meant to send this to the "external" Linaro toolchain mailing list,
not the internal CS one. Apologies to those who receive it twice!
In a follow-up message, Joseph Myers pointed out a post he'd written
previously on the same subject:
http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html
In further followups (at the risk of misrepresenting Joseph & Paul
Brook's opinions!), there seemed to be general agreement that a scheme
something like that outlined below, with "permuting" loads/stores and
some way of handling multiple in-register layouts for vectors seems
like it will be a necessary addition to the vectorizer, going forward.
Julian
Begin forwarded message:
Date: Thu, 7 Oct 2010 16:45:17 +0100
From: Julian Brown julian@codesourcery.com
To: Ira Rosen IRAR@il.ibm.com
Cc: Tejas Belagod Tejas.Belagod@arm.com, Linaro List
gnu-linaro-tools@codesourcery.com Subject: [gnu-linaro-tools] NEON
vectorization: use of specialized load/store instructions
Hi,
We're having some system issues, so I thought I'd take the chance to
write down some things I've been thinking about re: utilising the NEON
load/store instructions more effectively. I've also attempted to
summarize the problems with big-endian mode. All unverified as of yet,
so please take with a pinch of salt :-). Comments appreciated. It's
been a while since I last thought about some of this stuff...
Cheers,
Julian
Use of specialized load instructions
====================================
To provide good support for NEON's element and structure load/store
instructions, GCC lacks support for a couple of key features:
1. A good way of representing a set of two, three or four vector
registers (either D- or Q-sized), possibly with non-unit stride.
2. A generalised mapping between memory locations and lane numbers.
To start with point 1: currently the element and structure load/store
instructions are only supported via intrinsics. These are specified to
load and store as if going via an array embedded in a union, i.e.:
typedef struct int8x8x2_t
{
  int8x8_t val[2];
} int8x8x2_t;
__extension__ static __inline int8x8x2_t __attribute__
((__always_inline__)) vld2_s8 (const int8_t * __a)
{
  union { int8x8x2_t __i; __builtin_neon_ti __o; } __rv;
  __rv.__o = __builtin_neon_vld2v8qi ((const __builtin_neon_qi *) __a);
  return __rv.__i;
}
Even for a trivial test program, e.g.:
#include <arm_neon.h>
int foo (int8_t *x)
{
  int8x8x2_t result = vld2_s8 (x);
  return vget_lane_s8 (result.val[0], 1);
}
We will generate code like so:
sub     sp, sp, #32
        vld2.8  {d16-d17}, [r0]
        mov     r3, sp
        vstmia  sp, {d16-d17}
        add     ip, sp, #16
        ldmia   r3, {r0, r1, r2, r3}
        stmia   ip, {r0, r1, r2, r3}
        fldd    d16, [sp, #16]
        vmov.s8 r0, d16[1]
        add     sp, sp, #32
        bx      lr
I.e., rather than being used directly, the registers loaded by vld2
will always be spilled to the stack then reloaded. This obviously
reduces the usefulness of these intrinsics by a large factor. With some
planning, it'd be good to find a powerful enough solution to this
problem so that the same representation for multiple registers can be
used by the autovectorizer as well as the intrinsic-handling code.
(One difficulty is that the "foo.val[X]" interface should still be
available to user code. There's probably no need for "val" to literally
be an array, though other representations would require front-end
changes).
Assuming it's hard for the register allocator to deal with
highly-constrained situations like requiring four consecutive
registers, one (ugly) possibility might be to run a pass before
register allocation, looking for "big" multi-register vectors and
pre-allocating them to hard registers. Even using a fixed allocation of
a single set of registers (e.g. make it so that all multi-reg
loads/stores larger than a Q register must use d0-d7, or whatever)
would probably give better code than what we produce at present, in
most cases.
Now, point 2. To start with, an aside: AIUI, there is currently an
assumption in the vectoriser code that increasing element numbers in
vector registers correspond to increasing addresses when those
registers are loaded from and stored to memory (as if the vector was a
short array, or alternatively as if a union of the vector register and
an array of element-types had the same numberings for lanes and array
indices corresponding to the same elements). Unfortunately that is only
true for NEON in little-endian mode: in big-endian mode, the story is
more complicated, for reasons I will try to explain.
To remain compliant with the soft-float variant of the ARM EABI, we
must pass vector register arguments in ARM registers (or the stack),
not vector registers. This means that we must be very careful with the
ordering of elements for values passed to functions. Consider the
trivial function:
int __attribute__((noinline)) qux (int16x8_t x)
{
  x = vaddq_s16 (x, x);
  return vgetq_lane_s16 (x, 1);
}
This is compiled by GCC to the following (slightly unimpressively):
vmov    d18, r1, r0  @ v8hi
        vmov    d19, r3, r2
        vmov    d20, r1, r0  @ v8hi
        vmov    d21, r3, r2
        vadd.i16        q8, q9, q10
        vmov.s16        r0, d16[1]
        bx      lr
Which may then be called like, e.g.:
ldmia   sp, {r0-r3}
        blx     qux
So: notice that we're careful that when vector values are transferred
from NEON registers to core registers, the same result will be
transferred to/from memory when we use ldm/stm (core registers) or
vldm/vstm (vector registers) -- i.e. we might use "vldm rX, {d18-d19}",
storing d18 and d19 in consecutive increasing addresses, or "ldmia rX,
{r0-r3}", again with consecutive registers in increasing memory
locations, and we get the same outcome. The fact that we can use the
multiple-register loads/stores is also important for spilling/reloading
between vector and core registers, which inevitably happens
occasionally.
Notice also that when we call the above function like so:
typedef union {
  int16x8_t quadvec;
  int16_t half[8];
} u;
int foo (int8_t *x)
{
  u bar;
  int i;
for (i = 0; i < 8; i++)
    bar.half[i] = i;
qux (bar.quadvec);
}
The value returned from "qux" is NOT 2 (1+1), as it would be if we were
accessing the value at index 1 in the superimposed array in the union
"u". The vgetq_lane_s16 call still interprets the array as if it had
been loaded in little-endian element order. But we don't get the result
we would have if the vector had been interpreted in purely big-endian
order either (i.e. 12, 6+6)! In fact from the perspective of the
element numbering used by vgetq_lane_s16, the vector elements we see
for each of the (equal) operands of the "vadd" instruction in the qux
function are:
equiv. core register
lane number		(at function entry)	value
-----------		--------------------	-----
[0]			high part of r1		3
[1]			low part of r1		2
[2]			high part of r0		1
[3]			low part of r0		0
[4]			high part of r3		7
[5]			low part of r3		6
[6]			high part of r2		5
[7]			low part of r2		4
So the value returned will be 2+2, 4.
Now, coming back to the vectorizer. Current practice means that
increasing element numbers should correspond to increasing memory
locations: i.e., that "array ordering" is in effect, just as in the
call to vgetq_lane_s16 in the above example. This leads to an anomaly:
it means that when the vectorizer asks for a particular element, it
will generally get a different one. Most of the time we get away with
this, since the vectorizer mostly deals with "opaque" vectors which are
operated on element-wise: i.e. we only deal with data at the
granularity of whole vectors, so it doesn't matter which order the
elements are in. The ARM implementations of reduction operations
fortuitously calculate the results across all elements simultaneously,
so when one of those elements is extracted, we still get the right
answer.
One notable exception to this though is the movmisalign<mode> patterns:
these are implemented using the vld1 and vst1 instructions, which load
elements in "array" order (increasing elements from increasing memory
locations), even in big-endian mode. Since vectors loaded using those
instructions are "incompatible" with the above scheme, such misaligned
accesses are simply disabled in big-endian mode.
Of course, generally, sticking with the current non-solution in
big-endian mode is not sustainable (and is probably already broken in
various cases). So it might be worth thinking about whether supporting
big-endian mode properly, as well as handling the more complex load and
store element/structure instructions, can be done using some
generalised solution.
I'm thinking (without having much idea about how feasible such an idea
is) of something along the lines of a function (in the mathematical
sense) attached to each vector value manipulated by the vectorizer, to
map that value's element numberings to and from memory offsets. So then
the quad-word vector of 16-bit elements discussed above would look
like, in big-endian mode:
foo, {6, 4, 2, 0, 14, 12, 10, 8}
Whereas in little-endian mode (or in big-endian mode, for vectors
loaded using vld1), it would look like:
foo, {0, 2, 4, 6, 8, 10, 12, 14}
And then, perhaps more interestingly, a vector loaded using e.g. a
"multiple 3-element structures" load,
vld3.16 {d1, d2, d3}, [rN]
Might look like (in either endianness, assuming we can represent a
vector of such size in our hypothetical scheme):
foo, {0, 6, 12, 18, 2, 8, 14, 20, 4, 10, 16, 22}
Though it's not clear that such a scheme would be powerful enough to
represent the whole range of element/structure loads/stores available
(you'd probably need to be able to specify skipped or don't-care
elements to do that, at least).