I meant to send this to the "external" Linaro toolchain mailing list, not the internal CS one. Apologies to those who receive it twice!
In a follow-up message, Joseph Myers pointed out a post he'd written previously on the same subject:
http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html
In further followups (at the risk of misrepresenting Joseph & Paul Brook's opinions!), there seemed to be general agreement that a scheme something like that outlined below, with "permuting" loads/stores and some way of handling multiple in-register layouts for vectors seems like it will be a necessary addition to the vectorizer, going forward.
Julian
Begin forwarded message:
Date: Thu, 7 Oct 2010 16:45:17 +0100 From: Julian Brown julian@codesourcery.com To: Ira Rosen IRAR@il.ibm.com Cc: Tejas Belagod Tejas.Belagod@arm.com, Linaro List gnu-linaro-tools@codesourcery.com Subject: [gnu-linaro-tools] NEON vectorization: use of specialized load/store instructions
Hi,
We're having some system issues, so I thought I'd take the chance to write down some things I've been thinking about re: utilising the NEON load/store instructions more effectively. I've also attempted to summarize the problems with big-endian mode. All unverified as of yet, so please take with a pinch of salt :-). Comments appreciated. It's been a while since I last thought about some of this stuff...
Cheers,
Julian
Use of specialized load instructions ====================================
To provide good support for NEON's element and structure load/store instructions, GCC lacks support for a couple of key features:
1. A good way of representing a set of two, three or four vector registers (either D- or Q-sized), possibly with non-unit stride.
2. A generalised mapping between memory locations and lane numbers.
To start with point 1: currently the element and structure load/store instructions are only supported via intrinsics. These are specified to load and store as if going via an array embedded in a union, i.e.:
typedef struct int8x8x2_t { int8x8_t val[2]; } int8x8x2_t;
__extension__ static __inline int8x8x2_t __attribute__ ((__always_inline__)) vld2_s8 (const int8_t * __a) { union { int8x8x2_t __i; __builtin_neon_ti __o; } __rv; __rv.__o = __builtin_neon_vld2v8qi ((const __builtin_neon_qi *) __a); return __rv.__i; }
Even for a trivial test program, e.g.:
#include <arm_neon.h>
int foo (int8_t *x) { int8x8x2_t result = vld2_s8 (x); return vget_lane_s8 (result.val[0], 1); }
We will generate code like so:
sub sp, sp, #32 vld2.8 {d16-d17}, [r0] mov r3, sp vstmia sp, {d16-d17} add ip, sp, #16 ldmia r3, {r0, r1, r2, r3} stmia ip, {r0, r1, r2, r3} fldd d16, [sp, #16] vmov.s8 r0, d16[1] add sp, sp, #32 bx lr
I.e., rather than being used directly, the registers loaded by vld2 will always be spilled to the stack then reloaded. This obviously reduces the usefulness of these intrinsics by a large factor. With some planning, it'd be good to find a powerful enough solution to this problem so that the same representation for multiple registers can be used by the autovectorizer as well as the intrinsic-handling code.
(One difficulty is that the "foo.val[X]" interface should still be available to user code. There's probably no need for "val" to literally be an array, though other representations would require front-end changes).
Assuming it's hard for the register allocator to deal with highly-constrained situations like requiring four consecutive registers, one (ugly) possibility might be to run a pass before register allocation, looking for "big" multi-register vectors and pre-allocating them to hard registers. Even using a fixed allocation of a single set of registers (e.g. make it so that all multi-reg loads/stores larger than a Q register must use d0-d7, or whatever) would probably give better code than what we produce at present, in most cases.
Now, point 2. To start with, an aside: AIUI, there is currently an assumption in the vectoriser code that increasing element numbers in vector registers correspond to increasing addresses when those registers are loaded from and stored to memory (as if the vector was a short array, or alternatively as if a union of the vector register and an array of element-types had the same numberings for lanes and array indices corresponding to the same elements). Unfortunately that is only true for NEON in little-endian mode: in big-endian mode, the story is more complicated, for reasons I will try to explain.
To remain compliant with the soft-float variant of the ARM EABI, we must pass vector register arguments in ARM registers (or the stack), not vector registers. This means that we must be very careful with the ordering of elements for values passed to functions. Consider the trivial function:
int __attribute__((noinline)) qux (int16x8_t x) { x = vaddq_s16 (x, x); return vgetq_lane_s16 (x, 1); }
This is compiled by GCC to the following (slightly unimpressively):
vmov d18, r1, r0 @ v8hi vmov d19, r3, r2 vmov d20, r1, r0 @ v8hi vmov d21, r3, r2 vadd.i16 q8, q9, q10 vmov.s16 r0, d16[1] bx lr
Which may then be called like, e.g.:
ldmia sp, {r0-r3} blx qux
So: notice that we're careful that when vector values are transferred from NEON registers to core registers, the same result will be transferred to/from memory when we use ldm/stm (core registers) or vldm/vstm (vector registers) -- i.e. we might use "vldm rX, {d18-d19}", storing d18 and d19 in consecutive increasing addresses, or "ldmia rX, {r0-r3}", again with consecutive registers in increasing memory locations, and we get the same outcome. The fact that we can use the multiple-register loads/stores is also important for spilling/reloading between vector and core registers, which inevitably happens occasionally.
Notice also that when we call the above function like so:
typedef union { int16x8_t quadvec; int16_t half[8]; } u;
int foo (int8_t *x) { u bar; int i;
for (i = 0; i < 8; i++) bar.half[i] = i;
qux (bar.quadvec); }
The value returned from "qux" is NOT 2 (1+1), as it would be if we were accessing the value at index 1 in the superimposed array in the union "u". The vgetq_lane_s16 call still interprets the array as if it had been loaded in little-endian element order. But we don't get the result we would have if the vector had been interpreted in purely big-endian order either (i.e. 12, 6+6)! In fact from the perspective of the element numbering used by vgetq_lane_s16, the vector elements we see for each of the (equal) operands of the "vadd" instruction in the qux function are:
equiv. core register lane number (at function entry) value ----------- -------------------- ----- [0] high part of r1 3 [1] low part of r1 2 [2] high part of r0 1 [3] low part of r0 0 [4] high part of r3 7 [5] low part of r3 6 [6] high part of r2 5 [7] low part of r2 4
So the value returned will be 2+2, 4.
Now, coming back to the vectorizer. Current practice means that increasing element numbers should correspond to increasing memory locations: i.e., that "array ordering" is in effect, just as in the call to vgetq_lane_s16 in the above example. This leads to an anomaly: it means that when the vectorizer asks for a particular element, it will generally get a different one. Most of the time we get away with this, since the vectorizer mostly deals with "opaque" vectors which are operated on element-wise: i.e. we only deal with data at the granularity of whole vectors, so it doesn't matter which order the elements are in. The ARM implementations of reduction operations fortuitously calculate the results across all elements simultaneously, so when one of those elements is extracted, we still get the right answer.
One notable exception to this though is the movmisalign<mode> patterns: these are implemented using the vld1 and vst1 instructions, which load elements in "array" order (increasing elements from increasing memory locations), even in big-endian mode. Since vectors loaded using those instructions are "incompatible" with the above scheme, such misaligned accesses are simply disabled in big-endian mode.
Of course, generally, sticking with the current non-solution in big-endian mode is not sustainable (and is probably already broken in various cases). So it might be worth thinking about whether supporting big-endian mode properly, as well as handling the more complex load and store element/structure instructions, can be done using some generalised solution.
I'm thinking (without having much idea about how feasible such an idea is) of something along the lines of a function (in the mathematical sense) attached to each vector value manipulated by the vectorizer, to map that value's element numberings to and from memory offsets. So then the quad-word vector of 16-bit elements discussed above would look like, in big-endian mode:
foo, {6, 4, 2, 0, 14, 12, 10, 8}
Whereas in little-endian mode (or in big-endian mode, for vectors loaded using vld1), it would look like:
foo, {0, 2, 4, 6, 8, 10, 12, 14}
And then, perhaps more interestingly, a vector loaded using e.g. a "multiple 3-element structures" load,
vld3.16 {d1, d2, d3}, [rN]
Might look like (in either endianness, assuming we can represent a vector of such size in our hypothetical scheme):
foo, {0, 6, 12, 18, 2, 8, 14, 20, 4, 10, 16, 22}
Though it's not clear that such a scheme would be powerful enough to represent the whole range of element/structure loads/stores available (you'd probably need to be able to specify skipped or don't-care elements to do that, at least).
Julian Brown julian@codesourcery.com wrote on 11/10/2010 04:29:15 PM:
In further followups (at the risk of misrepresenting Joseph & Paul Brook's opinions!), there seemed to be general agreement that a scheme something like that outlined below, with "permuting" loads/stores and some way of handling multiple in-register layouts for vectors seems like it will be a necessary addition to the vectorizer, going forward.
Hi,
Let me check that I understand the problem first: the problem is that VLD1 and VST1 instructions in big endian mode follow the array numbering of elements, while all other memory instructions (VLDR, VLDM,VSTR, VSTM) do not. So, do we have two problems here? The first one that VLD1/VST1 and VLDR, etc. can't be mixed in one computation. And the second one, that access to a single element is incorrect, when VLDR, etc. are used. Is that correct? In addition, we need to think about how to represent VLD2/3, so the vectorizer can use them. Right?
I'm thinking (without having much idea about how feasible such an idea is) of something along the lines of a function (in the mathematical sense) attached to each vector value manipulated by the vectorizer, to map that value's element numberings to and from memory offsets.
Joseph Myers joseph@codesourcery.com wrote on 08/10/2010 02:54:29 AM:
Make it possible to describe in generic RTL a permuting vector load whose alignment requirement is element alignment, describe vld1 that way, and teach the vectorizer how to use such loads and stores.
Does that mean that the vectorizer will be aware of specific instructions?
I can see several places where the order of elements is important in vectorizer's code generation: - interleave_high/low and widening operations - but I am not sure that the current implementation suits NEON best, so maybe those are less important
- extraction of scalar result in reduction
The ARM implementations of reduction operations fortuitously calculate the results across all elements simultaneously, so when one of those elements is extracted, we still get the right answer.
So, does that mean that's not a problem?
- various scalar/invariant vectors, including initializations for reduction and induction
- the order of elements in loads and stores should match
Thanks, Ira
On Thu, 14 Oct 2010, Ira Rosen wrote:
Let me check that I understand the problem first: the problem is that VLD1 and VST1 instructions in big endian mode follow the array numbering of elements, while all other memory instructions (VLDR, VLDM,VSTR, VSTM) do not. So, do we have two problems here? The first one that VLD1/VST1 and VLDR, etc. can't be mixed in one computation. And the second one, that access to a single element is incorrect, when VLDR, etc. are used. Is that correct?
In terms of the native lane numbering used in NEON instructions, VLD1 and VST1 respect array ordering and are the instructions that can be used with single-element accesses, while the other instructions do not respect the ordering and cannot be so used without adjusting the element numbers.
In terms of the architecture-independent RTL semantics, VLDR, VLDM, VSTR and VSTM respect array ordering and can be used with single-element accesses, while VLD1 and VST1 do not respect the ordering and cannot be so used without adjusting element numbers.
The VLDR etc. order is the one required to be used for argument passing and return of vectors, and is the only one readily available when vectors are loaded/stored using core registers rather than NEON registers.
Thus, when generic RTL is generated from a NEON instrinsic (defined using native lane numbering) in big-endian mode, the lane number is adjusted to make the generic RTL correct, and when assembly code is generated from generic RTL the reverse adjustment is made.
In addition, we need to think about how to represent VLD2/3, so the vectorizer can use them. Right?
Yes. (I think code using arrays of red/green/blue values is the sort of real-world (and benchmark) code expected to be vectorized using VLD3.)
Joseph Myers joseph@codesourcery.com wrote on 08/10/2010 02:54:29 AM:
Make it possible to describe in generic RTL a permuting vector load whose alignment requirement is element alignment, describe vld1 that way, and teach the vectorizer how to use such loads and stores.
Does that mean that the vectorizer will be aware of specific instructions?
I would imagine that it would need to know what permutations are available, yes (GIMPLE and RTL would have some form of general permuting load/store operation, which the vectorizer would only generate where relevant instructions exist for the chosen permutation).
Joseph Myers joseph@codesourcery.com wrote on 14/10/2010 05:18:37 PM:
On Thu, 14 Oct 2010, Ira Rosen wrote:
Let me check that I understand the problem first: the problem is that
VLD1
and VST1 instructions in big endian mode follow the array numbering of elements, while all other memory instructions (VLDR, VLDM,VSTR, VSTM)
do
not. So, do we have two problems here? The first one that VLD1/VST1 and VLDR, etc. can't be mixed in one computation. And the second one, that access to a single element is incorrect, when VLDR, etc. are used. Is
that
correct?
In terms of the native lane numbering used in NEON instructions, VLD1 and
VST1 respect array ordering and are the instructions that can be used
with
single-element accesses, while the other instructions do not respect the ordering and cannot be so used without adjusting the element numbers.
In terms of the architecture-independent RTL semantics, VLDR, VLDM, VSTR and VSTM respect array ordering and can be used with single-element accesses, while VLD1 and VST1 do not respect the ordering and cannot be
so
used without adjusting element numbers.
The VLDR etc. order is the one required to be used for argument passing and return of vectors, and is the only one readily available when vectors
are loaded/stored using core registers rather than NEON registers.
Thus, when generic RTL is generated from a NEON instrinsic (defined using
native lane numbering) in big-endian mode, the lane number is adjusted to
make the generic RTL correct, and when assembly code is generated from generic RTL the reverse adjustment is made.
In addition, we need to think about how to represent VLD2/3, so the vectorizer can use them. Right?
Yes. (I think code using arrays of red/green/blue values is the sort of real-world (and benchmark) code expected to be vectorized using VLD3.)
Joseph Myers joseph@codesourcery.com wrote on 08/10/2010 02:54:29 AM:
Make it possible to describe in generic RTL a permuting vector load whose alignment requirement is element alignment,
describe
vld1 that way, and teach the vectorizer how to use such loads and
stores.
Does that mean that the vectorizer will be aware of specific
instructions?
I would imagine that it would need to know what permutations are available, yes (GIMPLE and RTL would have some form of general permuting load/store operation, which the vectorizer would only generate where relevant instructions exist for the chosen permutation).
So, there will be a new tree code, e.g. PERM_LOAD_EXPR, and the vectorizer will use it for misaligned loads in big endian (or maybe for little endian as well), and for strided loads. The vectorizer will check if the instruction is supported giving the desired stride (1,2,3) as input, and will receive a mask. It will use the mask in order to permute all other relevant vectors (like vectors of constants) if necessary, making all the generic GIMPLE and RTL correct. And later, when assembly code is generated, everything should be permuted again?
Thanks, Ira
-- Joseph S. Myers joseph@codesourcery.com
On Mon, 18 Oct 2010, Ira Rosen wrote:
Does that mean that the vectorizer will be aware of specific
instructions?
I would imagine that it would need to know what permutations are available, yes (GIMPLE and RTL would have some form of general permuting load/store operation, which the vectorizer would only generate where relevant instructions exist for the chosen permutation).
So, there will be a new tree code, e.g. PERM_LOAD_EXPR, and the vectorizer will use it for misaligned loads in big endian (or maybe for little endian as well), and for strided loads. The vectorizer will check if the instruction is supported giving the desired stride (1,2,3) as input, and will receive a mask. It will use the mask in order to permute all other relevant vectors (like vectors of constants) if necessary, making all the generic GIMPLE and RTL correct. And later, when assembly code is generated, everything should be permuted again?
Yes, something like that.
linaro-toolchain@lists.linaro.org