One of the vectorisation discussions from last year was about the poor code GCC generates for vld{2,3,4}_*() and vst{2,3,4}_*(). It forces the result of the loads onto the stack, then loads the individual pieces from there. It does the same thing in reverse for stores.
I think there are two major problems here:
1. The result of the vld*() is a record type such as:
typedef struct int16x4x3_t { int16x4_t val[3]; } int16x4x3_t;
Ideally, we'd like one of these structures to be stored in a pseudo register. However, the ARM port currently limits in-register record types to 64 bits, so something this big is always given BLKmode and stored on the stack.
A simple "fix" for this is to increase MAX_FIXED_MODE_SIZE. That would do the right thing for the structures in arm_neon.h, but wouldn't be safe in general.
2. The vld*() returns values as a single integer (such as EI mode), while uses of the value will typically be in a vector mode such as V4SI. CANNOT_CHANGE_MODE_CLASS doesn't allow direct "mode-punning" between the two in VFP_REGS, so this again forces the punning to be done on the stack.
The code in question is:
/* FPA registers can't do subreg as all values are reformatted to internal precision. VFP registers may only be accessed in the mode they were set. */ #define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS) \ (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO) \ ? reg_classes_intersect_p (FPA_REGS, (CLASS)) \ || reg_classes_intersect_p (VFP_REGS, (CLASS)) \
However, the VFP restriction appears to be specific to VFPv1 -- thanks to Peter for the archaeology -- and isn't a problem for v6+. In that case, removing this restriction is an important optimisation.
I tried the patch below on the following simple testcase:
#include "arm_neon.h"
void foo (uint16_t *a) { uint16x4x3_t x, y;
x = vld3_u16 (a); y = vld3_u16 (a + 12); x.val[0] = vadd_u16 (x.val[0], y.val[0]); x.val[1] = vadd_u16 (x.val[1], y.val[1]); x.val[2] = vadd_u16 (x.val[2], y.val[2]); vst3_u16 (a, x); }
(not necessarily sensible!). Before the patch, -O2 produced:
sub sp, sp, #48 add r3, r0, #24 vld3.16 {d16-d18}, [r3] vld3.16 {d20-d22}, [r0] add r3, sp, #24 vstmia sp, {d20-d22} vstmia r3, {d16-d18} fldd d19, [sp, #8] fldd d16, [sp, #0] fldd d17, [sp, #24] fldd d20, [sp, #32] vadd.i16 d18, d16, d17 vadd.i16 d17, d19, d20 fldd d19, [sp, #16] fldd d20, [sp, #40] vadd.i16 d16, d19, d20 fstd d18, [sp, #0] fstd d17, [sp, #8] fstd d16, [sp, #16] vldmia sp, {d16-d18} vst3.16 {d16-d18}, [r0] add sp, sp, #48 bx lr
After the patch we get:
vld3.16 {d24-d26}, [r0] add r3, r0, #24 vld3.16 {d20-d22}, [r3] vmov q8, q12 @ ti vadd.i16 d17, d17, d21 vadd.i16 d16, d24, d20 vadd.i16 d18, d26, d22 vst3.16 {d16-d18}, [r0] bx lr
The VMOV is a bit disappointing, and needs further investigation.
The first hunk fixes (2), and I think is correct. The second hunk hacks (1), and isn't suitable in itself. I'll next try to make arm_neon.h use built-in record types that are explicitly EImode, which should remove the need to change MAX_FIXED_MODE_SIZE.
Richard
Index: gcc/gcc/config/arm/arm.h =================================================================== --- gcc.orig/gcc/config/arm/arm.h +++ gcc/gcc/config/arm/arm.h @@ -1171,10 +1171,12 @@ enum reg_class /* FPA registers can't do subreg as all values are reformatted to internal precision. VFP registers may only be accessed in the mode they were set. */ -#define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS) \ - (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO) \ - ? reg_classes_intersect_p (FPA_REGS, (CLASS)) \ - || reg_classes_intersect_p (VFP_REGS, (CLASS)) \ 2+#define CANNOT_CHANGE_MODE_CLASS(FROM, TO, CLASS) \ + (GET_MODE_SIZE (FROM) != GET_MODE_SIZE (TO) \ + ? (reg_classes_intersect_p (FPA_REGS, (CLASS)) \ + || (TARGET_VFP \ + && reg_classes_intersect_p (VFP_REGS, (CLASS)) \ + && arm_fpu_desc->rev == 1)) \ : 0)
/* The class value for index registers, and the one for base regs. */ @@ -2458,4 +2460,6 @@ enum arm_builtins instruction. */ #define MAX_LDM_STM_OPS 4
+#define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (XImode) + #endif /* ! GCC_ARM_H */
Hi,
On Mon, 21 Feb 2011 17:20:52 +0000 Richard Sandiford richard.sandiford@linaro.org wrote:
One of the vectorisation discussions from last year was about the poor code GCC generates for vld{2,3,4}_*() and vst{2,3,4}_*(). It forces the result of the loads onto the stack, then loads the individual pieces from there. It does the same thing in reverse for stores.
I think there are two major problems here: [...]
I wanted to finish this off before making it public, but I think that to do so at this point is just going to result in duplicated effort. The attached patch does several things:
1. Struct (tree) types are defined via hard-wired code in the ARM backend rather than in arm_neon.h. The "type mode" of those struct types is overridden to be an extra-wide vector, the width of the whole struct (so int32x2x2_t would be V4SImode, etc.).
2. Builtins (__builtin_neon_*) which previously used "big" integer modes to pass/return values, are initialised such that they directly pass/return the struct types above instead. The intrinsic wrappers in arm_neon.h no longer need to use unions to pun the types of arguments & return values.
3. When those builtins are expanded, they now use the extra-wide vectors. The corresponding instruction patterns in neon.md also use wide vectors (rather than wide integer modes).
4. CANNOT_CHANGE_MODE_CLASS is redefined, somewhat like you describe.
The patch seems to work OK, at least for C. A couple of glaring issues remain:
1. It's only tested very lightly.
2. C++ is totally broken. The creation of "NEON" structs in the backend knows nothing of the subtle nuances of the C++ frontend's representation of structs/classes (one idea is to provide a "promotion" language hook in the C++ frontend to turn a C-like struct into a C++-like struct, but I don't know how acceptable that would be upstream).
3. There may be ABI-related issues regarding passing/returning the new backend-created struct types to/from regular functions.
The patch does not fix:
The VMOV is a bit disappointing, and needs further investigation.
But that should be fairly easy: we just need to expand to subreg operations for vcombine, vget_high/vget_low, etc. -- which I believe will work fine now (it didn't at some point in the distant past, which is why we have hardwired "vmov"s all over the place). Big-endian mode may need some care to be taken, as ever.
I don't think I'm going to have time to work on this in the immediate future: please feel free to use it as a base, or ignore it if your approach is simpler/better :-).
Cheers,
Julian
ChangeLog
gcc/ * config/arm/arm.c (arm_legitimate_address_outer_p) (thumb2_legitimate_address_p, output_move_neon): Use VALID_NEON_VEC_ARRAY_MODE instead of VALID_NEON_STRUCT_MODE. (arm_legitimate_index_p, thumb2_legitimate_index_p): Permit only zero indices for big vector modes. (arm_attr_length_move_neon): Check for explicit vector modes rather than big integer modes (OImode, etc.). (arm_hard_regno_mode_ok): Likewise. (arm_can_change_mode_class): New. Implement ARM_CANNOT_CHANGE_MODE_CLASS macro (negated). (arm_build_neon_struct_type): New. (neon_builtin_type_bits): Add unsigned/polynomial variants. Remove wide integer variants. (*_UP): Adjust accordingly. (T_MAX): Update number of variants. (*_REALMODE, REALMODE): Define macros. (CF2, CF3): New macros. (CF): Adjust to use above. (neon_builtin_data): Adjust entries for VTBL, VTBX and element/structure loads/stores to use explicit unsigned/polynomial variants where appropriate. (vec_array_struct_type_info): New. (arm_init_neon_builtins): Add explicit unsigned/polynomial types. Create structures used for passing/returning multiple vectors, and use to initialise builtins which use them. * config/arm/arm.h (VALID_NEON_STRUCT_MODE): Remove. (VALID_NEON_VEC_ARRAY_MODE): New macro. (CANNOT_CHANGE_MODE_CLASS): Remove comment. Implement using arm_can_change_mode_class. * config/arm/arm-modes.def (VECTOR_MODES (INT, ...)): Add 24-, 32-, 48- and 64-bit integer vector modes. (VECTOR_MODES (FLOAT, ...)): Likewise, for float modes. (EI, OI, CI, XI): Remove opaque integer modes. * config/arm/arm-protos.h (arm_can_change_mode_class): Add prototype. * config/arm/neon.md (VTAB, VTAB_n, V_PAIR, V_pair): Remove unused mode iterators/attributes. (VSTRUCT): Use explicit vector types instead of opaque integers. (VSTRUCT3, VSTRUCT4, VSTRUCT6, VSTRUCT8): New mode iterators. (VSTR3_LO, VSTR3_HI, VSTR_PART): New mode attributes. (V_two_elem, V_three_elem, V_four_elem): Use explicit vector types where appropriate. (V_DOUBLE): Add new wide vector types. (V_TRIPLE, V_TRIPLE_HALF, V_QUAD, V_QUAD_HALF): New. (*neon_mov<mode> splitters): Split using vector types rather than opaque integers. (neon_vtbl2v8qi, neon_vtbl3v8qi, neon_vtbl4v8qi, neon_vtbx2v8qi) (neon_vtbx3v8qi, neon_vtbx4v8qi): Use vector types not opaque integers. (neon_vld2<mode>, neon_vld2_lane<mode>, neon_vld2_dup<mode>) (neon_vst2<mode>, neon_vst2_lane<mode>, neon_vld3<mode>) (neon_vld3qa<mode>, neon_vld3qb<mode>, neon_vld3_lane<mode>) (neon_vld3_dup<mode>, neon_vst3<mode>, neon_vst3qa<mode>) (neon_vst3qb<mode>, neon_vst3_lane<mode>, neon_vld4<mode>) (neon_vld4qa<mode>, neon_vld4qb<mode>, neon_vld4_lane<mode>) (neon_vst4<mode>, neon_vst4qa<mode>, neon_vst4qb<mode>) (neon_vst4_lane<mode>): Use wide vector modes instead of opaque integer ones. * config/arm/neon.ml (features): Add Distinct_types feature. (ops): Use Distinct_types for VTBL, VTBX, element/structure load/store instruction definitions. * config/arm/neon-gen.ml (cast_for_return): Only add cast when needed. (return): When returning a multiple vectors via a struct, just return the value rather than forcing type-conversion through a union. (params): Similar, when passing a struct to a builtin. (type_suffix): New. (print_variant): Support intrinsics which use distinct types for each underlying builtin. (arrtypes): Emit typedefs to builtin struct type names, rather than actually defining structs. * config/arm/arm_neon.h: Regenerate.
Julian Brown julian@codesourcery.com writes:
Richard Sandiford richard.sandiford@linaro.org wrote:
One of the vectorisation discussions from last year was about the poor code GCC generates for vld{2,3,4}_*() and vst{2,3,4}_*(). It forces the result of the loads onto the stack, then loads the individual pieces from there. It does the same thing in reverse for stores.
I think there are two major problems here: [...]
I wanted to finish this off before making it public, but I think that to do so at this point is just going to result in duplicated effort.
Yeah, sorry about that. :-( I started out by looking at the tree-level side, but in the end decided that I needed to tackle the rtl level first.
- Struct (tree) types are defined via hard-wired code in the ARM
backend rather than in arm_neon.h. The "type mode" of those struct types is overridden to be an extra-wide vector, the width of the whole struct (so int32x2x2_t would be V4SImode, etc.).
FWIW, I was going to try to avoid this. I think instead we should automatically use vector modes for structures like those in arm_neon.h, via a target hook. It would improve the code quality for general code that has the same sort of small-array-of-vectors structure. (In other words, this would help when using the generic vector extensions rather than the Neon-specific builtins.)
- Builtins (__builtin_neon_*) which previously used "big" integer
modes to pass/return values, are initialised such that they directly pass/return the struct types above instead. The intrinsic wrappers in arm_neon.h no longer need to use unions to pun the types of arguments & return values.
Yeah, I'd wondered about that too. However, these days, I think we ought to be able to generate good code for this type of union, and we seem to for the cases I've tried. In the end I thought it was better to keep the underlying built-in function close to the rtl pattern. E.g. the fact that the name of the field is "val" seems more like an arm_neon.h detail than something that should be hard-coded into GCC.
- When those builtins are expanded, they now use the extra-wide
vectors. The corresponding instruction patterns in neon.md also use wide vectors (rather than wide integer modes).
I was also going to try defining non-power-of-two vectors. Glad to hear it works! (That was actually the main motivation for doing the rtl side first: to see whether it really would be OK to ask the vectoriser to treat these values as single vectors.)
I think we should keep the integer modes too, though, just like we allow DImode for double registers. (I'm surprised to see we don't allow TImode for quad registers TBH -- might look into that.) Given that there are no architectual restrictions on mode punning, these integer modes are useful neutral ground.
The VMOV is a bit disappointing, and needs further investigation.
But that should be fairly easy: we just need to expand to subreg operations for vcombine, vget_high/vget_low, etc. -- which I believe will work fine now (it didn't at some point in the distant past, which is why we have hardwired "vmov"s all over the place). Big-endian mode may need some care to be taken, as ever.
This VMOV is coming from a plain register-register SET that we fail to optimise away. The pattern is:
(set (reg NEWX) (reg OLDX)) (set (subreg (reg NEWX 0)) (plus (subreg (reg OLDX 0)) ...)) ; OLDX dead (set (subreg (reg NEWX 8)) (plus (subreg (reg NEWX 8)) ...)) ...
where what we really want is for the second instruction to use NEWX (or for NEWX not to exist at all, whichever way to prefer to look at it). I think it's a general subreg optimisation problem.
I don't think I'm going to have time to work on this in the immediate future: please feel free to use it as a base, or ignore it if your approach is simpler/better :-).
Thanks. I might well end up "borrowing" the vector-mode stuff.
Richard
On Tue, 22 Feb 2011 09:42:15 +0000 Richard Sandiford richard.sandiford@linaro.org wrote:
Julian Brown julian@codesourcery.com writes:
Richard Sandiford richard.sandiford@linaro.org wrote:
- Struct (tree) types are defined via hard-wired code in the ARM
backend rather than in arm_neon.h. The "type mode" of those struct types is overridden to be an extra-wide vector, the width of the whole struct (so int32x2x2_t would be V4SImode, etc.).
FWIW, I was going to try to avoid this. I think instead we should automatically use vector modes for structures like those in arm_neon.h, via a target hook. It would improve the code quality for general code that has the same sort of small-array-of-vectors structure. (In other words, this would help when using the generic vector extensions rather than the Neon-specific builtins.)
That sounds like a good plan, I think.
- Builtins (__builtin_neon_*) which previously used "big" integer
modes to pass/return values, are initialised such that they directly pass/return the struct types above instead. The intrinsic wrappers in arm_neon.h no longer need to use unions to pun the types of arguments & return values.
Yeah, I'd wondered about that too. However, these days, I think we ought to be able to generate good code for this type of union, and we seem to for the cases I've tried. In the end I thought it was better to keep the underlying built-in function close to the rtl pattern. E.g. the fact that the name of the field is "val" seems more like an arm_neon.h detail than something that should be hard-coded into GCC.
I still think it's a good idea to get rid of the unions in this case, or at least, replace the wide-integer modes in the unions with wide vectors. But I'm happy with whatever works :-).
- When those builtins are expanded, they now use the extra-wide
vectors. The corresponding instruction patterns in neon.md also use wide vectors (rather than wide integer modes).
I was also going to try defining non-power-of-two vectors. Glad to hear it works! (That was actually the main motivation for doing the rtl side first: to see whether it really would be OK to ask the vectoriser to treat these values as single vectors.)
(Caveat: that's one of the things that isn't well-tested. It works to define those modes, at least.)
I think we should keep the integer modes too, though, just like we allow DImode for double registers. (I'm surprised to see we don't allow TImode for quad registers TBH -- might look into that.) Given that there are no architectual restrictions on mode punning, these integer modes are useful neutral ground.
I'm not convinced by that. Consider that:
1. There are no useful operations on the wide-integer types.
2. There is no way of representing constants in RTL for the wide-integer types (we have an internal bug where a constant-zero OImode value is synthesized when using NEON intrinsics, and it leads to an ICE: see emit-rtl.c:immed_double_const -- OImode, etc. are wider than 2 * HOST_BITS_PER_WIDE_INT, so fail the second assertion).
Unless we can show that the wide-integer modes are really needed, and patch things up so (2) no longer holds, I'd strongly prefer to see them disappear. (As a side note, we obviously want to avoid wide-integer or wide-vector modes *ever* being reloaded via core registers, since there simply aren't enough of them for that to be possible. I think that may be one of the reasons that integer-equivalent modes for each size have traditionally been used by the compiler?).
The VMOV is a bit disappointing, and needs further investigation.
But that should be fairly easy: we just need to expand to subreg operations for vcombine, vget_high/vget_low, etc. -- which I believe will work fine now (it didn't at some point in the distant past, which is why we have hardwired "vmov"s all over the place). Big-endian mode may need some care to be taken, as ever.
This VMOV is coming from a plain register-register SET that we fail to optimise away. The pattern is:
(set (reg NEWX) (reg OLDX)) (set (subreg (reg NEWX 0)) (plus (subreg (reg OLDX 0)) ...)) ;
OLDX dead (set (subreg (reg NEWX 8)) (plus (subreg (reg NEWX 8)) ...)) ...
where what we really want is for the second instruction to use NEWX (or for NEWX not to exist at all, whichever way to prefer to look at it). I think it's a general subreg optimisation problem.
Hmm, not sure about that then.
I don't think I'm going to have time to work on this in the immediate future: please feel free to use it as a base, or ignore it if your approach is simpler/better :-).
Thanks. I might well end up "borrowing" the vector-mode stuff.
Cheers,
Julian
Julian Brown julian@codesourcery.com writes:
- Builtins (__builtin_neon_*) which previously used "big" integer
modes to pass/return values, are initialised such that they directly pass/return the struct types above instead. The intrinsic wrappers in arm_neon.h no longer need to use unions to pun the types of arguments & return values.
Yeah, I'd wondered about that too. However, these days, I think we ought to be able to generate good code for this type of union, and we seem to for the cases I've tried. In the end I thought it was better to keep the underlying built-in function close to the rtl pattern. E.g. the fact that the name of the field is "val" seems more like an arm_neon.h detail than something that should be hard-coded into GCC.
I still think it's a good idea to get rid of the unions in this case, or at least, replace the wide-integer modes in the unions with wide vectors. But I'm happy with whatever works :-).
Yeah, I was wondering about changing the mode to a vector mode (which is what the rtl pattern will use), but keeping the union.
I think we should keep the integer modes too, though, just like we allow DImode for double registers. (I'm surprised to see we don't allow TImode for quad registers TBH -- might look into that.) Given that there are no architectual restrictions on mode punning, these integer modes are useful neutral ground.
I'm not convinced by that. Consider that:
There are no useful operations on the wide-integer types.
There is no way of representing constants in RTL for the
wide-integer types (we have an internal bug where a constant-zero OImode value is synthesized when using NEON intrinsics, and it leads to an ICE: see emit-rtl.c:immed_double_const -- OImode, etc. are wider than 2 * HOST_BITS_PER_WIDE_INT, so fail the second assertion).
Unless we can show that the wide-integer modes are really needed, and patch things up so (2) no longer holds, I'd strongly prefer to see them disappear. (As a side note, we obviously want to avoid wide-integer or wide-vector modes *ever* being reloaded via core registers, since there simply aren't enough of them for that to be possible. I think that may be one of the reasons that integer-equivalent modes for each size have traditionally been used by the compiler?).
I think the main reason is that integer modes simply represent a string of bits, without any particular interpretation (not even signed vs. unsigned). So the restrictions on subregs are much more lax than they are for more specialised modes like floating-point and vector modes.
We can stop large modes from being stored in integer registers regardless of whether the mode is an integer or a vector (and we already do this). There's certainly the potential to mess things up by defining the secondary reload hooks in the wrong way -- see for instance
http://gcc.gnu.org/ml/gcc-patches/2011-01/msg02266.html
-- but the restriction is already there in principle.
I take your point about (2) not being possible at the moment, but that's a flaw that's worth fixing in its own right.
Richard
linaro-toolchain@lists.linaro.org