New subject: [PATCH, WIP] NEON quadword vectors in big-endian mode (#10061, #7306)

12 Nov 2010


      Hi,
Here's a work-in-progress patch which fixes many execution failures
seen in big-endian mode when -mvectorize-with-neon-quad is in effect
(which is soon to be the default, if we stick to the current plan).
But, it's pretty hairy, and I'm not at all convinced it's not working
"for the wrong reason" in a couple of places.
I'm mainly posting to gauge opinions on what we should do in big-endian
mode. This patch works with the assumption that quad-word vectors in
big-endian mode are in "vldm" order (i.e. with constituent double-words
in little-endian order: see previous discussions). But, that's pretty
confusing, leads to less than optimal code, and is bound to cause more
problems in the future. So I'm not sure how much effort to expend on
making it work right, given that we might be throwing that vector
ordering away in the future (at least in some cases: see below).
The "problem" patterns are as follows.
* Full-vector shifts: these don't work with big-endian vldm-order quad
   vectors. For now, I've disabled them, although they could
   potentially be implemented using vtbl (at some cost).
* Widening moves (unpacks) & widening multiplies: when widening from
   D-reg to Q-reg size, we must swap double-words in the result (I've
   done this with vext). This seems to work fine, but what "hi" and "lo"
   refer to is rather muddled (in my head!). Also they should be
   expanders instead of emitting multiple assembler insns.
* Narrowing moves: implemented by "open-coded" permute & vmovn (for 2x
   D-reg -> D-reg), or 2x vmovn and vrev64.32 for Q-regs (as
   suggested by Paul). These seem to work fine.
* Reduction operations: when reducing Q-reg values, GCC currently
   tries to extract the result from the "wrong half" of the reduced
   vector. The fix in the attached patch is rather dubious, but seems
   to work (I'd like to understand why better).
We can sort those bits out, but the question is, do we want to go that
route? Vectors are used in three quite distinct ways by GCC:
1. By the vectorizer.
2. By the NEON intrinsics.
3. By the "generic vector" support.
For the first of these, I think we can get away with changing the
vectorizer to use explicit "array" loads and stores (i.e. vldN/vstN), so
that vector registers will hold elements in memory order -- so, all the
contortions in the attached patch will be unnecessary. ABI issues are
irrelevant, since vectors are "invisible" at the source code layer
generally, including at ABI boundaries.
For the second, intrinsics, we should do exactly what the user
requests: so, vectors are essentially treated as opaque objects. This
isn't a problem as such, but might mean that instruction patterns
written using "canonical" RTL for the vectorizer can't be shared with
intrinsics when the order of elements matters. (I'm not sure how many
patterns this would refer to at present; possibly none.)
The third case would continue to use "vldm" ordering, so if users
inadvertantly write code such as:
res = vaddq_u32 (*foo, bar);
instead of writing an explicit vld* intrinsic (for the load of *foo),
the result might be different from what they expect. It'd be nice to
diagnose such code as erroneous, but that's another issue.
The important observation is that vectors from case 1 and from cases 2/3
never interact: it's quite safe for them to use different element
orderings, without extensive changes to GCC infrastructure (i.e.,
multiple internal representations). I don't think I quite realised this
previously.
So, anyway, back to the patch in question. The choices are, I think:
1. Apply as-is (after I've ironed out the wrinkles), and then remove
    the "ugly" bits at a later point when vectorizer "array load/store"
    support is implemented.
2. Apply a version which simply disables all the troublesome
    patterns until the same support appears.
Apologies if I'm retreading old ground ;-).
(The CANNOT_CHANGE_MODE_CLASS fragment is necessary to generate good
code for the quad-word vec_pack_trunc_<mode> pattern. It would
eventually be applied as a separate patch.)
Thoughts?
Julian
ChangeLog
gcc/
    * config/arm/arm.h (CANNOT_CHANGE_MODE_CLASS): Allow changing mode
    of vector registers.
    * config/arm/neon.md (vec_shr_<mode>, vec_shl_<mode>): Disable in
    big-endian mode.
    (reduc_splus_<mode>, reduc_smin_<mode>, reduc_smax_<mode>)
    (reduc_umin_<mode>, reduc_umax_<mode>)
    (neon_vec_unpack<US>_lo_<mode>, neon_vec_unpack<US>_hi_<mode>)
    (neon_vec_<US>mult_lo_<mode>, neon_vec_<US>mult_hi_<mode>)
    (vec_pack_trunc_<mode>, neon_vec_pack_trunc_<mode>): Handle
    big-endian mode for quad-word vectors.