- Continued looking into NEON special loads and stores.
- Benchmarks: concentrated on EEMBC Telecom:
- autcor gets vectorized
- viterbi, besides strided data accesses, needs to sink conditional
stores to allow if-conversion and make the main loop vectorizable.
Since the potential here is 4x, I think it's worthwhile to work on
this.
- conven, fbital also have control-flow issue, but much more
complicated than viterbi
- fft has a problem with loop count, I would like to investigate
this a bit more
- diffmeasure doesn't seem to have vectorization potential
- Fixed GCC PR 46663 on trunk, testing the fix for 4.3, 4.4, 4.5.
Hi,
Here's a work-in-progress patch which fixes many execution failures
seen in big-endian mode when -mvectorize-with-neon-quad is in effect
(which is soon to be the default, if we stick to the current plan).
But, it's pretty hairy, and I'm not at all convinced it's not working
"for the wrong reason" in a couple of places.
I'm mainly posting to gauge opinions on what we should do in big-endian
mode. This patch works with the assumption that quad-word vectors in
big-endian mode are in "vldm" order (i.e. with constituent double-words
in little-endian order: see previous discussions). But, that's pretty
confusing, leads to less than optimal code, and is bound to cause more
problems in the future. So I'm not sure how much effort to expend on
making it work right, given that we might be throwing that vector
ordering away in the future (at least in some cases: see below).
The "problem" patterns are as follows.
* Full-vector shifts: these don't work with big-endian vldm-order quad
vectors. For now, I've disabled them, although they could
potentially be implemented using vtbl (at some cost).
* Widening moves (unpacks) & widening multiplies: when widening from
D-reg to Q-reg size, we must swap double-words in the result (I've
done this with vext). This seems to work fine, but what "hi" and "lo"
refer to is rather muddled (in my head!). Also they should be
expanders instead of emitting multiple assembler insns.
* Narrowing moves: implemented by "open-coded" permute & vmovn (for 2x
D-reg -> D-reg), or 2x vmovn and vrev64.32 for Q-regs (as
suggested by Paul). These seem to work fine.
* Reduction operations: when reducing Q-reg values, GCC currently
tries to extract the result from the "wrong half" of the reduced
vector. The fix in the attached patch is rather dubious, but seems
to work (I'd like to understand why better).
We can sort those bits out, but the question is, do we want to go that
route? Vectors are used in three quite distinct ways by GCC:
1. By the vectorizer.
2. By the NEON intrinsics.
3. By the "generic vector" support.
For the first of these, I think we can get away with changing the
vectorizer to use explicit "array" loads and stores (i.e. vldN/vstN), so
that vector registers will hold elements in memory order -- so, all the
contortions in the attached patch will be unnecessary. ABI issues are
irrelevant, since vectors are "invisible" at the source code layer
generally, including at ABI boundaries.
For the second, intrinsics, we should do exactly what the user
requests: so, vectors are essentially treated as opaque objects. This
isn't a problem as such, but might mean that instruction patterns
written using "canonical" RTL for the vectorizer can't be shared with
intrinsics when the order of elements matters. (I'm not sure how many
patterns this would refer to at present; possibly none.)
The third case would continue to use "vldm" ordering, so if users
inadvertantly write code such as:
res = vaddq_u32 (*foo, bar);
instead of writing an explicit vld* intrinsic (for the load of *foo),
the result might be different from what they expect. It'd be nice to
diagnose such code as erroneous, but that's another issue.
The important observation is that vectors from case 1 and from cases 2/3
never interact: it's quite safe for them to use different element
orderings, without extensive changes to GCC infrastructure (i.e.,
multiple internal representations). I don't think I quite realised this
previously.
So, anyway, back to the patch in question. The choices are, I think:
1. Apply as-is (after I've ironed out the wrinkles), and then remove
the "ugly" bits at a later point when vectorizer "array load/store"
support is implemented.
2. Apply a version which simply disables all the troublesome
patterns until the same support appears.
Apologies if I'm retreading old ground ;-).
(The CANNOT_CHANGE_MODE_CLASS fragment is necessary to generate good
code for the quad-word vec_pack_trunc_<mode> pattern. It would
eventually be applied as a separate patch.)
Thoughts?
Julian
ChangeLog
gcc/
* config/arm/arm.h (CANNOT_CHANGE_MODE_CLASS): Allow changing mode
of vector registers.
* config/arm/neon.md (vec_shr_<mode>, vec_shl_<mode>): Disable in
big-endian mode.
(reduc_splus_<mode>, reduc_smin_<mode>, reduc_smax_<mode>)
(reduc_umin_<mode>, reduc_umax_<mode>)
(neon_vec_unpack<US>_lo_<mode>, neon_vec_unpack<US>_hi_<mode>)
(neon_vec_<US>mult_lo_<mode>, neon_vec_<US>mult_hi_<mode>)
(vec_pack_trunc_<mode>, neon_vec_pack_trunc_<mode>): Handle
big-endian mode for quad-word vectors.
Hi there. I've had a few questions recently about how to build a
cross compiler, so I took a stab at writing the steps down in a
Makefile. See:
https://code.launchpad.net/~michaelh1/+junk/cross-build
Hopefully it's easy to follow. It uses a binary sysroot and gives you
vanilla binutils 2.20 and Linaro GCC 2010.11 in a good enough way that
you can cross-compile for Maverick. The script is minimal and trades
readability for flexibility.
Note that Marcin's cross compiler packages or the Embedian toolchains
are a better way to go, but if you want to see the steps involved
check out the script.
Marcin or Matthias, would you mind reviewing it?
-- Michael