Julian Brown julian@codesourcery.com wrote on 03/11/2010 11:55:59 AM:
On Mon, 1 Nov 2010 15:57:11 +0200 Ira Rosen IRAR@il.ibm.com wrote:
It looks like it's enough to implement targetm.vectorize. autovectorize_vector_sizes for NEON in order to enable initial auto-detection of vector size. With the attached patch and -mvectorize-with-neon-quad flag, the vectorizer first tries to vectorize for 128 bit, and if this fails, it tries to vectorize for 64 bit. For example, in the attached testcase number of iterations is too small for 128 bit (first 2 iterations have to be peeled in order to align the array accesses), but is sufficient for 64 bit (the accesses are aligned here).
Cool, I hadn't found out about that hook yet (nor that UNITS_PER_SIMD_WORD had disappeared in favour of TARGET_VECTORIZE_PREFERRED_SIMD_MODE!).
It's pretty new.
I'd appreciate your comments on the patch, and I also have a few questions: 1. Why the default vector size is 64?
Ah, well, that was a decision made several years ago, and may no longer be the optimal choice. There are several possible reasons:
- NEON hardware available at the time (Cortex-A8) only processed data
in 64-bit chunks, so Q-reg operations weren't necessarily any faster than D-reg operations (that may still be true).
- Initially misaligned operations were not supported, so larger
vectors increase the chances that loop versioning would be required.
- Generally, setup/tear-down code is assumed to be larger for wider
vectors, and ARM hardware generally isn't very well-endowed with cache.
The choice wasn't supported by benchmarking, since IIRC we didn't have real hardware when we implemented initial NEON support, and nobody's revisited the decision since.
- Where is the place of NEON vectorization tests? I found NEON tests
with intrinsics at gcc.target/arm, is that the right place?
Looks like a good place to me. It might be an idea to avoid putting stuff in gcc.target/arm/neon, since most of those tests are auto-generated.
- According to gcc.dg/vect/vect.exp the only flag that is used for
NEON (in addition to target independent flags) is -ffast-math. Is that enough?
I'm not sure what you're asking here: there aren't any other target-specific flags to tweak NEON behaviour. Generally to enable NEON you need to set the float ABI and FPU variant (if they're not configured-in to the compiler), e.g.:
-mfloat-abi=softfp/-mfloat-abi=hard -mfpu=neon* [-march=armv7-a]
- there are several variants for this, e.g. neon, neon-fp16,
neon-vfpv4 ... generally -mfpu=neon should do though, for Cortex-A8 chips at least.
gcc.dg/vect/vect.exp runs all the vectorizer tests in gcc.dg/vect directory. I was wondering why the only flag used for NEON is -ffast-math and why other flags, like -mfpu=neon, are not used.
- config/arm/arm.c (arm_autovectorize_vector_sizes): New
function. (TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES): Define.
Looks fine to me (though I'm not an ARM maintainer, so can't approve it upstream). It's a bit of a pity the hook is defined in such a way that it doesn't support non-power-of-two sized vectors, but never mind...
I am curious what platform has non-power-of-two sized vectors?
I wonder if this allows us to remove -mvectorize-with-neon-quad already (or, perhaps, wire it on but make the option a no-op, for possible backward-compatibility reasons)?
What vector size we want as default? We can make 128 default and fall back to 64 if vectorization fails for some reason (or we can always start with 64 and switch to 128 if necessary). We can also add -mvectorize-with-neon-double, and use it and -mvectorize-with-neon-quad to set vector size and prevent auto-detection.
The best solution would be to evaluate costs for both size options. And it is a reasonable amount of work to do that. But the unknown loop bound case will require versioning between two vector options in addition to possible versioning between vector/scalar loops.
I don't know if we can make a decision without tuning, especially since
- NEON hardware available at the time (Cortex-A8) only processed data
in 64-bit chunks, so Q-reg operations weren't necessarily any faster than D-reg operations (that may still be true).
This is why I thought that starting from the option to switch to 64 if 128 fails (with -mvectorize-with-neon-quad flag) is the least intrusive.
Thanks, Ira
Thanks,
Julian