Re: Auto-detection of vector size for NEON

3 Nov 2010

      Julian Brown julian@codesourcery.com wrote on 03/11/2010 11:55:59 AM:
...
On Mon, 1 Nov 2010 15:57:11 +0200
Ira Rosen IRAR@il.ibm.com wrote:
...
It looks like it's enough to implement targetm.vectorize.
autovectorize_vector_sizes for NEON in order to enable initial
auto-detection of vector size. With the attached patch and
-mvectorize-with-neon-quad flag, the vectorizer first tries to
vectorize for 128 bit, and if this fails, it tries to vectorize for
64 bit. For example, in the attached testcase number of iterations is
too small for 128 bit (first 2 iterations have to be peeled in order
to align the array accesses), but is sufficient for 64 bit (the
accesses are aligned here).
Cool, I hadn't found out about that hook yet (nor that
UNITS_PER_SIMD_WORD had disappeared in favour of
TARGET_VECTORIZE_PREFERRED_SIMD_MODE!).
It's pretty new.
...
...
I'd appreciate your comments on the patch, and I also have a few
questions: 1. Why the default vector size is 64?
Ah, well, that was a decision made several years ago, and may no longer
be the optimal choice. There are several possible reasons:

NEON hardware available at the time (Cortex-A8) only processed data

in 64-bit chunks, so Q-reg operations weren't necessarily any faster
than D-reg operations (that may still be true).

Initially misaligned operations were not supported, so larger

vectors increase the chances that loop versioning would be required.

Generally, setup/tear-down code is assumed to be larger for wider

vectors, and ARM hardware generally isn't very well-endowed with cache.
The choice wasn't supported by benchmarking, since IIRC we didn't
have real hardware when we implemented initial NEON support, and
nobody's revisited the decision since.
...

Where is the place of NEON vectorization tests? I found NEON tests

with intrinsics at gcc.target/arm, is that the right place?
Looks like a good place to me. It might be an idea to avoid putting
stuff in gcc.target/arm/neon, since most of those tests are
auto-generated.
...

According to gcc.dg/vect/vect.exp the only flag that is used for

NEON (in addition to target independent flags) is -ffast-math. Is
that enough?
I'm not sure what you're asking here: there aren't any other
target-specific flags to tweak NEON behaviour. Generally to enable NEON
you need to set the float ABI and FPU variant (if they're not
configured-in to the compiler), e.g.:
-mfloat-abi=softfp/-mfloat-abi=hard -mfpu=neon* [-march=armv7-a]

there are several variants for this, e.g. neon, neon-fp16,

neon-vfpv4 ... generally -mfpu=neon should do though, for Cortex-A8
chips at least.
gcc.dg/vect/vect.exp runs all the vectorizer tests in gcc.dg/vect
directory. I was wondering why the only flag used for NEON is -ffast-math
and why other flags, like -mfpu=neon, are not used.
...
...

config/arm/arm.c (arm_autovectorize_vector_sizes): New

function.
   (TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES): Define.
Looks fine to me (though I'm not an ARM maintainer, so can't approve it
upstream). It's a bit of a pity the hook is defined in such a way that
it doesn't support non-power-of-two sized vectors, but never mind...
I am curious what platform has non-power-of-two sized vectors?
...
I wonder if this allows us to remove -mvectorize-with-neon-quad
already (or, perhaps, wire it on but make the option a no-op, for
possible backward-compatibility reasons)?
What vector size we want as default? We can make 128 default and fall back
to 64 if vectorization fails for some reason (or we can always start with
64 and switch to 128 if necessary). We can also add
-mvectorize-with-neon-double, and use it and -mvectorize-with-neon-quad to
set vector size and prevent auto-detection.
The best solution would be to evaluate costs for both size options. And it
is a reasonable amount of work to do that. But the unknown loop bound case
will require versioning between two vector options in addition to possible
versioning between vector/scalar loops.
I don't know if we can make a decision without tuning, especially since
...

NEON hardware available at the time (Cortex-A8) only processed data

in 64-bit chunks, so Q-reg operations weren't necessarily any faster
than D-reg operations (that may still be true).
This is why I thought that starting from the option to switch to 64 if 128
fails (with -mvectorize-with-neon-quad flag) is the least intrusive.
Thanks,
Ira
...
Thanks,
Julian

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Auto-detection of vector size for NEON