Hi,
It looks like it's enough to implement targetm.vectorize. autovectorize_vector_sizes for NEON in order to enable initial auto-detection of vector size. With the attached patch and -mvectorize-with-neon-quad flag, the vectorizer first tries to vectorize for 128 bit, and if this fails, it tries to vectorize for 64 bit. For example, in the attached testcase number of iterations is too small for 128 bit (first 2 iterations have to be peeled in order to align the array accesses), but is sufficient for 64 bit (the accesses are aligned here).
I'd appreciate your comments on the patch, and I also have a few questions: 1. Why the default vector size is 64? 2. Where is the place of NEON vectorization tests? I found NEON tests with intrinsics at gcc.target/arm, is that the right place? 3. According to gcc.dg/vect/vect.exp the only flag that is used for NEON (in addition to target independent flags) is -ffast-math. Is that enough?
Thanks, Ira
ChangeLog:
* config/arm/arm.c (arm_autovectorize_vector_sizes): New function. (TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES): Define.
Index: config/arm/arm.c =================================================================== --- config/arm/arm.c (revision 166032) +++ config/arm/arm.c (working copy) @@ -246,6 +246,7 @@ static bool arm_builtin_support_vector_misalignmen const_tree type, int misalignment, bool is_packed); +static unsigned int arm_autovectorize_vector_sizes (void);
/* Table of machine attributes. */ @@ -391,6 +392,9 @@ static const struct default_options arm_option_opt #define TARGET_VECTOR_MODE_SUPPORTED_P arm_vector_mode_supported_p #undef TARGET_VECTORIZE_PREFERRED_SIMD_MODE #define TARGET_VECTORIZE_PREFERRED_SIMD_MODE arm_preferred_simd_mode +#undef TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES +#define TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES \ + arm_autovectorize_vector_sizes
#undef TARGET_MACHINE_DEPENDENT_REORG #define TARGET_MACHINE_DEPENDENT_REORG arm_reorg @@ -23223,6 +23227,12 @@ arm_expand_sync (enum machine_mode mode, } }
+static unsigned int +arm_autovectorize_vector_sizes (void) +{ + return TARGET_NEON_VECTORIZE_QUAD ? 16 | 8 : 0; +} + static bool arm_vector_alignment_reachable (const_tree type, bool is_packed) {
test:
#define N 5
unsigned int ub[N+2] = {1,1,6,39,12,18,14}; unsigned int uc[N+2] = {2,3,4,11,6,7,1};
void main1 () { int i; unsigned int udiff = 2; unsigned int umax = 10;
for (i = 0; i < N; i++) { /* Summation. */ udiff += (ub[i+2] - uc[i]);
/* Maximum. */ umax = umax < uc[i+2] ? uc[i+2] : umax; } }
On Mon, 1 Nov 2010 15:57:11 +0200 Ira Rosen IRAR@il.ibm.com wrote:
It looks like it's enough to implement targetm.vectorize. autovectorize_vector_sizes for NEON in order to enable initial auto-detection of vector size. With the attached patch and -mvectorize-with-neon-quad flag, the vectorizer first tries to vectorize for 128 bit, and if this fails, it tries to vectorize for 64 bit. For example, in the attached testcase number of iterations is too small for 128 bit (first 2 iterations have to be peeled in order to align the array accesses), but is sufficient for 64 bit (the accesses are aligned here).
Cool, I hadn't found out about that hook yet (nor that UNITS_PER_SIMD_WORD had disappeared in favour of TARGET_VECTORIZE_PREFERRED_SIMD_MODE!).
I'd appreciate your comments on the patch, and I also have a few questions: 1. Why the default vector size is 64?
Ah, well, that was a decision made several years ago, and may no longer be the optimal choice. There are several possible reasons:
1. NEON hardware available at the time (Cortex-A8) only processed data in 64-bit chunks, so Q-reg operations weren't necessarily any faster than D-reg operations (that may still be true).
2. Initially misaligned operations were not supported, so larger vectors increase the chances that loop versioning would be required.
3. Generally, setup/tear-down code is assumed to be larger for wider vectors, and ARM hardware generally isn't very well-endowed with cache.
The choice wasn't supported by benchmarking, since IIRC we didn't have real hardware when we implemented initial NEON support, and nobody's revisited the decision since.
- Where is the place of NEON vectorization tests? I found NEON tests
with intrinsics at gcc.target/arm, is that the right place?
Looks like a good place to me. It might be an idea to avoid putting stuff in gcc.target/arm/neon, since most of those tests are auto-generated.
- According to gcc.dg/vect/vect.exp the only flag that is used for
NEON (in addition to target independent flags) is -ffast-math. Is that enough?
I'm not sure what you're asking here: there aren't any other target-specific flags to tweak NEON behaviour. Generally to enable NEON you need to set the float ABI and FPU variant (if they're not configured-in to the compiler), e.g.:
-mfloat-abi=softfp/-mfloat-abi=hard -mfpu=neon* [-march=armv7-a]
* there are several variants for this, e.g. neon, neon-fp16, neon-vfpv4 ... generally -mfpu=neon should do though, for Cortex-A8 chips at least.
- config/arm/arm.c (arm_autovectorize_vector_sizes): New
function. (TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES): Define.
Looks fine to me (though I'm not an ARM maintainer, so can't approve it upstream). It's a bit of a pity the hook is defined in such a way that it doesn't support non-power-of-two sized vectors, but never mind...
I wonder if this allows us to remove -mvectorize-with-neon-quad already (or, perhaps, wire it on but make the option a no-op, for possible backward-compatibility reasons)?
Thanks,
Julian
Julian Brown julian@codesourcery.com wrote on 03/11/2010 11:55:59 AM:
On Mon, 1 Nov 2010 15:57:11 +0200 Ira Rosen IRAR@il.ibm.com wrote:
It looks like it's enough to implement targetm.vectorize. autovectorize_vector_sizes for NEON in order to enable initial auto-detection of vector size. With the attached patch and -mvectorize-with-neon-quad flag, the vectorizer first tries to vectorize for 128 bit, and if this fails, it tries to vectorize for 64 bit. For example, in the attached testcase number of iterations is too small for 128 bit (first 2 iterations have to be peeled in order to align the array accesses), but is sufficient for 64 bit (the accesses are aligned here).
Cool, I hadn't found out about that hook yet (nor that UNITS_PER_SIMD_WORD had disappeared in favour of TARGET_VECTORIZE_PREFERRED_SIMD_MODE!).
It's pretty new.
I'd appreciate your comments on the patch, and I also have a few questions: 1. Why the default vector size is 64?
Ah, well, that was a decision made several years ago, and may no longer be the optimal choice. There are several possible reasons:
- NEON hardware available at the time (Cortex-A8) only processed data
in 64-bit chunks, so Q-reg operations weren't necessarily any faster than D-reg operations (that may still be true).
- Initially misaligned operations were not supported, so larger
vectors increase the chances that loop versioning would be required.
- Generally, setup/tear-down code is assumed to be larger for wider
vectors, and ARM hardware generally isn't very well-endowed with cache.
The choice wasn't supported by benchmarking, since IIRC we didn't have real hardware when we implemented initial NEON support, and nobody's revisited the decision since.
- Where is the place of NEON vectorization tests? I found NEON tests
with intrinsics at gcc.target/arm, is that the right place?
Looks like a good place to me. It might be an idea to avoid putting stuff in gcc.target/arm/neon, since most of those tests are auto-generated.
- According to gcc.dg/vect/vect.exp the only flag that is used for
NEON (in addition to target independent flags) is -ffast-math. Is that enough?
I'm not sure what you're asking here: there aren't any other target-specific flags to tweak NEON behaviour. Generally to enable NEON you need to set the float ABI and FPU variant (if they're not configured-in to the compiler), e.g.:
-mfloat-abi=softfp/-mfloat-abi=hard -mfpu=neon* [-march=armv7-a]
- there are several variants for this, e.g. neon, neon-fp16,
neon-vfpv4 ... generally -mfpu=neon should do though, for Cortex-A8 chips at least.
gcc.dg/vect/vect.exp runs all the vectorizer tests in gcc.dg/vect directory. I was wondering why the only flag used for NEON is -ffast-math and why other flags, like -mfpu=neon, are not used.
- config/arm/arm.c (arm_autovectorize_vector_sizes): New
function. (TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES): Define.
Looks fine to me (though I'm not an ARM maintainer, so can't approve it upstream). It's a bit of a pity the hook is defined in such a way that it doesn't support non-power-of-two sized vectors, but never mind...
I am curious what platform has non-power-of-two sized vectors?
I wonder if this allows us to remove -mvectorize-with-neon-quad already (or, perhaps, wire it on but make the option a no-op, for possible backward-compatibility reasons)?
What vector size we want as default? We can make 128 default and fall back to 64 if vectorization fails for some reason (or we can always start with 64 and switch to 128 if necessary). We can also add -mvectorize-with-neon-double, and use it and -mvectorize-with-neon-quad to set vector size and prevent auto-detection.
The best solution would be to evaluate costs for both size options. And it is a reasonable amount of work to do that. But the unknown loop bound case will require versioning between two vector options in addition to possible versioning between vector/scalar loops.
I don't know if we can make a decision without tuning, especially since
- NEON hardware available at the time (Cortex-A8) only processed data
in 64-bit chunks, so Q-reg operations weren't necessarily any faster than D-reg operations (that may still be true).
This is why I thought that starting from the option to switch to 64 if 128 fails (with -mvectorize-with-neon-quad flag) is the least intrusive.
Thanks, Ira
Thanks,
Julian
On Wed, 3 Nov 2010 14:57:01 +0200 Ira Rosen IRAR@il.ibm.com wrote:
-mfloat-abi=softfp/-mfloat-abi=hard -mfpu=neon* [-march=armv7-a]
- there are several variants for this, e.g. neon, neon-fp16,
neon-vfpv4 ... generally -mfpu=neon should do though, for Cortex-A8 chips at least.
gcc.dg/vect/vect.exp runs all the vectorizer tests in gcc.dg/vect directory. I was wondering why the only flag used for NEON is -ffast-math and why other flags, like -mfpu=neon, are not used.
I'm not sure about this. There might be dejagnu magic somewhere, or we might rely on e.g. multilib options or configured-in defaults to turn NEON on as appropriate.
- config/arm/arm.c (arm_autovectorize_vector_sizes): New
function. (TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES): Define.
Looks fine to me (though I'm not an ARM maintainer, so can't approve it upstream). It's a bit of a pity the hook is defined in such a way that it doesn't support non-power-of-two sized vectors, but never mind...
I am curious what platform has non-power-of-two sized vectors?
I think I was muddling up non-power-of-two vectors with vector "group" sizes used by e.g. vld3/vst3, but of course NEON doesn't have the former, and the latter are entirely irrelevant in this context. So, apologies for the noise :-).
(I did come across a MIPS variant with length-3 vectors at one point, but of course that's pretty much irrelevant here also. AFAIK it's not even publicly documented.)
I wonder if this allows us to remove -mvectorize-with-neon-quad already (or, perhaps, wire it on but make the option a no-op, for possible backward-compatibility reasons)?
What vector size we want as default? We can make 128 default and fall back to 64 if vectorization fails for some reason (or we can always start with 64 and switch to 128 if necessary). We can also add -mvectorize-with-neon-double, and use it and -mvectorize-with-neon-quad to set vector size and prevent auto-detection.
I think it's probably fine to default to 128-bit vectors, and fall back to 64-bits when necessary (where access patterns block usage of wider vectors, or similar). AIUI, ARM were quite keen to get rid of -mvectorize-with-neon-quad altogether, so I'm not sure it makes sense to add a new -double option also: particularly since with widening/narrowing operations, both vector sizes are generally needed simultaneously.
The best solution would be to evaluate costs for both size options. And it is a reasonable amount of work to do that. But the unknown loop bound case will require versioning between two vector options in addition to possible versioning between vector/scalar loops.
I don't know if we can make a decision without tuning, especially since
- NEON hardware available at the time (Cortex-A8) only processed
data in 64-bit chunks, so Q-reg operations weren't necessarily any faster than D-reg operations (that may still be true).
This is why I thought that starting from the option to switch to 64 if 128 fails (with -mvectorize-with-neon-quad flag) is the least intrusive.
I'm not sure. The best option may well depend on the particular core (A8 vs A9 vs A15), and users will generally want to have the right option (whatever that turns out to be) as the default, without having to grub around in the documentation.
(Maybe if we make -mvectorize-with-neon-quad "wired-on" but otherwise a no-op, we could add e.g. a --param to say "prefer 64-bit vectors" or "prefer 128-bit vectors" (falling back to 64-bit as necessary), for benchmarking purposes and/or intrepid users.)
CC'ing Richard E., in case he has any input.
Cheers,
Julian
Julian Brown julian@codesourcery.com wrote on 05/11/2010 12:58:14 PM:
I think it's probably fine to default to 128-bit vectors, and fall back to 64-bits when necessary (where access patterns block usage of wider vectors, or similar). AIUI, ARM were quite keen to get rid of -mvectorize-with-neon-quad altogether, so I'm not sure it makes sense to add a new -double option also: particularly since with widening/narrowing operations, both vector sizes are generally needed simultaneously.
Right, mixed vector sizes make it irrelevant.
The best solution would be to evaluate costs for both size options. And it is a reasonable amount of work to do that. But the unknown loop bound case will require versioning between two vector options in addition to possible versioning between vector/scalar loops.
I don't know if we can make a decision without tuning, especially since
- NEON hardware available at the time (Cortex-A8) only processed
data in 64-bit chunks, so Q-reg operations weren't necessarily any faster than D-reg operations (that may still be true).
This is why I thought that starting from the option to switch to 64 if 128 fails (with -mvectorize-with-neon-quad flag) is the least intrusive.
I'm not sure. The best option may well depend on the particular core (A8 vs A9 vs A15), and users will generally want to have the right option (whatever that turns out to be) as the default, without having to grub around in the documentation.
(Maybe if we make -mvectorize-with-neon-quad "wired-on" but otherwise a no-op,
Since TARGET_NEON_VECTORIZE_QUAD is only used in arm_preferred_simd_mode and arm_autovectorize_vector_sizes, we can simply remove it, making 128 the default. (I am not sure I fully understand "wired-on" but otherwise a no-op"...).
Index: config/arm/arm.c =================================================================== --- config/arm/arm.c (revision 166032) +++ config/arm/arm.c (working copy) @@ -246,6 +246,7 @@ static bool arm_builtin_support_vector_misalignmen const_tree type, int misalignment, bool is_packed); +static unsigned int arm_autovectorize_vector_sizes (void);
/* Table of machine attributes. */ @@ -391,6 +392,9 @@ static const struct default_options arm_option_opt #define TARGET_VECTOR_MODE_SUPPORTED_P arm_vector_mode_supported_p #undef TARGET_VECTORIZE_PREFERRED_SIMD_MODE #define TARGET_VECTORIZE_PREFERRED_SIMD_MODE arm_preferred_simd_mode +#undef TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES +#define TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES \ + arm_autovectorize_vector_sizes
#undef TARGET_MACHINE_DEPENDENT_REORG #define TARGET_MACHINE_DEPENDENT_REORG arm_reorg @@ -22025,15 +22029,14 @@ arm_preferred_simd_mode (enum machine_mode mode) switch (mode) { case SFmode: - return TARGET_NEON_VECTORIZE_QUAD ? V4SFmode : V2SFmode; + return V4SFmode; case SImode: - return TARGET_NEON_VECTORIZE_QUAD ? V4SImode : V2SImode; + return V4SImode; case HImode: - return TARGET_NEON_VECTORIZE_QUAD ? V8HImode : V4HImode; + return V8HImode; case QImode: - return TARGET_NEON_VECTORIZE_QUAD ? V16QImode : V8QImode; + return V16QImode; case DImode: - if (TARGET_NEON_VECTORIZE_QUAD) return V2DImode; break;
@@ -23223,6 +23226,12 @@ arm_expand_sync (enum machine_mode mode, } }
+static unsigned int +arm_autovectorize_vector_sizes (void) +{ + return 16 | 8; +} +
we could add e.g. a --param to say "prefer 64-bit vectors" or "prefer 128-bit vectors" (falling back to 64-bit as necessary), for benchmarking purposes and/or intrepid users.)
ARM specific param?
Thanks, Ira
CC'ing Richard E., in case he has any input.
Cheers,
Julian
On Wed, Nov 3, 2010 at 2:55 AM, Julian Brown julian@codesourcery.com wrote:
On Mon, 1 Nov 2010 15:57:11 +0200 Ira Rosen IRAR@il.ibm.com wrote:
- According to gcc.dg/vect/vect.exp the only flag that is used for
NEON (in addition to target independent flags) is -ffast-math. Is that enough?
Note that the NEON unit doesn't implement the deeper parts of IEEE754 floating point like NaN and exceptions. These are needed for general purpose applications but not, say, radial fills in graphics. Turning on -ffast-math allows the compiler to send floating point operations via NEON instead of VFP.
-- Michael
linaro-toolchain@lists.linaro.org