LLVM ARM NEON VMUL.f32

List overview All Threads
Download

newer

older

[ACTIVITY] report week 12

ARMV8 gcc toolchain issue while...

Renato Golin

19 Mar 2013 19 Mar '13

9:56 p.m.

Hi folks,

I found an issue while fixing a test using the wrong VMUL.f32, and I'd like to know what should be our choice on this topic that is slightly controversial.

Basically, LLVM chooses to lower single-precision FMUL to NEON's VMUL.f32 instead of VFP's version because, on some cores (A8, A5 and Apple's Swift), the VFP variant is really slow.

This is all cool and dandy, but NEON is not IEEE 754 compliant, so the result is slightly different. So slightly that only one test, that was really pushing the boundaries (ie. going below FLT_MIN) did catch it.

There are two ways we can go here:

1. Strict IEEE compatibility and *only* lower NEON's VMUL if unsafe-math is on. This will make generic single-prec. code slower but you can always turn unsafe-math on if you want more speed.

2. Continue using NEON for f32 by default and put a note somewhere that people should turn this option (FeatureNEONForFP) off on A5/A8 if they *really* care about maximum IEEE compliance.

Apple already said that for Darwin, 2 is still the option of choice. Do we agree and ignore this issue? Or for GNU/EABI we want strict conformance by default?

GCC uses fmuls...

cheers, --renato

Attachments:

attachment.html (text/html — 1.5 KB)

Show replies by date

Kristof Beyls

20 Mar 20 Mar

8:11 a.m.

Hi Renato,

I think to be able to make the best possible judgement here, answers to the following questions would be needed:

* Does this result in non-compliance of IEEE754 regarding denormals? NaN? INFs? Something else?

* Also, does the C/C++ standard say something about IEEE 754 compliance?

* I checked the OpenCL 1.1 spec, and that one says that IEEE 754 compliance regarding treatment of INF and NaNs is a must; signalling NaNs is not required; supporting denormalized numbers is optional. (see section 7.2)

* I'm guessing that default option for Clang is to produce fully compliant IEEE754 code? Is it? Is that the right choice? Or is not handling denormals fully correctly a better default? What about NaNs, INFs, others?

Thanks,

Kristof

From: Renato Golin [mailto:renato.golin@linaro.org] Sent: 19 March 2013 21:56 To: Linaro Toolchain Cc: Kristof Beyls; Tim Northover Subject: LLVM ARM NEON VMUL.f32

Hi folks,

I found an issue while fixing a test using the wrong VMUL.f32, and I'd like to know what should be our choice on this topic that is slightly controversial.

Basically, LLVM chooses to lower single-precision FMUL to NEON's VMUL.f32 instead of VFP's version because, on some cores (A8, A5 and Apple's Swift), the VFP variant is really slow.

There are two ways we can go here:

1. Strict IEEE compatibility and *only* lower NEON's VMUL if unsafe-math is on. This will make generic single-prec. code slower but you can always turn unsafe-math on if you want more speed.

2. Continue using NEON for f32 by default and put a note somewhere that people should turn this option (FeatureNEONForFP) off on A5/A8 if they *really* care about maximum IEEE compliance.

Apple already said that for Darwin, 2 is still the option of choice. Do we agree and ignore this issue? Or for GNU/EABI we want strict conformance by default?

GCC uses fmuls...

cheers, --renato

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Renato Golin

5:15 p.m.

On 20 March 2013 08:11, Kristof Beyls Kristof.Beyls@arm.com wrote:

...

· **Does this result in non-compliance of IEEE754 regarding denormals? NaN? INFs? Something else?

Yes, but only slightly. ;)

I don't want to treat this question as black and white because the penalty is severe, but I also don't want people looking at the generated code and thinking (as I did) that LLVM is doing it wrong.

· **Also, does the C/C++ standard say something about IEEE 754

...

compliance?

This has little to do with Clang, C, C++ or OpenCL. The IR is the same for every language and they all lower to fmul in the end. It depends on what we want to do in the ARM back-end by default.

There are flags that the compiler can turn on and off, but this is a question of what should the *default* be? Clang is not passing any flag because the hidden contract with the LLVM back-end is that LLVM does what Clang expects.

As Mans says, C99 (and C11) do require 754 compatibility (I couldn't find any strict requirement on C++11 but there could be one), and that's up to the Clang folks to make sure they pass the exact flags to the back-end.

...

From Perer and Mans answers, and my own opinion, I think we should be

strict and require unsafe-math for NEON f32, at least for *EABI (ie. option 1).

If no one else objects, I'll make that change next week.

cheers, --renato

Peter Maydell

10:31 a.m.

On 19 March 2013 21:56, Renato Golin renato.golin@linaro.org wrote:

...

Basically, LLVM chooses to lower single-precision FMUL to NEON's VMUL.f32 instead of VFP's version because, on some cores (A8, A5 and Apple's Swift), the VFP variant is really slow.

This is all cool and dandy, but NEON is not IEEE 754 compliant, so the result is slightly different. So slightly that only one test, that was really pushing the boundaries (ie. going below FLT_MIN) did catch it.

There are two ways we can go here:

Strict IEEE compatibility and *only* lower NEON's VMUL if unsafe-math is

on. This will make generic single-prec. code slower but you can always turn unsafe-math on if you want more speed.

Continue using NEON for f32 by default and put a note somewhere that

people should turn this option (FeatureNEONForFP) off on A5/A8 if they *really* care about maximum IEEE compliance.

Apple already said that for Darwin, 2 is still the option of choice. Do we agree and ignore this issue? Or for GNU/EABI we want strict conformance by default?

This seems straightforward to me. You have a user facing flag for controlling whether you can deviate from IEEE754 in the name of performance (unsafe-math), so you should honour it. This has the secondary advantage of following gcc behaviour, and the primary advantage of not being confusing or requiring people to use architecture-specific feature flags just to get standard fp behaviour.

Anybody actually writing code which uses 32 bit floats in performance critical code can apply unsafe-math if it helps them.

-- PMM

Mans Rullgard

2:17 p.m.

On 19 March 2013 21:56, Renato Golin renato.golin@linaro.org wrote:

...

Hi folks,

I found an issue while fixing a test using the wrong VMUL.f32, and I'd like to know what should be our choice on this topic that is slightly controversial.

Basically, LLVM chooses to lower single-precision FMUL to NEON's VMUL.f32 instead of VFP's version because, on some cores (A8, A5 and Apple's Swift), the VFP variant is really slow.

This is all cool and dandy, but NEON is not IEEE 754 compliant, so the result is slightly different. So slightly that only one test, that was really pushing the boundaries (ie. going below FLT_MIN) did catch it.

There are two ways we can go here:

Strict IEEE compatibility and *only* lower NEON's VMUL if unsafe-math is

on. This will make generic single-prec. code slower but you can always turn unsafe-math on if you want more speed.

Continue using NEON for f32 by default and put a note somewhere that

people should turn this option (FeatureNEONForFP) off on A5/A8 if they *really* care about maximum IEEE compliance.

Apple already said that for Darwin, 2 is still the option of choice. Do we agree and ignore this issue? Or for GNU/EABI we want strict conformance by default?

GCC uses fmuls...

The NEON vmul.f32 takes two possibly unexpected shortcuts: it flushes denormals to zero, and it ignores the selected rounding mode. Both of these can result in incorrect operation of code assuming standard behaviour.

C99 requires, and users generally expect, IEEE754 behaviour, so deviating from this by default is in my opinion a bad idea. The fact that well-known flags exist to explicitly request relaxed requirements in favour of speed further reinforce the expectation that the default will be standards compliance.

I am strongly in favour of your option 1.

-- Mans Rullgard / mru

Ramana Radhakrishnan

21 Mar 21 Mar

12:57 p.m.

...

-----Original Message----- From: linaro-toolchain-bounces@lists.linaro.org [mailto:linaro- toolchain-bounces@lists.linaro.org] On Behalf Of Mans Rullgard Sent: 20 March 2013 14:17 To: Renato Golin Cc: Kristof Beyls; Linaro Toolchain; Tim Northover Subject: Re: LLVM ARM NEON VMUL.f32

On 19 March 2013 21:56, Renato Golin renato.golin@linaro.org wrote:

...
Hi folks,

I found an issue while fixing a test using the wrong VMUL.f32, and

I'd like

...
to know what should be our choice on this topic that is slightly controversial.

Basically, LLVM chooses to lower single-precision FMUL to NEON's

VMUL.f32

...
instead of VFP's version because, on some cores (A8, A5 and Apple's

Swift),

...
the VFP variant is really slow.

This is all cool and dandy, but NEON is not IEEE 754 compliant, so

the

...
result is slightly different. So slightly that only one test, that

was

...
really pushing the boundaries (ie. going below FLT_MIN) did catch it.

There are two ways we can go here:

Strict IEEE compatibility and *only* lower NEON's VMUL if unsafe-

math is

...
on. This will make generic single-prec. code slower but you can

always turn

...
unsafe-math on if you want more speed.

Continue using NEON for f32 by default and put a note somewhere

that

...
people should turn this option (FeatureNEONForFP) off on A5/A8 if

they

...
*really* care about maximum IEEE compliance.

Apple already said that for Darwin, 2 is still the option of choice.

Do we

...
agree and ignore this issue? Or for GNU/EABI we want strict

conformance by

...
default?

GCC uses fmuls...

The NEON vmul.f32 takes two possibly unexpected shortcuts: it flushes denormals to zero, and it ignores the selected rounding mode. Both of these can result in incorrect operation of code assuming standard behaviour.

This was the reason GCC disabled vectorization for a lot of fp operations for neon when in strict IEEE754 conformance mode for the ARM port which is the default. And I suspect you want LLVM to as well if it already doesn't :)

http://gcc.gnu.org/PR43703 is the bug report for more - if sourceware is back up and services running.

regards Ramana

Renato Golin

1:30 p.m.

On 21 March 2013 12:57, Ramana Radhakrishnan Ramana.Radhakrishnan@arm.comwrote:

...

This was the reason GCC disabled vectorization for a lot of fp operations for neon when in strict IEEE754 conformance mode for the ARM port which is the default. And I suspect you want LLVM to as well if it already doesn't :)

http://gcc.gnu.org/PR43703 is the bug report for more - if sourceware is back up and services running.

Thanks Ramana,

I've added this to the bug:

http://llvm.org/bugs/show_bug.cgi?id=15546

cheers, --renato

4522

days inactive

4524

days old

linaro-toolchain@lists.linaro.org

6 comments

participants

tags (0)

participants (5)

Kristof Beyls
Mans Rullgard
Peter Maydell
Ramana Radhakrishnan
Renato Golin