question on aarch64 libm

List overview All Threads
Download

newer

older

GCC's ABI 5

[ACTIVITY] 11th - 15th January

Virendra Kumar Pathak

18 Jan 2016 18 Jan '16

5:54 p.m.

Hi Linaro Toochain Group,

I have few questions on glibc+libm w.r.t aarch64. If possible, please provide some insight, otherwise kindly redirect me to the concerned person/forum.

1.It seems from the community patches that ARM/Linaro is optimizing glibc functions such as memcpy/memmove, string for aarch64. However, looks like some of these (e.g. memcpy/memmov) patches are still not merged in glibc. Any comment on their availability in glibc? e.g. https://www.sourceware.org/ml/libc-alpha/2015-12/msg00341.html

2. On the same note, is there any plan for optimizing/tuning libm functions (e.g. trigonometric) for aarch64? I could find any matching patches on review board. Please correct me if I am wrong.

3. Looks like ARM have released an independent version of libm for certain trigonometric functions. https://github.com/ARM-software/optimized-routines. Any plan of these optimization going in glibc's libm? Any comment on its performance improvement over GNU libm ?

Thanks in advance for your time.

-- with regards, Virendra Kumar Pathak

Attachments:

attachment.html (text/html — 1.5 KB)

Show replies by date

Adhemerval Zanella

18 Jan 18 Jan

6:36 p.m.

Hi Virendra,

On 18-01-2016 15:54, Virendra Kumar Pathak wrote:

...

Hi Linaro Toochain Group,

I have few questions on glibc+libm w.r.t aarch64. If possible, please provide some insight, otherwise kindly redirect me to the concerned person/forum.

1.It seems from the community patches that ARM/Linaro is optimizing glibc functions such as memcpy/memmove, string for aarch64. However, looks like some of these (e.g. memcpy/memmov) patches are still not merged in glibc. Any comment on their availability in glibc? e.g. https://www.sourceware.org/ml/libc-alpha/2015-12/msg00341.html

This is mainly due lack of review. Usually for optimization patches the arch maintainer will have the final answer. Now it is too late for 2.23, but we will focus on make it available for 2.24.

Besides this memcpy, there is still some other string function (memchr) and some generic one (strpbrk, etc.) that are stalled either due missing review or lacking of comments follow up.

...

On the same note, is there any plan for optimizing/tuning libm functions (e.g. trigonometric) for aarch64?

I could find any matching patches on review board. Please correct me if I am wrong.

No one has posted any patch or stirred discussions about it. The complex function in libm are usually coded in in C to be platform neutral, with some specific function being optimized (rounding, etc.). x86_64 also have some assembly implementations for some specific routines (exp, log, ...), but I also do not have number about how fast are they related to C counterparts (it also might be the case where the speedup is not that high to validate the assembly existence).

Rule of thumb currently in GLIBC is to avoid as possible arch-assembly routines and work with C implementation that are platform neutral with possible arch hooks on sensitive performance paths (check Siddhesh recent sincos performance improvements).

For very critical performance paths we also have the option to add specific build with more aggressive optimization flags along with IFUNC support (for instance one for A57 and another for A72, if it is such the case).

If none options are the best way to improve performance, platform specific implementation are still a good option (libmvec is basically a lot of x86_64 assembly implementation currently).

...

Looks like ARM have released an independent version of libm for certain trigonometric functions.

https://github.com/ARM-software/optimized-routines. Any plan of these optimization going in glibc's libm? Any comment on its performance improvement over GNU libm ?

Regarding licensing I do not foresee any issues, since GLIBC is LGPL 2.1 and later it may be combined with code from a LGPL version 3 library, with the combined work as a whole falling under the terms of the GPLv3 [1] (since Apache 2.0, the one ARM used in this projects, and it is compatible with LGPL 3.0). I am far from a license lawyer, so someone please correct me if I am wrong.

Now related to technical side, I think it is feasible however it will required a lot of work to adjust these function for fit GLIBC project.

First thing is the requirements: GLIBC current required 4.7 as the minimum compiler, however the project itself requires 4.8. I noted mpfr and mpc are used exclusive in testing framework.

Second thing is add these implementations for ARM/AArch64 with correct names and infrastructure. The downside is it will deviate ARM/AArch64 from rest of other ports, requiring further maintenance because of the different optimization.

Another thing is to check the implementation against GLIBC own testcase, which add some tests regarding exceptions, rounding, etc. Any deviation will require fixing and/or bug reported.

Finally GLIBC developers will certainly ask for either improvements in the benchmark testsuite or number that show these implementation are somewhat better than current ones. It will also require some precision/speed analysis.

[1] http://www.gnu.org/licenses/license-list.en.html

...

Thanks in advance for your time.

-- with regards, Virendra Kumar Pathak

linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Siddhesh Poyarekar

19 Jan 19 Jan

5:49 a.m.

On 19 January 2016 at 00:06, Adhemerval Zanella adhemerval.zanella@linaro.org wrote:

...

No one has posted any patch or stirred discussions about it. The complex function in libm are usually coded in in C to be platform neutral, with some specific function being optimized (rounding, etc.). x86_64 also have some assembly implementations for some specific routines (exp, log, ...), but I also do not have number about how fast are they related to C counterparts (it also might be the case where the speedup is not that high to validate the assembly existence).

A correction here: i686 has a lot of assembly math implementations, x86_64 doesn't. The last x86_64 asm implementation was sincos which was removed because it was not accurate enough for our project goals. The i686 asm versions (and for other archs, I think alpha and m68k) are there because nobody cares enough about their precision. The i686 functions for example are known to not be precise for the entire input domain.

...

Rule of thumb currently in GLIBC is to avoid as possible arch-assembly routines and work with C implementation that are platform neutral with possible arch hooks on sensitive performance paths (check Siddhesh recent sincos performance improvements).

The general rule here is to more or less guarantee that the algorithm does not lose precision regardless of the language it is written in. However if you want the community also to support it actively, writing it in C is your best bet.

...

For very critical performance paths we also have the option to add specific build with more aggressive optimization flags along with IFUNC support (for instance one for A57 and another for A72, if it is such the case).

This is the cheapest way to squeeze out some performance, provided that the compiler is tuned correctly. This is in fact what we do in x86_64 with ifunc implementations for avx, sse2 and fma4.

Siddhesh

Adhemerval Zanella

12:34 p.m.

On 19-01-2016 03:49, Siddhesh Poyarekar wrote:

...

On 19 January 2016 at 00:06, Adhemerval Zanella adhemerval.zanella@linaro.org wrote:

...
No one has posted any patch or stirred discussions about it. The complex function in libm are usually coded in in C to be platform neutral, with some specific function being optimized (rounding, etc.). x86_64 also have some assembly implementations for some specific routines (exp, log, ...), but I also do not have number about how fast are they related to C counterparts (it also might be the case where the speedup is not that high to validate the assembly existence).

A correction here: i686 has a lot of assembly math implementations, x86_64 doesn't. The last x86_64 asm implementation was sincos which was removed because it was not accurate enough for our project goals. The i686 asm versions (and for other archs, I think alpha and m68k) are there because nobody cares enough about their precision. The i686 functions for example are known to not be precise for the entire input domain.

I do see some x86_64 specialized implementation being used currently (sysdeps/x86_64/fpu/s_{sin,cos}f.S for instance). The sincos implementations is still used (sysdeps/x86_64/fpu/s_sincosf.S).

What you referring that glibc has dropped is the utilization of the fsin/fcos/fsincos Intel instructions, which shows a ridiculous error range depending of the inputs [1].

[1] https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-boun...

...

...
Rule of thumb currently in GLIBC is to avoid as possible arch-assembly routines and work with C implementation that are platform neutral with possible arch hooks on sensitive performance paths (check Siddhesh recent sincos performance improvements).

The general rule here is to more or less guarantee that the algorithm does not lose precision regardless of the language it is written in. However if you want the community also to support it actively, writing it in C is your best bet.

...
For very critical performance paths we also have the option to add specific build with more aggressive optimization flags along with IFUNC support (for instance one for A57 and another for A72, if it is such the case).

This is the cheapest way to squeeze out some performance, provided that the compiler is tuned correctly. This is in fact what we do in x86_64 with ifunc implementations for avx, sse2 and fma4.

Siddhesh

Siddhesh Poyarekar

12:52 p.m.

On 19 January 2016 at 18:04, Adhemerval Zanella adhemerval.zanella@linaro.org wrote:

...

I do see some x86_64 specialized implementation being used currently (sysdeps/x86_64/fpu/s_{sin,cos}f.S for instance). The sincos implementations is still used (sysdeps/x86_64/fpu/s_sincosf.S).

What you referring that glibc has dropped is the utilization of the fsin/fcos/fsincos Intel instructions, which shows a ridiculous error range depending of the inputs [1].

The sincos implementation for x86_64 is the generic one; it is the sincosf (single float) that has an assembly implementation. However you're right otherwise; I had overlooked everything but the ieee754 double implementations of the transcendentals.

Siddhesh

3494

days inactive

3495

days old

linaro-toolchain@lists.linaro.org

4 comments

participants

tags (0)

participants (3)

Adhemerval Zanella
Siddhesh Poyarekar
Virendra Kumar Pathak