On 19-01-2016 03:49, Siddhesh Poyarekar wrote:
On 19 January 2016 at 00:06, Adhemerval Zanella adhemerval.zanella@linaro.org wrote:
No one has posted any patch or stirred discussions about it. The complex function in libm are usually coded in in C to be platform neutral, with some specific function being optimized (rounding, etc.). x86_64 also have some assembly implementations for some specific routines (exp, log, ...), but I also do not have number about how fast are they related to C counterparts (it also might be the case where the speedup is not that high to validate the assembly existence).
A correction here: i686 has a lot of assembly math implementations, x86_64 doesn't. The last x86_64 asm implementation was sincos which was removed because it was not accurate enough for our project goals. The i686 asm versions (and for other archs, I think alpha and m68k) are there because nobody cares enough about their precision. The i686 functions for example are known to not be precise for the entire input domain.
I do see some x86_64 specialized implementation being used currently (sysdeps/x86_64/fpu/s_{sin,cos}f.S for instance). The sincos implementations is still used (sysdeps/x86_64/fpu/s_sincosf.S).
What you referring that glibc has dropped is the utilization of the fsin/fcos/fsincos Intel instructions, which shows a ridiculous error range depending of the inputs [1].
[1] https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-boun...
Rule of thumb currently in GLIBC is to avoid as possible arch-assembly routines and work with C implementation that are platform neutral with possible arch hooks on sensitive performance paths (check Siddhesh recent sincos performance improvements).
The general rule here is to more or less guarantee that the algorithm does not lose precision regardless of the language it is written in. However if you want the community also to support it actively, writing it in C is your best bet.
For very critical performance paths we also have the option to add specific build with more aggressive optimization flags along with IFUNC support (for instance one for A57 and another for A72, if it is such the case).
This is the cheapest way to squeeze out some performance, provided that the compiler is tuned correctly. This is in fact what we do in x86_64 with ifunc implementations for avx, sse2 and fma4.
Siddhesh