Hi,
I have been comparing the stock gcc 5.2 and the Linaro 5.2 (Linaro GCC 5.2-2015.11-1) and have noticed a difference with the __sync intrinsics.
Here is the simple test case
--- cut here --- int add_int(int add_value, int *dest) { return __sync_add_and_fetch(dest, add_value); } --- cut here ---
Compiling with the stock gcc 5.2 (-S -O3) I get
--------- add_int: .L2: ldaxr w2, [x1] add w2, w2, w0 stlxr w3, w2, [x1] cbnz w3, .L2 mov w0, w2 ret ---------
Wheras with Linaro gcc 5.2 I get
--------- add_int: .L2: ldxr w2, [x1] add w2, w2, w0 stlxr w3, w2, [x1] cbnz w3, .L2 dmb ish mov w0, w2 ret ---------
Why the extra (unnecessary?) memory barrier?
Also, is it worthwhile putting a prfm before the ldaxr. EG
add_int: prfm pst1strm, [x1] .L2: ldaxr w2, [x1]
See the following thread
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html
All the best, Ed
Hi Ed,
On 9 March 2016 at 14:02, Edward Nevill edward.nevill@linaro.org wrote:
Hi,
I have been comparing the stock gcc 5.2 and the Linaro 5.2 (Linaro GCC 5.2-2015.11-1) and have noticed a difference with the __sync intrinsics.
Here is the simple test case
--- cut here --- int add_int(int add_value, int *dest) { return __sync_add_and_fetch(dest, add_value); } --- cut here ---
Compiling with the stock gcc 5.2 (-S -O3) I get
add_int: .L2: ldaxr w2, [x1] add w2, w2, w0 stlxr w3, w2, [x1] cbnz w3, .L2 mov w0, w2 ret
Wheras with Linaro gcc 5.2 I get
add_int: .L2: ldxr w2, [x1] add w2, w2, w0 stlxr w3, w2, [x1] cbnz w3, .L2 dmb ish mov w0, w2 ret
Why the extra (unnecessary?) memory barrier?
This is because Linaro gcc-5-branch is in sync with FSF gcc-5-branch which contains a fix for this PR : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
As explained in the bugzilla and the patch submission the restriction are stonger on __sync builtins than on __atomic ones.
https://gcc.gnu.org/ml/gcc-patches/2015-05/msg01989.html
Also, is it worthwhile putting a prfm before the ldaxr. EG
add_int: prfm pst1strm, [x1] .L2: ldaxr w2, [x1]
See the following thread
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html
All the best, Ed _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain
Hi Yvan,
On 9 March 2016 at 13:22, Yvan Roux yvan.roux@linaro.org wrote:
Hi Ed,
On 9 March 2016 at 14:02, Edward Nevill edward.nevill@linaro.org wrote:
Hi,
Why the extra (unnecessary?) memory barrier?
This is because Linaro gcc-5-branch is in sync with FSF gcc-5-branch which contains a fix for this PR : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
As explained in the bugzilla and the patch submission the restriction are stonger on __sync builtins than on __atomic ones.
Thanks for that, obviously we should be using the __atomic versions in Java as I dont believe Java requires a memory barrier.
One curiosity, I recompiled for 8.1 LSE using the Linaro gcc 5.2 and I got the following
/usr/local/linare-gcc-5.2/bin/gcc -S -O3 -march=armv8-a+lse test.c
add_int: ldaddal w0, w0, [x1] add w2, w0, w0 mov w0, w2 ret
Why no memory barrier here? As far as I am aware the ldaddal has only acquire and release semantics, it does not implement a full barrier?
All the best, Ed.
/usr/local/linare-gcc-5.2/bin/gcc -S -O3 -march=armv8-a+lse test.c
add_int: ldaddal w0, w0, [x1] add w2, w0, w0 mov w0, w2 ret
Am I going mad, or does this just return the contents of the memory location * 2.
ldaddal w0, w0, [x1]
Returns the original contents of [x1] in w0.
add w2, w0, w0
doubles it.
mov w0, w2
returns it.
I think __sync_add_and_fetch should return the updated contents.
Here is the test C code again.
int add_int(int add_value, int *dest) { return __sync_add_and_fetch(dest, add_value); }
Regards, Ed.
On 10 March 2016 at 19:42, Edward Nevill edward.nevill@linaro.org wrote:
/usr/local/linare-gcc-5.2/bin/gcc -S -O3 -march=armv8-a+lse test.c
add_int: ldaddal w0, w0, [x1] add w2, w0, w0 mov w0, w2 ret
Am I going mad, or does this just return the contents of the memory location * 2.
ldaddal w0, w0, [x1]
Returns the original contents of [x1] in w0.
add w2, w0, w0
doubles it.
mov w0, w2
returns it.
I think __sync_add_and_fetch should return the updated contents.
Here is the test C code again.
int add_int(int add_value, int *dest) { return __sync_add_and_fetch(dest, add_value); }
Hmm, it is not the code I get with our latest release candidate (the release should be out next week), it gives the code:
add_int: ldaddal w0, w3, [x1] add w2, w3, w0 mov w0, w2 ret
Regards, Ed.
On 10 March 2016 at 19:52, Yvan Roux yvan.roux@linaro.org wrote:
On 10 March 2016 at 19:42, Edward Nevill edward.nevill@linaro.org wrote:
/usr/local/linare-gcc-5.2/bin/gcc -S -O3 -march=armv8-a+lse test.c
add_int: ldaddal w0, w0, [x1] add w2, w0, w0 mov w0, w2 ret
Am I going mad, or does this just return the contents of the memory location * 2.
ldaddal w0, w0, [x1]
Returns the original contents of [x1] in w0.
add w2, w0, w0
doubles it.
mov w0, w2
returns it.
I think __sync_add_and_fetch should return the updated contents.
Here is the test C code again.
int add_int(int add_value, int *dest) { return __sync_add_and_fetch(dest, add_value); }
Hmm, it is not the code I get with our latest release candidate (the release should be out next week), it gives the code:
add_int: ldaddal w0, w3, [x1] add w2, w3, w0 mov w0, w2 ret
Ed, just for info it was fixed by Andrew in trunk and backported in our branch end of December.
https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01962.html
Regards, Ed.
Also, is it worthwhile putting a prfm before the ldaxr. EG
There is already a thread upstream about this: https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00508.html
I reject adding this to -mcpu=generic as it will hurt ThunderX more than it will help. Prfm is single issue for ThunderX so it causes an extra cycle for each case. For the kernel, the pfrm really should be patched out for ThunderX; I will propose a patch for that later on. The way ThunderX implements ldxr/stxr is much simpler than say Cortex-a57/a72 because the inner and outer domains are the same aka there is will only be one coherent point, the L2. It is also why ThunderX exposes so many race conditions (well and the timeout for write going to the coherent point is around 1024 cycles if there was no flush).
Thanks, Andrew Pinski
-----Original Message----- From: linaro-toolchain [mailto:linaro-toolchain-bounces@lists.linaro.org] On Behalf Of Edward Nevill Sent: Wednesday, March 9, 2016 5:02 AM To: Linaro Toolchain Mailman List linaro-toolchain@lists.linaro.org Subject: Some questions about the gcc __sync intrinsics
Hi,
I have been comparing the stock gcc 5.2 and the Linaro 5.2 (Linaro GCC 5.2-2015.11-1) and have noticed a difference with the __sync intrinsics.
Here is the simple test case
--- cut here --- int add_int(int add_value, int *dest) { return __sync_add_and_fetch(dest, add_value); } --- cut here ---
Compiling with the stock gcc 5.2 (-S -O3) I get
--------- add_int: .L2: ldaxr w2, [x1] add w2, w2, w0 stlxr w3, w2, [x1] cbnz w3, .L2 mov w0, w2 ret ---------
Wheras with Linaro gcc 5.2 I get
--------- add_int: .L2: ldxr w2, [x1] add w2, w2, w0 stlxr w3, w2, [x1] cbnz w3, .L2 dmb ish mov w0, w2 ret ---------
Why the extra (unnecessary?) memory barrier?
Also, is it worthwhile putting a prfm before the ldaxr. EG
add_int: prfm pst1strm, [x1] .L2: ldaxr w2, [x1]
See the following thread
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html
All the best, Ed _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain
linaro-toolchain@lists.linaro.org