Hi,
I have been comparing the stock gcc 5.2 and the Linaro 5.2 (Linaro GCC
5.2-2015.11-1) and have noticed a difference with the __sync
intrinsics.
Here is the simple test case
--- cut here ---
int add_int(int add_value, int *dest)
{
return __sync_add_and_fetch(dest, add_value);
}
--- cut here ---
Compiling with the stock gcc 5.2 (-S -O3) I get
---------
add_int:
.L2:
ldaxr w2, [x1]
add w2, w2, w0
stlxr w3, w2, [x1]
cbnz w3, .L2
mov w0, w2
ret
---------
Wheras with Linaro gcc 5.2 I get
---------
add_int:
.L2:
ldxr w2, [x1]
add w2, w2, w0
stlxr w3, w2, [x1]
cbnz w3, .L2
dmb ish
mov w0, w2
ret
---------
Why the extra (unnecessary?) memory barrier?
Also, is it worthwhile putting a prfm before the ldaxr. EG
add_int:
prfm pst1strm, [x1]
.L2:
ldaxr w2, [x1]
See the following thread
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html
All the best,
Ed
== Progress ==
o GCC dev. (7/10)
* Remote validation sanitizing:
- fixed last issues in dejagnu patch and submitted it uptsream
- 2 more cleanup/fix dejagnu patches submitted and merged upstream
- proposed a fix/workaround for the output pattern issues (>400
failures removed with this patch)
o Misc (3/10)
* Various meetings
* internal discussions
== Plan ==
o Try to follow connect remotely
o Extended validation work
== Progress ==
* GCC bugs:
- #2073 tried to reproduce it with a manually-built toolchain. No luck
* GCC validation:
- added support to choose simulated cpu (different from --with-cpu)
* GCC:
- completing Neon intrinsics tests, to prepare cleanup
* Validation:
- small improvements
* Misc (conf calls, meetings, emails, ...)
== Next ==
Remote Connect
== Progress ==
* Support (5/10)
- Working on PR17193
- Continue review on D17141
* Background (5/10)
- Code review, meetings, discussions, general support, etc.
- Connect preparations
- GCC ABI 5 discussions
- Assessing Swift calling convention impact ARM back-end
- Interviews
# Progress #
* TCWG-545, Handle "branch-to-self" instruction in single stepping.
[5/10] Patches are posted upstream for review.
* TCWG-532, one patch is committed and one patch is posted for review.
[2/10]
* Tweak ARM process record. [2/10]
Two patches are pushed in. Many test fails are fixed.
* FSF patches review. [1/10].
# Plan #
* Linaro Connect.
--
Yao
Hi,
I have just switched to gcc 5.2 from 4.9.2 and the code quality does seem to have improved significantly. For example, it now seems much better at using ldp/stp and it seems to has stopped gratuitous use of the SIMD registers.
However, I still have a few whinges:-)
See attached copy.c / copy.s (This is a performance critical function from OpenJDK)
pd_disjoint_words:
cmp x2, 8 <<< (1)
sub sp, sp, #64 <<< (2)
bhi .L2
cmp w2, 8 <<< (1)
bls .L15
.L2:
add sp, sp, 64 <<< (2)
(1) If count as a 64 bit unsigned is <= 8 then it is probably still <= 8 as a 32 bit unsigned.
(2) Nowhere in the function does it store anything on the stack, so why
drop and restore the stack every time. Also, minor quibble in the
disass, why does sub use #64 whereas add uses just '64' (appreciate this
is probably binutils, not gcc).
.L15:
adrp x3, .L4
add x3, x3, :lo12:.L4
ldrb w2, [x3,w2,uxtw] <<< (3)
adr x3, .Lrtx4
add x2, x3, w2, sxtb #2
br x2
(3) Why use a byte table, this is not some sort of embedded system. Use
a word table and this becomes.
.L15:
adrp x3, .L4
add x3, x3, :lo12:.L4
ldr x2, [x3, x2, lsl #3]
br x2
An aligned word load takes exactly the same time as a byte load and we
save the faffing about calculating the address.
.L10:
ldp x6, x7, [x0]
ldp x4, x5, [x0, 16]
ldp x2, x3, [x0, 32] <<< (4)
stp x2, x3, [x1, 32] <<< (4)
stp x6, x7, [x1]
stp x4, x5, [x1, 16]
(4) Seems to be something wrong with the load scheduler here? Why not
move the stp x2, x3 to the end. It does this repeatedly.
Unfortunately as this function is performance critical it means I will
probably end up doing it in inline assembler which is time consuming,
error prone and non portable.
* Whinge mode off
Ed
== Progress ==
o GCC dev. (7/10)
* Remote validation sanitizing:
- Implemented and tested a pure dejagnu fix (the actual
implementation works fine for GCC but might be an issue in a different
context, a cleaner fix almost done)
- Found a latent issue in GCC profiling test harness
* ARM and AArch64 backends LRA cleanup:
- Looked at the remaining artifacts, will prepare a patch for GCC 7
o Misc (3/10)
* Various meetings
* internal discussions
== Plan ==
o Finalize and submit dejagnu fix