The minutes of the performance call held on 3 December 2012 can be found at:
https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2012-12-03
In summary the actions from the meeting are:
* ACTION: Yvan will do the trunk merge this week
* ACTION: Yvan to do the GCC release next week
* ACTION: Christophe to do the GDB release branch merge
* ACTION: Matt volunteered Zhenqiang as the QEMU manual tester this month
* ACTION: Matt: want shrinkwrap to go upstream
* ACTION: Matt - Can blueprint improvements in shrink wrap past that
* ACTION: Matt to create a blueprint on aarch32 ARMv8 instructions in QEMU
* ACTION: Matt to decide if aarch32 ARMv8 instructions in QEMU needs a card
* ACTION: Christophe to send Michael cross build failures to investigate
Thanks,
-- Michael
== Progress ==
* Turn off 64-bits bitops in Neon: patch proposed upstream after
positive benchmarking.
Re-submitted after request to add testcases and documentation for
the new option.
* Disable peeling: running benchmarks with peeling completely disabled
to see the impact.
* PGO/hot-cold partitioning: tested new patch from google, which
solves some of the ICEs, but makes new ones appear.
* builtin_bswap16 backport to Linaro-4.7: checking if there would be a
possibility to keep the testcase which fails in one of our
configurations, after rebasing my branch.
* Trying to bootstrap gcc-linaro/4.7 on board
* Internal support
== Next ==
* Follow-up on 64-bits bitops in Neon
* Look at benchmarks results with peeling disabled
* finish builtin_bswap16 backport
Summary:
* Verify shrink-wrap related bugs.
* Validate and release Linaro toolchain binary 2012.11 release.
* Collect performance data for different branch costs.
Details:
* Validate and release Linaro toolchain binary 2012.11.
* Test aarch64 toolchain. All the basic tests PASS except gdbserver.
* Collect performance data for branch cost combination. For eembcv1,
some combinations have more than 2% performance improvement on
PandaBoard THUMB mode. More test results will come later.
* Verify shrink-wrap related bugs (http://goo.gl/6fGg5). All pass with
the new patch. Native tests are ongoing. Identify the root cause why
the copy, which blocks the shrink-wrap optimization for 453.povray
benchmark, is not optimized.
Plans:
* Finalize the aarch64 toolchain binary release plan.
* Collect performance data for branch cost tuning.
* Enhance shrink-wrap to optimize the copy.
Best regards!
-Zhenqiang
== Progress ==
* Boehm GC AArch64 support:
- Read wikis and papers on the memory model
- Reported an issue with the current ARMv7 atomic builtins
- Submitted the fix, which was approved
- Improving libatomic-ops AArch64 support with load-acquire/
store-release usage.
== Next ==
* Continue on the Boehm GC AArch64 support.
[Short week: 3 days]
* looked at (but failed to reproduce) a hang in QEMU reported
by Christoffer when shutting down a KVM ARM guest using TUN/TAP
networking
* investigated LP:1084148 (segfault in qemu usermode) sufficiently
to diagnose it as probably another of qemu's "can't handle
multithreaded guest programs" bugs
* fixed some problems with QEMU's secondary CPU boot code which
were masked by errors in QEMU's GIC model but revealed by
real hardware (ie KVM); fixed the GIC model bugs as well
* investigated LP:955379 (cmake hangs under qemu-arm-static).
Tracked down to a race condition involving signal delivery,
the fix to which would require the significant redesign I
sketched out here a year or so ago:
http://lists.gnu.org/archive/html/qemu-devel/2011-12/msg00384.html
KVM blueprint progress tracker:
http://ex.seabright.co.nz/helpers/backlog?group_by=topic&colour_by=state&pr…
-- PMM
== Blueprints ==
Initial Current Actual
initial-aarch64-backport 31 Oct 2012 7 Dec 2012*
aarch64-baremetal-testing 31 Oct 2012 7 Dec 2012*
fix-gcc-multiarch-testing 31 Dec 2012 31 Dec 2012
backport-fma-intrinsic 31 Dec 2012 31 Dec 2012
fused-multiply-add-support 31 Dec 2012 31 Dec 2012
gcc-investigate-lra-for-arm 31 Dec 2012 31 Dec 2012
== Progress ==
* Admin
* Interviewing
* Preparation for taking over from Michael
* Investigate patches for literal pool layout bug
* Applied
* PINGed triplet backport patches upstream
* Other bug issues
* Including an issue running SPEC2K on x86 with recent trunk
* And a 4.6 gcc-linaro only issue
== Next Week ==
* Start leading Toolchain team
* Run HOT/COLD partitioning benchmarks
* Analyse ARM results
* On x86_64 to see what the actual benefit we could get
* initial-aarch64-backport & aarch64-baremetal-testing
* Finish documentation
* gcc-investigate-lra-for-arm
* Analyse benchmarks
* fix-gcc-multiarch-testing
* Come up with strawman proposal for updating testsuite to handle
testing with varying command-line options.
== Future ==
* backport-fma-intrinsic & fused-multiply-add-support
* Backport patches once fix-gcc-multiarch-testing has been done.
== Planned Leave ==
* Monday 24 December - Monday 31 December
--
Matthew Gretton-Dann
Linaro Toolchain Working Group
matthew.gretton-dann(a)linaro.org
Hi,
I think I have identified some issues with the atomic builtins, but I want
your advises.
For instance :
A: __atomic_store_n (addr, val, __ATOMIC_SEQ_CST);
gives the armv7 code:
DMB sy
STR r1, [r0]
DMB sy
but if I have well understood, the DMBs instructions only provide the
property that the
code is sequentially consistent, but not the atomicity for which we have to
use the
LDREX/STREX instructions. Thus I think that the code should be :
DMB sy
1: LDREX r2, [r0]
STREX r1, r2, [r0]
TEQ r1, #0
BNE 1b
B: __atomic_load_n (addr, __ATOMIC_ACQUIRE);
gives the armv7 code:
DMB sy
LDR r0, [r0]
but the load-acquire semantique specifies that all loads and stores
appearing in program order
after the load-acquire will be observed after the load-acquire, thus the
DMB should be after the
LDR, no ?
--
Yvan
Hi,
I'm working on the libatomic-ops (part of the Boehm gc) AArch64 support,
I mainly use GCC's __atomic builtins to do this, but in our 4.7 version
they don't use the load acquire / store release instructions now available
in the ARMv8 ISA. These instructions are used in the mainline GCC
(in atomic.md) but not in their exclusive form, I understand that it should
be due to the performance penalty, but I want your feeling on that point
as I don't find the ARMv8 ISA really clear.
If we want to implement an atomic load acquire, is
LDAR x1, [x0]
sufficient, or do we have to write it like that :
L: LDAXR x0, [x3]
STEX x1, x0, [x3]
CBZ x0, L1
Thanks
Yvan