Dear List,
I'm new to this list and have some questions.
Looking at the created code of GCC on ARMv8, we noticed some areas where there is room for performance improvements.
I assume that these items might already be noticed by you guys.
For example:
1) We noticed that when writing typical DGEMM like code, GCC includes unnecessary DUP instruction
2) GCC seems unwilling to use LDP loads
3) For optimal FPU performance on some A57 its needed to interleave instruction working on ODD and EVEN registers
GCC seem not properly support this. Here sometimes 100% performance increase could be reached by different instruction interleaving.
4) Some work loops highly benefit of interleaving of FPU instructinons and loads.
GCC seems to likes to re-arrange the code so that most or all loads are put on top of the loop.
This can reduce the performance of a well written workloop significantly.
I have no patches to fix this.
But I can produce C- code and ASM output which will show these performance issues.
Please tell me what the next recommended step will be now.
Are all these items known already, or shall I provide code examples to further explain them?
Kind regards
Gunnar von Boehn
== Progress ==
* Validation
- comparing results and build times between the 2 labs
- tuning jobs scheduling to avoid deadlocks
* GCC trunk monitoring
- lots of validation results to check after 1 week of holidays :-)
- a few regressions/new failures/wrong tests reported
* Backports
- a few reviews
== Next ==
* Infrastructure/Validation
* GCC dev: try to fix vqtbl intrinsics for aarch64_be before e/o stage1
The Linaro Toolchain Working Group (TCWG) is pleased to announce the
2015.10 snapshot of the Linaro GCC 4.9 source package.
This snapshot[1] is based on FSF GCC 4.9.4-pre+svn229467 and includes
performance improvements and bug fixes backported from mainline GCC.
This snapshot contents will be part of the 2015.11 stable [1]
quarterly release.
This snapshot tarball is available on:
http://snapshots.linaro.org/components/toolchain/gcc-linaro/4.9-2015.10/
Interesting changes in this GCC source package snapshot include:
* Updates to GCC 4.9.4-pre+svn229467
* Backport of [Bugfix] PR tree-optimization/65735
* Backport of [Bugfix] PR tree-optimization/65177
* Backport of [Bugfix] PR tree-optimization/65048
Feedback and Support
Subscribe to the important Linaro mailing lists and join our IRC
channels to stay on top of Linaro development.
** Linaro Toolchain Development "mailing list":
http://lists.linaro.org/mailman/listinfo/linaro-toolchain
** Linaro Toolchain IRC channel on irc.freenode.net at @#linaro-tcwg@
* Bug reports should be filed in bugzilla against GCC product:
http://bugs.linaro.org/enter_bug.cgi?product=GCC
* Interested in commercial support? inquire at "Linaro support":
mailto:support@linaro.org
[1]. Stable source package releases are defined as releases where the
full Linaro Toolchain validation plan is executed.
[2]. Source package snapshots are defined when the compiler is only
put through unit-testing and full validation is not performed.
== Progress ==
o Linaro GCC (9/10)
* Backports and reviews for our GCC 5 2015.11 snapshot
* FSF branch merge and needed backports for our GCC 4.9 2015.10 snapshot
o Misc (1/10)
* Various meetings
== Plan ==
o Complete 4.9 snapshot
Noise control experiments - TCWG-358 [3/10]
* Some analysis of data to date
Debian filesystem - TCWG-360 [3/10]
* Got stuck on LAVA interactions
* Now booting-to-LAVA-usability, needs some cleanup and testing with
real benchmark runs
Benchmarking-via-Jenkins - TCWG-348 [1/10]
* Picked back up on understanding that LAVA uinstance is a-coming
* Hacked pbl.py (post-build-lava) to generate suitable JSON
** As a bonus, this can work as a CLI job-submission tool
=Plan=
Holiday Friday (pending approval)
Set up crashdumping on my Juno, try to learn why it crashes
Finish Debian filesystem
Get Jenkins generating and submitting jobs suitable for uinstance
Write up noise control report (probably will get bumped to next week)
== This week ==
* TCWG-317 - Exploit wide add operations when appropriate for Aarch32 (3/10)
- Could not reproduce test failure in qemu
- Waiting on feedback from Maxim who is running patch on Linaro
validation
* TCWG-369 - Exploit wide add operations when appropriate for Aarch64 (4/10)
- Debugging tree vectorizer to determine why loops with no wide add
operations are no longer being vectorized
* TCWG 146 - Debugging regression on arm big endian with Linaro 5 branch
(2/10)
* Misc (1/10)
- Conference calls
== Next week ==
- TCWG-317 - Resolve lto big endian failures
- TCWG-369 - Identify why loops are not being vectorized
- TCWG-146 - Resolve big endian failures
== This Week ==
* TCWG-72 / PR43721 (8/10)
- Completed writing expand_DIVMOD() and submitted patch upstream
- Reworked patch according to Richard Biener's comments.
(http://people.linaro.org/~prathamesh.kulkarni/pr43721-patch-v2.diff)
- Wrote test-cases.
* TCWG-319 (1/10)
- Benchmark run without patch complete for "int" benchmarks on a53, a57
- Benchmark run with patch in progress for "int" benchmarks on a53, a57
* Misc (1/10)
- Meetings
== Next Week ==
- Continue with TCWG-72, TCWG-310li, TCWG-319 benchmarking