Dear List,
I'm new to this list and have some questions. Looking at the created code of GCC on ARMv8, we noticed some areas where there is room for performance improvements. I assume that these items might already be noticed by you guys.
For example:
1) We noticed that when writing typical DGEMM like code, GCC includes unnecessary DUP instruction
2) GCC seems unwilling to use LDP loads
3) For optimal FPU performance on some A57 its needed to interleave instruction working on ODD and EVEN registers
GCC seem not properly support this. Here sometimes 100% performance increase could be reached by different instruction interleaving.
4) Some work loops highly benefit of interleaving of FPU instructinons and loads.
GCC seems to likes to re-arrange the code so that most or all loads are put on top of the loop. This can reduce the performance of a well written workloop significantly.
I have no patches to fix this. But I can produce C- code and ASM output which will show these performance issues.
Please tell me what the next recommended step will be now. Are all these items known already, or shall I provide code examples to further explain them?
Kind regards Gunnar von Boehn
On Wed, Nov 4, 2015 at 7:07 AM, Gunnar von Boehn gunnar.von.boehn@huawei.com wrote:
Looking at the created code of GCC on ARMv8, we noticed some areas where there is room for performance improvements. I assume that these items might already be noticed by you guys.
There is a known problem that the current register allocator doesn't handle partial overlap very well. Both aarch64 and aarch32 use the same register set for FP and SIMD/neon, which results in lots of partial overlaps, which can confuse the register allocator into using unnecessary temporaries. Otherwise, I don't think that we have any major problems, other than the fact that vectorization is a hard problem to solve, and we do have lots of examples showing non-optimal code generation in certain cases.
We noticed that when writing typical DGEMM like code, GCC includes
unnecessary DUP instruction
This could be the known register allocation problem. It is hard to say more without a testcase.
GCC seems unwilling to use LDP loads
There is support for generating LDP/STP, but since this usually involves combining unrelated data, it is done in a peephole pass and may not be triggering as often as we like, as the peephole optimization only works well if you get lucky register allocation and instruction scheduling that creates peephole optimization opportunities. Can't say more without a testcase.
For optimal FPU performance on some A57 its needed to interleave
instruction working on ODD and EVEN registers GCC seem not properly support this. Here sometimes 100% performance increase could be reached by different instruction interleaving.
A patch was added to GCC 6 for this. It looks like it has been backported into the Linaro gcc-5.x sources. From git log on the Linaro gcc-5.x tree:
commit 9c9ff2bc6885aa07d55ecef8248c08a8e14ff9b6 Author: Christophe Lyon christophe.lyon@linaro.org Date: Mon Oct 5 15:17:57 2015 +0200
gcc/ Backport from trunk r222512. 2015-04-28 Thomas Preud'homme thomas.preudhomme@arm.com
PR target/63503 * config.gcc: Add cortex-a57-fma-steering.o to extra_objs for aarch64-*-*. * config/aarch64/t-aarch64: Add a rule for cortex-a57-fma-steering.o. * config/aarch64/aarch64.h (AARCH64_FL_USE_FMA_STEERING_PASS): Define. (AARCH64_TUNE_FMA_STEERING): Likewise. * config/aarch64/aarch64-cores.def: Set AARCH64_FL_USE_FMA_STEERING_PASS for cores with dynamic steering of FMUL/FMADD instructions. * config/aarch64/aarch64.c (aarch64_register_fma_steering): Declare. (aarch64_override_options): Include cortex-a57-fma-steering.h. Call aarch64_register_fma_steering () if AARCH64_TUNE_FMA_STEERING is true. * config/aarch64/cortex-a57-fma-steering.h: New file. * config/aarch64/cortex-a57-fma-steering.c: Likewise.
Change-Id: I92e0e8d06fc5212e8856d6d5f9c7c6b83a737ca8
There are a number of related changes after this one. I don't know how well this works as I haven't tried using it.
Some work loops highly benefit of interleaving of FPU instructinons
and loads. GCC seems to likes to re-arrange the code so that most or all loads are put on top of the loop. This can reduce the performance of a well written workloop significantly.
This isn't an ARM specific problem, and within the ARM family, it is target dependent, as it depends on how the instruction scheduler hooks have been written for the target you are optimizing for. I know for some of the cortex parts, there was an effort to report fewer load/store pipes than exist, so that gcc would not schedule all loads at the start of a loop. I don't know how effective it is though.
Please tell me what the next recommended step will be now. Are all these items known already, or shall I provide code examples to further explain them?
You can try filing bug reports into the FSF bugzilla at http://gcc.gnu.org/bugzilla or the Linaro bugzilla at http://bugs.linaro.org. Bugs filed into the FSF bugzilla will get better visibility, as all ARM gcc developers will see them. The problems you are reporting are mostly hard problems that may not be fixed for a while, and these kinds of problems are probably better reported into the FSF bugzilla. Issues specific to Linaro should of course go into the Linaro bugzilla. You can try giving us testcases here, but if it isn't something we can fix in a few minutes, then it is better if it goes into bugzilla.
Jim
linaro-toolchain@lists.linaro.org