Dear List,
I'm new to this list and have some questions. Looking at the created code of GCC on ARMv8, we noticed some areas where there is room for performance improvements. I assume that these items might already be noticed by you guys.
For example:
1) We noticed that when writing typical DGEMM like code, GCC includes unnecessary DUP instruction
2) GCC seems unwilling to use LDP loads
3) For optimal FPU performance on some A57 its needed to interleave instruction working on ODD and EVEN registers
GCC seem not properly support this. Here sometimes 100% performance increase could be reached by different instruction interleaving.
4) Some work loops highly benefit of interleaving of FPU instructinons and loads.
GCC seems to likes to re-arrange the code so that most or all loads are put on top of the loop. This can reduce the performance of a well written workloop significantly.
I have no patches to fix this. But I can produce C- code and ASM output which will show these performance issues.
Please tell me what the next recommended step will be now. Are all these items known already, or shall I provide code examples to further explain them?
Kind regards Gunnar von Boehn