Hi Maxim,
We use Nvidia TK1s (Cortex-A15) for benchmarking on 32-bit ARM.
That's a bit old, I used Cortex-A57 as the closest to that.
LTO tends to increase functions due to additional inlining, which increases scheduling regions, which increases opportunities for the 1st scheduler for inter-block instruction moves, which increases register pressure.
I don't think this is related to LTO - I see large differences with plain -O2 as well.
SCHED_PRESSURE_MODEL handles cases with high register pressure well, and switching it off caused a few additional spills in the hot blocks, which caused the slow-down.
It may be worthwhile to bring SCHED_PRESSURE_MODEL back when LTO is enabled.
A quick run shows that on trunk --param sched-pressure-algorithm=2 is indeed faster for FP. However turning off pre-realloc scheduling is better overall since it gives 1% gain on INT and 0.5% on FP as well as significant codesize reductions.
So the best way forward for 32-bit Arm is to turn off pre-realloc scheduling as it just causes lots of spilling.
Cheers, Wilco IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.