Re: [CI-NOTIFY]: TCWG Bisect tcwg_bmk_tk1/gnu-release-arm-spec2k6-O3_LTO - Build # 27 - Successful!

15 Jul 2021


      Hi Maxim,
...
We use Nvidia TK1s (Cortex-A15) for benchmarking on 32-bit ARM.
That's a bit old, I used Cortex-A57 as the closest to that.
...
LTO tends to increase functions due to additional inlining, which increases scheduling regions,
which increases opportunities for the 1st scheduler for inter-block instruction moves, which
increases register pressure.
I don't think this is related to LTO - I see large differences with plain -O2 as well.
...
SCHED_PRESSURE_MODEL handles cases with high register pressure well, and switching it off
caused a few additional spills in the hot blocks, which caused the slow-down.
It may be worthwhile to bring SCHED_PRESSURE_MODEL back when LTO is enabled.
A quick run shows that on trunk --param sched-pressure-algorithm=2 is indeed faster
for FP. However turning off pre-realloc scheduling is better overall since it gives 1% gain
on INT and 0.5% on FP as well as significant codesize reductions.
So the best way forward for 32-bit Arm is to turn off pre-realloc scheduling as it
just causes lots of spilling.
Cheers,
Wilco
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [CI-NOTIFY]: TCWG Bisect tcwg_bmk_tk1/gnu-release-arm-spec2k6-O3_LTO - Build # 27 - Successful!