(I forgot to add Francesco Dolcini as a recipient on my previous email, so I'm doing that now.)
On 3/13/26 02:37, Ron Economos wrote:
On 3/13/26 01:05, Barry K. Nathan wrote:
On 3/12/26 23:10, Ron Economos wrote:
Probably those sched/fair patches.
Yes, after bisecting it turned out to be sched-fair-fix-eevdf-entity-placement-bug-causing-sc.patch
Taking 6.12.77-rc1 and reverting both of the sched-fair patches results in a working kernel that boots consistently (which I am using now to send this email).
Confirmed on RISC-V. Reverting "sched/fair: Fix lag clamp" commit b547745a2c78fd1cc1fdc6a0d1b05c884c05cec2 and "sched/fair: Fix EEVDF entity placement bug causing scheduling lag" commit f9891a33ba67ce40e5a17023d2f3a5e2b7d72ffd resolves the issue.
After looking into it a bit more, I found two upstream commits that should fix this issue without reverting the two sched/fair patches (either of the two commits alone should fix it if I understand the bug and the code correctly):
commit 4423af84b29794a9bd2bd07188d8e71083e54c61 sched/fair: optimize the PLACE_LAG when se->vlag is zero
commit c70fc32f44431bb30f9025ce753ba8be25acbba3 sched/fair: Adhere to place_entity() constraints
I think c70fc32f4443 is theoretically the proper fix, while 4423af84b297 is a performance optimization that just happens to also fix the bug.
4423af84b297 turned out to be the easier backport; the upstream patch applies to 6.12.77-rc1 with an offset but no fuzz or conflicts. So I tried 6.12.77-rc1 + 4423af84b297, and just as with reverting the two sched/fair patches, it eliminates the boot freeze in my testing. It's what I'm running now as I write and send this email.
Next, I think I'll try doing a backport of c70fc32f4443 (I think it should be easy enough), and I'll try testing 6.12.77-rc1 + c70fc32f4443 (probably both with and without 4423af84b297). Maybe 4423af84b297 on its own is enough though.