On 3/13/26 03:53, Barry K. Nathan wrote: [snip]
On 3/13/26 02:37, Ron Economos wrote:
On 3/13/26 01:05, Barry K. Nathan wrote:
On 3/12/26 23:10, Ron Economos wrote:
Probably those sched/fair patches.
Yes, after bisecting it turned out to be sched-fair-fix-eevdf-entity-placement-bug-causing-sc.patch
Taking 6.12.77-rc1 and reverting both of the sched-fair patches results in a working kernel that boots consistently (which I am using now to send this email).
Confirmed on RISC-V. Reverting "sched/fair: Fix lag clamp" commit b547745a2c78fd1cc1fdc6a0d1b05c884c05cec2 and "sched/fair: Fix EEVDF entity placement bug causing scheduling lag" commit f9891a33ba67ce40e5a17023d2f3a5e2b7d72ffd resolves the issue.
After looking into it a bit more, I found two upstream commits that should fix this issue without reverting the two sched/fair patches (either of the two commits alone should fix it if I understand the bug and the code correctly):
commit 4423af84b29794a9bd2bd07188d8e71083e54c61 sched/fair: optimize the PLACE_LAG when se->vlag is zero
commit c70fc32f44431bb30f9025ce753ba8be25acbba3 sched/fair: Adhere to place_entity() constraints
I think c70fc32f4443 is theoretically the proper fix, while 4423af84b297 is a performance optimization that just happens to also fix the bug.
4423af84b297 turned out to be the easier backport; the upstream patch applies to 6.12.77-rc1 with an offset but no fuzz or conflicts. So I tried 6.12.77-rc1 + 4423af84b297, and just as with reverting the two sched/fair patches, it eliminates the boot freeze in my testing. It's what I'm running now as I write and send this email.
Next, I think I'll try doing a backport of c70fc32f4443 (I think it should be easy enough), and I'll try testing 6.12.77-rc1 + c70fc32f4443 (probably both with and without 4423af84b297). Maybe 4423af84b297 on its own is enough though.
I originally wrote a much longer email, but I'll try to keep this concise.
I was able to backport c70fc32f4443 successfully, and the backport does fix the reboot freezes (with or without 4423af84b297). However, backporting that commit convinced me that it's too risky; I'm particularly worried it could make future sched/fair backports more difficult. And once 4423af84b297 is applied, I think c70fc32f4443 ends up being a fix for a theoretical bug.
So, even though c70fc32f4443 is the commit that was cc'd to stable@, I believe 4423af84b297 is a better (safer, less risky) way to go.
In summary, I believe the two best ways to fix this regression are: 1. Backport 4423af84b297, or 2. Revert the two sched/fair patches.