On Fri, Mar 13, 2026 at 06:38:25AM -0700, Barry K. Nathan wrote:
On 3/13/26 03:53, Barry K. Nathan wrote: [snip]
On 3/13/26 02:37, Ron Economos wrote:
On 3/13/26 01:05, Barry K. Nathan wrote:
On 3/12/26 23:10, Ron Economos wrote:
Probably those sched/fair patches.
Yes, after bisecting it turned out to be sched-fair-fix-eevdf-entity-placement-bug-causing-sc.patch
Taking 6.12.77-rc1 and reverting both of the sched-fair patches results in a working kernel that boots consistently (which I am using now to send this email).
Confirmed on RISC-V. Reverting "sched/fair: Fix lag clamp" commit b547745a2c78fd1cc1fdc6a0d1b05c884c05cec2 and "sched/fair: Fix EEVDF entity placement bug causing scheduling lag" commit f9891a33ba67ce40e5a17023d2f3a5e2b7d72ffd resolves the issue.
After looking into it a bit more, I found two upstream commits that should fix this issue without reverting the two sched/fair patches (either of the two commits alone should fix it if I understand the bug and the code correctly):
commit 4423af84b29794a9bd2bd07188d8e71083e54c61 sched/fair: optimize the PLACE_LAG when se->vlag is zero
commit c70fc32f44431bb30f9025ce753ba8be25acbba3 sched/fair: Adhere to place_entity() constraints
I think c70fc32f4443 is theoretically the proper fix, while 4423af84b297 is a performance optimization that just happens to also fix the bug.
4423af84b297 turned out to be the easier backport; the upstream patch applies to 6.12.77-rc1 with an offset but no fuzz or conflicts. So I tried 6.12.77-rc1 + 4423af84b297, and just as with reverting the two sched/fair patches, it eliminates the boot freeze in my testing. It's what I'm running now as I write and send this email.
Next, I think I'll try doing a backport of c70fc32f4443 (I think it should be easy enough), and I'll try testing 6.12.77-rc1 + c70fc32f4443 (probably both with and without 4423af84b297). Maybe 4423af84b297 on its own is enough though.
I originally wrote a much longer email, but I'll try to keep this concise.
I was able to backport c70fc32f4443 successfully, and the backport does fix the reboot freezes (with or without 4423af84b297). However, backporting that commit convinced me that it's too risky; I'm particularly worried it could make future sched/fair backports more difficult. And once 4423af84b297 is applied, I think c70fc32f4443 ends up being a fix for a theoretical bug.
So, even though c70fc32f4443 is the commit that was cc'd to stable@, I believe 4423af84b297 is a better (safer, less risky) way to go.
In summary, I believe the two best ways to fix this regression are:
- Backport 4423af84b297, or
- Revert the two sched/fair patches.
I'll go drop these for now, and if they should come back in the future, someone can send all of the needed ones at once.
thanks so much for the testing and figuring it all out!
greg k-h