Hello Matt,
On 9/26/2025 9:04 PM, Matt Fleming wrote:
On Fri, Sep 26, 2025 at 3:43 AM K Prateek Nayak kprateek.nayak@amd.com wrote:
Hello John, Matt,
On 9/26/2025 5:35 AM, John Stultz wrote:
However, there are two spots where we might exit dequeue_entities() early when cfs_rq_throttled(rq), so maybe that's what's catching us here?
That could very likely be it.
That tracks -- we're heavy users of cgroups and this particular issue only appeared on our kubernetes nodes.
Matt, if possible can you try the patch attached below to check if the bailout for throttled hierarchy is indeed the root cause. Thanks in advance.
I've been running our reproducer with this patch for the last few hours without any issues, so the fix looks good to me.
Tested-by: Matt Fleming mfleming@cloudflare.com
Thank you for testing the diff. I see Ingo has already posted the scheduler pull for v6.18 which indirectly solves this by removing those early returns.
Once that lands, I'll attach a formal commit log and send out a patch targeting the stable kernels >= 6.12.