Hello,
This is Chenglong from Google Container Optimized OS. I'm reporting a severe CPU hang regression that occurs after a high volume of file creation and subsequent cgroup cleanup.
Through bisection, the issue appears to be caused by a chain reaction between three commits related to writeback, unbound workqueues, and CPU-hogging detection. The issue is greatly alleviated on the latest mainline kernel but is not fully resolved, still occurring intermittently (~1 in 10 runs).
How to reproduce
The kernel v6.1 is good. The hang is reliably triggered(over 80% chance) on kernels v6.6 and 6.12 and intermittently on mainline(6.17-rc7) with the following steps:
Environment: A machine with a fast SSD and a high core count (e.g., Google Cloud's N2-standard-128).
Workload: Concurrently generate a large number of files (e.g., 2 million) using multiple services managed by systemd-run. This creates significant I/O and cgroup churn.
Trigger: After the file generation completes, terminate the systemd-run services.
Result: Shortly after the services are killed, the system's CPU load spikes, leading to a massive number of kworker/+inode_switch_wbs threads and a system-wide hang/livelock where the machine becomes unresponsive (20s - 300s).
Analysis and Problematic Commits
1. The initial commit: The process begins with a worker that can get stuck busy-waiting on a spinlock.
Commit: ("writeback, cgroup: release dying cgwbs by switching attached inodes")
Effect: This introduced the inode_switch_wbs_work_fn worker to clean up cgroup writeback structures. Under our test load, this worker appears to hit a highly contended wb->list_lock spinlock, causing it to burn 100% CPU without sleeping.
2. The Kworker Explosion: A subsequent change misinterprets the spinning worker from Stage 1, leading to a runaway feedback loop of worker creation.
Commit: 616db8779b1e ("workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE")
Effect: This logic sees the spinning worker, marks it as CPU_INTENSIVE, and excludes it from concurrency management. To handle the work backlog, it spawns a new kworker, which then also gets stuck on the same lock, repeating the cycle. This directly causes the kworker count to explode from <50 to 100-2000+.
3. The System-Wide Lockdown: The final piece allows this localized worker explosion to saturate the entire system.
Commit: 8639ecebc9b1 ("workqueue: Implement non-strict affinity scope for unbound workqueues")
Effect: This change introduced non-strict affinity as the default. It allows the hundreds of kworkers created in Stage 2 to be spread by the scheduler across all available CPU cores, turning the problem into a system-wide hang.
Current Status and Mitigation
Mainline Status: On the latest mainline kernel, the hang is far less frequent and the kworker counts are reduced back to normal (<50), suggesting other changes have partially mitigated the issue. However, the hang still occurs, and when it does, the kworker count still explodes (e.g., 300+), indicating the underlying feedback loop remains.
Workaround: A reliable mitigation is to revert to the old workqueue behavior by setting affinity_strict to 1. This contains the kworker proliferation to a single CPU pod, preventing the system-wide hang.
Questions
Given that the issue is not fully resolved, could you please provide some guidance?
1. Is this a known issue, and are there patches in development that might fully address the underlying spinlock contention or the kworker feedback loop?
2. Is there a better long-term mitigation we can apply other than forcing strict affinity?
Thank you for your time and help.
Best regards,
Chenglong
Just did more testing here. Confirmed that the system hang's still there but less frequently(6/40) with the patches http://lkml.kernel.org/r/20250912103522.2935-1-jack@suse.cz appied to v6.17-rc7. In the bad instances, the kworker count climbed to over 600+ and caused the hang over 80+ seconds.
So I think the patches didn't fully solve the issue.
On Wed, Sep 24, 2025 at 5:29 PM Chenglong Tang chenglongtang@google.com wrote:
Hello,
This is Chenglong from Google Container Optimized OS. I'm reporting a severe CPU hang regression that occurs after a high volume of file creation and subsequent cgroup cleanup.
Through bisection, the issue appears to be caused by a chain reaction between three commits related to writeback, unbound workqueues, and CPU-hogging detection. The issue is greatly alleviated on the latest mainline kernel but is not fully resolved, still occurring intermittently (~1 in 10 runs).
How to reproduce
The kernel v6.1 is good. The hang is reliably triggered(over 80% chance) on kernels v6.6 and 6.12 and intermittently on mainline(6.17-rc7) with the following steps:
Environment: A machine with a fast SSD and a high core count (e.g., Google Cloud's N2-standard-128).
Workload: Concurrently generate a large number of files (e.g., 2 million) using multiple services managed by systemd-run. This creates significant I/O and cgroup churn.
Trigger: After the file generation completes, terminate the systemd-run services.
Result: Shortly after the services are killed, the system's CPU load spikes, leading to a massive number of kworker/+inode_switch_wbs threads and a system-wide hang/livelock where the machine becomes unresponsive (20s - 300s).
Analysis and Problematic Commits
- The initial commit: The process begins with a worker that can get
stuck busy-waiting on a spinlock.
Commit: ("writeback, cgroup: release dying cgwbs by switching attached inodes")
Effect: This introduced the inode_switch_wbs_work_fn worker to clean up cgroup writeback structures. Under our test load, this worker appears to hit a highly contended wb->list_lock spinlock, causing it to burn 100% CPU without sleeping.
- The Kworker Explosion: A subsequent change misinterprets the
spinning worker from Stage 1, leading to a runaway feedback loop of worker creation.
Commit: 616db8779b1e ("workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE")
Effect: This logic sees the spinning worker, marks it as CPU_INTENSIVE, and excludes it from concurrency management. To handle the work backlog, it spawns a new kworker, which then also gets stuck on the same lock, repeating the cycle. This directly causes the kworker count to explode from <50 to 100-2000+.
- The System-Wide Lockdown: The final piece allows this localized
worker explosion to saturate the entire system.
Commit: 8639ecebc9b1 ("workqueue: Implement non-strict affinity scope for unbound workqueues")
Effect: This change introduced non-strict affinity as the default. It allows the hundreds of kworkers created in Stage 2 to be spread by the scheduler across all available CPU cores, turning the problem into a system-wide hang.
Current Status and Mitigation
Mainline Status: On the latest mainline kernel, the hang is far less frequent and the kworker counts are reduced back to normal (<50), suggesting other changes have partially mitigated the issue. However, the hang still occurs, and when it does, the kworker count still explodes (e.g., 300+), indicating the underlying feedback loop remains.
Workaround: A reliable mitigation is to revert to the old workqueue behavior by setting affinity_strict to 1. This contains the kworker proliferation to a single CPU pod, preventing the system-wide hang.
Questions
Given that the issue is not fully resolved, could you please provide some guidance?
- Is this a known issue, and are there patches in development that
might fully address the underlying spinlock contention or the kworker feedback loop?
- Is there a better long-term mitigation we can apply other than
forcing strict affinity?
Thank you for your time and help.
Best regards,
Chenglong
cc'ing Jan.
On Fri, Sep 26, 2025 at 12:54:29PM -0700, Chenglong Tang wrote:
Just did more testing here. Confirmed that the system hang's still there but less frequently(6/40) with the patches http://lkml.kernel.org/r/20250912103522.2935-1-jack@suse.cz appied to v6.17-rc7. In the bad instances, the kworker count climbed to over 600+ and caused the hang over 80+ seconds.
So I think the patches didn't fully solve the issue.
I wonder how the number of workers still exploded to 600+. Are there that many cgroups being shut down? Does clamping down @max_active resolve the problem? There's no reason to have really high concurrency for this.
Thanks.
cc'ing GKE folks
On Fri, Sep 26, 2025 at 12:59 PM Tejun Heo tj@kernel.org wrote:
cc'ing Jan.
On Fri, Sep 26, 2025 at 12:54:29PM -0700, Chenglong Tang wrote:
Just did more testing here. Confirmed that the system hang's still there but less frequently(6/40) with the patches http://lkml.kernel.org/r/20250912103522.2935-1-jack@suse.cz appied to v6.17-rc7. In the bad instances, the kworker count climbed to over 600+ and caused the hang over 80+ seconds.
So I think the patches didn't fully solve the issue.
I wonder how the number of workers still exploded to 600+. Are there that many cgroups being shut down? Does clamping down @max_active resolve the problem? There's no reason to have really high concurrency for this.
Thanks.
-- tejun
linux-stable-mirror@lists.linaro.org