From: Fengnan Chang changfengnan@bytedance.com
[ Upstream commit 9d83e1f05c98bab5de350bef89177e2be8b34db0 ]
After commit 0b2b066f8a85 ("io_uring/io-wq: only create a new worker if it can make progress"), in our produce environment, we still observe that part of io_worker threads keeps creating and destroying. After analysis, it was confirmed that this was due to a more complex scenario involving a large number of fsync operations, which can be abstracted as frequent write + fsync operations on multiple files in a single uring instance. Since write is a hash operation while fsync is not, and fsync is likely to be suspended during execution, the action of checking the hash value in io_wqe_dec_running cannot handle such scenarios. Similarly, if hash-based work and non-hash-based work are sent at the same time, similar issues are likely to occur. Returning to the starting point of the issue, when a new work arrives, io_wq_enqueue may wake up free worker A, while io_wq_dec_running may create worker B. Ultimately, only one of A and B can obtain and process the task, leaving the other in an idle state. In the end, the issue is caused by inconsistent logic in the checks performed by io_wq_enqueue and io_wq_dec_running. Therefore, the problem can be resolved by checking for available workers in io_wq_dec_running.
Signed-off-by: Fengnan Chang changfengnan@bytedance.com Reviewed-by: Diangang Li lidiangang@bytedance.com Link: https://lore.kernel.org/r/20250813120214.18729-1-changfengnan@bytedance.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
Now let me analyze the nature of the fix to determine if it's appropriate for stable:
**Backport Status: YES**
This commit should be backported to stable kernel trees for the following reasons:
## Bug Fix Nature The commit fixes a **real performance bug** that occurs in production environments. The issue causes unnecessary creation and destruction of io_worker threads, leading to: 1. CPU overhead from constant thread creation/destruction 2. Resource wastage from redundant workers 3. Performance degradation in production workloads with mixed hash/non- hash operations
## Root Cause Analysis The patch addresses a **race condition** between `io_wq_enqueue()` and `io_wq_dec_running()`: - When new work arrives, `io_wq_enqueue()` may wake up a free worker A - Simultaneously, `io_wq_dec_running()` may create a new worker B - Only one worker can actually process the task, leaving the other idle - This creates a worker "churn" pattern that wastes resources
## Small and Contained Fix The fix is minimal and surgical: 1. Adds a check in `create_worker_cb()` to verify if a free worker is available before creating a new one 2. Uses the existing `io_acct_activate_free_worker()` function (lines 361-365) 3. Only adds 8 lines of code with proper RCU locking 4. Introduces a single goto label for clean error handling
## Production Impact The commit message explicitly states this was observed in **production environments** with real workloads involving: - Frequent write + fsync operations - Mixed hash and non-hash operations - Multiple files in a single io_uring instance
## Regression Potential The fix has **low regression risk** because: 1. It reuses existing, well-tested infrastructure (`io_acct_activate_free_worker()`) 2. The logic is consistent with what `io_wq_enqueue()` already does 3. It adds a defensive check rather than changing core logic 4. The RCU locking is properly scoped and follows existing patterns
## Stable Kernel Rules Compliance The patch follows stable kernel criteria: - ✓ Fixes a real bug affecting users (worker thread churn in production) - ✓ Small change (8 lines added) - ✓ Obviously correct (aligns logic between enqueue and dec_running paths) - ✓ Already tested in production (reported by ByteDance engineers) - ✓ Not a new feature or optimization
## Affected Versions This should be backported to kernels containing commit 0b2b066f8a85 ("io_uring/io-wq: only create a new worker if it can make progress") which introduced incomplete handling of the worker creation logic. The issue affects mixed workload scenarios that are common in production environments.
The fact that this was discovered and reported by engineers from a major cloud provider (ByteDance) running production workloads further validates its importance for stable backporting.
io_uring/io-wq.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c index be91edf34f01..17dfaa0395c4 100644 --- a/io_uring/io-wq.c +++ b/io_uring/io-wq.c @@ -357,6 +357,13 @@ static void create_worker_cb(struct callback_head *cb) worker = container_of(cb, struct io_worker, create_work); wq = worker->wq; acct = worker->acct; + + rcu_read_lock(); + do_create = !io_acct_activate_free_worker(acct); + rcu_read_unlock(); + if (!do_create) + goto no_need_create; + raw_spin_lock(&acct->workers_lock);
if (acct->nr_workers < acct->max_workers) { @@ -367,6 +374,7 @@ static void create_worker_cb(struct callback_head *cb) if (do_create) { create_io_worker(wq, acct); } else { +no_need_create: atomic_dec(&acct->nr_running); io_worker_ref_put(wq); }