[PATCH AUTOSEL 6.17-6.6] blk-cgroup: fix possible deadlock while configuring policy

9 Oct 2025

From: Yu Kuai yukuai3@huawei.com
[ Upstream commit 5d726c4dbeeddef612e6bed27edd29733f4d13af ]
Following deadlock can be triggered easily by lockdep:
WARNING: possible circular locking dependency detected
6.17.0-rc3-00124-ga12c2658ced0 #1665 Not tainted
------------------------------------------------------
check/1334 is trying to acquire lock:
ff1100011d9d0678 (&q->sysfs_lock){+.+.}-{4:4}, at: blk_unregister_queue+0x53/0x180
but task is already holding lock:
ff1100011d9d00e0 (&q->q_usage_counter(queue)#3){++++}-{0:0}, at: del_gendisk+0xba/0x110
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&q->q_usage_counter(queue)#3){++++}-{0:0}:
       blk_queue_enter+0x40b/0x470
       blkg_conf_prep+0x7b/0x3c0
       tg_set_limit+0x10a/0x3e0
       cgroup_file_write+0xc6/0x420
       kernfs_fop_write_iter+0x189/0x280
       vfs_write+0x256/0x490
       ksys_write+0x83/0x190
       __x64_sys_write+0x21/0x30
       x64_sys_call+0x4608/0x4630
       do_syscall_64+0xdb/0x6b0
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #1 (&q->rq_qos_mutex){+.+.}-{4:4}:
       __mutex_lock+0xd8/0xf50
       mutex_lock_nested+0x2b/0x40
       wbt_init+0x17e/0x280
       wbt_enable_default+0xe9/0x140
       blk_register_queue+0x1da/0x2e0
       __add_disk+0x38c/0x5d0
       add_disk_fwnode+0x89/0x250
       device_add_disk+0x18/0x30
       virtblk_probe+0x13a3/0x1800
       virtio_dev_probe+0x389/0x610
       really_probe+0x136/0x620
       __driver_probe_device+0xb3/0x230
       driver_probe_device+0x2f/0xe0
       __driver_attach+0x158/0x250
       bus_for_each_dev+0xa9/0x130
       driver_attach+0x26/0x40
       bus_add_driver+0x178/0x3d0
       driver_register+0x7d/0x1c0
       __register_virtio_driver+0x2c/0x60
       virtio_blk_init+0x6f/0xe0
       do_one_initcall+0x94/0x540
       kernel_init_freeable+0x56a/0x7b0
       kernel_init+0x2b/0x270
       ret_from_fork+0x268/0x4c0
       ret_from_fork_asm+0x1a/0x30
-> #0 (&q->sysfs_lock){+.+.}-{4:4}:
       __lock_acquire+0x1835/0x2940
       lock_acquire+0xf9/0x450
       __mutex_lock+0xd8/0xf50
       mutex_lock_nested+0x2b/0x40
       blk_unregister_queue+0x53/0x180
       __del_gendisk+0x226/0x690
       del_gendisk+0xba/0x110
       sd_remove+0x49/0xb0 [sd_mod]
       device_remove+0x87/0xb0
       device_release_driver_internal+0x11e/0x230
       device_release_driver+0x1a/0x30
       bus_remove_device+0x14d/0x220
       device_del+0x1e1/0x5a0
       __scsi_remove_device+0x1ff/0x2f0
       scsi_remove_device+0x37/0x60
       sdev_store_delete+0x77/0x100
       dev_attr_store+0x1f/0x40
       sysfs_kf_write+0x65/0x90
       kernfs_fop_write_iter+0x189/0x280
       vfs_write+0x256/0x490
       ksys_write+0x83/0x190
       __x64_sys_write+0x21/0x30
       x64_sys_call+0x4608/0x4630
       do_syscall_64+0xdb/0x6b0
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
other info that might help us debug this:
Chain exists of:
  &q->sysfs_lock --> &q->rq_qos_mutex --> &q->q_usage_counter(queue)#3
Possible unsafe locking scenario:
CPU0                    CPU1
       ----                    ----
  lock(&q->q_usage_counter(queue)#3);
                               lock(&q->rq_qos_mutex);
                               lock(&q->q_usage_counter(queue)#3);
  lock(&q->sysfs_lock);
Root cause is that queue_usage_counter is grabbed with rq_qos_mutex
held in blkg_conf_prep(), while queue should be freezed before
rq_qos_mutex from other context.
The blk_queue_enter() from blkg_conf_prep() is used to protect against
policy deactivation, which is already protected with blkcg_mutex, hence
convert blk_queue_enter() to blkcg_mutex to fix this problem. Meanwhile,
consider that blkcg_mutex is held after queue is freezed from policy
deactivation, also convert blkg_alloc() to use GFP_NOIO.
Signed-off-by: Yu Kuai yukuai3@huawei.com
Reviewed-by: Ming Lei ming.lei@redhat.com
Signed-off-by: Jens Axboe axboe@kernel.dk
Signed-off-by: Sasha Levin sashal@kernel.org
---
LLM Generated explanations, may be completely bogus:
YES
Explanation:
- Fixes a real deadlock. The report shows a lockdep-proven circular
  dependency involving `&q->sysfs_lock`, `&q->rq_qos_mutex`, and
  `&q->q_usage_counter`, triggered during blkcg policy configuration
  while devices are being removed. The root cause is explicitly
  identified in the commit message: `blkg_conf_prep()` took
  `rq_qos_mutex` (indirectly via open) and then grabbed
  `q_usage_counter` via `blk_queue_enter()`, but other paths freeze the
  queue before taking `rq_qos_mutex`, creating an inversion.
- Core change eliminates the lock inversion without changing semantics:
  - Replaces `blk_queue_enter(q, 0)` with `mutex_lock(&q->blkcg_mutex)`
    to prevent concurrent policy deactivation, matching the protection
    already used by `blkcg_deactivate_policy()`.
    - Before: `blk_queue_enter()` grabbed `q_usage_counter` while
      `rq_qos_mutex` was already held (bad ordering).
    - After: Take `q->blkcg_mutex` instead, which is the correct lock to
      serialize with policy deactivation and does not participate in the
      problematic chain.
    - Code reference (new behavior at the commit): block/blk-
      cgroup.c:880 acquires `mutex_lock(&q->blkcg_mutex)`; block/blk-
      cgroup.c:fail_exit and success paths unlock it instead of
      `blk_queue_exit()`.
  - Drops the corresponding `blk_queue_exit(q)` calls since
    `blk_queue_enter()` is no longer used (block/blk-
    cgroup.c:success/fail paths).
- Allocation context adjusted for freeze safety:
  - Changes `blkg_alloc(..., GFP_KERNEL)` to `GFP_NOIO` while
    `blkcg_mutex` is held, because policy deactivation holds
    `blkcg_mutex` after freezing the queue; `GFP_NOIO` avoids potential
    IO/reclaim that could deadlock while the queue is frozen.
    - Code reference: block/blk-cgroup.c:911 switches to `GFP_NOIO`.
- Locking contract to callers is preserved:
  - `blkg_conf_prep()` still returns with `queue_lock` held as annotated
    (`__acquires(&bdev->bd_queue->queue_lock)`), and `blkg_conf_exit()`
    remains responsible for releasing `rq_qos_mutex` and `queue_lock`
    (block/blk-cgroup.c:992-1007). The new `blkcg_mutex` is released
    inside `blkg_conf_prep()` before returning, so external APIs don’t
    change.
- Small, contained change in a single subsystem file:
  - Only touches `block/blk-cgroup.c` and only the `blkg_conf_prep()`
    path. No UAPI or architectural shifts; minimal surface area.
- Aligns with existing design:
  - `blkcg_deactivate_policy()` already uses `q->blkcg_mutex` to
    serialize policy teardown with queue freeze (e.g., v6.6 has
    `mutex_lock(&q->blkcg_mutex)` in block/blk-cgroup.c:1625). Using the
    same mutex in `blkg_conf_prep()` is consistent and resolves the
    ordering with `rq_qos_mutex`.
- Applicability to stable:
  - Affects stable branches that:
    - Hold `rq_qos_mutex` in `blkg_conf_open_bdev()` and use
      `blk_queue_enter()` in `blkg_conf_prep()` (e.g., v6.6).
    - Have `q->blkcg_mutex` and use it in `blkcg_deactivate_policy()`.
  - Not directly applicable to much older series (e.g., v6.1) where the
    blkcg prep logic and locks differ (no `blkcg_mutex` field there, and
    `blkg_conf_prep()` is structured differently). Those branches may
    not exhibit the same lock chain or would need a different
    adaptation.
- Risk of regression is low:
  - The switch from `q_usage_counter` to `blkcg_mutex` is targeted to
    the specific policy update path and matches existing deactivate-
    policy synchronization.
  - The `GFP_NOIO` change is safer under queue freeze; allocation
    failure is handled with existing error paths (-ENOMEM) and retry
    logic for -EBUSY remains intact for other sources.
- Process signals quality:
  - Reviewed-by: Ming Lei; Signed-off-by: Jens Axboe. This is a well-
    scoped block layer fix.
Conclusion: This is a clear bug fix that removes a proven deadlock with
a minimal, well-aligned locking change. It should be backported to
stable trees where the affected locking and function structure exist
(e.g., 6.6+). It is not a straightforward backport to older trees
lacking `blkcg_mutex` and the newer prep/open split, so applicability
should be limited to branches that already have `rq_qos_mutex` in
`blkg_conf_open_bdev()` and `blk_queue_enter()` in `blkg_conf_prep()`.
block/blk-cgroup.c | 23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fe9ebd6a2e14d..6e89e64c71be0 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -883,14 +883,8 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
    disk = ctx->bdev->bd_disk;
    q = disk->queue;
-	/*
-	 * blkcg_deactivate_policy() requires queue to be frozen, we can grab
-	 * q_usage_counter to prevent concurrent with blkcg_deactivate_policy().
-	 */
-	ret = blk_queue_enter(q, 0);
-	if (ret)
-		goto fail;
-
+	/* Prevent concurrent with blkcg_deactivate_policy() */
+	mutex_lock(&q->blkcg_mutex);
    spin_lock_irq(&q->queue_lock);
if (!blkcg_policy_enabled(q, pol)) {
@@ -920,16 +914,16 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
    	/* Drop locks to do new blkg allocation with GFP_KERNEL. */
    	spin_unlock_irq(&q->queue_lock);
-		new_blkg = blkg_alloc(pos, disk, GFP_KERNEL);
+		new_blkg = blkg_alloc(pos, disk, GFP_NOIO);
    	if (unlikely(!new_blkg)) {
    		ret = -ENOMEM;
-			goto fail_exit_queue;
+			goto fail_exit;
    	}
if (radix_tree_preload(GFP_KERNEL)) {
    		blkg_free(new_blkg);
    		ret = -ENOMEM;
-			goto fail_exit_queue;
+			goto fail_exit;
    	}
spin_lock_irq(&q->queue_lock);
@@ -957,7 +951,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
    		goto success;
    }
 success:
-	blk_queue_exit(q);
+	mutex_unlock(&q->blkcg_mutex);
    ctx->blkg = blkg;
    return 0;
@@ -965,9 +959,8 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
    radix_tree_preload_end();
 fail_unlock:
    spin_unlock_irq(&q->queue_lock);
-fail_exit_queue:
-	blk_queue_exit(q);
-fail:
+fail_exit:
+	mutex_unlock(&q->blkcg_mutex);
    /*
     * If queue was bypassing, we should retry.  Do so after a
     * short msleep().  It isn't strictly necessary but queue
-- 
2.51.0



    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

[PATCH AUTOSEL 6.17-6.6] blk-cgroup: fix possible deadlock while configuring policy