Hello,
In this series are BFQ upstream fixes that didn't apply to 5.4 stable tree cleanly and needed some massaging before they apply. The result did pass some cgroup testing with bfq and the backport is based on the one we have in our SLES kernel so I'm reasonably confident things are fine.
Honza
commit c1cee4ab36acef271be9101590756ed0c0c374d9 upstream.
It can happen that the parent of a bfqq changes between the moment we decide two queues are worth to merge (and set bic->stable_merge_bfqq) and the moment bfq_setup_merge() is called. This can happen e.g. because the process submitted IO for a different cgroup and thus bfqq got reparented. It can even happen that the bfqq we are merging with has parent cgroup that is already offline and going to be destroyed in which case the merge can lead to use-after-free issues such as:
BUG: KASAN: use-after-free in __bfq_deactivate_entity+0x9cb/0xa50 Read of size 8 at addr ffff88800693c0c0 by task runc:[2:INIT]/10544
CPU: 0 PID: 10544 Comm: runc:[2:INIT] Tainted: G E 5.15.2-0.g5fb85fd-default #1 openSUSE Tumbleweed (unreleased) f1f3b891c72369aebecd2e43e4641a6358867c70 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 Call Trace: <IRQ> dump_stack_lvl+0x46/0x5a print_address_description.constprop.0+0x1f/0x140 ? __bfq_deactivate_entity+0x9cb/0xa50 kasan_report.cold+0x7f/0x11b ? __bfq_deactivate_entity+0x9cb/0xa50 __bfq_deactivate_entity+0x9cb/0xa50 ? update_curr+0x32f/0x5d0 bfq_deactivate_entity+0xa0/0x1d0 bfq_del_bfqq_busy+0x28a/0x420 ? resched_curr+0x116/0x1d0 ? bfq_requeue_bfqq+0x70/0x70 ? check_preempt_wakeup+0x52b/0xbc0 __bfq_bfqq_expire+0x1a2/0x270 bfq_bfqq_expire+0xd16/0x2160 ? try_to_wake_up+0x4ee/0x1260 ? bfq_end_wr_async_queues+0xe0/0xe0 ? _raw_write_unlock_bh+0x60/0x60 ? _raw_spin_lock_irq+0x81/0xe0 bfq_idle_slice_timer+0x109/0x280 ? bfq_dispatch_request+0x4870/0x4870 __hrtimer_run_queues+0x37d/0x700 ? enqueue_hrtimer+0x1b0/0x1b0 ? kvm_clock_get_cycles+0xd/0x10 ? ktime_get_update_offsets_now+0x6f/0x280 hrtimer_interrupt+0x2c8/0x740
Fix the problem by checking that the parent of the two bfqqs we are merging in bfq_setup_merge() is the same.
Link: https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/ CC: stable@vger.kernel.org Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues") Tested-by: "yukuai (C)" yukuai3@huawei.com Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20220401102752.8599-2-jack@suse.cz Signed-off-by: Jens Axboe axboe@kernel.dk --- block/bfq-iosched.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index bdfa707ce236..29c15079c4d5 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -2509,6 +2509,14 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq) if (process_refs == 0 || new_process_refs == 0) return NULL;
+ /* + * Make sure merged queues belong to the same parent. Parents could + * have changed since the time we decided the two queues are suitable + * for merging. + */ + if (new_bfqq->entity.parent != bfqq->entity.parent) + return NULL; + bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d", new_bfqq->pid);
commit fc84e1f941b91221092da5b3102ec82da24c5673 upstream.
In bfq_insert_request() we unlock bfqd->lock only to call trace_block_rq_insert() and then lock bfqd->lock again. This is really pointless since tracing is disabled if we really care about performance and even if the tracepoint is enabled, it is a quick call.
CC: stable@vger.kernel.org Tested-by: "yukuai (C)" yukuai3@huawei.com Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20220401102752.8599-5-jack@suse.cz Signed-off-by: Jens Axboe axboe@kernel.dk --- block/bfq-iosched.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 29c15079c4d5..ed25bcd1e820 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -5529,11 +5529,8 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, return; }
- spin_unlock_irq(&bfqd->lock); - blk_mq_sched_request_inserted(rq);
- spin_lock_irq(&bfqd->lock); bfqq = bfq_init_rq(rq); if (!bfqq || at_head || blk_rq_is_passthrough(rq)) { if (at_head)
commit 5f550ede5edf846ecc0067be1ba80514e6fe7f8e upstream.
We call bfq_init_rq() from request merging functions where requests we get should have already gone through bfq_init_rq() during insert and anyway we want to do anything only if the request is already tracked by BFQ. So replace calls to bfq_init_rq() with RQ_BFQQ() instead to simply skip requests untracked by BFQ. We move bfq_init_rq() call in bfq_insert_request() a bit earlier to cover request merging and thus can transfer FIFO position in case of a merge.
CC: stable@vger.kernel.org Tested-by: "yukuai (C)" yukuai3@huawei.com Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20220401102752.8599-6-jack@suse.cz Signed-off-by: Jens Axboe axboe@kernel.dk --- block/bfq-iosched.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index ed25bcd1e820..f91c9bb687a8 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -2267,8 +2267,6 @@ static int bfq_request_merge(struct request_queue *q, struct request **req, return ELEVATOR_NO_MERGE; }
-static struct bfq_queue *bfq_init_rq(struct request *rq); - static void bfq_request_merged(struct request_queue *q, struct request *req, enum elv_merge type) { @@ -2277,7 +2275,7 @@ static void bfq_request_merged(struct request_queue *q, struct request *req, blk_rq_pos(req) < blk_rq_pos(container_of(rb_prev(&req->rb_node), struct request, rb_node))) { - struct bfq_queue *bfqq = bfq_init_rq(req); + struct bfq_queue *bfqq = RQ_BFQQ(req); struct bfq_data *bfqd; struct request *prev, *next_rq;
@@ -2329,8 +2327,8 @@ static void bfq_request_merged(struct request_queue *q, struct request *req, static void bfq_requests_merged(struct request_queue *q, struct request *rq, struct request *next) { - struct bfq_queue *bfqq = bfq_init_rq(rq), - *next_bfqq = bfq_init_rq(next); + struct bfq_queue *bfqq = RQ_BFQQ(rq), + *next_bfqq = RQ_BFQQ(next);
if (!bfqq) return; @@ -5514,6 +5512,8 @@ static inline void bfq_update_insert_stats(struct request_queue *q, unsigned int cmd_flags) {} #endif /* CONFIG_BFQ_CGROUP_DEBUG */
+static struct bfq_queue *bfq_init_rq(struct request *rq); + static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, bool at_head) { @@ -5524,6 +5524,7 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, unsigned int cmd_flags;
spin_lock_irq(&bfqd->lock); + bfqq = bfq_init_rq(rq); if (blk_mq_sched_try_insert_merge(q, rq)) { spin_unlock_irq(&bfqd->lock); return; @@ -5531,7 +5532,6 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
blk_mq_sched_request_inserted(rq);
- bfqq = bfq_init_rq(rq); if (!bfqq || at_head || blk_rq_is_passthrough(rq)) { if (at_head) list_add(&rq->queuelist, &bfqd->dispatch);
commit 4e54a2493e582361adc3bfbf06c7d50d19d18837 upstream.
BFQ usage of __bio_blkcg() is a relict from the past. Furthermore if bio would not be associated with any blkcg, the usage of __bio_blkcg() in BFQ is prone to races with the task being migrated between cgroups as __bio_blkcg() calls at different places could return different blkcgs.
Convert BFQ to the new situation where bio->bi_blkg is initialized in bio_set_dev() and thus practically always valid. This allows us to save blkcg_gq lookup and noticeably simplify the code.
CC: stable@vger.kernel.org Fixes: 0fe061b9f03c ("blkcg: fix ref count issue with bio_blkcg() using task_css") Tested-by: "yukuai (C)" yukuai3@huawei.com Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20220401102752.8599-8-jack@suse.cz Signed-off-by: Jens Axboe axboe@kernel.dk --- block/bfq-cgroup.c | 63 +++++++++++++++++---------------------------- block/bfq-iosched.c | 10 +------ block/bfq-iosched.h | 3 +-- 3 files changed, 25 insertions(+), 51 deletions(-)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 095f2f65bc16..3835cd920587 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -565,27 +565,11 @@ static void bfq_group_set_parent(struct bfq_group *bfqg, entity->sched_data = &parent->sched_data; }
-static struct bfq_group *bfq_lookup_bfqg(struct bfq_data *bfqd, - struct blkcg *blkcg) +static void bfq_link_bfqg(struct bfq_data *bfqd, struct bfq_group *bfqg) { - struct blkcg_gq *blkg; - - blkg = blkg_lookup(blkcg, bfqd->queue); - if (likely(blkg)) - return blkg_to_bfqg(blkg); - return NULL; -} - -struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd, - struct blkcg *blkcg) -{ - struct bfq_group *bfqg, *parent; + struct bfq_group *parent; struct bfq_entity *entity;
- bfqg = bfq_lookup_bfqg(bfqd, blkcg); - if (unlikely(!bfqg)) - return NULL; - /* * Update chain of bfq_groups as we might be handling a leaf group * which, along with some of its relatives, has not been hooked yet @@ -602,8 +586,15 @@ struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd, bfq_group_set_parent(curr_bfqg, parent); } } +}
- return bfqg; +struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio) +{ + struct blkcg_gq *blkg = bio->bi_blkg; + + if (!blkg) + return bfqd->root_group; + return blkg_to_bfqg(blkg); }
/** @@ -679,25 +670,15 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, * Move bic to blkcg, assuming that bfqd->lock is held; which makes * sure that the reference to cgroup is valid across the call (see * comments in bfq_bic_update_cgroup on this issue) - * - * NOTE: an alternative approach might have been to store the current - * cgroup in bfqq and getting a reference to it, reducing the lookup - * time here, at the price of slightly more complex code. */ -static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd, - struct bfq_io_cq *bic, - struct blkcg *blkcg) +static void *__bfq_bic_change_cgroup(struct bfq_data *bfqd, + struct bfq_io_cq *bic, + struct bfq_group *bfqg) { struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0); struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1); - struct bfq_group *bfqg; struct bfq_entity *entity;
- bfqg = bfq_find_set_group(bfqd, blkcg); - - if (unlikely(!bfqg)) - bfqg = bfqd->root_group; - if (async_bfqq) { entity = &async_bfqq->entity;
@@ -749,20 +730,24 @@ static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd, void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) { struct bfq_data *bfqd = bic_to_bfqd(bic); - struct bfq_group *bfqg = NULL; + struct bfq_group *bfqg = bfq_bio_bfqg(bfqd, bio); uint64_t serial_nr;
- rcu_read_lock(); - serial_nr = __bio_blkcg(bio)->css.serial_nr; + serial_nr = bfqg_to_blkg(bfqg)->blkcg->css.serial_nr;
/* * Check whether blkcg has changed. The condition may trigger * spuriously on a newly created cic but there's no harm. */ if (unlikely(!bfqd) || likely(bic->blkcg_serial_nr == serial_nr)) - goto out; + return;
- bfqg = __bfq_bic_change_cgroup(bfqd, bic, __bio_blkcg(bio)); + /* + * New cgroup for this process. Make sure it is linked to bfq internal + * cgroup hierarchy. + */ + bfq_link_bfqg(bfqd, bfqg); + __bfq_bic_change_cgroup(bfqd, bic, bfqg); /* * Update blkg_path for bfq_log_* functions. We cache this * path, and update it here, for the following @@ -815,8 +800,6 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) */ blkg_path(bfqg_to_blkg(bfqg), bfqg->blkg_path, sizeof(bfqg->blkg_path)); bic->blkcg_serial_nr = serial_nr; -out: - rcu_read_unlock(); }
/** @@ -1433,7 +1416,7 @@ void bfq_end_wr_async(struct bfq_data *bfqd) bfq_end_wr_async_queues(bfqd, bfqd->root_group); }
-struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd, struct blkcg *blkcg) +struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio) { return bfqd->root_group; } diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index f91c9bb687a8..962701d3f46b 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -5158,14 +5158,7 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, struct bfq_queue *bfqq; struct bfq_group *bfqg;
- rcu_read_lock(); - - bfqg = bfq_find_set_group(bfqd, __bio_blkcg(bio)); - if (!bfqg) { - bfqq = &bfqd->oom_bfqq; - goto out; - } - + bfqg = bfq_bio_bfqg(bfqd, bio); if (!is_sync) { async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class, ioprio); @@ -5209,7 +5202,6 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, out: bfqq->ref++; /* get a process reference to this queue */ bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq, bfqq->ref); - rcu_read_unlock(); return bfqq; }
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index be1f4c1febf8..f6cc2b418008 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -978,8 +978,7 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, void bfq_init_entity(struct bfq_entity *entity, struct bfq_group *bfqg); void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio); void bfq_end_wr_async(struct bfq_data *bfqd); -struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd, - struct blkcg *blkcg); +struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio); struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg); struct bfq_group *bfqq_group(struct bfq_queue *bfqq); struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node);
commit 075a53b78b815301f8d3dd1ee2cd99554e34f0dd upstream.
Bios queued into BFQ IO scheduler can be associated with a cgroup that was already offlined. This may then cause insertion of this bfq_group into a service tree. But this bfq_group will get freed as soon as last bio associated with it is completed leading to use after free issues for service tree users. Fix the problem by making sure we always operate on online bfq_group. If the bfq_group associated with the bio is not online, we pick the first online parent.
CC: stable@vger.kernel.org Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support") Tested-by: "yukuai (C)" yukuai3@huawei.com Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20220401102752.8599-9-jack@suse.cz Signed-off-by: Jens Axboe axboe@kernel.dk --- block/bfq-cgroup.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 3835cd920587..09d721b1f6ac 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -591,10 +591,19 @@ static void bfq_link_bfqg(struct bfq_data *bfqd, struct bfq_group *bfqg) struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio) { struct blkcg_gq *blkg = bio->bi_blkg; + struct bfq_group *bfqg;
- if (!blkg) - return bfqd->root_group; - return blkg_to_bfqg(blkg); + while (blkg) { + bfqg = blkg_to_bfqg(blkg); + if (bfqg->online) { + bio_associate_blkg_from_css(bio, &blkg->blkcg->css); + return bfqg; + } + blkg = blkg->parent; + } + bio_associate_blkg_from_css(bio, + &bfqg_to_blkg(bfqd->root_group)->blkcg->css); + return bfqd->root_group; }
/**
commit 22b106e5355d6e7a9c3b5cb5ed4ef22ae585ea94 upstream.
Commit d92c370a16cb ("block: really clone the block cgroup in bio_clone_blkg_association") changed bio_clone_blkg_association() to just clone bio->bi_blkg reference from source to destination bio. This is however wrong if the source and destination bios are against different block devices because struct blkcg_gq is different for each bdev-blkcg pair. This will result in IOs being accounted (and throttled as a result) multiple times against the same device (src bdev) while throttling of the other device (dst bdev) is ignored. In case of BFQ the inconsistency can even result in crashes in bfq_bic_update_cgroup(). Fix the problem by looking up correct blkcg_gq for the cloned bio.
Reported-by: Logan Gunthorpe logang@deltatee.com Reported-and-tested-by: Donald Buczek buczek@molgen.mpg.de Fixes: d92c370a16cb ("block: really clone the block cgroup in bio_clone_blkg_association") CC: stable@vger.kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20220602081242.7731-1-jack@suse.cz Signed-off-by: Jens Axboe axboe@kernel.dk --- block/bio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/bio.c b/block/bio.c index 40004a3631a8..08dbdc32ceaa 100644 --- a/block/bio.c +++ b/block/bio.c @@ -2179,7 +2179,7 @@ void bio_clone_blkg_association(struct bio *dst, struct bio *src) rcu_read_lock();
if (src->bi_blkg) - __bio_associate_blkg(dst, src->bi_blkg); + bio_associate_blkg_from_css(dst, &bio_blkcg(src)->css);
rcu_read_unlock(); }
On Tue, Jun 07, 2022 at 11:15:08AM +0200, Jan Kara wrote:
Hello,
In this series are BFQ upstream fixes that didn't apply to 5.4 stable tree cleanly and needed some massaging before they apply. The result did pass some cgroup testing with bfq and the backport is based on the one we have in our SLES kernel so I'm reasonably confident things are fine.
All now queued up, thanks for the backports!
greg k-h
linux-stable-mirror@lists.linaro.org