Hi,
When one request is dispatched to LLD via dm-rq, if the result is BLK_STS_*RESOURCE, dm-rq will free the request. However, LLD may allocate private data for this request, so this way will cause memory leak.
Add .cleanup_rq() callback and implement it in SCSI for fixing the issue, since SCSI is the only driver which allocates private requst data in .queue_rq() path.
Another use case of this callback is to free the request and re-submit bios during cpu hotplug when the hctx is dead, see the following link:
https://lore.kernel.org/linux-block/f122e8f2-5ede-2d83-9ca0-bc713ce66d01@hua...
V2: - run .cleanup_rq() in blk_mq_free_request(), as suggested by Mike
Ming Lei (2): blk-mq: add callback of .cleanup_rq scsi: implement .cleanup_rq callback
block/blk-mq.c | 3 +++ drivers/scsi/scsi_lib.c | 28 ++++++++++++++++++++-------- include/linux/blk-mq.h | 7 +++++++ 3 files changed, 30 insertions(+), 8 deletions(-)
Cc: Ewan D. Milne emilne@redhat.com Cc: Bart Van Assche bvanassche@acm.org Cc: Hannes Reinecke hare@suse.com Cc: Christoph Hellwig hch@lst.de Cc: Mike Snitzer snitzer@redhat.com Cc: dm-devel@redhat.com Cc: stable@vger.kernel.org Fixes: 396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
dm-rq needs to free request which has been dispatched and not completed by underlying queue. However, the underlying queue may have allocated private data for this request in .queue_rq(), so the request private part is leaked.
Add one new callback of .cleanup_rq() to free the request private data.
Another use case is to free request when the hctx is dead during cpu hotplug context.
Cc: Ewan D. Milne emilne@redhat.com Cc: Bart Van Assche bvanassche@acm.org Cc: Hannes Reinecke hare@suse.com Cc: Christoph Hellwig hch@lst.de Cc: Mike Snitzer snitzer@redhat.com Cc: dm-devel@redhat.com Cc: stable@vger.kernel.org Fixes: 396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") Signed-off-by: Ming Lei ming.lei@redhat.com --- block/blk-mq.c | 3 +++ include/linux/blk-mq.h | 7 +++++++ 2 files changed, 10 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c index b038ec680e84..fc38d95c557f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -502,6 +502,9 @@ void blk_mq_free_request(struct request *rq) struct blk_mq_ctx *ctx = rq->mq_ctx; struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+ if (q->mq_ops->cleanup_rq) + q->mq_ops->cleanup_rq(rq); + if (rq->rq_flags & RQF_ELVPRIV) { if (e && e->type->ops.finish_request) e->type->ops.finish_request(rq); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 3fa1fa59f9b2..5882675c1989 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -140,6 +140,7 @@ typedef int (poll_fn)(struct blk_mq_hw_ctx *); typedef int (map_queues_fn)(struct blk_mq_tag_set *set); typedef bool (busy_fn)(struct request_queue *); typedef void (complete_fn)(struct request *); +typedef void (cleanup_rq_fn)(struct request *);
struct blk_mq_ops { @@ -200,6 +201,12 @@ struct blk_mq_ops { /* Called from inside blk_get_request() */ void (*initialize_rq_fn)(struct request *rq);
+ /* + * Called before freeing one request which isn't completed yet, + * and usually for freeing the driver private part + */ + cleanup_rq_fn *cleanup_rq; + /* * If set, returns whether or not this queue currently is busy */
On 7/19/19 8:06 PM, Ming Lei wrote:
diff --git a/block/blk-mq.c b/block/blk-mq.c index b038ec680e84..fc38d95c557f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -502,6 +502,9 @@ void blk_mq_free_request(struct request *rq) struct blk_mq_ctx *ctx = rq->mq_ctx; struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
- if (q->mq_ops->cleanup_rq)
q->mq_ops->cleanup_rq(rq);
- if (rq->rq_flags & RQF_ELVPRIV) { if (e && e->type->ops.finish_request) e->type->ops.finish_request(rq);
I'm concerned about the performance impact of this change. How about not introducing .cleanup_rq() and adding a call to scsi_mq_uninit_cmd() in scsi_queue_rq() just before that function returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE?
Thanks,
Bart.
On Mon, Jul 22, 2019 at 09:51:27AM -0700, Bart Van Assche wrote:
On 7/19/19 8:06 PM, Ming Lei wrote:
diff --git a/block/blk-mq.c b/block/blk-mq.c index b038ec680e84..fc38d95c557f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -502,6 +502,9 @@ void blk_mq_free_request(struct request *rq) struct blk_mq_ctx *ctx = rq->mq_ctx; struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
- if (q->mq_ops->cleanup_rq)
q->mq_ops->cleanup_rq(rq);
- if (rq->rq_flags & RQF_ELVPRIV) { if (e && e->type->ops.finish_request) e->type->ops.finish_request(rq);
I'm concerned about the performance impact of this change. How about not
Not see any performance impact in my test, and q->mq_ops should be in data cache at that time.
introducing .cleanup_rq() and adding a call to scsi_mq_uninit_cmd() in scsi_queue_rq() just before that function returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE?
The problem is that only dm-rq needs to free the request private data when BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE is returned. If we do that unconditionally, performance impact might be visible.
Thanks, Ming
Implement .cleanup_rq() callback for freeing driver private part of the request. Then we can avoid to leak request private data if the request isn't completed by SCSI, and freed by blk-mq or upper layer(such as dm-rq) finally.
Cc: Ewan D. Milne emilne@redhat.com Cc: Bart Van Assche bvanassche@acm.org Cc: Hannes Reinecke hare@suse.com Cc: Christoph Hellwig hch@lst.de Cc: Mike Snitzer snitzer@redhat.com Cc: dm-devel@redhat.com Cc: stable@vger.kernel.org Fixes: 396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") Signed-off-by: Ming Lei ming.lei@redhat.com --- drivers/scsi/scsi_lib.c | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index e1da8c70a266..52537c145762 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -154,12 +154,9 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd) { - if (cmd->request->rq_flags & RQF_DONTPREP) { - cmd->request->rq_flags &= ~RQF_DONTPREP; - scsi_mq_uninit_cmd(cmd); - } else { - WARN_ON_ONCE(true); - } + WARN_ON_ONCE(!(cmd->request->rq_flags & RQF_DONTPREP)); + + scsi_mq_uninit_cmd(cmd); blk_mq_requeue_request(cmd->request, true); }
@@ -563,9 +560,13 @@ static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd) { + if (!(cmd->request->rq_flags & RQF_DONTPREP)) + return; + scsi_mq_free_sgtables(cmd); scsi_uninit_cmd(cmd); scsi_del_cmd_from_list(cmd); + cmd->request->rq_flags &= ~RQF_DONTPREP; }
/* Returns false when no more bytes to process, true if there are more */ @@ -1089,6 +1090,17 @@ static void scsi_initialize_rq(struct request *rq) cmd->retries = 0; }
+/* + * Only called when the request isn't completed by SCSI, and not freed by + * SCSI + */ +static void scsi_cleanup_rq(struct request *rq) +{ + struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq); + + scsi_mq_uninit_cmd(cmd); +} + /* Add a command to the list used by the aacraid and dpt_i2o drivers */ void scsi_add_cmd_to_list(struct scsi_cmnd *cmd) { @@ -1708,8 +1720,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, * we hit an error, as we will never see this command * again. */ - if (req->rq_flags & RQF_DONTPREP) - scsi_mq_uninit_cmd(cmd); + scsi_mq_uninit_cmd(cmd); break; } return ret; @@ -1816,6 +1827,7 @@ static const struct blk_mq_ops scsi_mq_ops = { .init_request = scsi_mq_init_request, .exit_request = scsi_mq_exit_request, .initialize_rq_fn = scsi_initialize_rq, + .cleanup_rq = scsi_cleanup_rq, .busy = scsi_mq_lld_busy, .map_queues = scsi_map_queues, };
On 7/19/19 8:06 PM, Ming Lei wrote:
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index e1da8c70a266..52537c145762 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -154,12 +154,9 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason) static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd) {
- if (cmd->request->rq_flags & RQF_DONTPREP) {
cmd->request->rq_flags &= ~RQF_DONTPREP;
scsi_mq_uninit_cmd(cmd);
- } else {
WARN_ON_ONCE(true);
- }
- WARN_ON_ONCE(!(cmd->request->rq_flags & RQF_DONTPREP));
- scsi_mq_uninit_cmd(cmd); blk_mq_requeue_request(cmd->request, true); }
The above changes are independent of this patch series. Have you considered to move these into a separate patch?
+/*
- Only called when the request isn't completed by SCSI, and not freed by
- SCSI
- */
+static void scsi_cleanup_rq(struct request *rq) +{
- struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
- scsi_mq_uninit_cmd(cmd);
+}
Is the comment above this function correct? The previous patch adds an unconditional call to mq_ops->cleanup_rq() in blk_mq_free_request().
Thanks,
Bart.
On Mon, Jul 22, 2019 at 08:40:23AM -0700, Bart Van Assche wrote:
On 7/19/19 8:06 PM, Ming Lei wrote:
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index e1da8c70a266..52537c145762 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -154,12 +154,9 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason) static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd) {
- if (cmd->request->rq_flags & RQF_DONTPREP) {
cmd->request->rq_flags &= ~RQF_DONTPREP;
scsi_mq_uninit_cmd(cmd);
- } else {
WARN_ON_ONCE(true);
- }
- WARN_ON_ONCE(!(cmd->request->rq_flags & RQF_DONTPREP));
- scsi_mq_uninit_cmd(cmd); blk_mq_requeue_request(cmd->request, true); }
The above changes are independent of this patch series. Have you considered to move these into a separate patch?
OK.
+/*
- Only called when the request isn't completed by SCSI, and not freed by
- SCSI
- */
+static void scsi_cleanup_rq(struct request *rq) +{
- struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
- scsi_mq_uninit_cmd(cmd);
+}
Is the comment above this function correct? The previous patch adds an unconditional call to mq_ops->cleanup_rq() in blk_mq_free_request().
You are right, will fix it in V3.
Thanks, Ming
FYI, we noticed the following commit (built with gcc-7):
commit: ae86a1c5530b52dc44a280e78feb0c4eb2dd8595 ("[PATCH V2 2/2] scsi: implement .cleanup_rq callback") url: https://github.com/0day-ci/linux/commits/Ming-Lei/blk-mq-add-callback-of-cle...
in testcase: blktests with following parameters:
disk: 1SSD test: block-group1
on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+---------------------------------------------------------------------------------------------------------------+------------+------------+ | | bd222ca85f | ae86a1c553 | +---------------------------------------------------------------------------------------------------------------+------------+------------+ | boot_successes | 0 | 0 | | boot_failures | 11 | 14 | | BUG:kernel_reboot-without-warning_in_test_stage | 11 | 1 | | BUG:kernel_NULL_pointer_dereference,address | 0 | 4 | | Oops:#[##] | 0 | 4 | | RIP:scsi_queue_rq | 0 | 4 | | Kernel_panic-not_syncing:Fatal_exception | 0 | 4 | | invoked_oom-killer:gfp_mask=0x | 0 | 9 | | Mem-Info | 0 | 9 | | page_allocation_failure:order:#,mode:#(GFP_KERNEL|__GFP_RETRY_MAYFAIL),nodemask=(null),cpuset=/,mems_allowed= | 0 | 2 | +---------------------------------------------------------------------------------------------------------------+------------+------------+
If you fix the issue, kindly add following tag Reported-by: kernel test robot rong.a.chen@intel.com
[ 140.974865] BUG: kernel NULL pointer dereference, address: 000000000000001c [ 141.013422] #PF: supervisor read access in kernel mode [ 141.034814] #PF: error_code(0x0000) - not-present page [ 141.042285] sd 6:0:0:0: [sdo] Write cache: enabled, read cache: enabled, supports DPO and FUA [ 141.049589] PGD 0 P4D 0 [ 141.049616] Oops: 0000 [#1] SMP PTI [ 141.049621] CPU: 1 PID: 384 Comm: kworker/1:1H Not tainted 5.2.0-gae86a1c5530b52 #1 [ 141.049623] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 141.049654] Workqueue: kblockd blk_mq_run_work_fn [ 141.049684] RIP: 0010:scsi_queue_rq+0x267/0xa80 [ 141.049688] Code: b0 07 00 00 8d 4a fa 83 f9 01 0f 86 80 05 00 00 83 fa 04 0f 84 77 05 00 00 c7 83 2c 01 00 00 00 00 07 00 48 8b 93 30 02 00 00 <f6> 42 1c 80 0f 84 1d ff ff ff 48 8b 3c 24 88 44 24 08 e8 22 f5 ff [ 141.049690] RSP: 0000:ffffab02c078fd00 EFLAGS: 00010297 [ 141.049713] RAX: 000000000000000a RBX: ffff92fe394b1200 RCX: 00000000fffffffd [ 141.145130] sd 6:0:0:0: [sdo] Optimal transfer size 524288 bytes [ 141.161078] RDX: 0000000000000000 RSI: ffffab02c078fd90 RDI: ffff92fe3971f400 [ 141.161080] RBP: ffff92fe39721000 R08: 00000000000003c0 R09: 8080808080808080 [ 141.161082] R10: ffffab02c0373858 R11: fefefefefefefeff R12: ffff92fe3971f400 [ 141.161084] R13: 0000000000000000 R14: ffff92fe3ef93918 R15: ffff92fe3946a800 [ 141.161108] FS: 0000000000000000(0000) GS:ffff92feffd00000(0000) knlGS:0000000000000000 [ 141.161110] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 141.161111] CR2: 000000000000001c CR3: 000000007f2ec000 CR4: 00000000000006e0 [ 141.161143] Call Trace: [ 141.548583] blk_mq_dispatch_rq_list+0x3c3/0x5c0 [ 141.563298] ? syscall_return_via_sysret+0x10/0x7f [ 141.577641] ? __switch_to_asm+0x40/0x70 [ 141.603426] ? elv_rb_del+0x1f/0x30 [ 141.615379] ? deadline_remove_request+0x55/0xc0 [ 141.631520] blk_mq_do_dispatch_sched+0x76/0x120 [ 141.647829] blk_mq_sched_dispatch_requests+0x100/0x170 [ 141.667954] __blk_mq_run_hw_queue+0x60/0x130 [ 141.697797] process_one_work+0x19c/0x3b0 [ 141.714051] worker_thread+0x3c/0x3b0 [ 141.735158] ? process_one_work+0x3b0/0x3b0 [ 141.748670] kthread+0x11e/0x140 [ 141.761030] ? kthread_park+0xa0/0xa0 [ 141.784865] ret_from_fork+0x35/0x40 [ 141.804669] Modules linked in: scsi_debug loop sr_mod cdrom sd_mod sg intel_rapl_msr ppdev bochs_drm drm_vram_helper ata_generic ttm pata_acpi drm_kms_helper syscopyarea sysfillrect sysimgblt snd_pcm fb_sys_fops snd_timer snd intel_rapl_common crc32c_intel soundcore joydev ata_piix pcspkr drm serio_raw virtio_scsi libata i2c_piix4 floppy parport_pc parport ip_tables [ 141.928554] CR2: 000000000000001c [ 141.944180] ---[ end trace 9f0c0bf804097727 ]---
To reproduce:
# build kernel cd linux cp config-5.2.0-gae86a1c5530b52 .config make HOSTCC=gcc-7 CC=gcc-7 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage
git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
Thanks, Rong Chen
Hi,
Thanks for your report!
On Thu, Jul 25, 2019 at 06:46:29PM +0800, kernel test robot wrote:
FYI, we noticed the following commit (built with gcc-7):
commit: ae86a1c5530b52dc44a280e78feb0c4eb2dd8595 ("[PATCH V2 2/2] scsi: implement .cleanup_rq callback") url: https://github.com/0day-ci/linux/commits/Ming-Lei/blk-mq-add-callback-of-cle...
in testcase: blktests with following parameters:
disk: 1SSD test: block-group1
on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+---------------------------------------------------------------------------------------------------------------+------------+------------+ | | bd222ca85f | ae86a1c553 | +---------------------------------------------------------------------------------------------------------------+------------+------------+ | boot_successes | 0 | 0 | | boot_failures | 11 | 14 | | BUG:kernel_reboot-without-warning_in_test_stage | 11 | 1 | | BUG:kernel_NULL_pointer_dereference,address | 0 | 4 | | Oops:#[##] | 0 | 4 | | RIP:scsi_queue_rq | 0 | 4 | | Kernel_panic-not_syncing:Fatal_exception | 0 | 4 | | invoked_oom-killer:gfp_mask=0x | 0 | 9 | | Mem-Info | 0 | 9 | | page_allocation_failure:order:#,mode:#(GFP_KERNEL|__GFP_RETRY_MAYFAIL),nodemask=(null),cpuset=/,mems_allowed= | 0 | 2 | +---------------------------------------------------------------------------------------------------------------+------------+------------+
If you fix the issue, kindly add following tag Reported-by: kernel test robot rong.a.chen@intel.com
[ 140.974865] BUG: kernel NULL pointer dereference, address: 000000000000001c
Yeah, I know this issue, and it has been fixed in either V3 or V4.
Thanks, Ming
Hey Ming Lei,
On Sat, Jul 20, 2019 at 11:06:35AM +0800, Ming Lei wrote:
Hi,
When one request is dispatched to LLD via dm-rq, if the result is BLK_STS_*RESOURCE, dm-rq will free the request. However, LLD may allocate private data for this request, so this way will cause memory leak.
I am confused about this. Probably because I am not up-to-date with all of blk-mq. But if you free the LLD private data before the request is finished, what is the LLD doing if the request finishes afterwards? Would that not be an automatic use-after-free?
Add .cleanup_rq() callback and implement it in SCSI for fixing the issue, since SCSI is the only driver which allocates private requst data in .queue_rq() path.
Another use case of this callback is to free the request and re-submit bios during cpu hotplug when the hctx is dead, see the following link:
https://lore.kernel.org/linux-block/f122e8f2-5ede-2d83-9ca0-bc713ce66d01@hua...
V2:
- run .cleanup_rq() in blk_mq_free_request(), as suggested by Mike
On Fri, Jul 26, 2019 at 06:20:46PM +0200, Benjamin Block wrote:
Hey Ming Lei,
On Sat, Jul 20, 2019 at 11:06:35AM +0800, Ming Lei wrote:
Hi,
When one request is dispatched to LLD via dm-rq, if the result is BLK_STS_*RESOURCE, dm-rq will free the request. However, LLD may allocate private data for this request, so this way will cause memory leak.
I am confused about this. Probably because I am not up-to-date with all of blk-mq. But if you free the LLD private data before the request is finished, what is the LLD doing if the request finishes afterwards? Would that not be an automatic use-after-free?
Wrt. this special use case, the underlying request is totally covered by dm-rq after .queue_rq() returns BLK_STS_*RESOURCE. So the request won't be re-dispatched by blk-mq at all.
thanks, Ming
linux-stable-mirror@lists.linaro.org