----------------------------------------------------------------- This patch is not on mainline and is meant to 4.19 stable *only*. After the patch description there's a reasoning about that. -----------------------------------------------------------------
Commit 37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") introduced a NULL pointer dereference in generic_make_request(). The patch sets q to NULL and enter_succeeded to false; right after, there's an 'if (enter_succeeded)' which is not taken, and then the 'else' will dereference q in blk_queue_dying(q).
This patch just moves the 'q = NULL' to a point in which it won't trigger the oops, although the semantics of this NULLification remains untouched.
A simple test case/reproducer is as follows: a) Build kernel v4.19.56-stable with CONFIG_BLK_CGROUP=n.
b) Create a raid0 md array with 2 NVMe devices as members, and mount it with an ext4 filesystem.
c) Run the following oneliner (supposing the raid0 is mounted in /mnt): (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3; echo 1 > /sys/block/nvme1n1/device/device/remove (whereas nvme1n1 is the 2nd array member)
This will trigger the following oops:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI RIP: 0010:generic_make_request+0x32b/0x400 Call Trace: submit_bio+0x73/0x140 ext4_io_submit+0x4d/0x60 ext4_writepages+0x626/0xe90 do_writepages+0x4b/0xe0 [...]
This patch has no functional changes and preserves the md/raid0 behavior when a member is removed before kernel v4.17.
---------------------------- Why this is not on mainline? ----------------------------
The patch was originally submitted upstream in linux-raid and linux-block mailing-lists - it was initially accepted by Song Liu, but Christoph Hellwig[0] observed that there was a clean-up series ready to be accepted from Ming Lei[1] that fixed the same issue.
The accepted patches from Ming's series in upstream are: commit 47cdee29ef9d ("block: move blk_exit_queue into __blk_release_queue") and commit fe2008640ae3 ("block: don't protect generic_make_request_checks with blk_queue_enter"). Those patches basically do a clean-up in the block layer involving:
1) Putting back blk_exit_queue() logic into __blk_release_queue(); that path was changed in the past and the logic from blk_exit_queue() was added to blk_cleanup_queue().
2) Removing the guard/protection in generic_make_request_checks() with blk_queue_enter().
The problem with Ming's series for -stable is that it relies in the legacy request IO path removal. So it's "backport-able" to v5.0+, but doing that for early versions (like 4.19) would incur in complex code changes. Hence, it was suggested by Christoph and Song Liu that this patch was submitted to stable only; otherwise merging it upstream would add code to fix a path removed in a subsequent commit.
[0] lore.kernel.org/linux-block/20190521172258.GA32702@infradead.org [1] lore.kernel.org/linux-block/20190515030310.20393-1-ming.lei@redhat.com
Cc: Christoph Hellwig hch@lst.de Cc: Jens Axboe axboe@kernel.dk Cc: Song Liu songliubraving@fb.com Reviewed-by: Bart Van Assche bvanassche@acm.org Reviewed-by: Ming Lei ming.lei@redhat.com Tested-by: Eric Ren renzhengeek@gmail.com Fixes: 37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") Signed-off-by: Guilherme G. Piccoli gpiccoli@canonical.com --- block/blk-core.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 6eed5d84c2ef..682bc561b77b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2445,10 +2445,8 @@ blk_qc_t generic_make_request(struct bio *bio) flags = 0; if (bio->bi_opf & REQ_NOWAIT) flags = BLK_MQ_REQ_NOWAIT; - if (blk_queue_enter(q, flags) < 0) { + if (blk_queue_enter(q, flags) < 0) enter_succeeded = false; - q = NULL; - } }
if (enter_succeeded) { @@ -2479,6 +2477,7 @@ blk_qc_t generic_make_request(struct bio *bio) bio_wouldblock_error(bio); else bio_io_error(bio); + q = NULL; } bio = bio_list_pop(&bio_list_on_stack[0]); } while (bio);
----------------------------------------------------------------- This patch is not on mainline and is meant to 4.19 stable *only*. After the patch description there's a reasoning about that. -----------------------------------------------------------------
Commit cd4a4ae4683d ("block: don't use blocking queue entered for recursive bio submits") introduced the flag BIO_QUEUE_ENTERED in order split bios bypass the blocking queue entering routine and use the live non-blocking version. It was a result of an extensive discussion in a linux-block thread[0], and the purpose of this change was to prevent a hung task waiting on a reference to drop.
Happens that md raid0 split bios all the time, and more important, it changes their underlying device to the raid member. After the change introduced by this flag's usage, we experience various crashes if a raid0 member is removed during a large write. This happens because the bio reaches the live queue entering function when the queue of the raid0 member is dying.
A simple reproducer of this behavior is presented below: a) Build kernel v4.19.56-stable with CONFIG_BLK_DEV_THROTTLING=y.
b) Create a raid0 md array with 2 NVMe devices as members, and mount it with an ext4 filesystem.
c) Run the following oneliner (supposing the raid0 is mounted in /mnt): (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3; echo 1 > /sys/block/nvme1n1/device/device/remove (whereas nvme1n1 is the 2nd array member)
This will trigger the following warning/oops:
------------[ cut here ]------------ BUG: unable to handle kernel NULL pointer dereference at 0000000000000155 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI RIP: 0010:blk_throtl_bio+0x45/0x970 [...] Call Trace: generic_make_request_checks+0x1bf/0x690 generic_make_request+0x64/0x3f0 raid0_make_request+0x184/0x620 [raid0] ? raid0_make_request+0x184/0x620 [raid0] md_handle_request+0x126/0x1a0 md_make_request+0x7b/0x180 generic_make_request+0x19e/0x3f0 submit_bio+0x73/0x140 [...]
This patch changes raid0 driver to fallback to the "old" blocking queue entering procedure, by clearing the BIO_QUEUE_ENTERED from raid0 bios. This prevents the crashes and restores the regular behavior of raid0 arrays when a member is removed during a large write.
[0] lore.kernel.org/linux-block/343bbbf6-64eb-879e-d19e-96aebb037d47@I-love.SAKURA.ne.jp
---------------------------- Why this is not on mainline? ----------------------------
The patch was originally submitted upstream in linux-raid and linux-block mailing-lists - it was initially accepted by Song Liu, but Christoph Hellwig[1] observed that there was a clean-up series ready to be accepted from Ming Lei[2] that fixed the same issue.
The accepted patches from Ming's series in upstream are: commit 47cdee29ef9d ("block: move blk_exit_queue into __blk_release_queue") and commit fe2008640ae3 ("block: don't protect generic_make_request_checks with blk_queue_enter"). Those patches basically do a clean-up in the block layer involving:
1) Putting back blk_exit_queue() logic into __blk_release_queue(); that path was changed in the past and the logic from blk_exit_queue() was added to blk_cleanup_queue().
2) Removing the guard/protection in generic_make_request_checks() with blk_queue_enter().
The problem with Ming's series for -stable is that it relies in the legacy request IO path removal. So it's "backport-able" to v5.0+, but doing that for early versions (like 4.19) would incur in complex code changes. Hence, it was suggested by Christoph and Song Liu that this patch was submitted to stable only; otherwise merging it upstream would add code to fix a path removed in a subsequent commit.
[1] lore.kernel.org/linux-block/20190521172258.GA32702@infradead.org [2] lore.kernel.org/linux-block/20190515030310.20393-1-ming.lei@redhat.com
Cc: Christoph Hellwig hch@lst.de Cc: Jens Axboe axboe@kernel.dk Cc: Song Liu songliubraving@fb.com Cc: Ming Lei ming.lei@redhat.com Cc: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Fixes: cd4a4ae4683d ("block: don't use blocking queue entered for recursive bio submits") Signed-off-by: Guilherme G. Piccoli gpiccoli@canonical.com --- drivers/md/raid0.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index ac1cffd2a09b..f4daa56d204d 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -547,6 +547,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio) trace_block_bio_remap(bdev_get_queue(rdev->bdev), discard_bio, disk_devt(mddev->gendisk), bio->bi_iter.bi_sector); + bio_clear_flag(bio, BIO_QUEUE_ENTERED); generic_make_request(discard_bio); } bio_endio(bio); @@ -602,6 +603,7 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio) disk_devt(mddev->gendisk), bio_sector); mddev_check_writesame(mddev, bio); mddev_check_write_zeroes(mddev, bio); + bio_clear_flag(bio, BIO_QUEUE_ENTERED); generic_make_request(bio); return true; }
On Fri, Jun 28, 2019 at 3:18 PM Guilherme G. Piccoli gpiccoli@canonical.com wrote:
This patch is not on mainline and is meant to 4.19 stable *only*. After the patch description there's a reasoning about that.
Commit cd4a4ae4683d ("block: don't use blocking queue entered for recursive bio submits") introduced the flag BIO_QUEUE_ENTERED in order split bios bypass the blocking queue entering routine and use the live non-blocking version. It was a result of an extensive discussion in a linux-block thread[0], and the purpose of this change was to prevent a hung task waiting on a reference to drop.
Happens that md raid0 split bios all the time, and more important, it changes their underlying device to the raid member. After the change introduced by this flag's usage, we experience various crashes if a raid0 member is removed during a large write. This happens because the bio reaches the live queue entering function when the queue of the raid0 member is dying.
A simple reproducer of this behavior is presented below: a) Build kernel v4.19.56-stable with CONFIG_BLK_DEV_THROTTLING=y.
b) Create a raid0 md array with 2 NVMe devices as members, and mount it with an ext4 filesystem.
c) Run the following oneliner (supposing the raid0 is mounted in /mnt): (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3; echo 1 > /sys/block/nvme1n1/device/device/remove (whereas nvme1n1 is the 2nd array member)
This will trigger the following warning/oops:
------------[ cut here ]------------ BUG: unable to handle kernel NULL pointer dereference at 0000000000000155 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI RIP: 0010:blk_throtl_bio+0x45/0x970 [...] Call Trace: generic_make_request_checks+0x1bf/0x690 generic_make_request+0x64/0x3f0 raid0_make_request+0x184/0x620 [raid0] ? raid0_make_request+0x184/0x620 [raid0] md_handle_request+0x126/0x1a0 md_make_request+0x7b/0x180 generic_make_request+0x19e/0x3f0 submit_bio+0x73/0x140 [...]
This patch changes raid0 driver to fallback to the "old" blocking queue entering procedure, by clearing the BIO_QUEUE_ENTERED from raid0 bios. This prevents the crashes and restores the regular behavior of raid0 arrays when a member is removed during a large write.
[0] lore.kernel.org/linux-block/343bbbf6-64eb-879e-d19e-96aebb037d47@I-love.SAKURA.ne.jp
Why this is not on mainline?
The patch was originally submitted upstream in linux-raid and linux-block mailing-lists - it was initially accepted by Song Liu, but Christoph Hellwig[1] observed that there was a clean-up series ready to be accepted from Ming Lei[2] that fixed the same issue.
The accepted patches from Ming's series in upstream are: commit 47cdee29ef9d ("block: move blk_exit_queue into __blk_release_queue") and commit fe2008640ae3 ("block: don't protect generic_make_request_checks with blk_queue_enter"). Those patches basically do a clean-up in the block layer involving:
- Putting back blk_exit_queue() logic into __blk_release_queue(); that
path was changed in the past and the logic from blk_exit_queue() was added to blk_cleanup_queue().
- Removing the guard/protection in generic_make_request_checks() with
blk_queue_enter().
The problem with Ming's series for -stable is that it relies in the legacy request IO path removal. So it's "backport-able" to v5.0+, but doing that for early versions (like 4.19) would incur in complex code changes. Hence, it was suggested by Christoph and Song Liu that this patch was submitted to stable only; otherwise merging it upstream would add code to fix a path removed in a subsequent commit.
[1] lore.kernel.org/linux-block/20190521172258.GA32702@infradead.org [2] lore.kernel.org/linux-block/20190515030310.20393-1-ming.lei@redhat.com
Cc: Christoph Hellwig hch@lst.de Cc: Jens Axboe axboe@kernel.dk Cc: Song Liu songliubraving@fb.com Cc: Ming Lei ming.lei@redhat.com Cc: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Fixes: cd4a4ae4683d ("block: don't use blocking queue entered for recursive bio submits") Signed-off-by: Guilherme G. Piccoli gpiccoli@canonical.com
Acked-by: Song Liu songliubraving@fb.com
Thanks for the fix!
drivers/md/raid0.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index ac1cffd2a09b..f4daa56d204d 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -547,6 +547,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio) trace_block_bio_remap(bdev_get_queue(rdev->bdev), discard_bio, disk_devt(mddev->gendisk), bio->bi_iter.bi_sector);
bio_clear_flag(bio, BIO_QUEUE_ENTERED); generic_make_request(discard_bio); } bio_endio(bio);
@@ -602,6 +603,7 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio) disk_devt(mddev->gendisk), bio_sector); mddev_check_writesame(mddev, bio); mddev_check_write_zeroes(mddev, bio);
bio_clear_flag(bio, BIO_QUEUE_ENTERED); generic_make_request(bio); return true;
}
2.22.0
On Fri, Jun 28, 2019 at 3:22 PM Guilherme G. Piccoli gpiccoli@canonical.com wrote:
This patch is not on mainline and is meant to 4.19 stable *only*. After the patch description there's a reasoning about that.
Commit 37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") introduced a NULL pointer dereference in generic_make_request(). The patch sets q to NULL and enter_succeeded to false; right after, there's an 'if (enter_succeeded)' which is not taken, and then the 'else' will dereference q in blk_queue_dying(q).
This patch just moves the 'q = NULL' to a point in which it won't trigger the oops, although the semantics of this NULLification remains untouched.
A simple test case/reproducer is as follows: a) Build kernel v4.19.56-stable with CONFIG_BLK_CGROUP=n.
b) Create a raid0 md array with 2 NVMe devices as members, and mount it with an ext4 filesystem.
c) Run the following oneliner (supposing the raid0 is mounted in /mnt): (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3; echo 1 > /sys/block/nvme1n1/device/device/remove (whereas nvme1n1 is the 2nd array member)
This will trigger the following oops:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI RIP: 0010:generic_make_request+0x32b/0x400 Call Trace: submit_bio+0x73/0x140 ext4_io_submit+0x4d/0x60 ext4_writepages+0x626/0xe90 do_writepages+0x4b/0xe0 [...]
This patch has no functional changes and preserves the md/raid0 behavior when a member is removed before kernel v4.17.
Why this is not on mainline?
The patch was originally submitted upstream in linux-raid and linux-block mailing-lists - it was initially accepted by Song Liu, but Christoph Hellwig[0] observed that there was a clean-up series ready to be accepted from Ming Lei[1] that fixed the same issue.
The accepted patches from Ming's series in upstream are: commit 47cdee29ef9d ("block: move blk_exit_queue into __blk_release_queue") and commit fe2008640ae3 ("block: don't protect generic_make_request_checks with blk_queue_enter"). Those patches basically do a clean-up in the block layer involving:
- Putting back blk_exit_queue() logic into __blk_release_queue(); that
path was changed in the past and the logic from blk_exit_queue() was added to blk_cleanup_queue().
- Removing the guard/protection in generic_make_request_checks() with
blk_queue_enter().
The problem with Ming's series for -stable is that it relies in the legacy request IO path removal. So it's "backport-able" to v5.0+, but doing that for early versions (like 4.19) would incur in complex code changes. Hence, it was suggested by Christoph and Song Liu that this patch was submitted to stable only; otherwise merging it upstream would add code to fix a path removed in a subsequent commit.
[0] lore.kernel.org/linux-block/20190521172258.GA32702@infradead.org [1] lore.kernel.org/linux-block/20190515030310.20393-1-ming.lei@redhat.com
Cc: Christoph Hellwig hch@lst.de Cc: Jens Axboe axboe@kernel.dk Cc: Song Liu songliubraving@fb.com Reviewed-by: Bart Van Assche bvanassche@acm.org Reviewed-by: Ming Lei ming.lei@redhat.com Tested-by: Eric Ren renzhengeek@gmail.com Fixes: 37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") Signed-off-by: Guilherme G. Piccoli gpiccoli@canonical.com
Acked-by: Song Liu songliubraving@fb.com
Thanks for the fix!
block/blk-core.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 6eed5d84c2ef..682bc561b77b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2445,10 +2445,8 @@ blk_qc_t generic_make_request(struct bio *bio) flags = 0; if (bio->bi_opf & REQ_NOWAIT) flags = BLK_MQ_REQ_NOWAIT;
if (blk_queue_enter(q, flags) < 0) {
if (blk_queue_enter(q, flags) < 0) enter_succeeded = false;
q = NULL;
} } if (enter_succeeded) {
@@ -2479,6 +2477,7 @@ blk_qc_t generic_make_request(struct bio *bio) bio_wouldblock_error(bio); else bio_io_error(bio);
q = NULL; } bio = bio_list_pop(&bio_list_on_stack[0]); } while (bio);
-- 2.22.0
On Fri, Jun 28, 2019 at 07:17:58PM -0300, Guilherme G. Piccoli wrote:
This patch is not on mainline and is meant to 4.19 stable *only*. After the patch description there's a reasoning about that.
Commit 37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") introduced a NULL pointer dereference in generic_make_request(). The patch sets q to NULL and enter_succeeded to false; right after, there's an 'if (enter_succeeded)' which is not taken, and then the 'else' will dereference q in blk_queue_dying(q).
This patch just moves the 'q = NULL' to a point in which it won't trigger the oops, although the semantics of this NULLification remains untouched.
A simple test case/reproducer is as follows: a) Build kernel v4.19.56-stable with CONFIG_BLK_CGROUP=n.
b) Create a raid0 md array with 2 NVMe devices as members, and mount it with an ext4 filesystem.
c) Run the following oneliner (supposing the raid0 is mounted in /mnt): (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3; echo 1 > /sys/block/nvme1n1/device/device/remove (whereas nvme1n1 is the 2nd array member)
This will trigger the following oops:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI RIP: 0010:generic_make_request+0x32b/0x400 Call Trace: submit_bio+0x73/0x140 ext4_io_submit+0x4d/0x60 ext4_writepages+0x626/0xe90 do_writepages+0x4b/0xe0 [...]
This patch has no functional changes and preserves the md/raid0 behavior when a member is removed before kernel v4.17.
Why this is not on mainline?
The patch was originally submitted upstream in linux-raid and linux-block mailing-lists - it was initially accepted by Song Liu, but Christoph Hellwig[0] observed that there was a clean-up series ready to be accepted from Ming Lei[1] that fixed the same issue.
The accepted patches from Ming's series in upstream are: commit 47cdee29ef9d ("block: move blk_exit_queue into __blk_release_queue") and commit fe2008640ae3 ("block: don't protect generic_make_request_checks with blk_queue_enter"). Those patches basically do a clean-up in the block layer involving:
- Putting back blk_exit_queue() logic into __blk_release_queue(); that
path was changed in the past and the logic from blk_exit_queue() was added to blk_cleanup_queue().
- Removing the guard/protection in generic_make_request_checks() with
blk_queue_enter().
The problem with Ming's series for -stable is that it relies in the legacy request IO path removal. So it's "backport-able" to v5.0+, but doing that for early versions (like 4.19) would incur in complex code changes. Hence, it was suggested by Christoph and Song Liu that this patch was submitted to stable only; otherwise merging it upstream would add code to fix a path removed in a subsequent commit.
[0] lore.kernel.org/linux-block/20190521172258.GA32702@infradead.org [1] lore.kernel.org/linux-block/20190515030310.20393-1-ming.lei@redhat.com
Cc: Christoph Hellwig hch@lst.de Cc: Jens Axboe axboe@kernel.dk Reviewed-by: Bart Van Assche bvanassche@acm.org Reviewed-by: Ming Lei ming.lei@redhat.com Tested-by: Eric Ren renzhengeek@gmail.com Fixes: 37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") Signed-off-by: Guilherme G. Piccoli gpiccoli@canonical.com Acked-by: Song Liu songliubraving@fb.com
block/blk-core.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
Both patches now queued up, thanks.
greg k-h
On Wed, Jul 3, 2019 at 9:17 AM Greg KH gregkh@linuxfoundation.org wrote:
[...] Both patches now queued up, thanks.
Thanks a lot Song and Greg! Cheers,
Guilherme
linux-stable-mirror@lists.linaro.org