On Fri, Mar 10, 2023 at 07:33:37PM +0000, Mike Cloaked wrote:
With kerne. 6.2.3 if I simply plug in a usb external drive, mount it and umount it, then the journal has a kernel Oops and I have submitted a bug report, that includes the journal output, at https://bugzilla.kernel.org/show_bug.cgi?id=217174
As soon as the usb drive is unmounted, the kernel Oops occurs, and the machine hangs on shutdown and needs a hard reboot.
I have reproduced the same issue on three different machines, and in each case downgrading back to kernel 6.2.2 resolves the issue and it no longer occurs.
This would seem to be a regression in kernel 6.2.3
Mike C
Thanks for reporting this! If this is reliably reproducible and is known to be a regression between v6.2.2 and v6.2.3, any chance you could bisect it to find out the exact commit that introduced the bug?
For reference I'm also copying the stack trace from bugzilla below:
BUG: kernel NULL pointer dereference, address: 0000000000000028 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 9 PID: 1118 Comm: lvcreate Tainted: G T 6.2.3-1> Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370 Ex> RIP: 0010:blk_throtl_update_limit_valid+0x1f/0x110 Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 49 89 fc> RSP: 0018:ffffb5fd01b47bb8 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff9d09040d8000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffffff97b2f648 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9d090fce2c00 R13: ffff9d090aedf060 R14: ffff9d090aedf1c8 R15: ffff9d090aedf0d8 FS: 00007f3896fc7240(0000) GS:ffff9d109f040000(0000) knlGS:00000000> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000111ce4003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> throtl_pd_offline+0x40/0x70 blkcg_deactivate_policy+0xab/0x140 ? __pfx_dev_remove+0x10/0x10 [dm_mod] blk_throtl_exit+0x45/0x80 disk_release+0x4a/0xf0 device_release+0x34/0x90 kobject_put+0x97/0x1d0 cleanup_mapped_device+0xe0/0x170 [dm_mod] __dm_destroy+0x120/0x1e0 [dm_mod] dev_remove+0x11b/0x190 [dm_mod] ctl_ioctl+0x302/0x5b0 [dm_mod] dm_ctl_ioctl+0xe/0x20 [dm_mod] __x64_sys_ioctl+0x9c/0xe0 do_syscall_64+0x5c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f389745653f Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48> RSP: 002b:00007ffe5499e4f0 EFLAGS: 00000246 ORIG_RAX: 00000000000000> RAX: ffffffffffffffda RBX: 000055d198c3bec0 RCX: 00007f389745653f RDX: 000055d1994501b0 RSI: 00000000c138fd04 RDI: 0000000000000003 RBP: 0000000000000006 R08: 000055d197547088 R09: 00007ffe5499e3a0 R10: 0000000000000000 R11: 0000000000000246 R12: 000055d1974d10d6 R13: 000055d199450260 R14: 000055d1974d10c7 R15: 000055d197545bbb </TASK> Modules linked in: dm_cache_smq dm_cache dm_persistent_data dm_bio_p> soundcore pcspkr intel_wmi_thunderbolt i2c_smbus mei sysimgblt inpu> CR2: 0000000000000028 ---[ end trace 0000000000000000 ]--- RIP: 0010:blk_throtl_update_limit_valid+0x1f/0x110 Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 49 89 fc>
On Fri, Mar 10, 2023 at 12:14:10PM -0800, Eric Biggers wrote:
On Fri, Mar 10, 2023 at 07:33:37PM +0000, Mike Cloaked wrote:
With kerne. 6.2.3 if I simply plug in a usb external drive, mount it and umount it, then the journal has a kernel Oops and I have submitted a bug report, that includes the journal output, at https://bugzilla.kernel.org/show_bug.cgi?id=217174
As soon as the usb drive is unmounted, the kernel Oops occurs, and the machine hangs on shutdown and needs a hard reboot.
I have reproduced the same issue on three different machines, and in each case downgrading back to kernel 6.2.2 resolves the issue and it no longer occurs.
This would seem to be a regression in kernel 6.2.3
Mike C
Thanks for reporting this! If this is reliably reproducible and is known to be a regression between v6.2.2 and v6.2.3, any chance you could bisect it to find out the exact commit that introduced the bug?
For reference I'm also copying the stack trace from bugzilla below:
BUG: kernel NULL pointer dereference, address: 0000000000000028 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 9 PID: 1118 Comm: lvcreate Tainted: G T 6.2.3-1> Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370 Ex> RIP: 0010:blk_throtl_update_limit_valid+0x1f/0x110
BTW, the block/ commits between v6.2.2 and v6.2.3 were:
blk-mq: avoid sleep in blk_mq_alloc_request_hctx blk-mq: remove stale comment for blk_mq_sched_mark_restart_hctx blk-mq: wait on correct sbitmap_queue in blk_mq_mark_tag_wait blk-mq: Fix potential io hung for shared sbitmap per tagset blk-mq: correct stale comment of .get_budget block: sync mixed merged request's failfast with 1st bio's block: Fix io statistics for cgroup in throttle path block: bio-integrity: Copy flags when bio_integrity_payload is cloned block: use proper return value from bio_failfast() blk-iocost: fix divide by 0 error in calc_lcoefs() blk-cgroup: dropping parent refcount after pd_free_fn() is done blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy() block: don't allow multiple bios for IOCB_NOWAIT issue block: clear bio->bi_bdev when putting a bio back in the cache block: be a bit more careful in checking for NULL bdev while polling
Without having any in-depth knowledge here, I think "blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()" looks the most suspicious here... I see that AUTOSEL selected it from a 3-patch series without backporting patch 2, maybe that could be it? Anyway, just a hunch.
- Eric
On 3/10/23 1:16 PM, Eric Biggers wrote:
On Fri, Mar 10, 2023 at 12:14:10PM -0800, Eric Biggers wrote:
On Fri, Mar 10, 2023 at 07:33:37PM +0000, Mike Cloaked wrote:
With kerne. 6.2.3 if I simply plug in a usb external drive, mount it and umount it, then the journal has a kernel Oops and I have submitted a bug report, that includes the journal output, at https://bugzilla.kernel.org/show_bug.cgi?id=217174
As soon as the usb drive is unmounted, the kernel Oops occurs, and the machine hangs on shutdown and needs a hard reboot.
I have reproduced the same issue on three different machines, and in each case downgrading back to kernel 6.2.2 resolves the issue and it no longer occurs.
This would seem to be a regression in kernel 6.2.3
Mike C
Thanks for reporting this! If this is reliably reproducible and is known to be a regression between v6.2.2 and v6.2.3, any chance you could bisect it to find out the exact commit that introduced the bug?
For reference I'm also copying the stack trace from bugzilla below:
BUG: kernel NULL pointer dereference, address: 0000000000000028 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 9 PID: 1118 Comm: lvcreate Tainted: G T 6.2.3-1> Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370 Ex> RIP: 0010:blk_throtl_update_limit_valid+0x1f/0x110
BTW, the block/ commits between v6.2.2 and v6.2.3 were:
blk-mq: avoid sleep in blk_mq_alloc_request_hctx blk-mq: remove stale comment for blk_mq_sched_mark_restart_hctx blk-mq: wait on correct sbitmap_queue in blk_mq_mark_tag_wait blk-mq: Fix potential io hung for shared sbitmap per tagset blk-mq: correct stale comment of .get_budget block: sync mixed merged request's failfast with 1st bio's block: Fix io statistics for cgroup in throttle path block: bio-integrity: Copy flags when bio_integrity_payload is cloned block: use proper return value from bio_failfast() blk-iocost: fix divide by 0 error in calc_lcoefs() blk-cgroup: dropping parent refcount after pd_free_fn() is done blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy() block: don't allow multiple bios for IOCB_NOWAIT issue block: clear bio->bi_bdev when putting a bio back in the cache block: be a bit more careful in checking for NULL bdev while polling
Without having any in-depth knowledge here, I think "blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()" looks the most suspicious here... I see that AUTOSEL selected it from a 3-patch series without backporting patch 2, maybe that could be it? Anyway, just a hunch.
Was just looking at this too, the primary suspects would indeed be those two blk-cgroup changes. And yes, they ended up in stable due to auto selection, and very odd how it picked 2 and not the 3rd?!
But I would revert:
bfe46d2efe46c5c952f982e2ca94fe2ec5e58e2a 57a425badc05c2e87e9f25713e5c3c0298e4202c
in that order from 6.2.3 and see if that helps. Adding Yu.
On 3/10/23 15:23, Jens Axboe wrote:
On 3/10/23 1:16 PM, Eric Biggers wrote:
...
But I would revert:
bfe46d2efe46c5c952f982e2ca94fe2ec5e58e2a 57a425badc05c2e87e9f25713e5c3c0298e4202c
in that order from 6.2.3 and see if that helps. Adding Yu.
Confirm the 2 Reverts fixed in my tests as well (nvme + sata drives). Nasty crash - some needed to be power cycled as they hung on shutdown.
Thank you!
gene
On Fri, Mar 10, 2023 at 04:08:21PM -0500, Genes Lists wrote:
On 3/10/23 15:23, Jens Axboe wrote:
On 3/10/23 1:16 PM, Eric Biggers wrote:
...
But I would revert:
bfe46d2efe46c5c952f982e2ca94fe2ec5e58e2a 57a425badc05c2e87e9f25713e5c3c0298e4202c
in that order from 6.2.3 and see if that helps. Adding Yu.
Confirm the 2 Reverts fixed in my tests as well (nvme + sata drives). Nasty crash - some needed to be power cycled as they hung on shutdown.
Thank you!
gene
Great, thanks. BTW, 6.1 is also affected. A simple reproducer is to run:
dmsetup create dev --table "0 128 zero" dmsetup remove dev
The following kconfigs are needed for the bug to be hit:
CONFIG_BLK_CGROUP=y CONFIG_BLK_DEV_THROTTLING=y CONFIG_BLK_DEV_THROTTLING_LOW=y
Sasha or Greg, can you please revert the indicated commits from 6.1 and 6.2?
- Eric
On Fri, Mar 10, 2023 at 02:53:19PM -0800, Eric Biggers wrote:
On Fri, Mar 10, 2023 at 04:08:21PM -0500, Genes Lists wrote:
On 3/10/23 15:23, Jens Axboe wrote:
On 3/10/23 1:16 PM, Eric Biggers wrote:
...
But I would revert:
bfe46d2efe46c5c952f982e2ca94fe2ec5e58e2a 57a425badc05c2e87e9f25713e5c3c0298e4202c
in that order from 6.2.3 and see if that helps. Adding Yu.
Confirm the 2 Reverts fixed in my tests as well (nvme + sata drives). Nasty crash - some needed to be power cycled as they hung on shutdown.
Thank you!
gene
Great, thanks. BTW, 6.1 is also affected. A simple reproducer is to run:
dmsetup create dev --table "0 128 zero" dmsetup remove dev
The following kconfigs are needed for the bug to be hit:
CONFIG_BLK_CGROUP=y CONFIG_BLK_DEV_THROTTLING=y CONFIG_BLK_DEV_THROTTLING_LOW=y
Sasha or Greg, can you please revert the indicated commits from 6.1 and 6.2?
Yes, will go do that right now, thanks for debugging this so quickly!
greg k-h
n 3/11/23 2:32?AM, Greg Kroah-Hartman wrote:
On Fri, Mar 10, 2023 at 02:53:19PM -0800, Eric Biggers wrote:
On Fri, Mar 10, 2023 at 04:08:21PM -0500, Genes Lists wrote:
On 3/10/23 15:23, Jens Axboe wrote:
On 3/10/23 1:16?PM, Eric Biggers wrote:
...
But I would revert:
bfe46d2efe46c5c952f982e2ca94fe2ec5e58e2a 57a425badc05c2e87e9f25713e5c3c0298e4202c
in that order from 6.2.3 and see if that helps. Adding Yu.
Confirm the 2 Reverts fixed in my tests as well (nvme + sata drives). Nasty crash - some needed to be power cycled as they hung on shutdown.
Thank you!
gene
Great, thanks. BTW, 6.1 is also affected. A simple reproducer is to run:
dmsetup create dev --table "0 128 zero" dmsetup remove dev
The following kconfigs are needed for the bug to be hit:
CONFIG_BLK_CGROUP=y CONFIG_BLK_DEV_THROTTLING=y CONFIG_BLK_DEV_THROTTLING_LOW=y
Sasha or Greg, can you please revert the indicated commits from 6.1 and 6.2?
Yes, will go do that right now, thanks for debugging this so quickly!
The issue here is that parts of a series was auto-selected. That seems like a bad idea to do for stable. Just because something applies without other parts of the series doesn't mean it's sane to backport by itself.
How do we prevent that from happening? Maybe we just need to default to "if whole series doesn't pick cleanly, don't grab any parts of it in auto-selection"? Exception being if it's explicitly marked for stable, not uncommon to have a series that starts with a fix or two which should go to stable, then the feature bits.
Hi,
在 2023/03/11 5:08, Genes Lists 写道:
On 3/10/23 15:23, Jens Axboe wrote:
On 3/10/23 1:16 PM, Eric Biggers wrote:
...
But I would revert:
bfe46d2efe46c5c952f982e2ca94fe2ec5e58e2a 57a425badc05c2e87e9f25713e5c3c0298e4202c
in that order from 6.2.3 and see if that helps. Adding Yu.
So, The reason is that patch dfd6200a0954 ("blk-cgroup: support to track if policy is online") is missed, this will absolutely cause some problems.
Thanks, Kuai
Confirm the 2 Reverts fixed in my tests as well (nvme + sata drives). Nasty crash - some needed to be power cycled as they hung on shutdown.
Thank you!
gene
.
linux-stable-mirror@lists.linaro.org