Hi,
Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage subsystem freeze - any I/O ends up frozen. I'm not sure what exactly triggers the issue, but often it happens when doing some LVM operations (lvremove, lvrename etc) on a dm-thin volume together with bulk data copy to/from another LVM thin volume with ext4 fs.
The storage stack I use is: nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be specific.
I can reproduce the issue on at least 3 different machines. I did tried also 6.5.6 and the issue is still there. I haven't checked newer versions, but briefly reviewed git log and haven't found anything suggesting a fix to similar issue.
I managed to bisect it down to this commit:
commit 5054e778fcd9cd29ddaa8109077cd235527e4f94 Author: Mikulas Patocka mpatocka@redhat.com Date: Mon May 1 09:19:17 2023 -0400
dm crypt: allocate compound pages if possible
It was reported that allocating pages for the write buffer in dm-crypt causes measurable overhead [1].
Change dm-crypt to allocate compound pages if they are available. If not, fall back to the mempool.
[1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html
Suggested-by: Matthew Wilcox willy@infradead.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Mike Snitzer snitzer@kernel.org
TBH, I'm not sure if the bug is in this commit, or maybe in some functions it uses (I don't see dm-crypt functions directly involved in the stack traces I collected). But reverting this commit on top of 6.5.6 seems to fix the issue.
I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
I managed to collect "blocked tasks" dump via sysrq below. Few more can be found at https://github.com/QubesOS/qubes-issues/issues/8575
[ 4246.558313] sysrq: Show Blocked State [ 4246.558388] task:journal-offline state:D stack:0 pid:8098 ppid:1 flags:0x00000002 [ 4246.558407] Call Trace: [ 4246.558414] <TASK> [ 4246.558422] __schedule+0x23d/0x670 [ 4246.558440] schedule+0x5e/0xd0 [ 4246.558450] io_schedule+0x46/0x70 [ 4246.558461] folio_wait_bit_common+0x13d/0x350 [ 4246.558475] ? __pfx_wake_page_function+0x10/0x10 [ 4246.558488] folio_wait_writeback+0x2c/0x90 [ 4246.558498] mpage_prepare_extent_to_map+0x15c/0x4d0 [ 4246.558512] ext4_do_writepages+0x25f/0x770 [ 4246.558523] ext4_writepages+0xad/0x180 [ 4246.558533] do_writepages+0xcf/0x1e0 [ 4246.558543] ? __seccomp_filter+0x32a/0x4f0 [ 4246.558554] filemap_fdatawrite_wbc+0x63/0x90 [ 4246.558567] __filemap_fdatawrite_range+0x5c/0x80 [ 4246.558578] file_write_and_wait_range+0x4a/0xb0 [ 4246.558588] ext4_sync_file+0x88/0x380 [ 4246.558598] __x64_sys_fsync+0x3b/0x70 [ 4246.558609] do_syscall_64+0x5c/0x90 [ 4246.558621] ? exit_to_user_mode_prepare+0xb2/0xd0 [ 4246.558632] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4246.558644] RIP: 0033:0x7710cf124d0a [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028 [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0 [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940 [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000 [ 4246.558728] </TASK> [ 4246.558836] task:lvm state:D stack:0 pid:7835 ppid:5665 flags:0x00004006 [ 4246.558852] Call Trace: [ 4246.558857] <TASK> [ 4246.558863] __schedule+0x23d/0x670 [ 4246.558874] schedule+0x5e/0xd0 [ 4246.558884] io_schedule+0x46/0x70 [ 4246.558894] dm_wait_for_bios_completion+0xfc/0x110 [ 4246.558909] ? __pfx_autoremove_wake_function+0x10/0x10 [ 4246.558922] __dm_suspend+0x7e/0x1b0 [ 4246.558932] dm_internal_suspend_noflush+0x5c/0x80 [ 4246.558946] pool_presuspend+0xcc/0x130 [dm_thin_pool] [ 4246.558968] dm_table_presuspend_targets+0x3f/0x60 [ 4246.558980] __dm_suspend+0x41/0x1b0 [ 4246.558991] dm_suspend+0xc0/0xe0 [ 4246.559001] dev_suspend+0xa5/0xd0 [ 4246.559011] ctl_ioctl+0x26e/0x350 [ 4246.559020] ? __pfx_dev_suspend+0x10/0x10 [ 4246.559032] dm_ctl_ioctl+0xe/0x20 [ 4246.559041] __x64_sys_ioctl+0x94/0xd0 [ 4246.559052] do_syscall_64+0x5c/0x90 [ 4246.559062] ? do_syscall_64+0x6b/0x90 [ 4246.559072] ? do_syscall_64+0x6b/0x90 [ 4246.559081] ? xen_pv_evtchn_do_upcall+0x54/0xb0 [ 4246.559093] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003 [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420 [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0 [ 4246.559199] </TASK> [ 4246.559207] task:kworker/u8:3 state:D stack:0 pid:8033 ppid:2 flags:0x00004000 [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4) [ 4246.559238] Call Trace: [ 4246.559244] <TASK> [ 4246.559249] __schedule+0x23d/0x670 [ 4246.559260] schedule+0x5e/0xd0 [ 4246.559270] io_schedule+0x46/0x70 [ 4246.559280] folio_wait_bit_common+0x13d/0x350 [ 4246.559290] ? __pfx_wake_page_function+0x10/0x10 [ 4246.559302] mpage_prepare_extent_to_map+0x309/0x4d0 [ 4246.559314] ext4_do_writepages+0x25f/0x770 [ 4246.559324] ext4_writepages+0xad/0x180 [ 4246.559334] do_writepages+0xcf/0x1e0 [ 4246.559344] ? find_busiest_group+0x42/0x1a0 [ 4246.559354] __writeback_single_inode+0x3d/0x280 [ 4246.559368] writeback_sb_inodes+0x1ed/0x4a0 [ 4246.559381] __writeback_inodes_wb+0x4c/0xf0 [ 4246.559393] wb_writeback+0x298/0x310 [ 4246.559403] wb_do_writeback+0x230/0x2b0 [ 4246.559414] wb_workfn+0x5f/0x260 [ 4246.559424] ? _raw_spin_unlock+0xe/0x30 [ 4246.559434] ? finish_task_switch.isra.0+0x95/0x2b0 [ 4246.559447] ? __schedule+0x245/0x670 [ 4246.559457] process_one_work+0x1df/0x3e0 [ 4246.559466] worker_thread+0x51/0x390 [ 4246.559475] ? __pfx_worker_thread+0x10/0x10 [ 4246.559484] kthread+0xe5/0x120 [ 4246.559495] ? __pfx_kthread+0x10/0x10 [ 4246.559504] ret_from_fork+0x31/0x50 [ 4246.559514] ? __pfx_kthread+0x10/0x10 [ 4246.559523] ret_from_fork_asm+0x1b/0x30 [ 4246.559536] </TASK>
On Sat, Oct 21, 2023 at 05:38:58AM +0200, Marek Marczykowski-Górecki wrote:
Hi,
Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage subsystem freeze - any I/O ends up frozen. I'm not sure what exactly triggers the issue, but often it happens when doing some LVM operations (lvremove, lvrename etc) on a dm-thin volume together with bulk data copy to/from another LVM thin volume with ext4 fs.
The storage stack I use is: nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be specific.
I can reproduce the issue on at least 3 different machines. I did tried also 6.5.6 and the issue is still there. I haven't checked newer versions, but briefly reviewed git log and haven't found anything suggesting a fix to similar issue.
I managed to bisect it down to this commit:
commit 5054e778fcd9cd29ddaa8109077cd235527e4f94 Author: Mikulas Patocka <mpatocka@redhat.com> Date: Mon May 1 09:19:17 2023 -0400 dm crypt: allocate compound pages if possible
It was reported that allocating pages for the write buffer in dm-crypt causes measurable overhead [1]. Change dm-crypt to allocate compound pages if they are available. If not, fall back to the mempool. [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html Suggested-by: Matthew Wilcox willy@infradead.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Mike Snitzer snitzer@kernel.org
TBH, I'm not sure if the bug is in this commit, or maybe in some functions it uses (I don't see dm-crypt functions directly involved in the stack traces I collected). But reverting this commit on top of 6.5.6 seems to fix the issue.
I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
I managed to collect "blocked tasks" dump via sysrq below. Few more can be found at https://github.com/QubesOS/qubes-issues/issues/8575
[ 4246.558313] sysrq: Show Blocked State [ 4246.558388] task:journal-offline state:D stack:0 pid:8098 ppid:1 flags:0x00000002 [ 4246.558407] Call Trace: [ 4246.558414] <TASK> [ 4246.558422] __schedule+0x23d/0x670 [ 4246.558440] schedule+0x5e/0xd0 [ 4246.558450] io_schedule+0x46/0x70 [ 4246.558461] folio_wait_bit_common+0x13d/0x350 [ 4246.558475] ? __pfx_wake_page_function+0x10/0x10 [ 4246.558488] folio_wait_writeback+0x2c/0x90 [ 4246.558498] mpage_prepare_extent_to_map+0x15c/0x4d0 [ 4246.558512] ext4_do_writepages+0x25f/0x770 [ 4246.558523] ext4_writepages+0xad/0x180 [ 4246.558533] do_writepages+0xcf/0x1e0 [ 4246.558543] ? __seccomp_filter+0x32a/0x4f0 [ 4246.558554] filemap_fdatawrite_wbc+0x63/0x90 [ 4246.558567] __filemap_fdatawrite_range+0x5c/0x80 [ 4246.558578] file_write_and_wait_range+0x4a/0xb0 [ 4246.558588] ext4_sync_file+0x88/0x380 [ 4246.558598] __x64_sys_fsync+0x3b/0x70 [ 4246.558609] do_syscall_64+0x5c/0x90 [ 4246.558621] ? exit_to_user_mode_prepare+0xb2/0xd0 [ 4246.558632] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4246.558644] RIP: 0033:0x7710cf124d0a [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028 [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0 [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940 [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000 [ 4246.558728] </TASK> [ 4246.558836] task:lvm state:D stack:0 pid:7835 ppid:5665 flags:0x00004006 [ 4246.558852] Call Trace: [ 4246.558857] <TASK> [ 4246.558863] __schedule+0x23d/0x670 [ 4246.558874] schedule+0x5e/0xd0 [ 4246.558884] io_schedule+0x46/0x70 [ 4246.558894] dm_wait_for_bios_completion+0xfc/0x110 [ 4246.558909] ? __pfx_autoremove_wake_function+0x10/0x10 [ 4246.558922] __dm_suspend+0x7e/0x1b0 [ 4246.558932] dm_internal_suspend_noflush+0x5c/0x80 [ 4246.558946] pool_presuspend+0xcc/0x130 [dm_thin_pool] [ 4246.558968] dm_table_presuspend_targets+0x3f/0x60 [ 4246.558980] __dm_suspend+0x41/0x1b0 [ 4246.558991] dm_suspend+0xc0/0xe0 [ 4246.559001] dev_suspend+0xa5/0xd0 [ 4246.559011] ctl_ioctl+0x26e/0x350 [ 4246.559020] ? __pfx_dev_suspend+0x10/0x10 [ 4246.559032] dm_ctl_ioctl+0xe/0x20 [ 4246.559041] __x64_sys_ioctl+0x94/0xd0 [ 4246.559052] do_syscall_64+0x5c/0x90 [ 4246.559062] ? do_syscall_64+0x6b/0x90 [ 4246.559072] ? do_syscall_64+0x6b/0x90 [ 4246.559081] ? xen_pv_evtchn_do_upcall+0x54/0xb0 [ 4246.559093] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003 [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420 [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0 [ 4246.559199] </TASK> [ 4246.559207] task:kworker/u8:3 state:D stack:0 pid:8033 ppid:2 flags:0x00004000 [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4) [ 4246.559238] Call Trace: [ 4246.559244] <TASK> [ 4246.559249] __schedule+0x23d/0x670 [ 4246.559260] schedule+0x5e/0xd0 [ 4246.559270] io_schedule+0x46/0x70 [ 4246.559280] folio_wait_bit_common+0x13d/0x350 [ 4246.559290] ? __pfx_wake_page_function+0x10/0x10 [ 4246.559302] mpage_prepare_extent_to_map+0x309/0x4d0 [ 4246.559314] ext4_do_writepages+0x25f/0x770 [ 4246.559324] ext4_writepages+0xad/0x180 [ 4246.559334] do_writepages+0xcf/0x1e0 [ 4246.559344] ? find_busiest_group+0x42/0x1a0 [ 4246.559354] __writeback_single_inode+0x3d/0x280 [ 4246.559368] writeback_sb_inodes+0x1ed/0x4a0 [ 4246.559381] __writeback_inodes_wb+0x4c/0xf0 [ 4246.559393] wb_writeback+0x298/0x310 [ 4246.559403] wb_do_writeback+0x230/0x2b0 [ 4246.559414] wb_workfn+0x5f/0x260 [ 4246.559424] ? _raw_spin_unlock+0xe/0x30 [ 4246.559434] ? finish_task_switch.isra.0+0x95/0x2b0 [ 4246.559447] ? __schedule+0x245/0x670 [ 4246.559457] process_one_work+0x1df/0x3e0 [ 4246.559466] worker_thread+0x51/0x390 [ 4246.559475] ? __pfx_worker_thread+0x10/0x10 [ 4246.559484] kthread+0xe5/0x120 [ 4246.559495] ? __pfx_kthread+0x10/0x10 [ 4246.559504] ret_from_fork+0x31/0x50 [ 4246.559514] ? __pfx_kthread+0x10/0x10 [ 4246.559523] ret_from_fork_asm+0x1b/0x30 [ 4246.559536] </TASK>
Thanks for the regression report. I'm adding it to regzbot:
#regzbot ^introduced: 5054e778fcd9cd
[TLDR: This mail in primarily relevant for Linux regression tracking. A change or fix related to the regression discussed in this thread was posted or applied, but it did not use a Closes: tag to point to the report, as Linus and the documentation call for. Things happen, no worries -- but now the regression tracking bot needs to be told manually about the fix. See link in footer if these mails annoy you.]
On 21.10.23 09:48, Bagas Sanjaya wrote:
On Sat, Oct 21, 2023 at 05:38:58AM +0200, Marek Marczykowski-Górecki wrote:
Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage subsystem freeze - any I/O ends up frozen. I'm not sure what exactly triggers the issue, but often it happens when doing some LVM operations (lvremove, lvrename etc) on a dm-thin volume together with bulk data copy to/from another LVM thin volume with ext4 fs.
[...] #regzbot ^introduced: 5054e778fcd9cd
#regzbot fix: dm crypt: don't allocate large compound pages #regzbot ignore-activity
(fix currently found in https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/c... )
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you.
Hi
It's hard to say what causes this. Perhaps dm-crypt exhausts all the higher-order pages and some subsystem stalls because of it.
In drivers/md/dm-crypt.c in function crypt_alloc_buffer there is "unsigned int order = MAX_ORDER - 1"
What happens if you set the "order" variable to some small value, like 1, 2 or 3. Does the problem go away? Could you find a threshold value (something between 0 and MAX_ORDER-1) where the bug starts to appear?
What happens if you replace "pages = alloc_pages(gfp_mask" with "pages = alloc_pages((gfp_mask & ~__GFP_KSWAPD_RECLAIM)" ? Does the bug go away?
Mikulas
On Sat, 21 Oct 2023, Marek Marczykowski-Górecki wrote:
Hi,
Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage subsystem freeze - any I/O ends up frozen. I'm not sure what exactly triggers the issue, but often it happens when doing some LVM operations (lvremove, lvrename etc) on a dm-thin volume together with bulk data copy to/from another LVM thin volume with ext4 fs.
The storage stack I use is: nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be specific.
I can reproduce the issue on at least 3 different machines. I did tried also 6.5.6 and the issue is still there. I haven't checked newer versions, but briefly reviewed git log and haven't found anything suggesting a fix to similar issue.
I managed to bisect it down to this commit:
commit 5054e778fcd9cd29ddaa8109077cd235527e4f94 Author: Mikulas Patocka <mpatocka@redhat.com> Date: Mon May 1 09:19:17 2023 -0400 dm crypt: allocate compound pages if possible
It was reported that allocating pages for the write buffer in dm-crypt causes measurable overhead [1]. Change dm-crypt to allocate compound pages if they are available. If not, fall back to the mempool. [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html Suggested-by: Matthew Wilcox willy@infradead.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Mike Snitzer snitzer@kernel.org
TBH, I'm not sure if the bug is in this commit, or maybe in some functions it uses (I don't see dm-crypt functions directly involved in the stack traces I collected). But reverting this commit on top of 6.5.6 seems to fix the issue.
I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
I managed to collect "blocked tasks" dump via sysrq below. Few more can be found at https://github.com/QubesOS/qubes-issues/issues/8575
[ 4246.558313] sysrq: Show Blocked State [ 4246.558388] task:journal-offline state:D stack:0 pid:8098 ppid:1 flags:0x00000002 [ 4246.558407] Call Trace: [ 4246.558414] <TASK> [ 4246.558422] __schedule+0x23d/0x670 [ 4246.558440] schedule+0x5e/0xd0 [ 4246.558450] io_schedule+0x46/0x70 [ 4246.558461] folio_wait_bit_common+0x13d/0x350 [ 4246.558475] ? __pfx_wake_page_function+0x10/0x10 [ 4246.558488] folio_wait_writeback+0x2c/0x90 [ 4246.558498] mpage_prepare_extent_to_map+0x15c/0x4d0 [ 4246.558512] ext4_do_writepages+0x25f/0x770 [ 4246.558523] ext4_writepages+0xad/0x180 [ 4246.558533] do_writepages+0xcf/0x1e0 [ 4246.558543] ? __seccomp_filter+0x32a/0x4f0 [ 4246.558554] filemap_fdatawrite_wbc+0x63/0x90 [ 4246.558567] __filemap_fdatawrite_range+0x5c/0x80 [ 4246.558578] file_write_and_wait_range+0x4a/0xb0 [ 4246.558588] ext4_sync_file+0x88/0x380 [ 4246.558598] __x64_sys_fsync+0x3b/0x70 [ 4246.558609] do_syscall_64+0x5c/0x90 [ 4246.558621] ? exit_to_user_mode_prepare+0xb2/0xd0 [ 4246.558632] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4246.558644] RIP: 0033:0x7710cf124d0a [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028 [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0 [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940 [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000 [ 4246.558728] </TASK> [ 4246.558836] task:lvm state:D stack:0 pid:7835 ppid:5665 flags:0x00004006 [ 4246.558852] Call Trace: [ 4246.558857] <TASK> [ 4246.558863] __schedule+0x23d/0x670 [ 4246.558874] schedule+0x5e/0xd0 [ 4246.558884] io_schedule+0x46/0x70 [ 4246.558894] dm_wait_for_bios_completion+0xfc/0x110 [ 4246.558909] ? __pfx_autoremove_wake_function+0x10/0x10 [ 4246.558922] __dm_suspend+0x7e/0x1b0 [ 4246.558932] dm_internal_suspend_noflush+0x5c/0x80 [ 4246.558946] pool_presuspend+0xcc/0x130 [dm_thin_pool] [ 4246.558968] dm_table_presuspend_targets+0x3f/0x60 [ 4246.558980] __dm_suspend+0x41/0x1b0 [ 4246.558991] dm_suspend+0xc0/0xe0 [ 4246.559001] dev_suspend+0xa5/0xd0 [ 4246.559011] ctl_ioctl+0x26e/0x350 [ 4246.559020] ? __pfx_dev_suspend+0x10/0x10 [ 4246.559032] dm_ctl_ioctl+0xe/0x20 [ 4246.559041] __x64_sys_ioctl+0x94/0xd0 [ 4246.559052] do_syscall_64+0x5c/0x90 [ 4246.559062] ? do_syscall_64+0x6b/0x90 [ 4246.559072] ? do_syscall_64+0x6b/0x90 [ 4246.559081] ? xen_pv_evtchn_do_upcall+0x54/0xb0 [ 4246.559093] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003 [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420 [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0 [ 4246.559199] </TASK> [ 4246.559207] task:kworker/u8:3 state:D stack:0 pid:8033 ppid:2 flags:0x00004000 [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4) [ 4246.559238] Call Trace: [ 4246.559244] <TASK> [ 4246.559249] __schedule+0x23d/0x670 [ 4246.559260] schedule+0x5e/0xd0 [ 4246.559270] io_schedule+0x46/0x70 [ 4246.559280] folio_wait_bit_common+0x13d/0x350 [ 4246.559290] ? __pfx_wake_page_function+0x10/0x10 [ 4246.559302] mpage_prepare_extent_to_map+0x309/0x4d0 [ 4246.559314] ext4_do_writepages+0x25f/0x770 [ 4246.559324] ext4_writepages+0xad/0x180 [ 4246.559334] do_writepages+0xcf/0x1e0 [ 4246.559344] ? find_busiest_group+0x42/0x1a0 [ 4246.559354] __writeback_single_inode+0x3d/0x280 [ 4246.559368] writeback_sb_inodes+0x1ed/0x4a0 [ 4246.559381] __writeback_inodes_wb+0x4c/0xf0 [ 4246.559393] wb_writeback+0x298/0x310 [ 4246.559403] wb_do_writeback+0x230/0x2b0 [ 4246.559414] wb_workfn+0x5f/0x260 [ 4246.559424] ? _raw_spin_unlock+0xe/0x30 [ 4246.559434] ? finish_task_switch.isra.0+0x95/0x2b0 [ 4246.559447] ? __schedule+0x245/0x670 [ 4246.559457] process_one_work+0x1df/0x3e0 [ 4246.559466] worker_thread+0x51/0x390 [ 4246.559475] ? __pfx_worker_thread+0x10/0x10 [ 4246.559484] kthread+0xe5/0x120 [ 4246.559495] ? __pfx_kthread+0x10/0x10 [ 4246.559504] ret_from_fork+0x31/0x50 [ 4246.559514] ? __pfx_kthread+0x10/0x10 [ 4246.559523] ret_from_fork_asm+0x1b/0x30 [ 4246.559536] </TASK>
-- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab
On Mon, Oct 23, 2023 at 10:59:40PM +0200, Mikulas Patocka wrote:
Hi
It's hard to say what causes this. Perhaps dm-crypt exhausts all the higher-order pages and some subsystem stalls because of it.
In drivers/md/dm-crypt.c in function crypt_alloc_buffer there is "unsigned int order = MAX_ORDER - 1"
What happens if you set the "order" variable to some small value, like 1, 2 or 3. Does the problem go away? Could you find a threshold value (something between 0 and MAX_ORDER-1) where the bug starts to appear?
With 3 and lower it seems to work, with 4 it freezes. My test is not 100% reproducible, but I've repeated it a few times and got rather consistent results on this system (i3-1315U, NVMe).
BTW, when trying on a much older system (Thinkpad P52 with i7-8750H) I couldn't reproduce the issue at all. But OTOH, I hit it once on a system with i7-7600U and otherwise similar specs but much different workload (several more VMs accessing the data). I'm not sure if that info helps at all...
What happens if you replace "pages = alloc_pages(gfp_mask" with "pages = alloc_pages((gfp_mask & ~__GFP_KSWAPD_RECLAIM)" ? Does the bug go away?
In a limited test, with order restored to MAX_ORDER - 1, no, still got the issue, and got this via sysrq (just one task listed):
[ 516.375254] sysrq: Show Blocked State [ 516.375461] task:dd state:D stack:13072 pid:4385 ppid:4371 flags:0x00004002 [ 516.375496] Call Trace: [ 516.375507] <TASK> [ 516.375518] __schedule+0x30e/0x8b0 [ 516.375549] schedule+0x59/0xb0 [ 516.375566] io_schedule+0x41/0x70 [ 516.375582] folio_wait_bit_common+0x12c/0x300 [ 516.375603] ? __pfx_wake_page_function+0x10/0x10 [ 516.375623] folio_wait_writeback+0x27/0x80 [ 516.375639] __filemap_fdatawait_range+0x7b/0xe0 [ 516.375656] file_write_and_wait_range+0x86/0xb0 [ 516.375672] blkdev_fsync+0x33/0x60 [ 516.375693] __x64_sys_fdatasync+0x4a/0x90 [ 516.375713] do_syscall_64+0x3c/0x90 [ 516.375729] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 516.375751] RIP: 0033:0x7fd718562da4 [ 516.375764] RSP: 002b:00007ffea3815278 EFLAGS: 00000202 ORIG_RAX: 000000000000004b [ 516.375784] RAX: ffffffffffffffda RBX: 0000000000015200 RCX: 00007fd718562da4 [ 516.375801] RDX: 0000000000028000 RSI: 0000000000000000 RDI: 0000000000000001 [ 516.375817] RBP: 00007fd71845e6c0 R08: 00000000ffffffff R09: 0000000000000000 [ 516.375833] R10: 0000000000000022 R11: 0000000000000202 R12: 0000000000000000 [ 516.375848] R13: 0000000000000000 R14: 0000000000028000 R15: 0000000000000000 [ 516.375865] </TASK>
(it is dd writing to LVM thin volume, where PV is on dm-crypt)
Mikulas
On Sat, 21 Oct 2023, Marek Marczykowski-Górecki wrote:
Hi,
Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage subsystem freeze - any I/O ends up frozen. I'm not sure what exactly triggers the issue, but often it happens when doing some LVM operations (lvremove, lvrename etc) on a dm-thin volume together with bulk data copy to/from another LVM thin volume with ext4 fs.
The storage stack I use is: nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be specific.
I can reproduce the issue on at least 3 different machines. I did tried also 6.5.6 and the issue is still there. I haven't checked newer versions, but briefly reviewed git log and haven't found anything suggesting a fix to similar issue.
I managed to bisect it down to this commit:
commit 5054e778fcd9cd29ddaa8109077cd235527e4f94 Author: Mikulas Patocka <mpatocka@redhat.com> Date: Mon May 1 09:19:17 2023 -0400 dm crypt: allocate compound pages if possible
It was reported that allocating pages for the write buffer in dm-crypt causes measurable overhead [1]. Change dm-crypt to allocate compound pages if they are available. If not, fall back to the mempool. [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html Suggested-by: Matthew Wilcox willy@infradead.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Mike Snitzer snitzer@kernel.org
TBH, I'm not sure if the bug is in this commit, or maybe in some functions it uses (I don't see dm-crypt functions directly involved in the stack traces I collected). But reverting this commit on top of 6.5.6 seems to fix the issue.
I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
I managed to collect "blocked tasks" dump via sysrq below. Few more can be found at https://github.com/QubesOS/qubes-issues/issues/8575
[ 4246.558313] sysrq: Show Blocked State [ 4246.558388] task:journal-offline state:D stack:0 pid:8098 ppid:1 flags:0x00000002 [ 4246.558407] Call Trace: [ 4246.558414] <TASK> [ 4246.558422] __schedule+0x23d/0x670 [ 4246.558440] schedule+0x5e/0xd0 [ 4246.558450] io_schedule+0x46/0x70 [ 4246.558461] folio_wait_bit_common+0x13d/0x350 [ 4246.558475] ? __pfx_wake_page_function+0x10/0x10 [ 4246.558488] folio_wait_writeback+0x2c/0x90 [ 4246.558498] mpage_prepare_extent_to_map+0x15c/0x4d0 [ 4246.558512] ext4_do_writepages+0x25f/0x770 [ 4246.558523] ext4_writepages+0xad/0x180 [ 4246.558533] do_writepages+0xcf/0x1e0 [ 4246.558543] ? __seccomp_filter+0x32a/0x4f0 [ 4246.558554] filemap_fdatawrite_wbc+0x63/0x90 [ 4246.558567] __filemap_fdatawrite_range+0x5c/0x80 [ 4246.558578] file_write_and_wait_range+0x4a/0xb0 [ 4246.558588] ext4_sync_file+0x88/0x380 [ 4246.558598] __x64_sys_fsync+0x3b/0x70 [ 4246.558609] do_syscall_64+0x5c/0x90 [ 4246.558621] ? exit_to_user_mode_prepare+0xb2/0xd0 [ 4246.558632] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4246.558644] RIP: 0033:0x7710cf124d0a [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028 [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0 [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940 [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000 [ 4246.558728] </TASK> [ 4246.558836] task:lvm state:D stack:0 pid:7835 ppid:5665 flags:0x00004006 [ 4246.558852] Call Trace: [ 4246.558857] <TASK> [ 4246.558863] __schedule+0x23d/0x670 [ 4246.558874] schedule+0x5e/0xd0 [ 4246.558884] io_schedule+0x46/0x70 [ 4246.558894] dm_wait_for_bios_completion+0xfc/0x110 [ 4246.558909] ? __pfx_autoremove_wake_function+0x10/0x10 [ 4246.558922] __dm_suspend+0x7e/0x1b0 [ 4246.558932] dm_internal_suspend_noflush+0x5c/0x80 [ 4246.558946] pool_presuspend+0xcc/0x130 [dm_thin_pool] [ 4246.558968] dm_table_presuspend_targets+0x3f/0x60 [ 4246.558980] __dm_suspend+0x41/0x1b0 [ 4246.558991] dm_suspend+0xc0/0xe0 [ 4246.559001] dev_suspend+0xa5/0xd0 [ 4246.559011] ctl_ioctl+0x26e/0x350 [ 4246.559020] ? __pfx_dev_suspend+0x10/0x10 [ 4246.559032] dm_ctl_ioctl+0xe/0x20 [ 4246.559041] __x64_sys_ioctl+0x94/0xd0 [ 4246.559052] do_syscall_64+0x5c/0x90 [ 4246.559062] ? do_syscall_64+0x6b/0x90 [ 4246.559072] ? do_syscall_64+0x6b/0x90 [ 4246.559081] ? xen_pv_evtchn_do_upcall+0x54/0xb0 [ 4246.559093] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003 [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420 [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0 [ 4246.559199] </TASK> [ 4246.559207] task:kworker/u8:3 state:D stack:0 pid:8033 ppid:2 flags:0x00004000 [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4) [ 4246.559238] Call Trace: [ 4246.559244] <TASK> [ 4246.559249] __schedule+0x23d/0x670 [ 4246.559260] schedule+0x5e/0xd0 [ 4246.559270] io_schedule+0x46/0x70 [ 4246.559280] folio_wait_bit_common+0x13d/0x350 [ 4246.559290] ? __pfx_wake_page_function+0x10/0x10 [ 4246.559302] mpage_prepare_extent_to_map+0x309/0x4d0 [ 4246.559314] ext4_do_writepages+0x25f/0x770 [ 4246.559324] ext4_writepages+0xad/0x180 [ 4246.559334] do_writepages+0xcf/0x1e0 [ 4246.559344] ? find_busiest_group+0x42/0x1a0 [ 4246.559354] __writeback_single_inode+0x3d/0x280 [ 4246.559368] writeback_sb_inodes+0x1ed/0x4a0 [ 4246.559381] __writeback_inodes_wb+0x4c/0xf0 [ 4246.559393] wb_writeback+0x298/0x310 [ 4246.559403] wb_do_writeback+0x230/0x2b0 [ 4246.559414] wb_workfn+0x5f/0x260 [ 4246.559424] ? _raw_spin_unlock+0xe/0x30 [ 4246.559434] ? finish_task_switch.isra.0+0x95/0x2b0 [ 4246.559447] ? __schedule+0x245/0x670 [ 4246.559457] process_one_work+0x1df/0x3e0 [ 4246.559466] worker_thread+0x51/0x390 [ 4246.559475] ? __pfx_worker_thread+0x10/0x10 [ 4246.559484] kthread+0xe5/0x120 [ 4246.559495] ? __pfx_kthread+0x10/0x10 [ 4246.559504] ret_from_fork+0x31/0x50 [ 4246.559514] ? __pfx_kthread+0x10/0x10 [ 4246.559523] ret_from_fork_asm+0x1b/0x30 [ 4246.559536] </TASK>
-- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab
On Wed, Oct 25, 2023 at 05:10:56AM +0200, Marek Marczykowski-Górecki wrote:
On Mon, Oct 23, 2023 at 10:59:40PM +0200, Mikulas Patocka wrote:
Hi
It's hard to say what causes this. Perhaps dm-crypt exhausts all the higher-order pages and some subsystem stalls because of it.
In drivers/md/dm-crypt.c in function crypt_alloc_buffer there is "unsigned int order = MAX_ORDER - 1"
What happens if you set the "order" variable to some small value, like 1, 2 or 3. Does the problem go away? Could you find a threshold value (something between 0 and MAX_ORDER-1) where the bug starts to appear?
With 3 and lower it seems to work, with 4 it freezes. My test is not 100% reproducible, but I've repeated it a few times and got rather consistent results on this system (i3-1315U, NVMe).
BTW, when trying on a much older system (Thinkpad P52 with i7-8750H) I couldn't reproduce the issue at all. But OTOH, I hit it once on a system with i7-7600U and otherwise similar specs but much different workload (several more VMs accessing the data). I'm not sure if that info helps at all...
What happens if you replace "pages = alloc_pages(gfp_mask" with "pages = alloc_pages((gfp_mask & ~__GFP_KSWAPD_RECLAIM)" ? Does the bug go away?
In a limited test, with order restored to MAX_ORDER - 1, no, still got the issue, and got this via sysrq (just one task listed):
[ 516.375254] sysrq: Show Blocked State [ 516.375461] task:dd state:D stack:13072 pid:4385 ppid:4371 flags:0x00004002 [ 516.375496] Call Trace: [ 516.375507] <TASK> [ 516.375518] __schedule+0x30e/0x8b0 [ 516.375549] schedule+0x59/0xb0 [ 516.375566] io_schedule+0x41/0x70 [ 516.375582] folio_wait_bit_common+0x12c/0x300 [ 516.375603] ? __pfx_wake_page_function+0x10/0x10 [ 516.375623] folio_wait_writeback+0x27/0x80 [ 516.375639] __filemap_fdatawait_range+0x7b/0xe0 [ 516.375656] file_write_and_wait_range+0x86/0xb0 [ 516.375672] blkdev_fsync+0x33/0x60 [ 516.375693] __x64_sys_fdatasync+0x4a/0x90 [ 516.375713] do_syscall_64+0x3c/0x90 [ 516.375729] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 516.375751] RIP: 0033:0x7fd718562da4 [ 516.375764] RSP: 002b:00007ffea3815278 EFLAGS: 00000202 ORIG_RAX: 000000000000004b [ 516.375784] RAX: ffffffffffffffda RBX: 0000000000015200 RCX: 00007fd718562da4 [ 516.375801] RDX: 0000000000028000 RSI: 0000000000000000 RDI: 0000000000000001 [ 516.375817] RBP: 00007fd71845e6c0 R08: 00000000ffffffff R09: 0000000000000000 [ 516.375833] R10: 0000000000000022 R11: 0000000000000202 R12: 0000000000000000 [ 516.375848] R13: 0000000000000000 R14: 0000000000028000 R15: 0000000000000000 [ 516.375865] </TASK>
(it is dd writing to LVM thin volume, where PV is on dm-crypt)
In this bad state, I extracted some more info on memory:
/proc/meminfo:
MemTotal: 3983088 kB MemFree: 349264 kB MemAvailable: 3474744 kB Buffers: 1511732 kB Cached: 1608132 kB SwapCached: 12 kB Active: 203136 kB Inactive: 3147696 kB Active(anon): 1288 kB Inactive(anon): 252128 kB Active(file): 201848 kB Inactive(file): 2895568 kB Unevictable: 46064 kB Mlocked: 39920 kB SwapTotal: 4112380 kB SwapFree: 4112124 kB Dirty: 0 kB Writeback: 8556 kB AnonPages: 277020 kB Mapped: 137424 kB Shmem: 13792 kB KReclaimable: 91728 kB Slab: 182428 kB SReclaimable: 91728 kB SUnreclaim: 90700 kB KernelStack: 5776 kB PageTables: 7480 kB SecPageTables: 0 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 6103924 kB Committed_AS: 990924 kB VmallocTotal: 34359738367 kB VmallocUsed: 11560 kB VmallocChunk: 0 kB Percpu: 2528 kB DirectMap4k: 4325764 kB DirectMap2M: 0 kB
and also: [ 1168.537096] sysrq: Show Memory [ 1168.537192] Mem-Info: [ 1168.537206] active_anon:322 inactive_anon:63032 isolated_anon:0 [ 1168.537206] active_file:50462 inactive_file:723892 isolated_file:0 [ 1168.537206] unevictable:11516 dirty:1 writeback:2139 [ 1168.537206] slab_reclaimable:22932 slab_unreclaimable:22675 [ 1168.537206] mapped:34357 shmem:3448 pagetables:1870 [ 1168.537206] sec_pagetables:0 bounce:0 [ 1168.537206] kernel_misc_reclaimable:0 [ 1168.537206] free:87499 free_pcp:1642 free_cma:0 [ 1168.537279] Node 0 active_anon:1288kB inactive_anon:252128kB active_file:201848kB inactive_file:2895568kB unevictable:46064kB isolated(anon):0kB isolated(file):0kB mapped:137428kB dirty:4kB writeback:8556kB shmem:13792kB writeback_tmp:0kB kernel_stack:5776kB pagetables:7480kB sec_pagetables:0kB all_unreclaimable? no [ 1168.537332] Node 0 DMA free:15488kB boost:0kB min:32kB low:44kB high:56kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:352kB unevictable:0kB writepending:84kB present:15996kB managed:15936kB mlocked:0kB bounce:0kB free_pcp:12kB local_pcp:0kB free_cma:0kB [ 1168.537387] lowmem_reserve[]: 0 1728 3857 3857 [ 1168.537409] Node 0 DMA32 free:155300kB boost:0kB min:3552kB low:5320kB high:7088kB reserved_highatomic:0KB active_anon:16kB inactive_anon:3368kB active_file:5660kB inactive_file:1585120kB unevictable:0kB writepending:7576kB present:1924360kB managed:1803376kB mlocked:0kB bounce:0kB free_pcp:3184kB local_pcp:864kB free_cma:0kB [ 1168.537464] lowmem_reserve[]: 0 0 2129 2129 [ 1168.537483] Node 0 Normal free:179208kB boost:0kB min:4376kB low:6556kB high:8736kB reserved_highatomic:2048KB active_anon:1272kB inactive_anon:248760kB active_file:196104kB inactive_file:1310096kB unevictable:46064kB writepending:900kB present:2253948kB managed:2163776kB mlocked:39920kB bounce:0kB free_pcp:3372kB local_pcp:632kB free_cma:0kB [ 1168.537540] lowmem_reserve[]: 0 0 0 0 [ 1168.537557] Node 0 DMA: 4*4kB (UM) 4*8kB (UM) 3*16kB (UM) 3*32kB (U) 3*64kB (U) 2*128kB (UM) 0*256kB 1*512kB (M) 2*1024kB (UM) 0*2048kB 3*4096kB (M) = 15488kB [ 1168.537624] Node 0 DMA32: 1945*4kB (UME) 1446*8kB (UME) 953*16kB (UME) 1410*32kB (UM) 297*64kB (UM) 162*128kB (UM) 60*256kB (UM) 22*512kB (UM) 7*1024kB (UM) 1*2048kB (U) 0*4096kB = 155300kB [ 1168.537695] Node 0 Normal: 3146*4kB (UMEH) 2412*8kB (UMEH) 1444*16kB (UMEH) 1138*32kB (UMEH) 486*64kB (UME) 97*128kB (UM) 13*256kB (UM) 8*512kB (M) 2*1024kB (M) 5*2048kB (M) 6*4096kB (UM) = 179208kB [ 1168.537767] 779973 total pagecache pages [ 1168.537778] 3 pages in swap cache [ 1168.537788] Free swap = 4112124kB [ 1168.537798] Total swap = 4112380kB [ 1168.537807] 1048576 pages RAM [ 1168.537817] 0 pages HighMem/MovableOnly [ 1168.537827] 52804 pages reserved
Mikulas
On Sat, 21 Oct 2023, Marek Marczykowski-Górecki wrote:
Hi,
Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage subsystem freeze - any I/O ends up frozen. I'm not sure what exactly triggers the issue, but often it happens when doing some LVM operations (lvremove, lvrename etc) on a dm-thin volume together with bulk data copy to/from another LVM thin volume with ext4 fs.
The storage stack I use is: nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be specific.
I can reproduce the issue on at least 3 different machines. I did tried also 6.5.6 and the issue is still there. I haven't checked newer versions, but briefly reviewed git log and haven't found anything suggesting a fix to similar issue.
I managed to bisect it down to this commit:
commit 5054e778fcd9cd29ddaa8109077cd235527e4f94 Author: Mikulas Patocka <mpatocka@redhat.com> Date: Mon May 1 09:19:17 2023 -0400 dm crypt: allocate compound pages if possible
It was reported that allocating pages for the write buffer in dm-crypt causes measurable overhead [1]. Change dm-crypt to allocate compound pages if they are available. If not, fall back to the mempool. [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html Suggested-by: Matthew Wilcox willy@infradead.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Mike Snitzer snitzer@kernel.org
TBH, I'm not sure if the bug is in this commit, or maybe in some functions it uses (I don't see dm-crypt functions directly involved in the stack traces I collected). But reverting this commit on top of 6.5.6 seems to fix the issue.
I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
I managed to collect "blocked tasks" dump via sysrq below. Few more can be found at https://github.com/QubesOS/qubes-issues/issues/8575
[ 4246.558313] sysrq: Show Blocked State [ 4246.558388] task:journal-offline state:D stack:0 pid:8098 ppid:1 flags:0x00000002 [ 4246.558407] Call Trace: [ 4246.558414] <TASK> [ 4246.558422] __schedule+0x23d/0x670 [ 4246.558440] schedule+0x5e/0xd0 [ 4246.558450] io_schedule+0x46/0x70 [ 4246.558461] folio_wait_bit_common+0x13d/0x350 [ 4246.558475] ? __pfx_wake_page_function+0x10/0x10 [ 4246.558488] folio_wait_writeback+0x2c/0x90 [ 4246.558498] mpage_prepare_extent_to_map+0x15c/0x4d0 [ 4246.558512] ext4_do_writepages+0x25f/0x770 [ 4246.558523] ext4_writepages+0xad/0x180 [ 4246.558533] do_writepages+0xcf/0x1e0 [ 4246.558543] ? __seccomp_filter+0x32a/0x4f0 [ 4246.558554] filemap_fdatawrite_wbc+0x63/0x90 [ 4246.558567] __filemap_fdatawrite_range+0x5c/0x80 [ 4246.558578] file_write_and_wait_range+0x4a/0xb0 [ 4246.558588] ext4_sync_file+0x88/0x380 [ 4246.558598] __x64_sys_fsync+0x3b/0x70 [ 4246.558609] do_syscall_64+0x5c/0x90 [ 4246.558621] ? exit_to_user_mode_prepare+0xb2/0xd0 [ 4246.558632] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4246.558644] RIP: 0033:0x7710cf124d0a [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028 [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0 [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940 [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000 [ 4246.558728] </TASK> [ 4246.558836] task:lvm state:D stack:0 pid:7835 ppid:5665 flags:0x00004006 [ 4246.558852] Call Trace: [ 4246.558857] <TASK> [ 4246.558863] __schedule+0x23d/0x670 [ 4246.558874] schedule+0x5e/0xd0 [ 4246.558884] io_schedule+0x46/0x70 [ 4246.558894] dm_wait_for_bios_completion+0xfc/0x110 [ 4246.558909] ? __pfx_autoremove_wake_function+0x10/0x10 [ 4246.558922] __dm_suspend+0x7e/0x1b0 [ 4246.558932] dm_internal_suspend_noflush+0x5c/0x80 [ 4246.558946] pool_presuspend+0xcc/0x130 [dm_thin_pool] [ 4246.558968] dm_table_presuspend_targets+0x3f/0x60 [ 4246.558980] __dm_suspend+0x41/0x1b0 [ 4246.558991] dm_suspend+0xc0/0xe0 [ 4246.559001] dev_suspend+0xa5/0xd0 [ 4246.559011] ctl_ioctl+0x26e/0x350 [ 4246.559020] ? __pfx_dev_suspend+0x10/0x10 [ 4246.559032] dm_ctl_ioctl+0xe/0x20 [ 4246.559041] __x64_sys_ioctl+0x94/0xd0 [ 4246.559052] do_syscall_64+0x5c/0x90 [ 4246.559062] ? do_syscall_64+0x6b/0x90 [ 4246.559072] ? do_syscall_64+0x6b/0x90 [ 4246.559081] ? xen_pv_evtchn_do_upcall+0x54/0xb0 [ 4246.559093] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003 [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420 [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0 [ 4246.559199] </TASK> [ 4246.559207] task:kworker/u8:3 state:D stack:0 pid:8033 ppid:2 flags:0x00004000 [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4) [ 4246.559238] Call Trace: [ 4246.559244] <TASK> [ 4246.559249] __schedule+0x23d/0x670 [ 4246.559260] schedule+0x5e/0xd0 [ 4246.559270] io_schedule+0x46/0x70 [ 4246.559280] folio_wait_bit_common+0x13d/0x350 [ 4246.559290] ? __pfx_wake_page_function+0x10/0x10 [ 4246.559302] mpage_prepare_extent_to_map+0x309/0x4d0 [ 4246.559314] ext4_do_writepages+0x25f/0x770 [ 4246.559324] ext4_writepages+0xad/0x180 [ 4246.559334] do_writepages+0xcf/0x1e0 [ 4246.559344] ? find_busiest_group+0x42/0x1a0 [ 4246.559354] __writeback_single_inode+0x3d/0x280 [ 4246.559368] writeback_sb_inodes+0x1ed/0x4a0 [ 4246.559381] __writeback_inodes_wb+0x4c/0xf0 [ 4246.559393] wb_writeback+0x298/0x310 [ 4246.559403] wb_do_writeback+0x230/0x2b0 [ 4246.559414] wb_workfn+0x5f/0x260 [ 4246.559424] ? _raw_spin_unlock+0xe/0x30 [ 4246.559434] ? finish_task_switch.isra.0+0x95/0x2b0 [ 4246.559447] ? __schedule+0x245/0x670 [ 4246.559457] process_one_work+0x1df/0x3e0 [ 4246.559466] worker_thread+0x51/0x390 [ 4246.559475] ? __pfx_worker_thread+0x10/0x10 [ 4246.559484] kthread+0xe5/0x120 [ 4246.559495] ? __pfx_kthread+0x10/0x10 [ 4246.559504] ret_from_fork+0x31/0x50 [ 4246.559514] ? __pfx_kthread+0x10/0x10 [ 4246.559523] ret_from_fork_asm+0x1b/0x30 [ 4246.559536] </TASK>
-- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab
-- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab
So, I forward this to memory management maintainers.
What do you think? - We have a problem that if dm-crypt allocates pages with order > 3 (PAGE_ALLOC_COSTLY_ORDER), the system occasionally freezes waiting in writeback.
dm-crypt allocates the pages with GFP_NOWAIT | __GFP_HIGHMEM | __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP, so it shouldn't put any pressure on the system. If the allocations fails, it falls back to smaller order allocation or to mempool as a last resort.
When the freeze happens, there is "349264kB" free memory - so the system is not short on memory.
Should we restrict the dm-crypt allocation size to PAGE_ALLOC_COSTLY_ORDER? Or is it a bug somewhere in memory management system that needs to be fixes there?
Mikulas
On Wed, 25 Oct 2023, Marek Marczykowski-Górecki wrote:
On Wed, Oct 25, 2023 at 05:10:56AM +0200, Marek Marczykowski-Górecki wrote:
On Mon, Oct 23, 2023 at 10:59:40PM +0200, Mikulas Patocka wrote:
Hi
It's hard to say what causes this. Perhaps dm-crypt exhausts all the higher-order pages and some subsystem stalls because of it.
In drivers/md/dm-crypt.c in function crypt_alloc_buffer there is "unsigned int order = MAX_ORDER - 1"
What happens if you set the "order" variable to some small value, like 1, 2 or 3. Does the problem go away? Could you find a threshold value (something between 0 and MAX_ORDER-1) where the bug starts to appear?
With 3 and lower it seems to work, with 4 it freezes. My test is not 100% reproducible, but I've repeated it a few times and got rather consistent results on this system (i3-1315U, NVMe).
BTW, when trying on a much older system (Thinkpad P52 with i7-8750H) I couldn't reproduce the issue at all. But OTOH, I hit it once on a system with i7-7600U and otherwise similar specs but much different workload (several more VMs accessing the data). I'm not sure if that info helps at all...
What happens if you replace "pages = alloc_pages(gfp_mask" with "pages = alloc_pages((gfp_mask & ~__GFP_KSWAPD_RECLAIM)" ? Does the bug go away?
In a limited test, with order restored to MAX_ORDER - 1, no, still got the issue, and got this via sysrq (just one task listed):
[ 516.375254] sysrq: Show Blocked State [ 516.375461] task:dd state:D stack:13072 pid:4385 ppid:4371 flags:0x00004002 [ 516.375496] Call Trace: [ 516.375507] <TASK> [ 516.375518] __schedule+0x30e/0x8b0 [ 516.375549] schedule+0x59/0xb0 [ 516.375566] io_schedule+0x41/0x70 [ 516.375582] folio_wait_bit_common+0x12c/0x300 [ 516.375603] ? __pfx_wake_page_function+0x10/0x10 [ 516.375623] folio_wait_writeback+0x27/0x80 [ 516.375639] __filemap_fdatawait_range+0x7b/0xe0 [ 516.375656] file_write_and_wait_range+0x86/0xb0 [ 516.375672] blkdev_fsync+0x33/0x60 [ 516.375693] __x64_sys_fdatasync+0x4a/0x90 [ 516.375713] do_syscall_64+0x3c/0x90 [ 516.375729] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 516.375751] RIP: 0033:0x7fd718562da4 [ 516.375764] RSP: 002b:00007ffea3815278 EFLAGS: 00000202 ORIG_RAX: 000000000000004b [ 516.375784] RAX: ffffffffffffffda RBX: 0000000000015200 RCX: 00007fd718562da4 [ 516.375801] RDX: 0000000000028000 RSI: 0000000000000000 RDI: 0000000000000001 [ 516.375817] RBP: 00007fd71845e6c0 R08: 00000000ffffffff R09: 0000000000000000 [ 516.375833] R10: 0000000000000022 R11: 0000000000000202 R12: 0000000000000000 [ 516.375848] R13: 0000000000000000 R14: 0000000000028000 R15: 0000000000000000 [ 516.375865] </TASK>
(it is dd writing to LVM thin volume, where PV is on dm-crypt)
In this bad state, I extracted some more info on memory:
/proc/meminfo:
MemTotal: 3983088 kB MemFree: 349264 kB MemAvailable: 3474744 kB Buffers: 1511732 kB Cached: 1608132 kB SwapCached: 12 kB Active: 203136 kB Inactive: 3147696 kB Active(anon): 1288 kB Inactive(anon): 252128 kB Active(file): 201848 kB Inactive(file): 2895568 kB Unevictable: 46064 kB Mlocked: 39920 kB SwapTotal: 4112380 kB SwapFree: 4112124 kB Dirty: 0 kB Writeback: 8556 kB AnonPages: 277020 kB Mapped: 137424 kB Shmem: 13792 kB KReclaimable: 91728 kB Slab: 182428 kB SReclaimable: 91728 kB SUnreclaim: 90700 kB KernelStack: 5776 kB PageTables: 7480 kB SecPageTables: 0 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 6103924 kB Committed_AS: 990924 kB VmallocTotal: 34359738367 kB VmallocUsed: 11560 kB VmallocChunk: 0 kB Percpu: 2528 kB DirectMap4k: 4325764 kB DirectMap2M: 0 kB
and also: [ 1168.537096] sysrq: Show Memory [ 1168.537192] Mem-Info: [ 1168.537206] active_anon:322 inactive_anon:63032 isolated_anon:0 [ 1168.537206] active_file:50462 inactive_file:723892 isolated_file:0 [ 1168.537206] unevictable:11516 dirty:1 writeback:2139 [ 1168.537206] slab_reclaimable:22932 slab_unreclaimable:22675 [ 1168.537206] mapped:34357 shmem:3448 pagetables:1870 [ 1168.537206] sec_pagetables:0 bounce:0 [ 1168.537206] kernel_misc_reclaimable:0 [ 1168.537206] free:87499 free_pcp:1642 free_cma:0 [ 1168.537279] Node 0 active_anon:1288kB inactive_anon:252128kB active_file:201848kB inactive_file:2895568kB unevictable:46064kB isolated(anon):0kB isolated(file):0kB mapped:137428kB dirty:4kB writeback:8556kB shmem:13792kB writeback_tmp:0kB kernel_stack:5776kB pagetables:7480kB sec_pagetables:0kB all_unreclaimable? no [ 1168.537332] Node 0 DMA free:15488kB boost:0kB min:32kB low:44kB high:56kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:352kB unevictable:0kB writepending:84kB present:15996kB managed:15936kB mlocked:0kB bounce:0kB free_pcp:12kB local_pcp:0kB free_cma:0kB [ 1168.537387] lowmem_reserve[]: 0 1728 3857 3857 [ 1168.537409] Node 0 DMA32 free:155300kB boost:0kB min:3552kB low:5320kB high:7088kB reserved_highatomic:0KB active_anon:16kB inactive_anon:3368kB active_file:5660kB inactive_file:1585120kB unevictable:0kB writepending:7576kB present:1924360kB managed:1803376kB mlocked:0kB bounce:0kB free_pcp:3184kB local_pcp:864kB free_cma:0kB [ 1168.537464] lowmem_reserve[]: 0 0 2129 2129 [ 1168.537483] Node 0 Normal free:179208kB boost:0kB min:4376kB low:6556kB high:8736kB reserved_highatomic:2048KB active_anon:1272kB inactive_anon:248760kB active_file:196104kB inactive_file:1310096kB unevictable:46064kB writepending:900kB present:2253948kB managed:2163776kB mlocked:39920kB bounce:0kB free_pcp:3372kB local_pcp:632kB free_cma:0kB [ 1168.537540] lowmem_reserve[]: 0 0 0 0 [ 1168.537557] Node 0 DMA: 4*4kB (UM) 4*8kB (UM) 3*16kB (UM) 3*32kB (U) 3*64kB (U) 2*128kB (UM) 0*256kB 1*512kB (M) 2*1024kB (UM) 0*2048kB 3*4096kB (M) = 15488kB [ 1168.537624] Node 0 DMA32: 1945*4kB (UME) 1446*8kB (UME) 953*16kB (UME) 1410*32kB (UM) 297*64kB (UM) 162*128kB (UM) 60*256kB (UM) 22*512kB (UM) 7*1024kB (UM) 1*2048kB (U) 0*4096kB = 155300kB [ 1168.537695] Node 0 Normal: 3146*4kB (UMEH) 2412*8kB (UMEH) 1444*16kB (UMEH) 1138*32kB (UMEH) 486*64kB (UME) 97*128kB (UM) 13*256kB (UM) 8*512kB (M) 2*1024kB (M) 5*2048kB (M) 6*4096kB (UM) = 179208kB [ 1168.537767] 779973 total pagecache pages [ 1168.537778] 3 pages in swap cache [ 1168.537788] Free swap = 4112124kB [ 1168.537798] Total swap = 4112380kB [ 1168.537807] 1048576 pages RAM [ 1168.537817] 0 pages HighMem/MovableOnly [ 1168.537827] 52804 pages reserved
Mikulas
On Sat, 21 Oct 2023, Marek Marczykowski-Górecki wrote:
Hi,
Since updating from 6.4.13 to 6.5.5 occasionally I hit a storage subsystem freeze - any I/O ends up frozen. I'm not sure what exactly triggers the issue, but often it happens when doing some LVM operations (lvremove, lvrename etc) on a dm-thin volume together with bulk data copy to/from another LVM thin volume with ext4 fs.
The storage stack I use is: nvme -> dm-crypt (LUKS) -> dm-thin (LVM thin pool) -> ext4
And this whole thing running in a (PV) dom0 under Xen, on Qubes OS 4.2 to be specific.
I can reproduce the issue on at least 3 different machines. I did tried also 6.5.6 and the issue is still there. I haven't checked newer versions, but briefly reviewed git log and haven't found anything suggesting a fix to similar issue.
I managed to bisect it down to this commit:
commit 5054e778fcd9cd29ddaa8109077cd235527e4f94 Author: Mikulas Patocka <mpatocka@redhat.com> Date: Mon May 1 09:19:17 2023 -0400 dm crypt: allocate compound pages if possible
It was reported that allocating pages for the write buffer in dm-crypt causes measurable overhead [1]. Change dm-crypt to allocate compound pages if they are available. If not, fall back to the mempool. [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html Suggested-by: Matthew Wilcox willy@infradead.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Mike Snitzer snitzer@kernel.org
TBH, I'm not sure if the bug is in this commit, or maybe in some functions it uses (I don't see dm-crypt functions directly involved in the stack traces I collected). But reverting this commit on top of 6.5.6 seems to fix the issue.
I tried also CONFIG_PROVE_LOCKING, but it didn't show any issue.
I managed to collect "blocked tasks" dump via sysrq below. Few more can be found at https://github.com/QubesOS/qubes-issues/issues/8575
[ 4246.558313] sysrq: Show Blocked State [ 4246.558388] task:journal-offline state:D stack:0 pid:8098 ppid:1 flags:0x00000002 [ 4246.558407] Call Trace: [ 4246.558414] <TASK> [ 4246.558422] __schedule+0x23d/0x670 [ 4246.558440] schedule+0x5e/0xd0 [ 4246.558450] io_schedule+0x46/0x70 [ 4246.558461] folio_wait_bit_common+0x13d/0x350 [ 4246.558475] ? __pfx_wake_page_function+0x10/0x10 [ 4246.558488] folio_wait_writeback+0x2c/0x90 [ 4246.558498] mpage_prepare_extent_to_map+0x15c/0x4d0 [ 4246.558512] ext4_do_writepages+0x25f/0x770 [ 4246.558523] ext4_writepages+0xad/0x180 [ 4246.558533] do_writepages+0xcf/0x1e0 [ 4246.558543] ? __seccomp_filter+0x32a/0x4f0 [ 4246.558554] filemap_fdatawrite_wbc+0x63/0x90 [ 4246.558567] __filemap_fdatawrite_range+0x5c/0x80 [ 4246.558578] file_write_and_wait_range+0x4a/0xb0 [ 4246.558588] ext4_sync_file+0x88/0x380 [ 4246.558598] __x64_sys_fsync+0x3b/0x70 [ 4246.558609] do_syscall_64+0x5c/0x90 [ 4246.558621] ? exit_to_user_mode_prepare+0xb2/0xd0 [ 4246.558632] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4246.558644] RIP: 0033:0x7710cf124d0a [ 4246.558654] RSP: 002b:00007710ccdfda40 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 4246.558668] RAX: ffffffffffffffda RBX: 000064bb92f67e60 RCX: 00007710cf124d0a [ 4246.558679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028 [ 4246.558691] RBP: 000064bb92f72670 R08: 0000000000000000 R09: 00007710ccdfe6c0 [ 4246.558702] R10: 00007710cf0adfee R11: 0000000000000293 R12: 000064bb92505940 [ 4246.558713] R13: 0000000000000002 R14: 00007ffc05649500 R15: 00007710cc5fe000 [ 4246.558728] </TASK> [ 4246.558836] task:lvm state:D stack:0 pid:7835 ppid:5665 flags:0x00004006 [ 4246.558852] Call Trace: [ 4246.558857] <TASK> [ 4246.558863] __schedule+0x23d/0x670 [ 4246.558874] schedule+0x5e/0xd0 [ 4246.558884] io_schedule+0x46/0x70 [ 4246.558894] dm_wait_for_bios_completion+0xfc/0x110 [ 4246.558909] ? __pfx_autoremove_wake_function+0x10/0x10 [ 4246.558922] __dm_suspend+0x7e/0x1b0 [ 4246.558932] dm_internal_suspend_noflush+0x5c/0x80 [ 4246.558946] pool_presuspend+0xcc/0x130 [dm_thin_pool] [ 4246.558968] dm_table_presuspend_targets+0x3f/0x60 [ 4246.558980] __dm_suspend+0x41/0x1b0 [ 4246.558991] dm_suspend+0xc0/0xe0 [ 4246.559001] dev_suspend+0xa5/0xd0 [ 4246.559011] ctl_ioctl+0x26e/0x350 [ 4246.559020] ? __pfx_dev_suspend+0x10/0x10 [ 4246.559032] dm_ctl_ioctl+0xe/0x20 [ 4246.559041] __x64_sys_ioctl+0x94/0xd0 [ 4246.559052] do_syscall_64+0x5c/0x90 [ 4246.559062] ? do_syscall_64+0x6b/0x90 [ 4246.559072] ? do_syscall_64+0x6b/0x90 [ 4246.559081] ? xen_pv_evtchn_do_upcall+0x54/0xb0 [ 4246.559093] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4246.559104] RIP: 0033:0x7f1cb77cfe0f [ 4246.559112] RSP: 002b:00007fff870f2560 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 4246.559141] RAX: ffffffffffffffda RBX: 00005b8d13c16580 RCX: 00007f1cb77cfe0f [ 4246.559152] RDX: 00005b8d144a2180 RSI: 00000000c138fd06 RDI: 0000000000000003 [ 4246.559164] RBP: 00005b8d144a2180 R08: 00005b8d132b1190 R09: 00007fff870f2420 [ 4246.559175] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c [ 4246.559186] R13: 00005b8d132aacf0 R14: 00005b8d1324414d R15: 00005b8d144a21b0 [ 4246.559199] </TASK> [ 4246.559207] task:kworker/u8:3 state:D stack:0 pid:8033 ppid:2 flags:0x00004000 [ 4246.559222] Workqueue: writeback wb_workfn (flush-253:4) [ 4246.559238] Call Trace: [ 4246.559244] <TASK> [ 4246.559249] __schedule+0x23d/0x670 [ 4246.559260] schedule+0x5e/0xd0 [ 4246.559270] io_schedule+0x46/0x70 [ 4246.559280] folio_wait_bit_common+0x13d/0x350 [ 4246.559290] ? __pfx_wake_page_function+0x10/0x10 [ 4246.559302] mpage_prepare_extent_to_map+0x309/0x4d0 [ 4246.559314] ext4_do_writepages+0x25f/0x770 [ 4246.559324] ext4_writepages+0xad/0x180 [ 4246.559334] do_writepages+0xcf/0x1e0 [ 4246.559344] ? find_busiest_group+0x42/0x1a0 [ 4246.559354] __writeback_single_inode+0x3d/0x280 [ 4246.559368] writeback_sb_inodes+0x1ed/0x4a0 [ 4246.559381] __writeback_inodes_wb+0x4c/0xf0 [ 4246.559393] wb_writeback+0x298/0x310 [ 4246.559403] wb_do_writeback+0x230/0x2b0 [ 4246.559414] wb_workfn+0x5f/0x260 [ 4246.559424] ? _raw_spin_unlock+0xe/0x30 [ 4246.559434] ? finish_task_switch.isra.0+0x95/0x2b0 [ 4246.559447] ? __schedule+0x245/0x670 [ 4246.559457] process_one_work+0x1df/0x3e0 [ 4246.559466] worker_thread+0x51/0x390 [ 4246.559475] ? __pfx_worker_thread+0x10/0x10 [ 4246.559484] kthread+0xe5/0x120 [ 4246.559495] ? __pfx_kthread+0x10/0x10 [ 4246.559504] ret_from_fork+0x31/0x50 [ 4246.559514] ? __pfx_kthread+0x10/0x10 [ 4246.559523] ret_from_fork_asm+0x1b/0x30 [ 4246.559536] </TASK>
-- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab
-- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab
-- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab
So, we got no reponse from the MM maintainers. Marek - please try this patch on all the machines where you hit the bug and if you still hit the bug with this patch, report it.
Mikulas
From: Mikulas Patocka mpatocka@redhat.com Subject: [PATCH] dm-crypt: don't allocate large compound pages
It was reported that the patch 5054e778fcd9cd29ddaa8109077cd235527e4f94 ("dm crypt: allocate compound pages if possible") causes intermittent freezes [1].
So far, it is not clear what is the root cause. It was reported that with the allocation order 3 or lower it works [1], so we restrict the order to 3 (that is PAGE_ALLOC_COSTLY_ORDER).
[1] https://www.spinics.net/lists/dm-devel/msg56048.html
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Reported-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Tested-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Cc: stable@vger.kernel.org # v6.5+ Fixes: 5054e778fcd9 ("dm crypt: allocate compound pages if possible")
--- drivers/md/dm-crypt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6/drivers/md/dm-crypt.c =================================================================== --- linux-2.6.orig/drivers/md/dm-crypt.c +++ linux-2.6/drivers/md/dm-crypt.c @@ -1679,7 +1679,7 @@ static struct bio *crypt_alloc_buffer(st unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM; unsigned int remaining_size; - unsigned int order = MAX_ORDER - 1; + unsigned int order = PAGE_ALLOC_COSTLY_ORDER;
retry: if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
On Fri, Oct 27, 2023 at 07:32:54PM +0200, Mikulas Patocka wrote:
So, we got no reponse from the MM maintainers. Marek - please try this
yes, how dare i go on holiday. i'll look at this when i'm back. probably.
patch on all the machines where you hit the bug and if you still hit the bug with this patch, report it.
Mikulas
From: Mikulas Patocka mpatocka@redhat.com Subject: [PATCH] dm-crypt: don't allocate large compound pages
It was reported that the patch 5054e778fcd9cd29ddaa8109077cd235527e4f94 ("dm crypt: allocate compound pages if possible") causes intermittent freezes [1].
So far, it is not clear what is the root cause. It was reported that with the allocation order 3 or lower it works [1], so we restrict the order to 3 (that is PAGE_ALLOC_COSTLY_ORDER).
[1] https://www.spinics.net/lists/dm-devel/msg56048.html
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Reported-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Tested-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Cc: stable@vger.kernel.org # v6.5+ Fixes: 5054e778fcd9 ("dm crypt: allocate compound pages if possible")
drivers/md/dm-crypt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6/drivers/md/dm-crypt.c
--- linux-2.6.orig/drivers/md/dm-crypt.c +++ linux-2.6/drivers/md/dm-crypt.c @@ -1679,7 +1679,7 @@ static struct bio *crypt_alloc_buffer(st unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM; unsigned int remaining_size;
- unsigned int order = MAX_ORDER - 1;
- unsigned int order = PAGE_ALLOC_COSTLY_ORDER;
retry: if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
On Sat, Oct 28 2023 at 5:23P -0400, Matthew Wilcox willy@infradead.org wrote:
On Fri, Oct 27, 2023 at 07:32:54PM +0200, Mikulas Patocka wrote:
So, we got no reponse from the MM maintainers. Marek - please try this
yes, how dare i go on holiday. i'll look at this when i'm back. probably.
There was no malice or disrespect until you emerged from holiday. Hopefully you're at the start, if not you should extend it. Probably.
I've staged this workaround until you or others can look closer: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/c...
On Fri, Oct 27, 2023 at 07:32:54PM +0200, Mikulas Patocka wrote:
So, we got no reponse from the MM maintainers. Marek - please try this patch on all the machines where you hit the bug and if you still hit the bug with this patch, report it.
I've tested it for a bit and did not hit the issue again.
Mikulas
From: Mikulas Patocka mpatocka@redhat.com Subject: [PATCH] dm-crypt: don't allocate large compound pages
It was reported that the patch 5054e778fcd9cd29ddaa8109077cd235527e4f94 ("dm crypt: allocate compound pages if possible") causes intermittent freezes [1].
So far, it is not clear what is the root cause. It was reported that with the allocation order 3 or lower it works [1], so we restrict the order to 3 (that is PAGE_ALLOC_COSTLY_ORDER).
[1] https://www.spinics.net/lists/dm-devel/msg56048.html
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Reported-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Tested-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Cc: stable@vger.kernel.org # v6.5+ Fixes: 5054e778fcd9 ("dm crypt: allocate compound pages if possible")
drivers/md/dm-crypt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6/drivers/md/dm-crypt.c
--- linux-2.6.orig/drivers/md/dm-crypt.c +++ linux-2.6/drivers/md/dm-crypt.c @@ -1679,7 +1679,7 @@ static struct bio *crypt_alloc_buffer(st unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM; unsigned int remaining_size;
- unsigned int order = MAX_ORDER - 1;
- unsigned int order = PAGE_ALLOC_COSTLY_ORDER;
retry: if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
On 10/25/23 12:13, Mikulas Patocka wrote:
So, I forward this to memory management maintainers.
Hi,
What do you think? - We have a problem that if dm-crypt allocates pages with order > 3 (PAGE_ALLOC_COSTLY_ORDER), the system occasionally freezes waiting in writeback.
Hmm, so I've checked the backtraces provided and none seems to show the actual allocating task doing the crypt_alloc_buffer(), do we know what it's doing? Is it blocked or perhaps spinning in some infinite loop?
dm-crypt allocates the pages with GFP_NOWAIT | __GFP_HIGHMEM | __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP, so it shouldn't put any pressure on the system. If the allocations fails, it falls back to smaller order allocation or to mempool as a last resort.
Right. I noticed it may also fallback to __GFP_DIRECT_RECLAIM but then it's only order-0.
When the freeze happens, there is "349264kB" free memory - so the system is not short on memory.
Yeah, it may still be fragmented, although in [1] the sysqr show memory report suggests it's not, pages do exist up to MAX_ORDER. Weird.
[1] https://lore.kernel.org/all/ZTiJ3CO8w0jauOzW@mail-itl/
Should we restrict the dm-crypt allocation size to PAGE_ALLOC_COSTLY_ORDER? Or is it a bug somewhere in memory management system that needs to be fixes there?
Haven't found any. However I'd like to point out some things I noticed in crypt_alloc_buffer(), although they are probably not related.
static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size) { struct crypt_config *cc = io->cc; struct bio *clone; unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM; unsigned int remaining_size; unsigned int order = MAX_ORDER - 1;
retry: if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM)) mutex_lock(&cc->bio_alloc_lock);
What if we end up in "goto retry" more than once? I don't see a matching unlock. Yeah, very unlikely to happen that order-0 in page allocator which includes __GFP_DIRECT_RECLAIM would fail, but not impossible, and also I see crypt_page_alloc() for the mempool can fail for another reason, due to a counter being too high. Looks dangerous.
clone = bio_alloc_bioset(cc->dev->bdev, nr_iovecs, io->base_bio->bi_opf, GFP_NOIO, &cc->bs); clone->bi_private = io; clone->bi_end_io = crypt_endio;
remaining_size = size;
while (remaining_size) { struct page *pages; unsigned size_to_add; unsigned remaining_order = __fls((remaining_size + PAGE_SIZE - 1) >> PAGE_SHIFT);
Tip: you can use get_order(remaining_size) here.
order = min(order, remaining_order); while (order > 0) {
Is this intentionally > 0 and not >= 0? We could still succeed avoiding mempool with order-0...
pages = alloc_pages(gfp_mask | __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP, order); if (likely(pages != NULL)) goto have_pages; order--; } pages = mempool_alloc(&cc->page_pool, gfp_mask); if (!pages) { crypt_free_buffer_pages(cc, clone); bio_put(clone); gfp_mask |= __GFP_DIRECT_RECLAIM; order = 0; goto retry; }
have_pages: size_to_add = min((unsigned)PAGE_SIZE << order, remaining_size); __bio_add_page(clone, pages, size_to_add, 0); remaining_size -= size_to_add; }
/* Allocate space for integrity tags */ if (dm_crypt_integrity_io_alloc(io, clone)) { crypt_free_buffer_pages(cc, clone); bio_put(clone); clone = NULL; }
if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM)) mutex_unlock(&cc->bio_alloc_lock);
return clone; }
On Sun, 29 Oct 2023, Vlastimil Babka wrote:
Haven't found any. However I'd like to point out some things I noticed in crypt_alloc_buffer(), although they are probably not related.
static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size) { struct crypt_config *cc = io->cc; struct bio *clone; unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM; unsigned int remaining_size; unsigned int order = MAX_ORDER - 1;
retry: if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM)) mutex_lock(&cc->bio_alloc_lock);
What if we end up in "goto retry" more than once? I don't see a matching
It is impossible. Before we jump to the retry label, we set __GFP_DIRECT_RECLAIM. mempool_alloc can't ever fail if __GFP_DIRECT_RECLAIM is present (it will just wait until some other task frees some objects into the mempool).
unlock. Yeah, very unlikely to happen that order-0 in page allocator which includes __GFP_DIRECT_RECLAIM would fail, but not impossible, and also I see crypt_page_alloc() for the mempool can fail for another reason, due to a counter being too high. Looks dangerous.
If crypt_page_alloc fails, mempool falls back to allocating from a pre-allocated list.
But now I see that there is a bug that the compound pages don't contribute to the "cc->n_allocated_pages" counter. I'll have to fix it.
clone = bio_alloc_bioset(cc->dev->bdev, nr_iovecs, io->base_bio->bi_opf, GFP_NOIO, &cc->bs); clone->bi_private = io; clone->bi_end_io = crypt_endio;
remaining_size = size;
while (remaining_size) { struct page *pages; unsigned size_to_add; unsigned remaining_order = __fls((remaining_size + PAGE_SIZE - 1) >> PAGE_SHIFT);
Tip: you can use get_order(remaining_size) here.
get_order rounds the size up and we need to round it down here (rounding it up would waste memory).
order = min(order, remaining_order); while (order > 0) {
Is this intentionally > 0 and not >= 0? We could still succeed avoiding mempool with order-0...
Yes, it is intentional. mempool alloc will try to allocate the page using alloc_page, so there is no need to go to the "pages = alloc_pages" branch before it.
pages = alloc_pages(gfp_mask | __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP, order); if (likely(pages != NULL)) goto have_pages; order--; } pages = mempool_alloc(&cc->page_pool, gfp_mask); if (!pages) { crypt_free_buffer_pages(cc, clone); bio_put(clone); gfp_mask |= __GFP_DIRECT_RECLAIM; order = 0; goto retry; }
have_pages: size_to_add = min((unsigned)PAGE_SIZE << order, remaining_size); __bio_add_page(clone, pages, size_to_add, 0); remaining_size -= size_to_add; }
/* Allocate space for integrity tags */ if (dm_crypt_integrity_io_alloc(io, clone)) { crypt_free_buffer_pages(cc, clone); bio_put(clone); clone = NULL; }
if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM)) mutex_unlock(&cc->bio_alloc_lock);
return clone; }
Mikulas
On 10/30/23 08:37, Mikulas Patocka wrote:
On Sun, 29 Oct 2023, Vlastimil Babka wrote:
Haven't found any. However I'd like to point out some things I noticed in crypt_alloc_buffer(), although they are probably not related.
static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size) { struct crypt_config *cc = io->cc; struct bio *clone; unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM; unsigned int remaining_size; unsigned int order = MAX_ORDER - 1;
retry: if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM)) mutex_lock(&cc->bio_alloc_lock);
What if we end up in "goto retry" more than once? I don't see a matching
It is impossible. Before we jump to the retry label, we set __GFP_DIRECT_RECLAIM. mempool_alloc can't ever fail if __GFP_DIRECT_RECLAIM is present (it will just wait until some other task frees some objects into the mempool).
Ah, missed that. And the traces don't show that we would be waiting for that. I'm starting to think the allocation itself is really not the issue here. Also I don't think it deprives something else of large order pages, as per the sysrq listing they still existed.
What I rather suspect is what happens next to the allocated bio such that it works well with order-0 or up to costly_order pages, but there's some problem causing a deadlock if the bio contains larger pages than that?
Cc Honza. The thread starts here: https://lore.kernel.org/all/ZTNH0qtmint%2FzLJZ@mail-itl/
The linked qubes reports has a number of blocked task listings that can be expanded: https://github.com/QubesOS/qubes-issues/issues/8575
On Mon, 30 Oct 2023, Vlastimil Babka wrote:
Ah, missed that. And the traces don't show that we would be waiting for that. I'm starting to think the allocation itself is really not the issue here. Also I don't think it deprives something else of large order pages, as per the sysrq listing they still existed.
What I rather suspect is what happens next to the allocated bio such that it works well with order-0 or up to costly_order pages, but there's some problem causing a deadlock if the bio contains larger pages than that?
Yes. There are many "if (order > PAGE_ALLOC_COSTLY_ORDER)" branches in the memory allocation code and I suppose that one of them does something bad and triggers this bug. But I don't know which one.
Cc Honza. The thread starts here: https://lore.kernel.org/all/ZTNH0qtmint%2FzLJZ@mail-itl/
The linked qubes reports has a number of blocked task listings that can be expanded: https://github.com/QubesOS/qubes-issues/issues/8575
Mikulas
On 10/30/23 12:22, Mikulas Patocka wrote:
On Mon, 30 Oct 2023, Vlastimil Babka wrote:
Ah, missed that. And the traces don't show that we would be waiting for that. I'm starting to think the allocation itself is really not the issue here. Also I don't think it deprives something else of large order pages, as per the sysrq listing they still existed.
What I rather suspect is what happens next to the allocated bio such that it works well with order-0 or up to costly_order pages, but there's some problem causing a deadlock if the bio contains larger pages than that?
Yes. There are many "if (order > PAGE_ALLOC_COSTLY_ORDER)" branches in the memory allocation code and I suppose that one of them does something bad and triggers this bug. But I don't know which one.
It's not what I meant. All the interesting branches for costly order in page allocator/compaction only apply with __GFP_DIRECT_RECLAIM, so we can't be hitting those here. The traces I've seen suggest the allocation of the bio suceeded, and problems arised only after it was submitted.
I wouldn't even be surprised if the threshold for hitting the bug was not exactly order > PAGE_ALLOC_COSTLY_ORDER but order > PAGE_ALLOC_COSTLY_ORDER + 1 or + 2 (has that been tested?) or rather that there's no exact threshold, but probability increases with order.
Cc Honza. The thread starts here: https://lore.kernel.org/all/ZTNH0qtmint%2FzLJZ@mail-itl/
The linked qubes reports has a number of blocked task listings that can be expanded: https://github.com/QubesOS/qubes-issues/issues/8575
Mikulas
On Mon, 30 Oct 2023, Vlastimil Babka wrote:
On 10/30/23 12:22, Mikulas Patocka wrote:
On Mon, 30 Oct 2023, Vlastimil Babka wrote:
Ah, missed that. And the traces don't show that we would be waiting for that. I'm starting to think the allocation itself is really not the issue here. Also I don't think it deprives something else of large order pages, as per the sysrq listing they still existed.
What I rather suspect is what happens next to the allocated bio such that it works well with order-0 or up to costly_order pages, but there's some problem causing a deadlock if the bio contains larger pages than that?
Yes. There are many "if (order > PAGE_ALLOC_COSTLY_ORDER)" branches in the memory allocation code and I suppose that one of them does something bad and triggers this bug. But I don't know which one.
It's not what I meant. All the interesting branches for costly order in page allocator/compaction only apply with __GFP_DIRECT_RECLAIM, so we can't be hitting those here. The traces I've seen suggest the allocation of the bio suceeded, and problems arised only after it was submitted.
I wouldn't even be surprised if the threshold for hitting the bug was not exactly order > PAGE_ALLOC_COSTLY_ORDER but order > PAGE_ALLOC_COSTLY_ORDER
- 1 or + 2 (has that been tested?) or rather that there's no exact
threshold, but probability increases with order.
It has been tested - Marek tested it with intial order == 4 and he hit the bug. With 3 and less he didn't.
Mikulas
On Mon 30-10-23 12:30:23, Vlastimil Babka wrote:
On 10/30/23 12:22, Mikulas Patocka wrote:
On Mon, 30 Oct 2023, Vlastimil Babka wrote:
Ah, missed that. And the traces don't show that we would be waiting for that. I'm starting to think the allocation itself is really not the issue here. Also I don't think it deprives something else of large order pages, as per the sysrq listing they still existed.
What I rather suspect is what happens next to the allocated bio such that it works well with order-0 or up to costly_order pages, but there's some problem causing a deadlock if the bio contains larger pages than that?
Yes. There are many "if (order > PAGE_ALLOC_COSTLY_ORDER)" branches in the memory allocation code and I suppose that one of them does something bad and triggers this bug. But I don't know which one.
It's not what I meant. All the interesting branches for costly order in page allocator/compaction only apply with __GFP_DIRECT_RECLAIM, so we can't be hitting those here. The traces I've seen suggest the allocation of the bio suceeded, and problems arised only after it was submitted.
I wouldn't even be surprised if the threshold for hitting the bug was not exactly order > PAGE_ALLOC_COSTLY_ORDER but order > PAGE_ALLOC_COSTLY_ORDER
- 1 or + 2 (has that been tested?) or rather that there's no exact
threshold, but probability increases with order.
Well, it would be possible that larger pages in a bio would trip e.g. bio splitting due to maximum segment size the disk supports (which can be e.g. 0xffff) and that upsets something somewhere. But this is pure speculation. We definitely need more debug data to be able to tell more.
Honza
On Mon, Oct 30, 2023 at 01:25:13PM +0100, Jan Kara wrote:
On Mon 30-10-23 12:30:23, Vlastimil Babka wrote:
On 10/30/23 12:22, Mikulas Patocka wrote:
On Mon, 30 Oct 2023, Vlastimil Babka wrote:
Ah, missed that. And the traces don't show that we would be waiting for that. I'm starting to think the allocation itself is really not the issue here. Also I don't think it deprives something else of large order pages, as per the sysrq listing they still existed.
What I rather suspect is what happens next to the allocated bio such that it works well with order-0 or up to costly_order pages, but there's some problem causing a deadlock if the bio contains larger pages than that?
Yes. There are many "if (order > PAGE_ALLOC_COSTLY_ORDER)" branches in the memory allocation code and I suppose that one of them does something bad and triggers this bug. But I don't know which one.
It's not what I meant. All the interesting branches for costly order in page allocator/compaction only apply with __GFP_DIRECT_RECLAIM, so we can't be hitting those here. The traces I've seen suggest the allocation of the bio suceeded, and problems arised only after it was submitted.
I wouldn't even be surprised if the threshold for hitting the bug was not exactly order > PAGE_ALLOC_COSTLY_ORDER but order > PAGE_ALLOC_COSTLY_ORDER
- 1 or + 2 (has that been tested?) or rather that there's no exact
threshold, but probability increases with order.
Well, it would be possible that larger pages in a bio would trip e.g. bio splitting due to maximum segment size the disk supports (which can be e.g. 0xffff) and that upsets something somewhere. But this is pure speculation. We definitely need more debug data to be able to tell more.
I can collect more info, but I need some guidance how :) Some patch adding extra debug messages? Note I collect those via serial console (writing to disk doesn't work when it freezes), and that has some limits in the amount of data I can extract especially when printed quickly. For example sysrq-t is too much. Or maybe there is some trick to it, like increasing log_bug_len?
On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
Well, it would be possible that larger pages in a bio would trip e.g. bio splitting due to maximum segment size the disk supports (which can be e.g. 0xffff) and that upsets something somewhere. But this is pure speculation. We definitely need more debug data to be able to tell more.
I can collect more info, but I need some guidance how :) Some patch adding extra debug messages? Note I collect those via serial console (writing to disk doesn't work when it freezes), and that has some limits in the amount of data I can extract especially when printed quickly. For example sysrq-t is too much. Or maybe there is some trick to it, like increasing log_bug_len?
If you can do more tests, I would suggest this:
We already know that it works with order 3 and doesn't work with order 4.
So, in the file include/linux/mmzone.h, change PAGE_ALLOC_COSTLY_ORDER from 3 to 4 and in the file drivers/md/dm-crypt.c leave "unsigned int order = PAGE_ALLOC_COSTLY_ORDER" there.
Does it deadlock or not?
So, that we can see whether the deadlock depends on PAGE_ALLOC_COSTLY_ORDER or whether it is just a coincidence.
Mikulas
On Mon 30-10-23 15:08:56, Mikulas Patocka wrote:
On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
Well, it would be possible that larger pages in a bio would trip e.g. bio splitting due to maximum segment size the disk supports (which can be e.g. 0xffff) and that upsets something somewhere. But this is pure speculation. We definitely need more debug data to be able to tell more.
I can collect more info, but I need some guidance how :) Some patch adding extra debug messages? Note I collect those via serial console (writing to disk doesn't work when it freezes), and that has some limits in the amount of data I can extract especially when printed quickly. For example sysrq-t is too much. Or maybe there is some trick to it, like increasing log_bug_len?
If you can do more tests, I would suggest this:
We already know that it works with order 3 and doesn't work with order 4.
So, in the file include/linux/mmzone.h, change PAGE_ALLOC_COSTLY_ORDER from 3 to 4 and in the file drivers/md/dm-crypt.c leave "unsigned int order = PAGE_ALLOC_COSTLY_ORDER" there.
Does it deadlock or not?
So, that we can see whether the deadlock depends on PAGE_ALLOC_COSTLY_ORDER or whether it is just a coincidence.
Good idea. Also if the kernel hangs, please find kcryptd processes. In what state are they? If they are sleeping, please send what's in /proc/<kcryptd-pid>/stack. Thanks!
Honza
On Mon, Oct 30, 2023 at 04:56:03PM +0100, Jan Kara wrote:
On Mon 30-10-23 15:08:56, Mikulas Patocka wrote:
On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
Well, it would be possible that larger pages in a bio would trip e.g. bio splitting due to maximum segment size the disk supports (which can be e.g. 0xffff) and that upsets something somewhere. But this is pure speculation. We definitely need more debug data to be able to tell more.
I can collect more info, but I need some guidance how :) Some patch adding extra debug messages? Note I collect those via serial console (writing to disk doesn't work when it freezes), and that has some limits in the amount of data I can extract especially when printed quickly. For example sysrq-t is too much. Or maybe there is some trick to it, like increasing log_bug_len?
If you can do more tests, I would suggest this:
We already know that it works with order 3 and doesn't work with order 4.
So, in the file include/linux/mmzone.h, change PAGE_ALLOC_COSTLY_ORDER from 3 to 4 and in the file drivers/md/dm-crypt.c leave "unsigned int order = PAGE_ALLOC_COSTLY_ORDER" there.
Does it deadlock or not?
So, that we can see whether the deadlock depends on PAGE_ALLOC_COSTLY_ORDER or whether it is just a coincidence.
Good idea. Also if the kernel hangs, please find kcryptd processes. In what state are they? If they are sleeping, please send what's in /proc/<kcryptd-pid>/stack. Thanks!
Will do.
In the meantime, while testing version with PAGE_ALLOC_COSTLY_ORDER=4, and order=PAGE_ALLOC_COSTLY_ORDER, I'm getting crash like this (see important note below both traces):
[ 92.668486] BUG: unable to handle page fault for address: ffff8880c7b64098 [ 92.668558] #PF: supervisor read access in kernel mode [ 92.668574] #PF: error_code(0x0000) - not-present page [ 92.668590] PGD 2a32067 P4D 2a32067 PUD 12868a067 PMD 0 [ 92.668617] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 92.668637] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 6.5.6-dirty #354 [ 92.668658] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023 [ 92.668675] RIP: e030:__free_one_page+0x301/0x3e0 [ 92.668704] Code: 02 0f 85 c1 fe ff ff 49 c1 e6 04 49 8d 4c 24 08 4a 8d 94 36 c0 00 00 00 48 8d 34 80 48 8d 04 70 4c 01 fa 48 c1 e0 03 49 01 c6 <4b> 8b b4 37 c0 00 00 00 48 89 4e 08 49 89 74 24 08 49 89 54 24 10 [ 92.668738] RSP: e02b:ffffc90040154c60 EFLAGS: 00010006 [ 92.668754] RAX: 0000000000000058 RBX: 0000000000000001 RCX: ffff888075f66bc0 [ 92.668773] RDX: ffff8880c7b64098 RSI: 0000000000000005 RDI: 0000000000000000 [ 92.668789] RBP: fffffe7a01d7d9ae R08: ffff888075f66b38 R09: fffffe7a01d7d9ac [ 92.668805] R10: ffffea0004b35bc8 R11: 0000000000000001 R12: ffff888075f66bb8 [ 92.668821] R13: 0000000000000000 R14: 0000000051bfd4d8 R15: ffff888075f66b00 [ 92.668854] FS: 0000000000000000(0000) GS:ffff888189740000(0000) knlGS:0000000000000000 [ 92.668874] CS: 10000e030 DS: 002b ES: 002b CR0: 0000000080050033 [ 92.668889] CR2: ffff8880c7b64098 CR3: 0000000133cf8000 CR4: 0000000000050660 [ 92.668912] Call Trace: [ 92.668924] <IRQ> [ 92.668934] ? __die+0x1e/0x60 [ 92.668955] ? page_fault_oops+0x178/0x4a0 [ 92.668975] ? exc_page_fault+0x14e/0x160 [ 92.668994] ? asm_exc_page_fault+0x26/0x30 [ 92.669014] ? __free_one_page+0x301/0x3e0 [ 92.669027] free_pcppages_bulk+0x11c/0x2b0 [ 92.669042] free_unref_page+0x10d/0x170 [ 92.669058] crypt_free_buffer_pages+0x1f4/0x250 [ 92.669079] crypt_endio+0x48/0x70 [ 92.669094] blk_mq_end_request_batch+0xd0/0x400 [ 92.669114] nvme_irq+0x6d/0x80 [ 92.669132] ? __pfx_nvme_pci_complete_batch+0x10/0x10 [ 92.669148] __handle_irq_event_percpu+0x42/0x1a0 [ 92.669166] handle_irq_event+0x33/0x70 [ 92.669179] handle_edge_irq+0x9e/0x240 [ 92.669197] handle_irq_desc+0x36/0x50 [ 92.669215] __evtchn_fifo_handle_events+0x1af/0x1d0 [ 92.669236] __xen_evtchn_do_upcall+0x70/0xd0 [ 92.669255] __xen_pv_evtchn_do_upcall+0x3d/0x70 [ 92.669275] xen_pv_evtchn_do_upcall+0xd9/0x110 [ 92.669294] </IRQ> [ 92.669302] <TASK> [ 92.669310] exc_xen_hypervisor_callback+0x8/0x20 [ 92.669330] RIP: e030:xen_hypercall_xen_version+0xa/0x20 [ 92.669348] Code: 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc [ 92.669380] RSP: e02b:ffffc900400c7e08 EFLAGS: 00000246 [ 92.669394] RAX: 0000000000040011 RBX: 0000000000000000 RCX: ffffffff81fc222a [ 92.669411] RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000000 [ 92.669426] RBP: ffffc900400c7eb0 R08: 0000000000000000 R09: 00000000000001ba [ 92.669442] R10: 0000000000007ff0 R11: 0000000000000246 R12: ffff88818976c540 [ 92.669458] R13: ffff888101e10000 R14: 0000000000000000 R15: 0000000000000402 [ 92.669475] ? xen_hypercall_xen_version+0xa/0x20 [ 92.669493] ? pmu_msr_read+0x3c/0xd0 [ 92.669510] ? xen_force_evtchn_callback+0xd/0x20 [ 92.669526] ? check_events+0x16/0x30 [ 92.669540] ? xen_irq_enable_direct+0x1d/0x30 [ 92.669556] ? finish_task_switch.isra.0+0x8e/0x270 [ 92.669576] ? __switch_to+0x165/0x3b0 [ 92.669590] ? __schedule+0x316/0x8b0 [ 92.669609] ? schedule_idle+0x25/0x40 [ 92.669626] ? cpu_startup_entry+0x25/0x30 [ 92.669641] ? cpu_bringup_and_idle+0x89/0xa0 [ 92.669660] ? asm_cpu_bringup_and_idle+0x9/0x10 [ 92.669680] </TASK> [ 92.669688] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_generic ledtrig_audio snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep hid_multitouch snd_hda_core snd_seq snd_seq_device idma64 snd_pcm i2c_designware_platform iwlwifi i2c_designware_core snd_timer snd soundcore i2c_i801 i2c_smbus efivarfs i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput [ 92.669901] CR2: ffff8880c7b64098 [ 92.669915] ---[ end trace 0000000000000000 ]--- [ 92.669930] RIP: e030:__free_one_page+0x301/0x3e0 [ 92.669946] Code: 02 0f 85 c1 fe ff ff 49 c1 e6 04 49 8d 4c 24 08 4a 8d 94 36 c0 00 00 00 48 8d 34 80 48 8d 04 70 4c 01 fa 48 c1 e0 03 49 01 c6 <4b> 8b b4 37 c0 00 00 00 48 89 4e 08 49 89 74 24 08 49 89 54 24 10 [ 92.669977] RSP: e02b:ffffc90040154c60 EFLAGS: 00010006 [ 92.669992] RAX: 0000000000000058 RBX: 0000000000000001 RCX: ffff888075f66bc0 [ 92.670008] RDX: ffff8880c7b64098 RSI: 0000000000000005 RDI: 0000000000000000 [ 92.670024] RBP: fffffe7a01d7d9ae R08: ffff888075f66b38 R09: fffffe7a01d7d9ac [ 92.670040] R10: ffffea0004b35bc8 R11: 0000000000000001 R12: ffff888075f66bb8 [ 92.670055] R13: 0000000000000000 R14: 0000000051bfd4d8 R15: ffff888075f66b00 [ 92.670083] FS: 0000000000000000(0000) GS:ffff888189740000(0000) knlGS:0000000000000000 [ 92.670101] CS: 10000e030 DS: 002b ES: 002b CR0: 0000000080050033 [ 92.670116] CR2: ffff8880c7b64098 CR3: 0000000133cf8000 CR4: 0000000000050660 [ 92.670148] Kernel panic - not syncing: Fatal exception in interrupt [ 92.670177] ------------[ cut here ]------------ [ 92.670188] WARNING: CPU: 5 PID: 0 at kernel/smp.c:766 smp_call_function_many_cond+0x4db/0x560 [ 92.670219] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_generic ledtrig_audio snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep hid_multitouch snd_hda_core snd_seq snd_seq_device idma64 snd_pcm i2c_designware_platform iwlwifi i2c_designware_core snd_timer snd soundcore i2c_i801 i2c_smbus efivarfs i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput [ 92.670400] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G D W 6.5.6-dirty #354 [ 92.670419] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023 [ 92.670434] RIP: e030:smp_call_function_many_cond+0x4db/0x560 [ 92.670455] Code: f0 52 1b 81 48 89 74 24 08 e8 d1 c8 f6 ff 48 8b 74 24 08 65 ff 0d 5d 70 e7 7e 0f 85 bb fd ff ff 0f 1f 44 00 00 e9 b1 fd ff ff <0f> 0b e9 69 fb ff ff 8b 7c 24 38 e8 d5 4f f7 ff 84 c0 0f 84 a8 fd [ 92.670486] RSP: e02b:ffffc900401549d0 EFLAGS: 00010006 [ 92.670500] RAX: 0000000080010007 RBX: 0000000000000000 RCX: 0000000000000000 [ 92.670516] RDX: 0000000000000000 RSI: ffffffff826f3c61 RDI: ffffffff82cbc1d0 [ 92.670531] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000001 [ 92.670546] R10: 00000000ffffdfff R11: ffffffff82a5ddc0 R12: 0000000000000005 [ 92.670562] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 92.670588] FS: 0000000000000000(0000) GS:ffff888189740000(0000) knlGS:0000000000000000 [ 92.670606] CS: 10000e030 DS: 002b ES: 002b CR0: 0000000080050033 [ 92.670620] CR2: ffff8880c7b64098 CR3: 0000000133cf8000 CR4: 0000000000050660 [ 92.670642] Call Trace: [ 92.670650] <IRQ> [ 92.670659] ? smp_call_function_many_cond+0x4db/0x560 [ 92.670677] ? __warn+0x7c/0x130 [ 92.670693] ? smp_call_function_many_cond+0x4db/0x560 [ 92.670711] ? report_bug+0x191/0x1c0 [ 92.670726] ? handle_bug+0x3c/0x80 [ 92.670742] ? exc_invalid_op+0x17/0x70 [ 92.670758] ? asm_exc_invalid_op+0x1a/0x20 [ 92.670775] ? smp_call_function_many_cond+0x4db/0x560 [ 92.670793] ? __pfx_stop_self+0x10/0x10 [ 92.670811] ? _printk+0x5f/0x80 [ 92.670823] ? __pfx_stop_self+0x10/0x10 [ 92.670840] smp_call_function+0x38/0x70 [ 92.670857] panic+0x19c/0x320 [ 92.670873] oops_end+0xd8/0xe0 [ 92.670887] page_fault_oops+0x19c/0x4a0 [ 92.670905] exc_page_fault+0x14e/0x160 [ 92.670921] asm_exc_page_fault+0x26/0x30 [ 92.670937] RIP: e030:__free_one_page+0x301/0x3e0 [ 92.670952] Code: 02 0f 85 c1 fe ff ff 49 c1 e6 04 49 8d 4c 24 08 4a 8d 94 36 c0 00 00 00 48 8d 34 80 48 8d 04 70 4c 01 fa 48 c1 e0 03 49 01 c6 <4b> 8b b4 37 c0 00 00 00 48 89 4e 08 49 89 74 24 08 49 89 54 24 10 [ 92.670983] RSP: e02b:ffffc90040154c60 EFLAGS: 00010006 [ 92.670997] RAX: 0000000000000058 RBX: 0000000000000001 RCX: ffff888075f66bc0 [ 92.671013] RDX: ffff8880c7b64098 RSI: 0000000000000005 RDI: 0000000000000000 [ 92.671028] RBP: fffffe7a01d7d9ae R08: ffff888075f66b38 R09: fffffe7a01d7d9ac [ 92.671044] R10: ffffea0004b35bc8 R11: 0000000000000001 R12: ffff888075f66bb8 [ 92.671059] R13: 0000000000000000 R14: 0000000051bfd4d8 R15: ffff888075f66b00 [ 92.671077] free_pcppages_bulk+0x11c/0x2b0 [ 92.671091] free_unref_page+0x10d/0x170 [ 92.671106] crypt_free_buffer_pages+0x1f4/0x250 [ 92.671122] crypt_endio+0x48/0x70 [ 92.671136] blk_mq_end_request_batch+0xd0/0x400 [ 92.671152] nvme_irq+0x6d/0x80 [ 92.671166] ? __pfx_nvme_pci_complete_batch+0x10/0x10 [ 92.671181] __handle_irq_event_percpu+0x42/0x1a0 [ 92.671196] handle_irq_event+0x33/0x70 [ 92.671208] handle_edge_irq+0x9e/0x240 [ 92.671224] handle_irq_desc+0x36/0x50 [ 92.671241] __evtchn_fifo_handle_events+0x1af/0x1d0 [ 92.671258] __xen_evtchn_do_upcall+0x70/0xd0 [ 92.671275] __xen_pv_evtchn_do_upcall+0x3d/0x70 [ 92.671292] xen_pv_evtchn_do_upcall+0xd9/0x110 [ 92.671308] </IRQ> [ 92.671316] <TASK> [ 92.671324] exc_xen_hypervisor_callback+0x8/0x20 [ 92.671342] RIP: e030:xen_hypercall_xen_version+0xa/0x20 [ 92.671360] Code: 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc [ 92.671390] RSP: e02b:ffffc900400c7e08 EFLAGS: 00000246 [ 92.671403] RAX: 0000000000040011 RBX: 0000000000000000 RCX: ffffffff81fc222a [ 92.671419] RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000000 [ 92.671434] RBP: ffffc900400c7eb0 R08: 0000000000000000 R09: 00000000000001ba [ 92.671449] R10: 0000000000007ff0 R11: 0000000000000246 R12: ffff88818976c540 [ 92.671465] R13: ffff888101e10000 R14: 0000000000000000 R15: 0000000000000402 [ 92.671481] ? xen_hypercall_xen_version+0xa/0x20 [ 92.671499] ? pmu_msr_read+0x3c/0xd0 [ 92.671514] ? xen_force_evtchn_callback+0xd/0x20 [ 92.671529] ? check_events+0x16/0x30 [ 92.671543] ? xen_irq_enable_direct+0x1d/0x30 [ 92.671560] ? finish_task_switch.isra.0+0x8e/0x270 [ 92.671579] ? __switch_to+0x165/0x3b0 [ 92.671593] ? __schedule+0x316/0x8b0 [ 92.671611] ? schedule_idle+0x25/0x40 [ 92.671631] ? cpu_startup_entry+0x25/0x30 [ 92.671644] ? cpu_bringup_and_idle+0x89/0xa0 [ 92.671663] ? asm_cpu_bringup_and_idle+0x9/0x10 [ 92.671683] </TASK> [ 92.671691] ---[ end trace 0000000000000000 ]--- [ 92.671782] Kernel Offset: disabled (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
If I change order=PAGE_ALLOC_COSTLY_ORDER+1, then this:
[ 2205.112802] BUG: unable to handle page fault for address: ffffffff89630301 [ 2205.112866] #PF: supervisor write access in kernel mode [ 2205.112882] #PF: error_code(0x0002) - not-present page [ 2205.112899] PGD 2a35067 P4D 2a35067 PUD 2a36067 PMD 0 [ 2205.112921] Oops: 0002 [#1] PREEMPT SMP NOPTI [ 2205.112946] CPU: 0 PID: 12609 Comm: kworker/u12:9 Tainted: G W 6.5.6-dirty #355 [ 2205.112979] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023 [ 2205.112997] Workqueue: kcryptd/252:0 kcryptd_crypt [ 2205.113022] RIP: e030:get_page_from_freelist+0x281/0x10c0 [ 2205.113044] Code: 6c 05 18 49 89 df 8b 5c 24 34 49 8b 47 18 48 39 c5 0f 84 85 02 00 00 49 8b 47 18 48 8b 48 08 48 8b 30 48 8d 50 f8 48 89 4e 08 <48> 89 31 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 83 c1 22 48 89 [ 2205.113079] RSP: e02b:ffffc90041ea7c48 EFLAGS: 00010283 [ 2205.113096] RAX: ffffea00059b9248 RBX: 0000000000000001 RCX: ffffffff89630301 [ 2205.113117] RDX: ffffea00059b9240 RSI: ffffea00059b9201 RDI: 0000000000000001 [ 2205.113134] RBP: ffff888189631718 R08: 000000000000c000 R09: 000000000003cb07 [ 2205.113153] R10: 0000000000000001 R11: fefefefefefefeff R12: ffff888075f66b00 [ 2205.113171] R13: ffff888189631700 R14: 0000000000000000 R15: ffff888189631700 [ 2205.113208] FS: 0000000000000000(0000) GS:ffff888189600000(0000) knlGS:0000000000000000 [ 2205.113229] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2205.113245] CR2: ffffffff89630301 CR3: 00000001049d0000 CR4: 0000000000050660 [ 2205.113272] Call Trace: [ 2205.113285] <TASK> [ 2205.113295] ? __die+0x1e/0x60 [ 2205.113312] ? page_fault_oops+0x178/0x4a0 [ 2205.113330] ? exc_page_fault+0x14e/0x160 [ 2205.113346] ? asm_exc_page_fault+0x26/0x30 [ 2205.113363] ? get_page_from_freelist+0x281/0x10c0 [ 2205.113379] ? get_page_from_freelist+0x227/0x10c0 [ 2205.113396] __alloc_pages+0x1dd/0x300 [ 2205.113412] crypt_page_alloc+0x29/0x60 [ 2205.113425] mempool_alloc+0x81/0x1b0 [ 2205.113443] kcryptd_crypt+0x293/0x4b0 [ 2205.113458] process_one_work+0x1e0/0x3e0 [ 2205.113476] worker_thread+0x49/0x3b0 [ 2205.113490] ? _raw_spin_lock_irqsave+0x22/0x50 [ 2205.113510] ? __pfx_worker_thread+0x10/0x10 [ 2205.113527] kthread+0xef/0x120 [ 2205.113541] ? __pfx_kthread+0x10/0x10 [ 2205.113554] ret_from_fork+0x2c/0x50 [ 2205.113568] ? __pfx_kthread+0x10/0x10 [ 2205.113581] ret_from_fork_asm+0x1b/0x30 [ 2205.113598] </TASK> [ 2205.113606] Modules linked in: snd_hda_codec_hdmi snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_hda_codec_generic snd_sof ledtrig_audio snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device snd_pcm hid_multitouch snd_timer snd i2c_designware_platform i2c_designware_core i2c_i801 soundcore iwlwifi idma64 i2c_smbus i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput efivarfs [ 2205.113795] CR2: ffffffff89630301 [ 2205.113808] ---[ end trace 0000000000000000 ]--- [ 2205.113822] RIP: e030:get_page_from_freelist+0x281/0x10c0 [ 2205.113839] Code: 6c 05 18 49 89 df 8b 5c 24 34 49 8b 47 18 48 39 c5 0f 84 85 02 00 00 49 8b 47 18 48 8b 48 08 48 8b 30 48 8d 50 f8 48 89 4e 08 <48> 89 31 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 83 c1 22 48 89 [ 2205.113873] RSP: e02b:ffffc90041ea7c48 EFLAGS: 00010283 [ 2205.113887] RAX: ffffea00059b9248 RBX: 0000000000000001 RCX: ffffffff89630301 [ 2205.113903] RDX: ffffea00059b9240 RSI: ffffea00059b9201 RDI: 0000000000000001 [ 2205.113919] RBP: ffff888189631718 R08: 000000000000c000 R09: 000000000003cb07 [ 2205.113935] R10: 0000000000000001 R11: fefefefefefefeff R12: ffff888075f66b00 [ 2205.113950] R13: ffff888189631700 R14: 0000000000000000 R15: ffff888189631700 [ 2205.113980] FS: 0000000000000000(0000) GS:ffff888189600000(0000) knlGS:0000000000000000 [ 2205.113998] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2205.114012] CR2: ffffffff89630301 CR3: 00000001049d0000 CR4: 0000000000050660 [ 2205.114038] note: kworker/u12:9[12609] exited with irqs disabled [ 2205.114100] note: kworker/u12:9[12609] exited with preempt_count 2
Then retried with order=PAGE_ALLOC_COSTLY_ORDER and PAGE_ALLOC_COSTLY_ORDER back at 3, and also got similar crash.
Both happened only after logging into X session (lightdm -> Xfce) and starting xfce4-terminal. If I use text console/ssh/serial and leave lightdm at login screen, then it does not crash (this is in fact how I did most previous tests). I have no idea if it's related to some other bug somewhere else (graphics driver?), or simply higher memory usage due to the whole Xfce session running. It could be also a coincidence, the sample size is rather small... But could be also some memory corruption that depending on memory layout sometimes results in a crash and sometimes in "just" storage freeze.
Note this all is still on top of 6.5.6 with changes we discuss here. If you believe it's another issue that got fixed in the meantime, I can switch to another version, but otherwise I'd like to limit changes.
On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
On Mon, Oct 30, 2023 at 04:56:03PM +0100, Jan Kara wrote:
On Mon 30-10-23 15:08:56, Mikulas Patocka wrote:
On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
Well, it would be possible that larger pages in a bio would trip e.g. bio splitting due to maximum segment size the disk supports (which can be e.g. 0xffff) and that upsets something somewhere. But this is pure speculation. We definitely need more debug data to be able to tell more.
I can collect more info, but I need some guidance how :) Some patch adding extra debug messages? Note I collect those via serial console (writing to disk doesn't work when it freezes), and that has some limits in the amount of data I can extract especially when printed quickly. For example sysrq-t is too much. Or maybe there is some trick to it, like increasing log_bug_len?
If you can do more tests, I would suggest this:
We already know that it works with order 3 and doesn't work with order 4.
So, in the file include/linux/mmzone.h, change PAGE_ALLOC_COSTLY_ORDER from 3 to 4 and in the file drivers/md/dm-crypt.c leave "unsigned int order = PAGE_ALLOC_COSTLY_ORDER" there.
Does it deadlock or not?
So, that we can see whether the deadlock depends on PAGE_ALLOC_COSTLY_ORDER or whether it is just a coincidence.
Good idea. Also if the kernel hangs, please find kcryptd processes. In what state are they? If they are sleeping, please send what's in /proc/<kcryptd-pid>/stack. Thanks!
Will do.
In the meantime, while testing version with PAGE_ALLOC_COSTLY_ORDER=4, and order=PAGE_ALLOC_COSTLY_ORDER, I'm getting crash like this (see important note below both traces):
Perhaps the kernel uses a hardcoded value "3" instead of PAGE_ALLOC_COSTLY_ORDER somewhere...
[ 92.668486] BUG: unable to handle page fault for address: ffff8880c7b64098 [ 92.668558] #PF: supervisor read access in kernel mode [ 92.668574] #PF: error_code(0x0000) - not-present page [ 92.668590] PGD 2a32067 P4D 2a32067 PUD 12868a067 PMD 0 [ 92.668617] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 92.668637] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 6.5.6-dirty #354 [ 92.668658] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023 [ 92.668675] RIP: e030:__free_one_page+0x301/0x3e0 [ 92.668704] Code: 02 0f 85 c1 fe ff ff 49 c1 e6 04 49 8d 4c 24 08 4a 8d 94 36 c0 00 00 00 48 8d 34 80 48 8d 04 70 4c 01 fa 48 c1 e0 03 49 01 c6 <4b> 8b b4 37 c0 00 00 00 48 89 4e 08 49 89 74 24 08 49 89 54 24 10 [ 92.668738] RSP: e02b:ffffc90040154c60 EFLAGS: 00010006 [ 92.668754] RAX: 0000000000000058 RBX: 0000000000000001 RCX: ffff888075f66bc0 [ 92.668773] RDX: ffff8880c7b64098 RSI: 0000000000000005 RDI: 0000000000000000 [ 92.668789] RBP: fffffe7a01d7d9ae R08: ffff888075f66b38 R09: fffffe7a01d7d9ac [ 92.668805] R10: ffffea0004b35bc8 R11: 0000000000000001 R12: ffff888075f66bb8 [ 92.668821] R13: 0000000000000000 R14: 0000000051bfd4d8 R15: ffff888075f66b00 [ 92.668854] FS: 0000000000000000(0000) GS:ffff888189740000(0000) knlGS:0000000000000000 [ 92.668874] CS: 10000e030 DS: 002b ES: 002b CR0: 0000000080050033 [ 92.668889] CR2: ffff8880c7b64098 CR3: 0000000133cf8000 CR4: 0000000000050660 [ 92.668912] Call Trace: [ 92.668924] <IRQ> [ 92.668934] ? __die+0x1e/0x60 [ 92.668955] ? page_fault_oops+0x178/0x4a0 [ 92.668975] ? exc_page_fault+0x14e/0x160 [ 92.668994] ? asm_exc_page_fault+0x26/0x30 [ 92.669014] ? __free_one_page+0x301/0x3e0 [ 92.669027] free_pcppages_bulk+0x11c/0x2b0 [ 92.669042] free_unref_page+0x10d/0x170 [ 92.669058] crypt_free_buffer_pages+0x1f4/0x250 [ 92.669079] crypt_endio+0x48/0x70 [ 92.669094] blk_mq_end_request_batch+0xd0/0x400 [ 92.669114] nvme_irq+0x6d/0x80 [ 92.669132] ? __pfx_nvme_pci_complete_batch+0x10/0x10 [ 92.669148] __handle_irq_event_percpu+0x42/0x1a0 [ 92.669166] handle_irq_event+0x33/0x70 [ 92.669179] handle_edge_irq+0x9e/0x240 [ 92.669197] handle_irq_desc+0x36/0x50 [ 92.669215] __evtchn_fifo_handle_events+0x1af/0x1d0 [ 92.669236] __xen_evtchn_do_upcall+0x70/0xd0 [ 92.669255] __xen_pv_evtchn_do_upcall+0x3d/0x70 [ 92.669275] xen_pv_evtchn_do_upcall+0xd9/0x110 [ 92.669294] </IRQ> [ 92.669302] <TASK> [ 92.669310] exc_xen_hypervisor_callback+0x8/0x20 [ 92.669330] RIP: e030:xen_hypercall_xen_version+0xa/0x20 [ 92.669348] Code: 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc [ 92.669380] RSP: e02b:ffffc900400c7e08 EFLAGS: 00000246 [ 92.669394] RAX: 0000000000040011 RBX: 0000000000000000 RCX: ffffffff81fc222a [ 92.669411] RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000000 [ 92.669426] RBP: ffffc900400c7eb0 R08: 0000000000000000 R09: 00000000000001ba [ 92.669442] R10: 0000000000007ff0 R11: 0000000000000246 R12: ffff88818976c540 [ 92.669458] R13: ffff888101e10000 R14: 0000000000000000 R15: 0000000000000402 [ 92.669475] ? xen_hypercall_xen_version+0xa/0x20 [ 92.669493] ? pmu_msr_read+0x3c/0xd0 [ 92.669510] ? xen_force_evtchn_callback+0xd/0x20 [ 92.669526] ? check_events+0x16/0x30 [ 92.669540] ? xen_irq_enable_direct+0x1d/0x30 [ 92.669556] ? finish_task_switch.isra.0+0x8e/0x270 [ 92.669576] ? __switch_to+0x165/0x3b0 [ 92.669590] ? __schedule+0x316/0x8b0 [ 92.669609] ? schedule_idle+0x25/0x40 [ 92.669626] ? cpu_startup_entry+0x25/0x30 [ 92.669641] ? cpu_bringup_and_idle+0x89/0xa0 [ 92.669660] ? asm_cpu_bringup_and_idle+0x9/0x10 [ 92.669680] </TASK> [ 92.669688] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_generic ledtrig_audio snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep hid_multitouch snd_hda_core snd_seq snd_seq_device idma64 snd_pcm i2c_designware_platform iwlwifi i2c_designware_core snd_timer snd soundcore i2c_i801 i2c_smbus efivarfs i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput [ 92.669901] CR2: ffff8880c7b64098 [ 92.669915] ---[ end trace 0000000000000000 ]--- [ 92.669930] RIP: e030:__free_one_page+0x301/0x3e0 [ 92.669946] Code: 02 0f 85 c1 fe ff ff 49 c1 e6 04 49 8d 4c 24 08 4a 8d 94 36 c0 00 00 00 48 8d 34 80 48 8d 04 70 4c 01 fa 48 c1 e0 03 49 01 c6 <4b> 8b b4 37 c0 00 00 00 48 89 4e 08 49 89 74 24 08 49 89 54 24 10 [ 92.669977] RSP: e02b:ffffc90040154c60 EFLAGS: 00010006 [ 92.669992] RAX: 0000000000000058 RBX: 0000000000000001 RCX: ffff888075f66bc0 [ 92.670008] RDX: ffff8880c7b64098 RSI: 0000000000000005 RDI: 0000000000000000 [ 92.670024] RBP: fffffe7a01d7d9ae R08: ffff888075f66b38 R09: fffffe7a01d7d9ac [ 92.670040] R10: ffffea0004b35bc8 R11: 0000000000000001 R12: ffff888075f66bb8 [ 92.670055] R13: 0000000000000000 R14: 0000000051bfd4d8 R15: ffff888075f66b00 [ 92.670083] FS: 0000000000000000(0000) GS:ffff888189740000(0000) knlGS:0000000000000000 [ 92.670101] CS: 10000e030 DS: 002b ES: 002b CR0: 0000000080050033 [ 92.670116] CR2: ffff8880c7b64098 CR3: 0000000133cf8000 CR4: 0000000000050660 [ 92.670148] Kernel panic - not syncing: Fatal exception in interrupt [ 92.670177] ------------[ cut here ]------------ [ 92.670188] WARNING: CPU: 5 PID: 0 at kernel/smp.c:766 smp_call_function_many_cond+0x4db/0x560 [ 92.670219] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_generic ledtrig_audio snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep hid_multitouch snd_hda_core snd_seq snd_seq_device idma64 snd_pcm i2c_designware_platform iwlwifi i2c_designware_core snd_timer snd soundcore i2c_i801 i2c_smbus efivarfs i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput [ 92.670400] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G D W 6.5.6-dirty #354 [ 92.670419] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023 [ 92.670434] RIP: e030:smp_call_function_many_cond+0x4db/0x560 [ 92.670455] Code: f0 52 1b 81 48 89 74 24 08 e8 d1 c8 f6 ff 48 8b 74 24 08 65 ff 0d 5d 70 e7 7e 0f 85 bb fd ff ff 0f 1f 44 00 00 e9 b1 fd ff ff <0f> 0b e9 69 fb ff ff 8b 7c 24 38 e8 d5 4f f7 ff 84 c0 0f 84 a8 fd [ 92.670486] RSP: e02b:ffffc900401549d0 EFLAGS: 00010006 [ 92.670500] RAX: 0000000080010007 RBX: 0000000000000000 RCX: 0000000000000000 [ 92.670516] RDX: 0000000000000000 RSI: ffffffff826f3c61 RDI: ffffffff82cbc1d0 [ 92.670531] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000001 [ 92.670546] R10: 00000000ffffdfff R11: ffffffff82a5ddc0 R12: 0000000000000005 [ 92.670562] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 92.670588] FS: 0000000000000000(0000) GS:ffff888189740000(0000) knlGS:0000000000000000 [ 92.670606] CS: 10000e030 DS: 002b ES: 002b CR0: 0000000080050033 [ 92.670620] CR2: ffff8880c7b64098 CR3: 0000000133cf8000 CR4: 0000000000050660 [ 92.670642] Call Trace: [ 92.670650] <IRQ> [ 92.670659] ? smp_call_function_many_cond+0x4db/0x560 [ 92.670677] ? __warn+0x7c/0x130 [ 92.670693] ? smp_call_function_many_cond+0x4db/0x560 [ 92.670711] ? report_bug+0x191/0x1c0 [ 92.670726] ? handle_bug+0x3c/0x80 [ 92.670742] ? exc_invalid_op+0x17/0x70 [ 92.670758] ? asm_exc_invalid_op+0x1a/0x20 [ 92.670775] ? smp_call_function_many_cond+0x4db/0x560 [ 92.670793] ? __pfx_stop_self+0x10/0x10 [ 92.670811] ? _printk+0x5f/0x80 [ 92.670823] ? __pfx_stop_self+0x10/0x10 [ 92.670840] smp_call_function+0x38/0x70 [ 92.670857] panic+0x19c/0x320 [ 92.670873] oops_end+0xd8/0xe0 [ 92.670887] page_fault_oops+0x19c/0x4a0 [ 92.670905] exc_page_fault+0x14e/0x160 [ 92.670921] asm_exc_page_fault+0x26/0x30 [ 92.670937] RIP: e030:__free_one_page+0x301/0x3e0 [ 92.670952] Code: 02 0f 85 c1 fe ff ff 49 c1 e6 04 49 8d 4c 24 08 4a 8d 94 36 c0 00 00 00 48 8d 34 80 48 8d 04 70 4c 01 fa 48 c1 e0 03 49 01 c6 <4b> 8b b4 37 c0 00 00 00 48 89 4e 08 49 89 74 24 08 49 89 54 24 10 [ 92.670983] RSP: e02b:ffffc90040154c60 EFLAGS: 00010006 [ 92.670997] RAX: 0000000000000058 RBX: 0000000000000001 RCX: ffff888075f66bc0 [ 92.671013] RDX: ffff8880c7b64098 RSI: 0000000000000005 RDI: 0000000000000000 [ 92.671028] RBP: fffffe7a01d7d9ae R08: ffff888075f66b38 R09: fffffe7a01d7d9ac [ 92.671044] R10: ffffea0004b35bc8 R11: 0000000000000001 R12: ffff888075f66bb8 [ 92.671059] R13: 0000000000000000 R14: 0000000051bfd4d8 R15: ffff888075f66b00 [ 92.671077] free_pcppages_bulk+0x11c/0x2b0 [ 92.671091] free_unref_page+0x10d/0x170 [ 92.671106] crypt_free_buffer_pages+0x1f4/0x250 [ 92.671122] crypt_endio+0x48/0x70 [ 92.671136] blk_mq_end_request_batch+0xd0/0x400 [ 92.671152] nvme_irq+0x6d/0x80 [ 92.671166] ? __pfx_nvme_pci_complete_batch+0x10/0x10 [ 92.671181] __handle_irq_event_percpu+0x42/0x1a0 [ 92.671196] handle_irq_event+0x33/0x70 [ 92.671208] handle_edge_irq+0x9e/0x240 [ 92.671224] handle_irq_desc+0x36/0x50 [ 92.671241] __evtchn_fifo_handle_events+0x1af/0x1d0 [ 92.671258] __xen_evtchn_do_upcall+0x70/0xd0 [ 92.671275] __xen_pv_evtchn_do_upcall+0x3d/0x70 [ 92.671292] xen_pv_evtchn_do_upcall+0xd9/0x110 [ 92.671308] </IRQ> [ 92.671316] <TASK> [ 92.671324] exc_xen_hypervisor_callback+0x8/0x20 [ 92.671342] RIP: e030:xen_hypercall_xen_version+0xa/0x20 [ 92.671360] Code: 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc [ 92.671390] RSP: e02b:ffffc900400c7e08 EFLAGS: 00000246 [ 92.671403] RAX: 0000000000040011 RBX: 0000000000000000 RCX: ffffffff81fc222a [ 92.671419] RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000000 [ 92.671434] RBP: ffffc900400c7eb0 R08: 0000000000000000 R09: 00000000000001ba [ 92.671449] R10: 0000000000007ff0 R11: 0000000000000246 R12: ffff88818976c540 [ 92.671465] R13: ffff888101e10000 R14: 0000000000000000 R15: 0000000000000402 [ 92.671481] ? xen_hypercall_xen_version+0xa/0x20 [ 92.671499] ? pmu_msr_read+0x3c/0xd0 [ 92.671514] ? xen_force_evtchn_callback+0xd/0x20 [ 92.671529] ? check_events+0x16/0x30 [ 92.671543] ? xen_irq_enable_direct+0x1d/0x30 [ 92.671560] ? finish_task_switch.isra.0+0x8e/0x270 [ 92.671579] ? __switch_to+0x165/0x3b0 [ 92.671593] ? __schedule+0x316/0x8b0 [ 92.671611] ? schedule_idle+0x25/0x40 [ 92.671631] ? cpu_startup_entry+0x25/0x30 [ 92.671644] ? cpu_bringup_and_idle+0x89/0xa0 [ 92.671663] ? asm_cpu_bringup_and_idle+0x9/0x10 [ 92.671683] </TASK> [ 92.671691] ---[ end trace 0000000000000000 ]--- [ 92.671782] Kernel Offset: disabled (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
If I change order=PAGE_ALLOC_COSTLY_ORDER+1, then this:
[ 2205.112802] BUG: unable to handle page fault for address: ffffffff89630301 [ 2205.112866] #PF: supervisor write access in kernel mode [ 2205.112882] #PF: error_code(0x0002) - not-present page [ 2205.112899] PGD 2a35067 P4D 2a35067 PUD 2a36067 PMD 0 [ 2205.112921] Oops: 0002 [#1] PREEMPT SMP NOPTI [ 2205.112946] CPU: 0 PID: 12609 Comm: kworker/u12:9 Tainted: G W 6.5.6-dirty #355 [ 2205.112979] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023 [ 2205.112997] Workqueue: kcryptd/252:0 kcryptd_crypt [ 2205.113022] RIP: e030:get_page_from_freelist+0x281/0x10c0 [ 2205.113044] Code: 6c 05 18 49 89 df 8b 5c 24 34 49 8b 47 18 48 39 c5 0f 84 85 02 00 00 49 8b 47 18 48 8b 48 08 48 8b 30 48 8d 50 f8 48 89 4e 08 <48> 89 31 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 83 c1 22 48 89 [ 2205.113079] RSP: e02b:ffffc90041ea7c48 EFLAGS: 00010283 [ 2205.113096] RAX: ffffea00059b9248 RBX: 0000000000000001 RCX: ffffffff89630301 [ 2205.113117] RDX: ffffea00059b9240 RSI: ffffea00059b9201 RDI: 0000000000000001 [ 2205.113134] RBP: ffff888189631718 R08: 000000000000c000 R09: 000000000003cb07 [ 2205.113153] R10: 0000000000000001 R11: fefefefefefefeff R12: ffff888075f66b00 [ 2205.113171] R13: ffff888189631700 R14: 0000000000000000 R15: ffff888189631700 [ 2205.113208] FS: 0000000000000000(0000) GS:ffff888189600000(0000) knlGS:0000000000000000 [ 2205.113229] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2205.113245] CR2: ffffffff89630301 CR3: 00000001049d0000 CR4: 0000000000050660 [ 2205.113272] Call Trace: [ 2205.113285] <TASK> [ 2205.113295] ? __die+0x1e/0x60 [ 2205.113312] ? page_fault_oops+0x178/0x4a0 [ 2205.113330] ? exc_page_fault+0x14e/0x160 [ 2205.113346] ? asm_exc_page_fault+0x26/0x30 [ 2205.113363] ? get_page_from_freelist+0x281/0x10c0 [ 2205.113379] ? get_page_from_freelist+0x227/0x10c0 [ 2205.113396] __alloc_pages+0x1dd/0x300 [ 2205.113412] crypt_page_alloc+0x29/0x60 [ 2205.113425] mempool_alloc+0x81/0x1b0 [ 2205.113443] kcryptd_crypt+0x293/0x4b0 [ 2205.113458] process_one_work+0x1e0/0x3e0 [ 2205.113476] worker_thread+0x49/0x3b0 [ 2205.113490] ? _raw_spin_lock_irqsave+0x22/0x50 [ 2205.113510] ? __pfx_worker_thread+0x10/0x10 [ 2205.113527] kthread+0xef/0x120 [ 2205.113541] ? __pfx_kthread+0x10/0x10 [ 2205.113554] ret_from_fork+0x2c/0x50 [ 2205.113568] ? __pfx_kthread+0x10/0x10 [ 2205.113581] ret_from_fork_asm+0x1b/0x30 [ 2205.113598] </TASK> [ 2205.113606] Modules linked in: snd_hda_codec_hdmi snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_hda_codec_generic snd_sof ledtrig_audio snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device snd_pcm hid_multitouch snd_timer snd i2c_designware_platform i2c_designware_core i2c_i801 soundcore iwlwifi idma64 i2c_smbus i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput efivarfs [ 2205.113795] CR2: ffffffff89630301 [ 2205.113808] ---[ end trace 0000000000000000 ]--- [ 2205.113822] RIP: e030:get_page_from_freelist+0x281/0x10c0 [ 2205.113839] Code: 6c 05 18 49 89 df 8b 5c 24 34 49 8b 47 18 48 39 c5 0f 84 85 02 00 00 49 8b 47 18 48 8b 48 08 48 8b 30 48 8d 50 f8 48 89 4e 08 <48> 89 31 48 b9 00 01 00 00 00 00 ad de 48 89 08 48 83 c1 22 48 89 [ 2205.113873] RSP: e02b:ffffc90041ea7c48 EFLAGS: 00010283 [ 2205.113887] RAX: ffffea00059b9248 RBX: 0000000000000001 RCX: ffffffff89630301 [ 2205.113903] RDX: ffffea00059b9240 RSI: ffffea00059b9201 RDI: 0000000000000001 [ 2205.113919] RBP: ffff888189631718 R08: 000000000000c000 R09: 000000000003cb07 [ 2205.113935] R10: 0000000000000001 R11: fefefefefefefeff R12: ffff888075f66b00 [ 2205.113950] R13: ffff888189631700 R14: 0000000000000000 R15: ffff888189631700 [ 2205.113980] FS: 0000000000000000(0000) GS:ffff888189600000(0000) knlGS:0000000000000000 [ 2205.113998] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2205.114012] CR2: ffffffff89630301 CR3: 00000001049d0000 CR4: 0000000000050660 [ 2205.114038] note: kworker/u12:9[12609] exited with irqs disabled [ 2205.114100] note: kworker/u12:9[12609] exited with preempt_count 2
Then retried with order=PAGE_ALLOC_COSTLY_ORDER and PAGE_ALLOC_COSTLY_ORDER back at 3, and also got similar crash.
So, does it mean that even allocating with order=PAGE_ALLOC_COSTLY_ORDER isn't safe?
Try enabling CONFIG_DEBUG_VM (it also needs CONFIG_DEBUG_KERNEL) and try to provoke a similar crash. Let's see if it crashes on one of the VM_BUG_ON statements.
Mikulas
Both happened only after logging into X session (lightdm -> Xfce) and starting xfce4-terminal. If I use text console/ssh/serial and leave lightdm at login screen, then it does not crash (this is in fact how I did most previous tests). I have no idea if it's related to some other bug somewhere else (graphics driver?), or simply higher memory usage due to the whole Xfce session running. It could be also a coincidence, the sample size is rather small... But could be also some memory corruption that depending on memory layout sometimes results in a crash and sometimes in "just" storage freeze.
Note this all is still on top of 6.5.6 with changes we discuss here. If you believe it's another issue that got fixed in the meantime, I can switch to another version, but otherwise I'd like to limit changes.
-- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab
On Mon, Oct 30, 2023 at 06:50:35PM +0100, Mikulas Patocka wrote:
On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
Then retried with order=PAGE_ALLOC_COSTLY_ORDER and PAGE_ALLOC_COSTLY_ORDER back at 3, and also got similar crash.
So, does it mean that even allocating with order=PAGE_ALLOC_COSTLY_ORDER isn't safe?
That seems to be another bug, see below.
Try enabling CONFIG_DEBUG_VM (it also needs CONFIG_DEBUG_KERNEL) and try to provoke a similar crash. Let's see if it crashes on one of the VM_BUG_ON statements.
This was very interesting idea. With this, immediately after login I get the crash like below. Which makes sense, as this is when pulseaudio starts and opens /dev/snd/*. I then tried with the dm-crypt commit reverted and still got the crash! But, after blacklisting snd_pcm, there is no BUG splat, but the storage freeze still happens on vanilla 6.5.6.
The snd_pcm BUG splat:
[ 51.082877] page:00000000d8fdb7f1 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11b7d9 [ 51.082919] flags: 0x200000000000000(node=0|zone=2) [ 51.082924] page_type: 0xffffffff() [ 51.082929] raw: 0200000000000000 dead000000000100 dead000000000122 0000000000000000 [ 51.082934] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 51.082938] page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u)) [ 51.082969] ------------[ cut here ]------------ [ 51.082972] kernel BUG at include/linux/mm.h:1406! [ 51.082980] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 51.082986] CPU: 5 PID: 3893 Comm: alsa-sink-Gener Tainted: G W 6.5.6-dirty #359 [ 51.082992] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023 [ 51.082997] RIP: e030:snd_pcm_mmap_data_fault+0x11d/0x140 [snd_pcm] [ 51.083015] Code: 48 2b 05 8e 7b 67 c2 48 01 f0 48 c1 e8 0c 48 c1 e0 06 48 03 05 6c 7b 67 c2 e9 4c ff ff ff 48 c7 c6 d8 71 1c c0 e8 93 1e 0e c1 <0f> 0b 48 83 ef 01 e9 4d ff ff ff 48 8b 05 51 47 89 c2 eb c9 66 66 [ 51.083023] RSP: e02b:ffffc90041be7e00 EFLAGS: 00010246 [ 51.083028] RAX: 000000000000005c RBX: ffffc90041be7e28 RCX: 0000000000000000 [ 51.083033] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff [ 51.083038] RBP: ffff888102e75f18 R08: 00000000ffffdfff R09: 0000000000000001 [ 51.083042] R10: 00000000ffffdfff R11: ffffffff82a5ddc0 R12: ffff888102e75f18 [ 51.083047] R13: 0000000000000255 R14: ffff888100955e80 R15: ffff888102e75f18 [ 51.083056] FS: 00007f51d354f6c0(0000) GS:ffff888189740000(0000) knlGS:0000000000000000 [ 51.083061] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 51.083065] CR2: 00007f51d36f6000 CR3: 000000011b53e000 CR4: 0000000000050660 [ 51.083072] Call Trace: [ 51.083076] <TASK> [ 51.083078] ? die+0x31/0x80 [ 51.083085] ? do_trap+0xd5/0x100 [ 51.083089] ? snd_pcm_mmap_data_fault+0x11d/0x140 [snd_pcm] [ 51.083103] ? do_error_trap+0x65/0x90 [ 51.083107] ? snd_pcm_mmap_data_fault+0x11d/0x140 [snd_pcm] [ 51.083120] ? exc_invalid_op+0x50/0x70 [ 51.083127] ? snd_pcm_mmap_data_fault+0x11d/0x140 [snd_pcm] [ 51.083140] ? asm_exc_invalid_op+0x1a/0x20 [ 51.083146] ? snd_pcm_mmap_data_fault+0x11d/0x140 [snd_pcm] [ 51.083159] __do_fault+0x29/0x110 [ 51.083165] __handle_mm_fault+0x5fb/0xc40 [ 51.083170] handle_mm_fault+0x91/0x1e0 [ 51.083173] do_user_addr_fault+0x216/0x5d0 [ 51.083179] ? check_preemption_disabled+0x31/0xf0 [ 51.083185] exc_page_fault+0x71/0x160 [ 51.083189] asm_exc_page_fault+0x26/0x30 [ 51.083195] RIP: 0033:0x7f51e56793ca [ 51.083198] Code: c5 fe 7f 07 c5 fe 7f 47 20 c5 fe 7f 47 40 c5 fe 7f 47 60 c5 f8 77 c3 66 0f 1f 84 00 00 00 00 00 40 0f b6 c6 48 89 d1 48 89 fa <f3> aa 48 89 d0 c5 f8 77 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 [ 51.083207] RSP: 002b:00007f51d354c528 EFLAGS: 00010202 [ 51.083211] RAX: 0000000000000000 RBX: 00007f51d354ec80 RCX: 00000000000034e0 [ 51.083216] RDX: 00007f51d36f5000 RSI: 0000000000000000 RDI: 00007f51d36f6000 [ 51.083220] RBP: 000055fec98b2f60 R08: 00007f51cc0031c0 R09: 0000000000000000 [ 51.083224] R10: 0000000000000000 R11: 0000000000000101 R12: 000055fec98b2f60 [ 51.083228] R13: 00007f51d354c630 R14: 0000000000000000 R15: 000055fec78ba680 [ 51.083233] </TASK> [ 51.083235] Modules linked in: snd_hda_codec_hdmi snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_hda_codec_generic snd_sof_pci ledtrig_audio snd_sof snd_sof_utils snd_sof_xtensa_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device hid_multitouch snd_pcm i2c_i801 idma64 iwlwifi i2c_smbus i2c_designware_platform i2c_designware_core snd_timer snd soundcore efivarfs i2c_hid_acpi i2c_hid pinctrl_tigerlake pinctrl_intel xen_acpi_processor xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput [ 51.083293] ---[ end trace 0000000000000000 ]--- [ 51.083296] RIP: e030:snd_pcm_mmap_data_fault+0x11d/0x140 [snd_pcm] [ 51.083310] Code: 48 2b 05 8e 7b 67 c2 48 01 f0 48 c1 e8 0c 48 c1 e0 06 48 03 05 6c 7b 67 c2 e9 4c ff ff ff 48 c7 c6 d8 71 1c c0 e8 93 1e 0e c1 <0f> 0b 48 83 ef 01 e9 4d ff ff ff 48 8b 05 51 47 89 c2 eb c9 66 66 [ 51.083318] RSP: e02b:ffffc90041be7e00 EFLAGS: 00010246 [ 51.083323] RAX: 000000000000005c RBX: ffffc90041be7e28 RCX: 0000000000000000 [ 51.083327] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff [ 51.083331] RBP: ffff888102e75f18 R08: 00000000ffffdfff R09: 0000000000000001 [ 51.083335] R10: 00000000ffffdfff R11: ffffffff82a5ddc0 R12: ffff888102e75f18 [ 51.083340] R13: 0000000000000255 R14: ffff888100955e80 R15: ffff888102e75f18 [ 51.083347] FS: 00007f51d354f6c0(0000) GS:ffff888189740000(0000) knlGS:0000000000000000 [ 51.083353] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 51.083356] CR2: 00007f51d36f6000 CR3: 000000011b53e000 CR4: 0000000000050660
Having discovered that, I'm redoing recent tests with snd_pcm blacklisted. I'll get back to debugging snd_pcm issue separately.
Plain 6.5.6 (so order = MAX_ORDER - 1, and PAGE_ALLOC_COSTLY_ORDER=3), in frozen state:
[ 143.195348] sysrq: Show Blocked State [ 143.195471] task:lvm state:D stack:13312 pid:4882 ppid:2025 flags:0x00004002 [ 143.195504] Call Trace: [ 143.195514] <TASK> [ 143.195526] __schedule+0x30e/0x8b0 [ 143.195550] ? __pfx_dev_suspend+0x10/0x10 [ 143.195569] schedule+0x59/0xb0 [ 143.195582] io_schedule+0x41/0x70 [ 143.195595] dm_wait_for_completion+0x19d/0x1b0 [ 143.195671] ? __pfx_autoremove_wake_function+0x10/0x10 [ 143.195693] __dm_suspend+0x79/0x190 [ 143.195707] ? __pfx_dev_suspend+0x10/0x10 [ 143.195723] dm_internal_suspend_noflush+0x57/0x80 [ 143.195740] pool_presuspend+0xc7/0x130 [ 143.195759] dm_table_presuspend_targets+0x38/0x60 [ 143.195774] __dm_suspend+0x34/0x190 [ 143.195788] ? preempt_count_add+0x69/0xa0 [ 143.195805] ? __pfx_dev_suspend+0x10/0x10 [ 143.195819] dm_suspend+0xbb/0xe0 [ 143.195835] ? preempt_count_add+0x46/0xa0 [ 143.195851] dev_suspend+0x18e/0x2d0 [ 143.195867] ? __pfx_dev_suspend+0x10/0x10 [ 143.195882] ctl_ioctl+0x329/0x640 [ 143.195901] dm_ctl_ioctl+0x9/0x10 [ 143.195917] __x64_sys_ioctl+0x8f/0xd0 [ 143.195938] do_syscall_64+0x3c/0x90 [ 143.195954] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 143.195975] RIP: 0033:0x7f2e0ab1fe0f [ 143.195989] RSP: 002b:00007ffd59a16e60 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 143.196011] RAX: ffffffffffffffda RBX: 000056289d130840 RCX: 00007f2e0ab1fe0f [ 143.196029] RDX: 000056289d120b80 RSI: 00000000c138fd06 RDI: 0000000000000003 [ 143.196046] RBP: 000056289d120b80 R08: 000056289a7eb190 R09: 00007ffd59a16d20 [ 143.196063] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c [ 143.196080] R13: 000056289a7e4cf0 R14: 000056289a77e14d R15: 000056289d120bb0 [ 143.196098] </TASK> [ 143.196106] task:blkdiscard state:D stack:13672 pid:4884 ppid:2025 flags:0x00000002 [ 143.196130] Call Trace: [ 143.196139] <TASK> [ 143.196147] __schedule+0x30e/0x8b0 [ 143.196162] schedule+0x59/0xb0 [ 143.196175] schedule_timeout+0x14c/0x160 [ 143.196193] io_schedule_timeout+0x4b/0x70 [ 143.196207] wait_for_completion_io+0x81/0x130 [ 143.196226] submit_bio_wait+0x5c/0x90 [ 143.196241] blkdev_issue_discard+0x94/0xe0 [ 143.196260] blkdev_common_ioctl+0x79e/0x9c0 [ 143.196279] blkdev_ioctl+0xc7/0x270 [ 143.196293] __x64_sys_ioctl+0x8f/0xd0 [ 143.196310] do_syscall_64+0x3c/0x90 [ 143.196324] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 143.196343] RIP: 0033:0x7fa6cebcee0f [ 143.196354] RSP: 002b:00007ffe6700fa80 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 143.196374] RAX: ffffffffffffffda RBX: 0000000280000000 RCX: 00007fa6cebcee0f [ 143.196391] RDX: 00007ffe6700fb50 RSI: 0000000000001277 RDI: 0000000000000003 [ 143.196408] RBP: 0000000000000003 R08: 0000000000000071 R09: 0000000000000004 [ 143.196424] R10: 00007ffe67064170 R11: 0000000000000246 R12: 0000000040000000 [ 143.196441] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 143.196460] </TASK>
for f in $(grep -l crypt /proc/*/comm); do head $f ${f/comm/stack}; done ==> /proc/3761/comm <== kworker/u12:7-kcryptd/252:0
==> /proc/3761/stack <== [<0>] worker_thread+0xab/0x3b0 [<0>] kthread+0xef/0x120 [<0>] ret_from_fork+0x2c/0x50 [<0>] ret_from_fork_asm+0x1b/0x30 ==> /proc/51/comm <== cryptd
==> /proc/51/stack <== [<0>] rescuer_thread+0x2d5/0x390 [<0>] kthread+0xef/0x120 [<0>] ret_from_fork+0x2c/0x50 [<0>] ret_from_fork_asm+0x1b/0x30 ==> /proc/556/comm <== kcryptd_io/252:
==> /proc/556/stack <== [<0>] rescuer_thread+0x2d5/0x390 [<0>] kthread+0xef/0x120 [<0>] ret_from_fork+0x2c/0x50 [<0>] ret_from_fork_asm+0x1b/0x30 ==> /proc/557/comm <== kcryptd/252:0
==> /proc/557/stack <== [<0>] rescuer_thread+0x2d5/0x390 [<0>] kthread+0xef/0x120 [<0>] ret_from_fork+0x2c/0x50 [<0>] ret_from_fork_asm+0x1b/0x30 ==> /proc/558/comm <== dmcrypt_write/252:0
==> /proc/558/stack <== [<0>] dmcrypt_write+0x6a/0x140 [<0>] kthread+0xef/0x120 [<0>] ret_from_fork+0x2c/0x50 [<0>] ret_from_fork_asm+0x1b/0x30 ==> /proc/717/comm <== kworker/u12:6-kcryptd/252:0
==> /proc/717/stack <== [<0>] worker_thread+0xab/0x3b0 [<0>] kthread+0xef/0x120 [<0>] ret_from_fork+0x2c/0x50 [<0>] ret_from_fork_asm+0x1b/0x30
Then tried: - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce, - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce, - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times and I can't reproduce the issue there. I'm confused...
On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
On Mon, Oct 30, 2023 at 06:50:35PM +0100, Mikulas Patocka wrote:
On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
Then retried with order=PAGE_ALLOC_COSTLY_ORDER and PAGE_ALLOC_COSTLY_ORDER back at 3, and also got similar crash.
So, does it mean that even allocating with order=PAGE_ALLOC_COSTLY_ORDER isn't safe?
That seems to be another bug, see below.
Try enabling CONFIG_DEBUG_VM (it also needs CONFIG_DEBUG_KERNEL) and try to provoke a similar crash. Let's see if it crashes on one of the VM_BUG_ON statements.
This was very interesting idea. With this, immediately after login I get the crash like below. Which makes sense, as this is when pulseaudio starts and opens /dev/snd/*. I then tried with the dm-crypt commit reverted and still got the crash! But, after blacklisting snd_pcm, there is no BUG splat, but the storage freeze still happens on vanilla 6.5.6.
OK, great. Thanks for testing.
<snip snd_pcm bug>
Plain 6.5.6 (so order = MAX_ORDER - 1, and PAGE_ALLOC_COSTLY_ORDER=3), in frozen state: [ 143.196106] task:blkdiscard state:D stack:13672 pid:4884 ppid:2025 flags:0x00000002 [ 143.196130] Call Trace: [ 143.196139] <TASK> [ 143.196147] __schedule+0x30e/0x8b0 [ 143.196162] schedule+0x59/0xb0 [ 143.196175] schedule_timeout+0x14c/0x160 [ 143.196193] io_schedule_timeout+0x4b/0x70 [ 143.196207] wait_for_completion_io+0x81/0x130 [ 143.196226] submit_bio_wait+0x5c/0x90 [ 143.196241] blkdev_issue_discard+0x94/0xe0 [ 143.196260] blkdev_common_ioctl+0x79e/0x9c0 [ 143.196279] blkdev_ioctl+0xc7/0x270 [ 143.196293] __x64_sys_ioctl+0x8f/0xd0 [ 143.196310] do_syscall_64+0x3c/0x90
So this shows there was bio submitted and it never ran to completion.
for f in $(grep -l crypt /proc/*/comm); do head $f ${f/comm/stack}; done
<snip some backtraces>
So this shows dm-crypt layer isn't stuck anywhere. So the allocation path itself doesn't seem to be locking up, looping or anything.
Then tried:
- PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times and I can't reproduce the issue there. I'm confused...
And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER causing hangs is most likely just a coincidence. Rather something either in the block layer or in the storage driver has problems with handling bios with sufficiently high order pages attached. This is going to be a bit painful to debug I'm afraid. How long does it take for you trigger the hang? I'm asking to get rough estimate how heavy tracing we can afford so that we don't overwhelm the system...
Honza
On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
Then tried:
- PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times and I can't reproduce the issue there. I'm confused...
And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER causing hangs is most likely just a coincidence. Rather something either in the block layer or in the storage driver has problems with handling bios with sufficiently high order pages attached. This is going to be a bit painful to debug I'm afraid. How long does it take for you trigger the hang? I'm asking to get rough estimate how heavy tracing we can afford so that we don't overwhelm the system...
Sometimes it freezes just after logging in, but in worst case it takes me about 10min of more or less `tar xz` + `dd`.
On Tue, 31 Oct 2023, Marek Marczykowski-Górecki wrote:
On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
Then tried:
- PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times and I can't reproduce the issue there. I'm confused...
And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER causing hangs is most likely just a coincidence. Rather something either in the block layer or in the storage driver has problems with handling bios with sufficiently high order pages attached. This is going to be a bit painful to debug I'm afraid. How long does it take for you trigger the hang? I'm asking to get rough estimate how heavy tracing we can afford so that we don't overwhelm the system...
Sometimes it freezes just after logging in, but in worst case it takes me about 10min of more or less `tar xz` + `dd`.
Hi
I would like to ask you to try this patch. Revert the changes to "order" and "PAGE_ALLOC_COSTLY_ORDER" back to normal and apply this patch on a clean upstream kernel.
Does it deadlock?
There is a bug in dm-crypt that it doesn't account large pages in cc->n_allocated_pages, this patch fixes the bug.
Mikulas
--- drivers/md/dm-crypt.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)
Index: linux-stable/drivers/md/dm-crypt.c =================================================================== --- linux-stable.orig/drivers/md/dm-crypt.c 2023-10-31 16:25:09.000000000 +0100 +++ linux-stable/drivers/md/dm-crypt.c 2023-10-31 16:53:14.000000000 +0100 @@ -1700,11 +1700,16 @@ retry: order = min(order, remaining_order);
while (order > 0) { + if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) + (1 << order) > dm_crypt_pages_per_client)) + goto decrease_order; pages = alloc_pages(gfp_mask | __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP, order); - if (likely(pages != NULL)) + if (likely(pages != NULL)) { + percpu_counter_add(&cc->n_allocated_pages, 1 << order); goto have_pages; + } +decrease_order: order--; }
@@ -1742,10 +1747,12 @@ static void crypt_free_buffer_pages(stru
if (clone->bi_vcnt > 0) { /* bio_for_each_folio_all crashes with an empty bio */ bio_for_each_folio_all(fi, clone) { - if (folio_test_large(fi.folio)) + if (folio_test_large(fi.folio)) { + percpu_counter_sub(&cc->n_allocated_pages, 1 << folio_order(fi.folio)); folio_put(fi.folio); - else + } else { mempool_free(&fi.folio->page, &cc->page_pool); + } } } }
On Tue, 31 Oct 2023, Mikulas Patocka wrote:
On Tue, 31 Oct 2023, Marek Marczykowski-Górecki wrote:
On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
Then tried:
- PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times and I can't reproduce the issue there. I'm confused...
And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER causing hangs is most likely just a coincidence. Rather something either in the block layer or in the storage driver has problems with handling bios with sufficiently high order pages attached. This is going to be a bit painful to debug I'm afraid. How long does it take for you trigger the hang? I'm asking to get rough estimate how heavy tracing we can afford so that we don't overwhelm the system...
Sometimes it freezes just after logging in, but in worst case it takes me about 10min of more or less `tar xz` + `dd`.
Hi
I would like to ask you to try this patch. Revert the changes to "order" and "PAGE_ALLOC_COSTLY_ORDER" back to normal and apply this patch on a clean upstream kernel.
Does it deadlock?
There is a bug in dm-crypt that it doesn't account large pages in cc->n_allocated_pages, this patch fixes the bug.
Mikulas
If the previous patch didn't fix it, try this patch (on a clean upstream kernel).
This patch allocates large pages, but it breaks them up into single-page entries when adding them to the bio.
If this patch deadlocks, it is a sign that allocating large pages causes the problem.
If this patch doesn't deadlock, it is a sign that processing a bio with large pages is the problem.
Mikulas
--- drivers/md/dm-crypt.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-)
Index: linux-stable/drivers/md/dm-crypt.c =================================================================== --- linux-stable.orig/drivers/md/dm-crypt.c 2023-10-31 17:16:03.000000000 +0100 +++ linux-stable/drivers/md/dm-crypt.c 2023-10-31 17:21:14.000000000 +0100 @@ -1700,11 +1700,16 @@ retry: order = min(order, remaining_order);
while (order > 0) { + if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) + (1 << order) > dm_crypt_pages_per_client)) + goto decrease_order; pages = alloc_pages(gfp_mask | __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP, order); - if (likely(pages != NULL)) + if (likely(pages != NULL)) { + percpu_counter_add(&cc->n_allocated_pages, 1 << order); goto have_pages; + } +decrease_order: order--; }
@@ -1719,8 +1724,13 @@ retry:
have_pages: size_to_add = min((unsigned)PAGE_SIZE << order, remaining_size); - __bio_add_page(clone, pages, size_to_add, 0); remaining_size -= size_to_add; + while (size_to_add) { + unsigned this_step = min(size_to_add, (unsigned)PAGE_SIZE); + __bio_add_page(clone, pages, this_step, 0); + size_to_add -= this_step; + pages++; + } }
/* Allocate space for integrity tags */ @@ -1739,13 +1749,21 @@ have_pages: static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone) { struct folio_iter fi; + unsigned skip = 0;
if (clone->bi_vcnt > 0) { /* bio_for_each_folio_all crashes with an empty bio */ bio_for_each_folio_all(fi, clone) { - if (folio_test_large(fi.folio)) + if (skip) { + skip--; + continue; + } + if (folio_test_large(fi.folio)) { + skip = (1 << folio_order(fi.folio)) - 1; + percpu_counter_sub(&cc->n_allocated_pages, 1 << folio_order(fi.folio)); folio_put(fi.folio); - else + } else { mempool_free(&fi.folio->page, &cc->page_pool); + } } } }
On Tue, Oct 31, 2023 at 06:24:19PM +0100, Mikulas Patocka wrote:
On Tue, 31 Oct 2023, Mikulas Patocka wrote:
On Tue, 31 Oct 2023, Marek Marczykowski-Górecki wrote:
On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
Then tried:
- PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times and I can't reproduce the issue there. I'm confused...
And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER causing hangs is most likely just a coincidence. Rather something either in the block layer or in the storage driver has problems with handling bios with sufficiently high order pages attached. This is going to be a bit painful to debug I'm afraid. How long does it take for you trigger the hang? I'm asking to get rough estimate how heavy tracing we can afford so that we don't overwhelm the system...
Sometimes it freezes just after logging in, but in worst case it takes me about 10min of more or less `tar xz` + `dd`.
Hi
I would like to ask you to try this patch. Revert the changes to "order" and "PAGE_ALLOC_COSTLY_ORDER" back to normal and apply this patch on a clean upstream kernel.
Does it deadlock?
There is a bug in dm-crypt that it doesn't account large pages in cc->n_allocated_pages, this patch fixes the bug.
This patch did not help.
If the previous patch didn't fix it, try this patch (on a clean upstream kernel).
This patch allocates large pages, but it breaks them up into single-page entries when adding them to the bio.
But this does help.
If this patch deadlocks, it is a sign that allocating large pages causes the problem.
If this patch doesn't deadlock, it is a sign that processing a bio with large pages is the problem.
Mikulas
drivers/md/dm-crypt.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-)
Index: linux-stable/drivers/md/dm-crypt.c
--- linux-stable.orig/drivers/md/dm-crypt.c 2023-10-31 17:16:03.000000000 +0100 +++ linux-stable/drivers/md/dm-crypt.c 2023-10-31 17:21:14.000000000 +0100 @@ -1700,11 +1700,16 @@ retry: order = min(order, remaining_order); while (order > 0) {
if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) + (1 << order) > dm_crypt_pages_per_client))
goto decrease_order; pages = alloc_pages(gfp_mask | __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP, order);
if (likely(pages != NULL))
if (likely(pages != NULL)) {
percpu_counter_add(&cc->n_allocated_pages, 1 << order); goto have_pages;
}
+decrease_order: order--; } @@ -1719,8 +1724,13 @@ retry: have_pages: size_to_add = min((unsigned)PAGE_SIZE << order, remaining_size);
remaining_size -= size_to_add;__bio_add_page(clone, pages, size_to_add, 0);
while (size_to_add) {
unsigned this_step = min(size_to_add, (unsigned)PAGE_SIZE);
__bio_add_page(clone, pages, this_step, 0);
size_to_add -= this_step;
pages++;
}}
/* Allocate space for integrity tags */ @@ -1739,13 +1749,21 @@ have_pages: static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone) { struct folio_iter fi;
- unsigned skip = 0;
if (clone->bi_vcnt > 0) { /* bio_for_each_folio_all crashes with an empty bio */ bio_for_each_folio_all(fi, clone) {
if (folio_test_large(fi.folio))
if (skip) {
skip--;
continue;
}
if (folio_test_large(fi.folio)) {
skip = (1 << folio_order(fi.folio)) - 1;
percpu_counter_sub(&cc->n_allocated_pages, 1 << folio_order(fi.folio)); folio_put(fi.folio);
else
} else { mempool_free(&fi.folio->page, &cc->page_pool);
} }}
}
On Thu, 2 Nov 2023, Marek Marczykowski-Górecki wrote:
On Tue, Oct 31, 2023 at 06:24:19PM +0100, Mikulas Patocka wrote:
Hi
I would like to ask you to try this patch. Revert the changes to "order" and "PAGE_ALLOC_COSTLY_ORDER" back to normal and apply this patch on a clean upstream kernel.
Does it deadlock?
There is a bug in dm-crypt that it doesn't account large pages in cc->n_allocated_pages, this patch fixes the bug.
This patch did not help.
If the previous patch didn't fix it, try this patch (on a clean upstream kernel).
This patch allocates large pages, but it breaks them up into single-page entries when adding them to the bio.
But this does help.
Thanks. So we can stop blaming the memory allocator and start blaming the NVMe subsystem.
I added NVMe maintainers to this thread - the summary of the problem is: In dm-crypt, we allocate a large compound page and add this compound page to the bio as a single big vector entry. Marek reports that on his system it causes deadlocks, the deadlocks look like a lost bio that was never completed. When I chop the large compound page to individual pages in dm-crypt and add bio vector for each of them, Marek reports that there are no longer any deadlocks. So, we have a problem (either hardware or software) that the NVMe subsystem doesn't like bio vectors with large bv_len. This is the original bug report: https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/
Marek, what NVMe devices do you use? Do you use the same device on all 3 machines where you hit this bug?
In the directory /sys/block/nvme0n1/queue: what is the value of dma_alignment, max_hw_sectors_kb, max_sectors_kb, max_segment_size, max_segments, virt_boundary_mask?
Try lowring /sys/block/nvme0n1/queue/max_sectors_kb to some small value (for example 64) and test if it helps.
Mikulas
On Thu, Nov 02, 2023 at 10:28:57AM +0100, Mikulas Patocka wrote:
On Thu, 2 Nov 2023, Marek Marczykowski-Górecki wrote:
On Tue, Oct 31, 2023 at 06:24:19PM +0100, Mikulas Patocka wrote:
Hi
I would like to ask you to try this patch. Revert the changes to "order" and "PAGE_ALLOC_COSTLY_ORDER" back to normal and apply this patch on a clean upstream kernel.
Does it deadlock?
There is a bug in dm-crypt that it doesn't account large pages in cc->n_allocated_pages, this patch fixes the bug.
This patch did not help.
If the previous patch didn't fix it, try this patch (on a clean upstream kernel).
This patch allocates large pages, but it breaks them up into single-page entries when adding them to the bio.
But this does help.
Thanks. So we can stop blaming the memory allocator and start blaming the NVMe subsystem.
I added NVMe maintainers to this thread - the summary of the problem is: In dm-crypt, we allocate a large compound page and add this compound page to the bio as a single big vector entry. Marek reports that on his system it causes deadlocks, the deadlocks look like a lost bio that was never completed. When I chop the large compound page to individual pages in dm-crypt and add bio vector for each of them, Marek reports that there are no longer any deadlocks. So, we have a problem (either hardware or software) that the NVMe subsystem doesn't like bio vectors with large bv_len. This is the original bug report: https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/
Marek, what NVMe devices do you use? Do you use the same device on all 3 machines where you hit this bug?
This one is "Star Drive PCIe SSD", another one is "Samsung SSD 970 EVO Plus 1TB", I can't check the third one right now.
In the directory /sys/block/nvme0n1/queue: what is the value of dma_alignment, max_hw_sectors_kb, max_sectors_kb, max_segment_size, max_segments, virt_boundary_mask?
/sys/block/nvme0n1/queue/dma_alignment:3 /sys/block/nvme0n1/queue/max_hw_sectors_kb:2048 /sys/block/nvme0n1/queue/max_sectors_kb:1280 /sys/block/nvme0n1/queue/max_segment_size:4294967295 /sys/block/nvme0n1/queue/max_segments:128 /sys/block/nvme0n1/queue/virt_boundary_mask:4095
Try lowring /sys/block/nvme0n1/queue/max_sectors_kb to some small value (for example 64) and test if it helps.
Yes, this helps too.
On Thu, 2 Nov 2023, Marek Marczykowski-Górecki wrote:
On Thu, Nov 02, 2023 at 10:28:57AM +0100, Mikulas Patocka wrote:
Try lowring /sys/block/nvme0n1/queue/max_sectors_kb to some small value (for example 64) and test if it helps.
Yes, this helps too.
On a plain upstream kernel with no other modifications (and with default max_sectors_kb), set the value /sys/module/nvme/parameters/sgl_threshold to "0" and test it if it deadlocks. Then, set this value to "1" and test it again.
Revert sgl_threshold back to the default (32768). Boot the kernel with the option "iommu=panic". Reproduce the deadlock and if you get a kernel panic, send us the panic log.
Then, try this patch (without "iommu=panic"), reproduce the deadlock and tell us which one of the "printk" statements is triggered during the deadlock.
Mikulas
--- drivers/nvme/host/core.c | 8 ++++++-- drivers/nvme/host/pci.c | 27 ++++++++++++++++++++++----- 2 files changed, 28 insertions(+), 7 deletions(-)
Index: linux-stable/drivers/nvme/host/pci.c =================================================================== --- linux-stable.orig/drivers/nvme/host/pci.c 2023-10-31 15:35:38.000000000 +0100 +++ linux-stable/drivers/nvme/host/pci.c 2023-11-02 17:38:20.000000000 +0100 @@ -622,6 +622,10 @@ static blk_status_t nvme_pci_setup_prps( prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma); if (!prp_list) { iod->nr_allocations = -1; + if (nprps <= (256 / 8)) + printk("allocation failure at %d\n", __LINE__); + else + printk("allocation failure at %d\n", __LINE__); return BLK_STS_RESOURCE; } iod->list[0].prp_list = prp_list; @@ -631,8 +635,10 @@ static blk_status_t nvme_pci_setup_prps( if (i == NVME_CTRL_PAGE_SIZE >> 3) { __le64 *old_prp_list = prp_list; prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma); - if (!prp_list) + if (!prp_list) { + printk("allocation failure at %d\n", __LINE__); goto free_prps; + } iod->list[iod->nr_allocations++].prp_list = prp_list; prp_list[0] = old_prp_list[i - 1]; old_prp_list[i - 1] = cpu_to_le64(prp_dma); @@ -712,6 +718,7 @@ static blk_status_t nvme_pci_setup_sgls( sg_list = dma_pool_alloc(pool, GFP_ATOMIC, &sgl_dma); if (!sg_list) { iod->nr_allocations = -1; + printk("allocation failure at %d\n", __LINE__); return BLK_STS_RESOURCE; }
@@ -736,8 +743,10 @@ static blk_status_t nvme_setup_prp_simpl unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - offset;
iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0); - if (dma_mapping_error(dev->dev, iod->first_dma)) + if (dma_mapping_error(dev->dev, iod->first_dma)) { + printk("allocation failure at %d\n", __LINE__); return BLK_STS_RESOURCE; + } iod->dma_len = bv->bv_len;
cmnd->dptr.prp1 = cpu_to_le64(iod->first_dma); @@ -755,8 +764,10 @@ static blk_status_t nvme_setup_sgl_simpl struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0); - if (dma_mapping_error(dev->dev, iod->first_dma)) + if (dma_mapping_error(dev->dev, iod->first_dma)) { + printk("allocation failure at %d\n", __LINE__); return BLK_STS_RESOURCE; + } iod->dma_len = bv->bv_len;
cmnd->flags = NVME_CMD_SGL_METABUF; @@ -791,8 +802,10 @@ static blk_status_t nvme_map_data(struct
iod->dma_len = 0; iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); - if (!iod->sgt.sgl) + if (!iod->sgt.sgl) { + printk("allocation failure at %d\n", __LINE__); return BLK_STS_RESOURCE; + } sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req)); iod->sgt.orig_nents = blk_rq_map_sg(req->q, req, iod->sgt.sgl); if (!iod->sgt.orig_nents) @@ -801,8 +814,12 @@ static blk_status_t nvme_map_data(struct rc = dma_map_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), DMA_ATTR_NO_WARN); if (rc) { - if (rc == -EREMOTEIO) + if (rc == -EREMOTEIO) { + printk("allocation failure at %d\n", __LINE__); ret = BLK_STS_TARGET; + } else { + printk("allocation failure at %d\n", __LINE__); + } goto out_free_sg; }
Index: linux-stable/drivers/nvme/host/core.c =================================================================== --- linux-stable.orig/drivers/nvme/host/core.c 2023-10-31 15:35:38.000000000 +0100 +++ linux-stable/drivers/nvme/host/core.c 2023-11-02 17:12:39.000000000 +0100 @@ -708,8 +708,10 @@ blk_status_t nvme_fail_nonready_command( ctrl->state != NVME_CTRL_DELETING && ctrl->state != NVME_CTRL_DEAD && !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) && - !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH)) + !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH)) { + printk("allocation failure at %d\n", __LINE__); return BLK_STS_RESOURCE; + } return nvme_host_path_error(rq); } EXPORT_SYMBOL_GPL(nvme_fail_nonready_command); @@ -784,8 +786,10 @@ static blk_status_t nvme_setup_discard(s * discard page. If that's also busy, it's safe to return * busy, as we know we can make progress once that's freed. */ - if (test_and_set_bit_lock(0, &ns->ctrl->discard_page_busy)) + if (test_and_set_bit_lock(0, &ns->ctrl->discard_page_busy)) { + printk("allocation failure at %d\n", __LINE__); return BLK_STS_RESOURCE; + }
range = page_address(ns->ctrl->discard_page); }
On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
On Thu, 2 Nov 2023, Marek Marczykowski-Górecki wrote:
On Thu, Nov 02, 2023 at 10:28:57AM +0100, Mikulas Patocka wrote:
Try lowring /sys/block/nvme0n1/queue/max_sectors_kb to some small value (for example 64) and test if it helps.
Yes, this helps too.
On a plain upstream kernel with no other modifications (and with default max_sectors_kb), set the value /sys/module/nvme/parameters/sgl_threshold to "0" and test it if it deadlocks. Then, set this value to "1" and test it again.
Got deadlock wit both values.
Revert sgl_threshold back to the default (32768). Boot the kernel with the option "iommu=panic". Reproduce the deadlock and if you get a kernel panic, send us the panic log.
This is a Xen PV, so Linux is not in charge of IOMMU here. And there is SWIOTLB involved (64MB of it), I'm not sure if for every DMA, but definitely for some.
Then, try this patch (without "iommu=panic"), reproduce the deadlock and tell us which one of the "printk" statements is triggered during the deadlock.
I'll try this next.
On Fri, Nov 03, 2023 at 04:01:19PM +0100, Marek Marczykowski-G'orecki wrote:
On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
On Thu, 2 Nov 2023, Marek Marczykowski-G'orecki wrote:
On Thu, Nov 02, 2023 at 10:28:57AM +0100, Mikulas Patocka wrote:
Try lowring /sys/block/nvme0n1/queue/max_sectors_kb to some small value (for example 64) and test if it helps.
Yes, this helps too.
On a plain upstream kernel with no other modifications (and with default max_sectors_kb), set the value /sys/module/nvme/parameters/sgl_threshold to "0" and test it if it deadlocks. Then, set this value to "1" and test it again.
Got deadlock wit both values.
Revert sgl_threshold back to the default (32768). Boot the kernel with the option "iommu=panic". Reproduce the deadlock and if you get a kernel panic, send us the panic log.
This is a Xen PV, so Linux is not in charge of IOMMU here. And there is SWIOTLB involved (64MB of it), I'm not sure if for every DMA, but definitely for some.
So it's using xen_swiotlb_dma_ops, right? That doesn't implmeent .opt_mapping_size, and I'm guessing it should be equal to swiotlb_max_mapping_size().
Then, try this patch (without "iommu=panic"), reproduce the deadlock and tell us which one of the "printk" statements is triggered during the deadlock.
I'll try this next.
Placing my bet now: you'll see a DMA mapping error.
On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
Then, try this patch (without "iommu=panic"), reproduce the deadlock and tell us which one of the "printk" statements is triggered during the deadlock.
The "821" one - see below.
Mikulas
drivers/nvme/host/core.c | 8 ++++++-- drivers/nvme/host/pci.c | 27 ++++++++++++++++++++++----- 2 files changed, 28 insertions(+), 7 deletions(-)
Index: linux-stable/drivers/nvme/host/pci.c
--- linux-stable.orig/drivers/nvme/host/pci.c 2023-10-31 15:35:38.000000000 +0100 +++ linux-stable/drivers/nvme/host/pci.c 2023-11-02 17:38:20.000000000 +0100 @@ -622,6 +622,10 @@ static blk_status_t nvme_pci_setup_prps( prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma); if (!prp_list) { iod->nr_allocations = -1;
if (nprps <= (256 / 8))
printk("allocation failure at %d\n", __LINE__);
else
return BLK_STS_RESOURCE; } iod->list[0].prp_list = prp_list;printk("allocation failure at %d\n", __LINE__);
@@ -631,8 +635,10 @@ static blk_status_t nvme_pci_setup_prps( if (i == NVME_CTRL_PAGE_SIZE >> 3) { __le64 *old_prp_list = prp_list; prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
if (!prp_list)
if (!prp_list) {
printk("allocation failure at %d\n", __LINE__); goto free_prps;
} iod->list[iod->nr_allocations++].prp_list = prp_list; prp_list[0] = old_prp_list[i - 1]; old_prp_list[i - 1] = cpu_to_le64(prp_dma);
@@ -712,6 +718,7 @@ static blk_status_t nvme_pci_setup_sgls( sg_list = dma_pool_alloc(pool, GFP_ATOMIC, &sgl_dma); if (!sg_list) { iod->nr_allocations = -1;
return BLK_STS_RESOURCE; }printk("allocation failure at %d\n", __LINE__);
@@ -736,8 +743,10 @@ static blk_status_t nvme_setup_prp_simpl unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - offset; iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
- if (dma_mapping_error(dev->dev, iod->first_dma))
- if (dma_mapping_error(dev->dev, iod->first_dma)) {
return BLK_STS_RESOURCE;printk("allocation failure at %d\n", __LINE__);
- } iod->dma_len = bv->bv_len;
cmnd->dptr.prp1 = cpu_to_le64(iod->first_dma); @@ -755,8 +764,10 @@ static blk_status_t nvme_setup_sgl_simpl struct nvme_iod *iod = blk_mq_rq_to_pdu(req); iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
- if (dma_mapping_error(dev->dev, iod->first_dma))
- if (dma_mapping_error(dev->dev, iod->first_dma)) {
return BLK_STS_RESOURCE;printk("allocation failure at %d\n", __LINE__);
- } iod->dma_len = bv->bv_len;
cmnd->flags = NVME_CMD_SGL_METABUF; @@ -791,8 +802,10 @@ static blk_status_t nvme_map_data(struct iod->dma_len = 0; iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
- if (!iod->sgt.sgl)
- if (!iod->sgt.sgl) {
return BLK_STS_RESOURCE;printk("allocation failure at %d\n", __LINE__);
- } sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req)); iod->sgt.orig_nents = blk_rq_map_sg(req->q, req, iod->sgt.sgl); if (!iod->sgt.orig_nents)
@@ -801,8 +814,12 @@ static blk_status_t nvme_map_data(struct rc = dma_map_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), DMA_ATTR_NO_WARN); if (rc) {
if (rc == -EREMOTEIO)
if (rc == -EREMOTEIO) {
printk("allocation failure at %d\n", __LINE__); ret = BLK_STS_TARGET;
} else {
printk("allocation failure at %d\n", __LINE__);
I get a lot of this one.
goto out_free_sg; }}
Index: linux-stable/drivers/nvme/host/core.c =================================================================== --- linux-stable.orig/drivers/nvme/host/core.c 2023-10-31 15:35:38.000000000 +0100 +++ linux-stable/drivers/nvme/host/core.c 2023-11-02 17:12:39.000000000 +0100 @@ -708,8 +708,10 @@ blk_status_t nvme_fail_nonready_command( ctrl->state != NVME_CTRL_DELETING && ctrl->state != NVME_CTRL_DEAD && !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
!blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
!blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH)) {
return BLK_STS_RESOURCE;printk("allocation failure at %d\n", __LINE__);
- } return nvme_host_path_error(rq);
} EXPORT_SYMBOL_GPL(nvme_fail_nonready_command); @@ -784,8 +786,10 @@ static blk_status_t nvme_setup_discard(s * discard page. If that's also busy, it's safe to return * busy, as we know we can make progress once that's freed. */
if (test_and_set_bit_lock(0, &ns->ctrl->discard_page_busy))
if (test_and_set_bit_lock(0, &ns->ctrl->discard_page_busy)) {
printk("allocation failure at %d\n", __LINE__); return BLK_STS_RESOURCE;
}
range = page_address(ns->ctrl->discard_page); }
On Fri, Nov 03, 2023 at 05:15:49PM +0100, Marek Marczykowski-G'orecki wrote:
On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
Then, try this patch (without "iommu=panic"), reproduce the deadlock and tell us which one of the "printk" statements is triggered during the deadlock.
The "821" one - see below.
Thanks for confirming!
Could you try this patch?
--- diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 60a08dfe8d75f..348fd6c6702a5 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -41,7 +41,7 @@ * These can be higher, but we need to ensure that any command doesn't * require an sg allocation that needs more than a page of data. */ -#define NVME_MAX_KB_SZ 8192 +#define NVME_MAX_KB_SZ 8192u #define NVME_MAX_SEGS 128 #define NVME_MAX_NR_ALLOCATIONS 5
@@ -2957,8 +2957,9 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev, * Limit the max command size to prevent iod->sg allocations going * over a single page. */ - dev->ctrl.max_hw_sectors = min_t(u32, - NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9); + dev->ctrl.max_hw_sectors = min3(NVME_MAX_KB_SZ << 1, + dma_opt_mapping_size(&pdev->dev) >> 9, + dma_max_mapping_size(&pdev->dev) >> 9); dev->ctrl.max_segments = NVME_MAX_SEGS;
/* diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 946bd56f0ac53..0e6c6c25d154f 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma_ops = { .get_sgtable = dma_common_get_sgtable, .alloc_pages = dma_common_alloc_pages, .free_pages = dma_common_free_pages, + .max_mapping_size = swiotlb_max_mapping_size, }; --
On Fri, Nov 03, 2023 at 10:54:00AM -0600, Keith Busch wrote:
On Fri, Nov 03, 2023 at 05:15:49PM +0100, Marek Marczykowski-G'orecki wrote:
On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
Then, try this patch (without "iommu=panic"), reproduce the deadlock and tell us which one of the "printk" statements is triggered during the deadlock.
The "821" one - see below.
Thanks for confirming!
Could you try this patch?
Besides min3() being unhappy about types, it works.
On Fri, Nov 03, 2023 at 09:30:07PM +0100, Marek Marczykowski-G'orecki wrote:
On Fri, Nov 03, 2023 at 10:54:00AM -0600, Keith Busch wrote:
On Fri, Nov 03, 2023 at 05:15:49PM +0100, Marek Marczykowski-G'orecki wrote:
On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
Then, try this patch (without "iommu=panic"), reproduce the deadlock and tell us which one of the "printk" statements is triggered during the deadlock.
The "821" one - see below.
Thanks for confirming!
Could you try this patch?
Besides min3() being unhappy about types, it works.
Oops, should have changed the constant to "ul" instead of just "u".
Anyway, the overall idea makes sense to me, but I don't know the swiotlb stuff well.
Christoph, does that patch make sense? For reference:
https://lore.kernel.org/linux-mm/ZUOr-vp0yRkLyvyi@kbusch-mbp.dhcp.thefaceboo...
On Fri, 3 Nov 2023, Keith Busch wrote:
On Fri, Nov 03, 2023 at 09:30:07PM +0100, Marek Marczykowski-G'orecki wrote:
On Fri, Nov 03, 2023 at 10:54:00AM -0600, Keith Busch wrote:
On Fri, Nov 03, 2023 at 05:15:49PM +0100, Marek Marczykowski-G'orecki wrote:
On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
Then, try this patch (without "iommu=panic"), reproduce the deadlock and tell us which one of the "printk" statements is triggered during the deadlock.
The "821" one - see below.
Thanks for confirming!
Could you try this patch?
Besides min3() being unhappy about types, it works.
Oops, should have changed the constant to "ul" instead of just "u".
Anyway, the overall idea makes sense to me, but I don't know the swiotlb stuff well.
Christoph, does that patch make sense? For reference:
https://lore.kernel.org/linux-mm/ZUOr-vp0yRkLyvyi@kbusch-mbp.dhcp.thefaceboo...
dma_opt_mapping_size returns "min(dma_max_mapping_size(dev), size)". So you don't have to call dma_max_mapping_size explicitly, you can leave the file drivers/nvme/host/pci.c as it is.
What about the other block device drivers (AHCI, SCSI...)? Should there be some generic framework that restricts max_hw_sectors according to dma_opt_mapping_size?
Mikulas
On Sat, Nov 04, 2023 at 10:27:32AM +0100, Mikulas Patocka wrote:
On Fri, 3 Nov 2023, Keith Busch wrote:
On Fri, Nov 03, 2023 at 09:30:07PM +0100, Marek Marczykowski-G'orecki wrote:
On Fri, Nov 03, 2023 at 10:54:00AM -0600, Keith Busch wrote:
On Fri, Nov 03, 2023 at 05:15:49PM +0100, Marek Marczykowski-G'orecki wrote:
On Thu, Nov 02, 2023 at 06:06:33PM +0100, Mikulas Patocka wrote:
Then, try this patch (without "iommu=panic"), reproduce the deadlock and tell us which one of the "printk" statements is triggered during the deadlock.
The "821" one - see below.
Thanks for confirming!
Could you try this patch?
Besides min3() being unhappy about types, it works.
Oops, should have changed the constant to "ul" instead of just "u".
Anyway, the overall idea makes sense to me, but I don't know the swiotlb stuff well.
Christoph, does that patch make sense? For reference:
https://lore.kernel.org/linux-mm/ZUOr-vp0yRkLyvyi@kbusch-mbp.dhcp.thefaceboo...
dma_opt_mapping_size returns "min(dma_max_mapping_size(dev), size)". So you don't have to call dma_max_mapping_size explicitly, you can leave the file drivers/nvme/host/pci.c as it is.
Indeed.
What about the other block device drivers (AHCI, SCSI...)? Should there be some generic framework that restricts max_hw_sectors according to dma_opt_mapping_size?
I think it's just like any other request_queue limits and the individual drivers have to set up these.
Thinking on this again, this feels more like a max_segment_size limit rather than a max_hw_sectors.
On Sat, Nov 04, 2023 at 07:59:06AM -0600, Keith Busch wrote:
dma_opt_mapping_size returns "min(dma_max_mapping_size(dev), size)". So you don't have to call dma_max_mapping_size explicitly, you can leave the file drivers/nvme/host/pci.c as it is.
Indeed.
What about the other block device drivers (AHCI, SCSI...)? Should there be some generic framework that restricts max_hw_sectors according to dma_opt_mapping_size?
I think it's just like any other request_queue limits and the individual drivers have to set up these.
Yes, or in case of SCSI the scsi midlayer, which already does it. The block layer itself doesn't even know that we're using DMA to transfer data.
Thinking on this again, this feels more like a max_segment_size limit rather than a max_hw_sectors.
No, it's an overall limit. If we'd have a lot of segments using the max size we'd still starve the swiotlb pool.
There's a bug that when using the XEN hypervisor with dm-crypt on NVMe, the kernel deadlocks [1].
The deadlocks are caused by inability to map a large bio vector - dma_map_sgtable always returns an error, this gets propagated to the block layer as BLK_STS_RESOURCE and the block layer retries the request indefinitely.
XEN uses the swiotlb framework to map discontiguous pages into contiguous runs that are submitted to the PCIe device. The swiotlb framework has a limitation on the length of a mapping - this needs to be announced with the max_mapping_size method to make sure that the hardware drivers do not create larger mappings.
Without max_mapping_size, the NVMe block driver would create large mappings that overrun the maximum mapping size.
[1] https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Reported-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Tested-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Suggested-by: Keith Busch kbusch@kernel.org Suggested-by: Christoph Hellwig hch@lst.de Cc: stable@vger.kernel.org
--- drivers/xen/swiotlb-xen.c | 1 + 1 file changed, 1 insertion(+)
Index: linux-stable/drivers/xen/swiotlb-xen.c =================================================================== --- linux-stable.orig/drivers/xen/swiotlb-xen.c 2023-11-03 17:57:18.000000000 +0100 +++ linux-stable/drivers/xen/swiotlb-xen.c 2023-11-06 15:30:59.000000000 +0100 @@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma .get_sgtable = dma_common_get_sgtable, .alloc_pages = dma_common_alloc_pages, .free_pages = dma_common_free_pages, + .max_mapping_size = swiotlb_max_mapping_size, };
On Mon, Nov 06, 2023 at 03:59:40PM +0100, Mikulas Patocka wrote:
There's a bug that when using the XEN hypervisor with dm-crypt on NVMe, the kernel deadlocks [1].
The deadlocks are caused by inability to map a large bio vector - dma_map_sgtable always returns an error, this gets propagated to the block layer as BLK_STS_RESOURCE and the block layer retries the request indefinitely.
XEN uses the swiotlb framework to map discontiguous pages into contiguous runs that are submitted to the PCIe device. The swiotlb framework has a limitation on the length of a mapping - this needs to be announced with the max_mapping_size method to make sure that the hardware drivers do not create larger mappings.
Without max_mapping_size, the NVMe block driver would create large mappings that overrun the maximum mapping size.
[1] https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/
This should be a "Link:" tag.
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Reported-by: Marek Marczykowski-G'orecki marmarek@invisiblethingslab.com Tested-by: Marek Marczykowski-G'orecki marmarek@invisiblethingslab.com Suggested-by: Keith Busch kbusch@kernel.org
I was about to send the same thing. I did a little more than suggest this: it's is the very patch I wrote for testing, minus the redundant nvme bits! But since you already have a commit message for it...
Acked-by: Keith Busch kbusch@kernel.org
Suggested-by: Christoph Hellwig hch@lst.de Cc: stable@vger.kernel.org
drivers/xen/swiotlb-xen.c | 1 + 1 file changed, 1 insertion(+)
Index: linux-stable/drivers/xen/swiotlb-xen.c
--- linux-stable.orig/drivers/xen/swiotlb-xen.c 2023-11-03 17:57:18.000000000 +0100 +++ linux-stable/drivers/xen/swiotlb-xen.c 2023-11-06 15:30:59.000000000 +0100 @@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma .get_sgtable = dma_common_get_sgtable, .alloc_pages = dma_common_alloc_pages, .free_pages = dma_common_free_pages,
- .max_mapping_size = swiotlb_max_mapping_size,
};
On Mon, Nov 06 2023 at 10:16P -0500, Keith Busch kbusch@kernel.org wrote:
On Mon, Nov 06, 2023 at 03:59:40PM +0100, Mikulas Patocka wrote:
There's a bug that when using the XEN hypervisor with dm-crypt on NVMe, the kernel deadlocks [1].
The deadlocks are caused by inability to map a large bio vector - dma_map_sgtable always returns an error, this gets propagated to the block layer as BLK_STS_RESOURCE and the block layer retries the request indefinitely.
XEN uses the swiotlb framework to map discontiguous pages into contiguous runs that are submitted to the PCIe device. The swiotlb framework has a limitation on the length of a mapping - this needs to be announced with the max_mapping_size method to make sure that the hardware drivers do not create larger mappings.
Without max_mapping_size, the NVMe block driver would create large mappings that overrun the maximum mapping size.
[1] https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/
This should be a "Link:" tag.
Signed-off-by: Mikulas Patocka mpatocka@redhat.com Reported-by: Marek Marczykowski-G'orecki marmarek@invisiblethingslab.com Tested-by: Marek Marczykowski-G'orecki marmarek@invisiblethingslab.com Suggested-by: Keith Busch kbusch@kernel.org
I was about to send the same thing. I did a little more than suggest this: it's is the very patch I wrote for testing, minus the redundant nvme bits! But since you already have a commit message for it...
Acked-by: Keith Busch kbusch@kernel.org
No, this patch should be attributed to you Keith.
Mikulas, I like that you ran with getting a fix prepared but please update the patch so Keith is the author and use Link: as suggested for the v2. Note: you'll still use your Signed-off-by since you had a role in getting this patch together (but please move yours to the end of the header).
Mike
Suggested-by: Christoph Hellwig hch@lst.de Cc: stable@vger.kernel.org
drivers/xen/swiotlb-xen.c | 1 + 1 file changed, 1 insertion(+)
Index: linux-stable/drivers/xen/swiotlb-xen.c
--- linux-stable.orig/drivers/xen/swiotlb-xen.c 2023-11-03 17:57:18.000000000 +0100 +++ linux-stable/drivers/xen/swiotlb-xen.c 2023-11-06 15:30:59.000000000 +0100 @@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma .get_sgtable = dma_common_get_sgtable, .alloc_pages = dma_common_alloc_pages, .free_pages = dma_common_free_pages,
- .max_mapping_size = swiotlb_max_mapping_size,
};
From: Keith Busch kbusch@kernel.org
There's a bug that when using the XEN hypervisor with bios with large multi-page bio vectors on NVMe, the kernel deadlocks [1].
The deadlocks are caused by inability to map a large bio vector - dma_map_sgtable always returns an error, this gets propagated to the block layer as BLK_STS_RESOURCE and the block layer retries the request indefinitely.
XEN uses the swiotlb framework to map discontiguous pages into contiguous runs that are submitted to the PCIe device. The swiotlb framework has a limitation on the length of a mapping - this needs to be announced with the max_mapping_size method to make sure that the hardware drivers do not create larger mappings.
Without max_mapping_size, the NVMe block driver would create large mappings that overrun the maximum mapping size.
Reported-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Link: https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/ [1] Tested-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Suggested-by: Christoph Hellwig hch@lst.de Cc: stable@vger.kernel.org Signed-off-by: Keith Busch kbusch@kernel.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com
--- drivers/xen/swiotlb-xen.c | 1 + 1 file changed, 1 insertion(+)
Index: linux-stable/drivers/xen/swiotlb-xen.c =================================================================== --- linux-stable.orig/drivers/xen/swiotlb-xen.c 2023-11-03 17:57:18.000000000 +0100 +++ linux-stable/drivers/xen/swiotlb-xen.c 2023-11-06 15:30:59.000000000 +0100 @@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma .get_sgtable = dma_common_get_sgtable, .alloc_pages = dma_common_alloc_pages, .free_pages = dma_common_free_pages, + .max_mapping_size = swiotlb_max_mapping_size, };
On Mon, 6 Nov 2023, Mikulas Patocka wrote:
From: Keith Busch kbusch@kernel.org
There's a bug that when using the XEN hypervisor with bios with large multi-page bio vectors on NVMe, the kernel deadlocks [1].
The deadlocks are caused by inability to map a large bio vector - dma_map_sgtable always returns an error, this gets propagated to the block layer as BLK_STS_RESOURCE and the block layer retries the request indefinitely.
XEN uses the swiotlb framework to map discontiguous pages into contiguous runs that are submitted to the PCIe device. The swiotlb framework has a limitation on the length of a mapping - this needs to be announced with the max_mapping_size method to make sure that the hardware drivers do not create larger mappings.
Without max_mapping_size, the NVMe block driver would create large mappings that overrun the maximum mapping size.
Reported-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Link: https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/ [1] Tested-by: Marek Marczykowski-Górecki marmarek@invisiblethingslab.com Suggested-by: Christoph Hellwig hch@lst.de Cc: stable@vger.kernel.org Signed-off-by: Keith Busch kbusch@kernel.org Signed-off-by: Mikulas Patocka mpatocka@redhat.com
Acked-by: Stefano Stabellini sstabellini@kernel.org
drivers/xen/swiotlb-xen.c | 1 + 1 file changed, 1 insertion(+)
Index: linux-stable/drivers/xen/swiotlb-xen.c
--- linux-stable.orig/drivers/xen/swiotlb-xen.c 2023-11-03 17:57:18.000000000 +0100 +++ linux-stable/drivers/xen/swiotlb-xen.c 2023-11-06 15:30:59.000000000 +0100 @@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma .get_sgtable = dma_common_get_sgtable, .alloc_pages = dma_common_alloc_pages, .free_pages = dma_common_free_pages,
- .max_mapping_size = swiotlb_max_mapping_size,
};
Looks good:
Reviewed-by: Christoph Hellwig hch@lst.de
- dev->ctrl.max_hw_sectors = min_t(u32,
NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
- dev->ctrl.max_hw_sectors = min3(NVME_MAX_KB_SZ << 1,
dma_opt_mapping_size(&pdev->dev) >> 9,
dma_max_mapping_size(&pdev->dev) >> 9);
dma_opt_mapping_size is already capped by dma_max_mapping_size, so no need for this hunk.
dev->ctrl.max_segments = NVME_MAX_SEGS; /* diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 946bd56f0ac53..0e6c6c25d154f 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -405,4 +405,5 @@ const struct dma_map_ops xen_swiotlb_dma_ops = { .get_sgtable = dma_common_get_sgtable, .alloc_pages = dma_common_alloc_pages, .free_pages = dma_common_free_pages,
- .max_mapping_size = swiotlb_max_mapping_size,
};
And this is the right thing to do. I'm pretty sure I wrote this myself a while ago, but I must not have sent it out in the end.
On Thu 02-11-23 10:28:57, Mikulas Patocka wrote:
On Thu, 2 Nov 2023, Marek Marczykowski-Górecki wrote:
On Tue, Oct 31, 2023 at 06:24:19PM +0100, Mikulas Patocka wrote:
Hi
I would like to ask you to try this patch. Revert the changes to "order" and "PAGE_ALLOC_COSTLY_ORDER" back to normal and apply this patch on a clean upstream kernel.
Does it deadlock?
There is a bug in dm-crypt that it doesn't account large pages in cc->n_allocated_pages, this patch fixes the bug.
This patch did not help.
If the previous patch didn't fix it, try this patch (on a clean upstream kernel).
This patch allocates large pages, but it breaks them up into single-page entries when adding them to the bio.
But this does help.
Thanks. So we can stop blaming the memory allocator and start blaming the NVMe subsystem.
;-)
I added NVMe maintainers to this thread - the summary of the problem is: In dm-crypt, we allocate a large compound page and add this compound page to the bio as a single big vector entry. Marek reports that on his system it causes deadlocks, the deadlocks look like a lost bio that was never completed. When I chop the large compound page to individual pages in dm-crypt and add bio vector for each of them, Marek reports that there are no longer any deadlocks. So, we have a problem (either hardware or software) that the NVMe subsystem doesn't like bio vectors with large bv_len. This is the original bug report: https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/
Actually, Ming Lei has already identified [1] that we are apparently looping in an endless retry loop in nvme_queue_rq(), always ending up the attempt with BLK_STS_RESOURCE. Honza
[1] https://lore.kernel.org/all/ZUHE52SznRaZQxnG@fedora
On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
Then tried:
- PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times and I can't reproduce the issue there. I'm confused...
And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER causing hangs is most likely just a coincidence. Rather something either in the block layer or in the storage driver has problems with handling bios with sufficiently high order pages attached. This is going to be a bit painful to debug I'm afraid. How long does it take for you trigger the hang? I'm asking to get rough estimate how heavy tracing we can afford so that we don't overwhelm the system...
Sometimes it freezes just after logging in, but in worst case it takes me about 10min of more or less `tar xz` + `dd`.
blk-mq debugfs is usually helpful for hang issue in block layer or underlying drivers:
(cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} ;)
BTW, you can just collect logs of the exact disks if you know what are behind dm-crypt, which can be figured out by `lsblk`, and it has to be collected after the hang is triggered.
Thanks, Ming Lei
On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
Then tried:
- PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times and I can't reproduce the issue there. I'm confused...
And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER causing hangs is most likely just a coincidence. Rather something either in the block layer or in the storage driver has problems with handling bios with sufficiently high order pages attached. This is going to be a bit painful to debug I'm afraid. How long does it take for you trigger the hang? I'm asking to get rough estimate how heavy tracing we can afford so that we don't overwhelm the system...
Sometimes it freezes just after logging in, but in worst case it takes me about 10min of more or less `tar xz` + `dd`.
blk-mq debugfs is usually helpful for hang issue in block layer or underlying drivers:
(cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} ;)
BTW, you can just collect logs of the exact disks if you know what are behind dm-crypt, which can be figured out by `lsblk`, and it has to be collected after the hang is triggered.
dm-crypt lives on the nvme disk, this is what I collected when it hanged:
nvme0n1/hctx5/type:default nvme0n1/hctx5/dispatch_busy:0 nvme0n1/hctx5/active:0 nvme0n1/hctx5/run:273 nvme0n1/hctx5/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags:nr_tags=1023 nvme0n1/hctx5/tags:nr_reserved_tags=0 nvme0n1/hctx5/tags:active_queues=0 nvme0n1/hctx5/tags:bitmap_tags: nvme0n1/hctx5/tags:depth=1023 nvme0n1/hctx5/tags:busy=0 nvme0n1/hctx5/tags:cleared=7 nvme0n1/hctx5/tags:bits_per_word=64 nvme0n1/hctx5/tags:map_nr=16 nvme0n1/hctx5/tags:alloc_hint={633, 450, 354, 913, 651, 645} nvme0n1/hctx5/tags:wake_batch=8 nvme0n1/hctx5/tags:wake_index=0 nvme0n1/hctx5/tags:ws_active=0 nvme0n1/hctx5/tags:ws={ nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags:} nvme0n1/hctx5/tags:round_robin=0 nvme0n1/hctx5/tags:min_shallow_depth=4294967295 nvme0n1/hctx5/ctx_map:00000000: 00 nvme0n1/hctx5/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1} nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1} nvme0n1/hctx4/type:default nvme0n1/hctx4/dispatch_busy:9 nvme0n1/hctx4/active:0 nvme0n1/hctx4/run:20290468 nvme0n1/hctx4/tags_bitmap:00000000: 0000 0000 0000 0000 4240 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags:nr_tags=1023 nvme0n1/hctx4/tags:nr_reserved_tags=0 nvme0n1/hctx4/tags:active_queues=0 nvme0n1/hctx4/tags:bitmap_tags: nvme0n1/hctx4/tags:depth=1023 nvme0n1/hctx4/tags:busy=3 nvme0n1/hctx4/tags:cleared=7 nvme0n1/hctx4/tags:bits_per_word=64 nvme0n1/hctx4/tags:map_nr=16 nvme0n1/hctx4/tags:alloc_hint={899, 846, 390, 472, 73, 439} nvme0n1/hctx4/tags:wake_batch=8 nvme0n1/hctx4/tags:wake_index=0 nvme0n1/hctx4/tags:ws_active=0 nvme0n1/hctx4/tags:ws={ nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags:} nvme0n1/hctx4/tags:round_robin=0 nvme0n1/hctx4/tags:min_shallow_depth=4294967295 nvme0n1/hctx4/ctx_map:00000000: 01 nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1} nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx4/state:SCHED_RESTART nvme0n1/hctx3/type:default nvme0n1/hctx3/dispatch_busy:0 nvme0n1/hctx3/active:0 nvme0n1/hctx3/run:296 nvme0n1/hctx3/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags:nr_tags=1023 nvme0n1/hctx3/tags:nr_reserved_tags=0 nvme0n1/hctx3/tags:active_queues=0 nvme0n1/hctx3/tags:bitmap_tags: nvme0n1/hctx3/tags:depth=1023 nvme0n1/hctx3/tags:busy=0 nvme0n1/hctx3/tags:cleared=23 nvme0n1/hctx3/tags:bits_per_word=64 nvme0n1/hctx3/tags:map_nr=16 nvme0n1/hctx3/tags:alloc_hint={862, 557, 480, 24, 841, 23} nvme0n1/hctx3/tags:wake_batch=8 nvme0n1/hctx3/tags:wake_index=0 nvme0n1/hctx3/tags:ws_active=0 nvme0n1/hctx3/tags:ws={ nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags:} nvme0n1/hctx3/tags:round_robin=0 nvme0n1/hctx3/tags:min_shallow_depth=4294967295 nvme0n1/hctx3/ctx_map:00000000: 00 nvme0n1/hctx3/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx2/type:default nvme0n1/hctx2/dispatch_busy:0 nvme0n1/hctx2/active:0 nvme0n1/hctx2/run:279 nvme0n1/hctx2/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags:nr_tags=1023 nvme0n1/hctx2/tags:nr_reserved_tags=0 nvme0n1/hctx2/tags:active_queues=0 nvme0n1/hctx2/tags:bitmap_tags: nvme0n1/hctx2/tags:depth=1023 nvme0n1/hctx2/tags:busy=0 nvme0n1/hctx2/tags:cleared=16 nvme0n1/hctx2/tags:bits_per_word=64 nvme0n1/hctx2/tags:map_nr=16 nvme0n1/hctx2/tags:alloc_hint={960, 528, 145, 730, 447, 1002} nvme0n1/hctx2/tags:wake_batch=8 nvme0n1/hctx2/tags:wake_index=0 nvme0n1/hctx2/tags:ws_active=0 nvme0n1/hctx2/tags:ws={ nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags:} nvme0n1/hctx2/tags:round_robin=0 nvme0n1/hctx2/tags:min_shallow_depth=4294967295 nvme0n1/hctx2/ctx_map:00000000: 00 nvme0n1/hctx2/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx1/type:default nvme0n1/hctx1/dispatch_busy:0 nvme0n1/hctx1/active:0 nvme0n1/hctx1/run:458 nvme0n1/hctx1/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags:nr_tags=1023 nvme0n1/hctx1/tags:nr_reserved_tags=0 nvme0n1/hctx1/tags:active_queues=0 nvme0n1/hctx1/tags:bitmap_tags: nvme0n1/hctx1/tags:depth=1023 nvme0n1/hctx1/tags:busy=0 nvme0n1/hctx1/tags:cleared=31 nvme0n1/hctx1/tags:bits_per_word=64 nvme0n1/hctx1/tags:map_nr=16 nvme0n1/hctx1/tags:alloc_hint={689, 284, 498, 188, 808, 610} nvme0n1/hctx1/tags:wake_batch=8 nvme0n1/hctx1/tags:wake_index=0 nvme0n1/hctx1/tags:ws_active=0 nvme0n1/hctx1/tags:ws={ nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags:} nvme0n1/hctx1/tags:round_robin=0 nvme0n1/hctx1/tags:min_shallow_depth=4294967295 nvme0n1/hctx1/ctx_map:00000000: 00 nvme0n1/hctx1/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx0/type:default nvme0n1/hctx0/dispatch_busy:0 nvme0n1/hctx0/active:0 nvme0n1/hctx0/run:375 nvme0n1/hctx0/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags:nr_tags=1023 nvme0n1/hctx0/tags:nr_reserved_tags=0 nvme0n1/hctx0/tags:active_queues=0 nvme0n1/hctx0/tags:bitmap_tags: nvme0n1/hctx0/tags:depth=1023 nvme0n1/hctx0/tags:busy=0 nvme0n1/hctx0/tags:cleared=6 nvme0n1/hctx0/tags:bits_per_word=64 nvme0n1/hctx0/tags:map_nr=16 nvme0n1/hctx0/tags:alloc_hint={5, 18, 320, 448, 728, 9} nvme0n1/hctx0/tags:wake_batch=8 nvme0n1/hctx0/tags:wake_index=0 nvme0n1/hctx0/tags:ws_active=0 nvme0n1/hctx0/tags:ws={ nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags:} nvme0n1/hctx0/tags:round_robin=0 nvme0n1/hctx0/tags:min_shallow_depth=4294967295 nvme0n1/hctx0/ctx_map:00000000: 00 nvme0n1/hctx0/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/state:SAME_COMP|NONROT|IO_STAT|INIT_DONE|WC|FUA|REGISTERED|NOWAIT nvme0n1/pm_only:0
This is without any of the patches Mikulas proposed.
On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
Then tried:
- PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times and I can't reproduce the issue there. I'm confused...
And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER causing hangs is most likely just a coincidence. Rather something either in the block layer or in the storage driver has problems with handling bios with sufficiently high order pages attached. This is going to be a bit painful to debug I'm afraid. How long does it take for you trigger the hang? I'm asking to get rough estimate how heavy tracing we can afford so that we don't overwhelm the system...
Sometimes it freezes just after logging in, but in worst case it takes me about 10min of more or less `tar xz` + `dd`.
blk-mq debugfs is usually helpful for hang issue in block layer or underlying drivers:
(cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} ;)
BTW, you can just collect logs of the exact disks if you know what are behind dm-crypt, which can be figured out by `lsblk`, and it has to be collected after the hang is triggered.
dm-crypt lives on the nvme disk, this is what I collected when it hanged:
nvme0n1/hctx5/type:default nvme0n1/hctx5/dispatch_busy:0 nvme0n1/hctx5/active:0 nvme0n1/hctx5/run:273 nvme0n1/hctx5/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx5/tags:nr_tags=1023 nvme0n1/hctx5/tags:nr_reserved_tags=0 nvme0n1/hctx5/tags:active_queues=0 nvme0n1/hctx5/tags:bitmap_tags: nvme0n1/hctx5/tags:depth=1023 nvme0n1/hctx5/tags:busy=0 nvme0n1/hctx5/tags:cleared=7 nvme0n1/hctx5/tags:bits_per_word=64 nvme0n1/hctx5/tags:map_nr=16 nvme0n1/hctx5/tags:alloc_hint={633, 450, 354, 913, 651, 645} nvme0n1/hctx5/tags:wake_batch=8 nvme0n1/hctx5/tags:wake_index=0 nvme0n1/hctx5/tags:ws_active=0 nvme0n1/hctx5/tags:ws={ nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags: {.wait=inactive}, nvme0n1/hctx5/tags:} nvme0n1/hctx5/tags:round_robin=0 nvme0n1/hctx5/tags:min_shallow_depth=4294967295 nvme0n1/hctx5/ctx_map:00000000: 00 nvme0n1/hctx5/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1} nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1} nvme0n1/hctx4/type:default nvme0n1/hctx4/dispatch_busy:9 nvme0n1/hctx4/active:0 nvme0n1/hctx4/run:20290468 nvme0n1/hctx4/tags_bitmap:00000000: 0000 0000 0000 0000 4240 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx4/tags:nr_tags=1023 nvme0n1/hctx4/tags:nr_reserved_tags=0 nvme0n1/hctx4/tags:active_queues=0 nvme0n1/hctx4/tags:bitmap_tags: nvme0n1/hctx4/tags:depth=1023 nvme0n1/hctx4/tags:busy=3 nvme0n1/hctx4/tags:cleared=7 nvme0n1/hctx4/tags:bits_per_word=64 nvme0n1/hctx4/tags:map_nr=16 nvme0n1/hctx4/tags:alloc_hint={899, 846, 390, 472, 73, 439} nvme0n1/hctx4/tags:wake_batch=8 nvme0n1/hctx4/tags:wake_index=0 nvme0n1/hctx4/tags:ws_active=0 nvme0n1/hctx4/tags:ws={ nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags: {.wait=inactive}, nvme0n1/hctx4/tags:} nvme0n1/hctx4/tags:round_robin=0 nvme0n1/hctx4/tags:min_shallow_depth=4294967295 nvme0n1/hctx4/ctx_map:00000000: 01 nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1} nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx4/state:SCHED_RESTART nvme0n1/hctx3/type:default nvme0n1/hctx3/dispatch_busy:0 nvme0n1/hctx3/active:0 nvme0n1/hctx3/run:296 nvme0n1/hctx3/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx3/tags:nr_tags=1023 nvme0n1/hctx3/tags:nr_reserved_tags=0 nvme0n1/hctx3/tags:active_queues=0 nvme0n1/hctx3/tags:bitmap_tags: nvme0n1/hctx3/tags:depth=1023 nvme0n1/hctx3/tags:busy=0 nvme0n1/hctx3/tags:cleared=23 nvme0n1/hctx3/tags:bits_per_word=64 nvme0n1/hctx3/tags:map_nr=16 nvme0n1/hctx3/tags:alloc_hint={862, 557, 480, 24, 841, 23} nvme0n1/hctx3/tags:wake_batch=8 nvme0n1/hctx3/tags:wake_index=0 nvme0n1/hctx3/tags:ws_active=0 nvme0n1/hctx3/tags:ws={ nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags: {.wait=inactive}, nvme0n1/hctx3/tags:} nvme0n1/hctx3/tags:round_robin=0 nvme0n1/hctx3/tags:min_shallow_depth=4294967295 nvme0n1/hctx3/ctx_map:00000000: 00 nvme0n1/hctx3/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx2/type:default nvme0n1/hctx2/dispatch_busy:0 nvme0n1/hctx2/active:0 nvme0n1/hctx2/run:279 nvme0n1/hctx2/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx2/tags:nr_tags=1023 nvme0n1/hctx2/tags:nr_reserved_tags=0 nvme0n1/hctx2/tags:active_queues=0 nvme0n1/hctx2/tags:bitmap_tags: nvme0n1/hctx2/tags:depth=1023 nvme0n1/hctx2/tags:busy=0 nvme0n1/hctx2/tags:cleared=16 nvme0n1/hctx2/tags:bits_per_word=64 nvme0n1/hctx2/tags:map_nr=16 nvme0n1/hctx2/tags:alloc_hint={960, 528, 145, 730, 447, 1002} nvme0n1/hctx2/tags:wake_batch=8 nvme0n1/hctx2/tags:wake_index=0 nvme0n1/hctx2/tags:ws_active=0 nvme0n1/hctx2/tags:ws={ nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags: {.wait=inactive}, nvme0n1/hctx2/tags:} nvme0n1/hctx2/tags:round_robin=0 nvme0n1/hctx2/tags:min_shallow_depth=4294967295 nvme0n1/hctx2/ctx_map:00000000: 00 nvme0n1/hctx2/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx1/type:default nvme0n1/hctx1/dispatch_busy:0 nvme0n1/hctx1/active:0 nvme0n1/hctx1/run:458 nvme0n1/hctx1/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx1/tags:nr_tags=1023 nvme0n1/hctx1/tags:nr_reserved_tags=0 nvme0n1/hctx1/tags:active_queues=0 nvme0n1/hctx1/tags:bitmap_tags: nvme0n1/hctx1/tags:depth=1023 nvme0n1/hctx1/tags:busy=0 nvme0n1/hctx1/tags:cleared=31 nvme0n1/hctx1/tags:bits_per_word=64 nvme0n1/hctx1/tags:map_nr=16 nvme0n1/hctx1/tags:alloc_hint={689, 284, 498, 188, 808, 610} nvme0n1/hctx1/tags:wake_batch=8 nvme0n1/hctx1/tags:wake_index=0 nvme0n1/hctx1/tags:ws_active=0 nvme0n1/hctx1/tags:ws={ nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags: {.wait=inactive}, nvme0n1/hctx1/tags:} nvme0n1/hctx1/tags:round_robin=0 nvme0n1/hctx1/tags:min_shallow_depth=4294967295 nvme0n1/hctx1/ctx_map:00000000: 00 nvme0n1/hctx1/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx0/type:default nvme0n1/hctx0/dispatch_busy:0 nvme0n1/hctx0/active:0 nvme0n1/hctx0/run:375 nvme0n1/hctx0/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000 nvme0n1/hctx0/tags:nr_tags=1023 nvme0n1/hctx0/tags:nr_reserved_tags=0 nvme0n1/hctx0/tags:active_queues=0 nvme0n1/hctx0/tags:bitmap_tags: nvme0n1/hctx0/tags:depth=1023 nvme0n1/hctx0/tags:busy=0 nvme0n1/hctx0/tags:cleared=6 nvme0n1/hctx0/tags:bits_per_word=64 nvme0n1/hctx0/tags:map_nr=16 nvme0n1/hctx0/tags:alloc_hint={5, 18, 320, 448, 728, 9} nvme0n1/hctx0/tags:wake_batch=8 nvme0n1/hctx0/tags:wake_index=0 nvme0n1/hctx0/tags:ws_active=0 nvme0n1/hctx0/tags:ws={ nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags: {.wait=inactive}, nvme0n1/hctx0/tags:} nvme0n1/hctx0/tags:round_robin=0 nvme0n1/hctx0/tags:min_shallow_depth=4294967295 nvme0n1/hctx0/ctx_map:00000000: 00 nvme0n1/hctx0/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/state:SAME_COMP|NONROT|IO_STAT|INIT_DONE|WC|FUA|REGISTERED|NOWAIT nvme0n1/pm_only:0
This is without any of the patches Mikulas proposed.
And BTW, an attempt to access another partition on the same disk (not covered with dm-crypt) hangs too.
On Wed, Nov 01, 2023 at 03:15:50AM +0100, Marek Marczykowski-Górecki wrote:
And BTW, an attempt to access another partition on the same disk (not covered with dm-crypt) hangs too.
No, not sure about that part, it was rather reading /bin/ls hanging...
On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
Then tried:
- PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times and I can't reproduce the issue there. I'm confused...
And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER causing hangs is most likely just a coincidence. Rather something either in the block layer or in the storage driver has problems with handling bios with sufficiently high order pages attached. This is going to be a bit painful to debug I'm afraid. How long does it take for you trigger the hang? I'm asking to get rough estimate how heavy tracing we can afford so that we don't overwhelm the system...
Sometimes it freezes just after logging in, but in worst case it takes me about 10min of more or less `tar xz` + `dd`.
blk-mq debugfs is usually helpful for hang issue in block layer or underlying drivers:
(cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} ;)
BTW, you can just collect logs of the exact disks if you know what are behind dm-crypt, which can be figured out by `lsblk`, and it has to be collected after the hang is triggered.
dm-crypt lives on the nvme disk, this is what I collected when it hanged:
...
nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1} nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}
Two requests stays in sw queue, but not related with this issue.
nvme0n1/hctx4/type:default nvme0n1/hctx4/dispatch_busy:9
non-zero dispatch_busy means BLK_STS_RESOURCE is returned from nvme_queue_rq() recently and mostly.
nvme0n1/hctx4/active:0 nvme0n1/hctx4/run:20290468
...
nvme0n1/hctx4/tags:nr_tags=1023 nvme0n1/hctx4/tags:nr_reserved_tags=0 nvme0n1/hctx4/tags:active_queues=0 nvme0n1/hctx4/tags:bitmap_tags: nvme0n1/hctx4/tags:depth=1023 nvme0n1/hctx4/tags:busy=3
Just three requests in-flight, two are in sw queue, another is in hctx->dispatch.
...
nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1} nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx4/state:SCHED_RESTART
The request staying in hctx->dispatch can't move on, and nvme_queue_rq() returns -BLK_STS_RESOURCE constantly, and you can verify with the following bpftrace when the hang is triggered:
bpftrace -e 'kretfunc:nvme_queue_rq { @[retval, kstack]=count() }'
It is very likely that memory allocation inside nvme_queue_rq() can't be done successfully, then blk-mq just have to retry by calling nvme_queue_rq() on the above request.
Thanks, Ming
On 11/1/23 04:24, Ming Lei wrote:
On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
Then tried:
- PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times and I can't reproduce the issue there. I'm confused...
And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER causing hangs is most likely just a coincidence. Rather something either in the block layer or in the storage driver has problems with handling bios with sufficiently high order pages attached. This is going to be a bit painful to debug I'm afraid. How long does it take for you trigger the hang? I'm asking to get rough estimate how heavy tracing we can afford so that we don't overwhelm the system...
Sometimes it freezes just after logging in, but in worst case it takes me about 10min of more or less `tar xz` + `dd`.
blk-mq debugfs is usually helpful for hang issue in block layer or underlying drivers:
(cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} ;)
BTW, you can just collect logs of the exact disks if you know what are behind dm-crypt, which can be figured out by `lsblk`, and it has to be collected after the hang is triggered.
dm-crypt lives on the nvme disk, this is what I collected when it hanged:
...
nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1} nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}
Two requests stays in sw queue, but not related with this issue.
nvme0n1/hctx4/type:default nvme0n1/hctx4/dispatch_busy:9
non-zero dispatch_busy means BLK_STS_RESOURCE is returned from nvme_queue_rq() recently and mostly.
nvme0n1/hctx4/active:0 nvme0n1/hctx4/run:20290468
...
nvme0n1/hctx4/tags:nr_tags=1023 nvme0n1/hctx4/tags:nr_reserved_tags=0 nvme0n1/hctx4/tags:active_queues=0 nvme0n1/hctx4/tags:bitmap_tags: nvme0n1/hctx4/tags:depth=1023 nvme0n1/hctx4/tags:busy=3
Just three requests in-flight, two are in sw queue, another is in hctx->dispatch.
...
nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1} nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx4/state:SCHED_RESTART
The request staying in hctx->dispatch can't move on, and nvme_queue_rq() returns -BLK_STS_RESOURCE constantly, and you can verify with the following bpftrace when the hang is triggered:
bpftrace -e 'kretfunc:nvme_queue_rq { @[retval, kstack]=count() }'
It is very likely that memory allocation inside nvme_queue_rq() can't be done successfully, then blk-mq just have to retry by calling nvme_queue_rq() on the above request.
And that is something I've been wondering (for quite some time now): What _is_ the appropriate error handling for -ENOMEM? At this time, we assume it to be a retryable error and re-run the queue in the hope that things will sort itself out. But if they don't we're stuck. Can we somehow figure out if we make progress during submission, and (at least) issue a warning once we detect a stall?
Cheers,
Hannes
On Wed 01-11-23 11:15:02, Hannes Reinecke wrote:
On 11/1/23 04:24, Ming Lei wrote:
On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote: > Then tried: > - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce, > - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce, > - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times > and I can't reproduce the issue there. I'm confused...
And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER causing hangs is most likely just a coincidence. Rather something either in the block layer or in the storage driver has problems with handling bios with sufficiently high order pages attached. This is going to be a bit painful to debug I'm afraid. How long does it take for you trigger the hang? I'm asking to get rough estimate how heavy tracing we can afford so that we don't overwhelm the system...
Sometimes it freezes just after logging in, but in worst case it takes me about 10min of more or less `tar xz` + `dd`.
blk-mq debugfs is usually helpful for hang issue in block layer or underlying drivers:
(cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} ;)
BTW, you can just collect logs of the exact disks if you know what are behind dm-crypt, which can be figured out by `lsblk`, and it has to be collected after the hang is triggered.
dm-crypt lives on the nvme disk, this is what I collected when it hanged:
...
nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1} nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}
Two requests stays in sw queue, but not related with this issue.
nvme0n1/hctx4/type:default nvme0n1/hctx4/dispatch_busy:9
non-zero dispatch_busy means BLK_STS_RESOURCE is returned from nvme_queue_rq() recently and mostly.
nvme0n1/hctx4/active:0 nvme0n1/hctx4/run:20290468
...
nvme0n1/hctx4/tags:nr_tags=1023 nvme0n1/hctx4/tags:nr_reserved_tags=0 nvme0n1/hctx4/tags:active_queues=0 nvme0n1/hctx4/tags:bitmap_tags: nvme0n1/hctx4/tags:depth=1023 nvme0n1/hctx4/tags:busy=3
Just three requests in-flight, two are in sw queue, another is in hctx->dispatch.
...
nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1} nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx4/state:SCHED_RESTART
The request staying in hctx->dispatch can't move on, and nvme_queue_rq() returns -BLK_STS_RESOURCE constantly, and you can verify with the following bpftrace when the hang is triggered:
bpftrace -e 'kretfunc:nvme_queue_rq { @[retval, kstack]=count() }'
It is very likely that memory allocation inside nvme_queue_rq() can't be done successfully, then blk-mq just have to retry by calling nvme_queue_rq() on the above request.
And that is something I've been wondering (for quite some time now): What _is_ the appropriate error handling for -ENOMEM? At this time, we assume it to be a retryable error and re-run the queue in the hope that things will sort itself out. But if they don't we're stuck. Can we somehow figure out if we make progress during submission, and (at least) issue a warning once we detect a stall?
Well, but Marek has show [1] the machine is pretty far from being OOM when it is stuck. So it doesn't seem like a simple OOM situation...
Honza
[1] https://lore.kernel.org/all/ZTiJ3CO8w0jauOzW@mail-itl/
On Wed, Nov 01, 2023 at 11:15:02AM +0100, Hannes Reinecke wrote:
On 11/1/23 04:24, Ming Lei wrote:
On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote: > Then tried: > - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce, > - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce, > - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times > and I can't reproduce the issue there. I'm confused...
And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER causing hangs is most likely just a coincidence. Rather something either in the block layer or in the storage driver has problems with handling bios with sufficiently high order pages attached. This is going to be a bit painful to debug I'm afraid. How long does it take for you trigger the hang? I'm asking to get rough estimate how heavy tracing we can afford so that we don't overwhelm the system...
Sometimes it freezes just after logging in, but in worst case it takes me about 10min of more or less `tar xz` + `dd`.
blk-mq debugfs is usually helpful for hang issue in block layer or underlying drivers:
(cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} ;)
BTW, you can just collect logs of the exact disks if you know what are behind dm-crypt, which can be figured out by `lsblk`, and it has to be collected after the hang is triggered.
dm-crypt lives on the nvme disk, this is what I collected when it hanged:
...
nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1} nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}
Two requests stays in sw queue, but not related with this issue.
nvme0n1/hctx4/type:default nvme0n1/hctx4/dispatch_busy:9
non-zero dispatch_busy means BLK_STS_RESOURCE is returned from nvme_queue_rq() recently and mostly.
nvme0n1/hctx4/active:0 nvme0n1/hctx4/run:20290468
...
nvme0n1/hctx4/tags:nr_tags=1023 nvme0n1/hctx4/tags:nr_reserved_tags=0 nvme0n1/hctx4/tags:active_queues=0 nvme0n1/hctx4/tags:bitmap_tags: nvme0n1/hctx4/tags:depth=1023 nvme0n1/hctx4/tags:busy=3
Just three requests in-flight, two are in sw queue, another is in hctx->dispatch.
...
nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1} nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE nvme0n1/hctx4/state:SCHED_RESTART
The request staying in hctx->dispatch can't move on, and nvme_queue_rq() returns -BLK_STS_RESOURCE constantly, and you can verify with the following bpftrace when the hang is triggered:
bpftrace -e 'kretfunc:nvme_queue_rq { @[retval, kstack]=count() }'
It is very likely that memory allocation inside nvme_queue_rq() can't be done successfully, then blk-mq just have to retry by calling nvme_queue_rq() on the above request.
And that is something I've been wondering (for quite some time now): What _is_ the appropriate error handling for -ENOMEM?
It is just my guess.
Actually it shouldn't fail since the sgl allocation is backed with memory pool, but there is also dma pool allocation and dma mapping.
At this time, we assume it to be a retryable error and re-run the queue in the hope that things will sort itself out.
It should not be hard to figure out why nvme_queue_rq() can't move on.
But if they don't we're stuck. Can we somehow figure out if we make progress during submission, and (at least) issue a warning once we detect a stall?
It needs counting on request retry, and people often hate to add something to request or bio in fast path. Also this kind of issue is easy to show in blk-mq debugfs or bpftrace.
Thanks, Ming
On Wed, Nov 01, 2023 at 07:23:05PM +0800, Ming Lei wrote:
On Wed, Nov 01, 2023 at 11:15:02AM +0100, Hannes Reinecke wrote:
nvme_queue_rq() on the above request.
And that is something I've been wondering (for quite some time now): What _is_ the appropriate error handling for -ENOMEM?
It is just my guess.
Actually it shouldn't fail since the sgl allocation is backed with memory pool, but there is also dma pool allocation and dma mapping.
At this time, we assume it to be a retryable error and re-run the queue in the hope that things will sort itself out.
It should not be hard to figure out why nvme_queue_rq() can't move on.
There's only a few reasons nvme_queue_rq would return BLK_STS_RESOURCE for a typical read/write command:
DMA mapping error Can't allocate SGL from mempool Can't allocate PRP from dma_pool Controller stuck in resetting state
We should always be able to get at least one allocation from the memory pools, so I think the only one the driver doesn't have a way to guarantee eventual forward progress are the DMA mapping error conditions. Is there some other limit that the driver needs to consider when configuring it's largest supported transfers?
On Wed, 1 Nov 2023, Hannes Reinecke wrote:
And that is something I've been wondering (for quite some time now): What _is_ the appropriate error handling for -ENOMEM? At this time, we assume it to be a retryable error and re-run the queue in the hope that things will sort itself out. But if they don't we're stuck. Can we somehow figure out if we make progress during submission, and (at least) issue a warning once we detect a stall?
The appropriate way is to use mempools. mempool_alloc (with __GFP_DIRECT_RECLAIM) can't ever fail.
But some kernel code does GFP_NOIO allocations in the I/O path and the authors hope that they get away with it.
Mikulas
On Mon 30-10-23 09:37:10, Vlastimil Babka wrote:
On 10/30/23 08:37, Mikulas Patocka wrote:
On Sun, 29 Oct 2023, Vlastimil Babka wrote:
Haven't found any. However I'd like to point out some things I noticed in crypt_alloc_buffer(), although they are probably not related.
static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size) { struct crypt_config *cc = io->cc; struct bio *clone; unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM; unsigned int remaining_size; unsigned int order = MAX_ORDER - 1;
retry: if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM)) mutex_lock(&cc->bio_alloc_lock);
What if we end up in "goto retry" more than once? I don't see a matching
It is impossible. Before we jump to the retry label, we set __GFP_DIRECT_RECLAIM. mempool_alloc can't ever fail if __GFP_DIRECT_RECLAIM is present (it will just wait until some other task frees some objects into the mempool).
Ah, missed that. And the traces don't show that we would be waiting for that. I'm starting to think the allocation itself is really not the issue here. Also I don't think it deprives something else of large order pages, as per the sysrq listing they still existed.
What I rather suspect is what happens next to the allocated bio such that it works well with order-0 or up to costly_order pages, but there's some problem causing a deadlock if the bio contains larger pages than that?
Hum, so in all the backtraces presented we see that we are waiting for page writeback to complete but I don't see anything that would be preventing the bios from completing. Page writeback can submit quite large bios so it kind of makes sense that it trips up some odd behavior. Looking at the code I can see one possible problem in crypt_alloc_buffer() but it doesn't explain why reducing initial page order would help. Anyway: Are we guaranteed mempool has enough pages for arbitrarily large bio that can enter crypt_alloc_buffer()? I can see crypt_page_alloc() does limit the number of pages in the mempool to dm_crypt_pages_per_client plus I assume the percpu counter bias in cc->n_allocated_pages can limit the really available number of pages even further. So if a single bio is large enough to trip percpu_counter_read_positive(&cc->n_allocated_pages) >= dm_crypt_pages_per_client condition in crypt_page_alloc(), we can loop forever? But maybe this cannot happen for some reason...
If this is not it, I think we need to find out why the writeback bios are not completeting. Probably I'd start with checking what is kcryptd, presumably responsible for processing these bios, doing.
Honza
On Mon, 30 Oct 2023, Jan Kara wrote:
What if we end up in "goto retry" more than once? I don't see a matching
It is impossible. Before we jump to the retry label, we set __GFP_DIRECT_RECLAIM. mempool_alloc can't ever fail if __GFP_DIRECT_RECLAIM is present (it will just wait until some other task frees some objects into the mempool).
Ah, missed that. And the traces don't show that we would be waiting for that. I'm starting to think the allocation itself is really not the issue here. Also I don't think it deprives something else of large order pages, as per the sysrq listing they still existed.
What I rather suspect is what happens next to the allocated bio such that it works well with order-0 or up to costly_order pages, but there's some problem causing a deadlock if the bio contains larger pages than that?
Hum, so in all the backtraces presented we see that we are waiting for page writeback to complete but I don't see anything that would be preventing the bios from completing. Page writeback can submit quite large bios so it kind of makes sense that it trips up some odd behavior. Looking at the code I can see one possible problem in crypt_alloc_buffer() but it doesn't explain why reducing initial page order would help. Anyway: Are we guaranteed mempool has enough pages for arbitrarily large bio that can enter crypt_alloc_buffer()? I can see crypt_page_alloc() does limit the number of pages in the mempool to dm_crypt_pages_per_client plus I assume the percpu counter bias in cc->n_allocated_pages can limit the really available number of pages even further. So if a single bio is large enough to trip percpu_counter_read_positive(&cc->n_allocated_pages) >= dm_crypt_pages_per_client condition in crypt_page_alloc(), we can loop forever? But maybe this cannot happen for some reason...
If this is not it, I think we need to find out why the writeback bios are not completeting. Probably I'd start with checking what is kcryptd, presumably responsible for processing these bios, doing.
Honza
cc->page_pool is initialized to hold BIO_MAX_VECS pages. crypt_map will restrict the bio size to BIO_MAX_VECS (see dm_accept_partial_bio being called from crypt_map).
When we allocate a buffer in crypt_alloc_buffer, we try first allocation without waiting, then we grab the mutex and we try allocation with waiting.
The mutex should prevent a deadlock when two processes allocate 128 pages concurrently and wait for each other to free some pages.
The limit to dm_crypt_pages_per_client only applies to pages allocated from the kernel - when this limit is reached, we can still allocate from the mempool, so it shoudn't cause deadlocks.
Mikulas
On Mon 30-10-23 12:49:01, Mikulas Patocka wrote:
On Mon, 30 Oct 2023, Jan Kara wrote:
What if we end up in "goto retry" more than once? I don't see a matching
It is impossible. Before we jump to the retry label, we set __GFP_DIRECT_RECLAIM. mempool_alloc can't ever fail if __GFP_DIRECT_RECLAIM is present (it will just wait until some other task frees some objects into the mempool).
Ah, missed that. And the traces don't show that we would be waiting for that. I'm starting to think the allocation itself is really not the issue here. Also I don't think it deprives something else of large order pages, as per the sysrq listing they still existed.
What I rather suspect is what happens next to the allocated bio such that it works well with order-0 or up to costly_order pages, but there's some problem causing a deadlock if the bio contains larger pages than that?
Hum, so in all the backtraces presented we see that we are waiting for page writeback to complete but I don't see anything that would be preventing the bios from completing. Page writeback can submit quite large bios so it kind of makes sense that it trips up some odd behavior. Looking at the code I can see one possible problem in crypt_alloc_buffer() but it doesn't explain why reducing initial page order would help. Anyway: Are we guaranteed mempool has enough pages for arbitrarily large bio that can enter crypt_alloc_buffer()? I can see crypt_page_alloc() does limit the number of pages in the mempool to dm_crypt_pages_per_client plus I assume the percpu counter bias in cc->n_allocated_pages can limit the really available number of pages even further. So if a single bio is large enough to trip percpu_counter_read_positive(&cc->n_allocated_pages) >= dm_crypt_pages_per_client condition in crypt_page_alloc(), we can loop forever? But maybe this cannot happen for some reason...
If this is not it, I think we need to find out why the writeback bios are not completeting. Probably I'd start with checking what is kcryptd, presumably responsible for processing these bios, doing.
Honza
cc->page_pool is initialized to hold BIO_MAX_VECS pages. crypt_map will restrict the bio size to BIO_MAX_VECS (see dm_accept_partial_bio being called from crypt_map).
When we allocate a buffer in crypt_alloc_buffer, we try first allocation without waiting, then we grab the mutex and we try allocation with waiting.
The mutex should prevent a deadlock when two processes allocate 128 pages concurrently and wait for each other to free some pages.
The limit to dm_crypt_pages_per_client only applies to pages allocated from the kernel - when this limit is reached, we can still allocate from the mempool, so it shoudn't cause deadlocks.
Ah, ok, I missed the limitation of the bio size in crypt_map(). Thanks for explanation! So really the only advice I have now it to check what kcryptd is doing when the system is stuck. Because we didn't see it in any of the stacktrace dumps.
Honza
linux-stable-mirror@lists.linaro.org