On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
On Mon, Oct 30, 2023 at 06:50:35PM +0100, Mikulas Patocka wrote:
On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
Then retried with order=PAGE_ALLOC_COSTLY_ORDER and PAGE_ALLOC_COSTLY_ORDER back at 3, and also got similar crash.
So, does it mean that even allocating with order=PAGE_ALLOC_COSTLY_ORDER isn't safe?
That seems to be another bug, see below.
Try enabling CONFIG_DEBUG_VM (it also needs CONFIG_DEBUG_KERNEL) and try to provoke a similar crash. Let's see if it crashes on one of the VM_BUG_ON statements.
This was very interesting idea. With this, immediately after login I get the crash like below. Which makes sense, as this is when pulseaudio starts and opens /dev/snd/*. I then tried with the dm-crypt commit reverted and still got the crash! But, after blacklisting snd_pcm, there is no BUG splat, but the storage freeze still happens on vanilla 6.5.6.
OK, great. Thanks for testing.
<snip snd_pcm bug>
Plain 6.5.6 (so order = MAX_ORDER - 1, and PAGE_ALLOC_COSTLY_ORDER=3), in frozen state: [ 143.196106] task:blkdiscard state:D stack:13672 pid:4884 ppid:2025 flags:0x00000002 [ 143.196130] Call Trace: [ 143.196139] <TASK> [ 143.196147] __schedule+0x30e/0x8b0 [ 143.196162] schedule+0x59/0xb0 [ 143.196175] schedule_timeout+0x14c/0x160 [ 143.196193] io_schedule_timeout+0x4b/0x70 [ 143.196207] wait_for_completion_io+0x81/0x130 [ 143.196226] submit_bio_wait+0x5c/0x90 [ 143.196241] blkdev_issue_discard+0x94/0xe0 [ 143.196260] blkdev_common_ioctl+0x79e/0x9c0 [ 143.196279] blkdev_ioctl+0xc7/0x270 [ 143.196293] __x64_sys_ioctl+0x8f/0xd0 [ 143.196310] do_syscall_64+0x3c/0x90
So this shows there was bio submitted and it never ran to completion.
for f in $(grep -l crypt /proc/*/comm); do head $f ${f/comm/stack}; done
<snip some backtraces>
So this shows dm-crypt layer isn't stuck anywhere. So the allocation path itself doesn't seem to be locking up, looping or anything.
Then tried:
- PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
- PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times and I can't reproduce the issue there. I'm confused...
And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER causing hangs is most likely just a coincidence. Rather something either in the block layer or in the storage driver has problems with handling bios with sufficiently high order pages attached. This is going to be a bit painful to debug I'm afraid. How long does it take for you trigger the hang? I'm asking to get rough estimate how heavy tracing we can afford so that we don't overwhelm the system...
Honza