Dominique Martinet wrote on Tue, Jan 07, 2025 at 11:40:30AM +0900:
I'll look a bit more into it today and reply to this mail with anything I've found. (didn't test on master or anything else either)
So: - master has no problem - 5.10.233-rc1, 5.15.176-rc1, 6.1.124-rc1 have the same issue - 6.6.70-rc1 is also fine - My previous mail lacked stacktrace decoding, here's a new backtrace with proper decoding on 6.1.124-rc1, produced by virtme-ng + decode_stacktrace.sh (end of the mail) vng -e 'dmesg -C; echo 1 | sudo tee /sys/class/block/zram0/reset || dmesg' - looking at said backtrace, a likely difference would be the multi-stream rework, in particular commit 7ac07a26dea7 ("zram: preparation for multi-zcomp support") that changed how freeing works. ... and cherry-picking it on 6.1 does fix the issue. Unfortunately it doesn't cherry-pick cleanly to 5.15 and 5.10, so for these two I'm not sure if it's better to just drop this "zram: fix uninitialized ZRAM not releasing backing device" commit, or if we should try harder to backport prerequisites (e55e1b483156 ("block: move from strlcpy with unused retval to strscpy") is an obvious one that is not too hard to pick even if not clean, but that wasn't enough and I didn't try further).
I've at least checked that dropping this patch is enough, and will not do anything else on this for now.
(nothing aside of the symbolized backtrace below) -----------------
[ 2.184091] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 2.184094] #PF: supervisor read access in kernel mode [ 2.184096] #PF: error_code(0x0000) - not-present page [ 2.184098] PGD 0 P4D 0 [ 2.184101] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 2.184104] CPU: 2 PID: 650 Comm: tee Not tainted 6.1.124-rc1+ #5 [ 2.184107] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 2.184109] RIP: 0010:zcomp_cpu_dead (drivers/block/zram/zcomp.c:171) [ 2.184115] Code: ff 31 c0 48 c7 c7 c8 4c 9b a1 48 89 43 08 48 89 03 e8 ac d1 2c 00 ba f4 ff ff ff eb bb 0f 1f 44 00 00 0f 1f 44 00 00 89 ff 53 <48> 8b 5e f0 48 03 1c fd 20 ea 9e a1 48 8b 7b 08 48 85 ff 74 11 48 All code ======== 0: ff 31 push (%rcx) 2: c0 48 c7 c7 rorb $0xc7,-0x39(%rax) 6: c8 4c 9b a1 enter $0x9b4c,$0xa1 a: 48 89 43 08 mov %rax,0x8(%rbx) e: 48 89 03 mov %rax,(%rbx) 11: e8 ac d1 2c 00 call 0x2cd1c2 16: ba f4 ff ff ff mov $0xfffffff4,%edx 1b: eb bb jmp 0xffffffffffffffd8 1d: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 22: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 27: 89 ff mov %edi,%edi 29: 53 push %rbx 2a:* 48 8b 5e f0 mov -0x10(%rsi),%rbx <-- trapping instruction 2e: 48 03 1c fd 20 ea 9e add -0x5e6115e0(,%rdi,8),%rbx 35: a1 36: 48 8b 7b 08 mov 0x8(%rbx),%rdi 3a: 48 85 ff test %rdi,%rdi 3d: 74 11 je 0x50 3f: 48 rex.W
Code starting with the faulting instruction =========================================== 0: 48 8b 5e f0 mov -0x10(%rsi),%rbx 4: 48 03 1c fd 20 ea 9e add -0x5e6115e0(,%rdi,8),%rbx b: a1 c: 48 8b 7b 08 mov 0x8(%rbx),%rdi 10: 48 85 ff test %rdi,%rdi 13: 74 11 je 0x26 15: 48 rex.W [ 2.184117] RSP: 0018:ffffb556409ffd20 EFLAGS: 00010246 [ 2.184120] RAX: ffffffffa0e09620 RBX: 0000000000000000 RCX: ffffffffa1c604c0 [ 2.184122] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000000 [ 2.184124] RBP: 0000000000000044 R08: 0000000000000000 R09: 000000000000000a [ 2.184125] R10: ffff9e20fe61b360 R11: 0fffffffffffffff R12: 0000000000000010 [ 2.184127] R13: ffff9e20fe61b360 R14: ffffffffa0e09620 R15: 0000000000000000 [ 2.184130] FS: 00007f4ffda3c740(0000) GS:ffff9e20fe680000(0000) knlGS:0000000000000000 [ 2.184134] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.184136] CR2: 0000000000000000 CR3: 0000000005362000 CR4: 0000000000750ee0 [ 2.184138] PKRU: 55555554 [ 2.184139] Call Trace: [ 2.184141] <TASK> [ 2.184144] ? __die_body.cold (arch/x86/kernel/dumpstack.c:478 arch/x86/kernel/dumpstack.c:465 arch/x86/kernel/dumpstack.c:420) [ 2.184149] ? page_fault_oops (arch/x86/mm/fault.c:727) [ 2.184153] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:131) [ 2.184158] ? kernfs_iop_setattr (fs/kernfs/inode.c:137) [ 2.184163] ? exc_page_fault (arch/x86/include/asm/irqflags.h:40 arch/x86/include/asm/irqflags.h:75 arch/x86/mm/fault.c:1439 arch/x86/mm/fault.c:1487) [ 2.184168] ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:608) [ 2.184172] ? zcomp_cpu_up_prepare (drivers/block/zram/zcomp.c:167) [ 2.184176] ? zcomp_cpu_up_prepare (drivers/block/zram/zcomp.c:167) [ 2.184179] ? zcomp_cpu_dead (drivers/block/zram/zcomp.c:171) [ 2.184182] cpuhp_invoke_callback (kernel/cpu.c:202) [ 2.184187] cpuhp_issue_call (kernel/cpu.c:2016) [ 2.184190] __cpuhp_state_remove_instance (kernel/cpu.c:2224) [ 2.184193] zcomp_destroy (drivers/block/zram/zcomp.c:197) [ 2.184196] zram_reset_device (drivers/block/zram/zram_drv.c:1737) [ 2.184199] reset_store (drivers/pci/pci-sysfs.c:1387) [ 2.184204] kernfs_fop_write_iter (fs/kernfs/file.c:338) [ 2.184208] vfs_write (include/linux/fs.h:2265 fs/read_write.c:491 fs/read_write.c:584) [ 2.184214] ksys_write (fs/read_write.c:637) [ 2.184217] do_syscall_64 (arch/x86/entry/common.c:51 arch/x86/entry/common.c:81) [ 2.184220] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) [ 2.184224] RIP: 0033:0x7f4ffdb372c0 [ 2.184227] Code: 40 00 48 8b 15 41 9b 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d 21 23 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89 All code ======== 0: 40 00 48 8b rex add %cl,-0x75(%rax) 4: 15 41 9b 0d 00 adc $0xd9b41,%eax 9: f7 d8 neg %eax b: 64 89 02 mov %eax,%fs:(%rdx) e: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax 15: eb b7 jmp 0xffffffffffffffce 17: 0f 1f 00 nopl (%rax) 1a: 80 3d 21 23 0e 00 00 cmpb $0x0,0xe2321(%rip) # 0xe2342 21: 74 17 je 0x3a 23: b8 01 00 00 00 mov $0x1,%eax 28: 0f 05 syscall 2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction 30: 77 58 ja 0x8a 32: c3 ret 33: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 3a: 48 83 ec 28 sub $0x28,%rsp 3e: 48 rex.W 3f: 89 .byte 0x89
Code starting with the faulting instruction =========================================== 0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax 6: 77 58 ja 0x60 8: c3 ret 9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 10: 48 83 ec 28 sub $0x28,%rsp 14: 48 rex.W 15: 89 .byte 0x89 [ 2.184229] RSP: 002b:00007ffd10d35a78 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [ 2.184232] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f4ffdb372c0 [ 2.184233] RDX: 0000000000000002 RSI: 00007ffd10d35b90 RDI: 0000000000000003 [ 2.184235] RBP: 00007ffd10d35b90 R08: 0000000000000004 R09: 0000000000000001 [ 2.184236] R10: 00007f4ffda53f18 R11: 0000000000000202 R12: 0000000000000002 [ 2.184238] R13: 0000559c792ae310 R14: 0000000000000002 R15: 00007f4ffdc0d9e0 [ 2.184243] </TASK> [ 2.184244] Modules linked in: [ 2.184247] CR2: 0000000000000000 [ 2.184248] ---[ end trace 0000000000000000 ]--- [ 2.184250] RIP: 0010:zcomp_cpu_dead (drivers/block/zram/zcomp.c:171) [ 2.184253] Code: ff 31 c0 48 c7 c7 c8 4c 9b a1 48 89 43 08 48 89 03 e8 ac d1 2c 00 ba f4 ff ff ff eb bb 0f 1f 44 00 00 0f 1f 44 00 00 89 ff 53 <48> 8b 5e f0 48 03 1c fd 20 ea 9e a1 48 8b 7b 08 48 85 ff 74 11 48 All code ======== 0: ff 31 push (%rcx) 2: c0 48 c7 c7 rorb $0xc7,-0x39(%rax) 6: c8 4c 9b a1 enter $0x9b4c,$0xa1 a: 48 89 43 08 mov %rax,0x8(%rbx) e: 48 89 03 mov %rax,(%rbx) 11: e8 ac d1 2c 00 call 0x2cd1c2 16: ba f4 ff ff ff mov $0xfffffff4,%edx 1b: eb bb jmp 0xffffffffffffffd8 1d: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 22: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 27: 89 ff mov %edi,%edi 29: 53 push %rbx 2a:* 48 8b 5e f0 mov -0x10(%rsi),%rbx <-- trapping instruction 2e: 48 03 1c fd 20 ea 9e add -0x5e6115e0(,%rdi,8),%rbx 35: a1 36: 48 8b 7b 08 mov 0x8(%rbx),%rdi 3a: 48 85 ff test %rdi,%rdi 3d: 74 11 je 0x50 3f: 48 rex.W
Code starting with the faulting instruction =========================================== 0: 48 8b 5e f0 mov -0x10(%rsi),%rbx 4: 48 03 1c fd 20 ea 9e add -0x5e6115e0(,%rdi,8),%rbx b: a1 c: 48 8b 7b 08 mov 0x8(%rbx),%rdi 10: 48 85 ff test %rdi,%rdi 13: 74 11 je 0x26 15: 48 rex.W [ 2.184254] RSP: 0018:ffffb556409ffd20 EFLAGS: 00010246 [ 2.184256] RAX: ffffffffa0e09620 RBX: 0000000000000000 RCX: ffffffffa1c604c0 [ 2.184258] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000000 [ 2.184259] RBP: 0000000000000044 R08: 0000000000000000 R09: 000000000000000a [ 2.184261] R10: ffff9e20fe61b360 R11: 0fffffffffffffff R12: 0000000000000010 [ 2.184262] R13: ffff9e20fe61b360 R14: ffffffffa0e09620 R15: 0000000000000000 [ 2.184266] FS: 00007f4ffda3c740(0000) GS:ffff9e20fe680000(0000) knlGS:0000000000000000 [ 2.184269] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.184271] CR2: 0000000000000000 CR3: 0000000005362000 CR4: 0000000000750ee0 [ 2.184273] PKRU: 55555554 [ 2.184274] note: tee[650] exited with irqs disabled
Thanks,