Hi Ronald,
+ Richard, my dear jffs2 expert ;)
ronald.wahl@raritan.com wrote on Mon, 27 Nov 2023 18:54:40 +0100:
On 27.11.23 16:10, Ronald Wahl wrote:
On 27.11.23 10:58, Miquel Raynal wrote:
The intended move from wait_for_completion_*() to wait_for_completion_interruptible_*() was to allow (very) long spi memor
y
transfers to be stopped upon user request instead of freezing the machine forever as the timeout value could now be significantly bigger.
However, depending on the user logic, applications can receive many signals for their own "internal" purpose and have nothing to do with the requested kernel operations, hence interrupting spi transfers upon any signal is probably not a wise choice. Instead, let's switch to wait_for_completion_killable_*() to only catch the "important" signals. This was likely the intended behavior anyway.
Actually this seems to work. But aborting a process that has a SPI transfer running causes ugly messages from kernel. This is somehow unexpected:
# dd if=/dev/urandom of=/flashdisk/testfile bs=1024 count=512 ^C[ 380.726760] spi-nor spi0.0: spi transfer canceled [ 380.731688] spi-nor spi0.0: SPI transfer failed: -512 [ 380.737141] spi_master spi0: failed to transfer one message from queue [ 380.746495] spi-nor spi0.0: spi transfer canceled [ 380.751549] spi-nor spi0.0: SPI transfer failed: -512 [ 380.756844] spi_master spi0: failed to transfer one message from queue
JFFS2 also logs an informational message which is less visible but also may rise eyebrows: [ 380.743904] jffs2: Write of 4164 bytes at 0x0016a47c failed. retu
rned
-512, retlen 68
Killing a process is something to expect in certain cases and it should not cause such messages which may create some anxiety that something bad had happened. So maybe the "kill" case should be silent (e.g. level "debug") but without out hiding real errors. But even when hiding the message in t
he
SPI framework it may cause additional messages in upper layers like JFFS2
.
I'm not sure whether all of this is a good idea. This is something others have to decide.
... and now I just got a crash when unmounting and remounting jffs2:
unmount: [ 8245.821105] spi-nor spi0.0: spi transfer canceled [ 8245.826288] spi-nor spi0.0: SPI transfer failed: -512 [ 8245.831508] spi_master spi0: failed to transfer one message from queue [ 8245.838484] jffs2: Write of 1092 bytes at 0x00181458 failed. returned -5 12, retlen 68 [ 8245.839786] spi-nor spi0.0: spi transfer canceled [ 8245.844759] spi-nor spi0.0: SPI transfer failed: -512 [ 8245.850145] spi_master spi0: failed to transfer one message from queue [ 8245.856909] jffs2: Write of 1092 bytes at 0x0018189c failed. returned -5 12, retlen 0 [ 8245.856942] jffs2: Not marking the space at 0x0018189c as dirty because the flash driver returned retlen zero
mount: [ 8831.213456] jffs2: error: (1142) jffs2_link_node_ref: Adding new ref 28b d9da7 at (0x000ad578-0x000ae5bc) not immediately after previous (0x000ad578 -0x000ad578) [ 8831.228212] Internal error: Oops - undefined instruction: 0 [#1] THUMB2 [ 8831.234996] CPU: 0 PID: 1142 Comm: mount Not tainted 6.6.2-sama5 #1 [ 8831.241587] Hardware name: Atmel SAMA5 [ 8831.245478] PC is at jffs2_link_node_ref+0xe/0xe2 [ 8831.250360] LR is at jffs2_link_node_ref+0xb9/0xe2 [ 8831.255473] pc : [<c020969c>] lr : [<c0209747>] psr: 00000033 [ 8831.261893] sp : c974dd78 ip : 00000000 fp : c09c6a5c [ 8831.267428] r10: c1572f18 r9 : 0000e002 r8 : c2849964 [ 8831.272801] r7 : 00001a44 r6 : 000ae5bc r5 : c14df4a8 r4 : c1620208 [ 8831.279647] r3 : 00000001 r2 : 40000000 r1 : c090ce3c r0 : 00000093 [ 8831.286340] Flags: nzcv IRQs on FIQs on Mode SVC_32 ISA Thumb Segme nt none [ 8831.293974] Control: 50c53c7d Table: 21470059 DAC: 00000051 [ 8831.300025] Register r0 information: non-paged memory [ 8831.305239] Register r1 information: non-slab/vmalloc memory [ 8831.311220] Register r2 information: non-paged memory [ 8831.316427] Register r3 information: non-paged memory [ 8831.321630] Register r4 information: slab kmalloc-2k start c1620000 poin ter offset 520 size 2048 [ 8831.330796] Register r5 information: slab jffs2_refblock start c14df3e0 pointer offset 200 size 248 [ 8831.340219] Register r6 information: non-paged memory [ 8831.345422] Register r7 information: non-paged memory [ 8831.350784] Register r8 information: slab kmalloc-4k start c2849000 poin ter offset 2404 size 4096 [ 8831.360032] Register r9 information: non-paged memory [ 8831.365243] Register r10 information: slab pde_opener start c1572f18 poi nter offset 0 size 24 [ 8831.374137] Register r11 information: non-slab/vmalloc memory [ 8831.380041] Register r12 information: NULL pointer [ 8831.385140] Process mount (pid: 1142, stack limit = 0xb9d2bb39) [ 8831.391215] Stack: (0xc974dd78 to 0xc974e000) [ 8831.395884] dd60: 000ad578 000ae5bc [ 8831.404242] dd80: 000ad578 000ad578 c2849dec c2805e00 c1620208 c2849dec 00000000 c021130d [ 8831.412762] dda0: c1572d50 00000694 0000069c 00000036 00010000 c2937755 00000000 20061985 [ 8831.421280] ddc0: 0000069c 00000000 c2849964 c2805e00 c1620208 c2849964 0000069c 00000008 [ 8831.429799] dde0: 00000000 c2849ff8 c2849000 c020c40b c09c6a5c c974de34 c974de30 00000000 [ 8831.438154] de00: 00000004 0000069c 00001000 c2937bc0 00000208 c1620000 00000000 000a0000 [ 8831.446665] de20: 00000000 00000000 0000000a 00070000 c2849000 00000000 00001000 00000000 [ 8831.455184] de40: 00000000 00000000 00010000 c2805e00 00000000 00000000 00200000 c2937e00 [ 8831.463704] de60: 00000000 00008000 c0975a98 c020e02d c1001300 00000200 c020f3eb 00000dc0 [ 8831.472062] de80: c2937e00 00000000 c020f3eb c2805e00 c156fc80 c29b9400 00200000 c2937e00 [ 8831.480584] dea0: 00000000 c020f42d c29b9400 c156fc80 c09c6a64 c29b9400 c1298c00 c156fc80 [ 8831.489100] dec0: c020f5c9 c02a853d 0000003a 00000000 c156fc80 c020f5c9 c974df58 c02a85ff [ 8831.497458] dee0: c156fc80 c020f5c9 00000000 c020f5bf c156fc80 c156fc80 00000020 c016fdc3 [ 8831.505976] df00: c156fc80 00000000 c1374780 c0115c75 00000000 00000000 c156fc80 c0182209 [ 8831.514497] df20: c2937e00 c974df58 c2937f80 00000000 c2937e00 c2937f80 00008000 00000000 [ 8831.523015] df40: 0006d71a 00008000 00068991 c0182307 00000000 c015048b c1360d90 c18b3a18 [ 8831.531370] df60: 00001000 c2937f80 c2937e00 00000000 00000015 c0182539 00000000 0006d71a [ 8831.539881] df80: 0006d725 00000000 0006d70c 0006d71a 00000015 c010023c c1374780 00000015 [ 8831.548399] dfa0: 00068991 c0100041 00000000 0006d70c 0006d70c 0006d71a 0006d725 00008000 [ 8831.556908] dfc0: 00000000 0006d70c 0006d71a 00000015 00008000 00000000 0006d938 00068991 [ 8831.565263] dfe0: b6ea606c beb82b9c 00039818 b6ea607c a0000010 0006d70c 00000000 00000000 [ 8831.573779] jffs2_link_node_ref from jffs2_sum_scan_sumnode+0x1a5/0x320 [ 8831.580860] jffs2_sum_scan_sumnode from jffs2_scan_medium+0x2d7/0xab0 [ 8831.587595] jffs2_scan_medium from jffs2_do_mount_fs+0xeb/0x356 [ 8831.593958] jffs2_do_mount_fs from jffs2_do_fill_super+0xf7/0x182 [ 8831.600332] jffs2_do_fill_super from mtd_get_sb+0x61/0x98 [ 8831.606176] mtd_get_sb from get_tree_mtd+0x4f/0xe8 [ 8831.611241] get_tree_mtd from vfs_get_tree+0x13/0x7c [ 8831.616648] vfs_get_tree from path_mount+0x409/0x4d4 [ 8831.621891] path_mount from do_mount+0x33/0x40 [ 8831.626593] do_mount from sys_mount+0xeb/0xfe [ 8831.631367] sys_mount from ret_fast_syscall+0x1/0x5c [ 8831.636591] Exception stack(0xc974dfa8 to 0xc974dff0) [ 8831.641961] dfa0: 00000000 0006d70c 0006d70c 0006d71a 0006d725 00008000 [ 8831.650314] dfc0: 00000000 0006d70c 0006d71a 00000015 00008000 00000000 0006d938 00068991 [ 8831.658818] dfe0: b6ea606c beb82b9c 00039818 b6ea607c [ 8831.664195] Code: 6a63 b085 990a b903 (de02) 6ae5 [ 8831.669135] ---[ end trace 0000000000000000 ]--- [ 8831.673902] Kernel panic - not syncing: Fatal exception
The previous aborted I/Os from a dd may also play a role here but the crash is clearly a cause of the interrupted transfers. It's a bit odd that it is an undefined instruction but probably it was a stack overflow.
Looking back in the SPI driver history I see some occasions where interrupt ible transfers were changed to non-interruptible transfers because filesystems l ike jffs2 used SIGKILL during unmount or maybe other reasons:
775c4c0032c4 "spi: stm32-qspi: remove signal sensitive on completion" 26cfc0dbe43a "spi-zynq-qspi: use wait_for_completion_timeout to make zynq_q spi_exec_mem_op not interruptible" 7f3ac71ac3b0 "spi: davinci: fix spurious i/o error"
Having them in spi-atmel now seems a bit off and even a source of misbehavi our and even crashes.
It's not just spi-atmel, any spi-mem controller might be tempted to use interruptible^Wkillable transfers just because the timeout values can be really big as the memory sizes increase.
One solution is to change the completion helpers back to something non-killable/non-interruptible, but the user experience will be slightly degraded. The other would be to look into jffs2 (if it's the only filesystem playing with signals during unmount, tbh I don't know). But maybe this signaling mechanism can't be hacked for compatibility reasons. Handling this at the spi level would be a mix of layers, I'm not ready for that.
Richard, Mark, what's your opinion here?
Thanks, Miquèl