I was benchmarking some compressors, piping to and from a network share on a NAS, and some consistently wrote corrupted data.
First, apologies in advance: * if I'm not in the right place. I tried to follow the directions from the Regressions guide - https://www.kernel.org/doc/html/latest/admin-guide/reporting-regressions.htm... * I know there's a ton of context I don't know * I’m trying a different mail app, because the first one looked concussed with plain text. This might be worse.
The detailed description: I was benchmarking some compressors on Debian on a Raspberry Pi, piping to and from a network share on a NAS, and found that some consistently had issues writing to my NAS. Specifically: * lzop * pigz - parallel gzip * pbzip2 - parallel bzip2
This is dependent on kernel version. I've done a survey, below.
While I tripped over the issue on a Debian port (Debian 12, bookworm, kernel v6.6), I compiled my own vanilla / mainline kernels for testing and reporting this.
Even more details: The Pi and the Synology NAS are directly connected by Gigabit Ethernet. Both sides are using self-assigned IP addresses. I'll note that at boot, getting the Pi to see the NAS requires some nudging of avahi-autoipd; while I think it's stable before testing, I'm not positive, and reconnection issues might be in play.
The files in question are tars of sparse file systems, about 270 gig, compressing down to 10-30 gig.
Compression seems to work, without complaint; decompression crashes the process, usually within the first gig of the compressed file. The output of the stream doesn't match what ends up written to disk.
Trying decompression during compression gets further along than it does after compression finishes; this might point toward something with writes and caches.
A previous attempt involved rpi-update, which: * good: let me install kernels without building myself * bad: updated the bootloader and firmware, to bleeding edge, with possible regressions; it definitely muddied the results of my tests I started over with a fresh install, and no results involving rpi-update are included in this email.
A survey of major branches: * 5.15.167, LTS - good * 6.1.109, LTS - good * 6.2.16 - good * 6.3.13 - bad * 6.4.16 - bad * 6.5.13 - bad * 6.6.50, LTS - bad * 6.7.12 - bad * 6.8.12 - bad * 6.9.12 - bad * 6.10.9 - good * 6.11.0 - good
I tried, but couldn't fully build 4.19.322 or 6.0.19, due to issues with modules.
Important commits: It looked like both the breakage and the fix came in during rc1 releases.
Breakage, v6.3-rc1: I manually bisected commits in fs/smb* and fs/cifs.
3d78fe73fa12 cifs: Build the RDMA SGE list directly from an iterator
lzop and pigz worked. last working. test in progress: pbzip2
607aea3cc2a8 cifs: Remove unused code
lzop didn't work. first broken
Fix, v6.10-rc1: I manually bisected commits in fs/smb.
69c3c023af25 cifs: Implement netfslib hooks
lzop didn't work. last broken one
3ee1a1fc3981 cifs: Cut over to using netfslib
lzop, pigz, pbzip2, all worked. first fixed one
To test / reproduce: It looks like this, on a mounted network share, with extra pv for progress meters:
cat 1tb-rust-ext4.img.tar.gz | \ gzip -d | \ lzop -1 > \ 1tb-rust-ext4.img.tar.lzop # wait 40 minutes
cat 1tb-rust-ext4.img.tar.lzop | \ lzop -d | \ sha1sum # either it works, and shows the right checksum # or it crashes early, due to a corrupt file, and shows an incorrect checksum
As I re-read this, I realize it might look like the compressor behaves differently. I added a "tee $output | sha1sum; sha1sum $output" and ran it on a broken version. The checksums from the pipe and for the file on disk are different.
Assorted info: This is a Raspberry Pi 4, with 4 GiB RAM, running Debian 12, bookworm, or a port.
mount.cifs version: 7.0
# cat /proc/sys/kernel/tainted 1024
# cat /proc/version Linux version 6.2.0-3d78fe73f-v8-pronoiac+ (pronoiac@bisect) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #21 SMP PREEMPT Thu Sep 19 16:51:22 PDT 2024
DebugData: /proc/fs/cifs/DebugData Display Internal CIFS Data Structures for Debugging --------------------------------------------------- CIFS Version 2.41 Features: DFS,FSCACHE,STATS2,DEBUG,ALLOW_INSECURE_LEGACY,CIFS_POSIX,UPCALL(SPNEGO),XATTR,ACL CIFSMaxBufSize: 16384 Active VFS Requests: 1
Servers: 1) ConnectionId: 0x1 Hostname: drums.local Number of credits: 8062 Dialect 0x300 TCP status: 1 Instance: 1 Local Users To Server: 1 SecMode: 0x1 Req On Wire: 2 In Send: 1 In MaxReq Wait: 0
Sessions: 1) Address: 169.254.132.219 Uses: 1 Capability: 0x300047 Session Status: 1 Security type: RawNTLMSSP SessionId: 0x4969841e User: 1000 Cred User: 0
Shares: 0) IPC: \drums.local\IPC$ Mounts: 1 DevInfo: 0x0 Attributes: 0x0 PathComponentMax: 0 Status: 1 type: 0 Serial Number: 0x0 Share Capabilities: None Share Flags: 0x0 tid: 0xeb093f0b Maximal Access: 0x1f00a9
1) \drums.local\billions Mounts: 1 DevInfo: 0x20 Attributes: 0x5007f PathComponentMax: 255 Status: 1 type: DISK Serial Number: 0x735a9af5 Share Capabilities: None Aligned, Partition Aligned, Share Flags: 0x0 tid: 0x5e6832e6 Optimal sector size: 0x200 Maximal Access: 0x1f01ff
MIDs: State: 2 com: 9 pid: 3117 cbdata: 00000000e003293e mid 962892
State: 2 com: 9 pid: 3117 cbdata: 000000002610602a mid 962956
--
Let me know how I can help. The process of iterating can take hours, and it's not automated, so my resources are limited.
#regzbot introduced: 607aea3cc2a8 #regzbot fix: 3ee1a1fc3981
-James
Hi,
I was benchmarking some compressors, piping to and from a network share on a NAS, and some consistently wrote corrupted data.
First, apologies in advance:
- if I'm not in the right place. I tried to follow the directions from the Regressions guide - https://www.kernel.org/doc/html/latest/admin-guide/reporting-regressions.htm...
- I know there's a ton of context I don't know
- I’m trying a different mail app, because the first one looked concussed with plain text. This might be worse.
The detailed description: I was benchmarking some compressors on Debian on a Raspberry Pi, piping to and from a network share on a NAS, and found that some consistently had issues writing to my NAS. Specifically:
- lzop
- pigz - parallel gzip
- pbzip2 - parallel bzip2
This is dependent on kernel version. I've done a survey, below.
While I tripped over the issue on a Debian port (Debian 12, bookworm, kernel v6.6), I compiled my own vanilla / mainline kernels for testing and reporting this.
Even more details: The Pi and the Synology NAS are directly connected by Gigabit Ethernet. Both sides are using self-assigned IP addresses. I'll note that at boot, getting the Pi to see the NAS requires some nudging of avahi-autoipd; while I think it's stable before testing, I'm not positive, and reconnection issues might be in play.
The files in question are tars of sparse file systems, about 270 gig, compressing down to 10-30 gig.
Compression seems to work, without complaint; decompression crashes the process, usually within the first gig of the compressed file. The output of the stream doesn't match what ends up written to disk.
Trying decompression during compression gets further along than it does after compression finishes; this might point toward something with writes and caches.
A previous attempt involved rpi-update, which:
- good: let me install kernels without building myself
- bad: updated the bootloader and firmware, to bleeding edge, with possible regressions; it definitely muddied the results of my tests
I started over with a fresh install, and no results involving rpi-update are included in this email.
A survey of major branches:
- 5.15.167, LTS - good
- 6.1.109, LTS - good
- 6.2.16 - good
- 6.3.13 - bad
- 6.4.16 - bad
- 6.5.13 - bad
- 6.6.50, LTS - bad
- 6.7.12 - bad
- 6.8.12 - bad
- 6.9.12 - bad
- 6.10.9 - good
- 6.11.0 - good
I tried, but couldn't fully build 4.19.322 or 6.0.19, due to issues with modules.
Important commits: It looked like both the breakage and the fix came in during rc1 releases.
Breakage, v6.3-rc1: I manually bisected commits in fs/smb* and fs/cifs.
3d78fe73fa12 cifs: Build the RDMA SGE list directly from an iterator
lzop and pigz worked. last working. test in progress: pbzip2
607aea3cc2a8 cifs: Remove unused code
lzop didn't work. first broken
Fix, v6.10-rc1: I manually bisected commits in fs/smb.
69c3c023af25 cifs: Implement netfslib hooks
lzop didn't work. last broken one
3ee1a1fc3981 cifs: Cut over to using netfslib
lzop, pigz, pbzip2, all worked. first fixed one
To test / reproduce: It looks like this, on a mounted network share, with extra pv for progress meters:
cat 1tb-rust-ext4.img.tar.gz | \ gzip -d | \ lzop -1 > \ 1tb-rust-ext4.img.tar.lzop # wait 40 minutes
cat 1tb-rust-ext4.img.tar.lzop | \ lzop -d | \ sha1sum # either it works, and shows the right checksum # or it crashes early, due to a corrupt file, and shows an incorrect checksum
As I re-read this, I realize it might look like the compressor behaves differently. I added a "tee $output | sha1sum; sha1sum $output" and ran it on a broken version. The checksums from the pipe and for the file on disk are different.
Assorted info: This is a Raspberry Pi 4, with 4 GiB RAM, running Debian 12, bookworm, or a port.
mount.cifs version: 7.0
# cat /proc/sys/kernel/tainted 1024
# cat /proc/version Linux version 6.2.0-3d78fe73f-v8-pronoiac+ (pronoiac@bisect) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #21 SMP PREEMPT Thu Sep 19 16:51:22 PDT 2024
DebugData: /proc/fs/cifs/DebugData Display Internal CIFS Data Structures for Debugging
CIFS Version 2.41 Features: DFS,FSCACHE,STATS2,DEBUG,ALLOW_INSECURE_LEGACY,CIFS_POSIX,UPCALL(SPNEGO),XATTR,ACL CIFSMaxBufSize: 16384 Active VFS Requests: 1
Servers:
- ConnectionId: 0x1 Hostname: drums.local
Number of credits: 8062 Dialect 0x300 TCP status: 1 Instance: 1 Local Users To Server: 1 SecMode: 0x1 Req On Wire: 2 In Send: 1 In MaxReq Wait: 0
Sessions: 1) Address: 169.254.132.219 Uses: 1 Capability: 0x300047 Session Status: 1 Security type: RawNTLMSSP SessionId: 0x4969841e User: 1000 Cred User: 0 Shares: 0) IPC: \\drums.local\IPC$ Mounts: 1 DevInfo: 0x0 Attributes: 0x0 PathComponentMax: 0 Status: 1 type: 0 Serial Number: 0x0 Share Capabilities: None Share Flags: 0x0 tid: 0xeb093f0b Maximal Access: 0x1f00a9 1) \\drums.local\billions Mounts: 1 DevInfo: 0x20 Attributes: 0x5007f PathComponentMax: 255 Status: 1 type: DISK Serial Number: 0x735a9af5 Share Capabilities: None Aligned, Partition Aligned, Share Flags: 0x0 tid: 0x5e6832e6 Optimal sector size: 0x200 Maximal Access: 0x1f01ff MIDs: State: 2 com: 9 pid: 3117 cbdata: 00000000e003293e mid 962892 State: 2 com: 9 pid: 3117 cbdata: 000000002610602a mid 962956
--
Let me know how I can help. The process of iterating can take hours, and it's not automated, so my resources are limited.
#regzbot introduced: 607aea3cc2a8 #regzbot fix: 3ee1a1fc3981
I checked 607aea3cc2a8, it just removed some code in #if 0 ... #endif. so this regression is not introduced in 607aea3cc2a8, but the reproduce frequency is changed here.
Another issue in 6.6.y maybe related https://lore.kernel.org/linux-fsdevel/9e8f8872-f51b-4a09-a92c-49218748dd62@m...
Do this regression still happen after the following patches are applied?
a60cc288a1a2 :Luis Chamberlain: test_xarray: add tests for advanced multi-index use a08c7193e4f1 :Sidhartha Kumar: mm/filemap: remove hugetlb special casing in filemap.c 6212eb4d7a63 :Hongbo Li: mm/filemap: avoid type conversion
de60fd8ddeda :Kairui Song: mm/filemap: return early if failed to allocate memory for split b2ebcf9d3d5a :Kairui Song: mm/filemap: clean up hugetlb exclusion code a4864671ca0b :Kairui Song: lib/xarray: introduce a new helper xas_get_order 6758c1128ceb :Kairui Song: mm/filemap: optimize filemap folio adding
Best Regards Wang Yugui (wangyugui@e16-tech.com) 2024/09/23
Hey there -
On Sun, Sep 22, 2024 at 4:55 PM Wang Yugui wangyugui@e16-tech.com wrote:
Hi,
I was benchmarking some compressors, piping to and from a network share on a NAS, and some consistently wrote corrupted data.
Important commits: It looked like both the breakage and the fix came in during rc1 releases.
Breakage, v6.3-rc1: I manually bisected commits in fs/smb* and fs/cifs.
3d78fe73fa12 cifs: Build the RDMA SGE list directly from an iterator
lzop and pigz worked. last working. test in progress: pbzip2
This is a first for me: lzop was fine, but pbzip2 still had issues, roughly a clock hour into compression. (When lzop has issues, it's usually within a minute or two.)
607aea3cc2a8 cifs: Remove unused code
lzop didn't work. first broken
Fix, v6.10-rc1: I manually bisected commits in fs/smb.
69c3c023af25 cifs: Implement netfslib hooks
lzop didn't work. last broken one
3ee1a1fc3981 cifs: Cut over to using netfslib
lzop, pigz, pbzip2, all worked. first fixed one
I checked 607aea3cc2a8, it just removed some code in #if 0 ... #endif. so this regression is not introduced in 607aea3cc2a8, but the reproduce frequency is changed here.
I agree. The pbzip2 results above, regarding the break bisection I landed on: they mark when it became more of an issue, but not when it started.
I could re-run tests and dig into possible false negatives. It'll be slower going, though.
Another issue in 6.6.y maybe related https://lore.kernel.org/linux-fsdevel/9e8f8872-f51b-4a09-a92c-49218748dd62@m...
In comparison: I'm relieved that my issue is something that can be tested within hours, on one device.
Do this regression still happen after the following patches are applied?
a60cc288a1a2 :Luis Chamberlain: test_xarray: add tests for advanced multi-index use a08c7193e4f1 :Sidhartha Kumar: mm/filemap: remove hugetlb special casing in filemap.c 6212eb4d7a63 :Hongbo Li: mm/filemap: avoid type conversion
de60fd8ddeda :Kairui Song: mm/filemap: return early if failed to allocate memory for split b2ebcf9d3d5a :Kairui Song: mm/filemap: clean up hugetlb exclusion code a4864671ca0b :Kairui Song: lib/xarray: introduce a new helper xas_get_order 6758c1128ceb :Kairui Song: mm/filemap: optimize filemap folio adding
No luck: I cherry-picked those commits into 6.6.52, and upon testing lzop, the file didn't match the stream, and decompression failed.
Thank you for investigating, and giving me something to try!
-James
On request: * adding another cc for Steven * I tested 6.6.52, without any extra commits: it was bad.
-James
On Mon, Sep 23, 2024 at 12:36 PM james young pronoiac@gmail.com wrote:
Hey there -
On Sun, Sep 22, 2024 at 4:55 PM Wang Yugui wangyugui@e16-tech.com wrote:
Hi,
I was benchmarking some compressors, piping to and from a network share on a NAS, and some consistently wrote corrupted data.
Important commits: It looked like both the breakage and the fix came in during rc1 releases.
Breakage, v6.3-rc1: I manually bisected commits in fs/smb* and fs/cifs.
3d78fe73fa12 cifs: Build the RDMA SGE list directly from an iterator
lzop and pigz worked. last working. test in progress: pbzip2
This is a first for me: lzop was fine, but pbzip2 still had issues, roughly a clock hour into compression. (When lzop has issues, it's usually within a minute or two.)
607aea3cc2a8 cifs: Remove unused code
lzop didn't work. first broken
Fix, v6.10-rc1: I manually bisected commits in fs/smb.
69c3c023af25 cifs: Implement netfslib hooks
lzop didn't work. last broken one
3ee1a1fc3981 cifs: Cut over to using netfslib
lzop, pigz, pbzip2, all worked. first fixed one
I checked 607aea3cc2a8, it just removed some code in #if 0 ... #endif. so this regression is not introduced in 607aea3cc2a8, but the reproduce frequency is changed here.
I agree. The pbzip2 results above, regarding the break bisection I landed on: they mark when it became more of an issue, but not when it started.
I could re-run tests and dig into possible false negatives. It'll be slower going, though.
Another issue in 6.6.y maybe related https://lore.kernel.org/linux-fsdevel/9e8f8872-f51b-4a09-a92c-49218748dd62@m...
In comparison: I'm relieved that my issue is something that can be tested within hours, on one device.
Do this regression still happen after the following patches are applied?
a60cc288a1a2 :Luis Chamberlain: test_xarray: add tests for advanced multi-index use a08c7193e4f1 :Sidhartha Kumar: mm/filemap: remove hugetlb special casing in filemap.c 6212eb4d7a63 :Hongbo Li: mm/filemap: avoid type conversion
de60fd8ddeda :Kairui Song: mm/filemap: return early if failed to allocate memory for split b2ebcf9d3d5a :Kairui Song: mm/filemap: clean up hugetlb exclusion code a4864671ca0b :Kairui Song: lib/xarray: introduce a new helper xas_get_order 6758c1128ceb :Kairui Song: mm/filemap: optimize filemap folio adding
No luck: I cherry-picked those commits into 6.6.52, and upon testing lzop, the file didn't match the stream, and decompression failed.
Thank you for investigating, and giving me something to try!
-James
I retraced my steps: * looking for the breaking commit, between 6.2 and 6.3-rc1 * I switched to checksumming the stream and the written file; this can save time, compared to decompression * I checked for lzop, pigz, and pbzip2
So, breakage. I landed on different commits: last working commit. ok: lzop, pigz, pbzip2. 16541195c6d9 cifs: Add a function to read into an iter from a socket
first broken commit. lzop failed. d08089f649a0 cifs: Change the I/O paths to use an iterator rather than a page list
That broken commit is right before my previous "last good" and "break".
I'm seeing some inconsistencies. I'd *thought* I was careful with dtb files and .config; I might have dropped the ball occasionally, or there's something else, I don't know what, that I'm stumbling over.
To check for marginal hardware, I tried another Raspberry Pi 4. I verified baseline 6.6.52 didn't work there, and stopped there. It doesn't have any cooling; it *almost certainly* would throttle for thermal reasons, but I didn't want to push it.
-James
On Tue, Sep 24, 2024 at 9:35 PM james young pronoiac@gmail.com wrote:
On request:
- adding another cc for Steven
- I tested 6.6.52, without any extra commits: it was bad.
-James
On Mon, Sep 23, 2024 at 12:36 PM james young pronoiac@gmail.com wrote:
Hey there -
On Sun, Sep 22, 2024 at 4:55 PM Wang Yugui wangyugui@e16-tech.com wrote:
Hi,
I was benchmarking some compressors, piping to and from a network share on a NAS, and some consistently wrote corrupted data.
Important commits: It looked like both the breakage and the fix came in during rc1 releases.
Breakage, v6.3-rc1: I manually bisected commits in fs/smb* and fs/cifs.
3d78fe73fa12 cifs: Build the RDMA SGE list directly from an iterator
lzop and pigz worked. last working. test in progress: pbzip2
This is a first for me: lzop was fine, but pbzip2 still had issues, roughly a clock hour into compression. (When lzop has issues, it's usually within a minute or two.)
607aea3cc2a8 cifs: Remove unused code
lzop didn't work. first broken
Fix, v6.10-rc1: I manually bisected commits in fs/smb.
69c3c023af25 cifs: Implement netfslib hooks
lzop didn't work. last broken one
3ee1a1fc3981 cifs: Cut over to using netfslib
lzop, pigz, pbzip2, all worked. first fixed one
I checked 607aea3cc2a8, it just removed some code in #if 0 ... #endif. so this regression is not introduced in 607aea3cc2a8, but the reproduce frequency is changed here.
I agree. The pbzip2 results above, regarding the break bisection I landed on: they mark when it became more of an issue, but not when it started.
I could re-run tests and dig into possible false negatives. It'll be slower going, though.
Another issue in 6.6.y maybe related https://lore.kernel.org/linux-fsdevel/9e8f8872-f51b-4a09-a92c-49218748dd62@m...
In comparison: I'm relieved that my issue is something that can be tested within hours, on one device.
Do this regression still happen after the following patches are applied?
a60cc288a1a2 :Luis Chamberlain: test_xarray: add tests for advanced multi-index use a08c7193e4f1 :Sidhartha Kumar: mm/filemap: remove hugetlb special casing in filemap.c 6212eb4d7a63 :Hongbo Li: mm/filemap: avoid type conversion
de60fd8ddeda :Kairui Song: mm/filemap: return early if failed to allocate memory for split b2ebcf9d3d5a :Kairui Song: mm/filemap: clean up hugetlb exclusion code a4864671ca0b :Kairui Song: lib/xarray: introduce a new helper xas_get_order 6758c1128ceb :Kairui Song: mm/filemap: optimize filemap folio adding
No luck: I cherry-picked those commits into 6.6.52, and upon testing lzop, the file didn't match the stream, and decompression failed.
Thank you for investigating, and giving me something to try!
-James
linux-stable-mirror@lists.linaro.org