erofs readahead could fail with ENOMEM under the memory pressure because it tries to alloc_page with GFP_NOWAIT | GFP_NORETRY, while GFP_KERNEL for a regular read. And if readahead fails (with non-uptodate folios), the original request will then fall back to synchronous read, and `.read_folio()` should return appropriate errnos.
However, in scenarios where readahead and read operations compete, read operation could return an unintended EIO because of an incorrect error propagation.
To resolve this, this patch modifies the behavior so that, when the PCL is for read(which means pcl.besteffort is true), it attempts actual decompression instead of propagating the privios error except initial EIO.
- Page size: 4K - The original size of FileA: 16K - Compress-ratio per PCL: 50% (Uncompressed 8K -> Compressed 4K) [page0, page1] [page2, page3] [PCL0]---------[PCL1]
- functions declaration: . pread(fd, buf, count, offset) . readahead(fd, offset, count) - Thread A tries to read the last 4K - Thread B tries to do readahead 8K from 4K - RA, besteffort == false - R, besteffort == true
<process A> <process B>
pread(FileA, buf, 4K, 12K) do readahead(page3) // failed with ENOMEM wait_lock(page3) if (!uptodate(page3)) goto do_read readahead(FileA, 4K, 8K) // Here create PCL-chain like below: // [null, page1] [page2, null] // [PCL0:RA]-----[PCL1:RA] ... do read(page3) // found [PCL1:RA] and add page3 into it, // and then, change PCL1 from RA to R ... // Now, PCL-chain is as below: // [null, page1] [page2, page3] // [PCL0:RA]-----[PCL1:R]
// try to decompress PCL-chain... z_erofs_decompress_queue err = 0;
// failed with ENOMEM, so page 1 // only for RA will not be uptodated. // it's okay. err = decompress([PCL0:RA], err)
// However, ENOMEM propagated to next // PCL, even though PCL is not only // for RA but also for R. As a result, // it just failed with ENOMEM without // trying any decompression, so page2 // and page3 will not be uptodated. ** BUG HERE ** --> err = decompress([PCL1:R], err)
return err as ENOMEM ... wait_lock(page3) if (!uptodate(page3)) return EIO <-- Return an unexpected EIO! ...
Fixes: 2349d2fa02db ("erofs: sunset unneeded NOFAILs") Cc: stable@vger.kernel.org Reviewed-by: Jaewook Kim jw5454.kim@samsung.com Reviewed-by: Sungjong Seo sj1557.seo@samsung.com Signed-off-by: Junbeom Yeom junbeom.yeom@samsung.com --- fs/erofs/zdata.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c index 27b1f44d10ce..86bf6e087d34 100644 --- a/fs/erofs/zdata.c +++ b/fs/erofs/zdata.c @@ -1414,11 +1414,15 @@ static int z_erofs_decompress_queue(const struct z_erofs_decompressqueue *io, }; struct z_erofs_pcluster *next; int err = io->eio ? -EIO : 0; + int io_err = err;
for (; be.pcl != Z_EROFS_PCLUSTER_TAIL; be.pcl = next) { + int propagate_err; + DBG_BUGON(!be.pcl); next = READ_ONCE(be.pcl->next); - err = z_erofs_decompress_pcluster(&be, err) ?: err; + propagate_err = READ_ONCE(be.pcl->besteffort) ? io_err : err; + err = z_erofs_decompress_pcluster(&be, propagate_err) ?: err; } return err; }
Hi Junbeom,
On 2025/12/19 15:10, Junbeom Yeom wrote:
erofs readahead could fail with ENOMEM under the memory pressure because it tries to alloc_page with GFP_NOWAIT | GFP_NORETRY, while GFP_KERNEL for a regular read. And if readahead fails (with non-uptodate folios), the original request will then fall back to synchronous read, and `.read_folio()` should return appropriate errnos.
However, in scenarios where readahead and read operations compete, read operation could return an unintended EIO because of an incorrect error propagation.
To resolve this, this patch modifies the behavior so that, when the PCL is for read(which means pcl.besteffort is true), it attempts actual decompression instead of propagating the privios error except initial EIO.
- Page size: 4K
- The original size of FileA: 16K
- Compress-ratio per PCL: 50% (Uncompressed 8K -> Compressed 4K)
[page0, page1] [page2, page3] [PCL0]---------[PCL1]
functions declaration: . pread(fd, buf, count, offset) . readahead(fd, offset, count)
Thread A tries to read the last 4K
Thread B tries to do readahead 8K from 4K
RA, besteffort == false
R, besteffort == true
<process A> <process B>pread(FileA, buf, 4K, 12K) do readahead(page3) // failed with ENOMEM wait_lock(page3) if (!uptodate(page3)) goto do_read readahead(FileA, 4K, 8K) // Here create PCL-chain like below: // [null, page1] [page2, null] // [PCL0:RA]-----[PCL1:RA] ... do read(page3) // found [PCL1:RA] and add page3 into it, // and then, change PCL1 from RA to R ... // Now, PCL-chain is as below: // [null, page1] [page2, page3] // [PCL0:RA]-----[PCL1:R]
// try to decompress PCL-chain... z_erofs_decompress_queue err = 0; // failed with ENOMEM, so page 1 // only for RA will not be uptodated. // it's okay. err = decompress([PCL0:RA], err) // However, ENOMEM propagated to next // PCL, even though PCL is not only // for RA but also for R. As a result, // it just failed with ENOMEM without // trying any decompression, so page2 // and page3 will not be uptodated. ** BUG HERE ** --> err = decompress([PCL1:R], err) return err as ENOMEM... wait_lock(page3) if (!uptodate(page3)) return EIO <-- Return an unexpected EIO! ...
Many thanks for the report! It's indeed a new issue to me.
Fixes: 2349d2fa02db ("erofs: sunset unneeded NOFAILs") Cc: stable@vger.kernel.org Reviewed-by: Jaewook Kim jw5454.kim@samsung.com Reviewed-by: Sungjong Seo sj1557.seo@samsung.com Signed-off-by: Junbeom Yeom junbeom.yeom@samsung.com
fs/erofs/zdata.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c index 27b1f44d10ce..86bf6e087d34 100644 --- a/fs/erofs/zdata.c +++ b/fs/erofs/zdata.c @@ -1414,11 +1414,15 @@ static int z_erofs_decompress_queue(const struct z_erofs_decompressqueue *io, }; struct z_erofs_pcluster *next; int err = io->eio ? -EIO : 0;
- int io_err = err;
for (; be.pcl != Z_EROFS_PCLUSTER_TAIL; be.pcl = next) {
int propagate_err;- DBG_BUGON(!be.pcl); next = READ_ONCE(be.pcl->next);
err = z_erofs_decompress_pcluster(&be, err) ?: err;
propagate_err = READ_ONCE(be.pcl->besteffort) ? io_err : err;err = z_erofs_decompress_pcluster(&be, propagate_err) ?: err;
I wonder if it's just possible to decompress each pcluster according to io status only (but don't bother with previous pcluster status), like:
err = z_erofs_decompress_pcluster(&be, io->eio) ?: err;
and change the second argument of z_erofs_decompress_pcluster() to bool.
So that we could leverage the successful i/o as much as possible.
Thanks, Gao Xiang
} return err; }
Hi Xiang,
Hi Junbeom,
On 2025/12/19 15:10, Junbeom Yeom wrote:
erofs readahead could fail with ENOMEM under the memory pressure because it tries to alloc_page with GFP_NOWAIT | GFP_NORETRY, while GFP_KERNEL for a regular read. And if readahead fails (with non-uptodate folios), the original request will then fall back to synchronous read, and `.read_folio()` should return appropriate errnos.
However, in scenarios where readahead and read operations compete, read operation could return an unintended EIO because of an incorrect error propagation.
To resolve this, this patch modifies the behavior so that, when the PCL is for read(which means pcl.besteffort is true), it attempts actual decompression instead of propagating the privios error except initial EIO.
- Page size: 4K
- The original size of FileA: 16K
- Compress-ratio per PCL: 50% (Uncompressed 8K -> Compressed 4K)
[page0, page1] [page2, page3] [PCL0]---------[PCL1]
functions declaration: . pread(fd, buf, count, offset) . readahead(fd, offset, count)
Thread A tries to read the last 4K
Thread B tries to do readahead 8K from 4K
RA, besteffort == false
R, besteffort == true
<process A> <process B>pread(FileA, buf, 4K, 12K) do readahead(page3) // failed with ENOMEM wait_lock(page3) if (!uptodate(page3)) goto do_read readahead(FileA, 4K, 8K) // Here create PCL-chain like below: // [null, page1] [page2, null] // [PCL0:RA]-----[PCL1:RA] ... do read(page3) // found [PCL1:RA] and add page3 into it, // and then, change PCL1 from RA to R ... // Now, PCL-chain is as below: // [null, page1] [page2, page3] // [PCL0:RA]-----[PCL1:R]
// try to decompress PCL-chain... z_erofs_decompress_queue err = 0; // failed with ENOMEM, so page 1 // only for RA will not be uptodated. // it's okay. err = decompress([PCL0:RA], err) // However, ENOMEM propagated to next // PCL, even though PCL is not only // for RA but also for R. As a result, // it just failed with ENOMEM without // trying any decompression, so page2 // and page3 will not be uptodated. ** BUG HERE ** --> err = decompress([PCL1:R], err) return err as ENOMEM ... wait_lock(page3) if (!uptodate(page3)) return EIO <-- Return an unexpected EIO!...
Many thanks for the report! It's indeed a new issue to me.
Fixes: 2349d2fa02db ("erofs: sunset unneeded NOFAILs") Cc: stable@vger.kernel.org Reviewed-by: Jaewook Kim jw5454.kim@samsung.com Reviewed-by: Sungjong Seo sj1557.seo@samsung.com Signed-off-by: Junbeom Yeom junbeom.yeom@samsung.com
fs/erofs/zdata.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c index 27b1f44d10ce..86bf6e087d34 100644 --- a/fs/erofs/zdata.c +++ b/fs/erofs/zdata.c @@ -1414,11 +1414,15 @@ static int z_erofs_decompress_queue(const struct
z_erofs_decompressqueue *io,
}; struct z_erofs_pcluster *next; int err = io->eio ? -EIO : 0;
int io_err = err;
for (; be.pcl != Z_EROFS_PCLUSTER_TAIL; be.pcl = next) {
int propagate_err;DBG_BUGON(!be.pcl); next = READ_ONCE(be.pcl->next);
err = z_erofs_decompress_pcluster(&be, err) ?: err;
propagate_err = READ_ONCE(be.pcl->besteffort) ? io_err : err;err = z_erofs_decompress_pcluster(&be, propagate_err) ?: err;I wonder if it's just possible to decompress each pcluster according to io status only (but don't bother with previous pcluster status), like:
err = z_erofs_decompress_pcluster(&be, io->eio) ?: err;and change the second argument of z_erofs_decompress_pcluster() to bool.
So that we could leverage the successful i/o as much as possible.
Oh, I thought you were intending to address error propagation. If that's not the case, I also believe the approach you're suggesting is better. I'll send the next version.
Thanks, Junbeom Yeom
Thanks, Gao Xiang
} return err; }
On 2025/12/19 17:47, Junbeom Yeom wrote:
Hi Xiang,
Hi Junbeom,
On 2025/12/19 15:10, Junbeom Yeom wrote:
erofs readahead could fail with ENOMEM under the memory pressure because it tries to alloc_page with GFP_NOWAIT | GFP_NORETRY, while GFP_KERNEL for a regular read. And if readahead fails (with non-uptodate folios), the original request will then fall back to synchronous read, and `.read_folio()` should return appropriate errnos.
However, in scenarios where readahead and read operations compete, read operation could return an unintended EIO because of an incorrect error propagation.
To resolve this, this patch modifies the behavior so that, when the PCL is for read(which means pcl.besteffort is true), it attempts actual decompression instead of propagating the privios error except initial EIO.
- Page size: 4K
- The original size of FileA: 16K
- Compress-ratio per PCL: 50% (Uncompressed 8K -> Compressed 4K)
[page0, page1] [page2, page3] [PCL0]---------[PCL1]
functions declaration: . pread(fd, buf, count, offset) . readahead(fd, offset, count)
Thread A tries to read the last 4K
Thread B tries to do readahead 8K from 4K
RA, besteffort == false
R, besteffort == true
<process A> <process B>pread(FileA, buf, 4K, 12K) do readahead(page3) // failed with ENOMEM wait_lock(page3) if (!uptodate(page3)) goto do_read readahead(FileA, 4K, 8K) // Here create PCL-chain like below: // [null, page1] [page2, null] // [PCL0:RA]-----[PCL1:RA] ... do read(page3) // found [PCL1:RA] and add page3 into it, // and then, change PCL1 from RA to R ... // Now, PCL-chain is as below: // [null, page1] [page2, page3] // [PCL0:RA]-----[PCL1:R]
// try to decompress PCL-chain... z_erofs_decompress_queue err = 0; // failed with ENOMEM, so page 1 // only for RA will not be uptodated. // it's okay. err = decompress([PCL0:RA], err) // However, ENOMEM propagated to next // PCL, even though PCL is not only // for RA but also for R. As a result, // it just failed with ENOMEM without // trying any decompression, so page2 // and page3 will not be uptodated. ** BUG HERE ** --> err = decompress([PCL1:R], err) return err as ENOMEM ... wait_lock(page3) if (!uptodate(page3)) return EIO <-- Return an unexpected EIO!...
Many thanks for the report! It's indeed a new issue to me.
Fixes: 2349d2fa02db ("erofs: sunset unneeded NOFAILs") Cc: stable@vger.kernel.org Reviewed-by: Jaewook Kim jw5454.kim@samsung.com Reviewed-by: Sungjong Seo sj1557.seo@samsung.com Signed-off-by: Junbeom Yeom junbeom.yeom@samsung.com
fs/erofs/zdata.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c index 27b1f44d10ce..86bf6e087d34 100644 --- a/fs/erofs/zdata.c +++ b/fs/erofs/zdata.c @@ -1414,11 +1414,15 @@ static int z_erofs_decompress_queue(const struct
z_erofs_decompressqueue *io,
}; struct z_erofs_pcluster *next; int err = io->eio ? -EIO : 0;
int io_err = err;
for (; be.pcl != Z_EROFS_PCLUSTER_TAIL; be.pcl = next) {
int propagate_err;DBG_BUGON(!be.pcl); next = READ_ONCE(be.pcl->next);
err = z_erofs_decompress_pcluster(&be, err) ?: err;
propagate_err = READ_ONCE(be.pcl->besteffort) ? io_err : err;err = z_erofs_decompress_pcluster(&be, propagate_err) ?: err;I wonder if it's just possible to decompress each pcluster according to io status only (but don't bother with previous pcluster status), like:
err = z_erofs_decompress_pcluster(&be, io->eio) ?: err;and change the second argument of z_erofs_decompress_pcluster() to bool.
So that we could leverage the successful i/o as much as possible.
Oh, I thought you were intending to address error propagation.
We could still propagate errors (-ENOMEM) to the callers, but for the case you mentioned, I still think it's useful to handle the following pclusters if the disk I/Os are successful.
and it still addresses the issue you mentioned, I think it's also cleaner.
If that's not the case, I also believe the approach you're suggesting is better. I'll send the next version.
Thank you for the effort!
Thanks, Gao Xiang
Thanks, Junbeom Yeom
Thanks, Gao Xiang
} return err;}
linux-stable-mirror@lists.linaro.org