[PATCH 1/5] ceph: Do not propagate page array emplacement errors as batch errors

List overview All Threads
Download

newer

older

FAILED: patch "[PATCH] mm:...

[PATCH v2 0/1] fs/writeback: skip...

Sam Edwards

31 Dec 2025 31 Dec '25

2:43 a.m.

When fscrypt is enabled, move_dirty_folio_in_page_array() may fail because it needs to allocate bounce buffers to store the encrypted versions of each folio. Each folio beyond the first allocates its bounce buffer with GFP_NOWAIT. Failures are common (and expected) under this allocation mode; they should flush (not abort) the batch.

However, ceph_process_folio_batch() uses the same `rc` variable for its own return code and for capturing the return codes of its routine calls; failing to reset `rc` back to 0 results in the error being propagated out to the main writeback loop, which cannot actually tolerate any errors here: once `ceph_wbc.pages` is allocated, it must be passed to ceph_submit_write() to be freed. If it survives until the next iteration (e.g. due to the goto being followed), ceph_allocate_page_array()'s BUG_ON() will oops the worker. (Subsequent patches in this series make the loop more robust.)

Note that this failure mode is currently masked due to another bug (addressed later in this series) that prevents multiple encrypted folios from being selected for the same write.

For now, just reset `rc` when redirtying the folio and prevent the error from propagating. After this change, ceph_process_folio_batch() no longer returns errors; its only remaining failure indicator is `locked_pages == 0`, which the caller already handles correctly. The next patch in this series addresses this.

Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method") Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards CFSworks@gmail.com --- fs/ceph/addr.c | 1 + 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 63b75d214210..3462df35d245 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1369,6 +1369,7 @@ int ceph_process_folio_batch(struct address_space *mapping, rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc, folio); if (rc) { + rc = 0; folio_redirty_for_writepage(wbc, folio); folio_unlock(folio); break;

-- 2.51.2

Show replies by date

Viacheslav Dubeyko

5 Jan 5 Jan

8:23 p.m.

On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:

...

When fscrypt is enabled, move_dirty_folio_in_page_array() may fail because it needs to allocate bounce buffers to store the encrypted versions of each folio. Each folio beyond the first allocates its bounce buffer with GFP_NOWAIT. Failures are common (and expected) under this allocation mode; they should flush (not abort) the batch.

However, ceph_process_folio_batch() uses the same `rc` variable for its own return code and for capturing the return codes of its routine calls; failing to reset `rc` back to 0 results in the error being propagated out to the main writeback loop, which cannot actually tolerate any errors here: once `ceph_wbc.pages` is allocated, it must be passed to ceph_submit_write() to be freed. If it survives until the next iteration (e.g. due to the goto being followed), ceph_allocate_page_array()'s BUG_ON() will oops the worker. (Subsequent patches in this series make the loop more robust.)

I think you are right with the fix. We have the loop here and if we already moved some dirty folios, then we should flush it. But what if we failed on the first one folio, then should we return no error code in this case?

...

Note that this failure mode is currently masked due to another bug (addressed later in this series) that prevents multiple encrypted folios from being selected for the same write.

So, maybe, this patch has been not correctly placed in the order? It will be good to see the reproduction of the issue and which symptoms we have for this issue. Do you have the reproduction script and call trace of the issue?

...

For now, just reset `rc` when redirtying the folio and prevent the error from propagating. After this change, ceph_process_folio_batch() no longer returns errors; its only remaining failure indicator is `locked_pages == 0`, which the caller already handles correctly. The next patch in this series addresses this.

Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method") Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards CFSworks@gmail.com

fs/ceph/addr.c | 1 + 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 63b75d214210..3462df35d245 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1369,6 +1369,7 @@ int ceph_process_folio_batch(struct address_space *mapping, rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc, folio); if (rc) {
	rc = 0;

I like the fix but I would like to clarify the above questions at first.

Thanks, Slava.

...

	folio_redirty_for_writepage(wbc, folio);
	folio_unlock(folio);
	break;

Sam Edwards

6 Jan 6 Jan

6:52 a.m.

On Mon, Jan 5, 2026 at 12:24 PM Viacheslav Dubeyko Slava.Dubeyko@ibm.com wrote:

...

On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:

...
When fscrypt is enabled, move_dirty_folio_in_page_array() may fail because it needs to allocate bounce buffers to store the encrypted versions of each folio. Each folio beyond the first allocates its bounce buffer with GFP_NOWAIT. Failures are common (and expected) under this allocation mode; they should flush (not abort) the batch.

However, ceph_process_folio_batch() uses the same `rc` variable for its own return code and for capturing the return codes of its routine calls; failing to reset `rc` back to 0 results in the error being propagated out to the main writeback loop, which cannot actually tolerate any errors here: once `ceph_wbc.pages` is allocated, it must be passed to ceph_submit_write() to be freed. If it survives until the next iteration (e.g. due to the goto being followed), ceph_allocate_page_array()'s BUG_ON() will oops the worker. (Subsequent patches in this series make the loop more robust.)

Hi Slava,

...

I think you are right with the fix. We have the loop here and if we already moved some dirty folios, then we should flush it. But what if we failed on the first one folio, then should we return no error code in this case?

The case you ask about, where move_dirty_folio_in_page_array() returns an error for the first folio, is currently not possible: 1) The only error code that move_dirty_folio_in_page_array() can propagate is from fscrypt_encrypt_pagecache_blocks(), which it calls with GFP_NOFS for the first folio. The latter function's doc comment outright states: * The bounce page allocation is mempool-backed, so it will always succeed when * @gfp_flags includes __GFP_DIRECT_RECLAIM, e.g. when it's GFP_NOFS. 2) The error return isn't even reachable for the first folio because of the BUG_ON(ceph_wbc->locked_pages == 0); line.

...

...
Note that this failure mode is currently masked due to another bug (addressed later in this series) that prevents multiple encrypted folios from being selected for the same write.

So, maybe, this patch has been not correctly placed in the order?

This crash is unmasked by patch 5 of this series. (It allows multiple folios to be batched when fscrypt is enabled.) Patch 5 has no hard dependency on anything else in this series, so it could -- in principle -- be ordered first as you suggest. However, that ordering would deliberately cause a regression in kernel stability, even if only briefly. That's not considered good practice in my view, as it may affect people who are trying to bisect and regression test. So the ordering of this series is: fix the crash in the unused code first, then fix the bug that makes it unused.

...

It will be good to see the reproduction of the issue and which symptoms we have for this issue. Do you have the reproduction script and call trace of the issue?

Fair point!

Function inlining makes the call trace not very interesting: Call trace: ceph_writepages_start+0x16ec/0x18e0 [ceph] () do_writepages+0xb0/0x1c0 __writeback_single_inode+0x4c/0x4d8 writeback_sb_inodes+0x238/0x4c8 __writeback_inodes_wb+0x64/0x120 wb_writeback+0x320/0x3e8 wb_workfn+0x42c/0x518 process_one_work+0x17c/0x428 worker_thread+0x260/0x390 kthread+0x148/0x240 ret_from_fork+0x10/0x20 Code: 34ffdee0 52800020 3903e7e0 17fffef4 (d4210000) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops - BUG: Fatal exception

ceph_writepages_start+0x16ec corresponds to linux-6.18.2/fs/ceph/addr.c:1222

However, these repro steps should work: 1) Apply patch 5 from this series (and no other patches) 2) Mount CephFS and activate fscrypt 3) Copy a large directory into the CephFS mount 4) After dozens of GBs transferred, you should observe the above kernel oops

Warm regards, Sam

...

...
For now, just reset `rc` when redirtying the folio and prevent the error from propagating. After this change, ceph_process_folio_batch() no longer returns errors; its only remaining failure indicator is `locked_pages == 0`, which the caller already handles correctly. The next patch in this series addresses this.

Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method") Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards CFSworks@gmail.com

fs/ceph/addr.c | 1 + 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 63b75d214210..3462df35d245 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1369,6 +1369,7 @@ int ceph_process_folio_batch(struct address_space *mapping, rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc, folio); if (rc) {
                rc = 0;
I like the fix but I would like to clarify the above questions at first.

Thanks, Slava.

...
                  folio_redirty_for_writepage(wbc, folio);
                  folio_unlock(folio);
                  break;

Viacheslav Dubeyko

9:08 p.m.

On Mon, 2026-01-05 at 22:52 -0800, Sam Edwards wrote:

...

On Mon, Jan 5, 2026 at 12:24 PM Viacheslav Dubeyko Slava.Dubeyko@ibm.com wrote:

...
On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:

...
When fscrypt is enabled, move_dirty_folio_in_page_array() may fail because it needs to allocate bounce buffers to store the encrypted versions of each folio. Each folio beyond the first allocates its bounce buffer with GFP_NOWAIT. Failures are common (and expected) under this allocation mode; they should flush (not abort) the batch.

However, ceph_process_folio_batch() uses the same `rc` variable for its own return code and for capturing the return codes of its routine calls; failing to reset `rc` back to 0 results in the error being propagated out to the main writeback loop, which cannot actually tolerate any errors here: once `ceph_wbc.pages` is allocated, it must be passed to ceph_submit_write() to be freed. If it survives until the next iteration (e.g. due to the goto being followed), ceph_allocate_page_array()'s BUG_ON() will oops the worker. (Subsequent patches in this series make the loop more robust.)

Hi Slava,

...
I think you are right with the fix. We have the loop here and if we already moved some dirty folios, then we should flush it. But what if we failed on the first one folio, then should we return no error code in this case?

The case you ask about, where move_dirty_folio_in_page_array() returns an error for the first folio, is currently not possible:

The only error code that move_dirty_folio_in_page_array() can

propagate is from fscrypt_encrypt_pagecache_blocks(), which it calls with GFP_NOFS for the first folio. The latter function's doc comment outright states:

The bounce page allocation is mempool-backed, so it will always succeed when

@gfp_flags includes __GFP_DIRECT_RECLAIM, e.g. when it's GFP_NOFS.

The error return isn't even reachable for the first folio because

of the BUG_ON(ceph_wbc->locked_pages == 0); line.

Unfortunately, the kernel code is not something completely stable. We cannot rely on particular state of the code. The code should be stable, robust enough, and ready for different situations. The mentioned BUG_ON() could be removed somehow during refactoring because we already have comment there "better not fail on first page!". Also, the behavior of fscrypt_encrypt_pagecache_blocks() could be changed too. So, we need to expect any bad situation and this is why I prefer to manage such potential (and maybe not so potential) erroneous situation(s).

...

...
...
Note that this failure mode is currently masked due to another bug (addressed later in this series) that prevents multiple encrypted folios from being selected for the same write.

So, maybe, this patch has been not correctly placed in the order?

This crash is unmasked by patch 5 of this series. (It allows multiple folios to be batched when fscrypt is enabled.) Patch 5 has no hard dependency on anything else in this series, so it could -- in principle -- be ordered first as you suggest. However, that ordering would deliberately cause a regression in kernel stability, even if only briefly. That's not considered good practice in my view, as it may affect people who are trying to bisect and regression test. So the ordering of this series is: fix the crash in the unused code first, then fix the bug that makes it unused.

OK, your point sounds confusing, frankly speaking. If we cannot reproduce the issue because another bug hides the issue, then no such issue exists. And we don't need to fix something. So, from the logical point of view, we need to fix the first bug, then we can reproduce the hidden issue, and, finally, the fix makes sense.

I didn't suggest too make the patch 5th as the first one. But I believe that this patch should follow to the patch 5th.

...

...
It will be good to see the reproduction of the issue and which symptoms we have for this issue. Do you have the reproduction script and call trace of the issue?

Fair point!

Function inlining makes the call trace not very interesting: Call trace: ceph_writepages_start+0x16ec/0x18e0 [ceph] () do_writepages+0xb0/0x1c0 __writeback_single_inode+0x4c/0x4d8 writeback_sb_inodes+0x238/0x4c8 __writeback_inodes_wb+0x64/0x120 wb_writeback+0x320/0x3e8 wb_workfn+0x42c/0x518 process_one_work+0x17c/0x428 worker_thread+0x260/0x390 kthread+0x148/0x240 ret_from_fork+0x10/0x20 Code: 34ffdee0 52800020 3903e7e0 17fffef4 (d4210000) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops - BUG: Fatal exception

ceph_writepages_start+0x16ec corresponds to linux-6.18.2/fs/ceph/addr.c:1222

However, these repro steps should work:

Apply patch 5 from this series (and no other patches)

Mount CephFS and activate fscrypt

Copy a large directory into the CephFS mount

After dozens of GBs transferred, you should observe the above kernel oops

Could we have all of these details in the commit message?

Thanks, Slava.

...

Warm regards, Sam

...
...
For now, just reset `rc` when redirtying the folio and prevent the error from propagating. After this change, ceph_process_folio_batch() no longer returns errors; its only remaining failure indicator is `locked_pages == 0`, which the caller already handles correctly. The next patch in this series addresses this.

Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method") Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards CFSworks@gmail.com

fs/ceph/addr.c | 1 + 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 63b75d214210..3462df35d245 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1369,6 +1369,7 @@ int ceph_process_folio_batch(struct address_space *mapping, rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc, folio); if (rc) {
                rc = 0;
I like the fix but I would like to clarify the above questions at first.

Thanks, Slava.

...
                  folio_redirty_for_writepage(wbc, folio);
                  folio_unlock(folio);
                  break;

Sam Edwards

11:50 p.m.

On Tue, Jan 6, 2026 at 1:08 PM Viacheslav Dubeyko Slava.Dubeyko@ibm.com wrote:

...

On Mon, 2026-01-05 at 22:52 -0800, Sam Edwards wrote:

...
On Mon, Jan 5, 2026 at 12:24 PM Viacheslav Dubeyko Slava.Dubeyko@ibm.com wrote:

...
On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:

...
When fscrypt is enabled, move_dirty_folio_in_page_array() may fail because it needs to allocate bounce buffers to store the encrypted versions of each folio. Each folio beyond the first allocates its bounce buffer with GFP_NOWAIT. Failures are common (and expected) under this allocation mode; they should flush (not abort) the batch.

However, ceph_process_folio_batch() uses the same `rc` variable for its own return code and for capturing the return codes of its routine calls; failing to reset `rc` back to 0 results in the error being propagated out to the main writeback loop, which cannot actually tolerate any errors here: once `ceph_wbc.pages` is allocated, it must be passed to ceph_submit_write() to be freed. If it survives until the next iteration (e.g. due to the goto being followed), ceph_allocate_page_array()'s BUG_ON() will oops the worker. (Subsequent patches in this series make the loop more robust.)

Hi Slava,

...
I think you are right with the fix. We have the loop here and if we already moved some dirty folios, then we should flush it. But what if we failed on the first one folio, then should we return no error code in this case?

The case you ask about, where move_dirty_folio_in_page_array() returns an error for the first folio, is currently not possible:

The only error code that move_dirty_folio_in_page_array() can

propagate is from fscrypt_encrypt_pagecache_blocks(), which it calls with GFP_NOFS for the first folio. The latter function's doc comment outright states:

The bounce page allocation is mempool-backed, so it will always succeed when

@gfp_flags includes __GFP_DIRECT_RECLAIM, e.g. when it's GFP_NOFS.

The error return isn't even reachable for the first folio because

of the BUG_ON(ceph_wbc->locked_pages == 0); line.

Good day Slava,

...

Unfortunately, the kernel code is not something completely stable. We cannot rely on particular state of the code. The code should be stable, robust enough, and ready for different situations.

You describe "defensive programming." I fully agree and am a strong advocate for it, but each defensive measure comes with a complexity cost. A skilled defensive programmer evaluates the likelihood of each failure and invests that cost only where it's most warranted.

...

The mentioned BUG_ON() could be removed somehow during refactoring because we already have comment there "better not fail on first page!".

If the question is "What happens if the first folio fails when the BUG_ON is removed?" then my answer is: that is the responsibility of the person removing it. I am leaving the BUG_ON() in place.

...

Also, the behavior of fscrypt_encrypt_pagecache_blocks() could be changed too.

Changing that would alter the contract of fscrypt_encrypt_pagecache_blocks(). Contracts can evolve, but anyone making such a change must audit all call sites to ensure nothing breaks. Today, this is purely hypothetical; the function is not being changed. Speculating about behavior under a different, unimplemented contract is not a basis for complicating the current code.

...

So, we need to expect any bad situation and this is why I prefer to manage such potential (and maybe not so potential) erroneous situation(s).

This point is moot. Even if move_dirty_folio_in_page_array() somehow returned nonzero on the first folio, ceph_process_folio_batch() would simply lock zero folios, which ceph_writepages_start() already handles gracefully.

...

...
...
...
Note that this failure mode is currently masked due to another bug (addressed later in this series) that prevents multiple encrypted folios from being selected for the same write.

So, maybe, this patch has been not correctly placed in the order?

This crash is unmasked by patch 5 of this series. (It allows multiple folios to be batched when fscrypt is enabled.) Patch 5 has no hard dependency on anything else in this series, so it could -- in principle -- be ordered first as you suggest. However, that ordering would deliberately cause a regression in kernel stability, even if only briefly. That's not considered good practice in my view, as it may affect people who are trying to bisect and regression test. So the ordering of this series is: fix the crash in the unused code first, then fix the bug that makes it unused.

OK, your point sounds confusing, frankly speaking. If we cannot reproduce the issue because another bug hides the issue, then no such issue exists. And we don't need to fix something. So, from the logical point of view, we need to fix the first bug, then we can reproduce the hidden issue, and, finally, the fix makes sense.

With respect, that reasoning is flawed and not appropriate for a technical discussion. The crash in question cannot currently occur, but that does *not* make the fix unnecessary. Patch #5 in this series will re-enable the code path, at which point the crash becomes possible. Addressing it now ensures correctness and avoids introducing a regression. Attempting to "see it happen in the wild" before fixing it is neither required nor acceptable practice.

We are not uncertain about the crash: I have provided steps to reproduce it. You can apply patch #5 before #1 *in your own tree* to observe the crash if that helps you evaluate the patches. *But under no circumstances should this be done in mainline!* Introducing a crash upstream, even transiently, is strictly disallowed, and suggesting otherwise is not appropriate behavior for a Linux kernel developer.

...

I didn't suggest too make the patch 5th as the first one. But I believe that this patch should follow to the patch 5th.

As I explained, putting patch #5 before this one would deliberately introduce a regression -- a crash. Triggering this in mainline is not allowed by kernel development policy [1]; there is no exception for "transient regressions that are fixed immediately afterward." A regression is a regression.

...

...
...
It will be good to see the reproduction of the issue and which symptoms we have for this issue. Do you have the reproduction script and call trace of the issue?

Fair point!

Function inlining makes the call trace not very interesting: Call trace: ceph_writepages_start+0x16ec/0x18e0 [ceph] () do_writepages+0xb0/0x1c0 __writeback_single_inode+0x4c/0x4d8 writeback_sb_inodes+0x238/0x4c8 __writeback_inodes_wb+0x64/0x120 wb_writeback+0x320/0x3e8 wb_workfn+0x42c/0x518 process_one_work+0x17c/0x428 worker_thread+0x260/0x390 kthread+0x148/0x240 ret_from_fork+0x10/0x20 Code: 34ffdee0 52800020 3903e7e0 17fffef4 (d4210000) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops - BUG: Fatal exception

ceph_writepages_start+0x16ec corresponds to linux-6.18.2/fs/ceph/addr.c:1222

However, these repro steps should work:

Apply patch 5 from this series (and no other patches)

Mount CephFS and activate fscrypt

Copy a large directory into the CephFS mount

After dozens of GBs transferred, you should observe the above kernel oops

Could we have all of these details in the commit message?

Would this truly help future readers, or just create noise? The commit message already explains the exact execution path to the BUG_ON()/oops, which is what really matters; call traces are secondary. I did not want to imply that readers cannot understand the seriousness of the issue without a crash log. I will include these details if the group consensus prefers it, but I am otherwise opposed.

Hope you and yours are well, Sam

[1] See the "no regressions" rule: https://docs.kernel.org/admin-guide/reporting-regressions.html

...

Thanks, Slava.

...
Warm regards, Sam

...
...
For now, just reset `rc` when redirtying the folio and prevent the error from propagating. After this change, ceph_process_folio_batch() no longer returns errors; its only remaining failure indicator is `locked_pages == 0`, which the caller already handles correctly. The next patch in this series addresses this.

Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method") Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards CFSworks@gmail.com

fs/ceph/addr.c | 1 + 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 63b75d214210..3462df35d245 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1369,6 +1369,7 @@ int ceph_process_folio_batch(struct address_space *mapping, rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc, folio); if (rc) {
                rc = 0;
I like the fix but I would like to clarify the above questions at first.

Thanks, Slava.

...
                  folio_redirty_for_writepage(wbc, folio);
                  folio_unlock(folio);
                  break;

days inactive

days old

linux-stable-mirror@lists.linaro.org

4 comments

participants

tags (0)

participants (2)

Sam Edwards
Viacheslav Dubeyko