[PATCH 5/5] ceph: Fix write storm on fscrypted files

List overview All Threads
Download

newer

older

FAILED: patch "[PATCH] xhci:...

Sam Edwards

31 Dec 2025 31 Dec '25

2:43 a.m.

CephFS stores file data across multiple RADOS objects. An object is the atomic unit of storage, so the writeback code must clean only folios that belong to the same object with each OSD request.

CephFS also supports RAID0-style striping of file contents: if enabled, each object stores multiple unbroken "stripe units" covering different portions of the file; if disabled, a "stripe unit" is simply the whole object. The stripe unit is (usually) reported as the inode's block size.

Though the writeback logic could, in principle, lock all dirty folios belonging to the same object, its current design is to lock only a single stripe unit at a time. Ever since this code was first written, it has determined this size by checking the inode's block size. However, the relatively-new fscrypt support needed to reduce the block size for encrypted inodes to the crypto block size (see 'fixes' commit), which causes an unnecessarily high number of write operations (~1024x as many, with 4MiB objects) and grossly degraded performance.

Fix this (and clarify intent) by using i_layout.stripe_unit directly in ceph_define_write_size() so that encrypted inodes are written back with the same number of operations as if they were unencrypted.

Fixes: 94af0470924c ("ceph: add some fscrypt guardrails") Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards CFSworks@gmail.com --- fs/ceph/addr.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index b3569d44d510..cb1da8e27c2b 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping) { struct inode *inode = mapping->host; struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - unsigned int wsize = i_blocksize(inode); + struct ceph_inode_info *ci = ceph_inode(inode); + unsigned int wsize = ci->i_layout.stripe_unit;

if (fsc->mount_options->wsize < wsize) wsize = fsc->mount_options->wsize;

-- 2.51.2

Show replies by date

Viacheslav Dubeyko

5 Jan 5 Jan

10:34 p.m.

On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:

...

CephFS stores file data across multiple RADOS objects. An object is the atomic unit of storage, so the writeback code must clean only folios that belong to the same object with each OSD request.

CephFS also supports RAID0-style striping of file contents: if enabled, each object stores multiple unbroken "stripe units" covering different portions of the file; if disabled, a "stripe unit" is simply the whole object. The stripe unit is (usually) reported as the inode's block size.

Though the writeback logic could, in principle, lock all dirty folios belonging to the same object, its current design is to lock only a single stripe unit at a time. Ever since this code was first written, it has determined this size by checking the inode's block size. However, the relatively-new fscrypt support needed to reduce the block size for encrypted inodes to the crypto block size (see 'fixes' commit), which causes an unnecessarily high number of write operations (~1024x as many, with 4MiB objects) and grossly degraded performance.

Do you have any benchmarking results that prove your point?

Thanks, Slava.

...

Fix this (and clarify intent) by using i_layout.stripe_unit directly in ceph_define_write_size() so that encrypted inodes are written back with the same number of operations as if they were unencrypted.

Fixes: 94af0470924c ("ceph: add some fscrypt guardrails") Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards CFSworks@gmail.com

fs/ceph/addr.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index b3569d44d510..cb1da8e27c2b 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping) { struct inode *inode = mapping->host; struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);

unsigned int wsize = i_blocksize(inode);

struct ceph_inode_info *ci = ceph_inode(inode);

unsigned int wsize = ci->i_layout.stripe_unit;

if (fsc->mount_options->wsize < wsize) wsize = fsc->mount_options->wsize;

Sam Edwards

6 Jan 6 Jan

6:53 a.m.

On Mon, Jan 5, 2026 at 2:34 PM Viacheslav Dubeyko Slava.Dubeyko@ibm.com wrote:

...

On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:

...
CephFS stores file data across multiple RADOS objects. An object is the atomic unit of storage, so the writeback code must clean only folios that belong to the same object with each OSD request.

CephFS also supports RAID0-style striping of file contents: if enabled, each object stores multiple unbroken "stripe units" covering different portions of the file; if disabled, a "stripe unit" is simply the whole object. The stripe unit is (usually) reported as the inode's block size.

Though the writeback logic could, in principle, lock all dirty folios belonging to the same object, its current design is to lock only a single stripe unit at a time. Ever since this code was first written, it has determined this size by checking the inode's block size. However, the relatively-new fscrypt support needed to reduce the block size for encrypted inodes to the crypto block size (see 'fixes' commit), which causes an unnecessarily high number of write operations (~1024x as many, with 4MiB objects) and grossly degraded performance.

Hi Slava,

...

Do you have any benchmarking results that prove your point?

I haven't done any "real" benchmarking for this change. On my setup (closer to a home server than a typical Ceph deployment), sequential write throughput increased from ~1.7 to ~66 MB/s with this patch applied. I don't consider this single datapoint representative, so rather than presenting it as a general benchmark in the commit message, I chose the qualitative wording "grossly degraded performance." Actual impact will vary depending on workload, disk type, OSD count, etc.

Those curious about the bug's performance impact in their environment can find out without enabling fscrypt, using: mount -o wsize=4096

However, the core rationale for my claim is based on principles, not on measurements: batching writes into fewer operations necessarily spreads per-operation overhead across more bytes. So this change removes an artificial per-op bottleneck on sequential write performance. The exact impact varies, but the patch does improve (fscrypt-enabled) write throughput in nearly every case.

Warm regards, Sam

...

Thanks, Slava.

...
Fix this (and clarify intent) by using i_layout.stripe_unit directly in ceph_define_write_size() so that encrypted inodes are written back with the same number of operations as if they were unencrypted.

Fixes: 94af0470924c ("ceph: add some fscrypt guardrails") Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards CFSworks@gmail.com

fs/ceph/addr.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index b3569d44d510..cb1da8e27c2b 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping) { struct inode *inode = mapping->host; struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
unsigned int wsize = i_blocksize(inode);
struct ceph_inode_info *ci = ceph_inode(inode);
unsigned int wsize = ci->i_layout.stripe_unit;

if (fsc->mount_options->wsize < wsize)
        wsize = fsc->mount_options->wsize;

Viacheslav Dubeyko

11:11 p.m.

On Mon, 2026-01-05 at 22:53 -0800, Sam Edwards wrote:

...

On Mon, Jan 5, 2026 at 2:34 PM Viacheslav Dubeyko Slava.Dubeyko@ibm.com wrote:

...
On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:

...
CephFS stores file data across multiple RADOS objects. An object is the atomic unit of storage, so the writeback code must clean only folios that belong to the same object with each OSD request.

CephFS also supports RAID0-style striping of file contents: if enabled, each object stores multiple unbroken "stripe units" covering different portions of the file; if disabled, a "stripe unit" is simply the whole object. The stripe unit is (usually) reported as the inode's block size.

Though the writeback logic could, in principle, lock all dirty folios belonging to the same object, its current design is to lock only a single stripe unit at a time. Ever since this code was first written, it has determined this size by checking the inode's block size. However, the relatively-new fscrypt support needed to reduce the block size for encrypted inodes to the crypto block size (see 'fixes' commit), which causes an unnecessarily high number of write operations (~1024x as many, with 4MiB objects) and grossly degraded performance.

Hi Slava,

...
Do you have any benchmarking results that prove your point?

I haven't done any "real" benchmarking for this change. On my setup (closer to a home server than a typical Ceph deployment), sequential write throughput increased from ~1.7 to ~66 MB/s with this patch applied. I don't consider this single datapoint representative, so rather than presenting it as a general benchmark in the commit message, I chose the qualitative wording "grossly degraded performance." Actual impact will vary depending on workload, disk type, OSD count, etc.

Those curious about the bug's performance impact in their environment can find out without enabling fscrypt, using: mount -o wsize=4096

However, the core rationale for my claim is based on principles, not on measurements: batching writes into fewer operations necessarily spreads per-operation overhead across more bytes. So this change removes an artificial per-op bottleneck on sequential write performance. The exact impact varies, but the patch does improve (fscrypt-enabled) write throughput in nearly every case.

If you claim in commit message that "this patch fixes performance degradation", then you MUST share the evidence (benchmarking results). Otherwise, you cannot make such statements in commit message. Yes, it could be a good fix but please don't make a promise of performance improvement. Because, end-users have very different environments and workloads. And what could work on your side may not work for other use-cases and environments. Potentially, you could describe your environment, workload and to share your estimation/expectation of potential performance improvement.

Thanks, Slava.

...

Warm regards, Sam

...
Thanks, Slava.

...
Fix this (and clarify intent) by using i_layout.stripe_unit directly in ceph_define_write_size() so that encrypted inodes are written back with the same number of operations as if they were unencrypted.

Fixes: 94af0470924c ("ceph: add some fscrypt guardrails") Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards CFSworks@gmail.com

fs/ceph/addr.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index b3569d44d510..cb1da8e27c2b 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping) { struct inode *inode = mapping->host; struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
unsigned int wsize = i_blocksize(inode);
struct ceph_inode_info *ci = ceph_inode(inode);
unsigned int wsize = ci->i_layout.stripe_unit;

if (fsc->mount_options->wsize < wsize)
        wsize = fsc->mount_options->wsize;

Sam Edwards

7 Jan 7 Jan

12:05 a.m.

On Tue, Jan 6, 2026 at 3:11 PM Viacheslav Dubeyko Slava.Dubeyko@ibm.com wrote:

...

On Mon, 2026-01-05 at 22:53 -0800, Sam Edwards wrote:

...
On Mon, Jan 5, 2026 at 2:34 PM Viacheslav Dubeyko Slava.Dubeyko@ibm.com wrote:

...
On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:

...
CephFS stores file data across multiple RADOS objects. An object is the atomic unit of storage, so the writeback code must clean only folios that belong to the same object with each OSD request.

CephFS also supports RAID0-style striping of file contents: if enabled, each object stores multiple unbroken "stripe units" covering different portions of the file; if disabled, a "stripe unit" is simply the whole object. The stripe unit is (usually) reported as the inode's block size.

Though the writeback logic could, in principle, lock all dirty folios belonging to the same object, its current design is to lock only a single stripe unit at a time. Ever since this code was first written, it has determined this size by checking the inode's block size. However, the relatively-new fscrypt support needed to reduce the block size for encrypted inodes to the crypto block size (see 'fixes' commit), which causes an unnecessarily high number of write operations (~1024x as many, with 4MiB objects) and grossly degraded performance.

Hi Slava,

...
Do you have any benchmarking results that prove your point?

I haven't done any "real" benchmarking for this change. On my setup (closer to a home server than a typical Ceph deployment), sequential write throughput increased from ~1.7 to ~66 MB/s with this patch applied. I don't consider this single datapoint representative, so rather than presenting it as a general benchmark in the commit message, I chose the qualitative wording "grossly degraded performance." Actual impact will vary depending on workload, disk type, OSD count, etc.

Those curious about the bug's performance impact in their environment can find out without enabling fscrypt, using: mount -o wsize=4096

However, the core rationale for my claim is based on principles, not on measurements: batching writes into fewer operations necessarily spreads per-operation overhead across more bytes. So this change removes an artificial per-op bottleneck on sequential write performance. The exact impact varies, but the patch does improve (fscrypt-enabled) write throughput in nearly every case.

Hi Slava,

...

If you claim in commit message that "this patch fixes performance degradation", then you MUST share the evidence (benchmarking results). Otherwise, you cannot make such statements in commit message. Yes, it could be a good fix but please don't make a promise of performance improvement. Because, end-users have very different environments and workloads. And what could work on your side may not work for other use-cases and environments.

I agree with the underlying concern: I do not have a representative environment, and it would be irresponsible to promise or quantify a specific speedup. For that reason, the commit message does not claim any particular improvement factor.

What it does say is that this patch fixes a known performance degradation that artificially limits the write batch size. But that statement is about correctness relative to previous behavior, not about guaranteeing any specific performance outcome for end users.

...

Potentially, you could describe your environment, workload and to share your estimation/expectation of potential performance improvement.

I don’t think that would be useful here. As you pointed out, any such numbers would be highly workload- and environment-specific and would not be representative or actionable. The purpose of this patch is simply to remove an unintentional limit, not to advertise or promise measurable gains.

Best, Sam

...

Thanks, Slava.

...
Warm regards, Sam

...
Thanks, Slava.

...
Fix this (and clarify intent) by using i_layout.stripe_unit directly in ceph_define_write_size() so that encrypted inodes are written back with the same number of operations as if they were unencrypted.

Fixes: 94af0470924c ("ceph: add some fscrypt guardrails") Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards CFSworks@gmail.com

fs/ceph/addr.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index b3569d44d510..cb1da8e27c2b 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping) { struct inode *inode = mapping->host; struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
unsigned int wsize = i_blocksize(inode);
struct ceph_inode_info *ci = ceph_inode(inode);
unsigned int wsize = ci->i_layout.stripe_unit;

if (fsc->mount_options->wsize < wsize)
        wsize = fsc->mount_options->wsize;

days inactive

days old

linux-stable-mirror@lists.linaro.org

4 comments

participants

tags (0)

participants (2)

Sam Edwards
Viacheslav Dubeyko