David Howells dhowells@redhat.com wrote:
Here are some miscellaneous fixes and changes for netfslib and cifs, if you could consider pulling them. All the bugs fixed were observed in cifs, so they should probably go through the cifs tree unless Christian would much prefer for them to go through the VFS tree.
Hi David,
your commit 2b1424cd131c ("netfs: Fix wait/wake to be consistent about the waitqueue used") has given me serious headaches; it has caused outages in our web hosting clusters (yet again - all Linux versions since 6.9 had serious netfs regressions). Your patch was backported to 6.15 as commit 329ba1cb402a in 6.15.3 (why oh why??), and therefore the bugs it has caused will be "available" to all Linux stable users.
The problem we had is that writing to certain files never finishes. It looks like it has to do with the cachefiles subrequest never reporting completion. (We use Ceph with cachefiles)
I have tried applying the fixes in this pull request, which sounded promising, but the problem is still there. The only thing that helps is reverting 2b1424cd131c completely - everything is fine with 6.15.5 plus the revert.
What do you need from me in order to analyze the bug?
Max
Max Kellermann max.kellermann@ionos.com wrote:
your commit 2b1424cd131c ("netfs: Fix wait/wake to be consistent about the waitqueue used") has given me serious headaches; it has caused outages in our web hosting clusters (yet again - all Linux versions since 6.9 had serious netfs regressions). Your patch was backported to 6.15 as commit 329ba1cb402a in 6.15.3 (why oh why??), and therefore the bugs it has caused will be "available" to all Linux stable users.
The problem we had is that writing to certain files never finishes. It looks like it has to do with the cachefiles subrequest never reporting completion. (We use Ceph with cachefiles)
I have tried applying the fixes in this pull request, which sounded promising, but the problem is still there. The only thing that helps is reverting 2b1424cd131c completely - everything is fine with 6.15.5 plus the revert.
What do you need from me in order to analyze the bug?
As a start, can you turn on:
echo 65536 >/sys/kernel/debug/tracing/buffer_size_kb echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_read/enable echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_rreq/enable echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_sreq/enable echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_failure/enable
If you keep an eye on /proc/fs/netfs/requests you should be able to see any tasks in there that get stuck. If one gets stuck, then:
echo 0 > /sys/kernel/debug/tracing/events/enable
to stop further tracing.
Looking in /proc/fs/netfs/requests, you should be able to see the debug ID of the stuck request. If you can try grepping the trace log for that:
grep "R=<8-digit-hex-id>" /sys/kernel/debug/tracing/trace
that should hopefully let me see how things progressed on that call.
David
On Wed, Jul 9, 2025 at 3:01 PM David Howells dhowells@redhat.com wrote:
If you keep an eye on /proc/fs/netfs/requests you should be able to see any tasks in there that get stuck. If one gets stuck, then:
After one got stuck, this is what I see in /proc/fs/netfs/requests:
REQUEST OR REF FL ERR OPS COVERAGE ======== == === == ==== === ========= 00000065 2C 2 80002020 0 0 @0000 0/0
Looking in /proc/fs/netfs/requests, you should be able to see the debug ID of the stuck request. If you can try grepping the trace log for that:
grep "R=<8-digit-hex-id>" /sys/kernel/debug/tracing/trace
kworker/u96:4-455 [008] ...1. 107.145222: netfs_sreq: R=00000065[1] WRIT PREP f=00 s=0 0/0 s=0 e=0 kworker/u96:4-455 [008] ...1. 107.145292: netfs_sreq: R=00000065[1] WRIT SUBMT f=100 s=0 0/29e1 s=0 e=0 kworker/u96:4-455 [008] ...1. 107.145311: netfs_sreq: R=00000065[1] WRIT CA-PR f=100 s=0 0/3000 s=0 e=0 kworker/u96:4-455 [008] ...1. 107.145457: netfs_sreq: R=00000065[1] WRIT CA-WR f=100 s=0 0/3000 s=0 e=0 kworker/8:1-437 [008] ...1. 107.149530: netfs_sreq: R=00000065[1] WRIT TERM f=100 s=0 3000/3000 s=2 e=0 kworker/8:1-437 [008] ...1. 107.149531: netfs_rreq: R=00000065 2C WAKE-Q f=80002020
I can reproduce this easily - it happens every time I log out of that machine when bash tries to write the bash_history file - the write() always gets stuck.
(The above was 6.15.5 plus all patches in this PR.)
Max Kellermann max.kellermann@ionos.com wrote:
kworker/8:1-437 [008] ...1. 107.149531: netfs_rreq:
R=00000065 2C WAKE-Q f=80002020
... (The above was 6.15.5 plus all patches in this PR.)
Can you check that, please? If you have patch 12 applied, then the flags should be renumbered and there shouldn't be a NETFS_RREQ_ flag with 13, but f=80002020 would seem to have 0x2000 (ie. bit 13) set in it.
If you do have it applied, then this might be an indicator of the issue.
David
On Wed, Jul 9, 2025 at 10:22 PM David Howells dhowells@redhat.com wrote:
(The above was 6.15.5 plus all patches in this PR.)
Can you check that, please? If you have patch 12 applied, then the flags should be renumbered and there shouldn't be a NETFS_RREQ_ flag with 13, but f=80002020 would seem to have 0x2000 (ie. bit 13) set in it.
Oh, I was slightly wrong, I merged only 12 patches, omitting the renumbering patch because it had conflicts with my branch, and it was only a cosmetic change, not relevant for the bug. Sorry for the confusion!
Hi Max,
I managed to reproduce it on my test machine with ceph + fscache.
Does this fix the problem for you?
David --- netfs: Fix copy-to-cache so that it performs collection with ceph+fscache
The netfs copy-to-cache that is used by Ceph with local caching sets up a new request to write data just read to the cache. The request is started and then left to look after itself whilst the app continues. The request gets notified by the backing fs upon completion of the async DIO write, but then tries to wake up the app because NETFS_RREQ_OFFLOAD_COLLECTION isn't set - but the app isn't waiting there, and so the request just hangs.
Fix this by setting NETFS_RREQ_OFFLOAD_COLLECTION which causes the notification from the backing filesystem to put the collection onto a work queue instead.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Max Kellermann max.kellermann@ionos.com Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3... Signed-off-by: David Howells dhowells@redhat.com cc: Paulo Alcantara pc@manguebit.org cc: Viacheslav Dubeyko slava@dubeyko.com cc: Alex Markuze amarkuze@redhat.com cc: Ilya Dryomov idryomov@gmail.com cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org --- fs/netfs/read_pgpriv2.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/netfs/read_pgpriv2.c b/fs/netfs/read_pgpriv2.c index 5bbe906a551d..080d2a6a51d9 100644 --- a/fs/netfs/read_pgpriv2.c +++ b/fs/netfs/read_pgpriv2.c @@ -110,6 +110,7 @@ static struct netfs_io_request *netfs_pgpriv2_begin_copy_to_cache( if (!creq->io_streams[1].avail) goto cancel_put;
+ __set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &creq->flags); trace_netfs_write(creq, netfs_write_trace_copy_to_cache); netfs_stat(&netfs_n_wh_copy_to_cache); rreq->copy_to_cache = creq;
David Howells dhowells@redhat.com wrote:
I managed to reproduce it on my test machine with ceph + fscache.
Does this fix the problem for you?
There's at least one more bug in there, so this won't fix all of the cases.
David
On Thu, Jul 10, 2025 at 12:47 PM David Howells dhowells@redhat.com wrote:
I managed to reproduce it on my test machine with ceph + fscache.
Does this fix the problem for you?
Yes! I can no longer reproduce my problem. Thanks, David.
Hi Max,
Depending on what you're doing on ceph, you might need the attached patch as well. I managed to reproduce it by doing a git clone and kernel build on a ceph mount with cachefiles active.
David ---- commit 6c24e7124642846b628274c845cd6de252f1dfbf Author: David Howells dhowells@redhat.com Date: Thu Jul 10 15:02:57 2025 +0100
netfs: Fix race between cache write completion and ALL_QUEUED being set
When netfslib is issuing subrequests, the subrequests start processing immediately and may complete before we reach the end of the issuing function. At the end of the issuing function we set NETFS_RREQ_ALL_QUEUED to indicate to the collector that we aren't going to issue any more subreqs and that it can do the final notifications and cleanup.
Now, this isn't a problem if the request is synchronous (NETFS_RREQ_OFFLOAD_COLLECTION is unset) as the result collection will be done in-thread and we're guaranteed an opportunity to run the collector.
However, if the request is asynchronous, collection is primarily triggered by the termination of subrequests queuing it on a workqueue. Now, a race can occur here if the app thread sets ALL_QUEUED after the last subrequest terminates.
This can happen most easily with the copy2cache code (as used by Ceph) where, in the collection routine of a read request, an asynchronous write request is spawned to copy data to the cache. Folios are added to the write request as they're unlocked, but there may be a delay before ALL_QUEUED is set as the write subrequests may complete before we get there.
If all the write subreqs have finished by the ALL_QUEUED point, no further events happen and the collection never happens, leaving the request hanging.
Fix this by queuing the collector after setting ALL_QUEUED. This is a bit heavy-handed and it may be sufficient to do it only if there are no extant subreqs.
Also add a tracepoint to cross-reference both requests in a copy-to-request operation and add a trace to the netfs_rreq tracepoint to indicate the setting of ALL_QUEUED.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Max Kellermann max.kellermann@ionos.com Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3... Signed-off-by: David Howells dhowells@redhat.com cc: Paulo Alcantara pc@manguebit.org cc: Viacheslav Dubeyko slava@dubeyko.com cc: Alex Markuze amarkuze@redhat.com cc: Ilya Dryomov idryomov@gmail.com cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org
diff --git a/fs/netfs/read_pgpriv2.c b/fs/netfs/read_pgpriv2.c index 080d2a6a51d9..889ff7954f8c 100644 --- a/fs/netfs/read_pgpriv2.c +++ b/fs/netfs/read_pgpriv2.c @@ -111,6 +111,7 @@ static struct netfs_io_request *netfs_pgpriv2_begin_copy_to_cache( goto cancel_put;
__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &creq->flags); + trace_netfs_copy2cache(rreq, creq); trace_netfs_write(creq, netfs_write_trace_copy_to_cache); netfs_stat(&netfs_n_wh_copy_to_cache); rreq->copy_to_cache = creq; @@ -155,6 +156,8 @@ void netfs_pgpriv2_end_copy_to_cache(struct netfs_io_request *rreq) netfs_issue_write(creq, &creq->io_streams[1]); smp_wmb(); /* Write lists before ALL_QUEUED. */ set_bit(NETFS_RREQ_ALL_QUEUED, &creq->flags); + trace_netfs_rreq(rreq, netfs_rreq_trace_end_copy_to_cache); + netfs_wake_collector(creq);
netfs_put_request(creq, netfs_rreq_trace_put_return); creq->copy_to_cache = NULL; diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h index 73e96ccbe830..64a382fbc31a 100644 --- a/include/trace/events/netfs.h +++ b/include/trace/events/netfs.h @@ -55,6 +55,7 @@ EM(netfs_rreq_trace_copy, "COPY ") \ EM(netfs_rreq_trace_dirty, "DIRTY ") \ EM(netfs_rreq_trace_done, "DONE ") \ + EM(netfs_rreq_trace_end_copy_to_cache, "END-C2C") \ EM(netfs_rreq_trace_free, "FREE ") \ EM(netfs_rreq_trace_ki_complete, "KI-CMPL") \ EM(netfs_rreq_trace_recollect, "RECLLCT") \ @@ -559,6 +560,35 @@ TRACE_EVENT(netfs_write, __entry->start, __entry->start + __entry->len - 1) );
+TRACE_EVENT(netfs_copy2cache, + TP_PROTO(const struct netfs_io_request *rreq, + const struct netfs_io_request *creq), + + TP_ARGS(rreq, creq), + + TP_STRUCT__entry( + __field(unsigned int, rreq) + __field(unsigned int, creq) + __field(unsigned int, cookie) + __field(unsigned int, ino) + ), + + TP_fast_assign( + struct netfs_inode *__ctx = netfs_inode(rreq->inode); + struct fscache_cookie *__cookie = netfs_i_cookie(__ctx); + __entry->rreq = rreq->debug_id; + __entry->creq = creq->debug_id; + __entry->cookie = __cookie ? __cookie->debug_id : 0; + __entry->ino = rreq->inode->i_ino; + ), + + TP_printk("R=%08x CR=%08x c=%08x i=%x ", + __entry->rreq, + __entry->creq, + __entry->cookie, + __entry->ino) + ); + TRACE_EVENT(netfs_collect, TP_PROTO(const struct netfs_io_request *wreq),
David Howells dhowells@redhat.com wrote:
Depending on what you're doing on ceph, you might need the attached patch as well. I managed to reproduce it by doing a git clone and kernel build on a ceph mount with cachefiles active.
Here's a version of the patch that conditionally does the needed wakeup. I don't want to force processing if there's no need.
David --- commit 1fe42a9a7f0b2f51107574f0b8e151d13dc766cc Author: David Howells dhowells@redhat.com Date: Thu Jul 10 15:02:57 2025 +0100
netfs: Fix race between cache write completion and ALL_QUEUED being set
When netfslib is issuing subrequests, the subrequests start processing immediately and may complete before we reach the end of the issuing function. At the end of the issuing function we set NETFS_RREQ_ALL_QUEUED to indicate to the collector that we aren't going to issue any more subreqs and that it can do the final notifications and cleanup.
Now, this isn't a problem if the request is synchronous (NETFS_RREQ_OFFLOAD_COLLECTION is unset) as the result collection will be done in-thread and we're guaranteed an opportunity to run the collector.
However, if the request is asynchronous, collection is primarily triggered by the termination of subrequests queuing it on a workqueue. Now, a race can occur here if the app thread sets ALL_QUEUED after the last subrequest terminates.
This can happen most easily with the copy2cache code (as used by Ceph) where, in the collection routine of a read request, an asynchronous write request is spawned to copy data to the cache. Folios are added to the write request as they're unlocked, but there may be a delay before ALL_QUEUED is set as the write subrequests may complete before we get there.
If all the write subreqs have finished by the ALL_QUEUED point, no further events happen and the collection never happens, leaving the request hanging.
Fix this by queuing the collector after setting ALL_QUEUED. This is a bit heavy-handed and it may be sufficient to do it only if there are no extant subreqs.
Also add a tracepoint to cross-reference both requests in a copy-to-request operation and add a trace to the netfs_rreq tracepoint to indicate the setting of ALL_QUEUED.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Max Kellermann max.kellermann@ionos.com Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3... Signed-off-by: David Howells dhowells@redhat.com cc: Paulo Alcantara pc@manguebit.org cc: Viacheslav Dubeyko slava@dubeyko.com cc: Alex Markuze amarkuze@redhat.com cc: Ilya Dryomov idryomov@gmail.com cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org
diff --git a/fs/netfs/read_pgpriv2.c b/fs/netfs/read_pgpriv2.c index 080d2a6a51d9..8097bc069c1d 100644 --- a/fs/netfs/read_pgpriv2.c +++ b/fs/netfs/read_pgpriv2.c @@ -111,6 +111,7 @@ static struct netfs_io_request *netfs_pgpriv2_begin_copy_to_cache( goto cancel_put;
__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &creq->flags); + trace_netfs_copy2cache(rreq, creq); trace_netfs_write(creq, netfs_write_trace_copy_to_cache); netfs_stat(&netfs_n_wh_copy_to_cache); rreq->copy_to_cache = creq; @@ -155,6 +156,9 @@ void netfs_pgpriv2_end_copy_to_cache(struct netfs_io_request *rreq) netfs_issue_write(creq, &creq->io_streams[1]); smp_wmb(); /* Write lists before ALL_QUEUED. */ set_bit(NETFS_RREQ_ALL_QUEUED, &creq->flags); + trace_netfs_rreq(rreq, netfs_rreq_trace_end_copy_to_cache); + if (list_empty_careful(&creq->io_streams[1].subrequests)) + netfs_wake_collector(creq);
netfs_put_request(creq, netfs_rreq_trace_put_return); creq->copy_to_cache = NULL; diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h index 73e96ccbe830..64a382fbc31a 100644 --- a/include/trace/events/netfs.h +++ b/include/trace/events/netfs.h @@ -55,6 +55,7 @@ EM(netfs_rreq_trace_copy, "COPY ") \ EM(netfs_rreq_trace_dirty, "DIRTY ") \ EM(netfs_rreq_trace_done, "DONE ") \ + EM(netfs_rreq_trace_end_copy_to_cache, "END-C2C") \ EM(netfs_rreq_trace_free, "FREE ") \ EM(netfs_rreq_trace_ki_complete, "KI-CMPL") \ EM(netfs_rreq_trace_recollect, "RECLLCT") \ @@ -559,6 +560,35 @@ TRACE_EVENT(netfs_write, __entry->start, __entry->start + __entry->len - 1) );
+TRACE_EVENT(netfs_copy2cache, + TP_PROTO(const struct netfs_io_request *rreq, + const struct netfs_io_request *creq), + + TP_ARGS(rreq, creq), + + TP_STRUCT__entry( + __field(unsigned int, rreq) + __field(unsigned int, creq) + __field(unsigned int, cookie) + __field(unsigned int, ino) + ), + + TP_fast_assign( + struct netfs_inode *__ctx = netfs_inode(rreq->inode); + struct fscache_cookie *__cookie = netfs_i_cookie(__ctx); + __entry->rreq = rreq->debug_id; + __entry->creq = creq->debug_id; + __entry->cookie = __cookie ? __cookie->debug_id : 0; + __entry->ino = rreq->inode->i_ino; + ), + + TP_printk("R=%08x CR=%08x c=%08x i=%x ", + __entry->rreq, + __entry->creq, + __entry->cookie, + __entry->ino) + ); + TRACE_EVENT(netfs_collect, TP_PROTO(const struct netfs_io_request *wreq),
linux-stable-mirror@lists.linaro.org