On Tue, Sep 30, 2025 at 11:31:26PM -0700, Chris Leech wrote:
On Mon, Sep 29, 2025 at 02:19:51PM +0300, Dmitry Bogdanov wrote:
nvme uses page_frag_cache to preallocate PDU for each preallocated request of block device. Block devices are created in parallel threads, consequently page_frag_cache is used in not thread-safe manner. That leads to incorrect refcounting of backstore pages and premature free.
That can be catched by !sendpage_ok inside network stack:
WARNING: CPU: 7 PID: 467 at ../net/core/skbuff.c:6931 skb_splice_from_iter+0xfa/0x310. tcp_sendmsg_locked+0x782/0xce0 tcp_sendmsg+0x27/0x40 sock_sendmsg+0x8b/0xa0 nvme_tcp_try_send_cmd_pdu+0x149/0x2a0 Then random panic may occur.
Fix that by serializing the usage of page_frag_cache.
Thank you for reporting this. I think we can fix it without blocking the async namespace scanning with a mutex, by switching from a per-queue page_frag_cache to per-cpu. There shouldn't be a need to keep the page_frag allocations isolated by queue anyway.
It would be great if you could test the patch which I'll send after this.
As I commented on your patch, a naive per-cpu cache solution is error-prone. The complete solution will be unnecessaryly difficult. Block device creation is not a data plane, it is a control plane, so there is no sense to use there lockless algorithms.
My patch is a simple and error-proof already. So, I insist on this solution.
BR, Dmitry