In commit 8218f62c9c9b ("mm: page_frag: use initial zero offset for page_frag_alloc_align()"), the check for fragsz is moved earlier. So when the cache is used up, and if the fragsz > PAGE_SIZE, it won't try to refill, and just return NULL. I tested it with fragsz:8192, cache-size:32768. After the initial four successful allocations, it failed, even there is plenty of free memory in the system. To fix, revert the refill logic like before: the refill is attempted before the check & return NULL.
Cc: linyunsheng@huawei.com Cc: stable@vger.kernel.org Fixes: 8218f62c9c9b ("mm: page_frag: use initial zero offset for page_frag_alloc_align()") Signed-off-by: Haiyang Zhang haiyangz@microsoft.com --- mm/page_frag_cache.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c index d2423f30577e..82935d7e53de 100644 --- a/mm/page_frag_cache.c +++ b/mm/page_frag_cache.c @@ -119,19 +119,6 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc, size = PAGE_SIZE << encoded_page_decode_order(encoded_page); offset = __ALIGN_KERNEL_MASK(nc->offset, ~align_mask); if (unlikely(offset + fragsz > size)) { - if (unlikely(fragsz > PAGE_SIZE)) { - /* - * The caller is trying to allocate a fragment - * with fragsz > PAGE_SIZE but the cache isn't big - * enough to satisfy the request, this may - * happen in low memory conditions. - * We don't release the cache page because - * it could make memory pressure worse - * so we simply return NULL here. - */ - return NULL; - } - page = encoded_page_decode_page(encoded_page);
if (!page_ref_sub_and_test(page, nc->pagecnt_bias)) @@ -149,6 +136,19 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc, /* reset page count bias and offset to start of new frag */ nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; offset = 0; + + if (unlikely(fragsz > size)) { + /* + * The caller is trying to allocate a fragment + * with fragsz > size but the cache isn't big + * enough to satisfy the request, this may + * happen in low memory conditions. + * We don't release the cache page because + * it could make memory pressure worse + * so we simply return NULL here. + */ + return NULL; + } }
nc->pagecnt_bias--;
+cc netdev ML & Alexander
On 3/1/2025 10:03 AM, Haiyang Zhang wrote:
In commit 8218f62c9c9b ("mm: page_frag: use initial zero offset for page_frag_alloc_align()"), the check for fragsz is moved earlier. So when the cache is used up, and if the fragsz > PAGE_SIZE, it won't try to refill, and just return NULL. I tested it with fragsz:8192, cache-size:32768. After the initial four successful allocations, it failed, even there is plenty of free memory in the system.
Hi, Haiyang It seems the PAGE_SIZE is 4K for the tested system? Which drivers or subsystems are passing the fragsz being bigger than PAGE_SIZE to page_frag_alloc_align() related API?
To fix, revert the refill logic like before: the refill is attempted before the check & return NULL.
page_frag API is not really for allocating memory being bigger than PAGE_SIZE as __page_frag_cache_refill() will not try hard enough to allocate order 3 compound page when calling __alloc_pages() and will fail back to allocate base page as the discussed in below: https://lore.kernel.org/all/ead00fb7-8538-45b3-8322-8a41386e7381@huawei.com/
Cc: linyunsheng@huawei.com Cc: stable@vger.kernel.org Fixes: 8218f62c9c9b ("mm: page_frag: use initial zero offset for page_frag_alloc_align()") Signed-off-by: Haiyang Zhang haiyangz@microsoft.com
mm/page_frag_cache.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c index d2423f30577e..82935d7e53de 100644 --- a/mm/page_frag_cache.c +++ b/mm/page_frag_cache.c @@ -119,19 +119,6 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc, size = PAGE_SIZE << encoded_page_decode_order(encoded_page); offset = __ALIGN_KERNEL_MASK(nc->offset, ~align_mask); if (unlikely(offset + fragsz > size)) {
if (unlikely(fragsz > PAGE_SIZE)) {
/*
* The caller is trying to allocate a fragment
* with fragsz > PAGE_SIZE but the cache isn't big
* enough to satisfy the request, this may
* happen in low memory conditions.
* We don't release the cache page because
* it could make memory pressure worse
* so we simply return NULL here.
*/
return NULL;
}
- page = encoded_page_decode_page(encoded_page);
if (!page_ref_sub_and_test(page, nc->pagecnt_bias)) @@ -149,6 +136,19 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc, /* reset page count bias and offset to start of new frag */ nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; offset = 0;
if (unlikely(fragsz > size)) {
/*
* The caller is trying to allocate a fragment
* with fragsz > size but the cache isn't big
* enough to satisfy the request, this may
* happen in low memory conditions.
* We don't release the cache page because
* it could make memory pressure worse
* so we simply return NULL here.
*/
return NULL;
}}
nc->pagecnt_bias--;
-----Original Message----- From: Yunsheng Lin yunshenglin0825@gmail.com Sent: Saturday, March 1, 2025 8:50 AM To: Haiyang Zhang haiyangz@microsoft.com; linux-hyperv@vger.kernel.org; akpm@linux-foundation.org; linux-mm@kvack.org Cc: Dexuan Cui decui@microsoft.com; KY Srinivasan kys@microsoft.com; Paul Rosswurm paulros@microsoft.com; olaf@aepfle.de; vkuznets vkuznets@redhat.com; davem@davemloft.net; wei.liu@kernel.org; Long Li longli@microsoft.com; linux-kernel@vger.kernel.org; linyunsheng@huawei.com; stable@vger.kernel.org; netdev@vger.kernel.org; Alexander Duyck alexander.duyck@gmail.com Subject: [EXTERNAL] Re: [PATCH] mm: page_frag: Fix refill handling in __page_frag_alloc_align()
+cc netdev ML & Alexander
On 3/1/2025 10:03 AM, Haiyang Zhang wrote:
In commit 8218f62c9c9b ("mm: page_frag: use initial zero offset for page_frag_alloc_align()"), the check for fragsz is moved earlier. So when the cache is used up, and if the fragsz > PAGE_SIZE, it won't try to refill, and just return NULL. I tested it with fragsz:8192, cache-size:32768. After the initial four successful allocations, it failed, even there is plenty of free memory in the system.
Hi, Haiyang It seems the PAGE_SIZE is 4K for the tested system?
Yes.
Which drivers or subsystems are passing the fragsz being bigger than PAGE_SIZE to page_frag_alloc_align() related API?
For example, our MANA driver when using jumbo frame. https://web.git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree...
To fix, revert the refill logic like before: the refill is attempted before the check & return NULL.
page_frag API is not really for allocating memory being bigger than PAGE_SIZE as __page_frag_cache_refill() will not try hard enough to allocate order 3 compound page when calling __alloc_pages() and will fail back to allocate base page as the discussed in below: https://lore.ker/ nel.org%2Fall%2Fead00fb7-8538-45b3-8322- 8a41386e7381%40huawei.com%2F&data=05%7C02%7Chaiyangz%40microsoft.com%7Cd73 d6a0ae65b4a42681c08dd58c8087b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7 C638764338396356411%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOi IwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7 C&sdata=FJ7Ggrxxxv6QzKepUiHmtns1GZC2G2oJMcWSzOuFbsE%3D&reserved=0
We are already aware of this, and have error checking in place for the failover case to "base page".
From the discussion thread above, there are other drivers using page_frag_alloc_align() for over PAGE_SIZE too. If making the page_frag API support only fragsz <= PAGE_SIZE is desired, can we create another API? One keeps the existing API semantics (allowing > PAGE_SIZE), the other uses your new code. By the way, it should add an explicit check and fail ALL requests for fragsz > PAGE_SIZE. Currently your code successfully allocates big frags for a few times, then fail. This is not a desired behavior. It's also a breaking change for our MANA driver, which can no longer run Jumbo frames.
@Andrew Morton akpm@linux-foundation.org And other maintainers, could you please also evaluate the idea above?
Thanks, - Haiyang
-----Original Message----- From: Haiyang Zhang haiyangz@microsoft.com Sent: Saturday, March 1, 2025 12:00 PM To: Yunsheng Lin yunshenglin0825@gmail.com; linux- hyperv@vger.kernel.org; akpm@linux-foundation.org; linux-mm@kvack.org Cc: Dexuan Cui decui@microsoft.com; KY Srinivasan kys@microsoft.com; Paul Rosswurm paulros@microsoft.com; olaf@aepfle.de; vkuznets vkuznets@redhat.com; davem@davemloft.net; wei.liu@kernel.org; Long Li longli@microsoft.com; linux-kernel@vger.kernel.org; linyunsheng@huawei.com; stable@vger.kernel.org; netdev@vger.kernel.org; Alexander Duyck alexander.duyck@gmail.com Subject: RE: [EXTERNAL] Re: [PATCH] mm: page_frag: Fix refill handling in __page_frag_alloc_align()
-----Original Message----- From: Yunsheng Lin yunshenglin0825@gmail.com Sent: Saturday, March 1, 2025 8:50 AM To: Haiyang Zhang haiyangz@microsoft.com; linux-
hyperv@vger.kernel.org;
akpm@linux-foundation.org; linux-mm@kvack.org Cc: Dexuan Cui decui@microsoft.com; KY Srinivasan kys@microsoft.com; Paul Rosswurm paulros@microsoft.com; olaf@aepfle.de; vkuznets vkuznets@redhat.com; davem@davemloft.net; wei.liu@kernel.org; Long Li longli@microsoft.com; linux-kernel@vger.kernel.org; linyunsheng@huawei.com; stable@vger.kernel.org; netdev@vger.kernel.org; Alexander Duyck alexander.duyck@gmail.com Subject: [EXTERNAL] Re: [PATCH] mm: page_frag: Fix refill handling in __page_frag_alloc_align()
+cc netdev ML & Alexander
On 3/1/2025 10:03 AM, Haiyang Zhang wrote:
In commit 8218f62c9c9b ("mm: page_frag: use initial zero offset for page_frag_alloc_align()"), the check for fragsz is moved earlier. So when the cache is used up, and if the fragsz > PAGE_SIZE, it won't try to refill, and just return NULL. I tested it with fragsz:8192, cache-size:32768. After the initial four successful allocations, it failed, even there is plenty of free memory in the system.
Hi, Haiyang It seems the PAGE_SIZE is 4K for the tested system?
Yes.
Which drivers or subsystems are passing the fragsz being bigger than PAGE_SIZE to page_frag_alloc_align() related API?
For example, our MANA driver when using jumbo frame. https://web.git/. kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Fnetdev%2Fnet- next.git%2Ftree%2Fdrivers%2Fnet%2Fethernet%2Fmicrosoft%2Fmana&data=05%7C02 %7Chaiyangz%40microsoft.com%7Cea9cc3de8c904a5c720408dd58e2913a%7C72f988bf8 6f141af91ab2d7cd011db47%7C1%7C0%7C638764452327076527%7CUnknown%7CTWFpbGZsb 3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWF pbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=7%2BZ9hBYGudGZUeeF0i4UEa3zjx4ZLFd q5E3qcZxnIWE%3D&reserved=0
To fix, revert the refill logic like before: the refill is attempted before the check & return NULL.
page_frag API is not really for allocating memory being bigger than PAGE_SIZE as __page_frag_cache_refill() will not try hard enough to allocate order 3 compound page when calling __alloc_pages() and will fail back to allocate base page as the discussed in below:
https://lore.ker/ %2F&data=05%7C02%7Chaiyangz%40microsoft.com%7Cea9cc3de8c904a5c720408dd58e2 913a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638764452327105287%7CUnk nown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4 zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=g4VAI8DbzUD95qgth vzFV0PYgOIA3%2F%2FI3gmQHzuLwbo%3D&reserved=0
nel.org%2Fall%2Fead00fb7-8538-45b3-8322-
8a41386e7381%40huawei.com%2F&data=05%7C02%7Chaiyangz%40microsoft.com%7Cd73
d6a0ae65b4a42681c08dd58c8087b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7
C638764338396356411%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOi
IwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7
C&sdata=FJ7Ggrxxxv6QzKepUiHmtns1GZC2G2oJMcWSzOuFbsE%3D&reserved=0
We are already aware of this, and have error checking in place for the failover case to "base page".
From the discussion thread above, there are other drivers using page_frag_alloc_align() for over PAGE_SIZE too. If making the page_frag API support only fragsz <= PAGE_SIZE is desired, can we create another API? One keeps the existing API semantics (allowing > PAGE_SIZE), the other uses your new code. By the way, it should add an explicit check and fail ALL requests for fragsz > PAGE_SIZE. Currently your code successfully allocates big frags for a few times, then fail. This is not a desired behavior. It's also a breaking change for our MANA driver, which can no longer run Jumbo frames.
@Andrew Morton akpm@linux-foundation.org And other maintainers, could you please also evaluate the idea above?
And, quote from current doc 6.14.0-rc4: "A page fragment is an arbitrary-length arbitrary-offset area of memory which resides within a 0 or higher order compound page." https://web.git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree...
So, it is designed to be *arbitrary-length* within a 0 or higher order compound page.
If the commit 8218f62c9c9b ("mm: page_frag: use initial zero offset for page_frag_alloc_align()") intended to change the existing API semantics to be Page Frag Length <= PAGE_SIZE, the document and all breaking drivers need to be updated.
Thanks, - Haiyang
On 3/2/2025 7:16 AM, Haiyang Zhang wrote:
...
We are already aware of this, and have error checking in place for the failover case to "base page".
From the discussion thread above, there are other drivers using page_frag_alloc_align() for over PAGE_SIZE too. If making the page_frag API support only fragsz <= PAGE_SIZE is desired, can we create another API? One keeps the existing API semantics (allowing > PAGE_SIZE), the other uses your new code. By the way, it should add an explicit check and fail ALL requests for fragsz > PAGE_SIZE. Currently your code successfully allocates big frags for a few times, then fail. This is not a desired behavior. It's also a breaking change for our MANA driver, which can no longer run Jumbo frames.
It seems there was some memory corruption problem that may caused by reuse of the previously allocated frag cache memory by following a LARGER allocations before 'using initial zero offset'.
https://lore.kernel.org/netdev/b711ca5f-4180-d658-a330-89bd5dcb0acb@gmail.co...
'using initial zero offset' seems to just make the API misuse problem more obvious.
@Andrew Morton akpm@linux-foundation.org And other maintainers, could you please also evaluate the idea above?
And, quote from current doc 6.14.0-rc4: "A page fragment is an arbitrary-length arbitrary-offset area of memory which resides within a 0 or higher order compound page." https://web.git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree...
So, it is designed to be *arbitrary-length* within a 0 or higher order compound page.
If the commit 8218f62c9c9b ("mm: page_frag: use initial zero offset for page_frag_alloc_align()") intended to change the existing API semantics to be Page Frag Length <= PAGE_SIZE, the document and all breaking drivers need to be updated.
Yes, updating the Documemntation to make it more obvious seems like the right thing to do.
Thanks,
- Haiyang
linux-stable-mirror@lists.linaro.org