From: Ge Yang yangge1116@126.com
A kernel crash was observed when replacing free hugetlb folios:
BUG: kernel NULL pointer dereference, address: 0000000000000028 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary) RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0 RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000 RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000 RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000 R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000 R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004 FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0 Call Trace: <TASK> replace_free_hugepage_folios+0xb6/0x100 alloc_contig_range_noprof+0x18a/0x590 ? srso_return_thunk+0x5/0x5f ? down_read+0x12/0xa0 ? srso_return_thunk+0x5/0x5f cma_range_alloc.constprop.0+0x131/0x290 __cma_alloc+0xcf/0x2c0 cma_alloc_write+0x43/0xb0 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110 debugfs_attr_write+0x46/0x70 full_proxy_write+0x62/0xa0 vfs_write+0xf8/0x420 ? srso_return_thunk+0x5/0x5f ? filp_flush+0x86/0xa0 ? srso_return_thunk+0x5/0x5f ? filp_close+0x1f/0x30 ? srso_return_thunk+0x5/0x5f ? do_dup2+0xaf/0x160 ? srso_return_thunk+0x5/0x5f ksys_write+0x65/0xe0 do_syscall_64+0x64/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e
There is a potential race between __update_and_free_hugetlb_folio() and replace_free_hugepage_folios():
CPU1 CPU2 __update_and_free_hugetlb_folio replace_free_hugepage_folios folio_test_hugetlb(folio) -- It's still hugetlb folio.
__folio_clear_hugetlb(folio) hugetlb_free_folio(folio) h = folio_hstate(folio) -- Here, h is NULL pointer
When the above race condition occurs, folio_hstate(folio) returns NULL, and subsequent access to this NULL pointer will cause the system to crash. To resolve this issue, execute folio_hstate(folio) under the protection of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not return NULL.
Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration") Signed-off-by: Ge Yang yangge1116@126.com Cc: stable@vger.kernel.org --- mm/hugetlb.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3d3ca6b..6c2e007 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,12 +2924,20 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn); + + /* + * The folio might have been dissolved from under our feet, so make sure + * to carefully check the state under the lock. + */ + spin_lock_irq(&hugetlb_lock); if (folio_test_hugetlb(folio)) { h = folio_hstate(folio); } else { + spin_unlock_irq(&hugetlb_lock); start_pfn++; continue; } + spin_unlock_irq(&hugetlb_lock);
if (!folio_ref_count(folio)) { ret = alloc_and_dissolve_hugetlb_folio(h, folio,
On May 22, 2025, at 11:22, yangge1116@126.com wrote:
From: Ge Yang yangge1116@126.com
A kernel crash was observed when replacing free hugetlb folios:
BUG: kernel NULL pointer dereference, address: 0000000000000028 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary) RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0 RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000 RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000 RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000 R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000 R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004 FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0 Call Trace:
<TASK> replace_free_hugepage_folios+0xb6/0x100 alloc_contig_range_noprof+0x18a/0x590 ? srso_return_thunk+0x5/0x5f ? down_read+0x12/0xa0 ? srso_return_thunk+0x5/0x5f cma_range_alloc.constprop.0+0x131/0x290 __cma_alloc+0xcf/0x2c0 cma_alloc_write+0x43/0xb0 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110 debugfs_attr_write+0x46/0x70 full_proxy_write+0x62/0xa0 vfs_write+0xf8/0x420 ? srso_return_thunk+0x5/0x5f ? filp_flush+0x86/0xa0 ? srso_return_thunk+0x5/0x5f ? filp_close+0x1f/0x30 ? srso_return_thunk+0x5/0x5f ? do_dup2+0xaf/0x160 ? srso_return_thunk+0x5/0x5f ksys_write+0x65/0xe0 do_syscall_64+0x64/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e
There is a potential race between __update_and_free_hugetlb_folio() and replace_free_hugepage_folios():
CPU1 CPU2 __update_and_free_hugetlb_folio replace_free_hugepage_folios folio_test_hugetlb(folio) -- It's still hugetlb folio.
__folio_clear_hugetlb(folio) hugetlb_free_folio(folio) h = folio_hstate(folio) -- Here, h is NULL pointer
When the above race condition occurs, folio_hstate(folio) returns NULL, and subsequent access to this NULL pointer will cause the system to crash. To resolve this issue, execute folio_hstate(folio) under the protection of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not return NULL.
Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration") Signed-off-by: Ge Yang yangge1116@126.com Cc: stable@vger.kernel.org
Thanks for fixing this problem. BTW, in order to catch future similar problem, it is better to add WARN_ON into folio_hstate() to assert if hugetlb_lock is not held when folio's reference count is zero. For this fix, LGTM.
Reviewed-by: Muchun Song muchun.song@linux.dev
Thanks.
On Thu, May 22, 2025 at 11:47:05AM +0800, Muchun Song wrote:
Thanks for fixing this problem. BTW, in order to catch future similar problem, it is better to add WARN_ON into folio_hstate() to assert if hugetlb_lock is not held when folio's reference count is zero. For this fix, LGTM.
Why cannot we put all the burden in alloc_and_dissolve_hugetlb_folio(), which will again check things under the lock? I mean, I would be ok to save cycles and check upfront in replace_free_hugepage_folios(), but the latter has only one user which is alloc_contig_range(), which is not really an expected-to-be optimized function.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bd8971388236..b4d937732256 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,13 +2924,6 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn); - if (folio_test_hugetlb(folio)) { - h = folio_hstate(folio); - } else { - start_pfn++; - continue; - } - if (!folio_ref_count(folio)) { ret = alloc_and_dissolve_hugetlb_folio(h, folio, &isolate_list);
On May 22, 2025, at 13:34, Oscar Salvador osalvador@suse.de wrote:
On Thu, May 22, 2025 at 11:47:05AM +0800, Muchun Song wrote:
Thanks for fixing this problem. BTW, in order to catch future similar problem, it is better to add WARN_ON into folio_hstate() to assert if hugetlb_lock is not held when folio's reference count is zero. For this fix, LGTM.
Why cannot we put all the burden in alloc_and_dissolve_hugetlb_folio(), which will again check things under the lock?
I've also considered about this choice, because there is another similar case in isolate_or_dissolve_huge_page() which could benefit from this change. I am fine with both approaches. Anyway, adding an assertion into folio_hstate() is an improvement for capturing invalid users in the future. Because any user of folio_hstate() should hold a reference to folio or hold the hugetlb_lock to make sure it returns a valid hstate for a hugetlb folio.
Muchun, Thanks.
I mean, I would be ok to save cycles and check upfront in replace_free_hugepage_folios(), but the latter has only one user which is alloc_contig_range(), which is not really an expected-to-be optimized function.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bd8971388236..b4d937732256 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,13 +2924,6 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn);
if (folio_test_hugetlb(folio)) {
h = folio_hstate(folio);
} else {
start_pfn++;
continue;
}
- if (!folio_ref_count(folio)) { ret = alloc_and_dissolve_hugetlb_folio(h, folio, &isolate_list);
-- Oscar Salvador SUSE Labs
On Thu, May 22, 2025 at 03:13:31PM +0800, Muchun Song wrote:
On May 22, 2025, at 13:34, Oscar Salvador osalvador@suse.de wrote:
On Thu, May 22, 2025 at 11:47:05AM +0800, Muchun Song wrote:
Thanks for fixing this problem. BTW, in order to catch future similar problem, it is better to add WARN_ON into folio_hstate() to assert if hugetlb_lock is not held when folio's reference count is zero. For this fix, LGTM.
Why cannot we put all the burden in alloc_and_dissolve_hugetlb_folio(), which will again check things under the lock?
I've also considered about this choice, because there is another similar case in isolate_or_dissolve_huge_page() which could benefit from this change. I am fine with both approaches. Anyway, adding an assertion into folio_hstate() is an improvement for capturing invalid users in the future. Because any user of folio_hstate() should hold a reference to folio or hold the hugetlb_lock to make sure it returns a valid hstate for a hugetlb folio.
Yes, I am not arguing about that, it makes perfect sense to me, but I am just kinda against these micro-optimizations for taking the lock to check things when we are calling in a function that will again the lock to check things.
Actually, I think that the folio_test_hugetlb() check in replace_free_hugepage_folios() was put there to try tro be smart and save cycles in case we were not dealing with a hugetlb page (so freed under us). Now that we learnt that we cannot do that without 1) taking a refcount 2) or holding the lock, that becomes superfluos, so I would just wipe that out.
在 2025/5/22 13:34, Oscar Salvador 写道:
On Thu, May 22, 2025 at 11:47:05AM +0800, Muchun Song wrote:
Thanks for fixing this problem. BTW, in order to catch future similar problem, it is better to add WARN_ON into folio_hstate() to assert if hugetlb_lock is not held when folio's reference count is zero. For this fix, LGTM.
Why cannot we put all the burden in alloc_and_dissolve_hugetlb_folio(), which will again check things under the lock? I mean, I would be ok to save cycles and check upfront in replace_free_hugepage_folios(), but the latter has only one user which is alloc_contig_range(), which is not really an expected-to-be optimized function.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bd8971388236..b4d937732256 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,13 +2924,6 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn);
if (folio_test_hugetlb(folio)) {
h = folio_hstate(folio);
} else {
start_pfn++;
continue;
}
if (!folio_ref_count(folio)) { ret = alloc_and_dissolve_hugetlb_folio(h, folio, &isolate_list);
It seems that we cannot simply remove the folio_test_hugetlb() check. The reasons are as follows:
1)If we remove it, we will be unable to obtain the hstat corresponding to the folio, and consequently, we won't be able to call alloc_and_dissolve_hugetlb_folio().
2)The alloc_and_dissolve_hugetlb_folio() function is also called within the isolate_or_dissolve_huge_folio() function. However, the folio_test_hugetlb() check within the isolate_or_dissolve_huge_folio() function cannot be removed.
On Thu, May 22, 2025 at 07:34:56PM +0800, Ge Yang wrote:
It seems that we cannot simply remove the folio_test_hugetlb() check. The reasons are as follows:
Yeah, my thought was whether we could move the folio_hstate within alloc_and_dissolve_hugetlb_folio(), since the latter really needs to take the lock. But isolate_or_dissolve_huge_page() also needs the 'hstate' not only to pass it onto alloc_and_dissolve_hugetlb_folio() but to check whether hstate is gigantic.
Umh, kinda hate sparkling the locks all around.
On May 22, 2025, at 19:49, Oscar Salvador osalvador@suse.de wrote:
On Thu, May 22, 2025 at 07:34:56PM +0800, Ge Yang wrote:
It seems that we cannot simply remove the folio_test_hugetlb() check. The reasons are as follows:
Yeah, my thought was whether we could move the folio_hstate within alloc_and_dissolve_hugetlb_folio(), since the latter really needs to take the lock. But isolate_or_dissolve_huge_page() also needs the 'hstate' not only to pass it onto alloc_and_dissolve_hugetlb_folio() but to check whether hstate is gigantic.
But I think we could use "folio_order() > MAX_PAGE_ORDER" to replace the check of hstate_is_gigantic(), right? Then ee could remove the first parameter of hstate from alloc_and_dissolve_hugetlb_folio() and obtain hstate in it.
Umh, kinda hate sparkling the locks all around.
-- Oscar Salvador SUSE Labs
On Thu, May 22, 2025 at 08:39:39PM +0800, Muchun Song wrote:
But I think we could use "folio_order() > MAX_PAGE_ORDER" to replace the check of hstate_is_gigantic(), right? Then ee could remove the first parameter of hstate from alloc_and_dissolve_hugetlb_folio() and obtain hstate in it.
Yes, I think we can do that. So something like the following (compily-tested only) maybe?
From d7199339e905f83b54d22849e8f21f631916ce94 Mon Sep 17 00:00:00 2001 From: Oscar Salvador osalvador@suse.de Date: Thu, 22 May 2025 19:51:04 +0200 Subject: [PATCH] TMP
--- mm/hugetlb.c | 38 +++++++++----------------------------- 1 file changed, 9 insertions(+), 29 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bd8971388236..20f08de9e37d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2787,15 +2787,13 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, /* * alloc_and_dissolve_hugetlb_folio - Allocate a new folio and dissolve * the old one - * @h: struct hstate old page belongs to * @old_folio: Old folio to dissolve * @list: List to isolate the page in case we need to * Returns 0 on success, otherwise negated error. */ -static int alloc_and_dissolve_hugetlb_folio(struct hstate *h, - struct folio *old_folio, struct list_head *list) +static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio, struct list_head *list) { - gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; + struct hstate *h; int nid = folio_nid(old_folio); struct folio *new_folio = NULL; int ret = 0; @@ -2829,7 +2827,11 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h, cond_resched(); goto retry; } else { + h = folio_hstate(old_folio); + if (!new_folio) { + gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; + spin_unlock_irq(&hugetlb_lock); new_folio = alloc_buddy_hugetlb_folio(h, gfp_mask, nid, NULL, NULL); @@ -2874,35 +2876,20 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h,
int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list) { - struct hstate *h; int ret = -EBUSY;
- /* - * The page might have been dissolved from under our feet, so make sure - * to carefully check the state under the lock. - * Return success when racing as if we dissolved the page ourselves. - */ - spin_lock_irq(&hugetlb_lock); - if (folio_test_hugetlb(folio)) { - h = folio_hstate(folio); - } else { - spin_unlock_irq(&hugetlb_lock); - return 0; - } - spin_unlock_irq(&hugetlb_lock); - /* * Fence off gigantic pages as there is a cyclic dependency between * alloc_contig_range and them. Return -ENOMEM as this has the effect * of bailing out right away without further retrying. */ - if (hstate_is_gigantic(h)) + if (folio_order(folio) > MAX_PAGE_ORDER) return -ENOMEM;
if (folio_ref_count(folio) && folio_isolate_hugetlb(folio, list)) ret = 0; else if (!folio_ref_count(folio)) - ret = alloc_and_dissolve_hugetlb_folio(h, folio, list); + ret = alloc_and_dissolve_hugetlb_folio(folio, list);
return ret; } @@ -2916,7 +2903,6 @@ int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list) */ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) { - struct hstate *h; struct folio *folio; int ret = 0;
@@ -2924,15 +2910,9 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn); - if (folio_test_hugetlb(folio)) { - h = folio_hstate(folio); - } else { - start_pfn++; - continue; - }
if (!folio_ref_count(folio)) { - ret = alloc_and_dissolve_hugetlb_folio(h, folio, + ret = alloc_and_dissolve_hugetlb_folio(folio, &isolate_list); if (ret) break; -- 2.49.0
On May 23, 2025, at 03:32, Oscar Salvador osalvador@suse.de wrote:
On Thu, May 22, 2025 at 08:39:39PM +0800, Muchun Song wrote:
But I think we could use "folio_order() > MAX_PAGE_ORDER" to replace the check of hstate_is_gigantic(), right? Then ee could remove the first parameter of hstate from alloc_and_dissolve_hugetlb_folio() and obtain hstate in it.
Yes, I think we can do that. So something like the following (compily-tested only) maybe?
From d7199339e905f83b54d22849e8f21f631916ce94 Mon Sep 17 00:00:00 2001 From: Oscar Salvador osalvador@suse.de Date: Thu, 22 May 2025 19:51:04 +0200 Subject: [PATCH] TMP
mm/hugetlb.c | 38 +++++++++----------------------------- 1 file changed, 9 insertions(+), 29 deletions(-)
Pretty simple. The code LGTM.
Thanks.
在 2025/5/23 11:27, Muchun Song 写道:
On May 23, 2025, at 03:32, Oscar Salvador osalvador@suse.de wrote:
On Thu, May 22, 2025 at 08:39:39PM +0800, Muchun Song wrote:
But I think we could use "folio_order() > MAX_PAGE_ORDER" to replace the check of hstate_is_gigantic(), right? Then ee could remove the first parameter of hstate from alloc_and_dissolve_hugetlb_folio() and obtain hstate in it.
Yes, I think we can do that. So something like the following (compily-tested only) maybe?
From d7199339e905f83b54d22849e8f21f631916ce94 Mon Sep 17 00:00:00 2001 From: Oscar Salvador osalvador@suse.de Date: Thu, 22 May 2025 19:51:04 +0200 Subject: [PATCH] TMP
mm/hugetlb.c | 38 +++++++++----------------------------- 1 file changed, 9 insertions(+), 29 deletions(-)
Pretty simple. The code LGTM.
Thanks.
Thanks.
The implementation of alloc_and_dissolve_hugetlb_folio differs between kernel 6.6 and kernel 6.15. To facilitate backporting, I'm planning to submit another patch based on Oscar Salvador's suggestion.
On May 23, 2025, at 11:46, Ge Yang yangge1116@126.com wrote:
在 2025/5/23 11:27, Muchun Song 写道:
On May 23, 2025, at 03:32, Oscar Salvador osalvador@suse.de wrote:
On Thu, May 22, 2025 at 08:39:39PM +0800, Muchun Song wrote:
But I think we could use "folio_order() > MAX_PAGE_ORDER" to replace the check of hstate_is_gigantic(), right? Then ee could remove the first parameter of hstate from alloc_and_dissolve_hugetlb_folio() and obtain hstate in it.
Yes, I think we can do that. So something like the following (compily-tested only) maybe?
From d7199339e905f83b54d22849e8f21f631916ce94 Mon Sep 17 00:00:00 2001 From: Oscar Salvador osalvador@suse.de Date: Thu, 22 May 2025 19:51:04 +0200 Subject: [PATCH] TMP
mm/hugetlb.c | 38 +++++++++----------------------------- 1 file changed, 9 insertions(+), 29 deletions(-)
Pretty simple. The code LGTM. Thanks.
Thanks.
The implementation of alloc_and_dissolve_hugetlb_folio differs between kernel 6.6 and kernel 6.15. To facilitate backporting, I'm planning to submit another patch based on Oscar Salvador's suggestion.
A separate improving patch LGTM.
On Fri, May 23, 2025 at 11:46:51AM +0800, Ge Yang wrote:
The implementation of alloc_and_dissolve_hugetlb_folio differs between kernel 6.6 and kernel 6.15. To facilitate backporting, I'm planning to submit another patch based on Oscar Salvador's suggestion.
If the code differs a lot between a stable version and upstream, the way to go is to submit a specific patch for the stable one that applies to that tree, e.g: [PATCH 6.6]...
在 2025/5/23 13:30, Oscar Salvador 写道:
On Fri, May 23, 2025 at 11:46:51AM +0800, Ge Yang wrote:
The implementation of alloc_and_dissolve_hugetlb_folio differs between kernel 6.6 and kernel 6.15. To facilitate backporting, I'm planning to submit another patch based on Oscar Salvador's suggestion.
If the code differs a lot between a stable version and upstream, the way to go is to submit a specific patch for the stable one that applies to that tree, e.g: [PATCH 6.6]...
Ok, thanks.
On Thu, May 22, 2025 at 11:22:17AM +0800, yangge1116@126.com wrote:
From: Ge Yang yangge1116@126.com
A kernel crash was observed when replacing free hugetlb folios:
BUG: kernel NULL pointer dereference, address: 0000000000000028 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary) RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0 RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000 RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000 RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000 R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000 R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004 FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0 Call Trace:
<TASK> replace_free_hugepage_folios+0xb6/0x100 alloc_contig_range_noprof+0x18a/0x590 ? srso_return_thunk+0x5/0x5f ? down_read+0x12/0xa0 ? srso_return_thunk+0x5/0x5f cma_range_alloc.constprop.0+0x131/0x290 __cma_alloc+0xcf/0x2c0 cma_alloc_write+0x43/0xb0 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110 debugfs_attr_write+0x46/0x70 full_proxy_write+0x62/0xa0 vfs_write+0xf8/0x420 ? srso_return_thunk+0x5/0x5f ? filp_flush+0x86/0xa0 ? srso_return_thunk+0x5/0x5f ? filp_close+0x1f/0x30 ? srso_return_thunk+0x5/0x5f ? do_dup2+0xaf/0x160 ? srso_return_thunk+0x5/0x5f ksys_write+0x65/0xe0 do_syscall_64+0x64/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e
There is a potential race between __update_and_free_hugetlb_folio() and replace_free_hugepage_folios():
CPU1 CPU2 __update_and_free_hugetlb_folio replace_free_hugepage_folios folio_test_hugetlb(folio) -- It's still hugetlb folio.
__folio_clear_hugetlb(folio) hugetlb_free_folio(folio) h = folio_hstate(folio) -- Here, h is NULL pointer
When the above race condition occurs, folio_hstate(folio) returns NULL, and subsequent access to this NULL pointer will cause the system to crash. To resolve this issue, execute folio_hstate(folio) under the protection of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not return NULL.
Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration") Signed-off-by: Ge Yang yangge1116@126.com Cc: stable@vger.kernel.org
Reviewed-by: Oscar Salvador osalvador@suse.de
On 22.05.25 05:22, yangge1116@126.com wrote:
From: Ge Yang yangge1116@126.com
A kernel crash was observed when replacing free hugetlb folios:
BUG: kernel NULL pointer dereference, address: 0000000000000028 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary) RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0 RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000 RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000 RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000 R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000 R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004 FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0 Call Trace:
<TASK> replace_free_hugepage_folios+0xb6/0x100 alloc_contig_range_noprof+0x18a/0x590 ? srso_return_thunk+0x5/0x5f ? down_read+0x12/0xa0 ? srso_return_thunk+0x5/0x5f cma_range_alloc.constprop.0+0x131/0x290 __cma_alloc+0xcf/0x2c0 cma_alloc_write+0x43/0xb0 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110 debugfs_attr_write+0x46/0x70 full_proxy_write+0x62/0xa0 vfs_write+0xf8/0x420 ? srso_return_thunk+0x5/0x5f ? filp_flush+0x86/0xa0 ? srso_return_thunk+0x5/0x5f ? filp_close+0x1f/0x30 ? srso_return_thunk+0x5/0x5f ? do_dup2+0xaf/0x160 ? srso_return_thunk+0x5/0x5f ksys_write+0x65/0xe0 do_syscall_64+0x64/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e
There is a potential race between __update_and_free_hugetlb_folio() and replace_free_hugepage_folios():
CPU1 CPU2 __update_and_free_hugetlb_folio replace_free_hugepage_folios folio_test_hugetlb(folio) -- It's still hugetlb folio.
__folio_clear_hugetlb(folio) hugetlb_free_folio(folio) h = folio_hstate(folio) -- Here, h is NULL pointer
When the above race condition occurs, folio_hstate(folio) returns NULL, and subsequent access to this NULL pointer will cause the system to crash. To resolve this issue, execute folio_hstate(folio) under the protection of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not return NULL.
Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration") Signed-off-by: Ge Yang yangge1116@126.com Cc: stable@vger.kernel.org
mm/hugetlb.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3d3ca6b..6c2e007 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,12 +2924,20 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn);
/*
* The folio might have been dissolved from under our feet, so make sure
* to carefully check the state under the lock.
*/
if (folio_test_hugetlb(folio)) { h = folio_hstate(folio); } else {spin_lock_irq(&hugetlb_lock);
}spin_unlock_irq(&hugetlb_lock); start_pfn++; continue;
spin_unlock_irq(&hugetlb_lock);
As mentioned elsewhere, this will grab the hugetlb_lock for each and every pfn in the range if there are no hugetlb folios (common case).
That should certainly *not* be done.
In case we see !folio_test_hugetlb(), we should just move on.
在 2025/5/26 20:41, David Hildenbrand 写道:
On 22.05.25 05:22, yangge1116@126.com wrote:
From: Ge Yang yangge1116@126.com
A kernel crash was observed when replacing free hugetlb folios:
BUG: kernel NULL pointer dereference, address: 0000000000000028 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary) RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0 RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000 RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000 RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000 R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000 R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004 FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0 Call Trace:
<TASK> replace_free_hugepage_folios+0xb6/0x100 alloc_contig_range_noprof+0x18a/0x590 ? srso_return_thunk+0x5/0x5f ? down_read+0x12/0xa0 ? srso_return_thunk+0x5/0x5f cma_range_alloc.constprop.0+0x131/0x290 __cma_alloc+0xcf/0x2c0 cma_alloc_write+0x43/0xb0 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110 debugfs_attr_write+0x46/0x70 full_proxy_write+0x62/0xa0 vfs_write+0xf8/0x420 ? srso_return_thunk+0x5/0x5f ? filp_flush+0x86/0xa0 ? srso_return_thunk+0x5/0x5f ? filp_close+0x1f/0x30 ? srso_return_thunk+0x5/0x5f ? do_dup2+0xaf/0x160 ? srso_return_thunk+0x5/0x5f ksys_write+0x65/0xe0 do_syscall_64+0x64/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e
There is a potential race between __update_and_free_hugetlb_folio() and replace_free_hugepage_folios():
CPU1 CPU2 __update_and_free_hugetlb_folio replace_free_hugepage_folios folio_test_hugetlb(folio) -- It's still hugetlb folio.
__folio_clear_hugetlb(folio) hugetlb_free_folio(folio) h = folio_hstate(folio) -- Here, h is NULL pointer
When the above race condition occurs, folio_hstate(folio) returns NULL, and subsequent access to this NULL pointer will cause the system to crash. To resolve this issue, execute folio_hstate(folio) under the protection of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not return NULL.
Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration") Signed-off-by: Ge Yang yangge1116@126.com Cc: stable@vger.kernel.org
mm/hugetlb.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3d3ca6b..6c2e007 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,12 +2924,20 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn);
+ /* + * The folio might have been dissolved from under our feet, so make sure + * to carefully check the state under the lock. + */ + spin_lock_irq(&hugetlb_lock); if (folio_test_hugetlb(folio)) { h = folio_hstate(folio); } else { + spin_unlock_irq(&hugetlb_lock); start_pfn++; continue; } + spin_unlock_irq(&hugetlb_lock);
As mentioned elsewhere, this will grab the hugetlb_lock for each and every pfn in the range if there are no hugetlb folios (common case).
That should certainly *not* be done.
In case we see !folio_test_hugetlb(), we should just move on.
The main reason for acquiring the hugetlb_lock here is to obtain a valid hstate, as the alloc_and_dissolve_hugetlb_folio() function requires hstate as a parameter. This approach is indeed not performance-friendly. However, in the patch available at https://lore.kernel.org/lkml/1747987559-23082-1-git-send-email-yangge1116@12..., all these operations will be removed.
On 26.05.25 14:57, Ge Yang wrote:
在 2025/5/26 20:41, David Hildenbrand 写道:
On 22.05.25 05:22, yangge1116@126.com wrote:
From: Ge Yang yangge1116@126.com
A kernel crash was observed when replacing free hugetlb folios:
BUG: kernel NULL pointer dereference, address: 0000000000000028 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary) RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0 RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000 RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000 RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000 R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000 R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004 FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0 Call Trace:
<TASK> replace_free_hugepage_folios+0xb6/0x100 alloc_contig_range_noprof+0x18a/0x590 ? srso_return_thunk+0x5/0x5f ? down_read+0x12/0xa0 ? srso_return_thunk+0x5/0x5f cma_range_alloc.constprop.0+0x131/0x290 __cma_alloc+0xcf/0x2c0 cma_alloc_write+0x43/0xb0 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110 debugfs_attr_write+0x46/0x70 full_proxy_write+0x62/0xa0 vfs_write+0xf8/0x420 ? srso_return_thunk+0x5/0x5f ? filp_flush+0x86/0xa0 ? srso_return_thunk+0x5/0x5f ? filp_close+0x1f/0x30 ? srso_return_thunk+0x5/0x5f ? do_dup2+0xaf/0x160 ? srso_return_thunk+0x5/0x5f ksys_write+0x65/0xe0 do_syscall_64+0x64/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e
There is a potential race between __update_and_free_hugetlb_folio() and replace_free_hugepage_folios():
CPU1 CPU2 __update_and_free_hugetlb_folio replace_free_hugepage_folios folio_test_hugetlb(folio) -- It's still hugetlb folio.
__folio_clear_hugetlb(folio) hugetlb_free_folio(folio) h = folio_hstate(folio) -- Here, h is NULL pointer
When the above race condition occurs, folio_hstate(folio) returns NULL, and subsequent access to this NULL pointer will cause the system to crash. To resolve this issue, execute folio_hstate(folio) under the protection of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not return NULL.
Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration") Signed-off-by: Ge Yang yangge1116@126.com Cc: stable@vger.kernel.org
mm/hugetlb.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3d3ca6b..6c2e007 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,12 +2924,20 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn);
+ /* + * The folio might have been dissolved from under our feet, so make sure + * to carefully check the state under the lock. + */ + spin_lock_irq(&hugetlb_lock); if (folio_test_hugetlb(folio)) { h = folio_hstate(folio); } else { + spin_unlock_irq(&hugetlb_lock); start_pfn++; continue; } + spin_unlock_irq(&hugetlb_lock);
As mentioned elsewhere, this will grab the hugetlb_lock for each and every pfn in the range if there are no hugetlb folios (common case).
That should certainly *not* be done.
In case we see !folio_test_hugetlb(), we should just move on.
The main reason for acquiring the hugetlb_lock here is to obtain a valid hstate, as the alloc_and_dissolve_hugetlb_folio() function requires hstate as a parameter. This approach is indeed not performance-friendly. However, in the patch available at https://lore.kernel.org/lkml/1747987559-23082-1-git-send-email-yangge1116@12..., all these operations will be removed.
No, you still take locks on anything that is !folio_ref_count(folio).
We should really have an early folio_test_hugetlb(folio) check.
hugetlb is on most systems out there the absolute *corner case*.
linux-stable-mirror@lists.linaro.org