[PATCH] mm/hugetlb: fix kernel NULL pointer dereference when replacing free hugetlb folios

List overview All Threads
Download

newer

older

FAILED: patch "[PATCH] kasan:...

FAILED: patch "[PATCH] mm: fix...

yangge1116＠126.com

22 May 2025 22 May '25

3:22 a.m.

From: Ge Yang yangge1116@126.com

A kernel crash was observed when replacing free hugetlb folios:

BUG: kernel NULL pointer dereference, address: 0000000000000028 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary) RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0 RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000 RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000 RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000 R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000 R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004 FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0 Call Trace: <TASK> replace_free_hugepage_folios+0xb6/0x100 alloc_contig_range_noprof+0x18a/0x590 ? srso_return_thunk+0x5/0x5f ? down_read+0x12/0xa0 ? srso_return_thunk+0x5/0x5f cma_range_alloc.constprop.0+0x131/0x290 __cma_alloc+0xcf/0x2c0 cma_alloc_write+0x43/0xb0 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110 debugfs_attr_write+0x46/0x70 full_proxy_write+0x62/0xa0 vfs_write+0xf8/0x420 ? srso_return_thunk+0x5/0x5f ? filp_flush+0x86/0xa0 ? srso_return_thunk+0x5/0x5f ? filp_close+0x1f/0x30 ? srso_return_thunk+0x5/0x5f ? do_dup2+0xaf/0x160 ? srso_return_thunk+0x5/0x5f ksys_write+0x65/0xe0 do_syscall_64+0x64/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e

There is a potential race between __update_and_free_hugetlb_folio() and replace_free_hugepage_folios():

CPU1 CPU2 __update_and_free_hugetlb_folio replace_free_hugepage_folios folio_test_hugetlb(folio) -- It's still hugetlb folio.

__folio_clear_hugetlb(folio) hugetlb_free_folio(folio) h = folio_hstate(folio) -- Here, h is NULL pointer

When the above race condition occurs, folio_hstate(folio) returns NULL, and subsequent access to this NULL pointer will cause the system to crash. To resolve this issue, execute folio_hstate(folio) under the protection of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not return NULL.

Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration") Signed-off-by: Ge Yang yangge1116@126.com Cc: stable@vger.kernel.org --- mm/hugetlb.c | 8 ++++++++ 1 file changed, 8 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3d3ca6b..6c2e007 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,12 +2924,20 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)

while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn); + + /* + * The folio might have been dissolved from under our feet, so make sure + * to carefully check the state under the lock. + */ + spin_lock_irq(&hugetlb_lock); if (folio_test_hugetlb(folio)) { h = folio_hstate(folio); } else { + spin_unlock_irq(&hugetlb_lock); start_pfn++; continue; } + spin_unlock_irq(&hugetlb_lock);

if (!folio_ref_count(folio)) { ret = alloc_and_dissolve_hugetlb_folio(h, folio,

-- 2.7.4

Show replies by date

Muchun Song

22 May 22 May

3:47 a.m.

...

On May 22, 2025, at 11:22, yangge1116@126.com wrote:

From: Ge Yang yangge1116@126.com

A kernel crash was observed when replacing free hugetlb folios:

BUG: kernel NULL pointer dereference, address: 0000000000000028 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary) RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0 RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000 RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000 RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000 R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000 R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004 FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0 Call Trace:

<TASK> replace_free_hugepage_folios+0xb6/0x100 alloc_contig_range_noprof+0x18a/0x590 ? srso_return_thunk+0x5/0x5f ? down_read+0x12/0xa0 ? srso_return_thunk+0x5/0x5f cma_range_alloc.constprop.0+0x131/0x290 __cma_alloc+0xcf/0x2c0 cma_alloc_write+0x43/0xb0 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110 debugfs_attr_write+0x46/0x70 full_proxy_write+0x62/0xa0 vfs_write+0xf8/0x420 ? srso_return_thunk+0x5/0x5f ? filp_flush+0x86/0xa0 ? srso_return_thunk+0x5/0x5f ? filp_close+0x1f/0x30 ? srso_return_thunk+0x5/0x5f ? do_dup2+0xaf/0x160 ? srso_return_thunk+0x5/0x5f ksys_write+0x65/0xe0 do_syscall_64+0x64/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e

There is a potential race between __update_and_free_hugetlb_folio() and replace_free_hugepage_folios():

CPU1 CPU2 __update_and_free_hugetlb_folio replace_free_hugepage_folios folio_test_hugetlb(folio) -- It's still hugetlb folio.

__folio_clear_hugetlb(folio) hugetlb_free_folio(folio) h = folio_hstate(folio) -- Here, h is NULL pointer

When the above race condition occurs, folio_hstate(folio) returns NULL, and subsequent access to this NULL pointer will cause the system to crash. To resolve this issue, execute folio_hstate(folio) under the protection of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not return NULL.

Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration") Signed-off-by: Ge Yang yangge1116@126.com Cc: stable@vger.kernel.org

Thanks for fixing this problem. BTW, in order to catch future similar problem, it is better to add WARN_ON into folio_hstate() to assert if hugetlb_lock is not held when folio's reference count is zero. For this fix, LGTM.

Reviewed-by: Muchun Song muchun.song@linux.dev

Thanks.

Oscar Salvador

5:34 a.m.

On Thu, May 22, 2025 at 11:47:05AM +0800, Muchun Song wrote:

...

Thanks for fixing this problem. BTW, in order to catch future similar problem, it is better to add WARN_ON into folio_hstate() to assert if hugetlb_lock is not held when folio's reference count is zero. For this fix, LGTM.

Why cannot we put all the burden in alloc_and_dissolve_hugetlb_folio(), which will again check things under the lock? I mean, I would be ok to save cycles and check upfront in replace_free_hugepage_folios(), but the latter has only one user which is alloc_contig_range(), which is not really an expected-to-be optimized function.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bd8971388236..b4d937732256 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,13 +2924,6 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)

while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn); - if (folio_test_hugetlb(folio)) { - h = folio_hstate(folio); - } else { - start_pfn++; - continue; - } - if (!folio_ref_count(folio)) { ret = alloc_and_dissolve_hugetlb_folio(h, folio, &isolate_list);

-- Oscar Salvador SUSE Labs

Muchun Song

7:13 a.m.

...

On May 22, 2025, at 13:34, Oscar Salvador osalvador@suse.de wrote:

On Thu, May 22, 2025 at 11:47:05AM +0800, Muchun Song wrote:

...
Thanks for fixing this problem. BTW, in order to catch future similar problem, it is better to add WARN_ON into folio_hstate() to assert if hugetlb_lock is not held when folio's reference count is zero. For this fix, LGTM.

Why cannot we put all the burden in alloc_and_dissolve_hugetlb_folio(), which will again check things under the lock?

I've also considered about this choice, because there is another similar case in isolate_or_dissolve_huge_page() which could benefit from this change. I am fine with both approaches. Anyway, adding an assertion into folio_hstate() is an improvement for capturing invalid users in the future. Because any user of folio_hstate() should hold a reference to folio or hold the hugetlb_lock to make sure it returns a valid hstate for a hugetlb folio.

Muchun, Thanks.

...

I mean, I would be ok to save cycles and check upfront in replace_free_hugepage_folios(), but the latter has only one user which is alloc_contig_range(), which is not really an expected-to-be optimized function.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bd8971388236..b4d937732256 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,13 +2924,6 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)

while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn);
if (folio_test_hugetlb(folio)) {
	h = folio_hstate(folio);
} else {
	start_pfn++;
	continue;
}
if (!folio_ref_count(folio)) { ret = alloc_and_dissolve_hugetlb_folio(h, folio, &isolate_list);
-- Oscar Salvador SUSE Labs

Oscar Salvador

10:13 a.m.

On Thu, May 22, 2025 at 03:13:31PM +0800, Muchun Song wrote:

...

...
On May 22, 2025, at 13:34, Oscar Salvador osalvador@suse.de wrote:

On Thu, May 22, 2025 at 11:47:05AM +0800, Muchun Song wrote:

...
Thanks for fixing this problem. BTW, in order to catch future similar problem, it is better to add WARN_ON into folio_hstate() to assert if hugetlb_lock is not held when folio's reference count is zero. For this fix, LGTM.

Why cannot we put all the burden in alloc_and_dissolve_hugetlb_folio(), which will again check things under the lock?

I've also considered about this choice, because there is another similar case in isolate_or_dissolve_huge_page() which could benefit from this change. I am fine with both approaches. Anyway, adding an assertion into folio_hstate() is an improvement for capturing invalid users in the future. Because any user of folio_hstate() should hold a reference to folio or hold the hugetlb_lock to make sure it returns a valid hstate for a hugetlb folio.

Yes, I am not arguing about that, it makes perfect sense to me, but I am just kinda against these micro-optimizations for taking the lock to check things when we are calling in a function that will again the lock to check things.

Actually, I think that the folio_test_hugetlb() check in replace_free_hugepage_folios() was put there to try tro be smart and save cycles in case we were not dealing with a hugetlb page (so freed under us). Now that we learnt that we cannot do that without 1) taking a refcount 2) or holding the lock, that becomes superfluos, so I would just wipe that out.

-- Oscar Salvador SUSE Labs

Ge Yang

11:34 a.m.

在 2025/5/22 13:34, Oscar Salvador 写道:

...

On Thu, May 22, 2025 at 11:47:05AM +0800, Muchun Song wrote:

...
Thanks for fixing this problem. BTW, in order to catch future similar problem, it is better to add WARN_ON into folio_hstate() to assert if hugetlb_lock is not held when folio's reference count is zero. For this fix, LGTM.

Why cannot we put all the burden in alloc_and_dissolve_hugetlb_folio(), which will again check things under the lock? I mean, I would be ok to save cycles and check upfront in replace_free_hugepage_folios(), but the latter has only one user which is alloc_contig_range(), which is not really an expected-to-be optimized function.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bd8971388236..b4d937732256 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,13 +2924,6 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn);
if (folio_test_hugetlb(folio)) {
	h = folio_hstate(folio);
} else {
	start_pfn++;
	continue;
}
if (!folio_ref_count(folio)) {
	ret = alloc_and_dissolve_hugetlb_folio(h, folio,
					       &isolate_list);

It seems that we cannot simply remove the folio_test_hugetlb() check. The reasons are as follows:

1）If we remove it, we will be unable to obtain the hstat corresponding to the folio, and consequently, we won't be able to call alloc_and_dissolve_hugetlb_folio().

2）The alloc_and_dissolve_hugetlb_folio() function is also called within the isolate_or_dissolve_huge_folio() function. However, the folio_test_hugetlb() check within the isolate_or_dissolve_huge_folio() function cannot be removed.

Oscar Salvador

11:49 a.m.

On Thu, May 22, 2025 at 07:34:56PM +0800, Ge Yang wrote:

...

It seems that we cannot simply remove the folio_test_hugetlb() check. The reasons are as follows:

Yeah, my thought was whether we could move the folio_hstate within alloc_and_dissolve_hugetlb_folio(), since the latter really needs to take the lock. But isolate_or_dissolve_huge_page() also needs the 'hstate' not only to pass it onto alloc_and_dissolve_hugetlb_folio() but to check whether hstate is gigantic.

Umh, kinda hate sparkling the locks all around.

-- Oscar Salvador SUSE Labs

Muchun Song

12:39 p.m.

...

On May 22, 2025, at 19:49, Oscar Salvador osalvador@suse.de wrote:

On Thu, May 22, 2025 at 07:34:56PM +0800, Ge Yang wrote:

...
It seems that we cannot simply remove the folio_test_hugetlb() check. The reasons are as follows:

Yeah, my thought was whether we could move the folio_hstate within alloc_and_dissolve_hugetlb_folio(), since the latter really needs to take the lock. But isolate_or_dissolve_huge_page() also needs the 'hstate' not only to pass it onto alloc_and_dissolve_hugetlb_folio() but to check whether hstate is gigantic.

But I think we could use "folio_order() > MAX_PAGE_ORDER" to replace the check of hstate_is_gigantic(), right? Then ee could remove the first parameter of hstate from alloc_and_dissolve_hugetlb_folio() and obtain hstate in it.

...

Umh, kinda hate sparkling the locks all around.

-- Oscar Salvador SUSE Labs

Oscar Salvador

7:32 p.m.

On Thu, May 22, 2025 at 08:39:39PM +0800, Muchun Song wrote:

...

But I think we could use "folio_order() > MAX_PAGE_ORDER" to replace the check of hstate_is_gigantic(), right? Then ee could remove the first parameter of hstate from alloc_and_dissolve_hugetlb_folio() and obtain hstate in it.

Yes, I think we can do that. So something like the following (compily-tested only) maybe?

From d7199339e905f83b54d22849e8f21f631916ce94 Mon Sep 17 00:00:00 2001 From: Oscar Salvador osalvador@suse.de Date: Thu, 22 May 2025 19:51:04 +0200 Subject: [PATCH] TMP

--- mm/hugetlb.c | 38 +++++++++----------------------------- 1 file changed, 9 insertions(+), 29 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bd8971388236..20f08de9e37d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2787,15 +2787,13 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, /* * alloc_and_dissolve_hugetlb_folio - Allocate a new folio and dissolve * the old one - * @h: struct hstate old page belongs to * @old_folio: Old folio to dissolve * @list: List to isolate the page in case we need to * Returns 0 on success, otherwise negated error. */ -static int alloc_and_dissolve_hugetlb_folio(struct hstate *h, - struct folio *old_folio, struct list_head *list) +static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio, struct list_head *list) { - gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; + struct hstate *h; int nid = folio_nid(old_folio); struct folio *new_folio = NULL; int ret = 0; @@ -2829,7 +2827,11 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h, cond_resched(); goto retry; } else { + h = folio_hstate(old_folio); + if (!new_folio) { + gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; + spin_unlock_irq(&hugetlb_lock); new_folio = alloc_buddy_hugetlb_folio(h, gfp_mask, nid, NULL, NULL); @@ -2874,35 +2876,20 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h,

int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list) { - struct hstate *h; int ret = -EBUSY;

- /* - * The page might have been dissolved from under our feet, so make sure - * to carefully check the state under the lock. - * Return success when racing as if we dissolved the page ourselves. - */ - spin_lock_irq(&hugetlb_lock); - if (folio_test_hugetlb(folio)) { - h = folio_hstate(folio); - } else { - spin_unlock_irq(&hugetlb_lock); - return 0; - } - spin_unlock_irq(&hugetlb_lock); - /* * Fence off gigantic pages as there is a cyclic dependency between * alloc_contig_range and them. Return -ENOMEM as this has the effect * of bailing out right away without further retrying. */ - if (hstate_is_gigantic(h)) + if (folio_order(folio) > MAX_PAGE_ORDER) return -ENOMEM;

if (folio_ref_count(folio) && folio_isolate_hugetlb(folio, list)) ret = 0; else if (!folio_ref_count(folio)) - ret = alloc_and_dissolve_hugetlb_folio(h, folio, list); + ret = alloc_and_dissolve_hugetlb_folio(folio, list);

return ret; } @@ -2916,7 +2903,6 @@ int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list) */ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) { - struct hstate *h; struct folio *folio; int ret = 0;

@@ -2924,15 +2910,9 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)

while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn); - if (folio_test_hugetlb(folio)) { - h = folio_hstate(folio); - } else { - start_pfn++; - continue; - }

if (!folio_ref_count(folio)) { - ret = alloc_and_dissolve_hugetlb_folio(h, folio, + ret = alloc_and_dissolve_hugetlb_folio(folio, &isolate_list); if (ret) break; -- 2.49.0

-- Oscar Salvador SUSE Labs

Muchun Song

23 May 23 May

3:27 a.m.

...

On May 23, 2025, at 03:32, Oscar Salvador osalvador@suse.de wrote:

On Thu, May 22, 2025 at 08:39:39PM +0800, Muchun Song wrote:

...
But I think we could use "folio_order() > MAX_PAGE_ORDER" to replace the check of hstate_is_gigantic(), right? Then ee could remove the first parameter of hstate from alloc_and_dissolve_hugetlb_folio() and obtain hstate in it.

Yes, I think we can do that. So something like the following (compily-tested only) maybe?

From d7199339e905f83b54d22849e8f21f631916ce94 Mon Sep 17 00:00:00 2001 From: Oscar Salvador osalvador@suse.de Date: Thu, 22 May 2025 19:51:04 +0200 Subject: [PATCH] TMP

mm/hugetlb.c | 38 +++++++++----------------------------- 1 file changed, 9 insertions(+), 29 deletions(-)

Pretty simple. The code LGTM.

Thanks.

Ge Yang

3:46 a.m.

在 2025/5/23 11:27, Muchun Song 写道:

...

...
On May 23, 2025, at 03:32, Oscar Salvador osalvador@suse.de wrote:

On Thu, May 22, 2025 at 08:39:39PM +0800, Muchun Song wrote:

...
But I think we could use "folio_order() > MAX_PAGE_ORDER" to replace the check of hstate_is_gigantic(), right? Then ee could remove the first parameter of hstate from alloc_and_dissolve_hugetlb_folio() and obtain hstate in it.

Yes, I think we can do that. So something like the following (compily-tested only) maybe?

From d7199339e905f83b54d22849e8f21f631916ce94 Mon Sep 17 00:00:00 2001 From: Oscar Salvador osalvador@suse.de Date: Thu, 22 May 2025 19:51:04 +0200 Subject: [PATCH] TMP

mm/hugetlb.c | 38 +++++++++----------------------------- 1 file changed, 9 insertions(+), 29 deletions(-)

Pretty simple. The code LGTM.

Thanks.

Thanks.

The implementation of alloc_and_dissolve_hugetlb_folio differs between kernel 6.6 and kernel 6.15. To facilitate backporting, I'm planning to submit another patch based on Oscar Salvador's suggestion.

Muchun Song

3:56 a.m.

...

On May 23, 2025, at 11:46, Ge Yang yangge1116@126.com wrote:

在 2025/5/23 11:27, Muchun Song 写道:

...
...
On May 23, 2025, at 03:32, Oscar Salvador osalvador@suse.de wrote:

On Thu, May 22, 2025 at 08:39:39PM +0800, Muchun Song wrote:

...
But I think we could use "folio_order() > MAX_PAGE_ORDER" to replace the check of hstate_is_gigantic(), right? Then ee could remove the first parameter of hstate from alloc_and_dissolve_hugetlb_folio() and obtain hstate in it.

Yes, I think we can do that. So something like the following (compily-tested only) maybe?

From d7199339e905f83b54d22849e8f21f631916ce94 Mon Sep 17 00:00:00 2001 From: Oscar Salvador osalvador@suse.de Date: Thu, 22 May 2025 19:51:04 +0200 Subject: [PATCH] TMP

mm/hugetlb.c | 38 +++++++++----------------------------- 1 file changed, 9 insertions(+), 29 deletions(-)

Pretty simple. The code LGTM. Thanks.

Thanks.

The implementation of alloc_and_dissolve_hugetlb_folio differs between kernel 6.6 and kernel 6.15. To facilitate backporting, I'm planning to submit another patch based on Oscar Salvador's suggestion.

A separate improving patch LGTM.

...

Oscar Salvador

5:30 a.m.

On Fri, May 23, 2025 at 11:46:51AM +0800, Ge Yang wrote:

...

The implementation of alloc_and_dissolve_hugetlb_folio differs between kernel 6.6 and kernel 6.15. To facilitate backporting, I'm planning to submit another patch based on Oscar Salvador's suggestion.

If the code differs a lot between a stable version and upstream, the way to go is to submit a specific patch for the stable one that applies to that tree, e.g: [PATCH 6.6]...

-- Oscar Salvador SUSE Labs

Ge Yang

8:07 a.m.

在 2025/5/23 13:30, Oscar Salvador 写道:

...

On Fri, May 23, 2025 at 11:46:51AM +0800, Ge Yang wrote:

...
The implementation of alloc_and_dissolve_hugetlb_folio differs between kernel 6.6 and kernel 6.15. To facilitate backporting, I'm planning to submit another patch based on Oscar Salvador's suggestion.

If the code differs a lot between a stable version and upstream, the way to go is to submit a specific patch for the stable one that applies to that tree, e.g: [PATCH 6.6]...

Ok, thanks.

Oscar Salvador

22 May 22 May

11:50 a.m.

On Thu, May 22, 2025 at 11:22:17AM +0800, yangge1116@126.com wrote:

...

From: Ge Yang yangge1116@126.com

A kernel crash was observed when replacing free hugetlb folios:

BUG: kernel NULL pointer dereference, address: 0000000000000028 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary) RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0 RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000 RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000 RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000 R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000 R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004 FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0 Call Trace:

<TASK> replace_free_hugepage_folios+0xb6/0x100 alloc_contig_range_noprof+0x18a/0x590 ? srso_return_thunk+0x5/0x5f ? down_read+0x12/0xa0 ? srso_return_thunk+0x5/0x5f cma_range_alloc.constprop.0+0x131/0x290 __cma_alloc+0xcf/0x2c0 cma_alloc_write+0x43/0xb0 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110 debugfs_attr_write+0x46/0x70 full_proxy_write+0x62/0xa0 vfs_write+0xf8/0x420 ? srso_return_thunk+0x5/0x5f ? filp_flush+0x86/0xa0 ? srso_return_thunk+0x5/0x5f ? filp_close+0x1f/0x30 ? srso_return_thunk+0x5/0x5f ? do_dup2+0xaf/0x160 ? srso_return_thunk+0x5/0x5f ksys_write+0x65/0xe0 do_syscall_64+0x64/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e

There is a potential race between __update_and_free_hugetlb_folio() and replace_free_hugepage_folios():

CPU1 CPU2 __update_and_free_hugetlb_folio replace_free_hugepage_folios folio_test_hugetlb(folio) -- It's still hugetlb folio.

__folio_clear_hugetlb(folio) hugetlb_free_folio(folio) h = folio_hstate(folio) -- Here, h is NULL pointer

When the above race condition occurs, folio_hstate(folio) returns NULL, and subsequent access to this NULL pointer will cause the system to crash. To resolve this issue, execute folio_hstate(folio) under the protection of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not return NULL.

Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration") Signed-off-by: Ge Yang yangge1116@126.com Cc: stable@vger.kernel.org

Reviewed-by: Oscar Salvador osalvador@suse.de

-- Oscar Salvador SUSE Labs

David Hildenbrand

26 May 26 May

12:41 p.m.

On 22.05.25 05:22, yangge1116@126.com wrote:

...

From: Ge Yang yangge1116@126.com

A kernel crash was observed when replacing free hugetlb folios:

BUG: kernel NULL pointer dereference, address: 0000000000000028 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary) RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0 RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000 RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000 RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000 R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000 R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004 FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0 Call Trace:

<TASK> replace_free_hugepage_folios+0xb6/0x100 alloc_contig_range_noprof+0x18a/0x590 ? srso_return_thunk+0x5/0x5f ? down_read+0x12/0xa0 ? srso_return_thunk+0x5/0x5f cma_range_alloc.constprop.0+0x131/0x290 __cma_alloc+0xcf/0x2c0 cma_alloc_write+0x43/0xb0 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110 debugfs_attr_write+0x46/0x70 full_proxy_write+0x62/0xa0 vfs_write+0xf8/0x420 ? srso_return_thunk+0x5/0x5f ? filp_flush+0x86/0xa0 ? srso_return_thunk+0x5/0x5f ? filp_close+0x1f/0x30 ? srso_return_thunk+0x5/0x5f ? do_dup2+0xaf/0x160 ? srso_return_thunk+0x5/0x5f ksys_write+0x65/0xe0 do_syscall_64+0x64/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e

There is a potential race between __update_and_free_hugetlb_folio() and replace_free_hugepage_folios():

CPU1 CPU2 __update_and_free_hugetlb_folio replace_free_hugepage_folios folio_test_hugetlb(folio) -- It's still hugetlb folio.

__folio_clear_hugetlb(folio) hugetlb_free_folio(folio) h = folio_hstate(folio) -- Here, h is NULL pointer

When the above race condition occurs, folio_hstate(folio) returns NULL, and subsequent access to this NULL pointer will cause the system to crash. To resolve this issue, execute folio_hstate(folio) under the protection of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not return NULL.

Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration") Signed-off-by: Ge Yang yangge1116@126.com Cc: stable@vger.kernel.org

mm/hugetlb.c | 8 ++++++++ 1 file changed, 8 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3d3ca6b..6c2e007 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,12 +2924,20 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) while (start_pfn < end_pfn) { folio = pfn_folio(start_pfn);
/*
 * The folio might have been dissolved from under our feet, so make sure
 * to carefully check the state under the lock.
 */
spin_lock_irq(&hugetlb_lock);
if (folio_test_hugetlb(folio)) { h = folio_hstate(folio); } else {
	spin_unlock_irq(&hugetlb_lock);
start_pfn++;
continue;
}
spin_unlock_irq(&hugetlb_lock);

As mentioned elsewhere, this will grab the hugetlb_lock for each and every pfn in the range if there are no hugetlb folios (common case).

That should certainly *not* be done.

In case we see !folio_test_hugetlb(), we should just move on.

-- Cheers, David / dhildenb

Ge Yang

12:57 p.m.

在 2025/5/26 20:41, David Hildenbrand 写道:

...

On 22.05.25 05:22, yangge1116@126.com wrote:

...
From: Ge Yang yangge1116@126.com

A kernel crash was observed when replacing free hugetlb folios:

BUG: kernel NULL pointer dereference, address: 0000000000000028 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary) RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0 RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000 RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000 RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000 R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000 R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004 FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0 Call Trace:

<TASK> replace_free_hugepage_folios+0xb6/0x100 alloc_contig_range_noprof+0x18a/0x590 ? srso_return_thunk+0x5/0x5f ? down_read+0x12/0xa0 ? srso_return_thunk+0x5/0x5f cma_range_alloc.constprop.0+0x131/0x290 __cma_alloc+0xcf/0x2c0 cma_alloc_write+0x43/0xb0 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110 debugfs_attr_write+0x46/0x70 full_proxy_write+0x62/0xa0 vfs_write+0xf8/0x420 ? srso_return_thunk+0x5/0x5f ? filp_flush+0x86/0xa0 ? srso_return_thunk+0x5/0x5f ? filp_close+0x1f/0x30 ? srso_return_thunk+0x5/0x5f ? do_dup2+0xaf/0x160 ? srso_return_thunk+0x5/0x5f ksys_write+0x65/0xe0 do_syscall_64+0x64/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e

There is a potential race between __update_and_free_hugetlb_folio() and replace_free_hugepage_folios():

CPU1                              CPU2 __update_and_free_hugetlb_folio   replace_free_hugepage_folios                                      folio_test_hugetlb(folio)                                      -- It's still hugetlb folio.

__folio_clear_hugetlb(folio)    hugetlb_free_folio(folio)                                      h = folio_hstate(folio)                                      -- Here, h is NULL pointer

When the above race condition occurs, folio_hstate(folio) returns NULL, and subsequent access to this NULL pointer will cause the system to crash. To resolve this issue, execute folio_hstate(folio) under the protection of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not return NULL.

Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration") Signed-off-by: Ge Yang yangge1116@126.com Cc: stable@vger.kernel.org

mm/hugetlb.c | 8 ++++++++ 1 file changed, 8 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3d3ca6b..6c2e007 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,12 +2924,20 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)       while (start_pfn < end_pfn) {           folio = pfn_folio(start_pfn);

+        /* +         * The folio might have been dissolved from under our feet, so make sure +         * to carefully check the state under the lock. +         */ +        spin_lock_irq(&hugetlb_lock);           if (folio_test_hugetlb(folio)) {               h = folio_hstate(folio);           } else { +            spin_unlock_irq(&hugetlb_lock);               start_pfn++;               continue;           } +        spin_unlock_irq(&hugetlb_lock);

As mentioned elsewhere, this will grab the hugetlb_lock for each and every pfn in the range if there are no hugetlb folios (common case).

That should certainly *not* be done.

In case we see !folio_test_hugetlb(), we should just move on.

The main reason for acquiring the hugetlb_lock here is to obtain a valid hstate, as the alloc_and_dissolve_hugetlb_folio() function requires hstate as a parameter. This approach is indeed not performance-friendly. However, in the patch available at https://lore.kernel.org/lkml/1747987559-23082-1-git-send-email-yangge1116@12..., all these operations will be removed.

David Hildenbrand

12:59 p.m.

On 26.05.25 14:57, Ge Yang wrote:

...

在 2025/5/26 20:41, David Hildenbrand 写道:

...
On 22.05.25 05:22, yangge1116@126.com wrote:

...
From: Ge Yang yangge1116@126.com

A kernel crash was observed when replacing free hugetlb folios:

BUG: kernel NULL pointer dereference, address: 0000000000000028 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary) RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0 RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000 RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000 RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000 R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000 R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004 FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0 Call Trace:

<TASK> replace_free_hugepage_folios+0xb6/0x100 alloc_contig_range_noprof+0x18a/0x590 ? srso_return_thunk+0x5/0x5f ? down_read+0x12/0xa0 ? srso_return_thunk+0x5/0x5f cma_range_alloc.constprop.0+0x131/0x290 __cma_alloc+0xcf/0x2c0 cma_alloc_write+0x43/0xb0 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110 debugfs_attr_write+0x46/0x70 full_proxy_write+0x62/0xa0 vfs_write+0xf8/0x420 ? srso_return_thunk+0x5/0x5f ? filp_flush+0x86/0xa0 ? srso_return_thunk+0x5/0x5f ? filp_close+0x1f/0x30 ? srso_return_thunk+0x5/0x5f ? do_dup2+0xaf/0x160 ? srso_return_thunk+0x5/0x5f ksys_write+0x65/0xe0 do_syscall_64+0x64/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e

There is a potential race between __update_and_free_hugetlb_folio() and replace_free_hugepage_folios():

CPU1                              CPU2 __update_and_free_hugetlb_folio   replace_free_hugepage_folios                                      folio_test_hugetlb(folio)                                      -- It's still hugetlb folio.

__folio_clear_hugetlb(folio)    hugetlb_free_folio(folio)                                      h = folio_hstate(folio)                                      -- Here, h is NULL pointer

When the above race condition occurs, folio_hstate(folio) returns NULL, and subsequent access to this NULL pointer will cause the system to crash. To resolve this issue, execute folio_hstate(folio) under the protection of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not return NULL.

Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration") Signed-off-by: Ge Yang yangge1116@126.com Cc: stable@vger.kernel.org

mm/hugetlb.c | 8 ++++++++ 1 file changed, 8 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3d3ca6b..6c2e007 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,12 +2924,20 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)       while (start_pfn < end_pfn) {           folio = pfn_folio(start_pfn);

+        /* +         * The folio might have been dissolved from under our feet, so make sure +         * to carefully check the state under the lock. +         */ +        spin_lock_irq(&hugetlb_lock);           if (folio_test_hugetlb(folio)) {               h = folio_hstate(folio);           } else { +            spin_unlock_irq(&hugetlb_lock);               start_pfn++;               continue;           } +        spin_unlock_irq(&hugetlb_lock);

As mentioned elsewhere, this will grab the hugetlb_lock for each and every pfn in the range if there are no hugetlb folios (common case).

That should certainly *not* be done.

In case we see !folio_test_hugetlb(), we should just move on.

The main reason for acquiring the hugetlb_lock here is to obtain a valid hstate, as the alloc_and_dissolve_hugetlb_folio() function requires hstate as a parameter. This approach is indeed not performance-friendly. However, in the patch available at https://lore.kernel.org/lkml/1747987559-23082-1-git-send-email-yangge1116@12..., all these operations will be removed.

No, you still take locks on anything that is !folio_ref_count(folio).

We should really have an early folio_test_hugetlb(folio) check.

hugetlb is on most systems out there the absolute *corner case*.

-- Cheers, David / dhildenb

165

days inactive

169

days old

linux-stable-mirror@lists.linaro.org

17 comments

participants

tags (0)

participants (5)

David Hildenbrand
Ge Yang
Muchun Song
Oscar Salvador
yangge1116＠126.com