From: Lance Yang lance.yang@linux.dev
As pointed out by David[1], the batched unmap logic in try_to_unmap_one() may read past the end of a PTE table when a large folio's PTE mappings are not fully contained within a single page table.
While this scenario might be rare, an issue triggerable from userspace must be fixed regardless of its likelihood. This patch fixes the out-of-bounds access by refactoring the logic into a new helper, folio_unmap_pte_batch().
The new helper correctly calculates the safe batch size by capping the scan at both the VMA and PMD boundaries. To simplify the code, it also supports partial batching (i.e., any number of pages from 1 up to the calculated safe maximum), as there is no strong reason to special-case for fully mapped folios.
[1] https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat...
Cc: stable@vger.kernel.org Reported-by: David Hildenbrand david@redhat.com Closes: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat... Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Suggested-by: Barry Song baohua@kernel.org Acked-by: Barry Song baohua@kernel.org Reviewed-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com Acked-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev --- v3 -> v4: - Add Reported-by + Closes tags (per David) - Pick RB from Lorenzo - thanks! - Pick AB from David - thanks! - https://lore.kernel.org/linux-mm/20250630011305.23754-1-lance.yang@linux.dev
v2 -> v3: - Tweak changelog (per Barry and David) - Pick AB from Barry - thanks! - https://lore.kernel.org/linux-mm/20250627062319.84936-1-lance.yang@linux.dev
v1 -> v2: - Update subject and changelog (per Barry) - https://lore.kernel.org/linux-mm/20250627025214.30887-1-lance.yang@linux.dev
mm/rmap.c | 46 ++++++++++++++++++++++++++++------------------ 1 file changed, 28 insertions(+), 18 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c index fb63d9256f09..1320b88fab74 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1845,23 +1845,32 @@ void folio_remove_rmap_pud(struct folio *folio, struct page *page, #endif }
-/* We support batch unmapping of PTEs for lazyfree large folios */ -static inline bool can_batch_unmap_folio_ptes(unsigned long addr, - struct folio *folio, pte_t *ptep) +static inline unsigned int folio_unmap_pte_batch(struct folio *folio, + struct page_vma_mapped_walk *pvmw, + enum ttu_flags flags, pte_t pte) { const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; - int max_nr = folio_nr_pages(folio); - pte_t pte = ptep_get(ptep); + unsigned long end_addr, addr = pvmw->address; + struct vm_area_struct *vma = pvmw->vma; + unsigned int max_nr; + + if (flags & TTU_HWPOISON) + return 1; + if (!folio_test_large(folio)) + return 1;
+ /* We may only batch within a single VMA and a single page table. */ + end_addr = pmd_addr_end(addr, vma->vm_end); + max_nr = (end_addr - addr) >> PAGE_SHIFT; + + /* We only support lazyfree batching for now ... */ if (!folio_test_anon(folio) || folio_test_swapbacked(folio)) - return false; + return 1; if (pte_unused(pte)) - return false; - if (pte_pfn(pte) != folio_pfn(folio)) - return false; + return 1;
- return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL, - NULL, NULL) == max_nr; + return folio_pte_batch(folio, addr, pvmw->pte, pte, max_nr, fpb_flags, + NULL, NULL, NULL); }
/* @@ -2024,9 +2033,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, if (pte_dirty(pteval)) folio_mark_dirty(folio); } else if (likely(pte_present(pteval))) { - if (folio_test_large(folio) && !(flags & TTU_HWPOISON) && - can_batch_unmap_folio_ptes(address, folio, pvmw.pte)) - nr_pages = folio_nr_pages(folio); + nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval); end_addr = address + nr_pages * PAGE_SIZE; flush_cache_range(vma, address, end_addr);
@@ -2206,13 +2213,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, hugetlb_remove_rmap(folio); } else { folio_remove_rmap_ptes(folio, subpage, nr_pages, vma); - folio_ref_sub(folio, nr_pages - 1); } if (vma->vm_flags & VM_LOCKED) mlock_drain_local(); - folio_put(folio); - /* We have already batched the entire folio */ - if (nr_pages > 1) + folio_put_refs(folio, nr_pages); + + /* + * If we are sure that we batched the entire folio and cleared + * all PTEs, we can just optimize and stop right here. + */ + if (nr_pages == folio_nr_pages(folio)) goto walk_done; continue; walk_abort:
On Tue, 1 Jul 2025 22:31:00 +0800 Lance Yang ioworker0@gmail.com wrote:
- Add Reported-by + Closes tags (per David)
- Pick RB from Lorenzo - thanks!
- Pick AB from David - thanks!
It generally isn't necessary to resend a patch to add these things - I update the changelog in place as they come in.
In this case I'll grab that Reported-by: and Closes:, thanks.
On 2025/7/2 05:17, Andrew Morton wrote:
On Tue, 1 Jul 2025 22:31:00 +0800 Lance Yang ioworker0@gmail.com wrote:
- Add Reported-by + Closes tags (per David)
- Pick RB from Lorenzo - thanks!
- Pick AB from David - thanks!
It generally isn't necessary to resend a patch to add these things - I update the changelog in place as they come in.
In this case I'll grab that Reported-by: and Closes:, thanks.
Ah, good to know that. Thanks for adding these tags!
On Tue, Jul 01, 2025 at 10:31:00PM +0800, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
As pointed out by David[1], the batched unmap logic in try_to_unmap_one() may read past the end of a PTE table when a large folio's PTE mappings are not fully contained within a single page table.
While this scenario might be rare, an issue triggerable from userspace must be fixed regardless of its likelihood. This patch fixes the out-of-bounds access by refactoring the logic into a new helper, folio_unmap_pte_batch().
The new helper correctly calculates the safe batch size by capping the scan at both the VMA and PMD boundaries. To simplify the code, it also supports partial batching (i.e., any number of pages from 1 up to the calculated safe maximum), as there is no strong reason to special-case for fully mapped folios.
[1] https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat...
Cc: stable@vger.kernel.org Reported-by: David Hildenbrand david@redhat.com Closes: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat... Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Suggested-by: Barry Song baohua@kernel.org Acked-by: Barry Song baohua@kernel.org Reviewed-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com Acked-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev
LGTM, Reviewed-by: Harry Yoo harry.yoo@oracle.com
With a minor comment below.
diff --git a/mm/rmap.c b/mm/rmap.c index fb63d9256f09..1320b88fab74 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2206,13 +2213,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, hugetlb_remove_rmap(folio); } else { folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
} if (vma->vm_flags & VM_LOCKED) mlock_drain_local();folio_ref_sub(folio, nr_pages - 1);
folio_put(folio);
/* We have already batched the entire folio */
if (nr_pages > 1)
folio_put_refs(folio, nr_pages);
/*
* If we are sure that we batched the entire folio and cleared
* all PTEs, we can just optimize and stop right here.
*/
if (nr_pages == folio_nr_pages(folio)) goto walk_done;
Just a minor comment.
We should probably teach page_vma_mapped_walk() to skip nr_pages pages, or just rely on next_pte: do { ... } while (pte_none(ptep_get(pvmw->pte))) loop in page_vma_mapped_walk() to skip those ptes?
Taking different paths depending on (nr_pages == folio_nr_pages(folio)) doesn't seem sensible.
continue;
On 2025/7/7 13:40, Harry Yoo wrote:
On Tue, Jul 01, 2025 at 10:31:00PM +0800, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
As pointed out by David[1], the batched unmap logic in try_to_unmap_one() may read past the end of a PTE table when a large folio's PTE mappings are not fully contained within a single page table.
While this scenario might be rare, an issue triggerable from userspace must be fixed regardless of its likelihood. This patch fixes the out-of-bounds access by refactoring the logic into a new helper, folio_unmap_pte_batch().
The new helper correctly calculates the safe batch size by capping the scan at both the VMA and PMD boundaries. To simplify the code, it also supports partial batching (i.e., any number of pages from 1 up to the calculated safe maximum), as there is no strong reason to special-case for fully mapped folios.
[1] https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat...
Cc: stable@vger.kernel.org Reported-by: David Hildenbrand david@redhat.com Closes: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat... Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Suggested-by: Barry Song baohua@kernel.org Acked-by: Barry Song baohua@kernel.org Reviewed-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com Acked-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev
LGTM, Reviewed-by: Harry Yoo harry.yoo@oracle.com
Hi Harry,
Thanks for taking time to review!
With a minor comment below.
diff --git a/mm/rmap.c b/mm/rmap.c index fb63d9256f09..1320b88fab74 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2206,13 +2213,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, hugetlb_remove_rmap(folio); } else { folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
} if (vma->vm_flags & VM_LOCKED) mlock_drain_local();folio_ref_sub(folio, nr_pages - 1);
folio_put(folio);
/* We have already batched the entire folio */
if (nr_pages > 1)
folio_put_refs(folio, nr_pages);
/*
* If we are sure that we batched the entire folio and cleared
* all PTEs, we can just optimize and stop right here.
*/
if (nr_pages == folio_nr_pages(folio)) goto walk_done;
Just a minor comment.
We should probably teach page_vma_mapped_walk() to skip nr_pages pages, or just rely on next_pte: do { ... } while (pte_none(ptep_get(pvmw->pte))) loop in page_vma_mapped_walk() to skip those ptes?
Good point. We handle partially-mapped folios by relying on the "next_pte" loop to skip those ptes. The common case we expect to handle is fully-mapped folios.
Taking different paths depending on (nr_pages == folio_nr_pages(folio)) doesn't seem sensible.
Adding more logic to page_vma_mapped_walk() for the rare partial-folio case seems like an over-optimization that would complicate the walker.
So, I'd prefer to keep it as is for now ;)
On Mon, Jul 7, 2025 at 1:40 PM Harry Yoo harry.yoo@oracle.com wrote:
On Tue, Jul 01, 2025 at 10:31:00PM +0800, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
As pointed out by David[1], the batched unmap logic in try_to_unmap_one() may read past the end of a PTE table when a large folio's PTE mappings are not fully contained within a single page table.
While this scenario might be rare, an issue triggerable from userspace must be fixed regardless of its likelihood. This patch fixes the out-of-bounds access by refactoring the logic into a new helper, folio_unmap_pte_batch().
The new helper correctly calculates the safe batch size by capping the scan at both the VMA and PMD boundaries. To simplify the code, it also supports partial batching (i.e., any number of pages from 1 up to the calculated safe maximum), as there is no strong reason to special-case for fully mapped folios.
[1] https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat...
Cc: stable@vger.kernel.org Reported-by: David Hildenbrand david@redhat.com Closes: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat... Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Suggested-by: Barry Song baohua@kernel.org Acked-by: Barry Song baohua@kernel.org Reviewed-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com Acked-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev
LGTM, Reviewed-by: Harry Yoo harry.yoo@oracle.com
With a minor comment below.
diff --git a/mm/rmap.c b/mm/rmap.c index fb63d9256f09..1320b88fab74 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2206,13 +2213,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, hugetlb_remove_rmap(folio); } else { folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
folio_ref_sub(folio, nr_pages - 1); } if (vma->vm_flags & VM_LOCKED) mlock_drain_local();
folio_put(folio);
/* We have already batched the entire folio */
if (nr_pages > 1)
folio_put_refs(folio, nr_pages);
/*
* If we are sure that we batched the entire folio and cleared
* all PTEs, we can just optimize and stop right here.
*/
if (nr_pages == folio_nr_pages(folio)) goto walk_done;
Just a minor comment.
We should probably teachhttps://lore.kernel.org/linux-mm/5db6fb4c-079d-4237-80b3-637565457f39@redhat...) to skip nr_pages pages, or just rely on next_pte: do { ... } while (pte_none(ptep_get(pvmw->pte))) loop in page_vma_mapped_walk() to skip those ptes?
Taking different paths depending on (nr_pages == folio_nr_pages(folio)) doesn't seem sensible.
Hi Harry,
I believe we've already had this discussion here: https://lore.kernel.org/linux-mm/5db6fb4c-079d-4237-80b3-637565457f39@redhat...
My main point is that nr_pages = folio_nr_pages(folio) is the typical/common case. Also, modifying page_vma_mapped_walk() feels like a layering violation.
continue;
-- Cheers, Harry / Hyeonggon
Thanks Barry
On Mon, Jul 07, 2025 at 11:40:55PM +0800, Barry Song wrote:
On Mon, Jul 7, 2025 at 1:40 PM Harry Yoo harry.yoo@oracle.com wrote:
On Tue, Jul 01, 2025 at 10:31:00PM +0800, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
As pointed out by David[1], the batched unmap logic in try_to_unmap_one() may read past the end of a PTE table when a large folio's PTE mappings are not fully contained within a single page table.
While this scenario might be rare, an issue triggerable from userspace must be fixed regardless of its likelihood. This patch fixes the out-of-bounds access by refactoring the logic into a new helper, folio_unmap_pte_batch().
The new helper correctly calculates the safe batch size by capping the scan at both the VMA and PMD boundaries. To simplify the code, it also supports partial batching (i.e., any number of pages from 1 up to the calculated safe maximum), as there is no strong reason to special-case for fully mapped folios.
[1] https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat...
Cc: stable@vger.kernel.org Reported-by: David Hildenbrand david@redhat.com Closes: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat... Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Suggested-by: Barry Song baohua@kernel.org Acked-by: Barry Song baohua@kernel.org Reviewed-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com Acked-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev
LGTM, Reviewed-by: Harry Yoo harry.yoo@oracle.com
With a minor comment below.
diff --git a/mm/rmap.c b/mm/rmap.c index fb63d9256f09..1320b88fab74 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2206,13 +2213,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, hugetlb_remove_rmap(folio); } else { folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
folio_ref_sub(folio, nr_pages - 1); } if (vma->vm_flags & VM_LOCKED) mlock_drain_local();
folio_put(folio);
/* We have already batched the entire folio */
if (nr_pages > 1)
folio_put_refs(folio, nr_pages);
/*
* If we are sure that we batched the entire folio and cleared
* all PTEs, we can just optimize and stop right here.
*/
if (nr_pages == folio_nr_pages(folio)) goto walk_done;
Just a minor comment.
We should probably teachhttps://lore.kernel.org/linux-mm/5db6fb4c-079d-4237-80b3-637565457f39@redhat...) to skip nr_pages pages, or just rely on next_pte: do { ... } while (pte_none(ptep_get(pvmw->pte))) loop in page_vma_mapped_walk() to skip those ptes?
Taking different paths depending on (nr_pages == folio_nr_pages(folio)) doesn't seem sensible.
Hi Harry,
Hi Lance and Barry.
I believe we've already had this discussion here: https://lore.kernel.org/linux-mm/5db6fb4c-079d-4237-80b3-637565457f39@redhat...
My main point is that nr_pages = folio_nr_pages(folio) is the typical/common case. Also, modifying page_vma_mapped_walk() feels like a layering violation.
Agreed. Perhaps it's not worth the trouble, nevermind :)
The patch looks good to me as-is.
Hello:
This patch was applied to riscv/linux.git (fixes) by Andrew Morton akpm@linux-foundation.org:
On Tue, 1 Jul 2025 22:31:00 +0800 you wrote:
From: Lance Yang lance.yang@linux.dev
As pointed out by David[1], the batched unmap logic in try_to_unmap_one() may read past the end of a PTE table when a large folio's PTE mappings are not fully contained within a single page table.
While this scenario might be rare, an issue triggerable from userspace must be fixed regardless of its likelihood. This patch fixes the out-of-bounds access by refactoring the logic into a new helper, folio_unmap_pte_batch().
[...]
Here is the summary with links: - [v4,1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap https://git.kernel.org/riscv/c/ddd05742b45b
You are awesome, thank you!
linux-stable-mirror@lists.linaro.org