From: Lance Yang lance.yang@linux.dev
When splitting an mTHP and replacing a zero-filled subpage with the shared zeropage, try_to_map_unused_to_zeropage() currently drops several important PTE bits.
For userspace tools like CRIU, which rely on the soft-dirty mechanism for incremental snapshots, losing the soft-dirty bit means modified pages are missed, leading to inconsistent memory state after restore.
As pointed out by David, the more critical uffd-wp bit is also dropped. This breaks the userfaultfd write-protection mechanism, causing writes to be silently missed by monitoring applications, which can lead to data corruption.
Preserve both the soft-dirty and uffd-wp bits from the old PTE when creating the new zeropage mapping to ensure they are correctly tracked.
Cc: stable@vger.kernel.org Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Suggested-by: Dev Jain dev.jain@arm.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Dev Jain dev.jain@arm.com Signed-off-by: Lance Yang lance.yang@linux.dev --- v4 -> v5: - Move ptep_get() call after the !pvmw.pte check, which handles PMD-mapped THP migration entries. - https://lore.kernel.org/linux-mm/20250930071053.36158-1-lance.yang@linux.dev...
v3 -> v4: - Minor formatting tweak in try_to_map_unused_to_zeropage() function signature (per David and Dev) - Collect Reviewed-by from Dev - thanks! - https://lore.kernel.org/linux-mm/20250930060557.85133-1-lance.yang@linux.dev...
v2 -> v3: - ptep_get() gets called only once per iteration (per Dev) - https://lore.kernel.org/linux-mm/20250930043351.34927-1-lance.yang@linux.dev...
v1 -> v2: - Avoid calling ptep_get() multiple times (per Dev) - Double-check the uffd-wp bit (per David) - Collect Acked-by from David - thanks! - https://lore.kernel.org/linux-mm/20250928044855.76359-1-lance.yang@linux.dev...
mm/migrate.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index ce83c2c3c287..e3065c9edb55 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -296,8 +296,7 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list) }
static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, - struct folio *folio, - unsigned long idx) + struct folio *folio, pte_t old_pte, unsigned long idx) { struct page *page = folio_page(folio, idx); pte_t newpte; @@ -306,7 +305,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, return false; VM_BUG_ON_PAGE(!PageAnon(page), page); VM_BUG_ON_PAGE(!PageLocked(page), page); - VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page); + VM_BUG_ON_PAGE(pte_present(old_pte), page);
if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags & VM_LOCKED) || mm_forbids_zeropage(pvmw->vma->vm_mm)) @@ -322,6 +321,12 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address), pvmw->vma->vm_page_prot)); + + if (pte_swp_soft_dirty(old_pte)) + newpte = pte_mksoft_dirty(newpte); + if (pte_swp_uffd_wp(old_pte)) + newpte = pte_mkuffd_wp(newpte); + set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio)); @@ -364,13 +369,13 @@ static bool remove_migration_pte(struct folio *folio, continue; } #endif + old_pte = ptep_get(pvmw.pte); if (rmap_walk_arg->map_unused_to_zeropage && - try_to_map_unused_to_zeropage(&pvmw, folio, idx)) + try_to_map_unused_to_zeropage(&pvmw, folio, old_pte, idx)) continue;
folio_get(folio); pte = mk_pte(new, READ_ONCE(vma->vm_page_prot)); - old_pte = ptep_get(pvmw.pte);
entry = pte_to_swp_entry(old_pte); if (!is_migration_entry_young(entry))
On 30 Sep 2025, at 4:10, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
When splitting an mTHP and replacing a zero-filled subpage with the shared zeropage, try_to_map_unused_to_zeropage() currently drops several important PTE bits.
For userspace tools like CRIU, which rely on the soft-dirty mechanism for incremental snapshots, losing the soft-dirty bit means modified pages are missed, leading to inconsistent memory state after restore.
As pointed out by David, the more critical uffd-wp bit is also dropped. This breaks the userfaultfd write-protection mechanism, causing writes to be silently missed by monitoring applications, which can lead to data corruption.
Preserve both the soft-dirty and uffd-wp bits from the old PTE when creating the new zeropage mapping to ensure they are correctly tracked.
Cc: stable@vger.kernel.org Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Suggested-by: Dev Jain dev.jain@arm.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Dev Jain dev.jain@arm.com Signed-off-by: Lance Yang lance.yang@linux.dev
v4 -> v5:
- Move ptep_get() call after the !pvmw.pte check, which handles PMD-mapped THP migration entries.
- https://lore.kernel.org/linux-mm/20250930071053.36158-1-lance.yang@linux.dev...
v3 -> v4:
- Minor formatting tweak in try_to_map_unused_to_zeropage() function signature (per David and Dev)
- Collect Reviewed-by from Dev - thanks!
- https://lore.kernel.org/linux-mm/20250930060557.85133-1-lance.yang@linux.dev...
v2 -> v3:
- ptep_get() gets called only once per iteration (per Dev)
- https://lore.kernel.org/linux-mm/20250930043351.34927-1-lance.yang@linux.dev...
v1 -> v2:
- Avoid calling ptep_get() multiple times (per Dev)
- Double-check the uffd-wp bit (per David)
- Collect Acked-by from David - thanks!
- https://lore.kernel.org/linux-mm/20250928044855.76359-1-lance.yang@linux.dev...
mm/migrate.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)
Acked-by: Zi Yan ziy@nvidia.com
Best Regards, Yan, Zi
On Tue, Sep 30, 2025 at 04:10:40PM +0800, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
When splitting an mTHP and replacing a zero-filled subpage with the shared zeropage, try_to_map_unused_to_zeropage() currently drops several important PTE bits.
For userspace tools like CRIU, which rely on the soft-dirty mechanism for incremental snapshots, losing the soft-dirty bit means modified pages are missed, leading to inconsistent memory state after restore.
As pointed out by David, the more critical uffd-wp bit is also dropped. This breaks the userfaultfd write-protection mechanism, causing writes to be silently missed by monitoring applications, which can lead to data corruption.
Preserve both the soft-dirty and uffd-wp bits from the old PTE when creating the new zeropage mapping to ensure they are correctly tracked.
Cc: stable@vger.kernel.org Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Suggested-by: Dev Jain dev.jain@arm.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Dev Jain dev.jain@arm.com Signed-off-by: Lance Yang lance.yang@linux.dev
Looks good to me, Reviewed-by: Harry Yoo harry.yoo@oracle.com
* Lance Yang lance.yang@linux.dev [250930 04:13]:
From: Lance Yang lance.yang@linux.dev
When splitting an mTHP and replacing a zero-filled subpage with the shared zeropage, try_to_map_unused_to_zeropage() currently drops several important PTE bits.
For userspace tools like CRIU, which rely on the soft-dirty mechanism for incremental snapshots, losing the soft-dirty bit means modified pages are missed, leading to inconsistent memory state after restore.
As pointed out by David, the more critical uffd-wp bit is also dropped. This breaks the userfaultfd write-protection mechanism, causing writes to be silently missed by monitoring applications, which can lead to data corruption.
Preserve both the soft-dirty and uffd-wp bits from the old PTE when creating the new zeropage mapping to ensure they are correctly tracked.
Cc: stable@vger.kernel.org Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Suggested-by: Dev Jain dev.jain@arm.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Dev Jain dev.jain@arm.com Signed-off-by: Lance Yang lance.yang@linux.dev
Reviewed-by: Liam R. Howlett Liam.Howlett@oracle.com
v4 -> v5:
- Move ptep_get() call after the !pvmw.pte check, which handles PMD-mapped THP migration entries.
- https://lore.kernel.org/linux-mm/20250930071053.36158-1-lance.yang@linux.dev...
v3 -> v4:
- Minor formatting tweak in try_to_map_unused_to_zeropage() function signature (per David and Dev)
- Collect Reviewed-by from Dev - thanks!
- https://lore.kernel.org/linux-mm/20250930060557.85133-1-lance.yang@linux.dev...
v2 -> v3:
- ptep_get() gets called only once per iteration (per Dev)
- https://lore.kernel.org/linux-mm/20250930043351.34927-1-lance.yang@linux.dev...
v1 -> v2:
- Avoid calling ptep_get() multiple times (per Dev)
- Double-check the uffd-wp bit (per David)
- Collect Acked-by from David - thanks!
- https://lore.kernel.org/linux-mm/20250928044855.76359-1-lance.yang@linux.dev...
mm/migrate.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index ce83c2c3c287..e3065c9edb55 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -296,8 +296,7 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list) } static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
struct folio *folio,unsigned long idx)
struct folio *folio, pte_t old_pte, unsigned long idx){ struct page *page = folio_page(folio, idx); pte_t newpte; @@ -306,7 +305,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, return false; VM_BUG_ON_PAGE(!PageAnon(page), page); VM_BUG_ON_PAGE(!PageLocked(page), page);
- VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
- VM_BUG_ON_PAGE(pte_present(old_pte), page);
if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags & VM_LOCKED) || mm_forbids_zeropage(pvmw->vma->vm_mm)) @@ -322,6 +321,12 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address), pvmw->vma->vm_page_prot));
- if (pte_swp_soft_dirty(old_pte))
newpte = pte_mksoft_dirty(newpte);- if (pte_swp_uffd_wp(old_pte))
newpte = pte_mkuffd_wp(newpte);- set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio)); @@ -364,13 +369,13 @@ static bool remove_migration_pte(struct folio *folio, continue; } #endif
if (rmap_walk_arg->map_unused_to_zeropage &&old_pte = ptep_get(pvmw.pte);
try_to_map_unused_to_zeropage(&pvmw, folio, idx))
try_to_map_unused_to_zeropage(&pvmw, folio, old_pte, idx)) continue;folio_get(folio); pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
old_pte = ptep_get(pvmw.pte);entry = pte_to_swp_entry(old_pte); if (!is_migration_entry_young(entry)) -- 2.49.0
Feels like the mTHP implementation is hitting up on the buffers of the THP code being something of a mess... :)
On Tue, Sep 30, 2025 at 04:10:40PM +0800, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
When splitting an mTHP and replacing a zero-filled subpage with the shared zeropage, try_to_map_unused_to_zeropage() currently drops several important PTE bits.
For userspace tools like CRIU, which rely on the soft-dirty mechanism for
It's slightly by-the-by, but CRIU in my view - as it relies on kernel implementation details that can always change to operate - is not actually something we have to strictly keep working.
HOWEVER, if we can reasonably do so without causing issues for us in the kernel we ought to do so.
incremental snapshots, losing the soft-dirty bit means modified pages are missed, leading to inconsistent memory state after restore.
As pointed out by David, the more critical uffd-wp bit is also dropped. This breaks the userfaultfd write-protection mechanism, causing writes to be silently missed by monitoring applications, which can lead to data corruption.
Again, uffd-wp is a total mess. We shouldn't be in a position where its state being correctly retained relies on everybody always getting the subtle, uncommented, open-coded details right everywhere all the time.
But this is again a general comment... :)
Preserve both the soft-dirty and uffd-wp bits from the old PTE when creating the new zeropage mapping to ensure they are correctly tracked.
Cc: stable@vger.kernel.org Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Suggested-by: Dev Jain dev.jain@arm.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Dev Jain dev.jain@arm.com Signed-off-by: Lance Yang lance.yang@linux.dev
Overall LGTM, so:
Reviewed-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
v4 -> v5:
- Move ptep_get() call after the !pvmw.pte check, which handles PMD-mapped THP migration entries.
- https://lore.kernel.org/linux-mm/20250930071053.36158-1-lance.yang@linux.dev...
v3 -> v4:
- Minor formatting tweak in try_to_map_unused_to_zeropage() function signature (per David and Dev)
- Collect Reviewed-by from Dev - thanks!
- https://lore.kernel.org/linux-mm/20250930060557.85133-1-lance.yang@linux.dev...
v2 -> v3:
- ptep_get() gets called only once per iteration (per Dev)
- https://lore.kernel.org/linux-mm/20250930043351.34927-1-lance.yang@linux.dev...
v1 -> v2:
- Avoid calling ptep_get() multiple times (per Dev)
- Double-check the uffd-wp bit (per David)
- Collect Acked-by from David - thanks!
- https://lore.kernel.org/linux-mm/20250928044855.76359-1-lance.yang@linux.dev...
mm/migrate.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index ce83c2c3c287..e3065c9edb55 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -296,8 +296,7 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list) }
static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
struct folio *folio,unsigned long idx)
struct folio *folio, pte_t old_pte, unsigned long idx){ struct page *page = folio_page(folio, idx); pte_t newpte; @@ -306,7 +305,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, return false; VM_BUG_ON_PAGE(!PageAnon(page), page); VM_BUG_ON_PAGE(!PageLocked(page), page);
- VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
- VM_BUG_ON_PAGE(pte_present(old_pte), page);
Kinda ugly that we pass old_pte when it's avaiable via the shared state object, but probably nothing to really concern ourselves about.
Guess you could argue it both ways :)
It'd be good to convert these VM_BUG_ON_*() to VM_WARN_ON_*() but I guess that's somewhat out of the scope of the code here and would be inconsistent to change it for just one condition.
if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags & VM_LOCKED) || mm_forbids_zeropage(pvmw->vma->vm_mm)) @@ -322,6 +321,12 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address), pvmw->vma->vm_page_prot));
if (pte_swp_soft_dirty(old_pte))
newpte = pte_mksoft_dirty(newpte);if (pte_swp_uffd_wp(old_pte))
newpte = pte_mkuffd_wp(newpte);set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
@@ -364,13 +369,13 @@ static bool remove_migration_pte(struct folio *folio, continue; } #endif
if (rmap_walk_arg->map_unused_to_zeropage &&old_pte = ptep_get(pvmw.pte);
try_to_map_unused_to_zeropage(&pvmw, folio, idx))
try_to_map_unused_to_zeropage(&pvmw, folio, old_pte, idx)) continue;folio_get(folio); pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
old_pte = ptep_get(pvmw.pte);entry = pte_to_swp_entry(old_pte); if (!is_migration_entry_young(entry))
-- 2.49.0
Thanks for the super energetic review!
On 2025/10/14 19:19, Lorenzo Stoakes wrote:
Feels like the mTHP implementation is hitting up on the buffers of the THP code being something of a mess... :)
Haha, yeah, it really feels that way sometimes ;)
On Tue, Sep 30, 2025 at 04:10:40PM +0800, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
When splitting an mTHP and replacing a zero-filled subpage with the shared zeropage, try_to_map_unused_to_zeropage() currently drops several important PTE bits.
For userspace tools like CRIU, which rely on the soft-dirty mechanism for
It's slightly by-the-by, but CRIU in my view - as it relies on kernel implementation details that can always change to operate - is not actually something we have to strictly keep working.
HOWEVER, if we can reasonably do so without causing issues for us in the kernel we ought to do so.
incremental snapshots, losing the soft-dirty bit means modified pages are missed, leading to inconsistent memory state after restore.
As pointed out by David, the more critical uffd-wp bit is also dropped. This breaks the userfaultfd write-protection mechanism, causing writes to be silently missed by monitoring applications, which can lead to data corruption.
Again, uffd-wp is a total mess. We shouldn't be in a position where its state being correctly retained relies on everybody always getting the subtle, uncommented, open-coded details right everywhere all the time.
But this is again a general comment... :)
:)
Preserve both the soft-dirty and uffd-wp bits from the old PTE when creating the new zeropage mapping to ensure they are correctly tracked.
Cc: stable@vger.kernel.org Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Suggested-by: Dev Jain dev.jain@arm.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Dev Jain dev.jain@arm.com Signed-off-by: Lance Yang lance.yang@linux.dev
Overall LGTM, so:
Reviewed-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Cheers!
v4 -> v5:
- Move ptep_get() call after the !pvmw.pte check, which handles PMD-mapped THP migration entries.
- https://lore.kernel.org/linux-mm/20250930071053.36158-1-lance.yang@linux.dev...
v3 -> v4:
- Minor formatting tweak in try_to_map_unused_to_zeropage() function signature (per David and Dev)
- Collect Reviewed-by from Dev - thanks!
- https://lore.kernel.org/linux-mm/20250930060557.85133-1-lance.yang@linux.dev...
v2 -> v3:
- ptep_get() gets called only once per iteration (per Dev)
- https://lore.kernel.org/linux-mm/20250930043351.34927-1-lance.yang@linux.dev...
v1 -> v2:
- Avoid calling ptep_get() multiple times (per Dev)
- Double-check the uffd-wp bit (per David)
- Collect Acked-by from David - thanks!
- https://lore.kernel.org/linux-mm/20250928044855.76359-1-lance.yang@linux.dev...
mm/migrate.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index ce83c2c3c287..e3065c9edb55 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -296,8 +296,7 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list) }
static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
struct folio *folio,unsigned long idx)
{ struct page *page = folio_page(folio, idx); pte_t newpte;struct folio *folio, pte_t old_pte, unsigned long idx)@@ -306,7 +305,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, return false; VM_BUG_ON_PAGE(!PageAnon(page), page); VM_BUG_ON_PAGE(!PageLocked(page), page);
- VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
- VM_BUG_ON_PAGE(pte_present(old_pte), page);
Kinda ugly that we pass old_pte when it's avaiable via the shared state object, but probably nothing to really concern ourselves about.
Guess you could argue it both ways :)
It'd be good to convert these VM_BUG_ON_*() to VM_WARN_ON_*() but I guess that's
I agree.
somewhat out of the scope of the code here and would be inconsistent to change it for just one condition.
Since this fix already landed in the mainline, just leave it as is here :)
Thanks, Lance
linux-stable-mirror@lists.linaro.org