Peter Xu peterx@redhat.com writes:
Ives van Hoorne from codesandbox.io reported an issue regarding possible data loss of uffd-wp when applied to memfds on heavily loaded systems. The sympton is some read page got data mismatch from the snapshot child VMs.
Here I can also reproduce with a Rust reproducer that was provided by Ives that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate 80 instances I can trigger the issues in ten minutes.
It turns out that we got some pages write-through even if uffd-wp is applied to the pte.
The problem is, when removing migration entries, we didn't really worry about write bit as long as we know it's not a write migration entry. That may not be true, for some memory types (e.g. writable shmem) mk_pte can return a pte with write bit set, then to recover the migration entry to its original state we need to explicit wr-protect the pte or it'll has the write bit set if it's a read migration entry.
For uffd it can cause write-through. I didn't verify, but I think it'll be the same for mprotect()ed pages and after migration we can miss the sigbus instead.
The relevant code on uffd was introduced in the anon support, which is commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration", 2020-04-07). However anon shouldn't suffer from this problem because anon should already have the write bit cleared always, so that may not be a proper Fixes target. To satisfy the need on the backport, I'm attaching the Fixes tag to the uffd-wp shmem support. Since no one had issue with mprotect, so I assume that's also the kernel version we should start to backport for stable, and we shouldn't need to worry before that.
Hi Peter, for the patch feel free to add:
Reviewed-by: Alistair Popple apopple@nvidia.com
I did wonder if this should be backported further for migrate_vma as well given that a migration failure there might lead a shmem read-only PTE to become read-write. I couldn't think of an obvious reason why that would cause an actual problem though.
I think folio_mkclean() will wrprotect the pte for writeback to swap, but it holds the page lock which prevents migrate_vma installing migration entries in the first place.
I suppose there is a small window there because migrate_vma will unlock the page before removing the migration entries. So to be safe we could consider going back to 8763cb45ab96 ("mm/migrate: new memory migration helper for use with device memory") but I doubt in practice it's a real problem.
Cc: Andrea Arcangeli aarcange@redhat.com Cc: stable@vger.kernel.org Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") Reported-by: Ives van Hoorne ives@codesandbox.io Signed-off-by: Peter Xu peterx@redhat.com
mm/migrate.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/migrate.c b/mm/migrate.c index dff333593a8a..8b6351c08c78 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio, pte = pte_mkdirty(pte); if (is_writable_migration_entry(entry)) pte = maybe_mkwrite(pte, vma);
else if (pte_swp_uffd_wp(*pvmw.pte))
else
/* NOTE: mk_pte can have write bit set */
pte = pte_wrprotect(pte);
if (pte_swp_uffd_wp(*pvmw.pte)) {
WARN_ON_ONCE(pte_write(pte)); pte = pte_mkuffd_wp(pte);
}
if (folio_test_anon(folio) && !is_readable_migration_entry(entry)) rmap_flags |= RMAP_EXCLUSIVE;