On 08.12.22 17:29, Peter Xu wrote:
On Thu, Dec 08, 2022 at 12:41:37PM +0100, David Hildenbrand wrote:
Currently, we don't enable writenotify when enabling userfaultfd-wp on a shared writable mapping (for now only shmem and hugetlb). The consequence is that vma->vm_page_prot will still include write permissions, to be set as default for all PTEs that get remapped (e.g., mprotect(), NUMA hinting, page migration, ...).
So far, vma->vm_page_prot is assumed to be a safe default, meaning that we only add permissions (e.g., mkwrite) but not remove permissions (e.g., wrprotect). For example, when enabling softdirty tracking, we enable writenotify. With uffd-wp on shared mappings, that changed. More details on vma->vm_page_prot semantics were summarized in [1].
This is problematic for uffd-wp: we'd have to manually check for a uffd-wp PTEs/PMDs and manually write-protect PTEs/PMDs, which is error prone. Prone to such issues is any code that uses vma->vm_page_prot to set PTE permissions: primarily pte_modify() and mk_pte().
Instead, let's enable writenotify such that PTEs/PMDs/... will be mapped write-protected as default and we will only allow selected PTEs that are definitely safe to be mapped without write-protection (see can_change_pte_writable()) to be writable. In the future, we might want to enable write-bit recovery -- e.g., can_change_pte_writable() -- at more locations, for example, also when removing uffd-wp protection.
This fixes two known cases:
(a) remove_migration_pte() mapping uffd-wp'ed PTEs writable, resulting in uffd-wp not triggering on write access. (b) do_numa_page() / do_huge_pmd_numa_page() mapping uffd-wp'ed PTEs/PMDs writable, resulting in uffd-wp not triggering on write access.
Note that do_numa_page() / do_huge_pmd_numa_page() can be reached even without NUMA hinting (which currently doesn't seem to be applicable to shmem), for example, by using uffd-wp with a PROT_WRITE shmem VMA. On such a VMA, userfaultfd-wp is currently non-functional.
Note that when enabling userfaultfd-wp, there is no need to walk page tables to enforce the new default protection for the PTEs: we know that they cannot be uffd-wp'ed yet, because that can only happen after enabling uffd-wp for the VMA in general.
Also note that this makes mprotect() on ranges with uffd-wp'ed PTEs not accidentally set the write bit -- which would result in uffd-wp not triggering on later write access. This commit makes uffd-wp on shmem behave just like uffd-wp on anonymous memory (iow, less special) in that regard, even though, mixing mprotect with uffd-wp is controversial.
[1] https://lkml.kernel.org/r/92173bad-caa3-6b43-9d1e-9a471fdbc184@redhat.com
Reported-by: Ives van Hoorne ives@codesandbox.io Debugged-by: Peter Xu peterx@redhat.com Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") Cc: stable@vger.kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Hugh Dickins hugh@veritas.com Cc: Alistair Popple apopple@nvidia.com Cc: Mike Rapoport rppt@linux.vnet.ibm.com Cc: Nadav Amit nadav.amit@gmail.com Cc: Andrea Arcangeli aarcange@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
Acked-by: Peter Xu peterx@redhat.com
One trivial nit.
As discussed in [2], this is supposed to replace the fix by Peter: [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte
This survives vm/selftests and my reproducers:
- migrating pages that are uffd-wp'ed using mbind() on a machine with 2 NUMA nodes
- Using a PROT_WRITE mapping with uffd-wp
- Using a PROT_READ|PROT_WRITE mapping with uffd-wp'ed pages and mprotect()'ing it PROT_WRITE
- Using a PROT_READ|PROT_WRITE mapping with uffd-wp'ed pages and temporarily mprotect()'ing it PROT_READ
uffd-wp properly triggers in all cases. On v8.1-rc8, all mre reproducers fail.
It would be good to get some more testing feedback and review.
[2] https://lkml.kernel.org/r/20221202122748.113774-1-david@redhat.com
fs/userfaultfd.c | 28 ++++++++++++++++++++++------ mm/mmap.c | 4 ++++ 2 files changed, 26 insertions(+), 6 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 98ac37e34e3d..fb0733f2e623 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -108,6 +108,21 @@ static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx) return ctx->features & UFFD_FEATURE_INITIALIZED; } +static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
vm_flags_t flags)
+{
- const bool uffd_wp = !!((vma->vm_flags | flags) & VM_UFFD_WP);
IIUC this can be "uffd_wp_changed" then switch "|" to "^". But not a hot path at all, so shouldn't matter a lot.
Yes, let's do that (we can also remove the !! here):
This hunk will be:
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 98ac37e34e3d..a988485ada05 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -108,6 +108,21 @@ static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx) return ctx->features & UFFD_FEATURE_INITIALIZED; }
+static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, + vm_flags_t flags) +{ + const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP; + + vma->vm_flags = flags; + /* + * For shared mappings, we want to enable writenotify while + * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply + * recalculate vma->vm_page_prot whenever userfaultfd-wp changes. + */ + if ((vma->vm_flags & VM_SHARED) && uffd_wp_changed) + vma_set_page_prot(vma); +} +
I'll wait for some more (+retest) before I resend tomorrow.
Thanks!