Base ====
This series is based on (and therefore should apply cleanly to) the tag "v5.12-rc7-mmots-2021-04-11-20-49", additionally with Peter's selftest cleanup series applied *first*:
https://lore.kernel.org/patchwork/cover/1412450/
Changelog =========
v1->v2: - Pick up Reviewed-by's. - Don't swapin page when a minor fault occurs. Notice that it needs to be swapped in, and just immediately fire the minor fault. Let a future CONTINUE deal with swapping in the page. [Peter] - Clarify comment about i_size checks in mm/userfaultfd.c. [Peter] - Only forward declare once (out of #ifdef) in hugetlb.h. [Peter]
Changes since [2]: - Squash the fixes ([2]) in with the original series ([1]). This makes reviewing easier, as we no longer have to sift through deltas undoing what we had done before. [Hugh, Peter] - Modify shmem_mcopy_atomic_pte() to use the new mcopy_atomic_install_ptes() helper, reducing code duplication. [Hugh] - Properly trigger handle_userfault() in the shmem_swapin_page() case. [Hugh] - Use shmem_getpage() instead of find_lock_page() to lookup the existing page in for continue. This properly deals with swapped-out pages. [Hugh] - Unconditionally pte_mkdirty() for anon memory (as before). [Peter] - Don't include userfaultfd_k.h in either hugetlb.h or shmem_fs.h. [Hugh] - Add comment for UFFD_FEATURE_MINOR_SHMEM (to match _HUGETLBFS). [Hugh] - Fix some small cleanup issues (parens, reworded conditionals, reduced plumbing of some parameters, simplify labels/gotos, ...). [Hugh, Peter]
Overview ========
See the series which added minor faults for hugetlbfs [3] for a detailed overview of minor fault handling in general. This series adds the same support for shmem-backed areas.
This series is structured as follows:
- Commits 1 and 2 are cleanups. - Commits 3 and 4 implement the new feature (minor fault handling for shmem). - Commits 5, 6, 7, 8 update the userfaultfd selftest to exercise the feature. - Commit 9 is one final cleanup, modifying an existing code path to re-use a new helper we've introduced. We rely on the selftest to show that this change doesn't break anything.
Use Case ========
In some cases it is useful to have VM memory backed by tmpfs instead of hugetlbfs. So, this feature will be used to support the same VM live migration use case described in my original series.
Additionally, Android folks (Lokesh Gidra lokeshgidra@google.com) hope to optimize the Android Runtime garbage collector using this feature:
"The plan is to use userfaultfd for concurrently compacting the heap. With this feature, the heap can be shared-mapped at another location where the GC-thread(s) could continue the compaction operation without the need to invoke userfault ioctl(UFFDIO_COPY) each time. OTOH, if and when Java threads get faults on the heap, UFFDIO_CONTINUE can be used to resume execution. Furthermore, this feature enables updating references in the 'non-moving' portion of the heap efficiently. Without this feature, uneccessary page copying (ioctl(UFFDIO_COPY)) would be required."
[1] https://lore.kernel.org/patchwork/cover/1388144/ [2] https://lore.kernel.org/patchwork/patch/1408161/ [3] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@...
Axel Rasmussen (9): userfaultfd/hugetlbfs: avoid including userfaultfd_k.h in hugetlb.h userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte userfaultfd/shmem: support minor fault registration for shmem userfaultfd/shmem: support UFFDIO_CONTINUE for shmem userfaultfd/selftests: use memfd_create for shmem test type userfaultfd/selftests: create alias mappings in the shmem test userfaultfd/selftests: reinitialize test context in each test userfaultfd/selftests: exercise minor fault handling shmem support userfaultfd/shmem: modify shmem_mcopy_atomic_pte to use install_ptes
fs/userfaultfd.c | 6 +- include/linux/hugetlb.h | 4 +- include/linux/shmem_fs.h | 15 +- include/linux/userfaultfd_k.h | 5 + include/uapi/linux/userfaultfd.h | 7 +- mm/hugetlb.c | 1 + mm/memory.c | 8 +- mm/shmem.c | 112 +++------ mm/userfaultfd.c | 183 ++++++++++----- tools/testing/selftests/vm/userfaultfd.c | 280 +++++++++++++++-------- 10 files changed, 377 insertions(+), 244 deletions(-)
-- 2.31.1.295.g9ea45b61b8-goog
Minimizing header file inclusion is desirable. In this case, we can do so just by forward declaring the enumeration our signature relies upon.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- include/linux/hugetlb.h | 4 +++- mm/hugetlb.c | 1 + 2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 09f1fd12a6fa..3f47650ab79b 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -11,7 +11,6 @@ #include <linux/kref.h> #include <linux/pgtable.h> #include <linux/gfp.h> -#include <linux/userfaultfd_k.h>
struct ctl_table; struct user_struct; @@ -135,6 +134,8 @@ void hugetlb_show_meminfo(void); unsigned long hugetlb_total_pages(void); vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, unsigned int flags); + +enum mcopy_atomic_mode; #ifdef CONFIG_USERFAULTFD int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte, struct vm_area_struct *dst_vma, @@ -143,6 +144,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte, enum mcopy_atomic_mode mode, struct page **pagep); #endif /* CONFIG_USERFAULTFD */ + bool hugetlb_reserve_pages(struct inode *inode, long from, long to, struct vm_area_struct *vma, vm_flags_t vm_flags); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 54d81d5947ed..b1652e747318 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -40,6 +40,7 @@ #include <linux/hugetlb_cgroup.h> #include <linux/node.h> #include <linux/page_owner.h> +#include <linux/userfaultfd_k.h> #include "internal.h"
int hugetlb_max_hstate __read_mostly;
On Mon, 12 Apr 2021, Axel Rasmussen wrote:
Minimizing header file inclusion is desirable. In this case, we can do so just by forward declaring the enumeration our signature relies upon.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: Axel Rasmussen axelrasmussen@google.com
include/linux/hugetlb.h | 4 +++- mm/hugetlb.c | 1 + 2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 09f1fd12a6fa..3f47650ab79b 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -11,7 +11,6 @@ #include <linux/kref.h> #include <linux/pgtable.h> #include <linux/gfp.h> -#include <linux/userfaultfd_k.h> struct ctl_table; struct user_struct; @@ -135,6 +134,8 @@ void hugetlb_show_meminfo(void); unsigned long hugetlb_total_pages(void); vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, unsigned int flags);
+enum mcopy_atomic_mode;
Wrongly placed: the CONFIG_USERFAULTFD=y CONFIG_HUGETLB_PAGE=n build fails. Better place it up above with struct ctl_table etc.
#ifdef CONFIG_USERFAULTFD int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte, struct vm_area_struct *dst_vma, @@ -143,6 +144,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte, enum mcopy_atomic_mode mode, struct page **pagep); #endif /* CONFIG_USERFAULTFD */
bool hugetlb_reserve_pages(struct inode *inode, long from, long to, struct vm_area_struct *vma, vm_flags_t vm_flags); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 54d81d5947ed..b1652e747318 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -40,6 +40,7 @@ #include <linux/hugetlb_cgroup.h> #include <linux/node.h> #include <linux/page_owner.h> +#include <linux/userfaultfd_k.h> #include "internal.h" int hugetlb_max_hstate __read_mostly; -- 2.31.1.295.g9ea45b61b8-goog
Previously, we did a dance where we had one calling path in userfaultfd.c (mfill_atomic_pte), but then we split it into two in shmem_fs.h (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single shared function in shmem.c (shmem_mfill_atomic_pte).
This is all a bit overly complex. Just call the single combined shmem function directly, allowing us to clean up various branches, boilerplate, etc.
While we're touching this function, two other small cleanup changes: - offset is equivalent to pgoff, so we can get rid of offset entirely. - Split two VM_BUG_ON cases into two statements. This means the line number reported when the BUG is hit specifies exactly which condition was true.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- include/linux/shmem_fs.h | 15 +++++------- mm/shmem.c | 52 +++++++++++++--------------------------- mm/userfaultfd.c | 10 +++----- 3 files changed, 25 insertions(+), 52 deletions(-)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index d82b6f396588..919e36671fe6 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -122,21 +122,18 @@ static inline bool shmem_file(struct file *file) extern bool shmem_charge(struct inode *inode, long pages); extern void shmem_uncharge(struct inode *inode, long pages);
+#ifdef CONFIG_USERFAULTFD #ifdef CONFIG_SHMEM extern int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, + bool zeropage, struct page **pagep); -extern int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm, - pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr); -#else +#else /* !CONFIG_SHMEM */ #define shmem_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \ - src_addr, pagep) ({ BUG(); 0; }) -#define shmem_mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, \ - dst_addr) ({ BUG(); 0; }) -#endif + src_addr, zeropage, pagep) ({ BUG(); 0; }) +#endif /* CONFIG_SHMEM */ +#endif /* CONFIG_USERFAULTFD */
#endif diff --git a/mm/shmem.c b/mm/shmem.c index 26c76b13ad23..b72c55aa07fc 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2354,13 +2354,14 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode return inode; }
-static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, - pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - bool zeropage, - struct page **pagep) +#ifdef CONFIG_USERFAULTFD +int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, + pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + unsigned long src_addr, + bool zeropage, + struct page **pagep) { struct inode *inode = file_inode(dst_vma->vm_file); struct shmem_inode_info *info = SHMEM_I(inode); @@ -2372,7 +2373,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, struct page *page; pte_t _dst_pte, *dst_pte; int ret; - pgoff_t offset, max_off; + pgoff_t max_off;
ret = -ENOMEM; if (!shmem_inode_acct_block(inode, 1)) @@ -2383,7 +2384,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, if (!page) goto out_unacct_blocks;
- if (!zeropage) { /* mcopy_atomic */ + if (!zeropage) { /* COPY */ page_kaddr = kmap_atomic(page); ret = copy_from_user(page_kaddr, (const void __user *)src_addr, @@ -2397,7 +2398,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, /* don't free the page */ return -ENOENT; } - } else { /* mfill_zeropage_atomic */ + } else { /* ZEROPAGE */ clear_highpage(page); } } else { @@ -2405,15 +2406,15 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, *pagep = NULL; }
- VM_BUG_ON(PageLocked(page) || PageSwapBacked(page)); + VM_BUG_ON(PageLocked(page)); + VM_BUG_ON(PageSwapBacked(page)); __SetPageLocked(page); __SetPageSwapBacked(page); __SetPageUptodate(page);
ret = -EFAULT; - offset = linear_page_index(dst_vma, dst_addr); max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); - if (unlikely(offset >= max_off)) + if (unlikely(pgoff >= max_off)) goto out_release;
ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL, @@ -2439,7 +2440,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
ret = -EFAULT; max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); - if (unlikely(offset >= max_off)) + if (unlikely(pgoff >= max_off)) goto out_release_unlock;
ret = -EEXIST; @@ -2476,28 +2477,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, shmem_inode_unacct_blocks(inode, 1); goto out; } - -int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, - pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - struct page **pagep) -{ - return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, - dst_addr, src_addr, false, pagep); -} - -int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm, - pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr) -{ - struct page *page = NULL; - - return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, - dst_addr, 0, true, &page); -} +#endif /* CONFIG_USERFAULTFD */
#ifdef CONFIG_TMPFS static const struct inode_operations shmem_symlink_inode_operations; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e14b3820c6a8..23fa2583bbd1 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -440,13 +440,9 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, dst_vma, dst_addr); } else { VM_WARN_ON_ONCE(wp_copy); - if (!zeropage) - err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, - dst_vma, dst_addr, - src_addr, page); - else - err = shmem_mfill_zeropage_pte(dst_mm, dst_pmd, - dst_vma, dst_addr); + err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, + dst_addr, src_addr, zeropage, + page); }
return err;
On Mon, 12 Apr 2021, Axel Rasmussen wrote:
Previously, we did a dance where we had one calling path in userfaultfd.c (mfill_atomic_pte), but then we split it into two in shmem_fs.h (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single shared function in shmem.c (shmem_mfill_atomic_pte).
This is all a bit overly complex. Just call the single combined shmem function directly, allowing us to clean up various branches, boilerplate, etc.
While we're touching this function, two other small cleanup changes:
- offset is equivalent to pgoff, so we can get rid of offset entirely.
- Split two VM_BUG_ON cases into two statements. This means the line number reported when the BUG is hit specifies exactly which condition was true.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: Axel Rasmussen axelrasmussen@google.com
Acked-by: Hugh Dickins hughd@google.com though you've dropped one minor fix I did like, see below...
include/linux/shmem_fs.h | 15 +++++------- mm/shmem.c | 52 +++++++++++++--------------------------- mm/userfaultfd.c | 10 +++----- 3 files changed, 25 insertions(+), 52 deletions(-)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index d82b6f396588..919e36671fe6 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -122,21 +122,18 @@ static inline bool shmem_file(struct file *file) extern bool shmem_charge(struct inode *inode, long pages); extern void shmem_uncharge(struct inode *inode, long pages); +#ifdef CONFIG_USERFAULTFD #ifdef CONFIG_SHMEM extern int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr,
bool zeropage, struct page **pagep);
-extern int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm,
pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr);
-#else +#else /* !CONFIG_SHMEM */ #define shmem_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \
In a previous version, you quietly corrected that "dst_pte" to "dst_pmd": of course it makes no difference to the code generated, but it was a good correction, helping to prevent confusion.
src_addr, pagep) ({ BUG(); 0; })
-#define shmem_mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, \
dst_addr) ({ BUG(); 0; })
-#endif
src_addr, zeropage, pagep) ({ BUG(); 0; })
+#endif /* CONFIG_SHMEM */ +#endif /* CONFIG_USERFAULTFD */ #endif diff --git a/mm/shmem.c b/mm/shmem.c index 26c76b13ad23..b72c55aa07fc 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2354,13 +2354,14 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode return inode; } -static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
bool zeropage,
struct page **pagep)
+#ifdef CONFIG_USERFAULTFD +int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
bool zeropage,
struct page **pagep)
{ struct inode *inode = file_inode(dst_vma->vm_file); struct shmem_inode_info *info = SHMEM_I(inode); @@ -2372,7 +2373,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, struct page *page; pte_t _dst_pte, *dst_pte; int ret;
- pgoff_t offset, max_off;
- pgoff_t max_off;
ret = -ENOMEM; if (!shmem_inode_acct_block(inode, 1)) @@ -2383,7 +2384,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, if (!page) goto out_unacct_blocks;
if (!zeropage) { /* mcopy_atomic */
if (!zeropage) { /* COPY */ page_kaddr = kmap_atomic(page); ret = copy_from_user(page_kaddr, (const void __user *)src_addr,
@@ -2397,7 +2398,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, /* don't free the page */ return -ENOENT; }
} else { /* mfill_zeropage_atomic */
} } else {} else { /* ZEROPAGE */ clear_highpage(page);
@@ -2405,15 +2406,15 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, *pagep = NULL; }
- VM_BUG_ON(PageLocked(page) || PageSwapBacked(page));
- VM_BUG_ON(PageLocked(page));
- VM_BUG_ON(PageSwapBacked(page)); __SetPageLocked(page); __SetPageSwapBacked(page); __SetPageUptodate(page);
ret = -EFAULT;
- offset = linear_page_index(dst_vma, dst_addr); max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
- if (unlikely(offset >= max_off))
- if (unlikely(pgoff >= max_off)) goto out_release;
ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL, @@ -2439,7 +2440,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, ret = -EFAULT; max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
- if (unlikely(offset >= max_off))
- if (unlikely(pgoff >= max_off)) goto out_release_unlock;
ret = -EEXIST; @@ -2476,28 +2477,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, shmem_inode_unacct_blocks(inode, 1); goto out; }
-int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
struct page **pagep)
-{
- return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
dst_addr, src_addr, false, pagep);
-}
-int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm,
pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr)
-{
- struct page *page = NULL;
- return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
dst_addr, 0, true, &page);
-} +#endif /* CONFIG_USERFAULTFD */ #ifdef CONFIG_TMPFS static const struct inode_operations shmem_symlink_inode_operations; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e14b3820c6a8..23fa2583bbd1 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -440,13 +440,9 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, dst_vma, dst_addr); } else { VM_WARN_ON_ONCE(wp_copy);
if (!zeropage)
err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
dst_vma, dst_addr,
src_addr, page);
else
err = shmem_mfill_zeropage_pte(dst_mm, dst_pmd,
dst_vma, dst_addr);
err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
dst_addr, src_addr, zeropage,
}page);
return err; -- 2.31.1.295.g9ea45b61b8-goog
This patch allows shmem-backed VMAs to be registered for minor faults. Minor faults are appropriately relayed to userspace in the fault path, for VMAs with the relevant flag.
This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed minor faults, though, so userspace doesn't yet have a way to resolve such faults.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- fs/userfaultfd.c | 6 +++--- include/uapi/linux/userfaultfd.h | 7 ++++++- mm/memory.c | 8 +++++--- mm/shmem.c | 10 +++++++++- 4 files changed, 23 insertions(+), 8 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 14f92285d04f..9f3b8684cf3c 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1267,8 +1267,7 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, }
if (vm_flags & VM_UFFD_MINOR) { - /* FIXME: Add minor fault interception for shmem. */ - if (!is_vm_hugetlb_page(vma)) + if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma))) return false; }
@@ -1941,7 +1940,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx, /* report all available features and ioctls to userland */ uffdio_api.features = UFFD_API_FEATURES; #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR - uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS; + uffdio_api.features &= + ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM); #endif uffdio_api.ioctls = UFFD_API_IOCTLS; ret = -EFAULT; diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index bafbeb1a2624..159a74e9564f 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -31,7 +31,8 @@ UFFD_FEATURE_MISSING_SHMEM | \ UFFD_FEATURE_SIGBUS | \ UFFD_FEATURE_THREAD_ID | \ - UFFD_FEATURE_MINOR_HUGETLBFS) + UFFD_FEATURE_MINOR_HUGETLBFS | \ + UFFD_FEATURE_MINOR_SHMEM) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -185,6 +186,9 @@ struct uffdio_api { * UFFD_FEATURE_MINOR_HUGETLBFS indicates that minor faults * can be intercepted (via REGISTER_MODE_MINOR) for * hugetlbfs-backed pages. + * + * UFFD_FEATURE_MINOR_SHMEM indicates the same support as + * UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -196,6 +200,7 @@ struct uffdio_api { #define UFFD_FEATURE_SIGBUS (1<<7) #define UFFD_FEATURE_THREAD_ID (1<<8) #define UFFD_FEATURE_MINOR_HUGETLBFS (1<<9) +#define UFFD_FEATURE_MINOR_SHMEM (1<<10) __u64 features;
__u64 ioctls; diff --git a/mm/memory.c b/mm/memory.c index 4e358601c5d6..cc71a445c76c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3972,9 +3972,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf) * something). */ if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) { - ret = do_fault_around(vmf); - if (ret) - return ret; + if (likely(!userfaultfd_minor(vmf->vma))) { + ret = do_fault_around(vmf); + if (ret) + return ret; + } }
ret = __do_fault(vmf); diff --git a/mm/shmem.c b/mm/shmem.c index b72c55aa07fc..3f48cb5e8404 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1785,7 +1785,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index, * vm. If we swap it in we mark it dirty since we also free the swap * entry since a page cannot live in both the swap and page cache. * - * vmf and fault_type are only supplied by shmem_fault: + * vma, vmf, and fault_type are only supplied by shmem_fault: * otherwise they are NULL. */ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, @@ -1820,6 +1820,14 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
page = pagecache_get_page(mapping, index, FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0); + + if (page && vma && userfaultfd_minor(vma)) { + unlock_page(page); + put_page(page); + *fault_type = handle_userfault(vmf, VM_UFFD_MINOR); + return 0; + } + if (xa_is_value(page)) { error = shmem_swapin_page(inode, index, &page, sgp, gfp, vma, fault_type);
On Mon, Apr 12, 2021 at 10:17:15PM -0700, Axel Rasmussen wrote:
This patch allows shmem-backed VMAs to be registered for minor faults. Minor faults are appropriately relayed to userspace in the fault path, for VMAs with the relevant flag.
This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed minor faults, though, so userspace doesn't yet have a way to resolve such faults.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com
Everything looks right to me, but it'll be great if Andrea or Hugh will have a look too.
Acked-by: Peter Xu peterx@redhat.com
On Mon, 12 Apr 2021, Axel Rasmussen wrote:
This patch allows shmem-backed VMAs to be registered for minor faults. Minor faults are appropriately relayed to userspace in the fault path, for VMAs with the relevant flag.
This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed minor faults, though, so userspace doesn't yet have a way to resolve such faults.
This is a very odd way to divide up the series: an "Intermission" half way through the implementation of MINOR/CONTINUE: this 3/9 makes little sense without the 4/9 to mm/userfaultfd.c which follows.
But, having said that, I won't object and Peter did not object, and I don't know of anyone else looking here: it will only give each of us more trouble to insist on repartitioning the series, and it's the end state that's far more important to me and to all of us.
And I'll even seize on it, to give myself an intermission after this one, until tomorrow (when I'll look at 4/9 and 9/9 - but shall not look at the selftests ones at all).
Most of this is okay, except the mm/shmem.c part; and I've just now realized that somewhere (whether in this patch or separately) there needs to be an update to Documentation/admin-guide/mm/userfaultfd.rst (admin-guide? how weird, but not this series' business to correct).
Signed-off-by: Axel Rasmussen axelrasmussen@google.com
fs/userfaultfd.c | 6 +++--- include/uapi/linux/userfaultfd.h | 7 ++++++- mm/memory.c | 8 +++++--- mm/shmem.c | 10 +++++++++- 4 files changed, 23 insertions(+), 8 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 14f92285d04f..9f3b8684cf3c 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1267,8 +1267,7 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, } if (vm_flags & VM_UFFD_MINOR) {
/* FIXME: Add minor fault interception for shmem. */
if (!is_vm_hugetlb_page(vma))
}if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma))) return false;
@@ -1941,7 +1940,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx, /* report all available features and ioctls to userland */ uffdio_api.features = UFFD_API_FEATURES; #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
- uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS;
- uffdio_api.features &=
~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
#endif uffdio_api.ioctls = UFFD_API_IOCTLS; ret = -EFAULT; diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index bafbeb1a2624..159a74e9564f 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -31,7 +31,8 @@ UFFD_FEATURE_MISSING_SHMEM | \ UFFD_FEATURE_SIGBUS | \ UFFD_FEATURE_THREAD_ID | \
UFFD_FEATURE_MINOR_HUGETLBFS)
UFFD_FEATURE_MINOR_HUGETLBFS | \
UFFD_FEATURE_MINOR_SHMEM)
#define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -185,6 +186,9 @@ struct uffdio_api { * UFFD_FEATURE_MINOR_HUGETLBFS indicates that minor faults * can be intercepted (via REGISTER_MODE_MINOR) for * hugetlbfs-backed pages.
*
* UFFD_FEATURE_MINOR_SHMEM indicates the same support as
*/* UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead.
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -196,6 +200,7 @@ struct uffdio_api { #define UFFD_FEATURE_SIGBUS (1<<7) #define UFFD_FEATURE_THREAD_ID (1<<8) #define UFFD_FEATURE_MINOR_HUGETLBFS (1<<9) +#define UFFD_FEATURE_MINOR_SHMEM (1<<10) __u64 features; __u64 ioctls; diff --git a/mm/memory.c b/mm/memory.c index 4e358601c5d6..cc71a445c76c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3972,9 +3972,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf) * something). */ if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
ret = do_fault_around(vmf);
if (ret)
return ret;
if (likely(!userfaultfd_minor(vmf->vma))) {
ret = do_fault_around(vmf);
if (ret)
return ret;
}}
ret = __do_fault(vmf); diff --git a/mm/shmem.c b/mm/shmem.c index b72c55aa07fc..3f48cb5e8404 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1785,7 +1785,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
- vm. If we swap it in we mark it dirty since we also free the swap
- entry since a page cannot live in both the swap and page cache.
- vmf and fault_type are only supplied by shmem_fault:
*/
- vma, vmf, and fault_type are only supplied by shmem_fault:
- otherwise they are NULL.
static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, @@ -1820,6 +1820,14 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, page = pagecache_get_page(mapping, index, FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0);
- if (page && vma && userfaultfd_minor(vma)) {
unlock_page(page);
put_page(page);
*fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
return 0;
- }
Okay, Peter persuaded you to move that up here: where indeed it does look better than the earlier "swapped" version.
But will crash on swap as it's currently written: it needs to say if (!xa_is_value(page)) { unlock_page(page); put_page(page); }
I did say before that it's more robust to return from the swap case after doing the shmem_swapin_page(). But I might be slowly realizing that the ioctl to add the pte (in 4/9) will do its shmem_getpage_gfp(), and that will bring in the swap if user did not already do so: so I was wrong to claim more robustness the other way, this placement should be fine. I think.
if (xa_is_value(page)) { error = shmem_swapin_page(inode, index, &page, sgp, gfp, vma, fault_type); -- 2.31.1.295.g9ea45b61b8-goog
On Wed, Apr 14, 2021 at 12:36:13AM -0700, Hugh Dickins wrote:
On Mon, 12 Apr 2021, Axel Rasmussen wrote:
This patch allows shmem-backed VMAs to be registered for minor faults. Minor faults are appropriately relayed to userspace in the fault path, for VMAs with the relevant flag.
This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed minor faults, though, so userspace doesn't yet have a way to resolve such faults.
This is a very odd way to divide up the series: an "Intermission" half way through the implementation of MINOR/CONTINUE: this 3/9 makes little sense without the 4/9 to mm/userfaultfd.c which follows.
But, having said that, I won't object and Peter did not object, and I don't know of anyone else looking here: it will only give each of us more trouble to insist on repartitioning the series, and it's the end state that's far more important to me and to all of us.
Agreed, ideally it should be after patch 4 since this patch enables the feature already.
And I'll even seize on it, to give myself an intermission after this one, until tomorrow (when I'll look at 4/9 and 9/9 - but shall not look at the selftests ones at all).
Most of this is okay, except the mm/shmem.c part; and I've just now realized that somewhere (whether in this patch or separately) there needs to be an update to Documentation/admin-guide/mm/userfaultfd.rst (admin-guide? how weird, but not this series' business to correct).
(maybe some dir "devel" would suite better? But I do also see soft-dirty.rst, idle_page_tracking.rst,..)
[...]
static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, @@ -1820,6 +1820,14 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, page = pagecache_get_page(mapping, index, FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0);
- if (page && vma && userfaultfd_minor(vma)) {
unlock_page(page);
put_page(page);
*fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
return 0;
- }
Okay, Peter persuaded you to move that up here: where indeed it does look better than the earlier "swapped" version.
But will crash on swap as it's currently written: it needs to say if (!xa_is_value(page)) { unlock_page(page); put_page(page); }
And this is definitely true... Thanks,
I did say before that it's more robust to return from the swap case after doing the shmem_swapin_page(). But I might be slowly realizing that the ioctl to add the pte (in 4/9) will do its shmem_getpage_gfp(), and that will bring in the swap if user did not already do so: so I was wrong to claim more robustness the other way, this placement should be fine. I think.
if (xa_is_value(page)) { error = shmem_swapin_page(inode, index, &page, sgp, gfp, vma, fault_type); -- 2.31.1.295.g9ea45b61b8-goog
With this change, userspace can resolve a minor fault within a shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this match those for hugetlbfs - we look up the existing page in the page cache, and install PTEs for it.
This commit introduces a new helper: mcopy_atomic_install_ptes.
Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in shmem.c? The existing userfault implementation only relies on shmem.c for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for shmem in one place, regardless of shared/private (to reduce code duplication).
Why add a new mcopy_atomic_install_ptes helper? A problem we have with continue is that shmem_mcopy_atomic_pte() and mcopy_atomic_pte() are *close* to what we want, but not exactly. We do want to setup the PTEs in a CONTINUE operation, but we don't want to e.g. allocate a new page, charge it (e.g. to the shmem inode), manipulate various flags, etc. Also we have the problem stated above: shmem_mcopy_atomic_pte() and mcopy_atomic_pte() both handle one-half of the problem (shared / private) continue cares about. So, introduce mcontinue_atomic_pte(), to handle all of the shmem continue cases. Introduce the helper so it doesn't duplicate code with mcopy_atomic_pte().
In a future commit, shmem_mcopy_atomic_pte() will also be modified to use this new helper. However, since this is a bigger refactor, it seems most clear to do it as a separate change.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- mm/userfaultfd.c | 176 +++++++++++++++++++++++++++++++++++------------ 1 file changed, 131 insertions(+), 45 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 23fa2583bbd1..8df0438f5d6a 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -48,6 +48,87 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm, return dst_vma; }
+/* + * Install PTEs, to map dst_addr (within dst_vma) to page. + * + * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed), + * whether or not dst_vma is VM_SHARED. It also handles the more general + * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file + * backed, or not). + * + * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by + * shmem_mcopy_atomic_pte instead. + */ +static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, struct page *page, + bool newly_allocated, bool wp_copy) +{ + int ret; + pte_t _dst_pte, *dst_pte; + int writable; + bool vm_shared = dst_vma->vm_flags & VM_SHARED; + spinlock_t *ptl; + struct inode *inode; + pgoff_t offset, max_off; + + _dst_pte = mk_pte(page, dst_vma->vm_page_prot); + writable = dst_vma->vm_flags & VM_WRITE; + /* For private, non-anon we need CoW (don't write to page cache!) */ + if (!vma_is_anonymous(dst_vma) && !vm_shared) + writable = 0; + + if (writable || vma_is_anonymous(dst_vma)) + _dst_pte = pte_mkdirty(_dst_pte); + if (writable) { + if (wp_copy) + _dst_pte = pte_mkuffd_wp(_dst_pte); + else + _dst_pte = pte_mkwrite(_dst_pte); + } else if (vm_shared) { + /* + * Since we didn't pte_mkdirty(), mark the page dirty or it + * could be freed from under us. We could do this + * unconditionally, but doing it only if !writable is faster. + */ + set_page_dirty(page); + } + + dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + + if (vma_is_shmem(dst_vma)) { + /* serialize against truncate with the page table lock */ + inode = dst_vma->vm_file->f_inode; + offset = linear_page_index(dst_vma, dst_addr); + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); + ret = -EFAULT; + if (unlikely(offset >= max_off)) + goto out_unlock; + } + + ret = -EEXIST; + if (!pte_none(*dst_pte)) + goto out_unlock; + + inc_mm_counter(dst_mm, mm_counter(page)); + if (vma_is_shmem(dst_vma)) + page_add_file_rmap(page, false); + else + page_add_new_anon_rmap(page, dst_vma, dst_addr, false); + + if (newly_allocated) + lru_cache_add_inactive_or_unevictable(page, dst_vma); + + set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(dst_vma, dst_addr, dst_pte); + ret = 0; +out_unlock: + pte_unmap_unlock(dst_pte, ptl); + return ret; +} + static int mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct vm_area_struct *dst_vma, @@ -56,13 +137,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, struct page **pagep, bool wp_copy) { - pte_t _dst_pte, *dst_pte; - spinlock_t *ptl; void *page_kaddr; int ret; struct page *page; - pgoff_t offset, max_off; - struct inode *inode;
if (!*pagep) { ret = -ENOMEM; @@ -99,43 +176,12 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL)) goto out_release;
- _dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot)); - if (dst_vma->vm_flags & VM_WRITE) { - if (wp_copy) - _dst_pte = pte_mkuffd_wp(_dst_pte); - else - _dst_pte = pte_mkwrite(_dst_pte); - } - - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); - if (dst_vma->vm_file) { - /* the shmem MAP_PRIVATE case requires checking the i_size */ - inode = dst_vma->vm_file->f_inode; - offset = linear_page_index(dst_vma, dst_addr); - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); - ret = -EFAULT; - if (unlikely(offset >= max_off)) - goto out_release_uncharge_unlock; - } - ret = -EEXIST; - if (!pte_none(*dst_pte)) - goto out_release_uncharge_unlock; - - inc_mm_counter(dst_mm, MM_ANONPAGES); - page_add_new_anon_rmap(page, dst_vma, dst_addr, false); - lru_cache_add_inactive_or_unevictable(page, dst_vma); - - set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); - - /* No need to invalidate - it was non-present before */ - update_mmu_cache(dst_vma, dst_addr, dst_pte); - - pte_unmap_unlock(dst_pte, ptl); - ret = 0; + ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr, + page, true, wp_copy); + if (ret) + goto out_release; out: return ret; -out_release_uncharge_unlock: - pte_unmap_unlock(dst_pte, ptl); out_release: put_page(page); goto out; @@ -176,6 +222,41 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm, return ret; }
+/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ +static int mcontinue_atomic_pte(struct mm_struct *dst_mm, + pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + bool wp_copy) +{ + struct inode *inode = file_inode(dst_vma->vm_file); + pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); + struct page *page; + int ret; + + ret = shmem_getpage(inode, pgoff, &page, SGP_READ); + if (ret) + goto out; + if (!page) { + ret = -EFAULT; + goto out; + } + + ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr, + page, false, wp_copy); + if (ret) + goto out_release; + + unlock_page(page); + ret = 0; +out: + return ret; +out_release: + unlock_page(page); + put_page(page); + goto out; +} + static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) { pgd_t *pgd; @@ -415,11 +496,16 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, unsigned long dst_addr, unsigned long src_addr, struct page **page, - bool zeropage, + enum mcopy_atomic_mode mode, bool wp_copy) { ssize_t err;
+ if (mode == MCOPY_ATOMIC_CONTINUE) { + return mcontinue_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, + wp_copy); + } + /* * The normal page fault path for a shmem will invoke the * fault, fill the hole in the file and COW it right away. The @@ -431,7 +517,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, * and not in the radix tree. */ if (!(dst_vma->vm_flags & VM_SHARED)) { - if (!zeropage) + if (mode == MCOPY_ATOMIC_NORMAL) err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, src_addr, page, wp_copy); @@ -441,7 +527,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, } else { VM_WARN_ON_ONCE(wp_copy); err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, - dst_addr, src_addr, zeropage, + dst_addr, src_addr, + mode != MCOPY_ATOMIC_NORMAL, page); }
@@ -463,7 +550,6 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, long copied; struct page *page; bool wp_copy; - bool zeropage = (mcopy_mode == MCOPY_ATOMIC_ZEROPAGE);
/* * Sanitize the command parameters: @@ -526,7 +612,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) goto out_unlock; - if (mcopy_mode == MCOPY_ATOMIC_CONTINUE) + if (!vma_is_shmem(dst_vma) && mcopy_mode == MCOPY_ATOMIC_CONTINUE) goto out_unlock;
/* @@ -574,7 +660,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, BUG_ON(pmd_trans_huge(*dst_pmd));
err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, - src_addr, &page, zeropage, wp_copy); + src_addr, &page, mcopy_mode, wp_copy); cond_resched();
if (unlikely(err == -ENOENT)) {
On Mon, 12 Apr 2021, Axel Rasmussen wrote:
With this change, userspace can resolve a minor fault within a shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this match those for hugetlbfs - we look up the existing page in the page cache, and install PTEs for it.
s/PTEs/a PTE/
This commit introduces a new helper: mcopy_atomic_install_ptes.
The plural is misleading: it only installs a single pte, so I'm going to ask you to change it throughout to mcopy_atomic_install_pte() (I'm not thrilled with the "mcopy" nor the "atomic", but there you are being consistent with userfaultfd's peculiar naming, so let them be).
Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in shmem.c? The existing userfault implementation only relies on shmem.c for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for shmem in one place, regardless of shared/private (to reduce code duplication).
Why add a new mcopy_atomic_install_ptes helper? A problem we have with continue is that shmem_mcopy_atomic_pte() and mcopy_atomic_pte() are *close* to what we want, but not exactly. We do want to setup the PTEs in a CONTINUE operation, but we don't want to e.g. allocate a new page, charge it (e.g. to the shmem inode), manipulate various flags, etc. Also we have the problem stated above: shmem_mcopy_atomic_pte() and mcopy_atomic_pte() both handle one-half of the problem (shared / private) continue cares about. So, introduce mcontinue_atomic_pte(), to handle all of the shmem continue cases. Introduce the helper so it doesn't duplicate code with mcopy_atomic_pte().
In a future commit, shmem_mcopy_atomic_pte() will also be modified to use this new helper. However, since this is a bigger refactor, it seems most clear to do it as a separate change.
(Actually that turns out to be a nice deletion of lines, but you're absolutely right to do it as a separate patch.)
Signed-off-by: Axel Rasmussen axelrasmussen@google.com
mm/userfaultfd.c | 176 +++++++++++++++++++++++++++++++++++------------ 1 file changed, 131 insertions(+), 45 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 23fa2583bbd1..8df0438f5d6a 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -48,6 +48,87 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm, return dst_vma; } +/*
- Install PTEs, to map dst_addr (within dst_vma) to page.
- This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
- whether or not dst_vma is VM_SHARED. It also handles the more general
- MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
- backed, or not).
- Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
- shmem_mcopy_atomic_pte instead.
Right, I'm thinking in terms of five cases below (I'm not for a moment saying that you need to list these out in the comment, just saying that I could not get my head around the issues in this function without listing them out for myself):
1. anon private mcopy (using anon page newly allocated) 2. shmem private mcopy (using anon page newly allocated) 3. shmem private mcontinue (using page in cache from shmem_getpage) 4. shmem shared mcontinue (using page in cache from shmem_getpage) 5. shmem shared mcopy (using page in cache newly allocated)
Of which each has a VM_WRITE and a !VM_WRITE case; and the third and fourth cases are new in this patch (it really would have been better to introduce mcopy_atomic_install_pte() in a separate earlier patch, but don't change that now we've got this far); and the fifth case does *not* use mcopy_atomic_install_pte() in this patch, but will in future.
And while making these notes, let's hightlight again what is commented elsewhere, the odd nature of the second case: where userfaultfd short circuits to an anonymous CoW page without instantiating the shmem page. (Please double-check me on that: quite a lot of my comments below are about this case 2, so if I've got it wrong, then I've got a lot wrong.)
- */
+static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
mcopy_atomic_install_pte() throughout please.
struct vm_area_struct *dst_vma,
unsigned long dst_addr, struct page *page,
bool newly_allocated, bool wp_copy)
+{
- int ret;
- pte_t _dst_pte, *dst_pte;
- int writable;
Sorry, it's silly of me, but I keep getting irritated by "int writable" in company with the various bools; and the way vm_shared is initialized below, but writable initialized later. Please humour me by making it bool writable = dst_vma->vm_flags & VM_WRITE;
- bool vm_shared = dst_vma->vm_flags & VM_SHARED;
And I've found that we also need bool page_in_cache = page->mapping; because an anonymous page does not at this point have page->mapping set, and does not yet satisfy PageAnon(page). Or other naming if you think of better; or its inverse with page_is_anon or whatever.
- spinlock_t *ptl;
- struct inode *inode;
- pgoff_t offset, max_off;
- _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
- writable = dst_vma->vm_flags & VM_WRITE;
- /* For private, non-anon we need CoW (don't write to page cache!) */
- if (!vma_is_anonymous(dst_vma) && !vm_shared)
writable = 0;
That appears to differ slightly from what was done before this patch: it is now making a case 2 VM_WRITE pte unwritable, incurring an unnecessary write fault later on. I think it would be better (for all fives cases) to say:
if (page_in_cache && !vm_shared) writable = false;
- if (writable || vma_is_anonymous(dst_vma))
_dst_pte = pte_mkdirty(_dst_pte);
And, unlike before, that is not marking the case 2 unwritable pte dirty. Which works okay, because add_to_swap()'s unconditional set_page_dirty() will make sure this page is written to swap before it is freed. But I'd rather not rely on that here: it's a detail which might get changed one day, and whoever changes it may not think to update mm/userfaultfd.c.
Sticking with Andrea's caution about marking a shared unwritable dirty, but happy as before to mark a private unwritable dirty:
if (writable || !page_in_cache) _dst_pte = pte_mkdirty(_dst_pte);
This does *not* mark the new cases 3 and 4 dirty when unwritable, but there's no chance of data loss in their case, because the kernel has not modified the page's data: the page from shmem_getpage() is already marked correctly (usually PageDirty, but there's a mapped-hole case where it might not be, and that is still correct).
(Why do we mark these pages dirty when writable? To skip a hardware fault when and if the page is written later; but I'm not sure whether that's necessarily a good idea - we don't know here whether it was a write fault which triggered all this. I also don't know what difference wp_copy, which skips the mkwrite, makes to this calculus; but follow the example of before.)
- if (writable) {
if (wp_copy)
_dst_pte = pte_mkuffd_wp(_dst_pte);
else
_dst_pte = pte_mkwrite(_dst_pte);
- }
Fine.
else if (vm_shared) {
/*
* Since we didn't pte_mkdirty(), mark the page dirty or it
* could be freed from under us. We could do this
* unconditionally, but doing it only if !writable is faster.
*/
set_page_dirty(page);
- }
But delete this block, as we all wanted. As I've argued above, the new cases 3 and 4 using shmem_getpage() do not need an extra set_page_dirty() here, and we can address case 5 when it's added.
- dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
- if (vma_is_shmem(dst_vma)) {
/* serialize against truncate with the page table lock */
inode = dst_vma->vm_file->f_inode;
offset = linear_page_index(dst_vma, dst_addr);
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
ret = -EFAULT;
if (unlikely(offset >= max_off))
goto out_unlock;
- }
- ret = -EEXIST;
- if (!pte_none(*dst_pte))
goto out_unlock;
- inc_mm_counter(dst_mm, mm_counter(page));
Hard to spot, but that's wrong: because mm_counter() depends on PageAnon to decide which count to adjust, and that does not get set until the page_add_new_anon_rmap(). I'd expect your tests to have left "Bad rss" warnings in the kernel log? This would be why. Just move the line down until after page_add_new_anon_rmap() - with a comment line to say why!
- if (vma_is_shmem(dst_vma))
No, that gets case 2 wrong: use "if (page_in_cache)" instead.
page_add_file_rmap(page, false);
- else
page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
- if (newly_allocated)
lru_cache_add_inactive_or_unevictable(page, dst_vma);
- set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
- /* No need to invalidate - it was non-present before */
- update_mmu_cache(dst_vma, dst_addr, dst_pte);
- ret = 0;
+out_unlock:
- pte_unmap_unlock(dst_pte, ptl);
- return ret;
+}
static int mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct vm_area_struct *dst_vma, @@ -56,13 +137,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, struct page **pagep, bool wp_copy) {
- pte_t _dst_pte, *dst_pte;
- spinlock_t *ptl; void *page_kaddr; int ret; struct page *page;
- pgoff_t offset, max_off;
- struct inode *inode;
if (!*pagep) { ret = -ENOMEM; @@ -99,43 +176,12 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL)) goto out_release;
- _dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
- if (dst_vma->vm_flags & VM_WRITE) {
if (wp_copy)
_dst_pte = pte_mkuffd_wp(_dst_pte);
else
_dst_pte = pte_mkwrite(_dst_pte);
- }
- dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
- if (dst_vma->vm_file) {
/* the shmem MAP_PRIVATE case requires checking the i_size */
inode = dst_vma->vm_file->f_inode;
offset = linear_page_index(dst_vma, dst_addr);
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
ret = -EFAULT;
if (unlikely(offset >= max_off))
goto out_release_uncharge_unlock;
- }
- ret = -EEXIST;
- if (!pte_none(*dst_pte))
goto out_release_uncharge_unlock;
- inc_mm_counter(dst_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
- lru_cache_add_inactive_or_unevictable(page, dst_vma);
- set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
- /* No need to invalidate - it was non-present before */
- update_mmu_cache(dst_vma, dst_addr, dst_pte);
- pte_unmap_unlock(dst_pte, ptl);
- ret = 0;
- ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr,
page, true, wp_copy);
- if (ret)
goto out_release;
out: return ret; -out_release_uncharge_unlock:
- pte_unmap_unlock(dst_pte, ptl);
out_release: put_page(page); goto out; @@ -176,6 +222,41 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm, return ret; } +/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ +static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
bool wp_copy)
+{
- struct inode *inode = file_inode(dst_vma->vm_file);
- pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
- struct page *page;
- int ret;
- ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
- if (ret)
goto out;
- if (!page) {
ret = -EFAULT;
goto out;
- }
Right, I'll go along with that. I did say to use SGP_CACHE, and I'm not sure why you did not, but perhaps were put off it by my remarks about a racing hole punch. Using SGP_READ here, you will not allocate an unnecessary page in that (exceptional) case, good; but you are left with inconsistent behaviour on fallocated (!PageUptodate: the page has been allocated, but not yet cleared or overwritten with user data) pages.
No bad data is leaked, but the inconsistency is that handle_userfault() believes there's a page at this offset, but mcontinue_atomic_pte() says there is not (and might they retry forever disagreeing?). It's a somewhat grey area: I'd say your mcontinue_atomic_pte() is the correct one (it is a hole, but one that we happen to have reserved future space for); but that we don't really want to complicate the other end for it (if we skip going the VM_UFFD_MINOR way, it's more of a problem for VM_UFFD_MISSING).
I think stick with SGP_READ as you have: just be aware at the userspace end that this case might occur (and you only have to fault the page into the other mapping to resolve it), if anyone is using fallocate().
All the rest of 4/9 looked fine to me, though I have worried about a couple more things.
One: whereas I tend to think of one call to handle_userfault() ending up in one call to mcopy_atomic_install_pte() to resolve it, I see that actually __mcopy_atomic() can be a loop over many pages, and I have not thought through all the possibilities that might allow, and now with the interspersal of MINOR and MISSING.
Two: mcopy_atomic_install_pte() can only install a pte, and it looks as if it handles tails of a compound page correctly (as might come from a MINOR userfault on a pre-existing shmem THP); but there is no mapping of huge page by pmd, and khugepaged's userfaultfd_armed() checks will (rightly) keep it from interfering. I guess later on, after all the userfaultfd-ing is done, khugepaged can come around and collapse to huge pages, if the file was on a huge mount: okay.
Hugh
- ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr,
page, false, wp_copy);
- if (ret)
goto out_release;
- unlock_page(page);
- ret = 0;
+out:
- return ret;
+out_release:
- unlock_page(page);
- put_page(page);
- goto out;
+}
static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) { pgd_t *pgd; @@ -415,11 +496,16 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, unsigned long dst_addr, unsigned long src_addr, struct page **page,
bool zeropage,
enum mcopy_atomic_mode mode, bool wp_copy)
{ ssize_t err;
- if (mode == MCOPY_ATOMIC_CONTINUE) {
return mcontinue_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
wp_copy);
- }
- /*
- The normal page fault path for a shmem will invoke the
- fault, fill the hole in the file and COW it right away. The
@@ -431,7 +517,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, * and not in the radix tree. */ if (!(dst_vma->vm_flags & VM_SHARED)) {
if (!zeropage)
if (mode == MCOPY_ATOMIC_NORMAL) err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, src_addr, page, wp_copy);
@@ -441,7 +527,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, } else { VM_WARN_ON_ONCE(wp_copy); err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
dst_addr, src_addr, zeropage,
dst_addr, src_addr,
}mode != MCOPY_ATOMIC_NORMAL, page);
@@ -463,7 +550,6 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, long copied; struct page *page; bool wp_copy;
- bool zeropage = (mcopy_mode == MCOPY_ATOMIC_ZEROPAGE);
/* * Sanitize the command parameters: @@ -526,7 +612,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) goto out_unlock;
- if (mcopy_mode == MCOPY_ATOMIC_CONTINUE)
- if (!vma_is_shmem(dst_vma) && mcopy_mode == MCOPY_ATOMIC_CONTINUE) goto out_unlock;
/* @@ -574,7 +660,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, BUG_ON(pmd_trans_huge(*dst_pmd)); err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
src_addr, &page, zeropage, wp_copy);
cond_resched();src_addr, &page, mcopy_mode, wp_copy);
if (unlikely(err == -ENOENT)) { -- 2.31.1.295.g9ea45b61b8-goog
On Fri, Apr 16, 2021 at 4:47 PM Hugh Dickins hughd@google.com wrote:
On Mon, 12 Apr 2021, Axel Rasmussen wrote:
With this change, userspace can resolve a minor fault within a shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this match those for hugetlbfs - we look up the existing page in the page cache, and install PTEs for it.
s/PTEs/a PTE/
This commit introduces a new helper: mcopy_atomic_install_ptes.
The plural is misleading: it only installs a single pte, so I'm going to ask you to change it throughout to mcopy_atomic_install_pte() (I'm not thrilled with the "mcopy" nor the "atomic", but there you are being consistent with userfaultfd's peculiar naming, so let them be).
Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in shmem.c? The existing userfault implementation only relies on shmem.c for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for shmem in one place, regardless of shared/private (to reduce code duplication).
Why add a new mcopy_atomic_install_ptes helper? A problem we have with continue is that shmem_mcopy_atomic_pte() and mcopy_atomic_pte() are *close* to what we want, but not exactly. We do want to setup the PTEs in a CONTINUE operation, but we don't want to e.g. allocate a new page, charge it (e.g. to the shmem inode), manipulate various flags, etc. Also we have the problem stated above: shmem_mcopy_atomic_pte() and mcopy_atomic_pte() both handle one-half of the problem (shared / private) continue cares about. So, introduce mcontinue_atomic_pte(), to handle all of the shmem continue cases. Introduce the helper so it doesn't duplicate code with mcopy_atomic_pte().
In a future commit, shmem_mcopy_atomic_pte() will also be modified to use this new helper. However, since this is a bigger refactor, it seems most clear to do it as a separate change.
(Actually that turns out to be a nice deletion of lines, but you're absolutely right to do it as a separate patch.)
Signed-off-by: Axel Rasmussen axelrasmussen@google.com
mm/userfaultfd.c | 176 +++++++++++++++++++++++++++++++++++------------ 1 file changed, 131 insertions(+), 45 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 23fa2583bbd1..8df0438f5d6a 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -48,6 +48,87 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm, return dst_vma; }
+/*
- Install PTEs, to map dst_addr (within dst_vma) to page.
- This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
- whether or not dst_vma is VM_SHARED. It also handles the more general
- MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
- backed, or not).
- Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
- shmem_mcopy_atomic_pte instead.
Right, I'm thinking in terms of five cases below (I'm not for a moment saying that you need to list these out in the comment, just saying that I could not get my head around the issues in this function without listing them out for myself):
- anon private mcopy (using anon page newly allocated)
- shmem private mcopy (using anon page newly allocated)
- shmem private mcontinue (using page in cache from shmem_getpage)
- shmem shared mcontinue (using page in cache from shmem_getpage)
- shmem shared mcopy (using page in cache newly allocated)
Of which each has a VM_WRITE and a !VM_WRITE case; and the third and fourth cases are new in this patch (it really would have been better to introduce mcopy_atomic_install_pte() in a separate earlier patch, but don't change that now we've got this far); and the fifth case does *not* use mcopy_atomic_install_pte() in this patch, but will in future.
And while making these notes, let's hightlight again what is commented elsewhere, the odd nature of the second case: where userfaultfd short circuits to an anonymous CoW page without instantiating the shmem page. (Please double-check me on that: quite a lot of my comments below are about this case 2, so if I've got it wrong, then I've got a lot wrong.)
My understanding of case (2) is the same. In mfill_atomic_pte(), we call into mcopy_atomic_pte if !VM_SHARED, regardless of whether it's anon or shmem we're dealing with. That function allocates an anon page, and then mcopy_atomic_install_pte() will *not* mark it writable, so we get CoW semantics.
- */
+static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
mcopy_atomic_install_pte() throughout please.
struct vm_area_struct *dst_vma,
unsigned long dst_addr, struct page *page,
bool newly_allocated, bool wp_copy)
+{
int ret;
pte_t _dst_pte, *dst_pte;
int writable;
Sorry, it's silly of me, but I keep getting irritated by "int writable" in company with the various bools; and the way vm_shared is initialized below, but writable initialized later. Please humour me by making it bool writable = dst_vma->vm_flags & VM_WRITE;
bool vm_shared = dst_vma->vm_flags & VM_SHARED;
And I've found that we also need bool page_in_cache = page->mapping; because an anonymous page does not at this point have page->mapping set, and does not yet satisfy PageAnon(page). Or other naming if you think of better; or its inverse with page_is_anon or whatever.
spinlock_t *ptl;
struct inode *inode;
pgoff_t offset, max_off;
_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
writable = dst_vma->vm_flags & VM_WRITE;
/* For private, non-anon we need CoW (don't write to page cache!) */
if (!vma_is_anonymous(dst_vma) && !vm_shared)
writable = 0;
That appears to differ slightly from what was done before this patch: it is now making a case 2 VM_WRITE pte unwritable, incurring an unnecessary write fault later on. I think it would be better (for all fives cases) to say:
if (page_in_cache && !vm_shared) writable = false;
Agreed, this is more clear.
The case 2 difference is subtle, thanks for spotting it! I had assumed pages backing shmem VMAs would *always* be in the page cache, but due to the trick where we use an anon page in that particular case, this isn't true.
if (writable || vma_is_anonymous(dst_vma))
_dst_pte = pte_mkdirty(_dst_pte);
And, unlike before, that is not marking the case 2 unwritable pte dirty. Which works okay, because add_to_swap()'s unconditional set_page_dirty() will make sure this page is written to swap before it is freed. But I'd rather not rely on that here: it's a detail which might get changed one day, and whoever changes it may not think to update mm/userfaultfd.c.
Sticking with Andrea's caution about marking a shared unwritable dirty, but happy as before to mark a private unwritable dirty:
if (writable || !page_in_cache) _dst_pte = pte_mkdirty(_dst_pte);
This does *not* mark the new cases 3 and 4 dirty when unwritable, but there's no chance of data loss in their case, because the kernel has not modified the page's data: the page from shmem_getpage() is already marked correctly (usually PageDirty, but there's a mapped-hole case where it might not be, and that is still correct).
(Why do we mark these pages dirty when writable? To skip a hardware fault when and if the page is written later; but I'm not sure whether that's necessarily a good idea - we don't know here whether it was a write fault which triggered all this. I also don't know what difference wp_copy, which skips the mkwrite, makes to this calculus; but follow the example of before.)
Agreed - it's kind of the same thing as above, I had been assuming that "!vma_is_anonymous(dst_vma)" was equivalent to our new "page_in_cache", but case (2) violates that assumption. This is more clear, and given case (2), more correct.
Thanks for the thorough explanation and background!
if (writable) {
if (wp_copy)
_dst_pte = pte_mkuffd_wp(_dst_pte);
else
_dst_pte = pte_mkwrite(_dst_pte);
}
Fine.
else if (vm_shared) {
/*
* Since we didn't pte_mkdirty(), mark the page dirty or it
* could be freed from under us. We could do this
* unconditionally, but doing it only if !writable is faster.
*/
set_page_dirty(page);
}
But delete this block, as we all wanted. As I've argued above, the new cases 3 and 4 using shmem_getpage() do not need an extra set_page_dirty() here, and we can address case 5 when it's added.
At this point working on revising the patches, I'm not sure why we won't need this for case (5), but I at least agree that it's certainly not needed yet, and we can add it back (or something else) when we get to case (5) in that later patch.
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
if (vma_is_shmem(dst_vma)) {
/* serialize against truncate with the page table lock */
inode = dst_vma->vm_file->f_inode;
offset = linear_page_index(dst_vma, dst_addr);
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
ret = -EFAULT;
if (unlikely(offset >= max_off))
goto out_unlock;
}
ret = -EEXIST;
if (!pte_none(*dst_pte))
goto out_unlock;
inc_mm_counter(dst_mm, mm_counter(page));
Hard to spot, but that's wrong: because mm_counter() depends on PageAnon to decide which count to adjust, and that does not get set until the page_add_new_anon_rmap(). I'd expect your tests to have left "Bad rss" warnings in the kernel log? This would be why. Just move the line down until after page_add_new_anon_rmap() - with a comment line to say why!
if (vma_is_shmem(dst_vma))
No, that gets case 2 wrong: use "if (page_in_cache)" instead.
page_add_file_rmap(page, false);
else
page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
if (newly_allocated)
lru_cache_add_inactive_or_unevictable(page, dst_vma);
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
/* No need to invalidate - it was non-present before */
update_mmu_cache(dst_vma, dst_addr, dst_pte);
ret = 0;
+out_unlock:
pte_unmap_unlock(dst_pte, ptl);
return ret;
+}
static int mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct vm_area_struct *dst_vma, @@ -56,13 +137,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, struct page **pagep, bool wp_copy) {
pte_t _dst_pte, *dst_pte;
spinlock_t *ptl; void *page_kaddr; int ret; struct page *page;
pgoff_t offset, max_off;
struct inode *inode; if (!*pagep) { ret = -ENOMEM;
@@ -99,43 +176,12 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL)) goto out_release;
_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
if (dst_vma->vm_flags & VM_WRITE) {
if (wp_copy)
_dst_pte = pte_mkuffd_wp(_dst_pte);
else
_dst_pte = pte_mkwrite(_dst_pte);
}
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
if (dst_vma->vm_file) {
/* the shmem MAP_PRIVATE case requires checking the i_size */
inode = dst_vma->vm_file->f_inode;
offset = linear_page_index(dst_vma, dst_addr);
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
ret = -EFAULT;
if (unlikely(offset >= max_off))
goto out_release_uncharge_unlock;
}
ret = -EEXIST;
if (!pte_none(*dst_pte))
goto out_release_uncharge_unlock;
inc_mm_counter(dst_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
lru_cache_add_inactive_or_unevictable(page, dst_vma);
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
/* No need to invalidate - it was non-present before */
update_mmu_cache(dst_vma, dst_addr, dst_pte);
pte_unmap_unlock(dst_pte, ptl);
ret = 0;
ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr,
page, true, wp_copy);
if (ret)
goto out_release;
out: return ret; -out_release_uncharge_unlock:
pte_unmap_unlock(dst_pte, ptl);
out_release: put_page(page); goto out; @@ -176,6 +222,41 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm, return ret; }
+/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ +static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
bool wp_copy)
+{
struct inode *inode = file_inode(dst_vma->vm_file);
pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
struct page *page;
int ret;
ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
if (ret)
goto out;
if (!page) {
ret = -EFAULT;
goto out;
}
Right, I'll go along with that. I did say to use SGP_CACHE, and I'm not sure why you did not, but perhaps were put off it by my remarks about a racing hole punch. Using SGP_READ here, you will not allocate an unnecessary page in that (exceptional) case, good; but you are left with inconsistent behaviour on fallocated (!PageUptodate: the page has been allocated, but not yet cleared or overwritten with user data) pages.
You're exactly right, I was worried about allocating an unwanted page in that case, and SGP_READ seemed to avoid it. At the time, I hadn't spotted the fallocated page difference, though.
No bad data is leaked, but the inconsistency is that handle_userfault() believes there's a page at this offset, but mcontinue_atomic_pte() says there is not (and might they retry forever disagreeing?). It's a somewhat grey area: I'd say your mcontinue_atomic_pte() is the correct one (it is a hole, but one that we happen to have reserved future space for); but that we don't really want to complicate the other end for it (if we skip going the VM_UFFD_MINOR way, it's more of a problem for VM_UFFD_MISSING).
I think stick with SGP_READ as you have: just be aware at the userspace end that this case might occur (and you only have to fault the page into the other mapping to resolve it), if anyone is using fallocate().
Right, I think that situation is okay.
I don't think we'll retry forever. handle_userfault() will have just put the faulting thread(s) to sleep, so they won't call back into handle_userfault() until the kernel wakes them up (in response to one of these mcopy ioctls).
If the userspace fault handler runs into this case, at the end of the day it will get an error to its ioctl() against the UFFD. We don't wake up the faulting threads in the error path, so they're still stuck.
My thinking is, at that point the fault handler could do something to "force" PageUptodate (e.g., I'd expect any read or write to the non-UFFD-registered VMA which points to this same underlying page to accomplish this). And then, retrying the ioctl ought to succeed. Since this is a relatively narrow case, this doesn't feel overly burdensome.
My worry about using SGP_CACHE instead is, there's the edge case where we might allocate a new page and just carry on. I don't see an obvious way for the *caller* of shmem_getpage_gfp() to know whether or not this case happened (generally we'd want to just discard the page). I suspect this is "likely" not what userspace wants to happen in this case, and it would happen sort of silently - no error returned, or chance to recover.
It also seems more likely to happen than the fallocated page case. If the userspace fault handler is a bit buggy and tries to CONTINUE an area with no existing page, we'll just install a newly allocated zero page and return success -- userspace in this case would prefer we return an error.
Anyway, it seems to me we're on the same page ( ;) ) - just wanted to provide some background. :)
All the rest of 4/9 looked fine to me, though I have worried about a couple more things.
One: whereas I tend to think of one call to handle_userfault() ending up in one call to mcopy_atomic_install_pte() to resolve it, I see that actually __mcopy_atomic() can be a loop over many pages, and I have not thought through all the possibilities that might allow, and now with the interspersal of MINOR and MISSING.
Two: mcopy_atomic_install_pte() can only install a pte, and it looks as if it handles tails of a compound page correctly (as might come from a MINOR userfault on a pre-existing shmem THP); but there is no mapping of huge page by pmd, and khugepaged's userfaultfd_armed() checks will (rightly) keep it from interfering. I guess later on, after all the userfaultfd-ing is done, khugepaged can come around and collapse to huge pages, if the file was on a huge mount: okay.
Hugh
ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr,
page, false, wp_copy);
if (ret)
goto out_release;
unlock_page(page);
ret = 0;
+out:
return ret;
+out_release:
unlock_page(page);
put_page(page);
goto out;
+}
static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) { pgd_t *pgd; @@ -415,11 +496,16 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, unsigned long dst_addr, unsigned long src_addr, struct page **page,
bool zeropage,
enum mcopy_atomic_mode mode, bool wp_copy)
{ ssize_t err;
if (mode == MCOPY_ATOMIC_CONTINUE) {
return mcontinue_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
wp_copy);
}
/* * The normal page fault path for a shmem will invoke the * fault, fill the hole in the file and COW it right away. The
@@ -431,7 +517,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, * and not in the radix tree. */ if (!(dst_vma->vm_flags & VM_SHARED)) {
if (!zeropage)
if (mode == MCOPY_ATOMIC_NORMAL) err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, src_addr, page, wp_copy);
@@ -441,7 +527,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, } else { VM_WARN_ON_ONCE(wp_copy); err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
dst_addr, src_addr, zeropage,
dst_addr, src_addr,
mode != MCOPY_ATOMIC_NORMAL, page); }
@@ -463,7 +550,6 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, long copied; struct page *page; bool wp_copy;
bool zeropage = (mcopy_mode == MCOPY_ATOMIC_ZEROPAGE); /* * Sanitize the command parameters:
@@ -526,7 +612,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) goto out_unlock;
if (mcopy_mode == MCOPY_ATOMIC_CONTINUE)
if (!vma_is_shmem(dst_vma) && mcopy_mode == MCOPY_ATOMIC_CONTINUE) goto out_unlock; /*
@@ -574,7 +660,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, BUG_ON(pmd_trans_huge(*dst_pmd));
err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
src_addr, &page, zeropage, wp_copy);
src_addr, &page, mcopy_mode, wp_copy); cond_resched(); if (unlikely(err == -ENOENT)) {
-- 2.31.1.295.g9ea45b61b8-goog
This is a preparatory commit. In the future, we want to be able to setup alias mappings for area_src and area_dst in the shmem test, like we do in the hugetlb_shared test. With a VMA obtained via mmap(MAP_ANONYMOUS | MAP_SHARED), it isn't clear how to do this.
So, mmap() with an fd, so we can create alias mappings. Use memfd_create instead of actually passing in a tmpfs path like hugetlb does, since it's more convenient / simpler to run, and works just as well.
Future commits will:
1. Setup the alias mappings. 2. Extend our tests to actually take advantage of this, to test new userfaultfd behavior being introduced in this series.
Also, a small fix in the area we're changing: when the hugetlb setup fails in main(), pass in the right argv[] so we actually print out the hugetlb file path.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- tools/testing/selftests/vm/userfaultfd.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 6339aeaeeff8..fc40831f818f 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -85,6 +85,7 @@ static bool test_uffdio_wp = false; static bool test_uffdio_minor = false;
static bool map_shared; +static int shm_fd; static int huge_fd; static char *huge_fd_off0; static unsigned long long *count_verify; @@ -277,8 +278,11 @@ static void shmem_release_pages(char *rel_area)
static void shmem_allocate_area(void **alloc_area) { + unsigned long offset = + alloc_area == (void **)&area_src ? 0 : nr_pages * page_size; + *alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, - MAP_ANONYMOUS | MAP_SHARED, -1, 0); + MAP_SHARED, shm_fd, offset); if (*alloc_area == MAP_FAILED) err("mmap of memfd failed"); } @@ -1448,6 +1452,16 @@ int main(int argc, char **argv) err("Open of %s failed", argv[4]); if (ftruncate(huge_fd, 0)) err("ftruncate %s to size 0 failed", argv[4]); + } else if (test_type == TEST_SHMEM) { + shm_fd = memfd_create(argv[0], 0); + if (shm_fd < 0) + err("memfd_create"); + if (ftruncate(shm_fd, nr_pages * page_size * 2)) + err("ftruncate"); + if (fallocate(shm_fd, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, + nr_pages * page_size * 2)) + err("fallocate"); } printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n", nr_pages, nr_pages_per_cpu);
On Mon, Apr 12, 2021 at 10:17:17PM -0700, Axel Rasmussen wrote:
This is a preparatory commit. In the future, we want to be able to setup alias mappings for area_src and area_dst in the shmem test, like we do in the hugetlb_shared test. With a VMA obtained via mmap(MAP_ANONYMOUS | MAP_SHARED), it isn't clear how to do this.
So, mmap() with an fd, so we can create alias mappings. Use memfd_create instead of actually passing in a tmpfs path like hugetlb does, since it's more convenient / simpler to run, and works just as well.
Future commits will:
- Setup the alias mappings.
- Extend our tests to actually take advantage of this, to test new userfaultfd behavior being introduced in this series.
Also, a small fix in the area we're changing: when the hugetlb setup fails in main(), pass in the right argv[] so we actually print out the hugetlb file path.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com
Reviewed-by: Peter Xu peterx@redhat.com
Previously, we just allocated two shm areas: area_src and area_dst. With this commit, change this so we also allocate area_src_alias, and area_dst_alias.
area_*_alias and area_* (respectively) point to the same underlying physical pages, but are different VMAs. In a future commit in this series, we'll leverage this setup to exercise minor fault handling support for shmem, just like we do in the hugetlb_shared test.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- tools/testing/selftests/vm/userfaultfd.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index fc40831f818f..1f65c4ab7994 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -278,13 +278,29 @@ static void shmem_release_pages(char *rel_area)
static void shmem_allocate_area(void **alloc_area) { - unsigned long offset = - alloc_area == (void **)&area_src ? 0 : nr_pages * page_size; + void *area_alias = NULL; + bool is_src = alloc_area == (void **)&area_src; + unsigned long offset = is_src ? 0 : nr_pages * page_size;
*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, offset); if (*alloc_area == MAP_FAILED) err("mmap of memfd failed"); + + area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, + MAP_SHARED, shm_fd, offset); + if (area_alias == MAP_FAILED) + err("mmap of memfd alias failed"); + + if (is_src) + area_src_alias = area_alias; + else + area_dst_alias = area_alias; +} + +static void shmem_alias_mapping(__u64 *start, size_t len, unsigned long offset) +{ + *start = (unsigned long)area_dst_alias + offset; }
struct uffd_test_ops { @@ -314,7 +330,7 @@ static struct uffd_test_ops shmem_uffd_test_ops = { .expected_ioctls = SHMEM_EXPECTED_IOCTLS, .allocate_area = shmem_allocate_area, .release_pages = shmem_release_pages, - .alias_mapping = noop_alias_mapping, + .alias_mapping = shmem_alias_mapping, };
static struct uffd_test_ops hugetlb_uffd_test_ops = {
On Mon, Apr 12, 2021 at 10:17:18PM -0700, Axel Rasmussen wrote:
static void shmem_allocate_area(void **alloc_area) {
- unsigned long offset =
alloc_area == (void **)&area_src ? 0 : nr_pages * page_size;
- void *area_alias = NULL;
- bool is_src = alloc_area == (void **)&area_src;
- unsigned long offset = is_src ? 0 : nr_pages * page_size;
*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, offset); if (*alloc_area == MAP_FAILED) err("mmap of memfd failed");
- area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
MAP_SHARED, shm_fd, offset);
- if (area_alias == MAP_FAILED)
err("mmap of memfd alias failed");
- if (is_src)
area_src_alias = area_alias;
- else
area_dst_alias = area_alias;
+}
It would be nice if shmem_allocate_area() could merge with hugetlb_allocate_area() somehow, but not that urgent.
Reviewed-by: Peter Xu peterx@redhat.com
Currently, the context (fds, mmap-ed areas, etc.) are global. Each test mutates this state in some way, in some cases really "clobbering it" (e.g., the events test mremap-ing area_dst over the top of area_src, or the minor faults tests overwriting the count_verify values in the test areas). We run the tests in a particular order, each test is careful to make the right assumptions about its starting state, etc.
But, this is fragile. It's better for a test's success or failure to not depend on what some other prior test case did to the global state.
To that end, clear and reinitialize the test context at the start of each test case, so whatever prior test cases did doesn't affect future tests.
This is particularly relevant to this series because the events test's mremap of area_dst screws up assumptions the minor fault test was relying on. This wasn't a problem for hugetlb, as we don't mremap in that case.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- tools/testing/selftests/vm/userfaultfd.c | 221 +++++++++++++---------- 1 file changed, 127 insertions(+), 94 deletions(-)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 1f65c4ab7994..0ff01f437a39 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -89,7 +89,8 @@ static int shm_fd; static int huge_fd; static char *huge_fd_off0; static unsigned long long *count_verify; -static int uffd, uffd_flags, finished, *pipefd; +static int uffd = -1; +static int uffd_flags, finished, *pipefd; static char *area_src, *area_src_alias, *area_dst, *area_dst_alias; static char *zeropage; pthread_attr_t attr; @@ -342,6 +343,121 @@ static struct uffd_test_ops hugetlb_uffd_test_ops = {
static struct uffd_test_ops *uffd_test_ops;
+static int userfaultfd_open(uint64_t *features) +{ + struct uffdio_api uffdio_api; + + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); + if (uffd < 0) + err("userfaultfd syscall not available in this kernel"); + uffd_flags = fcntl(uffd, F_GETFD, NULL); + + uffdio_api.api = UFFD_API; + uffdio_api.features = *features; + if (ioctl(uffd, UFFDIO_API, &uffdio_api)) + err("UFFDIO_API failed.\nPlease make sure to " + "run with either root or ptrace capability."); + if (uffdio_api.api != UFFD_API) + err("UFFDIO_API error: %" PRIu64, (uint64_t)uffdio_api.api); + + *features = uffdio_api.features; + return 0; +} + +static int uffd_test_ctx_init_ext(uint64_t *features) +{ + unsigned long nr, cpu; + + uffd_test_ops->allocate_area((void **)&area_src); + if (!area_src) + return 1; + uffd_test_ops->allocate_area((void **)&area_dst); + if (!area_dst) + return 1; + + uffd_test_ops->release_pages(area_src); + uffd_test_ops->release_pages(area_dst); + + if (userfaultfd_open(features)) + return 1; + + count_verify = malloc(nr_pages * sizeof(unsigned long long)); + if (!count_verify) + err("count_verify"); + + for (nr = 0; nr < nr_pages; nr++) { + *area_mutex(area_src, nr) = + (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER; + count_verify[nr] = *area_count(area_src, nr) = 1; + /* + * In the transition between 255 to 256, powerpc will + * read out of order in my_bcmp and see both bytes as + * zero, so leave a placeholder below always non-zero + * after the count, to avoid my_bcmp to trigger false + * positives. + */ + *(area_count(area_src, nr) + 1) = 1; + } + + pipefd = malloc(sizeof(int) * nr_cpus * 2); + if (!pipefd) + err("pipefd"); + for (cpu = 0; cpu < nr_cpus; cpu++) + if (pipe2(&pipefd[cpu * 2], O_CLOEXEC | O_NONBLOCK)) + err("pipe"); + + return 0; +} + +static inline int uffd_test_ctx_init(uint64_t features) +{ + return uffd_test_ctx_init_ext(&features); +} + +static inline int munmap_area(void **area) +{ + if (*area) + if (munmap(*area, nr_pages * page_size)) + err("munmap"); + + *area = NULL; + return 0; +} + +static int uffd_test_ctx_clear(void) +{ + int ret = 0; + size_t i; + + if (pipefd) { + for (i = 0; i < nr_cpus * 2; ++i) { + if (close(pipefd[i])) + err("close pipefd"); + } + free(pipefd); + pipefd = NULL; + } + + if (count_verify) { + free(count_verify); + count_verify = NULL; + } + + if (uffd != -1) { + if (close(uffd)) + err("close uffd"); + uffd = -1; + } + + huge_fd_off0 = NULL; + ret |= munmap_area((void **)&area_src); + ret |= munmap_area((void **)&area_src_alias); + ret |= munmap_area((void **)&area_dst); + ret |= munmap_area((void **)&area_dst_alias); + + return ret; +} + static int my_bcmp(char *str1, char *str2, size_t n) { unsigned long i; @@ -726,40 +842,6 @@ static int stress(struct uffd_stats *uffd_stats) return 0; }
-static int userfaultfd_open_ext(uint64_t *features) -{ - struct uffdio_api uffdio_api; - - uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY); - if (uffd < 0) { - fprintf(stderr, - "userfaultfd syscall not available in this kernel\n"); - return 1; - } - uffd_flags = fcntl(uffd, F_GETFD, NULL); - - uffdio_api.api = UFFD_API; - uffdio_api.features = *features; - if (ioctl(uffd, UFFDIO_API, &uffdio_api)) { - fprintf(stderr, "UFFDIO_API failed.\nPlease make sure to " - "run with either root or ptrace capability.\n"); - return 1; - } - if (uffdio_api.api != UFFD_API) { - fprintf(stderr, "UFFDIO_API error: %" PRIu64 "\n", - (uint64_t)uffdio_api.api); - return 1; - } - - *features = uffdio_api.features; - return 0; -} - -static int userfaultfd_open(uint64_t features) -{ - return userfaultfd_open_ext(&features); -} - sigjmp_buf jbuf, *sigbuf;
static void sighndl(int sig, siginfo_t *siginfo, void *ptr) @@ -868,6 +950,8 @@ static int faulting_process(int signal_test) MREMAP_MAYMOVE | MREMAP_FIXED, area_src); if (area_dst == MAP_FAILED) err("mremap"); + /* Reset area_src since we just clobbered it */ + area_src = NULL;
for (; nr < nr_pages; nr++) { count = *area_count(area_dst, nr); @@ -961,10 +1045,9 @@ static int userfaultfd_zeropage_test(void) printf("testing UFFDIO_ZEROPAGE: "); fflush(stdout);
- uffd_test_ops->release_pages(area_dst); - - if (userfaultfd_open(0)) + if (uffd_test_ctx_clear() || uffd_test_ctx_init(0)) return 1; + uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; @@ -981,7 +1064,6 @@ static int userfaultfd_zeropage_test(void) if (my_bcmp(area_dst, zeropage, page_size)) err("zeropage is not zero");
- close(uffd); printf("done.\n"); return 0; } @@ -999,12 +1081,11 @@ static int userfaultfd_events_test(void) printf("testing events (fork, remap, remove): "); fflush(stdout);
- uffd_test_ops->release_pages(area_dst); - features = UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP | UFFD_FEATURE_EVENT_REMOVE; - if (userfaultfd_open(features)) + if (uffd_test_ctx_clear() || uffd_test_ctx_init(features)) return 1; + fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
uffdio_register.range.start = (unsigned long) area_dst; @@ -1037,8 +1118,6 @@ static int userfaultfd_events_test(void) if (pthread_join(uffd_mon, NULL)) return 1;
- close(uffd); - uffd_stats_report(&stats, 1);
return stats.missing_faults != nr_pages; @@ -1058,11 +1137,10 @@ static int userfaultfd_sig_test(void) printf("testing signal delivery: "); fflush(stdout);
- uffd_test_ops->release_pages(area_dst); - features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS; - if (userfaultfd_open(features)) + if (uffd_test_ctx_clear() || uffd_test_ctx_init(features)) return 1; + fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
uffdio_register.range.start = (unsigned long) area_dst; @@ -1126,9 +1204,7 @@ static int userfaultfd_minor_test(void) printf("testing minor faults: "); fflush(stdout);
- uffd_test_ops->release_pages(area_dst); - - if (userfaultfd_open_ext(&features)) + if (uffd_test_ctx_clear() || uffd_test_ctx_init_ext(&features)) return 1; /* If kernel reports the feature isn't supported, skip the test. */ if (!(features & UFFD_FEATURE_MINOR_HUGETLBFS)) { @@ -1183,8 +1259,6 @@ static int userfaultfd_minor_test(void) if (pthread_join(uffd_mon, NULL)) return 1;
- close(uffd); - uffd_stats_report(&stats, 1);
return stats.missing_faults != 0 || stats.minor_faults != nr_pages; @@ -1196,50 +1270,10 @@ static int userfaultfd_stress(void) char *tmp_area; unsigned long nr; struct uffdio_register uffdio_register; - unsigned long cpu; struct uffd_stats uffd_stats[nr_cpus];
- uffd_test_ops->allocate_area((void **)&area_src); - if (!area_src) - return 1; - uffd_test_ops->allocate_area((void **)&area_dst); - if (!area_dst) - return 1; - - if (userfaultfd_open(0)) - return 1; - - count_verify = malloc(nr_pages * sizeof(unsigned long long)); - if (!count_verify) { - perror("count_verify"); - return 1; - } - - for (nr = 0; nr < nr_pages; nr++) { - *area_mutex(area_src, nr) = (pthread_mutex_t) - PTHREAD_MUTEX_INITIALIZER; - count_verify[nr] = *area_count(area_src, nr) = 1; - /* - * In the transition between 255 to 256, powerpc will - * read out of order in my_bcmp and see both bytes as - * zero, so leave a placeholder below always non-zero - * after the count, to avoid my_bcmp to trigger false - * positives. - */ - *(area_count(area_src, nr) + 1) = 1; - } - - pipefd = malloc(sizeof(int) * nr_cpus * 2); - if (!pipefd) { - perror("pipefd"); + if (uffd_test_ctx_init(0)) return 1; - } - for (cpu = 0; cpu < nr_cpus; cpu++) { - if (pipe2(&pipefd[cpu*2], O_CLOEXEC | O_NONBLOCK)) { - perror("pipe"); - return 1; - } - }
if (posix_memalign(&area, page_size, page_size)) err("out of memory"); @@ -1360,7 +1394,6 @@ static int userfaultfd_stress(void) uffd_stats_report(uffd_stats, nr_cpus); }
- close(uffd); return userfaultfd_zeropage_test() || userfaultfd_sig_test() || userfaultfd_events_test() || userfaultfd_minor_test(); }
On Mon, Apr 12, 2021 at 10:17:19PM -0700, Axel Rasmussen wrote:
Currently, the context (fds, mmap-ed areas, etc.) are global. Each test mutates this state in some way, in some cases really "clobbering it" (e.g., the events test mremap-ing area_dst over the top of area_src, or the minor faults tests overwriting the count_verify values in the test areas). We run the tests in a particular order, each test is careful to make the right assumptions about its starting state, etc.
But, this is fragile. It's better for a test's success or failure to not depend on what some other prior test case did to the global state.
To that end, clear and reinitialize the test context at the start of each test case, so whatever prior test cases did doesn't affect future tests.
This is particularly relevant to this series because the events test's mremap of area_dst screws up assumptions the minor fault test was relying on. This wasn't a problem for hugetlb, as we don't mremap in that case.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com
tools/testing/selftests/vm/userfaultfd.c | 221 +++++++++++++---------- 1 file changed, 127 insertions(+), 94 deletions(-)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 1f65c4ab7994..0ff01f437a39 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -89,7 +89,8 @@ static int shm_fd; static int huge_fd; static char *huge_fd_off0; static unsigned long long *count_verify; -static int uffd, uffd_flags, finished, *pipefd; +static int uffd = -1; +static int uffd_flags, finished, *pipefd; static char *area_src, *area_src_alias, *area_dst, *area_dst_alias; static char *zeropage; pthread_attr_t attr; @@ -342,6 +343,121 @@ static struct uffd_test_ops hugetlb_uffd_test_ops = { static struct uffd_test_ops *uffd_test_ops; +static int userfaultfd_open(uint64_t *features) +{
- struct uffdio_api uffdio_api;
- uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
Keep UFFD_USER_MODE_ONLY?
[...]
@@ -961,10 +1045,9 @@ static int userfaultfd_zeropage_test(void) printf("testing UFFDIO_ZEROPAGE: "); fflush(stdout);
- uffd_test_ops->release_pages(area_dst);
- if (userfaultfd_open(0))
- if (uffd_test_ctx_clear() || uffd_test_ctx_init(0)) return 1;
Would it look even nicer to init() at the entry of each test, and clear() after finish one test?
- uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
The rest looks good to me. Thanks,
On Tue, Apr 13, 2021 at 1:15 PM Peter Xu peterx@redhat.com wrote:
On Mon, Apr 12, 2021 at 10:17:19PM -0700, Axel Rasmussen wrote:
Currently, the context (fds, mmap-ed areas, etc.) are global. Each test mutates this state in some way, in some cases really "clobbering it" (e.g., the events test mremap-ing area_dst over the top of area_src, or the minor faults tests overwriting the count_verify values in the test areas). We run the tests in a particular order, each test is careful to make the right assumptions about its starting state, etc.
But, this is fragile. It's better for a test's success or failure to not depend on what some other prior test case did to the global state.
To that end, clear and reinitialize the test context at the start of each test case, so whatever prior test cases did doesn't affect future tests.
This is particularly relevant to this series because the events test's mremap of area_dst screws up assumptions the minor fault test was relying on. This wasn't a problem for hugetlb, as we don't mremap in that case.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com
tools/testing/selftests/vm/userfaultfd.c | 221 +++++++++++++---------- 1 file changed, 127 insertions(+), 94 deletions(-)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 1f65c4ab7994..0ff01f437a39 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -89,7 +89,8 @@ static int shm_fd; static int huge_fd; static char *huge_fd_off0; static unsigned long long *count_verify; -static int uffd, uffd_flags, finished, *pipefd; +static int uffd = -1; +static int uffd_flags, finished, *pipefd; static char *area_src, *area_src_alias, *area_dst, *area_dst_alias; static char *zeropage; pthread_attr_t attr; @@ -342,6 +343,121 @@ static struct uffd_test_ops hugetlb_uffd_test_ops = {
static struct uffd_test_ops *uffd_test_ops;
+static int userfaultfd_open(uint64_t *features) +{
struct uffdio_api uffdio_api;
uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
Keep UFFD_USER_MODE_ONLY?
[...]
@@ -961,10 +1045,9 @@ static int userfaultfd_zeropage_test(void) printf("testing UFFDIO_ZEROPAGE: "); fflush(stdout);
uffd_test_ops->release_pages(area_dst);
if (userfaultfd_open(0))
if (uffd_test_ctx_clear() || uffd_test_ctx_init(0)) return 1;
Would it look even nicer to init() at the entry of each test, and clear() after finish one test?
I slightly prefer clearing at the beginning, as it means we don't need to depend on the previous test being correct for this test to function. And, we don't need more complex error handling in the test cases to make sure we don't mess things up for the next test.
But, two things we can do to clean this up as-is:
The initialization function can just call clear itself, so tests don't need to worry about it.
And, with err(), we don't need these functions to return an int any more.
I'll send a version like that, we can see how it looks.
uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
The rest looks good to me. Thanks,
-- Peter Xu
Enable test_uffdio_minor for test_type == TEST_SHMEM, and modify the test slightly to pass in / check for the right feature flags.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- tools/testing/selftests/vm/userfaultfd.c | 29 ++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 0ff01f437a39..0830f155e6c2 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -484,6 +484,7 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp) static void continue_range(int ufd, __u64 start, __u64 len) { struct uffdio_continue req; + int ret;
req.range.start = start; req.range.len = len; @@ -492,6 +493,17 @@ static void continue_range(int ufd, __u64 start, __u64 len) if (ioctl(ufd, UFFDIO_CONTINUE, &req)) err("UFFDIO_CONTINUE failed for address 0x%" PRIx64, (uint64_t)start); + + /* + * Error handling within the kernel for continue is subtly different + * from copy or zeropage, so it may be a source of bugs. Trigger an + * error (-EEXIST) on purpose, to verify doing so doesn't cause a BUG. + */ + req.mapped = 0; + ret = ioctl(ufd, UFFDIO_CONTINUE, &req); + if (ret >= 0 || req.mapped != -EEXIST) + err("failed to exercise UFFDIO_CONTINUE error handling, ret=%d, mapped=%" PRId64, + ret, (int64_t) req.mapped); }
static void *locking_thread(void *arg) @@ -1196,7 +1208,7 @@ static int userfaultfd_minor_test(void) void *expected_page; char c; struct uffd_stats stats = { 0 }; - uint64_t features = UFFD_FEATURE_MINOR_HUGETLBFS; + uint64_t req_features, features_out;
if (!test_uffdio_minor) return 0; @@ -1204,10 +1216,18 @@ static int userfaultfd_minor_test(void) printf("testing minor faults: "); fflush(stdout);
- if (uffd_test_ctx_clear() || uffd_test_ctx_init_ext(&features)) + if (test_type == TEST_HUGETLB) + req_features = UFFD_FEATURE_MINOR_HUGETLBFS; + else if (test_type == TEST_SHMEM) + req_features = UFFD_FEATURE_MINOR_SHMEM; + else + return 1; + + features_out = req_features; + if (uffd_test_ctx_clear() || uffd_test_ctx_init_ext(&features_out)) return 1; - /* If kernel reports the feature isn't supported, skip the test. */ - if (!(features & UFFD_FEATURE_MINOR_HUGETLBFS)) { + /* If kernel reports required features aren't supported, skip test. */ + if ((features_out & req_features) != req_features) { printf("skipping test due to lack of feature support\n"); fflush(stdout); return 0; @@ -1442,6 +1462,7 @@ static void set_test_type(const char *type) map_shared = true; test_type = TEST_SHMEM; uffd_test_ops = &shmem_uffd_test_ops; + test_uffdio_minor = true; } else { err("Unknown test type: %s", type); }
In a previous commit, we added the mcopy_atomic_install_ptes() helper. This helper does the job of setting up PTEs for an existing page, to map it into a given VMA. It deals with both the anon and shmem cases, as well as the shared and private cases.
In other words, shmem_mcopy_atomic_pte() duplicates a case it already handles. So, expose it, and let shmem_mcopy_atomic_pte() use it directly, to reduce code duplication.
This requires that we refactor shmem_mcopy_atomic-pte() a bit:
Instead of doing accounting (shmem_recalc_inode() et al) part-way through the PTE setup, do it beforehand. This frees up mcopy_atomic_install_ptes() from having to care about this accounting, but it does mean we need to clean it up if we get a failure afterwards (shmem_uncharge()).
We can *almost* use shmem_charge() to do this, reducing code duplication. But, it does `inode->i_mapping->nrpages++`, which would double-count since shmem_add_to_page_cache() also does this.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com --- include/linux/userfaultfd_k.h | 5 ++++ mm/shmem.c | 52 +++++++---------------------------- mm/userfaultfd.c | 25 ++++++++--------- 3 files changed, 27 insertions(+), 55 deletions(-)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 794d1538b8ba..3e20bfa9ef80 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -53,6 +53,11 @@ enum mcopy_atomic_mode { MCOPY_ATOMIC_CONTINUE, };
+extern int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, struct page *page, + bool newly_allocated, bool wp_copy); + extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long src_start, unsigned long len, bool *mmap_changing, __u64 mode); diff --git a/mm/shmem.c b/mm/shmem.c index 3f48cb5e8404..9b12298405a4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2376,10 +2376,8 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, struct address_space *mapping = inode->i_mapping; gfp_t gfp = mapping_gfp_mask(mapping); pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); - spinlock_t *ptl; void *page_kaddr; struct page *page; - pte_t _dst_pte, *dst_pte; int ret; pgoff_t max_off;
@@ -2389,8 +2387,10 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
if (!*pagep) { page = shmem_alloc_page(gfp, info, pgoff); - if (!page) - goto out_unacct_blocks; + if (!page) { + shmem_inode_unacct_blocks(inode, 1); + goto out; + }
if (!zeropage) { /* COPY */ page_kaddr = kmap_atomic(page); @@ -2430,59 +2430,27 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, if (ret) goto out_release;
- _dst_pte = mk_pte(page, dst_vma->vm_page_prot); - if (dst_vma->vm_flags & VM_WRITE) - _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte)); - else { - /* - * We don't set the pte dirty if the vma has no - * VM_WRITE permission, so mark the page dirty or it - * could be freed from under us. We could do it - * unconditionally before unlock_page(), but doing it - * only if VM_WRITE is not set is faster. - */ - set_page_dirty(page); - } - - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); - - ret = -EFAULT; - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); - if (unlikely(pgoff >= max_off)) - goto out_release_unlock; - - ret = -EEXIST; - if (!pte_none(*dst_pte)) - goto out_release_unlock; - - lru_cache_add(page); - spin_lock_irq(&info->lock); info->alloced++; inode->i_blocks += BLOCKS_PER_PAGE; shmem_recalc_inode(inode); spin_unlock_irq(&info->lock);
- inc_mm_counter(dst_mm, mm_counter_file(page)); - page_add_file_rmap(page, false); - set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr, + page, true, false); + if (ret) + goto out_release_uncharge;
- /* No need to invalidate - it was non-present before */ - update_mmu_cache(dst_vma, dst_addr, dst_pte); - pte_unmap_unlock(dst_pte, ptl); unlock_page(page); ret = 0; out: return ret; -out_release_unlock: - pte_unmap_unlock(dst_pte, ptl); - ClearPageDirty(page); +out_release_uncharge: delete_from_page_cache(page); + shmem_uncharge(inode, 1); out_release: unlock_page(page); put_page(page); -out_unacct_blocks: - shmem_inode_unacct_blocks(inode, 1); goto out; } #endif /* CONFIG_USERFAULTFD */ diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 8df0438f5d6a..3f73ba0b99f0 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -51,18 +51,13 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm, /* * Install PTEs, to map dst_addr (within dst_vma) to page. * - * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed), - * whether or not dst_vma is VM_SHARED. It also handles the more general - * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file - * backed, or not). - * - * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by - * shmem_mcopy_atomic_pte instead. + * This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem + * and anon, and for both shared and private VMAs. */ -static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, struct page *page, - bool newly_allocated, bool wp_copy) +int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, struct page *page, + bool newly_allocated, bool wp_copy) { int ret; pte_t _dst_pte, *dst_pte; @@ -116,8 +111,12 @@ static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd, else page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
- if (newly_allocated) - lru_cache_add_inactive_or_unevictable(page, dst_vma); + if (newly_allocated) { + if (vma_is_shmem(dst_vma) && vm_shared) + lru_cache_add(page); + else + lru_cache_add_inactive_or_unevictable(page, dst_vma); + }
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
On Mon, 12 Apr 2021, Axel Rasmussen wrote:
In a previous commit, we added the mcopy_atomic_install_ptes() helper. This helper does the job of setting up PTEs for an existing page, to map it into a given VMA. It deals with both the anon and shmem cases, as well as the shared and private cases.
In other words, shmem_mcopy_atomic_pte() duplicates a case it already handles. So, expose it, and let shmem_mcopy_atomic_pte() use it directly, to reduce code duplication.
This requires that we refactor shmem_mcopy_atomic-pte() a bit:
Instead of doing accounting (shmem_recalc_inode() et al) part-way through the PTE setup, do it beforehand. This frees up mcopy_atomic_install_ptes() from having to care about this accounting, but it does mean we need to clean it up if we get a failure afterwards (shmem_uncharge()).
We can *almost* use shmem_charge() to do this, reducing code duplication. But, it does `inode->i_mapping->nrpages++`, which would double-count since shmem_add_to_page_cache() also does this.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com
include/linux/userfaultfd_k.h | 5 ++++ mm/shmem.c | 52 +++++++---------------------------- mm/userfaultfd.c | 25 ++++++++--------- 3 files changed, 27 insertions(+), 55 deletions(-)
Very nice, and it gets better.
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 794d1538b8ba..3e20bfa9ef80 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -53,6 +53,11 @@ enum mcopy_atomic_mode { MCOPY_ATOMIC_CONTINUE, }; +extern int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
mcopy_atomic_install_pte throughout as before.
struct vm_area_struct *dst_vma,
unsigned long dst_addr, struct page *page,
bool newly_allocated, bool wp_copy);
extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long src_start, unsigned long len, bool *mmap_changing, __u64 mode); diff --git a/mm/shmem.c b/mm/shmem.c index 3f48cb5e8404..9b12298405a4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2376,10 +2376,8 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, struct address_space *mapping = inode->i_mapping; gfp_t gfp = mapping_gfp_mask(mapping); pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
- spinlock_t *ptl; void *page_kaddr; struct page *page;
- pte_t _dst_pte, *dst_pte; int ret; pgoff_t max_off;
@@ -2389,8 +2387,10 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, if (!*pagep) { page = shmem_alloc_page(gfp, info, pgoff);
if (!page)
goto out_unacct_blocks;
if (!page) {
shmem_inode_unacct_blocks(inode, 1);
goto out;
}
if (!zeropage) { /* COPY */ page_kaddr = kmap_atomic(page); @@ -2430,59 +2430,27 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, if (ret) goto out_release;
- _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
- if (dst_vma->vm_flags & VM_WRITE)
_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
- else {
/*
* We don't set the pte dirty if the vma has no
* VM_WRITE permission, so mark the page dirty or it
* could be freed from under us. We could do it
* unconditionally before unlock_page(), but doing it
* only if VM_WRITE is not set is faster.
*/
set_page_dirty(page);
- }
- dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
- ret = -EFAULT;
- max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
- if (unlikely(pgoff >= max_off))
goto out_release_unlock;
- ret = -EEXIST;
- if (!pte_none(*dst_pte))
goto out_release_unlock;
- lru_cache_add(page);
- spin_lock_irq(&info->lock); info->alloced++; inode->i_blocks += BLOCKS_PER_PAGE; shmem_recalc_inode(inode); spin_unlock_irq(&info->lock);
- inc_mm_counter(dst_mm, mm_counter_file(page));
- page_add_file_rmap(page, false);
- set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
- ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr,
page, true, false);
- if (ret)
goto out_release_uncharge;
- /* No need to invalidate - it was non-present before */
- update_mmu_cache(dst_vma, dst_addr, dst_pte);
- pte_unmap_unlock(dst_pte, ptl);
In reviewing 4/9, I said to take out mcopy_atomic_install_pte()'s set_page_dirty(). Just call it here instead, before the unlock_page(). You have an array of choices for how to write it:
if (!(dst_vma->vm_flags & VM_WRITE)) set_page_dirty(page); or if (!(dst_vma->vm_flags & VM_WRITE)) SetPageDirty(page); or set_page_dirty(page); or SetPageDirty(page);
Personally, I'd go for the last: this function has just modified the page, so it ought to mark it dirty: without second-guessing what mcopy_atomic_install_pte() might do to make that redundant in the VM_WRITE case. set_page_dirty() or SetPageDirty()? Some years ago I tended to favour the former for its preparatory PageDirty check; nowadays (on shmem) I favour the latter, to avoid the function call indirection which became more expensive with spectre+retpoline.
unlock_page(page); ret = 0; out: return ret; -out_release_unlock:
- pte_unmap_unlock(dst_pte, ptl);
- ClearPageDirty(page);
+out_release_uncharge:
Given how 4/9 was, this did still need the ClearPageDirty() before delete_from_page_cache(), to prevent a warning. But not needed here if SPD is done just before unlock_page().
delete_from_page_cache(page);
- shmem_uncharge(inode, 1);
out_release: unlock_page(page); put_page(page); -out_unacct_blocks:
- shmem_inode_unacct_blocks(inode, 1); goto out;
} #endif /* CONFIG_USERFAULTFD */ diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 8df0438f5d6a..3f73ba0b99f0 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -51,18 +51,13 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm, /*
- Install PTEs, to map dst_addr (within dst_vma) to page.
- This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
- whether or not dst_vma is VM_SHARED. It also handles the more general
- MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
- backed, or not).
- Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
- shmem_mcopy_atomic_pte instead.
- This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem
*/
- and anon, and for both shared and private VMAs.
-static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr, struct page *page,
bool newly_allocated, bool wp_copy)
+int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr, struct page *page,
bool newly_allocated, bool wp_copy)
{ int ret; pte_t _dst_pte, *dst_pte; @@ -116,8 +111,12 @@ static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd, else page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
- if (newly_allocated)
lru_cache_add_inactive_or_unevictable(page, dst_vma);
- if (newly_allocated) {
if (vma_is_shmem(dst_vma) && vm_shared)
lru_cache_add(page);
else
lru_cache_add_inactive_or_unevictable(page, dst_vma);
- }
This change is not required, you'll be glad to hear. Take a look at lru_cache_add_inactive_or_unevictable() in mm/swap.c: it's a wrapper to set PageMlocked and do associated accounting (if appropriate) before doing the lru_cache_add(page). And, strictly speaking, should have been used in shmem_mcopy_atomic_pte() all along.
I say "strictly speaking" because it would be exceptional to find a page needing Mlock there, and not an error to delay that Mlock until sometime later, before reclaim. Worth a sentence in the commit message, but not worth a "Fixes:" tag.
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); -- 2.31.1.295.g9ea45b61b8-goog
On Fri, Apr 16, 2021 at 5:34 PM Hugh Dickins hughd@google.com wrote:
On Mon, 12 Apr 2021, Axel Rasmussen wrote:
In a previous commit, we added the mcopy_atomic_install_ptes() helper. This helper does the job of setting up PTEs for an existing page, to map it into a given VMA. It deals with both the anon and shmem cases, as well as the shared and private cases.
In other words, shmem_mcopy_atomic_pte() duplicates a case it already handles. So, expose it, and let shmem_mcopy_atomic_pte() use it directly, to reduce code duplication.
This requires that we refactor shmem_mcopy_atomic-pte() a bit:
Instead of doing accounting (shmem_recalc_inode() et al) part-way through the PTE setup, do it beforehand. This frees up mcopy_atomic_install_ptes() from having to care about this accounting, but it does mean we need to clean it up if we get a failure afterwards (shmem_uncharge()).
We can *almost* use shmem_charge() to do this, reducing code duplication. But, it does `inode->i_mapping->nrpages++`, which would double-count since shmem_add_to_page_cache() also does this.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com
include/linux/userfaultfd_k.h | 5 ++++ mm/shmem.c | 52 +++++++---------------------------- mm/userfaultfd.c | 25 ++++++++--------- 3 files changed, 27 insertions(+), 55 deletions(-)
Very nice, and it gets better.
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 794d1538b8ba..3e20bfa9ef80 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -53,6 +53,11 @@ enum mcopy_atomic_mode { MCOPY_ATOMIC_CONTINUE, };
+extern int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
mcopy_atomic_install_pte throughout as before.
struct vm_area_struct *dst_vma,
unsigned long dst_addr, struct page *page,
bool newly_allocated, bool wp_copy);
extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long src_start, unsigned long len, bool *mmap_changing, __u64 mode); diff --git a/mm/shmem.c b/mm/shmem.c index 3f48cb5e8404..9b12298405a4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2376,10 +2376,8 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, struct address_space *mapping = inode->i_mapping; gfp_t gfp = mapping_gfp_mask(mapping); pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
spinlock_t *ptl; void *page_kaddr; struct page *page;
pte_t _dst_pte, *dst_pte; int ret; pgoff_t max_off;
@@ -2389,8 +2387,10 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
if (!*pagep) { page = shmem_alloc_page(gfp, info, pgoff);
if (!page)
goto out_unacct_blocks;
if (!page) {
shmem_inode_unacct_blocks(inode, 1);
goto out;
} if (!zeropage) { /* COPY */ page_kaddr = kmap_atomic(page);
@@ -2430,59 +2430,27 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, if (ret) goto out_release;
_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
if (dst_vma->vm_flags & VM_WRITE)
_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
else {
/*
* We don't set the pte dirty if the vma has no
* VM_WRITE permission, so mark the page dirty or it
* could be freed from under us. We could do it
* unconditionally before unlock_page(), but doing it
* only if VM_WRITE is not set is faster.
*/
set_page_dirty(page);
}
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
ret = -EFAULT;
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
if (unlikely(pgoff >= max_off))
goto out_release_unlock;
ret = -EEXIST;
if (!pte_none(*dst_pte))
goto out_release_unlock;
lru_cache_add(page);
spin_lock_irq(&info->lock); info->alloced++; inode->i_blocks += BLOCKS_PER_PAGE; shmem_recalc_inode(inode); spin_unlock_irq(&info->lock);
inc_mm_counter(dst_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr,
page, true, false);
if (ret)
goto out_release_uncharge;
/* No need to invalidate - it was non-present before */
update_mmu_cache(dst_vma, dst_addr, dst_pte);
pte_unmap_unlock(dst_pte, ptl);
In reviewing 4/9, I said to take out mcopy_atomic_install_pte()'s set_page_dirty(). Just call it here instead, before the unlock_page(). You have an array of choices for how to write it:
if (!(dst_vma->vm_flags & VM_WRITE)) set_page_dirty(page);
or if (!(dst_vma->vm_flags & VM_WRITE)) SetPageDirty(page); or set_page_dirty(page); or SetPageDirty(page);
Personally, I'd go for the last: this function has just modified the page, so it ought to mark it dirty: without second-guessing what mcopy_atomic_install_pte() might do to make that redundant in the VM_WRITE case. set_page_dirty() or SetPageDirty()? Some years ago I tended to favour the former for its preparatory PageDirty check; nowadays (on shmem) I favour the latter, to avoid the function call indirection which became more expensive with spectre+retpoline.
Now reaching this comment after looking through patch 4:
Ah, this makes sense. I agree this is clearer too, as it doesn't split the concern across two files / functions.
unlock_page(page); ret = 0;
out: return ret; -out_release_unlock:
pte_unmap_unlock(dst_pte, ptl);
ClearPageDirty(page);
+out_release_uncharge:
Given how 4/9 was, this did still need the ClearPageDirty() before delete_from_page_cache(), to prevent a warning. But not needed here if SPD is done just before unlock_page().
delete_from_page_cache(page);
shmem_uncharge(inode, 1);
out_release: unlock_page(page); put_page(page); -out_unacct_blocks:
shmem_inode_unacct_blocks(inode, 1); goto out;
} #endif /* CONFIG_USERFAULTFD */ diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 8df0438f5d6a..3f73ba0b99f0 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -51,18 +51,13 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm, /*
- Install PTEs, to map dst_addr (within dst_vma) to page.
- This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
- whether or not dst_vma is VM_SHARED. It also handles the more general
- MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
- backed, or not).
- Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
- shmem_mcopy_atomic_pte instead.
- This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem
*/
- and anon, and for both shared and private VMAs.
-static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr, struct page *page,
bool newly_allocated, bool wp_copy)
+int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr, struct page *page,
bool newly_allocated, bool wp_copy)
{ int ret; pte_t _dst_pte, *dst_pte; @@ -116,8 +111,12 @@ static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd, else page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
if (newly_allocated)
lru_cache_add_inactive_or_unevictable(page, dst_vma);
if (newly_allocated) {
if (vma_is_shmem(dst_vma) && vm_shared)
lru_cache_add(page);
else
lru_cache_add_inactive_or_unevictable(page, dst_vma);
}
This change is not required, you'll be glad to hear. Take a look at lru_cache_add_inactive_or_unevictable() in mm/swap.c: it's a wrapper to set PageMlocked and do associated accounting (if appropriate) before doing the lru_cache_add(page). And, strictly speaking, should have been used in shmem_mcopy_atomic_pte() all along.
I say "strictly speaking" because it would be exceptional to find a page needing Mlock there, and not an error to delay that Mlock until sometime later, before reclaim. Worth a sentence in the commit message, but not worth a "Fixes:" tag.
This makes sense. I had considered doing this before, but took a brief look at the functions and saw they were indeed different. But looking more closely, you're right that the wrapper is only different if some particular conditions are true, and so is fine to use in other cases.
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
-- 2.31.1.295.g9ea45b61b8-goog
linux-kselftest-mirror@lists.linaro.org