Hi everybody,
as discussed in the linux-mm alignment session on Wednesday, this is part 1 of the COW fixes: fix the COW security issue using GUP-triggered unsharing of shared anonymous pages (ordinary, THP, hugetlb). In the meeting slides, this approach was referred to as "Copy On Read". If anybody wants to have access to the slides, please feel free to reach out.
The patches are based on v5.16-rc5 and available at: https://github.com/davidhildenbrand/linux/pull/new/unshare_v1
It is currently again possible for a child process to observe modifications of anonymous pages performed by the parent process after fork() in some cases, which is not only a violation of the POSIX semantics of MAP_PRIVATE, but more importantly a real security issue.
This issue, including other related COW issues, has been summarized at [1]: " 1. Observing Memory Modifications of Private Pages From A Child Process
Long story short: process-private memory might not be as private as you think once you fork(): successive modifications of private memory regions in the parent process can still be observed by the child process, for example, by smart use of vmsplice()+munmap().
The core problem is that pinning pages readable in a child process, such as done via the vmsplice system call, can result in a child process observing memory modifications done in the parent process the child is not supposed to observe. [1] contains an excellent summary and [2] contains further details. This issue was assigned CVE-2020-29374 [9].
For this to trigger, it's required to use a fork() without subsequent exec(), for example, as used under Android zygote. Without further details about an application that forks less-privileged child processes, one cannot really say what's actually affected and what's not -- see the details section the end of this mail for a short sshd/openssh analysis.
While commit 17839856fd58 ("gup: document and work around "COW can break either way" issue") fixed this issue and resulted in other problems (e.g., ptrace on pmem), commit 09854ba94c6a ("mm: do_wp_page() simplification") re-introduced part of the problem unfortunately.
The original reproducer can be modified quite easily to use THP [3] and make the issue appear again on upstream kernels. I modified it to use hugetlb [4] and it triggers as well. The problem is certainly less severe with hugetlb than with THP; it merely highlights that we still have plenty of open holes we should be closing/fixing.
Regarding vmsplice(), the only known workaround is to disallow the vmsplice() system call ... or disable THP and hugetlb. But who knows what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in the end, it's a more generic issue. "
This security issue was first reported by Jann Horn on 27 May 2020 and it currently affects anonymous THP and hugetlb again. The "security issue" part for hugetlb might be less important than for THP. However, with this approach it's just easy to get the MAP_PRIVATE semantics of any anonymous pages in that regard and avoid any such information leaks without much added complexity.
Ordinary anonymous pages are currently not affected, because the COW logic was changed in commit 09854ba94c6a ("mm: do_wp_page() simplification") for them to COW on "page_count() != 1" instead of "mapcount > 1", which unfortunately results in other COW issues, some of them documented in [1] as well.
To fix this COW issue once and for all, introduce GUP-triggered unsharing that can be conditionally triggered via FAULT_FLAG_UNSHARE. In contrast to traditional COW, unsharing will leave the copied page mapped write-protected in the page table, not having the semantics of a write fault.
Logically, unsharing is triggered "early", as soon as GUP performs the action that could result in a COW getting missed later and the security issue triggering: however, unsharing is not triggered as before via a write fault with undesired side effects.
Long story short, GUP triggers unsharing if all of the following conditions are met: * The page is mapped R/O * We have an anonymous page, excluding KSM * We want to read (!FOLL_WRITE) * Unsharing is not disabled (!FOLL_NOUNSHARE) * We want to take a reference (FOLL_GET or FOLL_PIN) * The page is a shared anonymous page: mapcount > 1
To reliably detect shared anonymous THP without heavy locking, introduce a mapcount_seqcount seqlock that protects the mapcount of a THP and can be used to read an atomic mapcount value. The mapcount_seqlock is stored inside the memmap of the compound page -- to keep it simple, factor out a raw_seqlock_t from the seqlock_t.
As this patch series introduces the same unsharing logic for any anonymous pages, it also paves the way to fix other COW issues, e.g., documented in [1], without reintroducing the security issue or reintroducing other issues we observed in the past (e.g., broken ptrace on pmem).
All reproducers for this COW issue have been consolidated in the selftest included in this series. Hopefully we'll get this fixed for good.
Future work:
* get_user_pages_fast_only() can currently spin on the mapcount_seqcount when reading the mapcount, which might be a rare event. While this is fine even when done from get_user_pages_fast_only() in IRQ context, we might want to just fail fast in get_user_pages_fast_only(). We already have patches prepared that add page_anon_maybe_shared() and page_trans_huge_anon_maybe_shared() that will return "true" in case spinning would be required and make get_user_pages_fast_only() fail fast. I'm excluding them for simplicity.
... even better would be finding a way to just not need the mapcount_seqcount, but THP splitting and PageDoubleMap() gives us a hard time -- but maybe we'll eventually find a way someday :)
* Part 2 will tackle the other user-space visible breakages / COW issues raised in [1]. This series is the basis for adjusting the COW logic once again without re-introducing the COW issue fixed in this series and without reintroducing the issues we saw with the original CVE fix (e.g., breaking ptrace on pmem). There might be further parts to improve the GUP long-term <-> MM synchronicity and to optimize some things around that.
The idea is by Andrea and some patches are rewritten versions of prototype patches by Andrea. I cross-compiled and tested as good as possible.
I'll CC locking+selftest folks only on the relevant patch and the cover letter to minimze the noise. I'll put everyone on CC who was either involved with the COW issues in the past or attended the linux-mm alignment session on Wednesday. Appologies if I forget anyone :)
[1] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
David Hildenbrand (11): seqlock: provide lockdep-free raw_seqcount_t variant mm: thp: consolidate mapcount logic on THP split mm: simplify hugetlb and file-THP handling in __page_mapcount() mm: thp: simlify total_mapcount() mm: thp: allow for reading the THP mapcount atomically via a raw_seqlock_t mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb) mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required (!hugetlb) mm: hugetlb: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required (hugetlb) mm: thp: introduce and use page_trans_huge_anon_shared() selftests/vm: add tests for the known COW security issues
Documentation/locking/seqlock.rst | 50 ++++ include/linux/huge_mm.h | 72 +++++ include/linux/mm.h | 14 + include/linux/mm_types.h | 9 + include/linux/seqlock.h | 145 +++++++--- mm/gup.c | 89 +++++- mm/huge_memory.c | 120 +++++++-- mm/hugetlb.c | 129 +++++++-- mm/memory.c | 136 ++++++++-- mm/rmap.c | 40 +-- mm/swapfile.c | 35 ++- mm/util.c | 24 +- tools/testing/selftests/vm/Makefile | 1 + tools/testing/selftests/vm/gup_cow.c | 312 ++++++++++++++++++++++ tools/testing/selftests/vm/run_vmtests.sh | 16 ++ 15 files changed, 1044 insertions(+), 148 deletions(-) create mode 100644 tools/testing/selftests/vm/gup_cow.c
Sometimes it is required to have a seqcount implementation that uses a structure with a fixed and minimal size -- just a bare unsigned int -- independent of the kernel configuration. This is especially valuable, when the raw_ variants of the seqlock function will be used and the additional lockdep part of the seqcount_t structure remains essentially unused.
Let's provide a lockdep-free raw_seqcount_t variant that can be used via the raw functions to have a basic seqlock.
The target use case is embedding a raw_seqcount_t in the "struct page", where we really want a minimal size and cannot tolerate a sudden grow of the seqcount_t structure resulting in a significant "struct page" increase or even a layout change.
Provide raw_read_seqcount_retry(), to make it easy to match to raw_read_seqcount_begin() in the code.
Let's add a short documentation as well.
Note: There might be other possible users for raw_seqcount_t where the lockdep part might be completely unused and just wastes memory -- essentially any users that only use the raw_ function variants.
Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: Will Deacon will@kernel.org Cc: Waiman Long longman@redhat.com Cc: Boqun Feng boqun.feng@gmail.com Cc: Jonathan Corbet corbet@lwn.net Acked-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com --- Documentation/locking/seqlock.rst | 50 +++++++++++ include/linux/seqlock.h | 145 +++++++++++++++++++++++------- 2 files changed, 162 insertions(+), 33 deletions(-)
diff --git a/Documentation/locking/seqlock.rst b/Documentation/locking/seqlock.rst index 64405e5da63e..6f66ae29cc07 100644 --- a/Documentation/locking/seqlock.rst +++ b/Documentation/locking/seqlock.rst @@ -87,6 +87,56 @@ Read path:: } while (read_seqcount_retry(&foo_seqcount, seq));
+Raw sequence counters (``raw_seqcount_t``) +========================================== + +This is the raw counting mechanism, which does not protect against multiple +writers and does not perform any lockdep tracking. Write side critical sections +must thus be serialized by an external lock. + +It is primary useful when a fixed, minimal sequence counter size is +required and the lockdep overhead cannot be tolerated or is unused. +Prefer using a :ref:`seqcount_t`, a :ref:`seqlock_t` or a +:ref:`seqcount_locktype_t` if possible. + +The raw sequence counter is very similar to the :ref:`seqcount_t`, however, +it can only be used with functions that don't perform any implicit lockdep +tracking: primarily the *raw* function variants. + +Initialization:: + + /* dynamic */ + raw_seqcount_t foo_seqcount; + raw_seqcount_init(&foo_seqcount); + + /* static */ + static raw_seqcount_t foo_seqcount = RAW_SEQCNT_ZERO(foo_seqcount); + + /* C99 struct init */ + struct { + .seq = RAW_SEQCNT_ZERO(foo.seq), + } foo; + +Write path:: + + /* Serialized context with disabled preemption */ + + raw_write_seqcount_begin(&foo_seqcount); + + /* ... [[write-side critical section]] ... */ + + raw_write_seqcount_end(&foo_seqcount); + +Read path:: + + do { + seq = raw_read_seqcount_begin(&foo_seqcount); + + /* ... [[read-side critical section]] ... */ + + } while (raw_read_seqcount_retry(&foo_seqcount, seq)); + + .. _seqcount_locktype_t:
Sequence counters with associated locks (``seqcount_LOCKNAME_t``) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index 37ded6b8fee6..c61fba1f9893 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -60,15 +60,27 @@ * serialization and non-preemptibility requirements, use a sequential * lock (seqlock_t) instead. * + * If it's undesired to have lockdep, especially when a fixed, minimal, + * structure size is required, use raw_seqcount_t along with the raw + * function variants. + * * See Documentation/locking/seqlock.rst */ + +typedef unsigned int raw_seqcount_t; + typedef struct seqcount { - unsigned sequence; + raw_seqcount_t sequence; #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif } seqcount_t;
+static inline void __raw_seqcount_init(raw_seqcount_t *s) +{ + *s = 0; +} + static inline void __seqcount_init(seqcount_t *s, const char *name, struct lock_class_key *key) { @@ -76,9 +88,15 @@ static inline void __seqcount_init(seqcount_t *s, const char *name, * Make sure we are not reinitializing a held lock: */ lockdep_init_map(&s->dep_map, name, key, 0); - s->sequence = 0; + __raw_seqcount_init(&s->sequence); }
+/** + * raw_seqcount_init() - runtime initializer for raw_seqcount_t + * @s: Pointer to the raw_seqcount_t instance + */ +# define raw_seqcount_init(s) __raw_seqcount_init(s) + #ifdef CONFIG_DEBUG_LOCK_ALLOC
# define SEQCOUNT_DEP_MAP_INIT(lockname) \ @@ -111,11 +129,16 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s) # define seqcount_lockdep_reader_access(x) #endif
+/** + * RAW_SEQCNT_ZERO() - static initializer for raw_seqcount_t + */ +#define RAW_SEQCNT_ZERO() 0 + /** * SEQCNT_ZERO() - static initializer for seqcount_t * @name: Name of the seqcount_t instance */ -#define SEQCNT_ZERO(name) { .sequence = 0, SEQCOUNT_DEP_MAP_INIT(name) } +#define SEQCNT_ZERO(name) { .sequence = RAW_SEQCNT_ZERO(), SEQCOUNT_DEP_MAP_INIT(name) }
/* * Sequence counters with associated locks (seqcount_LOCKNAME_t) @@ -203,6 +226,12 @@ typedef struct seqcount_##lockname { \ __SEQ_LOCK(locktype *lock); \ } seqcount_##lockname##_t; \ \ +static __always_inline raw_seqcount_t * \ +__seqprop_##lockname##_raw_ptr(seqcount_##lockname##_t *s) \ +{ \ + return &s->seqcount.sequence; \ +} \ + \ static __always_inline seqcount_t * \ __seqprop_##lockname##_ptr(seqcount_##lockname##_t *s) \ { \ @@ -247,10 +276,45 @@ __seqprop_##lockname##_assert(const seqcount_##lockname##_t *s) \ __SEQ_LOCK(lockdep_assert_held(lockmember)); \ }
+/* + * __raw_seqprop() for raw_seqcount_t + */ + +static inline raw_seqcount_t *__raw_seqprop_raw_ptr(raw_seqcount_t *s) +{ + return s; +} + +static inline seqcount_t *__raw_seqprop_ptr(raw_seqcount_t *s) +{ + BUILD_BUG(); + return NULL; +} + +static inline unsigned int __raw_seqprop_sequence(const raw_seqcount_t *s) +{ + return READ_ONCE(*s); +} + +static inline bool __raw_seqprop_preemptible(const raw_seqcount_t *s) +{ + return false; +} + +static inline void __raw_seqprop_assert(const raw_seqcount_t *s) +{ + lockdep_assert_preemption_disabled(); +} + /* * __seqprop() for seqcount_t */
+static inline raw_seqcount_t *__seqprop_raw_ptr(seqcount_t *s) +{ + return &s->sequence; +} + static inline seqcount_t *__seqprop_ptr(seqcount_t *s) { return s; @@ -300,6 +364,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu seqcount_##lockname##_t: __seqprop_##lockname##_##prop((void *)(s))
#define __seqprop(s, prop) _Generic(*(s), \ + raw_seqcount_t: __raw_seqprop_##prop((void *)(s)), \ seqcount_t: __seqprop_##prop((void *)(s)), \ __seqprop_case((s), raw_spinlock, prop), \ __seqprop_case((s), spinlock, prop), \ @@ -307,6 +372,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu __seqprop_case((s), mutex, prop), \ __seqprop_case((s), ww_mutex, prop))
+#define seqprop_raw_ptr(s) __seqprop(s, raw_ptr) #define seqprop_ptr(s) __seqprop(s, ptr) #define seqprop_sequence(s) __seqprop(s, sequence) #define seqprop_preemptible(s) __seqprop(s, preemptible) @@ -314,7 +380,8 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu
/** * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier - * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants + * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t + * variants * * __read_seqcount_begin is like read_seqcount_begin, but has no smp_rmb() * barrier. Callers should ensure that smp_rmb() or equivalent ordering is @@ -339,7 +406,8 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu
/** * raw_read_seqcount_begin() - begin a seqcount_t read section w/o lockdep - * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants + * @s: Pointer to seqcount_t, raw_seqcount_t or any of the + * seqcount_LOCKNAME_t variants * * Return: count to be passed to read_seqcount_retry() */ @@ -365,7 +433,8 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu
/** * raw_read_seqcount() - read the raw seqcount_t counter value - * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants + * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t + * variants * * raw_read_seqcount opens a read critical section of the given * seqcount_t, without any lockdep checking, and without checking or @@ -386,7 +455,8 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu /** * raw_seqcount_begin() - begin a seqcount_t read critical section w/o * lockdep and w/o counter stabilization - * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants + * @s: Pointer to seqcount_t, raw_seqcount_t, or any of the seqcount_LOCKNAME_t + * variants * * raw_seqcount_begin opens a read critical section of the given * seqcount_t. Unlike read_seqcount_begin(), this function will not wait @@ -411,7 +481,8 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu
/** * __read_seqcount_retry() - end a seqcount_t read section w/o barrier - * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants + * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t + * variants * @start: count, from read_seqcount_begin() * * __read_seqcount_retry is like read_seqcount_retry, but has no smp_rmb() @@ -425,17 +496,19 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu * Return: true if a read section retry is required, else false */ #define __read_seqcount_retry(s, start) \ - do___read_seqcount_retry(seqprop_ptr(s), start) + do___read_seqcount_retry(seqprop_raw_ptr(s), start)
-static inline int do___read_seqcount_retry(const seqcount_t *s, unsigned start) +static inline int do___read_seqcount_retry(const raw_seqcount_t *s, + unsigned int start) { kcsan_atomic_next(0); - return unlikely(READ_ONCE(s->sequence) != start); + return unlikely(READ_ONCE(*s) != start); }
/** * read_seqcount_retry() - end a seqcount_t read critical section - * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants + * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t + * variants * @start: count, from read_seqcount_begin() * * read_seqcount_retry closes the read critical section of given @@ -445,9 +518,11 @@ static inline int do___read_seqcount_retry(const seqcount_t *s, unsigned start) * Return: true if a read section retry is required, else false */ #define read_seqcount_retry(s, start) \ - do_read_seqcount_retry(seqprop_ptr(s), start) + do_read_seqcount_retry(seqprop_raw_ptr(s), start) +#define raw_read_seqcount_retry(s, start) read_seqcount_retry(s, start)
-static inline int do_read_seqcount_retry(const seqcount_t *s, unsigned start) +static inline int do_read_seqcount_retry(const raw_seqcount_t *s, + unsigned int start) { smp_rmb(); return do___read_seqcount_retry(s, start); @@ -455,7 +530,8 @@ static inline int do_read_seqcount_retry(const seqcount_t *s, unsigned start)
/** * raw_write_seqcount_begin() - start a seqcount_t write section w/o lockdep - * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants + * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t + * variants * * Context: check write_seqcount_begin() */ @@ -464,34 +540,35 @@ do { \ if (seqprop_preemptible(s)) \ preempt_disable(); \ \ - do_raw_write_seqcount_begin(seqprop_ptr(s)); \ + do_raw_write_seqcount_begin(seqprop_raw_ptr(s)); \ } while (0)
-static inline void do_raw_write_seqcount_begin(seqcount_t *s) +static inline void do_raw_write_seqcount_begin(raw_seqcount_t *s) { kcsan_nestable_atomic_begin(); - s->sequence++; + (*s)++; smp_wmb(); }
/** * raw_write_seqcount_end() - end a seqcount_t write section w/o lockdep - * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants + * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t + * variants * * Context: check write_seqcount_end() */ #define raw_write_seqcount_end(s) \ do { \ - do_raw_write_seqcount_end(seqprop_ptr(s)); \ + do_raw_write_seqcount_end(seqprop_raw_ptr(s)); \ \ if (seqprop_preemptible(s)) \ preempt_enable(); \ } while (0)
-static inline void do_raw_write_seqcount_end(seqcount_t *s) +static inline void do_raw_write_seqcount_end(raw_seqcount_t *s) { smp_wmb(); - s->sequence++; + (*s)++; kcsan_nestable_atomic_end(); }
@@ -516,7 +593,7 @@ do { \
static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass) { - do_raw_write_seqcount_begin(s); + do_raw_write_seqcount_begin(&s->sequence); seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_); }
@@ -563,12 +640,13 @@ do { \ static inline void do_write_seqcount_end(seqcount_t *s) { seqcount_release(&s->dep_map, _RET_IP_); - do_raw_write_seqcount_end(s); + do_raw_write_seqcount_end(&s->sequence); }
/** * raw_write_seqcount_barrier() - do a seqcount_t write barrier - * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants + * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t + * variants * * This can be used to provide an ordering guarantee instead of the usual * consistency guarantee. It is one wmb cheaper, because it can collapse @@ -608,33 +686,34 @@ static inline void do_write_seqcount_end(seqcount_t *s) * } */ #define raw_write_seqcount_barrier(s) \ - do_raw_write_seqcount_barrier(seqprop_ptr(s)) + do_raw_write_seqcount_barrier(seqprop_raw_ptr(s))
-static inline void do_raw_write_seqcount_barrier(seqcount_t *s) +static inline void do_raw_write_seqcount_barrier(raw_seqcount_t *s) { kcsan_nestable_atomic_begin(); - s->sequence++; + (*s)++; smp_wmb(); - s->sequence++; + (*s)++; kcsan_nestable_atomic_end(); }
/** * write_seqcount_invalidate() - invalidate in-progress seqcount_t read * side operations - * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants + * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t + * variants * * After write_seqcount_invalidate, no seqcount_t read side operations * will complete successfully and see data older than this. */ #define write_seqcount_invalidate(s) \ - do_write_seqcount_invalidate(seqprop_ptr(s)) + do_write_seqcount_invalidate(seqprop_raw_ptr(s))
-static inline void do_write_seqcount_invalidate(seqcount_t *s) +static inline void do_write_seqcount_invalidate(raw_seqcount_t *s) { smp_wmb(); kcsan_nestable_atomic_begin(); - s->sequence+=2; + (*s) += 2; kcsan_nestable_atomic_end(); }
On Dec 17, 2021, at 3:30 AM, David Hildenbrand david@redhat.com wrote:
Sometimes it is required to have a seqcount implementation that uses a structure with a fixed and minimal size -- just a bare unsigned int -- independent of the kernel configuration. This is especially valuable, when the raw_ variants of the seqlock function will be used and the additional lockdep part of the seqcount_t structure remains essentially unused.
Let's provide a lockdep-free raw_seqcount_t variant that can be used via the raw functions to have a basic seqlock.
The target use case is embedding a raw_seqcount_t in the "struct page", where we really want a minimal size and cannot tolerate a sudden grow of the seqcount_t structure resulting in a significant "struct page" increase or even a layout change.
Provide raw_read_seqcount_retry(), to make it easy to match to raw_read_seqcount_begin() in the code.
Let's add a short documentation as well.
Note: There might be other possible users for raw_seqcount_t where the lockdep part might be completely unused and just wastes memory -- essentially any users that only use the raw_ function variants.
Is it possible to force some policy when raw_seqcount_t is used to prevent its abuse? For instance not to allow to acquire other (certain?) locks when it is held?
[ snip ]
+/**
- raw_seqcount_init() - runtime initializer for raw_seqcount_t
- @s: Pointer to the raw_seqcount_t instance
- */
+# define raw_seqcount_init(s) __raw_seqcount_init(s)
#ifdef CONFIG_DEBUG_LOCK_ALLOC
# define SEQCOUNT_DEP_MAP_INIT(lockname) \ @@ -111,11 +129,16 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s) # define seqcount_lockdep_reader_access(x) #endif
+/**
- RAW_SEQCNT_ZERO() - static initializer for raw_seqcount_t
- */
+#define RAW_SEQCNT_ZERO() 0
I am not sure why RAW_SWQCNT_ZERO() should be a function-like macro.
Moreover, the documentation showed:
+ /* static */ + static raw_seqcount_t foo_seqcount = RAW_SEQCNT_ZERO(foo_seqcount); +
But RAW_SEQCNT_ZERO does not have an argument?
On 17.12.21 18:02, Nadav Amit wrote:
On Dec 17, 2021, at 3:30 AM, David Hildenbrand david@redhat.com wrote:
Sometimes it is required to have a seqcount implementation that uses a structure with a fixed and minimal size -- just a bare unsigned int -- independent of the kernel configuration. This is especially valuable, when the raw_ variants of the seqlock function will be used and the additional lockdep part of the seqcount_t structure remains essentially unused.
Let's provide a lockdep-free raw_seqcount_t variant that can be used via the raw functions to have a basic seqlock.
The target use case is embedding a raw_seqcount_t in the "struct page", where we really want a minimal size and cannot tolerate a sudden grow of the seqcount_t structure resulting in a significant "struct page" increase or even a layout change.
Provide raw_read_seqcount_retry(), to make it easy to match to raw_read_seqcount_begin() in the code.
Let's add a short documentation as well.
Note: There might be other possible users for raw_seqcount_t where the lockdep part might be completely unused and just wastes memory -- essentially any users that only use the raw_ function variants.
Is it possible to force some policy when raw_seqcount_t is used to prevent its abuse? For instance not to allow to acquire other (certain?) locks when it is held?
Good question ... in this series we won't be taking additional locks on the reader or the writer side. Something like lockdep_forbid() / lockdep_allow() to disallow any kind of locking. I haven't heard of anything like that, maybe someone reading along has a clue?
The writer side might be easy to handle, but some seqcount operations that don't do the full read()->retry() cycle are problematic (->raw_read_seqcount).
[ snip ]
+/**
- raw_seqcount_init() - runtime initializer for raw_seqcount_t
- @s: Pointer to the raw_seqcount_t instance
- */
+# define raw_seqcount_init(s) __raw_seqcount_init(s)
#ifdef CONFIG_DEBUG_LOCK_ALLOC
# define SEQCOUNT_DEP_MAP_INIT(lockname) \ @@ -111,11 +129,16 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s) # define seqcount_lockdep_reader_access(x) #endif
+/**
- RAW_SEQCNT_ZERO() - static initializer for raw_seqcount_t
- */
+#define RAW_SEQCNT_ZERO() 0
I am not sure why RAW_SWQCNT_ZERO() should be a function-like macro.
I think I just went for consistency with SEQCNT_ZERO() -- but I agree, that can just be simplified!
Thanks!
On 17.12.21 18:29, David Hildenbrand wrote:
On 17.12.21 18:02, Nadav Amit wrote:
On Dec 17, 2021, at 3:30 AM, David Hildenbrand david@redhat.com wrote:
Sometimes it is required to have a seqcount implementation that uses a structure with a fixed and minimal size -- just a bare unsigned int -- independent of the kernel configuration. This is especially valuable, when the raw_ variants of the seqlock function will be used and the additional lockdep part of the seqcount_t structure remains essentially unused.
Let's provide a lockdep-free raw_seqcount_t variant that can be used via the raw functions to have a basic seqlock.
The target use case is embedding a raw_seqcount_t in the "struct page", where we really want a minimal size and cannot tolerate a sudden grow of the seqcount_t structure resulting in a significant "struct page" increase or even a layout change.
Provide raw_read_seqcount_retry(), to make it easy to match to raw_read_seqcount_begin() in the code.
Let's add a short documentation as well.
Note: There might be other possible users for raw_seqcount_t where the lockdep part might be completely unused and just wastes memory -- essentially any users that only use the raw_ function variants.
Is it possible to force some policy when raw_seqcount_t is used to prevent its abuse? For instance not to allow to acquire other (certain?) locks when it is held?
Good question ... in this series we won't be taking additional locks on the reader or the writer side. Something like lockdep_forbid() / lockdep_allow() to disallow any kind of locking. I haven't heard of anything like that, maybe someone reading along has a clue?
The writer side might be easy to handle, but some seqcount operations that don't do the full read()->retry() cycle are problematic (->raw_read_seqcount).
Sorry, I forgot to mention an important point: the raw_seqcount_t doesn't give you any additional "power" to abuse.
You can just use the ordinary seqcount_t with the raw_ functions. One example is mm->write_protect_seq . So whatever we would want to "invent" should also apply to the raw_ functions in general -- which might be undesired or impossible (IIRC IRQ context).
On Dec 17, 2021, at 9:49 AM, David Hildenbrand david@redhat.com wrote:
On 17.12.21 18:29, David Hildenbrand wrote:
On 17.12.21 18:02, Nadav Amit wrote:
On Dec 17, 2021, at 3:30 AM, David Hildenbrand david@redhat.com wrote:
Sometimes it is required to have a seqcount implementation that uses a structure with a fixed and minimal size -- just a bare unsigned int -- independent of the kernel configuration. This is especially valuable, when the raw_ variants of the seqlock function will be used and the additional lockdep part of the seqcount_t structure remains essentially unused.
Let's provide a lockdep-free raw_seqcount_t variant that can be used via the raw functions to have a basic seqlock.
The target use case is embedding a raw_seqcount_t in the "struct page", where we really want a minimal size and cannot tolerate a sudden grow of the seqcount_t structure resulting in a significant "struct page" increase or even a layout change.
Provide raw_read_seqcount_retry(), to make it easy to match to raw_read_seqcount_begin() in the code.
Let's add a short documentation as well.
Note: There might be other possible users for raw_seqcount_t where the lockdep part might be completely unused and just wastes memory -- essentially any users that only use the raw_ function variants.
Is it possible to force some policy when raw_seqcount_t is used to prevent its abuse? For instance not to allow to acquire other (certain?) locks when it is held?
Good question ... in this series we won't be taking additional locks on the reader or the writer side. Something like lockdep_forbid() / lockdep_allow() to disallow any kind of locking. I haven't heard of anything like that, maybe someone reading along has a clue?
The writer side might be easy to handle, but some seqcount operations that don't do the full read()->retry() cycle are problematic (->raw_read_seqcount).
Sorry, I forgot to mention an important point: the raw_seqcount_t doesn't give you any additional "power" to abuse.
You can just use the ordinary seqcount_t with the raw_ functions. One example is mm->write_protect_seq . So whatever we would want to "invent" should also apply to the raw_ functions in general -- which might be undesired or impossible (IIRC IRQ context).
Thanks for the clarification. I was unfamiliar with raw_read_seqcount_begin() (and friends). Indeed it is very very rarely used.
On Fri, Dec 17 2021 at 12:30, David Hildenbrand wrote:
Sometimes it is required to have a seqcount implementation that uses a structure with a fixed and minimal size -- just a bare unsigned int -- independent of the kernel configuration. This is especially valuable, when the raw_ variants of the seqlock function will be used and the additional lockdep part of the seqcount_t structure remains essentially unused.
Let's provide a lockdep-free raw_seqcount_t variant that can be used via the raw functions to have a basic seqlock.
The target use case is embedding a raw_seqcount_t in the "struct page", where we really want a minimal size and cannot tolerate a sudden grow of the seqcount_t structure resulting in a significant "struct page" increase or even a layout change.
Cannot tolerate? Could you please provide a reason and not just a statement?
Provide raw_read_seqcount_retry(), to make it easy to match to raw_read_seqcount_begin() in the code.
Let's add a short documentation as well.
Note: There might be other possible users for raw_seqcount_t where the lockdep part might be completely unused and just wastes memory -- essentially any users that only use the raw_ function variants.
Even when the reader side uses raw_seqcount_begin/retry() the writer side still can use the non-raw variant which validates that the associated lock is held on write.
Aside of that your proposed extra raw sequence count needs extra care vs. PREEMPT_RT and this want's to be very clearly documented. Why?
The lock association has two purposes:
1) Lockdep coverage which unearthed bugs already
2) PREEMPT_RT livelock prevention
Assume the following:
spin_lock(wrlock); write_seqcount_begin(seq);
---> preemption by a high priority reader
seqcount_begin(seq); <-- live lock
The RT substitution does:
seqcount_begin(seq) cnt = READ_ONCE(seq->sequence);
if (cnt & 1) { lock(s->lock); unlock(s->lock); }
which prevents the deadlock because it makes the reader block on the associated lock, which allows the preempted writer to make progress.
This applies to raw_seqcount_begin() as well.
I have no objections against the construct itself, but this has to be properly documented vs. the restriction this imposes.
As you can see above the writer side therefore has to ensure that it cannot preempted on PREEMPT_RT, which limits the possibilities what you can do inside a preemption (or interrupt) disabled section on RT enabled kernels. See Documentation/locking/locktypes.rst for further information.
Thanks,
tglx
On 17.12.21 22:28, Thomas Gleixner wrote:
On Fri, Dec 17 2021 at 12:30, David Hildenbrand wrote:
Sometimes it is required to have a seqcount implementation that uses a structure with a fixed and minimal size -- just a bare unsigned int -- independent of the kernel configuration. This is especially valuable, when the raw_ variants of the seqlock function will be used and the additional lockdep part of the seqcount_t structure remains essentially unused.
Let's provide a lockdep-free raw_seqcount_t variant that can be used via the raw functions to have a basic seqlock.
The target use case is embedding a raw_seqcount_t in the "struct page", where we really want a minimal size and cannot tolerate a sudden grow of the seqcount_t structure resulting in a significant "struct page" increase or even a layout change.
Hi Thomas,
thanks for your feedback!
Cannot tolerate? Could you please provide a reason and not just a statement?
Absolutely.
"struct page" is supposed to have a minimal size with a fixed layout. Embedding something inside such a structure can change the fixed layout in a way that it can just completely breaks any assumptions on location of values.
Therefore, embedding a complex structure in it is usually avoided -- and if we have to (spin_lock), we work around sudden size increases.
There are ways around it: allocate the lock and only store the pointer in the struct page. But that most certainly adds complexity, which is why I want to avoid it for now.
I'll extend that answer and add it to the patch description.
Provide raw_read_seqcount_retry(), to make it easy to match to raw_read_seqcount_begin() in the code.
Let's add a short documentation as well.
Note: There might be other possible users for raw_seqcount_t where the lockdep part might be completely unused and just wastes memory -- essentially any users that only use the raw_ function variants.
Even when the reader side uses raw_seqcount_begin/retry() the writer side still can use the non-raw variant which validates that the associated lock is held on write.
Yes, that's my understanding as well.
Aside of that your proposed extra raw sequence count needs extra care vs. PREEMPT_RT and this want's to be very clearly documented. Why?
The lock association has two purposes:
1) Lockdep coverage which unearthed bugs already
Yes, that's a real shame to lose.
2) PREEMPT_RT livelock prevention Assume the following: spin_lock(wrlock); write_seqcount_begin(seq); ---> preemption by a high priority reader seqcount_begin(seq); <-- live lock The RT substitution does: seqcount_begin(seq) cnt = READ_ONCE(seq->sequence); if (cnt & 1) { lock(s->lock); unlock(s->lock); } which prevents the deadlock because it makes the reader block on the associated lock, which allows the preempted writer to make progress. This applies to raw_seqcount_begin() as well.
I have no objections against the construct itself, but this has to be properly documented vs. the restriction this imposes.
Absolutely, any input is highly appreciated.
But to mention it again: whatever you can do with raw_seqcount_t, you can do with seqcount_t, and there are already users relying completely on the raw_ function variants (see my other reply).
So the documentation should most probably be extended to cover the raw_ functions and seqcount_t in general.
As you can see above the writer side therefore has to ensure that it cannot preempted on PREEMPT_RT, which limits the possibilities what you can do inside a preemption (or interrupt) disabled section on RT enabled kernels. See Documentation/locking/locktypes.rst for further information.
It's going to be used for THP, which are currently incompatible with PREEMPT_RT (disabled in the Kconfig). But preemption is also disabled because we're using bit_spin_lock(), which does a bit_spin_lock().
Certainly worth documenting!
Thanks for your input!
Let's consolidate the mapcount logic to make it easier to understand and to prepare for further changes.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com --- mm/huge_memory.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e5483347291c..4751d03947da 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2101,21 +2101,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, pte = pte_offset_map(&_pmd, addr); BUG_ON(!pte_none(*pte)); set_pte_at(mm, addr, pte, entry); - if (!pmd_migration) - atomic_inc(&page[i]._mapcount); pte_unmap(pte); }
if (!pmd_migration) { + /* Sub-page mapcount accounting for above small mappings. */ + int val = 1; + /* * Set PG_double_map before dropping compound_mapcount to avoid * false-negative page_mapped(). + * + * The first to set PageDoubleMap() has to increment all + * sub-page mapcounts by one. */ - if (compound_mapcount(page) > 1 && - !TestSetPageDoubleMap(page)) { - for (i = 0; i < HPAGE_PMD_NR; i++) - atomic_inc(&page[i]._mapcount); - } + if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page)) + val++; + + for (i = 0; i < HPAGE_PMD_NR; i++) + atomic_add(val, &page[i]._mapcount);
lock_page_memcg(page); if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
On Fri, Dec 17, 2021 at 3:33 AM David Hildenbrand david@redhat.com wrote:
Let's consolidate the mapcount logic to make it easier to understand and to prepare for further changes.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
Reviewed-by: Yang Shi shy828301@gmail.com
mm/huge_memory.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e5483347291c..4751d03947da 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2101,21 +2101,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, pte = pte_offset_map(&_pmd, addr); BUG_ON(!pte_none(*pte)); set_pte_at(mm, addr, pte, entry);
if (!pmd_migration)
atomic_inc(&page[i]._mapcount); pte_unmap(pte); } if (!pmd_migration) {
/* Sub-page mapcount accounting for above small mappings. */
int val = 1;
/* * Set PG_double_map before dropping compound_mapcount to avoid * false-negative page_mapped().
*
* The first to set PageDoubleMap() has to increment all
* sub-page mapcounts by one. */
if (compound_mapcount(page) > 1 &&
!TestSetPageDoubleMap(page)) {
for (i = 0; i < HPAGE_PMD_NR; i++)
atomic_inc(&page[i]._mapcount);
}
if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page))
val++;
for (i = 0; i < HPAGE_PMD_NR; i++)
atomic_add(val, &page[i]._mapcount); lock_page_memcg(page); if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
-- 2.31.1
On Fri, Dec 17, 2021 at 12:30:40PM +0100, David Hildenbrand wrote:
Let's consolidate the mapcount logic to make it easier to understand and to prepare for further changes.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
Let's return early for hugetlb, which really only relies on the compound mapcount so far and does not support PageDoubleMap() yet. Use the chance to cleanup the file-THP case to make it easier to grasp. While at it, use head_compound_mapcount().
This is a preparation for further changes.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com --- mm/util.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/mm/util.c b/mm/util.c index 741ba32a43ac..3239e75c148d 100644 --- a/mm/util.c +++ b/mm/util.c @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page) { int ret;
- ret = atomic_read(&page->_mapcount) + 1; + if (PageHuge(page)) + return compound_mapcount(page); /* * For file THP page->_mapcount contains total number of mapping * of the page: no need to look into compound_mapcount. */ - if (!PageAnon(page) && !PageHuge(page)) - return ret; + if (!PageAnon(page)) + return atomic_read(&page->_mapcount) + 1; + + ret = atomic_read(&page->_mapcount) + 1; page = compound_head(page); - ret += atomic_read(compound_mapcount_ptr(page)) + 1; + ret += head_compound_mapcount(page); if (PageDoubleMap(page)) ret--; return ret;
On Dec 17, 2021, at 3:30 AM, David Hildenbrand david@redhat.com wrote:
Let's return early for hugetlb, which really only relies on the compound mapcount so far and does not support PageDoubleMap() yet. Use the chance to cleanup the file-THP case to make it easier to grasp. While at it, use head_compound_mapcount().
This is a preparation for further changes.
It would be useful to add “no functional change intended” or something.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
mm/util.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/mm/util.c b/mm/util.c index 741ba32a43ac..3239e75c148d 100644 --- a/mm/util.c +++ b/mm/util.c @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page) { int ret;
- ret = atomic_read(&page->_mapcount) + 1;
- if (PageHuge(page))
return compound_mapcount(page);
Before you return, perhaps you can add an assertion like:
VM_BUG_ON(PageDoubleMap(page));
This would be make the code clearer and would ease debugging in the future (if support for double-map is expanded).
On 17.12.21 18:16, Nadav Amit wrote:
On Dec 17, 2021, at 3:30 AM, David Hildenbrand david@redhat.com wrote:
Let's return early for hugetlb, which really only relies on the compound mapcount so far and does not support PageDoubleMap() yet. Use the chance to cleanup the file-THP case to make it easier to grasp. While at it, use head_compound_mapcount().
This is a preparation for further changes.
It would be useful to add “no functional change intended” or something.
Absolutely, same applies to other "simplification" patches.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
mm/util.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/mm/util.c b/mm/util.c index 741ba32a43ac..3239e75c148d 100644 --- a/mm/util.c +++ b/mm/util.c @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page) { int ret;
- ret = atomic_read(&page->_mapcount) + 1;
- if (PageHuge(page))
return compound_mapcount(page);
Before you return, perhaps you can add an assertion like:
VM_BUG_ON(PageDoubleMap(page));
This would be make the code clearer and would ease debugging in the future (if support for double-map is expanded).
I'd probably have to add this to a couple of places -- and I assume anybody working on that has to grep the kernel for use of PageDoubleMap already.
Thanks!
On 12/17/21 03:30, David Hildenbrand wrote:
Let's return early for hugetlb, which really only relies on the compound mapcount so far and does not support PageDoubleMap() yet. Use the chance
It is too early to say if hugetlb double mapping will use PageDoubleMap(). I do not think (hope) it will be necessary. So, I think you can drop mention of it here.
to cleanup the file-THP case to make it easier to grasp. While at it, use head_compound_mapcount().
This is a preparation for further changes.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
Acked-by: Mike Kravetz mike.kravetz@oracle.com
On 17.12.21 19:06, Mike Kravetz wrote:
On 12/17/21 03:30, David Hildenbrand wrote:
Let's return early for hugetlb, which really only relies on the compound mapcount so far and does not support PageDoubleMap() yet. Use the chance
It is too early to say if hugetlb double mapping will use PageDoubleMap(). I do not think (hope) it will be necessary. So, I think you can drop mention of it here.
Desires have most certainly been expressed from a couple of parties -- to PTE map huge pages :) Hopefully we'll find a way to avoid PageDoubleMap, I agree.
Dropping the comment!
to cleanup the file-THP case to make it easier to grasp. While at it, use head_compound_mapcount().
This is a preparation for further changes.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
Acked-by: Mike Kravetz mike.kravetz@oracle.com
Thanks!
On Fri, Dec 17, 2021 at 3:33 AM David Hildenbrand david@redhat.com wrote:
Let's return early for hugetlb, which really only relies on the compound mapcount so far and does not support PageDoubleMap() yet. Use the chance to cleanup the file-THP case to make it easier to grasp. While at it, use head_compound_mapcount().
This is a preparation for further changes.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
Reviewed-by: Yang Shi shy828301@gmail.com
mm/util.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/mm/util.c b/mm/util.c index 741ba32a43ac..3239e75c148d 100644 --- a/mm/util.c +++ b/mm/util.c @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page) { int ret;
ret = atomic_read(&page->_mapcount) + 1;
if (PageHuge(page))
return compound_mapcount(page); /* * For file THP page->_mapcount contains total number of mapping * of the page: no need to look into compound_mapcount. */
if (!PageAnon(page) && !PageHuge(page))
return ret;
if (!PageAnon(page))
return atomic_read(&page->_mapcount) + 1;
ret = atomic_read(&page->_mapcount) + 1; page = compound_head(page);
ret += atomic_read(compound_mapcount_ptr(page)) + 1;
ret += head_compound_mapcount(page); if (PageDoubleMap(page)) ret--; return ret;
-- 2.31.1
On Fri, Dec 17, 2021 at 12:30:41PM +0100, David Hildenbrand wrote:
Let's return early for hugetlb, which really only relies on the compound mapcount so far and does not support PageDoubleMap() yet. Use the chance to cleanup the file-THP case to make it easier to grasp. While at it, use head_compound_mapcount().
This is a preparation for further changes.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
mm/util.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/mm/util.c b/mm/util.c index 741ba32a43ac..3239e75c148d 100644 --- a/mm/util.c +++ b/mm/util.c @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page) { int ret;
- ret = atomic_read(&page->_mapcount) + 1;
- if (PageHuge(page))
return compound_mapcount(page);
It would be nice to make PageHuge() inlinable first. It's a shame the we need to have to do a function call for PageHuge() check.
Otherwise, looks good:
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
Let's simplify a bit, returning for PageHuge() early and using head_compound_page() as we are only getting called for HEAD pages.
Note the VM_BUG_ON_PAGE(PageTail(page), page) check at the beginning of total_mapcount().
This is a preparation for further changes.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com --- mm/huge_memory.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 4751d03947da..826cabcad11a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2506,12 +2506,11 @@ int total_mapcount(struct page *page)
if (likely(!PageCompound(page))) return atomic_read(&page->_mapcount) + 1; + if (PageHuge(page)) + return head_compound_mapcount(page);
- compound = compound_mapcount(page); nr = compound_nr(page); - if (PageHuge(page)) - return compound; - ret = compound; + ret = compound = head_compound_mapcount(page); for (i = 0; i < nr; i++) ret += atomic_read(&page[i]._mapcount) + 1; /* File pages has compound_mapcount included in _mapcount */
On Fri, Dec 17, 2021 at 3:33 AM David Hildenbrand david@redhat.com wrote:
Let's simplify a bit, returning for PageHuge() early and using head_compound_page() as we are only getting called for HEAD pages.
Note the VM_BUG_ON_PAGE(PageTail(page), page) check at the beginning of total_mapcount().
This is a preparation for further changes.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
Reviewed-by: Yang Shi shy828301@gmail.com
mm/huge_memory.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 4751d03947da..826cabcad11a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2506,12 +2506,11 @@ int total_mapcount(struct page *page)
if (likely(!PageCompound(page))) return atomic_read(&page->_mapcount) + 1;
if (PageHuge(page))
return head_compound_mapcount(page);
compound = compound_mapcount(page); nr = compound_nr(page);
if (PageHuge(page))
return compound;
ret = compound;
ret = compound = head_compound_mapcount(page); for (i = 0; i < nr; i++) ret += atomic_read(&page[i]._mapcount) + 1; /* File pages has compound_mapcount included in _mapcount */
-- 2.31.1
On Fri, Dec 17, 2021 at 12:30:42PM +0100, David Hildenbrand wrote:
Let's simplify a bit, returning for PageHuge() early and using head_compound_page() as we are only getting called for HEAD pages.
Note the VM_BUG_ON_PAGE(PageTail(page), page) check at the beginning of total_mapcount().
This is a preparation for further changes.
Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
Currently, we are not able to read the mapcount of a THP atomically without expensive locking, for example, if the THP is getting split concurrently.
Also, we don't want mapcount readers to observe jitter on concurrent GUP and unmapping like: 2 -> 1 -> 2 -> 1 Instead, we want to avoid such jitter and want the mapcount of a THP to move into one direction only instead.
The main challenge to avoid such jitter is PageDoubleMap. If the compound_mapcount and the tail mapcounts move in the same direction, there is no problem. However when the compound_mapcount is decreased and reaches zero, the reader will see initially a decrease in the THP mapcount that will then be followed by the PageDoubleMap being cleared and the mapcount getting increased again. The act of clearing PageDoubleMap will lead readers to overestimate the mapcount until all tail mapcounts (that the PageDoubleMap flag kept artificially elevated) are finally released.
Introduce a raw_seqlock_t in the THP subpage at index 1 to allow reading the THP mapcount atomically without grabbing the page lock, avoiding racing with THP splitting or PageDoubleMap processing. For now, we only require the seqlock for anonymous THP.
We use a PG_lock-based spinlock to synchronize the writer side. Note that the PG_lock is located on the THP subpage at index 1, which is unused so far.
To make especially page_mapcount() safe to be called from IRQ context, as required by GUP via get_user_pages_fast_only() in the context of GUP-triggered unsharing of shared anonymous pages soon, make sure the reader side cannot deadlock if the writer side would be interrupted: disable local interrupts on the writer side. Note that they are already disabled during lock_page_memcg() in some configurations.
Fortunately, we do have as of now (mm/Kconfig) config TRANSPARENT_HUGEPAGE bool "Transparent Hugepage Support" depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT so the disabling of interrupts in our case in particular has no effect on PREEMPT_RT, which is good.
We don't need this type of locking on the THP freeing path: Once the compound_mapcount of an anonymous THP drops to 0, it won't suddenly increase again, so PageDoubleMap cannot be cleared concurrently and consequently the seqlock only needs to be taken if the PageDoubleMap flag is found set.
Note: In the future, we could avoid disabling local interrupts on the writer side by providing alternative functions that can be called from IRQ context without deadlocking: These functions must not spin but instead have to signal that locking failed. OR maybe we'll find a way to just simplify that whole mapcount handling logic for anonymous THP, but for now none has been identified. Let's keep it simple for now.
This commit is based on prototype patches by Andrea.
Reported-by: Sergey Senozhatsky sergey.senozhatsky@gmail.com Reported-by: Hugh Dickins hughd@google.com Fixes: c444eb564fb1 ("mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()") Co-developed-by: Andrea Arcangeli aarcange@redhat.com Signed-off-by: Andrea Arcangeli aarcange@redhat.com Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com --- include/linux/huge_mm.h | 65 ++++++++++++++++++++++++++++++++++++++++ include/linux/mm_types.h | 9 ++++++ mm/huge_memory.c | 56 +++++++++++++++++++++++----------- mm/rmap.c | 40 +++++++++++++++---------- mm/swapfile.c | 35 +++++++++++++--------- mm/util.c | 17 +++++++---- 6 files changed, 170 insertions(+), 52 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index f280f33ff223..44e02d47c65a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -318,6 +318,49 @@ static inline struct list_head *page_deferred_list(struct page *page) return &page[2].deferred_list; }
+static inline void thp_mapcount_seqcount_init(struct page *page) +{ + raw_seqcount_init(&page[1].mapcount_seqcount); +} + +static inline unsigned int thp_mapcount_read_begin(struct page *page) +{ + VM_BUG_ON_PAGE(PageTail(page), page); + return raw_read_seqcount_begin(&page[1].mapcount_seqcount); +} + +static inline bool thp_mapcount_read_retry(struct page *page, + unsigned int seqcount) +{ + VM_BUG_ON_PAGE(PageTail(page), page); + if (!raw_read_seqcount_retry(&page[1].mapcount_seqcount, seqcount)) + return false; + cpu_relax(); + return true; +} + +static inline void thp_mapcount_lock(struct page *page, + unsigned long *irq_flags) +{ + VM_BUG_ON_PAGE(PageTail(page), page); + /* + * Prevent deadlocks in thp_mapcount_read_begin() if it is called in IRQ + * context. + */ + local_irq_save(*irq_flags); + bit_spin_lock(PG_locked, &page[1].flags); + raw_write_seqcount_begin(&page[1].mapcount_seqcount); +} + +static inline void thp_mapcount_unlock(struct page *page, + unsigned long irq_flags) +{ + VM_BUG_ON_PAGE(PageTail(page), page); + raw_write_seqcount_end(&page[1].mapcount_seqcount); + bit_spin_unlock(PG_locked, &page[1].flags); + local_irq_restore(irq_flags); +} + #else /* CONFIG_TRANSPARENT_HUGEPAGE */ #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) @@ -467,6 +510,28 @@ static inline bool thp_migration_supported(void) { return false; } + +static inline unsigned int thp_mapcount_read_begin(struct page *page) +{ + return 0; +} + +static inline bool thp_mapcount_read_retry(struct page *page, + unsigned int seqcount) +{ + return false; +} + +static inline void thp_mapcount_lock(struct page *page, + unsigned long *irq_flags) +{ +} + +static inline void thp_mapcount_unlock(struct page *page, + unsigned long irq_flags) +{ +} + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
/** diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index c3a6e6209600..a85a2a75d4ff 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -151,6 +151,15 @@ struct page { unsigned char compound_order; atomic_t compound_mapcount; unsigned int compound_nr; /* 1 << compound_order */ + /* + * THP only: allow for atomic reading of the mapcount, + * for example when we might be racing with a concurrent + * THP split. Initialized for all THP but locking is + * so far only required for anon THP where such races + * apply. Write access is serialized via the + * PG_locked-based spinlock in the first tail page. + */ + raw_seqcount_t mapcount_seqcount; }; struct { /* Second tail page of compound page */ unsigned long _compound_pad_1; /* compound_head */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 826cabcad11a..1685821525e8 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -527,6 +527,7 @@ void prep_transhuge_page(struct page *page)
INIT_LIST_HEAD(page_deferred_list(page)); set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR); + thp_mapcount_seqcount_init(page); }
bool is_transparent_hugepage(struct page *page) @@ -1959,11 +1960,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, unsigned long haddr, bool freeze) { struct mm_struct *mm = vma->vm_mm; + unsigned long addr, irq_flags; struct page *page; pgtable_t pgtable; pmd_t old_pmd, _pmd; bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false; - unsigned long addr; int i;
VM_BUG_ON(haddr & ~HPAGE_PMD_MASK); @@ -2108,6 +2109,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, /* Sub-page mapcount accounting for above small mappings. */ int val = 1;
+ /* + * lock_page_memcg() is taken before thp_mapcount_lock() in + * page_remove_anon_compound_rmap(), respect the same locking + * order. + */ + lock_page_memcg(page); + thp_mapcount_lock(page, &irq_flags); /* * Set PG_double_map before dropping compound_mapcount to avoid * false-negative page_mapped(). @@ -2121,7 +2129,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, for (i = 0; i < HPAGE_PMD_NR; i++) atomic_add(val, &page[i]._mapcount);
- lock_page_memcg(page); if (atomic_add_negative(-1, compound_mapcount_ptr(page))) { /* Last compound_mapcount is gone. */ __mod_lruvec_page_state(page, NR_ANON_THPS, @@ -2132,6 +2139,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, atomic_dec(&page[i]._mapcount); } } + thp_mapcount_unlock(page, irq_flags); unlock_page_memcg(page); }
@@ -2501,6 +2509,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, int total_mapcount(struct page *page) { int i, compound, nr, ret; + unsigned int seqcount; + bool double_map;
VM_BUG_ON_PAGE(PageTail(page), page);
@@ -2510,13 +2520,19 @@ int total_mapcount(struct page *page) return head_compound_mapcount(page);
nr = compound_nr(page); - ret = compound = head_compound_mapcount(page); - for (i = 0; i < nr; i++) - ret += atomic_read(&page[i]._mapcount) + 1; + + do { + seqcount = thp_mapcount_read_begin(page); + ret = compound = head_compound_mapcount(page); + for (i = 0; i < nr; i++) + ret += atomic_read(&page[i]._mapcount) + 1; + double_map = PageDoubleMap(page); + } while (thp_mapcount_read_retry(page, seqcount)); + /* File pages has compound_mapcount included in _mapcount */ if (!PageAnon(page)) return ret - compound * nr; - if (PageDoubleMap(page)) + if (double_map) ret -= nr; return ret; } @@ -2548,6 +2564,7 @@ int total_mapcount(struct page *page) int page_trans_huge_mapcount(struct page *page, int *total_mapcount) { int i, ret, _total_mapcount, mapcount; + unsigned int seqcount;
/* hugetlbfs shouldn't call it */ VM_BUG_ON_PAGE(PageHuge(page), page); @@ -2561,17 +2578,22 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
page = compound_head(page);
- _total_mapcount = ret = 0; - for (i = 0; i < thp_nr_pages(page); i++) { - mapcount = atomic_read(&page[i]._mapcount) + 1; - ret = max(ret, mapcount); - _total_mapcount += mapcount; - } - if (PageDoubleMap(page)) { - ret -= 1; - _total_mapcount -= thp_nr_pages(page); - } - mapcount = compound_mapcount(page); + do { + _total_mapcount = ret = 0; + + seqcount = thp_mapcount_read_begin(page); + for (i = 0; i < thp_nr_pages(page); i++) { + mapcount = atomic_read(&page[i]._mapcount) + 1; + ret = max(ret, mapcount); + _total_mapcount += mapcount; + } + if (PageDoubleMap(page)) { + ret -= 1; + _total_mapcount -= thp_nr_pages(page); + } + mapcount = compound_mapcount(page); + } while (thp_mapcount_read_retry(page, seqcount)); + ret += mapcount; _total_mapcount += mapcount; if (total_mapcount) diff --git a/mm/rmap.c b/mm/rmap.c index 163ac4e6bcee..0218052586e7 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1294,6 +1294,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
static void page_remove_anon_compound_rmap(struct page *page) { + unsigned long irq_flags; int i, nr;
if (!atomic_add_negative(-1, compound_mapcount_ptr(page))) @@ -1308,23 +1309,30 @@ static void page_remove_anon_compound_rmap(struct page *page)
__mod_lruvec_page_state(page, NR_ANON_THPS, -thp_nr_pages(page));
- if (TestClearPageDoubleMap(page)) { - /* - * Subpages can be mapped with PTEs too. Check how many of - * them are still mapped. - */ - for (i = 0, nr = 0; i < thp_nr_pages(page); i++) { - if (atomic_add_negative(-1, &page[i]._mapcount)) - nr++; - } + if (PageDoubleMap(page)) { + thp_mapcount_lock(page, &irq_flags); + if (TestClearPageDoubleMap(page)) { + /* + * Subpages can be mapped with PTEs too. Check how many + * of them are still mapped. + */ + for (i = 0, nr = 0; i < thp_nr_pages(page); i++) { + if (atomic_add_negative(-1, &page[i]._mapcount)) + nr++; + } + thp_mapcount_unlock(page, irq_flags);
- /* - * Queue the page for deferred split if at least one small - * page of the compound page is unmapped, but at least one - * small page is still mapped. - */ - if (nr && nr < thp_nr_pages(page)) - deferred_split_huge_page(page); + /* + * Queue the page for deferred split if at least one + * small page of the compound page is unmapped, but at + * least one small page is still mapped. + */ + if (nr && nr < thp_nr_pages(page)) + deferred_split_huge_page(page); + } else { + thp_mapcount_unlock(page, irq_flags); + nr = thp_nr_pages(page); + } } else { nr = thp_nr_pages(page); } diff --git a/mm/swapfile.c b/mm/swapfile.c index e59e08ef46e1..82aeb927a7ba 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1610,6 +1610,7 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, struct swap_cluster_info *ci = NULL; unsigned char *map = NULL; int mapcount, swapcount = 0; + unsigned int seqcount;
/* hugetlbfs shouldn't call it */ VM_BUG_ON_PAGE(PageHuge(page), page); @@ -1625,7 +1626,6 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
page = compound_head(page);
- _total_mapcount = _total_swapcount = map_swapcount = 0; if (PageSwapCache(page)) { swp_entry_t entry;
@@ -1638,21 +1638,28 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, } if (map) ci = lock_cluster(si, offset); - for (i = 0; i < HPAGE_PMD_NR; i++) { - mapcount = atomic_read(&page[i]._mapcount) + 1; - _total_mapcount += mapcount; - if (map) { - swapcount = swap_count(map[offset + i]); - _total_swapcount += swapcount; + + do { + _total_mapcount = _total_swapcount = map_swapcount = 0; + + seqcount = thp_mapcount_read_begin(page); + for (i = 0; i < HPAGE_PMD_NR; i++) { + mapcount = atomic_read(&page[i]._mapcount) + 1; + _total_mapcount += mapcount; + if (map) { + swapcount = swap_count(map[offset + i]); + _total_swapcount += swapcount; + } + map_swapcount = max(map_swapcount, mapcount + swapcount); } - map_swapcount = max(map_swapcount, mapcount + swapcount); - } + if (PageDoubleMap(page)) { + map_swapcount -= 1; + _total_mapcount -= HPAGE_PMD_NR; + } + mapcount = compound_mapcount(page); + } while (thp_mapcount_read_retry(page, seqcount)); + unlock_cluster(ci); - if (PageDoubleMap(page)) { - map_swapcount -= 1; - _total_mapcount -= HPAGE_PMD_NR; - } - mapcount = compound_mapcount(page); map_swapcount += mapcount; _total_mapcount += mapcount; if (total_mapcount) diff --git a/mm/util.c b/mm/util.c index 3239e75c148d..f4b81c794da1 100644 --- a/mm/util.c +++ b/mm/util.c @@ -730,6 +730,8 @@ EXPORT_SYMBOL(folio_mapping); /* Slow path of page_mapcount() for compound pages */ int __page_mapcount(struct page *page) { + struct page *head_page; + unsigned int seqcount; int ret;
if (PageHuge(page)) @@ -741,11 +743,16 @@ int __page_mapcount(struct page *page) if (!PageAnon(page)) return atomic_read(&page->_mapcount) + 1;
- ret = atomic_read(&page->_mapcount) + 1; - page = compound_head(page); - ret += head_compound_mapcount(page); - if (PageDoubleMap(page)) - ret--; + /* The mapcount_seqlock is so far only required for anonymous THP. */ + head_page = compound_head(page); + do { + seqcount = thp_mapcount_read_begin(head_page); + ret = atomic_read(&page->_mapcount) + 1; + ret += head_compound_mapcount(head_page); + if (PageDoubleMap(head_page)) + ret--; + } while (thp_mapcount_read_retry(head_page, seqcount)); + return ret; } EXPORT_SYMBOL_GPL(__page_mapcount);
FAULT_FLAG_UNSHARE is a new type of page fault applicable to COW-able anonymous memory (including hugetlb but excluding KSM) and its purpose is to allow for unsharing of shared anonymous pages on selected GUP *read* access, in comparison to the traditional COW on *write* access.
In contrast to a COW, GUP-triggered unsharing will still maintain the write protection. It will be triggered by GUP to properly prevent a child process from finding ways via GUP to observe memory modifications of anonymous memory of the parent process after fork().
Rename the relevant functions to make it clear whether we're dealing with unsharing, cow, or both.
The hugetlb part will be added separately.
This commit is based on prototype patches by Andrea.
Co-developed-by: Andrea Arcangeli aarcange@redhat.com Signed-off-by: Andrea Arcangeli aarcange@redhat.com Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com --- include/linux/mm.h | 4 ++ mm/memory.c | 136 ++++++++++++++++++++++++++++++++++++++------- 2 files changed, 119 insertions(+), 21 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index a7e4a9e7d807..37d1fb2f865e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -436,6 +436,9 @@ extern pgprot_t protection_map[16]; * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals. + * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare a + * shared anonymous page (-> mapped R/O). Does not apply + * to KSM. * * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify * whether we would allow page faults to retry by specifying these two @@ -467,6 +470,7 @@ enum fault_flag { FAULT_FLAG_REMOTE = 1 << 7, FAULT_FLAG_INSTRUCTION = 1 << 8, FAULT_FLAG_INTERRUPTIBLE = 1 << 9, + FAULT_FLAG_UNSHARE = 1 << 10, };
/* diff --git a/mm/memory.c b/mm/memory.c index 8f1de811a1dc..7253a2ad4320 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2707,8 +2707,9 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range); * read non-atomically. Before making any commitment, on those architectures * or configurations (e.g. i386 with PAE) which might give a mix of unmatched * parts, do_swap_page must check under lock before unmapping the pte and - * proceeding (but do_wp_page is only called after already making such a check; - * and do_anonymous_page can safely check later on). + * proceeding (but do_wp_page_cow/do_wp_page_unshare is only called after + * already making such a check; and do_anonymous_page can safely check later + * on). */ static inline int pte_unmap_same(struct vm_fault *vmf) { @@ -2726,8 +2727,8 @@ static inline int pte_unmap_same(struct vm_fault *vmf) return same; }
-static inline bool cow_user_page(struct page *dst, struct page *src, - struct vm_fault *vmf) +static inline bool __wp_page_copy_user(struct page *dst, struct page *src, + struct vm_fault *vmf) { bool ret; void *kaddr; @@ -2952,7 +2953,8 @@ static inline void wp_page_reuse(struct vm_fault *vmf) }
/* - * Handle the case of a page which we actually need to copy to a new page. + * Handle the case of a page which we actually need to copy to a new page, + * either due to COW or unsharing. * * Called with mmap_lock locked and the old page referenced, but * without the ptl held. @@ -2967,7 +2969,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf) * held to the old page, as well as updating the rmap. * - In any case, unlock the PTL and drop the reference we took to the old page. */ -static vm_fault_t wp_page_copy(struct vm_fault *vmf) +static vm_fault_t wp_page_copy(struct vm_fault *vmf, bool unshare) { struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm; @@ -2991,7 +2993,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (!new_page) goto oom;
- if (!cow_user_page(new_page, old_page, vmf)) { + if (!__wp_page_copy_user(new_page, old_page, vmf)) { /* * COW failed, if the fault was solved by other, * it's fine. If not, userspace would re-fault on @@ -3033,7 +3035,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); entry = mk_pte(new_page, vma->vm_page_prot); entry = pte_sw_mkyoung(entry); - entry = maybe_mkwrite(pte_mkdirty(entry), vma); + if (unlikely(unshare)) { + if (pte_soft_dirty(vmf->orig_pte)) + entry = pte_mksoft_dirty(entry); + if (pte_uffd_wp(vmf->orig_pte)) + entry = pte_mkuffd_wp(entry); + } else { + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + }
/* * Clear the pte entry and flush it first, before updating the @@ -3050,6 +3059,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) * mmu page tables (such as kvm shadow page tables), we want the * new page to be mapped directly into the secondary page table. */ + BUG_ON(unshare && pte_write(entry)); set_pte_at_notify(mm, vmf->address, vmf->pte, entry); update_mmu_cache(vma, vmf->address, vmf->pte); if (old_page) { @@ -3109,6 +3119,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) free_swap_cache(old_page); put_page(old_page); } + if (unlikely(unshare)) + return 0; return page_copied ? VM_FAULT_WRITE : 0; oom_free_new: put_page(new_page); @@ -3118,6 +3130,70 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) return VM_FAULT_OOM; }
+static __always_inline vm_fault_t wp_page_cow(struct vm_fault *vmf) +{ + return wp_page_copy(vmf, false); +} + +static __always_inline vm_fault_t wp_page_unshare(struct vm_fault *vmf) +{ + return wp_page_copy(vmf, true); +} + +/* + * This routine handles present pages, when GUP tries to take a read-only + * pin on a shared anonymous page. It's similar to do_wp_page_cow(), except that + * it keeps the pages mapped read-only and doesn't apply to KSM pages. + * + * If a parent process forks a child process, we share anonymous pages between + * both processes with COW semantics. Both processes will map these now shared + * anonymous pages read-only, and any write access triggers unsharing via COW. + * + * If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not + * set) and then unmaps the target page, we have: + * + * * page has mapcount == 1 and refcount > 1 + * * page is mapped read-only into the parent + * * page is pinned by the child and can still be read + * + * For now, we rely on refcount > 1 to perform the COW and trigger unsharing. + * However, that leads to other hard-to fix issues. + * + * GUP-triggered unsharing provides a parallel approach to trigger unsharing + * early, still allowing for relying on mapcount > 1 in COW code instead of on + * imprecise refcount > 1. Note that when we don't actually take a reference + * on the target page but instead use memory notifiers to synchronize to changes + * in the process page tables, unsharing is not required. + * + * Note that in the above scenario, it's impossible to distinguish during the + * write fault between: + * + * a) The parent process performed the pin and the child no longer has access + * to the page. + * + * b) The child process performed the pin and the child still has access to the + * page. + * + * In case of a), if we're dealing with a long-term read-only pin, the COW + * in the parent will result the pinned page differing from the page actually + * mapped into the process page tables in the parent: loss of synchronicity. + * Therefore, we really want to perform the copy when the read-only pin happens. + */ +static vm_fault_t do_wp_page_unshare(struct vm_fault *vmf) + __releases(vmf->ptl) +{ + vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte); + if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) && + page_mapcount(vmf->page) > 1) { + get_page(vmf->page); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return wp_page_unshare(vmf); + } + vmf->page = NULL; + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; +} + /** * finish_mkwrite_fault - finish page fault for a shared mapping, making PTE * writeable once the page is prepared @@ -3226,7 +3302,7 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf) * but allow concurrent faults), with pte both mapped and locked. * We return with mmap_lock still held, but pte unmapped and unlocked. */ -static vm_fault_t do_wp_page(struct vm_fault *vmf) +static vm_fault_t do_wp_page_cow(struct vm_fault *vmf) __releases(vmf->ptl) { struct vm_area_struct *vma = vmf->vma; @@ -3258,7 +3334,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) return wp_pfn_shared(vmf);
pte_unmap_unlock(vmf->pte, vmf->ptl); - return wp_page_copy(vmf); + return wp_page_cow(vmf); }
/* @@ -3296,7 +3372,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl); - return wp_page_copy(vmf); + return wp_page_cow(vmf); }
static void unmap_mapping_range_vma(struct vm_area_struct *vma, @@ -3670,7 +3746,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) }
if (vmf->flags & FAULT_FLAG_WRITE) { - ret |= do_wp_page(vmf); + ret |= do_wp_page_cow(vmf); if (ret & VM_FAULT_ERROR) ret &= VM_FAULT_ERROR; goto out; @@ -4428,6 +4504,16 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) /* `inline' is required to avoid gcc 4.1.2 build error */ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) { + if (vmf->flags & FAULT_FLAG_UNSHARE) { + /* + * We'll simply split the THP and handle unsharing on the + * PTE level. Unsharing only applies to anon THPs and we + * shouldn't ever find them inside shared mappings. + */ + if (WARN_ON_ONCE(vmf->vma->vm_flags & VM_SHARED)) + return 0; + goto split_fallback; + } if (vma_is_anonymous(vmf->vma)) { if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) return handle_userfault(vmf, VM_UFFD_WP); @@ -4440,7 +4526,8 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) return ret; }
- /* COW or write-notify handled on pte level: split pmd. */ +split_fallback: + /* COW, unsharing or write-notify handled on pte level: split pmd. */ __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
return VM_FAULT_FALLBACK; @@ -4551,8 +4638,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) return do_fault(vmf); }
- if (!pte_present(vmf->orig_pte)) - return do_swap_page(vmf); + if (!pte_present(vmf->orig_pte)) { + if (likely(!(vmf->flags & FAULT_FLAG_UNSHARE))) + return do_swap_page(vmf); + return 0; + }
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) return do_numa_page(vmf); @@ -4564,9 +4654,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); goto unlock; } - if (vmf->flags & FAULT_FLAG_WRITE) { - if (!pte_write(entry)) - return do_wp_page(vmf); + if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { + if (!pte_write(entry)) { + if (vmf->flags & FAULT_FLAG_WRITE) + return do_wp_page_cow(vmf); + else + return do_wp_page_unshare(vmf); + } entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); @@ -4607,7 +4701,6 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, .pgoff = linear_page_index(vma, address), .gfp_mask = __get_fault_gfp_mask(vma), }; - unsigned int dirty = flags & FAULT_FLAG_WRITE; struct mm_struct *mm = vma->vm_mm; pgd_t *pgd; p4d_t *p4d; @@ -4634,7 +4727,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
/* NUMA case for anonymous PUDs would go here */
- if (dirty && !pud_write(orig_pud)) { + if ((flags & FAULT_FLAG_WRITE) && !pud_write(orig_pud)) { ret = wp_huge_pud(&vmf, orig_pud); if (!(ret & VM_FAULT_FALLBACK)) return ret; @@ -4672,7 +4765,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) return do_huge_pmd_numa_page(&vmf);
- if (dirty && !pmd_write(vmf.orig_pmd)) { + if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && + !pmd_write(vmf.orig_pmd)) { ret = wp_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret;
On Fri, Dec 17, 2021 at 3:34 AM David Hildenbrand david@redhat.com wrote:
- If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not
- set) and then unmaps the target page, we have:
- page has mapcount == 1 and refcount > 1
All these games with mapcount makes me think this is still broken.
mapcount has been a horribly broken thing in the past, and I'm not convinced it's not a broken thing now.
vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
page_mapcount(vmf->page) > 1) {
What keeps the mapcount stable in here?
And I still believe that the whole notion that "COW should use mapcount" is pure and utter garbage.
If we are doing a COW, we need an *exclusive* access to the page. That is not mapcount, that is the page ref.
mapcount is insane, and I think this is making this worse again.
Linus
On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds torvalds@linux-foundation.org wrote:
If we are doing a COW, we need an *exclusive* access to the page. That is not mapcount, that is the page ref.
mapcount is insane, and I think this is making this worse again.
Maybe I'm misreading this, but afaik
- get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.
This just increments the page count, because mapcount == 1.
- fork()
- unmap in the original
- child now has "mapcount == 1" on a page again, but refcount is elevated, and child HAS TO COW before writing.
Notice? "mapcount" is complete BS. The number of times a page is mapped is irrelevant for COW. All that matters is that we get an exclusive access to the page before we can write to it.
Anybody who takes mapcount into account at COW time is broken, and it worries me how this is all mixing up with the COW logic.
Now, maybe this "unshare" case is sufficiently different from COW that it's ok to look at mapcount for FAULT_FLAG_UNSHARE, as long as it doesn't happen for a real COW.
But honestly, for "unshare", I still don't see that the mapcount matters. What does "mapcount == 1" mean? Why is it meaningful?
Because if COW does things right, and always breaks a COW based on refcount, then what's the problem with taking a read-only ref to the page whether it is mapped multiple times or mapped just once? Anybody who already had write access to the page can write to it regardless, and any new writers go through COW and get a new page.
I must be missing something realyl fundamental here, but to me it really reads like "mapcount can fundamentally never be relevant for COW, and if it's not relevant for COW, how can it be relevant for a read-only copy?"
Linus
On 17.12.21 20:22, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds torvalds@linux-foundation.org wrote:
If we are doing a COW, we need an *exclusive* access to the page. That is not mapcount, that is the page ref.
mapcount is insane, and I think this is making this worse again.
Maybe I'm misreading this, but afaik
get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.
This just increments the page count, because mapcount == 1.
fork()
unmap in the original
child now has "mapcount == 1" on a page again, but refcount is
elevated, and child HAS TO COW before writing.
Hi Linus,
This is just GUP before fork(), which is in general problematic/incompatible with sharing. What we're concerned about in the context of this series (see the security issue) is GUP after fork(). And we're not changing GUP before fork() or even the COW logic in the context of this series.
I agree that GUP before fork() has to be handled differently: during fork(): don't share something that cannot possibly be shared in a safe way. Don't allow COW semantics for something that is just broken with COW.
Notice? "mapcount" is complete BS. The number of times a page is mapped is irrelevant for COW. All that matters is that we get an exclusive access to the page before we can write to it.
We have to be very careful about the two sides of the story: GUP before fork and GUP after fork.
Anybody who takes mapcount into account at COW time is broken, and it worries me how this is all mixing up with the COW logic.
Now, maybe this "unshare" case is sufficiently different from COW that it's ok to look at mapcount for FAULT_FLAG_UNSHARE, as long as it doesn't happen for a real COW.
But honestly, for "unshare", I still don't see that the mapcount matters. What does "mapcount == 1" mean? Why is it meaningful?
I'll reply to your first mail in a sec.
GUP is the problem with COW, not ordinary processes mapping a page (mapcount), where you will only get new sharers during fork() -- in a very controlled way. So GUP has to take care to unshare *before* taking a reference, such that we can never reach the point of missed COW. GUP really is the problematic bit with it all.
Without GUP, we'd be living in a wonderful world in regards to COW.
Because if COW does things right, and always breaks a COW based on refcount, then what's the problem with taking a read-only ref to the page whether it is mapped multiple times or mapped just once? Anybody who already had write access to the page can write to it regardless, and any new writers go through COW and get a new page.
Let's just take a look at what refcount does *wrong*. Let's use an adjusted version of your example above, because it's a perfect fit:
1. mem = mmap(pagesize, MAP_PRIVATE) -> refcount == 1
2. memset(mem, 0, pagesize); /* Page is mapped R/W */
3. fork() /* Page gets mapped R/O */ -> refcount > 1
4. child quits -> refcount == 1
5. Take a R/O pin (RDMA, VFIO, ...) -> refcount > 1
6. memset(mem, 0xff, pagesize); -> Write fault -> COW
And GUP sees something different than our MM -- and this is perfectly valid, the R/O pin is just reading page content we might be modifying afterwrds. Take out 3. and 4. and it works as expected. This wouldn't happen when relying on the mapcount.
And 3+4 can really be anything that results in a R/O mapping of an anonymous page, even if it's just swapout followed by read fault that maps the page R/O.
I must be missing something realyl fundamental here, but to me it really reads like "mapcount can fundamentally never be relevant for COW, and if it's not relevant for COW, how can it be relevant for a read-only copy?"
It really is the right value to use. Only GUP is the problematic bit that has to trigger unsharing to not mess up COW logic later. Take GUP out of the equation and COW just works as expected with the mapcount -- as long as we can read an atomic value and synchronize against fork. (again, still composing the other mail :) )
On Fri, Dec 17, 2021 at 12:18 PM David Hildenbrand david@redhat.com wrote:
On 17.12.21 20:22, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds torvalds@linux-foundation.org wrote:
get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.
This just increments the page count, because mapcount == 1.
fork()
unmap in the original
child now has "mapcount == 1" on a page again, but refcount is
elevated, and child HAS TO COW before writing.
Hi Linus,
This is just GUP before fork(), which is in general problematic/incompatible with sharing.
Note that my example was not meant to be an example of a problem per se, but purely as an example of how meaningless 'mapcount' is, and how 'mapcount==1' isn't really a very meaningful test.
So it wasn't mean to show "look, GUP before fork is problematic". We have that problem already solved at least for regular pages.
It was purely meant to show how "mapcount==1" isn't a meaningful thing to test, and my worry about how you're adding that nonsensical test to the new code.
Let's just take a look at what refcount does *wrong*. Let's use an adjusted version of your example above, because it's a perfect fit:
- mem = mmap(pagesize, MAP_PRIVATE)
-> refcount == 1
memset(mem, 0, pagesize); /* Page is mapped R/W */
fork() /* Page gets mapped R/O */
-> refcount > 1
- child quits
-> refcount == 1
- Take a R/O pin (RDMA, VFIO, ...)
-> refcount > 1
- memset(mem, 0xff, pagesize);
-> Write fault -> COW
I do not believe this is actually a bug.
You asked for a R/O pin, and you got one.
Then somebody else modified that page, and you got exactly what you asked for - a COW event. The original R/O pin has the original page that it asked for, and can read it just fine.
So what is your argument?
Linus
On Fri, Dec 17, 2021 at 12:36 PM Linus Torvalds torvalds@linux-foundation.org wrote:
- Take a R/O pin (RDMA, VFIO, ...)
-> refcount > 1
- memset(mem, 0xff, pagesize);
-> Write fault -> COW
I do not believe this is actually a bug.
You asked for a R/O pin, and you got one.
If you want a shared pin that actually follows the changes of your process around, then that is what you should have asked for.
At the time of such a shared pin, you can do what we already do: re-use the page if it has a refcount of 1. Or do an early COW event (feel free to avoid the "mark it writable and dirty").
But note: *refcount* of 1. Not "mapcount". Because mapcount would be broken garbage.
Linus
On Fri, Dec 17, 2021 at 12:39 PM Linus Torvalds torvalds@linux-foundation.org wrote:
At the time of such a shared pin, you can do what we already do: re-use the page if it has a refcount of 1. Or do an early COW event (feel free to avoid the "mark it writable and dirty").
Note that this also depends on fork() doing the right thing, marking things for "a fork() can not share this page any more".
Which it does for regular pages, and is exactly what that page_needs_cow_for_dma() logic is all about (and the special write_protect_seq around gup/fork).
I do believe that huge-pages don't do it right. But I think that as you try to fix hugepages, you are now breaking the normal case.
If all your logic was only about hugepages, I wouldn't care so much. But you are playing questionable games with code that I think is correct.
Please explain why.
Linus
On 17.12.21 21:36, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 12:18 PM David Hildenbrand david@redhat.com wrote:
On 17.12.21 20:22, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds torvalds@linux-foundation.org wrote:
get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.
This just increments the page count, because mapcount == 1.
fork()
unmap in the original
child now has "mapcount == 1" on a page again, but refcount is
elevated, and child HAS TO COW before writing.
Hi Linus,
This is just GUP before fork(), which is in general problematic/incompatible with sharing.
Note that my example was not meant to be an example of a problem per se, but purely as an example of how meaningless 'mapcount' is, and how 'mapcount==1' isn't really a very meaningful test.
So it wasn't mean to show "look, GUP before fork is problematic". We have that problem already solved at least for regular pages.
It was purely meant to show how "mapcount==1" isn't a meaningful thing to test, and my worry about how you're adding that nonsensical test to the new code.
Let's just take a look at what refcount does *wrong*. Let's use an adjusted version of your example above, because it's a perfect fit:
- mem = mmap(pagesize, MAP_PRIVATE)
-> refcount == 1
memset(mem, 0, pagesize); /* Page is mapped R/W */
fork() /* Page gets mapped R/O */
-> refcount > 1
- child quits
-> refcount == 1
- Take a R/O pin (RDMA, VFIO, ...)
-> refcount > 1
- memset(mem, 0xff, pagesize);
-> Write fault -> COW
I do not believe this is actually a bug.
It's debatable if it's a BUG or not (I think it is one). It's for sure inconsistent.
You asked for a R/O pin, and you got one.
Then somebody else modified that page, and you got exactly what you asked for - a COW event. The original R/O pin has the original page that it asked for, and can read it just fine.
Where in the code did I ask for a COW event? I asked for a R/O pin, not any kind of memory protection.
On Fri, Dec 17, 2021 at 12:42 PM David Hildenbrand david@redhat.com wrote:
Then somebody else modified that page, and you got exactly what you asked for - a COW event. The original R/O pin has the original page that it asked for, and can read it just fine.
Where in the code did I ask for a COW event? I asked for a R/O pin, not any kind of memory protection.
Why didn't you ask for a shared pin, if that is what you want?
We already support that.
If you don't like the read-only pins, don't use them. It's that simple.
Linus
On Fri, Dec 17, 2021 at 12:45:45PM -0800, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 12:42 PM David Hildenbrand david@redhat.com wrote:
Then somebody else modified that page, and you got exactly what you asked for - a COW event. The original R/O pin has the original page that it asked for, and can read it just fine.
Where in the code did I ask for a COW event? I asked for a R/O pin, not any kind of memory protection.
Why didn't you ask for a shared pin, if that is what you want?
We already support that.
If you don't like the read-only pins, don't use them. It's that simple.
So you are saying that if a GUP user wants to see changes made by userspace to the page after the GUP it must ask for FOLL_WRITE, even if it doesn't have intend to write to the page?
That's news to me.
Or did I misunderstand you?
On Sat, Dec 18, 2021 at 2:52 PM Kirill A. Shutemov kirill@shutemov.name wrote:
So you are saying that if a GUP user wants to see changes made by userspace to the page after the GUP it must ask for FOLL_WRITE, even if it doesn't have intend to write to the page?
Yup. Put the onus very clearly on GUP.
It's a very special operation, and it's one of the operations that cause a lot of problems for the VM code. It's by no means the _only_ one - we've got a lot of other things that cause issues - but I think it's very clear that GUP is very very special, and nobody should say "I want GUP to do whatever".
There are two cases for GUP:
- look up the page as it is *NOW*
- look up the page in order to see any future state on it (and possibly modify it)
that "any future state" is a fundamentally much heavier operation. It's the one that really *has* to get rid of any sharing, and it has to make sure no sharing happens in the future either.
So ii it is an anonymous page, it basically needs to act like a write. Even if that page is then used only for reading.
Note that here that "if it's anonymous" is kind of a big deal. If it's a shared file-mapped page, at that point it's "just another reference". It's potentially problematic even in that case (think of "truncate()" that tries to force-unmap all pages from VM's), but for the shared case the answer is "if you truncate it and disassociate the page from the mapping, it's _your_ problem.
And once it acts as a write, and once it does that COW and we have exclusive access to it, it might as well be just writable and dirty. You've done the expensive part already. You've forced it to be private to that VM.
And this was all triggered by the user doing something very special, so no amount of "but POSIX" or whatever matters.
GUP is not great. If you use GUP, you get to deal with the downsides.
Linus
On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote:
- Take a R/O pin (RDMA, VFIO, ...)
-> refcount > 1
- memset(mem, 0xff, pagesize);
-> Write fault -> COW
I do not believe this is actually a bug.
You asked for a R/O pin, and you got one.
Then somebody else modified that page, and you got exactly what you asked for - a COW event. The original R/O pin has the original page that it asked for, and can read it just fine.
To remind all, the GUP users, like RDMA, VFIO use FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the COW breaking the coherence. In these case 'r/o pin' does not mean "snapshot the data", but its only a promise not to write to the pages and still desires coherence with the memory map.
Eg in RDMA we know of apps asking for a R/O pin of something in .bss then filling that something with data finally doing the actual DMA. Breaking COW after pin breaks those apps.
The above #5 can occur for O_DIRECT read and in that case the 'snapshot the data' is perfectly fine as racing the COW with the O_DIRECT read just resolves the race toward the read() direction.
IIRC there is some other scenario that motivated this patch?
Jason
On Fri, Dec 17, 2021 at 12:47 PM Jason Gunthorpe jgg@nvidia.com wrote:
To remind all, the GUP users, like RDMA, VFIO use FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the COW breaking the coherence. In these case 'r/o pin' does not mean "snapshot the data", but its only a promise not to write to the pages and still desires coherence with the memory map.
Eg in RDMA we know of apps asking for a R/O pin of something in .bss then filling that something with data finally doing the actual DMA. Breaking COW after pin breaks those apps.
I agree.
And my argument is that those kinds of things that ask for a R/O pin are broken, and should just make sure to use the shared pins.
If the page was already writable, you can just re-use the page directly (marking it pinned, so that any subsequent fork() does the right pre-cow thing)
And if the page was *NOT* already writable, you do a COW - which might be sharing the page directly too, if it has no other references.
What's the downside of just doing this properly?
Again: if a DMA user wants coherent memory, then it should use the coherent pinning. Not some kind of "read-only sharing that looks at crazy mapcounts that have absolutely zero relevance to whether the page is coherent or not".
Linus
On 17.12.21 21:56, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 12:47 PM Jason Gunthorpe jgg@nvidia.com wrote:
To remind all, the GUP users, like RDMA, VFIO use FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the COW breaking the coherence. In these case 'r/o pin' does not mean "snapshot the data", but its only a promise not to write to the pages and still desires coherence with the memory map.
Eg in RDMA we know of apps asking for a R/O pin of something in .bss then filling that something with data finally doing the actual DMA. Breaking COW after pin breaks those apps.
I agree.
I agree that breaking COW after a pin should never be done. Therefore, break the COW before the pin -> unsharing as implemented here.
And my argument is that those kinds of things that ask for a R/O pin are broken, and should just make sure to use the shared pins.
And trigger a write fault although they are just reading. To me this is just a band aid instead of eventually ...
...
What's the downside of just doing this properly?
Doing it properly by fixing GUP and not the COW logic. GUP and COW are just incompatible and we should unshare early.
Honestly, the memory corruptions we can trigger in user space due to the current COW logic *especially* with FOLL_WRITE users such O_DIRECT, io_uring or vfio are not a joke anymore. (again, link in the cover letter)
On 17.12.21 21:47, Jason Gunthorpe wrote:
On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote:
- Take a R/O pin (RDMA, VFIO, ...)
-> refcount > 1
- memset(mem, 0xff, pagesize);
-> Write fault -> COW
I do not believe this is actually a bug.
You asked for a R/O pin, and you got one.
Then somebody else modified that page, and you got exactly what you asked for - a COW event. The original R/O pin has the original page that it asked for, and can read it just fine.
Hi Jason
To remind all, the GUP users, like RDMA, VFIO use FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
I heard that statement often. Can you point me at the code?
VFIO: drivers/vfio/vfio_iommu_type1.c
vaddr_get_pfns() will end up doing a pin_user_pages_remote(FOLL_LONGTERM) without FOLL_FORCE|FOLL_WRITE.
Is that added automatically internally?
Note the comment in the next patch
+ * + * TODO: although the security issue described does no longer apply in any case, + * the full consistency between the pinned pages and the pages mapped into the + * page tables of the MM only apply to short-term pinnings only. For + * FOLL_LONGTERM, FOLL_WRITE|FOLL_FORCE is required for now, which can be + * inefficient and still result in some consistency issues. Extend this + * mechanism to also provide full synchronicity to FOLL_LONGTERM, avoiding + * FOLL_WRITE|FOLL_FORCE.
Eg in RDMA we know of apps asking for a R/O pin of something in .bss then filling that something with data finally doing the actual DMA. Breaking COW after pin breaks those apps.
The above #5 can occur for O_DIRECT read and in that case the 'snapshot the data' is perfectly fine as racing the COW with the O_DIRECT read just resolves the race toward the read() direction.
IIRC there is some other scenario that motivated this patch?
1. I want to fix the COW security issue as documented. Reproducers in patch #11
2. I want to fix all of the other issues as documented and linked in the cover letter that result from the imprecise page_count check in COW code. Especially the ones where we have memory corruptions, because this is just not acceptable. There are reproducers as well for everybody that doesn't believe me.
But this series really just wants to fix the security issue as "part 1". Without any more breakages.
I'm sorry, but it's all described in the cover letter. Maybe TL;DR
On Fri, Dec 17, 2021 at 10:04:11PM +0100, David Hildenbrand wrote:
To remind all, the GUP users, like RDMA, VFIO use FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
I heard that statement often. Can you point me at the code?
VFIO: drivers/vfio/vfio_iommu_type1.c
vaddr_get_pfns() will end up doing a pin_user_pages_remote(FOLL_LONGTERM) without FOLL_FORCE|FOLL_WRITE.
Is that added automatically internally?
No, it is just that VFIO is broken in this regard. AFAIK VFIO users rarely use the read-only mode and haven't noticed this bug yet.
You can search for FOLL_FORCE and see the drivers that do it this way:
drivers/misc/habanalabs/common/memory.c: FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, drivers/infiniband/core/umem.c: gup_flags |= FOLL_FORCE; drivers/media/v4l2-core/videobuf-dma-sg.c: unsigned int flags = FOLL_FORCE; drivers/media/common/videobuf2/frame_vector.c: FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
[etc]
No doubt there are others that do it right and wrong, this is badly documented and misunderstood.
But this series really just wants to fix the security issue as "part 1". Without any more breakages.
Sure, I'm just trying to understand why this example was brought up.
Jason
On Dec 17, 2021, at 12:47 PM, Jason Gunthorpe jgg@nvidia.com wrote:
On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote:
- Take a R/O pin (RDMA, VFIO, ...)
-> refcount > 1
- memset(mem, 0xff, pagesize);
-> Write fault -> COW
I do not believe this is actually a bug.
You asked for a R/O pin, and you got one.
Then somebody else modified that page, and you got exactly what you asked for - a COW event. The original R/O pin has the original page that it asked for, and can read it just fine.
To remind all, the GUP users, like RDMA, VFIO use FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the COW breaking the coherence. In these case 'r/o pin' does not mean "snapshot the data", but its only a promise not to write to the pages and still desires coherence with the memory map.
Eg in RDMA we know of apps asking for a R/O pin of something in .bss then filling that something with data finally doing the actual DMA. Breaking COW after pin breaks those apps.
The above #5 can occur for O_DIRECT read and in that case the 'snapshot the data' is perfectly fine as racing the COW with the O_DIRECT read just resolves the race toward the read() direction.
IIRC there is some other scenario that motivated this patch?
I think that there is an assumption that once a page is COW-broken, it would never have another write-fault that might lead to COW breaking later.
AFAIK at least after userfaultfd-WP followed by userfaultfd-write-unprotect a page might be write-protected and go through do_wp_page() a second time to be COW-broken again. In such case, I think the FOLL_FORCE|FOLL_WRITE would not help.
I suspect (not sure) that this might even happen with mprotect() since I do not see all code-paths preserving whether the page was writable.
On 17.12.21 22:15, Nadav Amit wrote:
On Dec 17, 2021, at 12:47 PM, Jason Gunthorpe jgg@nvidia.com wrote:
On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote:
- Take a R/O pin (RDMA, VFIO, ...)
-> refcount > 1
- memset(mem, 0xff, pagesize);
-> Write fault -> COW
I do not believe this is actually a bug.
You asked for a R/O pin, and you got one.
Then somebody else modified that page, and you got exactly what you asked for - a COW event. The original R/O pin has the original page that it asked for, and can read it just fine.
To remind all, the GUP users, like RDMA, VFIO use FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the COW breaking the coherence. In these case 'r/o pin' does not mean "snapshot the data", but its only a promise not to write to the pages and still desires coherence with the memory map.
Eg in RDMA we know of apps asking for a R/O pin of something in .bss then filling that something with data finally doing the actual DMA. Breaking COW after pin breaks those apps.
The above #5 can occur for O_DIRECT read and in that case the 'snapshot the data' is perfectly fine as racing the COW with the O_DIRECT read just resolves the race toward the read() direction.
IIRC there is some other scenario that motivated this patch?
I think that there is an assumption that once a page is COW-broken, it would never have another write-fault that might lead to COW breaking later.
AFAIK at least after userfaultfd-WP followed by userfaultfd-write-unprotect a page might be write-protected and go through do_wp_page() a second time to be COW-broken again. In such case, I think the FOLL_FORCE|FOLL_WRITE would not help.
I suspect (not sure) that this might even happen with mprotect() since I do not see all code-paths preserving whether the page was writable.
uffd-wp and mprotect() are broken as well, yes. But the easiest example is just swap + read fault.
Section 2 and 3 in [1], along with reproducers.
Note that I didn't mention uffd-wp and mprotect(), because these require "manual intervention". With swap, it's not your application doing something "special".
[1] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
On Fri, Dec 17, 2021 at 09:15:45PM +0000, Nadav Amit wrote:
I think that there is an assumption that once a page is COW-broken, it would never have another write-fault that might lead to COW breaking later.
Yes, that is what Linus has been explaining, AFAICT
AFAIK at least after userfaultfd-WP followed by userfaultfd-write-unprotect a page might be write-protected and go through do_wp_page() a second time to be COW-broken again. In such case, I think the FOLL_FORCE|FOLL_WRITE would not help.
Right, and this is a good reason why refcount is running into trouble, it COW's too much in cases like that because userfaultfd-WP doesn't align to the first assumption.
Jason
[ Going back in the thread to this one ]
On Fri, Dec 17, 2021 at 1:15 PM Nadav Amit namit@vmware.com wrote:
I think that there is an assumption that once a page is COW-broken, it would never have another write-fault that might lead to COW breaking later.
Right. I do think there are problems in the current code, I just think that the patches are a step back.
The problems with the current code are of two kinds:
- I think the largepage code (either THP or explicit hugetlb) doesn't do as good a job of this whole COW handling as the regular pages do
- some of the "you can make pages read-only again explicitly" kinds of loads.
But honestly, at least for the second case, if somebody does a GUP, and then starts playing mprotect games on the same virtual memory area that they did a GUP on, and are surprised when they get another COW fault that breaks their own connection with a page they did a GUP on earlier, that's their own fault.
So I think there's some of "If you broke it, you get to keep both pieces". Literally, in this case. You have your GUP page that you looked up, and you have your virtual address page that you caused COW on with mprotect() by making it read-only and then read-write again, then you have two different pages, and at some point it really is just "Well, don't do that then".
But yes, there's also some of "some code probably didn't get fully converted to the new world order". So if VFIO only uses FOLL_LONGTERM, and didn't ask for the COW breaking, then yes, VFIO will see page incoherencies. But that should be an issue of "VFIO should do the right thing".
So part of it is a combination of "if you do crazy things, you'll get crazy results". And some of it is some kernel pinning code that doesn't do the right thing to actually make sure it gets a shared page to be pinned.
And then there's THP and HUGETLB, that I do think needs fixing and aren't about those two kinds of cases.
I think we never got around to just doing the same thing we did for regular pages. I think the hugepage code simply doesn't follow that "COW on GUP, mark to not COW later" pattern.
Linus
On Fri, Dec 17, 2021 at 5:53 PM Linus Torvalds torvalds@linux-foundation.org wrote:
And then there's THP and HUGETLB, that I do think needs fixing and aren't about those two kinds of cases.
I think we never got around to just doing the same thing we did for regular pages. I think the hugepage code simply doesn't follow that "COW on GUP, mark to not COW later" pattern.
In particular, do_huge_pmd_wp_page() has this pattern:
/* * We can only reuse the page if nobody else maps the huge page or it's * part. */ if (reuse_swap_page(page, NULL)) { ... mark it writable ...
and that never got converted to "only mark it writable if we actually have exclusive access to this huge page".
So the problem is literally that reuse_swap_page() uses that "page_mapcount()" logic, and doesn't take into account that the page is actually used by a GUP reference.
Which is exactly why David then sees that "oh, we got a GUP reference to it, and now we're seeing the writes come through". Because that code looks at mapcount, and it shouldn't.
I think the hugepage code should use the exact same logic that the regular wp fault code does.
Linus
On Fri, Dec 17, 2021 at 6:17 PM Linus Torvalds torvalds@linux-foundation.org wrote:
I think the hugepage code should use the exact same logic that the regular wp fault code does.
IOW, I think that this stupid (AND UNTESTED) patch should likely just fix David's test-case with the hugepage and splice thing..
Or at least be somewhat close. But it should be paired with the GUP side doing the right thing too, of course. Maybe it already does, maybe it doesn't, I didn't check...
And maybe there's something subtle about the page_count() of a THP entry. Again, I can't really claim to have tested this all, but I'm hoping this makes somebody go "Ahh, now I see what Linus means"
Linus
On Fri, Dec 17, 2021 at 6:42 PM Linus Torvalds torvalds@linux-foundation.org wrote:
IOW, I think that this stupid (AND UNTESTED) patch should likely just fix David's test-case with the hugepage and splice thing..
Looking at that patch, the page lock is entirely pointless.
It existed because of that broken reuse_swap_page() that tried to count page mappings etc, but once you get rid of that - like we got rid of it for the regular pages - it's not even needed.
So as we hold the page table lock, and see a page_count() of 1, we could be done without any page lock at all. So that whole trylock/unlock is actually unnecessary.
That said, it's possibly woth re-using any swap cache pages at this point, and that would want the page lock. So some complexity in this area is likely worth it. Similar to how we did it in commit f4c4a3f48480 ("mm: free idle swap cache page after COW") for regular pages.
So that patch is not great, but I think it works as a guiding one.
And notice how *simple* it is. It doesn't require careful counting of swap entries that depend on page locking.
Linus
On 18.12.21 03:42, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 6:17 PM Linus Torvalds torvalds@linux-foundation.org wrote:
I think the hugepage code should use the exact same logic that the regular wp fault code does.
IOW, I think that this stupid (AND UNTESTED) patch should likely just fix David's test-case with the hugepage and splice thing..
Or at least be somewhat close. But it should be paired with the GUP side doing the right thing too, of course. Maybe it already does, maybe it doesn't, I didn't check...
And maybe there's something subtle about the page_count() of a THP entry. Again, I can't really claim to have tested this all, but I'm hoping this makes somebody go "Ahh, now I see what Linus means"
Not the reaction you are hoping for: "Gah, Linus still doesn't see why the page_count is just wrong". :)
See the mail I just wrote, let's get a common understanding of how our check should actually look like.
On Fri, Dec 17, 2021 at 05:53:45PM -0800, Linus Torvalds wrote:
But honestly, at least for the second case, if somebody does a GUP, and then starts playing mprotect games on the same virtual memory area that they did a GUP on, and are surprised when they get another COW fault that breaks their own connection with a page they did a GUP on earlier, that's their own fault.
I've been told there are real workloads that do this.
Something like qemu will use GUP with VFIO to insert PCI devices into the guest and GUP with RDMA to do fast network copy of VM memory during VM migration.
qemu also uses the WP games to implement dirty tracking of VM memory during migration (and more? I'm not sure). It expects that during all of this nothing will COW the pages, as the two kinds of DMA must always go to the pages mapped to KVM.
The big trouble here is this all worked before, so it is a userspace visible regression.
Can this be made to work at all? I wonder if qemu uses MAP_SHARED, eg via a memfd or something, does the COW then go away naturally?
Jason
On Dec 17, 2021, at 7:05 PM, Jason Gunthorpe jgg@nvidia.com wrote:
On Fri, Dec 17, 2021 at 05:53:45PM -0800, Linus Torvalds wrote:
But honestly, at least for the second case, if somebody does a GUP, and then starts playing mprotect games on the same virtual memory area that they did a GUP on, and are surprised when they get another COW fault that breaks their own connection with a page they did a GUP on earlier, that's their own fault.
I've been told there are real workloads that do this.
Something like qemu will use GUP with VFIO to insert PCI devices into the guest and GUP with RDMA to do fast network copy of VM memory during VM migration.
qemu also uses the WP games to implement dirty tracking of VM memory during migration (and more? I'm not sure). It expects that during all of this nothing will COW the pages, as the two kinds of DMA must always go to the pages mapped to KVM.
The big trouble here is this all worked before, so it is a userspace visible regression.
In such a case, I do think it makes sense to fail uffd-wp (when page_count() > 1), and in a prototype I am working on I do something like that. Otherwise, if the page is written and you use uffd for dirty tracking, what do you actually achieve?
You can return EAGAIN (which is documented and actually returned while “mmap_changing”) in such case. This would not break userspace, but indeed still likely to cause a performance regression.
On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit namit@vmware.com wrote:
In such a case, I do think it makes sense to fail uffd-wp (when page_count() > 1), and in a prototype I am working on I do something like that.
Ack. If uddf-wp finds a page that is pinned, just skip it as not write-protectable.
Because some of the pinners might be writing to it, of course - just not through the page tables.
So that sounds like the right thing to do. I _think_ we discussed this the last time this came up. I have some dim memory of that. Jason, ring a bell?
Linus
On Fri, Dec 17, 2021 at 07:38:39PM -0800, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit namit@vmware.com wrote:
In such a case, I do think it makes sense to fail uffd-wp (when page_count() > 1), and in a prototype I am working on I do something like that.
Ack. If uddf-wp finds a page that is pinned, just skip it as not write-protectable.
Because some of the pinners might be writing to it, of course - just not through the page tables.
That doesn't address the qemu use case though. The RDMA pin is the 'coherent r/o pin' we discussed before, which requires that the pages remain un-write-protected and the HW DMA is read only.
The VFIO pin will enable dirty page tracking in the system IOMMU so it gets the same effect from qemu's perspective as the CPU WP is doing.
In these operations every single page of the guest will be pinned, so skip it just means userfault fd wp doesn't work at all.
Qemu needs some solution to be able to dirty track the CPU memory for migration..
So that sounds like the right thing to do. I _think_ we discussed this the last time this came up. I have some dim memory of that. Jason, ring a bell?
We talked about clear_refs alot, but it was never really clear the use case, I think. Plus that discussion never finialized to anything.
David's latest summary seems accurate, if I paraphrase at a high level, Linus's approach always does enough COWs but might do extra and David's approach tries to do exactly the right number of COWs.
It looks like to have the same functionality with Linus's approach we need to have a way for userspace to opt out of COW and work in an entirely deterministic non-COW world. WP&GUP can never work together otherwise which leaves qemu stranded.
Or, we follow David's approach and make COW be precise and accept the complexity..
Jason
On 18.12.21 19:42, Jason Gunthorpe wrote:
On Fri, Dec 17, 2021 at 07:38:39PM -0800, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit namit@vmware.com wrote:
In such a case, I do think it makes sense to fail uffd-wp (when page_count() > 1), and in a prototype I am working on I do something like that.
Ack. If uddf-wp finds a page that is pinned, just skip it as not write-protectable.
Because some of the pinners might be writing to it, of course - just not through the page tables.
That doesn't address the qemu use case though. The RDMA pin is the 'coherent r/o pin' we discussed before, which requires that the pages remain un-write-protected and the HW DMA is read only.
The VFIO pin will enable dirty page tracking in the system IOMMU so it gets the same effect from qemu's perspective as the CPU WP is doing.
In these operations every single page of the guest will be pinned, so skip it just means userfault fd wp doesn't work at all.
Qemu needs some solution to be able to dirty track the CPU memory for migration..
So that sounds like the right thing to do. I _think_ we discussed this the last time this came up. I have some dim memory of that. Jason, ring a bell?
We talked about clear_refs alot, but it was never really clear the use case, I think. Plus that discussion never finialized to anything.
David's latest summary seems accurate, if I paraphrase at a high level, Linus's approach always does enough COWs but might do extra and David's approach tries to do exactly the right number of COWs.
It looks like to have the same functionality with Linus's approach we need to have a way for userspace to opt out of COW and work in an entirely deterministic non-COW world. WP&GUP can never work together otherwise which leaves qemu stranded.
Or, we follow David's approach and make COW be precise and accept the complexity..
Thanks Jason,
I would really enjoy us discussion how we can eventually make it *precise* COW model work instead of living with a broken MM subsystem, as all the reproducers show. IMHO we should stop throwing more band-aids at it.
Is my approach complete? Sounds like it's not because Linus raised a good point that the mapcount in the current state might not be stable for our use case. I'm very happy that he reviewed this series.
I have some ideas to make the "_mapcount" of anonymous pages express exactly that: how many active (PTE mapped) users do we have and how many inactive (swap entries, migration entries) do we have. We can certainly discuss any such approaches, but first there should be the will to try getting it right ...
On Dec 18, 2021, at 10:42 AM, Jason Gunthorpe jgg@nvidia.com wrote:
On Fri, Dec 17, 2021 at 07:38:39PM -0800, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit namit@vmware.com wrote:
In such a case, I do think it makes sense to fail uffd-wp (when page_count() > 1), and in a prototype I am working on I do something like that.
Ack. If uddf-wp finds a page that is pinned, just skip it as not write-protectable.
Because some of the pinners might be writing to it, of course - just not through the page tables.
That doesn't address the qemu use case though. The RDMA pin is the 'coherent r/o pin' we discussed before, which requires that the pages remain un-write-protected and the HW DMA is read only.
The VFIO pin will enable dirty page tracking in the system IOMMU so it gets the same effect from qemu's perspective as the CPU WP is doing.
In these operations every single page of the guest will be pinned, so skip it just means userfault fd wp doesn't work at all.
Qemu needs some solution to be able to dirty track the CPU memory for migration..
My bad. I misunderstood the scenario.
Yes, I guess that you pin the pages early for RDMA registration, which is also something you may do for IO-uring buffers. This would render userfaultfd unusable.
I do not see how it can be solved without custom, potentially complicated logic, which the page_count() approach wants to avoid.
The only thing I can think of is requiring the pinned regions to be first madvise’d with MADV_DONTFORK and not COW’ing in such case. But this would break existing code though.
On Sat, Dec 18, 2021 at 1:49 PM Nadav Amit namit@vmware.com wrote:
Yes, I guess that you pin the pages early for RDMA registration, which is also something you may do for IO-uring buffers. This would render userfaultfd unusable.
I think this is all on usefaultfd.
That code literally stole two of the bits from the page table layout - bits that we could have used for better things.
And guess what? Because it required those two bits in the page tables, and because that's non-portable, it turns out that UFFD_WP can only be enabled and only works on x86-64 in the first place.
So UFFS_WP is fundamentally non-portable. Don't use it.
Anyway, the good news is that I think that exactly because uffd_wp stole two bits from the page table layout, it already has all the knowledge it needs to handle this entirely on its own. It's just too lazy to do so now.
In particular, it has that special UFFD_WP bit that basically says "this page is actually writable, but I've made it read-only just to get the fault for soft-dirty".
And the hint here is that if the page truly *was* writable, then COW just shouldn't happen, and all that the page fault code should do is set soft-dirty and return with the page set writable.
And if the page was *not* writable, then UFFD_WP wasn't actually needed in the first place, but the uffd code just sets it blindly.
Notice? It _should_ be just an operation based purely on the page table contents, never even looking at the page AT ALL. Not even the page count, much less some mapcount thing.
Done right, that soft-dirty thing could work even with no page backing at all, I think.
But as far as I know, we've actually never seen a workload that does all this, so.. Does anybody even have a test-case?
Because I do think that UFFD_WP really should never really look at the page, and this issue is actually independent of the "page_count() vs page_mapcount()" discussion.
(Somewhat related aside: Looking up the page is actually one of the more expensive operations of a page fault and a lot of other page table manipulation functions - it's where most of the cache misses happen. That's true on the page fault side, but it's also true for things like copy_page_range() etc)
Linus
On Dec 18, 2021, at 2:53 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, Dec 18, 2021 at 1:49 PM Nadav Amit namit@vmware.com wrote:
Yes, I guess that you pin the pages early for RDMA registration, which is also something you may do for IO-uring buffers. This would render userfaultfd unusable.
I think this is all on usefaultfd.
That code literally stole two of the bits from the page table layout - bits that we could have used for better things.
And guess what? Because it required those two bits in the page tables, and because that's non-portable, it turns out that UFFD_WP can only be enabled and only works on x86-64 in the first place.
So UFFS_WP is fundamentally non-portable. Don't use it.
I have always felt that the PTE software-bits limit is very artificial. We can just allocate two adjacent pages when needed, one for PTEs and one for extra software bits. A software bit in the PTE can indicate “extra software bits” are relevant (to save cache-misses), and a bit in the PTEs' page-struct indicate whether there is adjacent “extra software bits” page.
I’ve done it something similar once in a research project. It is rather similar to what is done for PTI in the PGD level.
Anyway, the good news is that I think that exactly because uffd_wp stole two bits from the page table layout, it already has all the knowledge it needs to handle this entirely on its own. It's just too lazy to do so now.
In particular, it has that special UFFD_WP bit that basically says "this page is actually writable, but I've made it read-only just to get the fault for soft-dirty".
And the hint here is that if the page truly *was* writable, then COW just shouldn't happen, and all that the page fault code should do is set soft-dirty and return with the page set writable.
And if the page was *not* writable, then UFFD_WP wasn't actually needed in the first place, but the uffd code just sets it blindly.
I don’t think that I am following. The write-protection of UFFD means that the userspace wants to intervene before anything else (including COW).
UFFD_WP indications are recorded per PTE (i.e., not VMA). So if userspace wants to intervene on write, it must use UFFD_WP even if the page is write-protected. The kernel then has to keep the UFFD_WP indication to call userspace upon a write.
Notice? It _should_ be just an operation based purely on the page table contents, never even looking at the page AT ALL. Not even the page count, much less some mapcount thing.
Done right, that soft-dirty thing could work even with no page backing at all, I think.
But as far as I know, we've actually never seen a workload that does all this, so.. Does anybody even have a test-case?
Because I do think that UFFD_WP really should never really look at the page, and this issue is actually independent of the "page_count() vs page_mapcount()" discussion.
I can think of two examples for reasonable flows of UFFD:
[ M = Monitor thread; F = Faulting thread ]
(A) Userspace page-fault tracking (e.g., for memory migration):
1. M: WP memory. 2. F: WP page-fault: provide UFFD notification. 3. M: Unprotect the page. 4. M: Wake the faulting thread (usually as part of the unprotect) 5. F: Retry the page-fault (and handle COW).
(B) Userspace memory snapshots:
1. M: Write-protect memory. 2. M: Copy the memory to disk. 3. M: Write-unprotect memory (e.g., speculatively as you expect a page to be written to and do not want to pay the #PF price).
[ notice that the un-protection is logical, not really in the PTEs]
4. F: Get a page-fault (later) and handle it (because it might or might not need COW)
There may be “crazier” flows (e.g., wake the faulting thread and emulate the instruction that triggered the write with ptrace), but let’s put those aside.
IIUC the solution you propose, it tries to address flows such as (A).
I am not sure whether the proposal is to change the write-protection API to only provide notifications (i.e., not block to after page-fault as done today), but I do not see how it addresses (B).
I am not saying it is impossible, but I think that the solution would complicate the code by making UFFD a special case.
On Sat, Dec 18, 2021 at 4:19 PM Nadav Amit namit@vmware.com wrote:
I have always felt that the PTE software-bits limit is very artificial. We can just allocate two adjacent pages when needed, one for PTEs and one for extra software bits. A software bit in the PTE can indicate “extra software bits” are relevant (to save cache-misses), and a bit in the PTEs' page-struct indicate whether there is adjacent “extra software bits” page.
Hmm. That doesn't sound very bad, no. And it would be nice to have more software bits (and have them portably).
I don’t think that I am following. The write-protection of UFFD means that the userspace wants to intervene before anything else (including COW).
The point I was making (badly) is that UFFD_WP is only needed to for the case where the pte isn't already non-writable for other reasons.
UFFD_WP indications are recorded per PTE (i.e., not VMA).
The changing of those bits are basically a bastardized 'mprotect()', and does already require the vma to be marked VM_UFFD_WP.
And the way you set (or clear) the bits is with a range operation. It really could have been done with mprotect(), and with actual explicit vma bits.
The fact that it now uses the page table bit is rather random. I think it would actually be cleaner to make that userfaultfd_writeprotect truly *be* a vma range.
Right now it's kind of "half this, half that".
Of course, it's possible that because of this situation, some users do a lot of fine-grained VM_UFFD_WP setting, and they kind of expect to not have issues with lots of vma fragments. So practical concerns may have made the implementation set in stone.
(I have only ever seen the kernel side of uffd, not the actual user side, so I'm not sure about the use patterns).
That said, your suggestion of a shadow sw page table bit thing would also work. And it would solve some problems we have in core areas (notably "page_special()" which right now has that ARCH_HAS_PTE_SPECIAL thing).
It would make it really easy to have that "this page table entry is pinned" flag too.
Linus
On Dec 18, 2021, at 4:35 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
(I have only ever seen the kernel side of uffd, not the actual user side, so I'm not sure about the use patterns).
I use it in a very fine granularity, and I suspect QEMU and CRIU do so too.
That said, your suggestion of a shadow sw page table bit thing would also work. And it would solve some problems we have in core areas (notably "page_special()" which right now has that ARCH_HAS_PTE_SPECIAL thing).
It would make it really easy to have that "this page table entry is pinned" flag too.
I found my old messy code for the software-PTE thing.
I see that eventually I decided to hold a pointer to the “extra PTEs” of each page in the PMD-page-struct. [ I also implemented the 2-adjacent pages approach but this code is long gone. ]
My rationale was that:
1. It does not bound you to have the same size for PTE and “extra-PTE” 2. The PMD-page struct is anyhow hot (since you acquired the PTL) 3. Allocating “extra-PTE” dynamically does not require to rewire the page-tables, which requires a TLB flush.
I think there is a place to hold a pointer in the PMD-page-struct (_pt_pad_1, we just need to keep the lowest bit clear so the kernel won’t mistaken it to be a compound page).
I still don’t know what exactly you have in mind for making use out of it for the COW issue. Keeping a pin-count (which requires internal API changes for unpin_user_page() and friends?) or having “was ever pinned” sticky bit? And then changing page_needs_cow_for_dma() to look at the PTE so copy_present_pte() would break the COW eagerly?
Anyhow, I can clean it up and send (although it is rather simple and I ignored many thing, such as THP, remap, etc), but I am not sure I have the time now to fully address the COW problem. I will wait for Monday for David’s response.
On 12/18/21 22:02, Nadav Amit wrote:
On Dec 18, 2021, at 4:35 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
(I have only ever seen the kernel side of uffd, not the actual user side, so I'm not sure about the use patterns).
I use it in a very fine granularity, and I suspect QEMU and CRIU do so too.
That said, your suggestion of a shadow sw page table bit thing would also work. And it would solve some problems we have in core areas (notably "page_special()" which right now has that ARCH_HAS_PTE_SPECIAL thing).
It would make it really easy to have that "this page table entry is pinned" flag too.
I found my old messy code for the software-PTE thing.
I see that eventually I decided to hold a pointer to the “extra PTEs” of each page in the PMD-page-struct. [ I also implemented the 2-adjacent pages approach but this code is long gone. ]
My rationale was that:
- It does not bound you to have the same size for PTE and “extra-PTE”
- The PMD-page struct is anyhow hot (since you acquired the PTL)
- Allocating “extra-PTE” dynamically does not require to rewire the page-tables, which requires a TLB flush.
I think there is a place to hold a pointer in the PMD-page-struct (_pt_pad_1, we just need to keep the lowest bit clear so the kernel won’t mistaken it to be a compound page).
I still don’t know what exactly you have in mind for making use out of it for the COW issue. Keeping a pin-count (which requires internal API changes for unpin_user_page() and friends?) or having “was ever pinned” sticky bit? And then changing page_needs_cow_for_dma() to look at the PTE so copy_present_pte() would break the COW eagerly?
Anyhow, I can clean it up and send (although it is rather simple and I ignored many thing, such as THP, remap, etc), but I am not sure I have the time now to fully address the COW problem. I will wait for Monday for David’s response.
Hi Nadav,
A couple of thoughts about this part of the design:
a) The PMD-page-struct approach won't help as much, because (assuming that we're using it in an attempt to get a true, perfect pin count), you are combining the pin counts of a PMD's worth of pages. OTOH...maybe that actually *is* OK, assuming you don't overflow--except that you can only answer the "is it dma-pinned?" question at a PMD level. That's a contradiction of your stated desire above to have very granular control.
Also, because of not having bit 0 available in page._pt_pad_1, I think the count would have to be implemented as adding and subtracting 2, instead of 1 (in order to keep the value even), further reducing the counter range.
b) If, instead, you used your older 2-adjacent pages approach, then Linus' comment makes more sense here: we could use the additional struct page to hold an exact pin count, per page. That way, you can actually build a wrapper function such as:
page_really_is_dma_pinned()
...and/or simply get a stronger "maybe" for page_maybe_dma_pinned().
Furthermore, this direction is extensible and supports solving other "I am out of space in struct page" problems, at the cost of more pages, of course.
As an aside, I find it instructive that we're talking about this approach, instead of extending struct page. The lesson I'm taking away is: allocating more space for some cases (2x pages) is better than having *all* struct pages be larger than they are now.
Anyway, the pin count implementation would look somewhat like the existing hpage_pincount, which similarly has ample space for a separate, exact pin count. In other words, this sort of thing (mostly-pseudo code):
diff --git a/include/linux/mm.h b/include/linux/mm.h index a7e4a9e7d807..646761388025 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -938,6 +938,16 @@ static inline bool hpage_pincount_available(struct page *page) return PageCompound(page) && compound_order(page) > 1; }
+static inline bool shadow_page_pincount_available(struct page *page) +{ + /* + * TODO: Nadav: connect this up with the shadow page table + * implementation, and return an appropriate answer. + */ + + return false; // hardcoded for now, for compile testing +} + static inline int head_compound_pincount(struct page *head) { return atomic_read(compound_pincount_ptr(head)); @@ -950,6 +960,13 @@ static inline int compound_pincount(struct page *page) return head_compound_pincount(page); }
+static inline int shadow_page_pincount(struct page *page) +{ + VM_BUG_ON_PAGE(!shadow_page_pincount_available(page), page); + + return atomic_read(shadow_page_pincount_ptr(page)); +} + static inline void set_compound_order(struct page *page, unsigned int order) { page[1].compound_order = order; @@ -1326,6 +1343,9 @@ static inline bool page_maybe_dma_pinned(struct page *page) if (hpage_pincount_available(page)) return compound_pincount(page) > 0;
+ if (shadow_page_pincount_available(page)) + return shadow_page_pincount(page) > 0; + /* * page_ref_count() is signed. If that refcount overflows, then * page_ref_count() returns a negative value, and callers will avoid
c) The "was it ever pinned" sticky bit is not a workable concept, at the struct page level. A counter is required, in order to allow pages to live out their normal lives to their fullest potential. The only time we even temporarily got away with this kind of stickiness was at a higher level, and only per-process, not per-physical-page. Processes come and go, but the struct pages are more or less forever, so once you mark one sticky like this, it's out of play.
thanks,
On Sun, Dec 19, 2021 at 12:01:59AM -0800, John Hubbard wrote:
On 12/18/21 22:02, Nadav Amit wrote:
I found my old messy code for the software-PTE thing.
I see that eventually I decided to hold a pointer to the “extra PTEs” of each page in the PMD-page-struct. [ I also implemented the 2-adjacent pages approach but this code is long gone. ]
a) The PMD-page-struct approach won't help as much, because (assuming that we're using it in an attempt to get a true, perfect pin count), you are combining the pin counts of a PMD's worth of pages. OTOH...maybe that actually *is* OK, assuming you don't overflow--except that you can only answer the "is it dma-pinned?" question at a PMD level. That's a contradiction of your stated desire above to have very granular control.
Also, because of not having bit 0 available in page._pt_pad_1, I think the count would have to be implemented as adding and subtracting 2, instead of 1 (in order to keep the value even), further reducing the counter range.
I think you misunderstood Nadav's approach. He's talking about making an extra side-allocation per PMD if you're using uffd, and storing extra information in it. I think it's a worthwile approach.
On Sat, Dec 18, 2021 at 10:02 PM Nadav Amit namit@vmware.com wrote:
I found my old messy code for the software-PTE thing.
I see that eventually I decided to hold a pointer to the “extra PTEs” of each page in the PMD-page-struct. [ I also implemented the 2-adjacent pages approach but this code is long gone. ]
Ok, I understand why that ends up being the choice, but it makes it too ugly and messy to look up to be worth it, I think.
I still don’t know what exactly you have in mind for making use out of it for the COW issue.
So the truly fundamental question for COW (and for a long-term GUP) is fairly simple:
- Is the page I have truly owned exclusively by this VM?
If that _isn't_ the case, you absolutely have to COW.
If that _is_ the case, you can re-use the page.
That is really it, boiled down to the pure basics.
And if you aren't sure whether you are the ultimate and only authority over the page, then COW is the "safer" option, in that breaking sharing is fundamentally better than over-sharing.
Now, the reason I like "page_count()==1" is that it is a 100% certain way to know that you own the page absolutely and clearly.
There is no question what-so-ever about it.
And the reason I hate "page_mapcount()==1" with a passion is that it is NOTHING OF THE KIND. It is an entirely meaningless number. It doesn't mean anything at all.
Even if the page mapcount is exactly right, it could easily and trivially be a result of "fork, then unmap in either parent or child".
Now that page_mapcount() is unquestionably 1, but despite that, at some point the page was shared by another VM, and you can not know whether you really have exclusive access.
And that "even if page mapcount is exactly right" is a big issue in itself, as I hope I've explained.
It requires page locking, it requires that you take swapcache users into account, it is just a truly messy and messed up thing.
There really is absolutely no reason for page_mapcount to exist. It's a mistake. We have it for completely broken historical reasons.
It's WRONG.
Now, if "page_count()==1" is so great, what is the issue? Problem solved.
No, while page_count()==1 is one really fundamental marker (unlike the mapcount), it does have problems too.
Because yes, "page_count()==1" does mean that you have truly exclusive ownership of the page, but the reverse is not true.
The way the current regular VM code handles that "the reverse is not true" is by making "the page is writable" be the second way you can say "you clearly have full ownership of the page".
So that's why you then have the "maybe_pinned()" thing in fork() and in swap cache creation that keeps such a page writable, and doesn't do the virtual copy and make it read-only again.
But that's also why it has problems with write-protect (whether mprotect or uddf_wp).
Anyway, that was a long explanation to make the thinking clear, and finally come to the actual answer to your question:
Adding another bit in the page tables - *purely* to say "this VM owns the page outright" - would be fairly powerful. And fairly simple.
Then any COW event will set that bit - because when you actually COW, the page you install is *yours*. No questions asked.
And fork() would simply clear that bit (unless the page was one of the pinned pages that we simply copy).
See how simple that kind of concept is.
And please, see how INCREDIBLY BROKEN page_mapcount() is. It really fundamentally is pure and utter garbage. It in no way says "I have exclusive ownership of this page", because even if the mapcount is 1 *now*, it could have been something else earlier, and some other VM could have gotten a reference to it before the current VM did so.
This is why I will categoricall NAK any stupid attempt to re-introduce page_mapcount() for COW or GUP handling. It's unacceptably fundamentally broken.
Btw, the extra bit doesn't really have to be in the page tables. It could be a bit in the page itself. We could add another page bit that we just clear when we do the "add ref to page as you make a virtual copy during fork() etc".
And no, we can't use "pincount" either, because it's not exact. The fact that the page count is so elevated that we think it's pinned is a _heuristic_, and that's ok when you have the opposite problem, and ask "*might* this page be pinned". You want to never get a false negative, but it can get a false positive.
Linus
Btw, the extra bit doesn't really have to be in the page tables. It could be a bit in the page itself. We could add another page bit that we just clear when we do the "add ref to page as you make a virtual copy during fork() etc".
^ I'm playing with the idea if using a page bit to express: "This page is exclusive". On a CoW fault, if that bit is set, I can simply reuse the page.
The semantics under which semantics to set the bit are slightly different than what you describe, and I'm playing with additional unsharing (on GUP R/O) that avoids mapping the copied page similarly R/O and simply sets the bit.
But the general idea could fly I think, devil's in the detail ...
David, you said that you were working on some alternative model. Is it perhaps along these same lines below?
I was thinking that a bit in the page tables to say "this page is exclusive to this VM" would be a really simple thing to deal with for fork() and swapout and friends.
But we don't have such a bit in general, since many architectures have very limited sets of SW bits, and even when they exist we've spent them on things like UDDF_WP.,
But the more I think about the "bit doesn't even have to be in the page tables", the more I think maybe that's the solution.
A bit in the 'struct page' itself.
For hugepages, you'd have to distribute said bit when you split the hugepage.
But other than that it looks quite simple: anybody who does a virtual copy will inevitably be messing with the page refcount, so clearing the "exclusive ownership" bit wouldn't be costly: the 'struct page' cacheline is already getting dirtied.
Or what was your model you were implying you were thinking about in your other email? You said
"I might have had an idea yesterday on how to fix most of the issues without relying on the mapcount, doing it similar [..]"
but I didn't then reply to that email because I had just written this other long email to Nadav.
Linus
On Sun, Dec 19, 2021 at 9:27 AM Linus Torvalds torvalds@linux-foundation.org wrote:
Adding another bit in the page tables - *purely* to say "this VM owns the page outright" - would be fairly powerful. And fairly simple.
Then any COW event will set that bit - because when you actually COW, the page you install is *yours*. No questions asked.
[ snip snip ]
Btw, the extra bit doesn't really have to be in the page tables. It could be a bit in the page itself. We could add another page bit that we just clear when we do the "add ref to page as you make a virtual copy during fork() etc".
And no, we can't use "pincount" either, because it's not exact. The fact that the page count is so elevated that we think it's pinned is a _heuristic_, and that's ok when you have the opposite problem, and ask "*might* this page be pinned". You want to never get a false negative, but it can get a false positive.
Linus
On 19.12.21 18:44, Linus Torvalds wrote:
David, you said that you were working on some alternative model. Is it perhaps along these same lines below?
I was thinking that a bit in the page tables to say "this page is exclusive to this VM" would be a really simple thing to deal with for fork() and swapout and friends.
But we don't have such a bit in general, since many architectures have very limited sets of SW bits, and even when they exist we've spent them on things like UDDF_WP.,
But the more I think about the "bit doesn't even have to be in the page tables", the more I think maybe that's the solution.
A bit in the 'struct page' itself.
Exactly what I am prototyping right now.
For hugepages, you'd have to distribute said bit when you split the hugepage.
Yes, that's one tricky part ...
But other than that it looks quite simple: anybody who does a virtual copy will inevitably be messing with the page refcount, so clearing the "exclusive ownership" bit wouldn't be costly: the 'struct page' cacheline is already getting dirtied.
Or what was your model you were implying you were thinking about in your other email? You said
I'm playing with the idea of not setting the bit always during COW but only on GUP request to set the bit (either manually if possible or via FOLL_UNSHARE). That's a bit more tricky but allows for decoupling that approach completely from the page_pin() counter.
fork() is allowed to clear the bit if page_count() == 1 and share the page. So no GUP->no fork() performance changes (!) . Otherwise the bit can only vanish if we swapout/migrate the page: in which case there are no additional GUP/references on the page that rely on it!
The bit can be set directly if we have to copy the page in the fault handler (COW or unshare). Outside of COW/Unshare code, the bit can only be set if page_count() == 1 and we sync against fork(). (and that's the problem for gup-fast-only that I'm investigating right now, because it would then always have to fallback to the slow variant if the bit isn't already set)
So the bit can "vanish" whenever there is no additional reference on the page. GUP syncs against fork() and can thereby set the bit/request to set the bit.
I'm trying to decouple it completely from the page_pin() counter to also be able to handle FOLL_GET (O_DIRECT reproducers unfortunately) correctly.
Not set it stone, just an idea what I'm playing with right now ... and I have to tripple-check if * page is PTE mapped in the page table I'm walking * page_count() == 1 Really means that "this is the only reference.". I do strongly believe so .. :)
On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:
On 19.12.21 18:44, Linus Torvalds wrote:
David, you said that you were working on some alternative model. Is it perhaps along these same lines below?
I was thinking that a bit in the page tables to say "this page is exclusive to this VM" would be a really simple thing to deal with for fork() and swapout and friends.
But we don't have such a bit in general, since many architectures have very limited sets of SW bits, and even when they exist we've spent them on things like UDDF_WP.,
But the more I think about the "bit doesn't even have to be in the page tables", the more I think maybe that's the solution.
A bit in the 'struct page' itself.
Exactly what I am prototyping right now.
For hugepages, you'd have to distribute said bit when you split the hugepage.
Yes, that's one tricky part ...
That part shouldn't be that tricky ...
Can we get rid of ->mapcount altogether? Three states: - Not mapped - Mapped exactly once - Possibly mapped more than once
I appreciate "Not mapped" is not a state that anon pages can meaningfully have (maybe when they go into the swap cache?)
And this information would only be present on the head page (ie stored per folio). If one VMA has multiple PTEs that map the same folio, then hopefully that only counts as mapped once.
I must admit about half this conversation is going over my head. I need more time to understand all the constraints than exists between emails :-)
On Sun, Dec 19, 2021 at 1:12 PM Matthew Wilcox willy@infradead.org wrote:
Can we get rid of ->mapcount altogether? Three states:
- Not mapped
- Mapped exactly once
- Possibly mapped more than once
I don't think even that is useful. We should get rid of mapcount entirely.
It doesn't actually help to know "mapped exactly once", exactly because even when that's true, there may be non-mapped references to the page.
Mapped references really aren't that special in general.
One case where it *can* be special is on virtually indexed cache architectures, where "is this mapped anywhere else" can be an issue for cache flushing.
There the page_mapcount() can actually really matter, but it's such an odd case that I'm not convinced it should be something the kernel VM code should bend over backwards for.
And the count could be useful for 'rmap' operations, where you can stop walking the rmap once you've found all mapped cases (paghe migration being one case of this). But again, I'm not convinced the serialization is worth it, or that it's a noticeable win.
However, I'm not 100% convinced it's worth it even there, and I'm not sure we necessarily use it there.
So in general, I think page_mapcount() can be useful as a count for those things that are _literally_ about "where is this page mapped". Page migration, virtual cache coherency, things like that can literally be about "how many different virtual mappings can we find".
It's just that pages can have a number of non-mapped users too, so mapcount isn't all that meaningful in general.
And you can look it up with rmap too, and so I do think it would be worth discussing whether we really should strive to maintain 'mapcount' at all.
I appreciate "Not mapped" is not a state that anon pages can meaningfully have (maybe when they go into the swap cache?)
Absolutely. And we can keep references around to an anonymous page even without it having any mapping or swap cache at all (ie "gup + unmap").
So "Not mapped at all" is a possible case, without the page being free'd.
Linus
On Sun, Dec 19, 2021 at 01:27:07PM -0800, Linus Torvalds wrote:
On Sun, Dec 19, 2021 at 1:12 PM Matthew Wilcox willy@infradead.org wrote:
Can we get rid of ->mapcount altogether? Three states:
- Not mapped
- Mapped exactly once
- Possibly mapped more than once
I don't think even that is useful. We should get rid of mapcount entirely.
It doesn't actually help to know "mapped exactly once", exactly because even when that's true, there may be non-mapped references to the page.
Mapped references really aren't that special in general.
One case where it *can* be special is on virtually indexed cache architectures, where "is this mapped anywhere else" can be an issue for cache flushing.
There the page_mapcount() can actually really matter, but it's such an odd case that I'm not convinced it should be something the kernel VM code should bend over backwards for.
And the count could be useful for 'rmap' operations, where you can stop walking the rmap once you've found all mapped cases (paghe migration being one case of this). But again, I'm not convinced the serialization is worth it, or that it's a noticeable win.
However, I'm not 100% convinced it's worth it even there, and I'm not sure we necessarily use it there.
So in general, I think page_mapcount() can be useful as a count for those things that are _literally_ about "where is this page mapped". Page migration, virtual cache coherency, things like that can literally be about "how many different virtual mappings can we find".
It's just that pages can have a number of non-mapped users too, so mapcount isn't all that meaningful in general.
And you can look it up with rmap too, and so I do think it would be worth discussing whether we really should strive to maintain 'mapcount' at all.
Yes, agreed, I was thinking that we could use "not mapped at all" as an optimisation to avoid doing rmap walks. eg __unmap_and_move().
Perhaps more interestingly in truncate_cleanup_page(): if (page_mapped(page)) unmap_mapping_page(page); where we can skip the i_mmap rbtree walk if we know the page isn't mapped. I'd be willing to give up that optimisation if we had "this page was never mapped" (ie if page_mapped() was allowed to return false positives).
On Sun, Dec 19, 2021 at 1:48 PM Matthew Wilcox willy@infradead.org wrote:
Yes, agreed, I was thinking that we could use "not mapped at all" as an optimisation to avoid doing rmap walks. eg __unmap_and_move().
So the thing is, it's a very dodgy optimization for a rather simple reason: what if somebody pages the page in?
So even "not mapped at all" is questionable.
You have to check that it's also not a swapcache page, and hold the page lock for that check, at the very least.
And by then, you're really in a very unusual situation - and my gut feel says not one worth optimizing for (because anon pages are _usually_ mapped at least once).
But I dunno - it might depend on your load. Maybe you have some very special load that happens to trigger this case a lot?
Linus
On Sun, Dec 19, 2021 at 01:53:36PM -0800, Linus Torvalds wrote:
On Sun, Dec 19, 2021 at 1:48 PM Matthew Wilcox willy@infradead.org wrote:
Yes, agreed, I was thinking that we could use "not mapped at all" as an optimisation to avoid doing rmap walks. eg __unmap_and_move().
So the thing is, it's a very dodgy optimization for a rather simple reason: what if somebody pages the page in?
So even "not mapped at all" is questionable.
You have to check that it's also not a swapcache page, and hold the page lock for that check, at the very least.
And by then, you're really in a very unusual situation - and my gut feel says not one worth optimizing for (because anon pages are _usually_ mapped at least once).
I'd like to get rid of ->mapcount for file pages too. And those are definitely never mapped in the majority of cases.
On Sun, Dec 19, 2021 at 2:02 PM Matthew Wilcox willy@infradead.org wrote:
I'd like to get rid of ->mapcount for file pages too. And those are definitely never mapped in the majority of cases.
Fair enough.
You'd probably be better off checking "is this mapping mapped" though. Because otherwise you have to get the page lock to serialize each page.
Linus
On Sun, Dec 19, 2021 at 02:12:04PM -0800, Linus Torvalds wrote:
On Sun, Dec 19, 2021 at 2:02 PM Matthew Wilcox willy@infradead.org wrote:
I'd like to get rid of ->mapcount for file pages too. And those are definitely never mapped in the majority of cases.
Fair enough.
You'd probably be better off checking "is this mapping mapped" though. Because otherwise you have to get the page lock to serialize each page.
Truncate already has the page locked, eg truncate_inode_pages_range() find_lock_entries() truncate_cleanup_page() if (page_mapped(page)) unmap_mapping_page(page)
I think anyone calling unmap_mapping_page() really ought to have the page lock. Oh, we actually have an assert already to that effect ;-) VM_BUG_ON(!PageLocked(page));
On Sun, Dec 19, 2021 at 09:12:01PM +0000, Matthew Wilcox wrote:
Can we get rid of ->mapcount altogether? Three states:
This might be a step in the right direction?
From f723fb7cf2519428eee75e9e779907f80258f302 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" willy@infradead.org Date: Mon, 20 Dec 2021 13:24:04 -0500 Subject: [PATCH] mm: reuse_swap_page() no longer needs to return map_swapcount
All callers of reuse_swap_page() currently pass NULL, indicating that they don't use the painstakingly calculated map_swapcount. That lets us further remove it from page_trans_huge_map_swapcount() and page_trans_huge_mapcount().
Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org --- I don't know if this is a helpful patch to add at this point or whether it will get in the way of David's much more extensive work.
include/linux/mm.h | 10 +++------- include/linux/swap.h | 6 +++--- mm/huge_memory.c | 32 +++++++++++--------------------- mm/khugepaged.c | 2 +- mm/memory.c | 2 +- mm/swapfile.c | 33 +++++++++++---------------------- 6 files changed, 30 insertions(+), 55 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index a7e4a9e7d807..286eb4155c80 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -840,19 +840,15 @@ static inline int page_mapcount(struct page *page)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE int total_mapcount(struct page *page); -int page_trans_huge_mapcount(struct page *page, int *total_mapcount); +int page_trans_huge_mapcount(struct page *page); #else static inline int total_mapcount(struct page *page) { return page_mapcount(page); } -static inline int page_trans_huge_mapcount(struct page *page, - int *total_mapcount) +static inline int page_trans_huge_mapcount(struct page *page) { - int mapcount = page_mapcount(page); - if (total_mapcount) - *total_mapcount = mapcount; - return mapcount; + return page_mapcount(page); } #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h index d1ea44b31f19..1d38d9475c4d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -514,7 +514,7 @@ extern int __swp_swapcount(swp_entry_t entry); extern int swp_swapcount(swp_entry_t entry); extern struct swap_info_struct *page_swap_info(struct page *); extern struct swap_info_struct *swp_swap_info(swp_entry_t entry); -extern bool reuse_swap_page(struct page *, int *); +extern bool reuse_swap_page(struct page *); extern int try_to_free_swap(struct page *); struct backing_dev_info; extern int init_swap_address_space(unsigned int type, unsigned long nr_pages); @@ -680,8 +680,8 @@ static inline int swp_swapcount(swp_entry_t entry) return 0; }
-#define reuse_swap_page(page, total_map_swapcount) \ - (page_trans_huge_mapcount(page, total_map_swapcount) == 1) +#define reuse_swap_page(page) \ + (page_trans_huge_mapcount(page) == 1)
static inline int try_to_free_swap(struct page *page) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e5483347291c..6ed86a8f6a5b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1322,7 +1322,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) * We can only reuse the page if nobody else maps the huge page or it's * part. */ - if (reuse_swap_page(page, NULL)) { + if (reuse_swap_page(page)) { pmd_t entry; entry = pmd_mkyoung(orig_pmd); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); @@ -2542,38 +2542,28 @@ int total_mapcount(struct page *page) * need full accuracy to avoid breaking page pinning, because * page_trans_huge_mapcount() is slower than page_mapcount(). */ -int page_trans_huge_mapcount(struct page *page, int *total_mapcount) +int page_trans_huge_mapcount(struct page *page) { - int i, ret, _total_mapcount, mapcount; + int i, ret;
/* hugetlbfs shouldn't call it */ VM_BUG_ON_PAGE(PageHuge(page), page);
- if (likely(!PageTransCompound(page))) { - mapcount = atomic_read(&page->_mapcount) + 1; - if (total_mapcount) - *total_mapcount = mapcount; - return mapcount; - } + if (likely(!PageTransCompound(page))) + return atomic_read(&page->_mapcount) + 1;
page = compound_head(page);
- _total_mapcount = ret = 0; + ret = 0; for (i = 0; i < thp_nr_pages(page); i++) { - mapcount = atomic_read(&page[i]._mapcount) + 1; + int mapcount = atomic_read(&page[i]._mapcount) + 1; ret = max(ret, mapcount); - _total_mapcount += mapcount; } - if (PageDoubleMap(page)) { + + if (PageDoubleMap(page)) ret -= 1; - _total_mapcount -= thp_nr_pages(page); - } - mapcount = compound_mapcount(page); - ret += mapcount; - _total_mapcount += mapcount; - if (total_mapcount) - *total_mapcount = _total_mapcount; - return ret; + + return ret + compound_mapcount(page); }
/* Racy check whether the huge page can be split */ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index e99101162f1a..11794bdf513a 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -681,7 +681,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, goto out; } if (!pte_write(pteval) && PageSwapCache(page) && - !reuse_swap_page(page, NULL)) { + !reuse_swap_page(page)) { /* * Page is in the swap cache and cannot be re-used. * It cannot be collapsed into a THP. diff --git a/mm/memory.c b/mm/memory.c index 8f1de811a1dc..dd85fd07cb24 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3626,7 +3626,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); pte = mk_pte(page, vma->vm_page_prot); - if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) { + if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); vmf->flags &= ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; diff --git a/mm/swapfile.c b/mm/swapfile.c index e59e08ef46e1..bc0810c3b2a5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1601,31 +1601,30 @@ static bool page_swapped(struct page *page) return false; }
-static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, +static int page_trans_huge_map_swapcount(struct page *page, int *total_swapcount) { - int i, map_swapcount, _total_mapcount, _total_swapcount; + int i, map_swapcount, _total_swapcount; unsigned long offset = 0; struct swap_info_struct *si; struct swap_cluster_info *ci = NULL; unsigned char *map = NULL; - int mapcount, swapcount = 0; + int swapcount = 0;
/* hugetlbfs shouldn't call it */ VM_BUG_ON_PAGE(PageHuge(page), page);
if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!PageTransCompound(page))) { - mapcount = page_trans_huge_mapcount(page, total_mapcount); if (PageSwapCache(page)) swapcount = page_swapcount(page); if (total_swapcount) *total_swapcount = swapcount; - return mapcount + swapcount; + return swapcount + page_trans_huge_mapcount(page); }
page = compound_head(page);
- _total_mapcount = _total_swapcount = map_swapcount = 0; + _total_swapcount = map_swapcount = 0; if (PageSwapCache(page)) { swp_entry_t entry;
@@ -1639,8 +1638,7 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, if (map) ci = lock_cluster(si, offset); for (i = 0; i < HPAGE_PMD_NR; i++) { - mapcount = atomic_read(&page[i]._mapcount) + 1; - _total_mapcount += mapcount; + int mapcount = atomic_read(&page[i]._mapcount) + 1; if (map) { swapcount = swap_count(map[offset + i]); _total_swapcount += swapcount; @@ -1648,15 +1646,9 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, map_swapcount = max(map_swapcount, mapcount + swapcount); } unlock_cluster(ci); - if (PageDoubleMap(page)) { + if (PageDoubleMap(page)) map_swapcount -= 1; - _total_mapcount -= HPAGE_PMD_NR; - } - mapcount = compound_mapcount(page); - map_swapcount += mapcount; - _total_mapcount += mapcount; - if (total_mapcount) - *total_mapcount = _total_mapcount; + map_swapcount += compound_mapcount(page); if (total_swapcount) *total_swapcount = _total_swapcount;
@@ -1673,17 +1665,14 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, * reuse_swap_page() returns false, but it may be always overwritten * (see the other implementation for CONFIG_SWAP=n). */ -bool reuse_swap_page(struct page *page, int *total_map_swapcount) +bool reuse_swap_page(struct page *page) { - int count, total_mapcount, total_swapcount; + int count, total_swapcount;
VM_BUG_ON_PAGE(!PageLocked(page), page); if (unlikely(PageKsm(page))) return false; - count = page_trans_huge_map_swapcount(page, &total_mapcount, - &total_swapcount); - if (total_map_swapcount) - *total_map_swapcount = total_mapcount + total_swapcount; + count = page_trans_huge_map_swapcount(page, &total_swapcount); if (count == 1 && PageSwapCache(page) && (likely(!PageTransCompound(page)) || /* The remaining swap count will be freed soon */
On Mon, Dec 20, 2021 at 06:37:30PM +0000, Matthew Wilcox wrote:
+++ b/mm/memory.c @@ -3626,7 +3626,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); pte = mk_pte(page, vma->vm_page_prot);
- if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
- if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); vmf->flags &= ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE;
[...]
@@ -1673,17 +1665,14 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
- reuse_swap_page() returns false, but it may be always overwritten
- (see the other implementation for CONFIG_SWAP=n).
*/ -bool reuse_swap_page(struct page *page, int *total_map_swapcount) +bool reuse_swap_page(struct page *page) {
- int count, total_mapcount, total_swapcount;
- int count, total_swapcount;
VM_BUG_ON_PAGE(!PageLocked(page), page); if (unlikely(PageKsm(page))) return false;
- count = page_trans_huge_map_swapcount(page, &total_mapcount,
&total_swapcount);
- if (total_map_swapcount)
*total_map_swapcount = total_mapcount + total_swapcount;
- count = page_trans_huge_map_swapcount(page, &total_swapcount); if (count == 1 && PageSwapCache(page) && (likely(!PageTransCompound(page)) || /* The remaining swap count will be freed soon */
It makes me wonder if reuse_swap_page() can also be based on refcount instead of mapcount?
On Mon, Dec 20, 2021 at 10:53 AM Matthew Wilcox willy@infradead.org wrote:
It makes me wonder if reuse_swap_page() can also be based on refcount instead of mapcount?
I suspect it doesn't even need refcount.
For regular pages, after we've copied the page, all we do right now is
if (page_copied) free_swap_cache(old_page);
which is basically just an optimistic trylock_page() followed by try_to_free_swap().
And that then pretty much simply checks "are there any swap users left" and deletes it from the swap cache if not.
The "free_swap_cache()" thing is actually just an optimization to avoid having memory pressure do it later. So it doesn't have to be exact.
In fact, I thought that swap is so unusual that it's not even needed at all, but I was wrong. See how this was re-introduced in commit f4c4a3f48480 ("mm: free idle swap cache page after COW") because yes, some loads still have swap space allocated.
In theory, it would probably be a good idea at COW time to see if the page ref is 2, and if it's a swap cache page, and try to do that swap cache removal even earlier, so that the page actually gets re-used (instead of copied and then the swap entry removed).
But swap is such a non-issue these days that I doubt it matters, and it's probably better to keep the swap handling in the unusual path.
So mapcount and refcount aren't what matters for the swap cache.
The swap count obviously *does* matter - because it means that some mapping has a reference to this swap entry (not as a page, but as an actual swap pointer).
But the mapcount is irrelevant - any users that have the swap page actually mapped, don't actually need to be a swapcache page.
Even the refcount doesn't really matter, afaik. The only "refcount" we care about is that swapcount - that's what actually reference counts the swap cases.
try_to_free_swap() does check for one particular kind of reference: it does a check for PageWriteback(). We don't want to remove the thing from the swap cache if it's under active IO.
(This codepath does need the page lock, though, thus all those "page_trylock()" things).
Linus
On Mon, Dec 20, 2021 at 10:37 AM Matthew Wilcox willy@infradead.org wrote:
This might be a step in the right direction?
Subject: [PATCH] mm: reuse_swap_page() no longer needs to return map_swapcount
Well, that patch seems to be a no-op removal of dead code, so absolutely yes.
That said, I think it would be good to split it up. I looked at that patch and went "is that really a no-op" to the point of recreating it.
I think it would be good to make it multiple patches that are each individally trivial. IOW, start with
(1) remove second argument to reuse_swap_page() that is always NULL, without making any other changes
(2) that now made 'total_mapcount' unused in reuse_swap_page(), remove it as an argument from page_trans_huge_map_swapcount()
(3) that now made 'total_mapcount' unused in page_trans_huge_mapcount(), remove it as an argument there too.
because as it stands, that patch of yours looks like it is changing a lot of things, and I think it would be clearer to remove one thign at a time as it becomes obviously not used.
Hmm?
Linus
On Mon, Dec 20, 2021 at 11:15:14AM -0800, Linus Torvalds wrote:
Well, that patch seems to be a no-op removal of dead code, so absolutely yes.
That said, I think it would be good to split it up. I looked at that patch and went "is that really a no-op" to the point of recreating it.
I think it would be good to make it multiple patches that are each individally trivial. IOW, start with
(1) remove second argument to reuse_swap_page() that is always NULL, without making any other changes
(2) that now made 'total_mapcount' unused in reuse_swap_page(), remove it as an argument from page_trans_huge_map_swapcount()
(3) that now made 'total_mapcount' unused in page_trans_huge_mapcount(), remove it as an argument there too.
Hah, that was actually how I did it originally (without actually committing at each step, and with a few "Oh, hang on, now we can avoid calculating this too" stops and restarts along the way), but I thought it all hung together logically as a single change. It's hard to see things from the other person's perspective at times.
On Mon, Dec 20, 2021 at 1:02 PM Matthew Wilcox willy@infradead.org wrote:
Hah, that was actually how I did it originally (without actually committing at each step, and with a few "Oh, hang on, now we can avoid calculating this too" stops and restarts along the way), but I thought it all hung together logically as a single change. It's hard to see things from the other person's perspective at times.
In just about any other area, I wouldn't mind one bigger patch that just removes code that isn't used.
But when it's in the vm code, and it's pretty grotty, I do prefer seeing three patches that individually are much easier to see that "yeah, this doesn't actually change anything at all".
The combined patch may be exactly the same thing, it's just much harder to see that "oh, now it's not used any more".
That was perhaps especially true since a number of the changes also ended up doing statement simplification when the old layout made no sense any more with part of the results not used.
So your 3-patch series was much easier to look at and go "Yeah, I believe each of these patches is a no-op".
So ACK on all those patches.
Linus
On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:
handler (COW or unshare). Outside of COW/Unshare code, the bit can only be set if page_count() == 1 and we sync against fork(). (and that's the problem for gup-fast-only that I'm investigating right now, because it would then always have to fallback to the slow variant if the bit isn't already set)
I'm having a hard time imagining how gup_fast can maintain any sort of bit - it lacks all forms of locks so how can we do an atomic test and set between two pieces of data?
I think the point of Linus's plan really is that the new bit is derived from page_count, we get the set the new bit when we observe page_count==1 in various situations and we clear the new bit whenever we write protect with the intent to copy.
GUP doesn't interact with this bit. A writable page would still be the second way you can say "you clearly have full ownership of the page', so GUP just checks writability and bumps the refcount. The challenge of GUP reamins to sanely sequence it with things that are doing WP.
The elevated GUP refcount prevents the page from getting the bit set again, regardless of what happens to it.
Then on the WP sides.. Obviously we clear the bit when applying a WP for copy. So all the bad GUP cases are halted now, as with a cleared bit and a != 1 refcount COW must happen.
Then, it is also the case that most often a read-only page will have this bit cleared, UFFWD WP being the exception.
UFFWD WP works fine as it will have the bit set in the cases we care about and COW will not happen.
If the bit is not set then everything works as is today and you get extra COWs. We still have to fix the things that are missing the extra COWs to check the page ref to fix this.
It seems this new bit is acting as a 'COW disable', so, the accuracy of COW vs GUP&speculative pagerefs now relies on setting the bit as aggressively as possible when it is safe and cheap to do so.
If I got it right this is why it is not just mapcount reduced to 1 bit. It is quite different, even though "this VM owns the page outright" sure sounds like "mapcount == 1"..
It seems like an interesting direction - the security properties seem good as we only have to take care of sites applying WP to decide what to do with the extra bit, and all places that set the bit to 1 do so after testing refcount under various locks preventing PTE WP.
That just leave the THP splitting.. I suppose we get the PTL, then compute the current value of the new bit based on refcount and diffuse it to all tail pages, then update the PMD and release the PTL. Safe against concurrent WP - don't need DoubleMap horrors because it isn't a counter.
Not set it stone, just an idea what I'm playing with right now ... and I have to tripple-check if
- page is PTE mapped in the page table I'm walking
- page_count() == 1
Really means that "this is the only reference.". I do strongly believe so .. :)
AFAIK the only places that can break this are places putting struct page memory into special PTEs. Which is horrific and is just bugs, but I think I've seen it from time to time :(
ZONE_DEVICE is also messed up, of course, but that is just more reasons ZONE_DEVICE refcounting needs fixing and you should ignore it.
Jason
On Mon, Dec 20, 2021 at 09:03:12PM -0400, Jason Gunthorpe wrote:
That just leave the THP splitting.. I suppose we get the PTL, then compute the current value of the new bit based on refcount and diffuse it to all tail pages, then update the PMD and release the PTL. Safe against concurrent WP - don't need DoubleMap horrors because it isn't a counter.
One of the things I've been trying to figure out is how we do can_split_huge_page(). Maybe an rmap walk to figure out how many refcounts we would subtract if we did unmap it from everywhere it's currently mapped? (just to be clear, we call unmap_page() as the next thing, so I don't mind warming up the rbtree cachelines if it's mapped anywhere)
On 21.12.21 02:03, Jason Gunthorpe wrote:
On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:
handler (COW or unshare). Outside of COW/Unshare code, the bit can only be set if page_count() == 1 and we sync against fork(). (and that's the problem for gup-fast-only that I'm investigating right now, because it would then always have to fallback to the slow variant if the bit isn't already set)
[in the meantime I figured out which pageflag we can reuse for anon pages, which is at least one step into the right direction]
I'm having a hard time imagining how gup_fast can maintain any sort of bit - it lacks all forms of locks so how can we do an atomic test and set between two pieces of data?
And exactly that is to be figured out.
Note that I am trying to make also any kind of R/O pins on an anonymous page work as expected as well, to fix any kind of GUP after fork() and GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page similarly has to make sure that the page is exclusive -- even if it's mapped R/O (!).
In the pagefault handler we can then always reuse a PageAnonExclusive() page, because we know it's exclusive and it will stay exclusive because concurrent fork() isn't possible.
I think the point of Linus's plan really is that the new bit is derived from page_count, we get the set the new bit when we observe page_count==1 in various situations and we clear the new bit whenever we write protect with the intent to copy.
Here are is one problems I'm fighting with:
Assume we set the bit whenever we create a new anon page (either due to COW, ordinary fault, unsharing request, ..., even if it's mapped R/O first). We know the page is exclusive at that point because we created it and fork() could not happen yet.
fork() is the only code that can share the page between processes and turn it non-exclusive.
We can only clear the bit during fork() -- to turn the share page exclusive and map it R/O into both processes -- when we are sure that *nobody* concurrently takes a reference on the page the would be problematic (-> GUP).
So to clear the bit during fork, we have to (1) Check against page_count == 1 (2) Synchronize against GUP
(2) is easy using the mmap_lock and the mm->write_protect_seq
BUT, it would mean that whenever we fork() and there is one additional reference on a page (even if it's from the swapcache), we would slow down fork() even if there was never any GUP. This would apply to any process out there that does a fork() ...
So the idea is to mark a page only exclusive as soon as someone needs the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN or selected FOLL_GET like O_DIRECT). This can happen in my current approach using two ways:
(1) Set the bit when we know we are the only users
We can set PageAnonExclusive() in case *we sync against fork* and the page cannot get unmapped (pt lock) when: * The page is mapped writable * The page is mapped readable and page_count == 1
This should work during ordinary GUP in many cases.
If we cannot set the page exclusive, we have to trigger a page fault.
(2) During pagefaults when FOLL_FAULT_UNSHARE is set.
GUP will set FOLL_FAULT_UNSHARE for a pagefault when required (again e.g., FOLL_PIN or selected FOLL_GET users), and manual setting of the bit failed. The page fault will then try once again to set the bit if there is a page mapped, and if that fails, do the COW/unshare and set the bit.
The above should work fairly reliable with GUP. But indeed, gup-fast-only is the problem. I'm still investigating what kind of lightweight synchronization we could do against fork() such that we wouldn't try setting a page PageAnonExclusive() while fork() concurrently shares the page.
We could eventually use the page lock and do a try_lock(), both in fork() and in gup-fast-only. fork() would only clear the bit if the try_lock() succeeded. gup-fast-only would only be able to set the bit and not fallback to the slow path if try_lock() succeeded.
But I'm still investigating if there are better alternatives ...
GUP doesn't interact with this bit. A writable page would still be the second way you can say "you clearly have full ownership of the page', so GUP just checks writability and bumps the refcount. The challenge of GUP reamins to sanely sequence it with things that are doing WP.
The elevated GUP refcount prevents the page from getting the bit set again, regardless of what happens to it.
Then on the WP sides.. Obviously we clear the bit when applying a WP for copy. So all the bad GUP cases are halted now, as with a cleared bit and a != 1 refcount COW must happen.
Then, it is also the case that most often a read-only page will have this bit cleared, UFFWD WP being the exception.
UFFWD WP works fine as it will have the bit set in the cases we care about and COW will not happen.
If the bit is not set then everything works as is today and you get extra COWs. We still have to fix the things that are missing the extra COWs to check the page ref to fix this.
It seems this new bit is acting as a 'COW disable', so, the accuracy of COW vs GUP&speculative pagerefs now relies on setting the bit as aggressively as possible when it is safe and cheap to do so.
But we really want to avoid degrading fork() for everybody that doesn't do heavy GUP ...
If I got it right this is why it is not just mapcount reduced to 1 bit. It is quite different, even though "this VM owns the page outright" sure sounds like "mapcount == 1"..
It seems like an interesting direction - the security properties seem good as we only have to take care of sites applying WP to decide what to do with the extra bit, and all places that set the bit to 1 do so after testing refcount under various locks preventing PTE WP.
That just leave the THP splitting.. I suppose we get the PTL, then compute the current value of the new bit based on refcount and diffuse it to all tail pages, then update the PMD and release the PTL. Safe against concurrent WP - don't need DoubleMap horrors because it isn't a counter.
Not set it stone, just an idea what I'm playing with right now ... and I have to tripple-check if
- page is PTE mapped in the page table I'm walking
- page_count() == 1
Really means that "this is the only reference.". I do strongly believe so .. :)
AFAIK the only places that can break this are places putting struct page memory into special PTEs. Which is horrific and is just bugs, but I think I've seen it from time to time :(
As we only care about anon pages, I think that doesn't apply. At least that's what I hope.
On Tue, Dec 21, 2021 at 09:58:32AM +0100, David Hildenbrand wrote:
I'm having a hard time imagining how gup_fast can maintain any sort of bit - it lacks all forms of locks so how can we do an atomic test and set between two pieces of data?
And exactly that is to be figured out.
Note that I am trying to make also any kind of R/O pins on an anonymous page work as expected as well, to fix any kind of GUP after fork() and GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page similarly has to make sure that the page is exclusive -- even if it's mapped R/O (!).
Why? AFAIK we don't have bugs here. If the page is RO and has an elevated refcount it cannot be 'PageAnonExclusive' and so any place that wants to drop the WP just cannot. What is the issue?
BUT, it would mean that whenever we fork() and there is one additional reference on a page (even if it's from the swapcache), we would slow down fork() even if there was never any GUP. This would apply to any process out there that does a fork() ...
You mean because we'd copy?
Is this common? Linus' prior email was talking as though swap is so rare we should't optimize for it?
So the idea is to mark a page only exclusive as soon as someone needs the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN or selected FOLL_GET like O_DIRECT). This can happen in my current approach using two ways:
(1) Set the bit when we know we are the only users
We can set PageAnonExclusive() in case *we sync against fork* and the page cannot get unmapped (pt lock) when:
- The page is mapped writable
- The page is mapped readable and page_count == 1
I'm still not sure I see that all this complexity is netting a gain?
If we cannot set the page exclusive, we have to trigger a page fault.
(2) During pagefaults when FOLL_FAULT_UNSHARE is set.
Why do we need FOLL_FAULT_UNSHARE ? AFAICT that was part of this series because of mapcount, once the hugetlb COW is fixed to use refcount properly, as Linus showed, the bugs this was trying to fix go away.
And as discussed before it is OK if READ gup becomes incoherent, that is its defined semantic.
The above should work fairly reliable with GUP. But indeed, gup-fast-only is the problem. I'm still investigating what kind of lightweight synchronization we could do against fork() such that we wouldn't try setting a page PageAnonExclusive() while fork() concurrently shares the page.
We could eventually use the page lock and do a try_lock(), both in fork() and in gup-fast-only. fork() would only clear the bit if the try_lock() succeeded. gup-fast-only would only be able to set the bit and not fallback to the slow path if try_lock() succeeded.
I suspect that is worse than just having fork clear the bit and leave GUP as-is. try lock is an atomic, clearing PageAnonExclusive does not need to be atomic, it is protected by the PTL.
Then on the WP sides.. Obviously we clear the bit when applying a WP for copy. So all the bad GUP cases are halted now, as with a cleared bit and a != 1 refcount COW must happen.
But we really want to avoid degrading fork() for everybody that doesn't do heavy GUP ...
fork() already has to dirty the struct page cache line for refcount, setting a flag seems minor at that point? At least we shouldn't discard this nice understandable approach without a measurement....
Remember fork is already incring mapcount so if we kill mapcount it is a win for fork to replace the mapcount atomic with a non-atomic flag.
AFAIK the only places that can break this are places putting struct page memory into special PTEs. Which is horrific and is just bugs, but I think I've seen it from time to time :(
As we only care about anon pages, I think that doesn't apply. At least that's what I hope.
You are optimistic :)
Jason
On 21.12.21 15:28, Jason Gunthorpe wrote:
On Tue, Dec 21, 2021 at 09:58:32AM +0100, David Hildenbrand wrote:
I'm having a hard time imagining how gup_fast can maintain any sort of bit - it lacks all forms of locks so how can we do an atomic test and set between two pieces of data?
And exactly that is to be figured out.
Note that I am trying to make also any kind of R/O pins on an anonymous page work as expected as well, to fix any kind of GUP after fork() and GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page similarly has to make sure that the page is exclusive -- even if it's mapped R/O (!).
Why? AFAIK we don't have bugs here. If the page is RO and has an elevated refcount it cannot be 'PageAnonExclusive' and so any place that wants to drop the WP just cannot. What is the issue?
Sure it can.
1. Map page R/W 2. Pin it R/W 3. Swapout 4. Read access
Page is now mapped R/O and *has to be* marked PageAnonExclusive(), to properly skip the COW fault. That's literally 60% of the reproducers we have that need fixing.
But what I think you actually mean is if we want to get R/O pins right.
BUT, it would mean that whenever we fork() and there is one additional reference on a page (even if it's from the swapcache), we would slow down fork() even if there was never any GUP. This would apply to any process out there that does a fork() ...
You mean because we'd copy?
Yes.
Is this common? Linus' prior email was talking as though swap is so rare we should't optimize for it?
At least in the enterprise segment having swap enabled is mostly a hard documented requirement. On customer installations swap is still common, and even gets replaced zswap that is enabled automatically in many installations ...
So in the world I live and work in, swap is used frequently.
So the idea is to mark a page only exclusive as soon as someone needs the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN or selected FOLL_GET like O_DIRECT). This can happen in my current approach using two ways:
(1) Set the bit when we know we are the only users
We can set PageAnonExclusive() in case *we sync against fork* and the page cannot get unmapped (pt lock) when:
- The page is mapped writable
- The page is mapped readable and page_count == 1
I'm still not sure I see that all this complexity is netting a gain?
Avoid copy on fork().
If we cannot set the page exclusive, we have to trigger a page fault.
(2) During pagefaults when FOLL_FAULT_UNSHARE is set.
Why do we need FOLL_FAULT_UNSHARE ? AFAICT that was part of this series because of mapcount, once the hugetlb COW is fixed to use refcount properly, as Linus showed, the bugs this was trying to fix go away.
The purpose of FOLL_FAULT_UNSHARE in the !mapcount version is to cleanly support R/O pins without the need for FOLL_WRITE.
And it's comparatively easy to add on top. This is not core of the complexity, really.
And as discussed before it is OK if READ gup becomes incoherent, that is its defined semantic.
And that's where I still disagree.
But anyhow, this is really more about FOLL_FAULT_UNSHARE, which is pretty easy and natural to add on top and just gets this right.
The above should work fairly reliable with GUP. But indeed, gup-fast-only is the problem. I'm still investigating what kind of lightweight synchronization we could do against fork() such that we wouldn't try setting a page PageAnonExclusive() while fork() concurrently shares the page.
We could eventually use the page lock and do a try_lock(), both in fork() and in gup-fast-only. fork() would only clear the bit if the try_lock() succeeded. gup-fast-only would only be able to set the bit and not fallback to the slow path if try_lock() succeeded.
I suspect that is worse than just having fork clear the bit and leave GUP as-is. try lock is an atomic, clearing PageAnonExclusive does not need to be atomic, it is protected by the PTL.
There are 2 models, leaving FOLL_FAULT_UNSHARE out of the picture for now:
1) Whenever mapping an anonymous page R/W (after COW, during ordinary fault, on swapin), we mark the page exclusive. We must never lose the PageAnonExclusive bit, not during migration, not during swapout.
fork() will process the bit for each and every process, even if there was no GUP, and will copy if there are additional references.
2) Whenever GUP wants to pin/ref a page, we try marking it exclusive. We can lose the PageAnonExclusive bit during migration and swapout, because that can only happen when there are no additional references.
fork() will process the bit only if there was GUP. Ordinary fork() is left unchanged.
Getting R/O supported in the same way just means that we have to check on a R/O pin if the page is PageAnonExclusive, and if that's not the case, trigger a FOLL_FAULT_UNSHARE fault. That's really the only "complexity" on top which is without the mapcount really easy.
On Tue, Dec 21, 2021 at 04:19:33PM +0100, David Hildenbrand wrote:
Note that I am trying to make also any kind of R/O pins on an anonymous page work as expected as well, to fix any kind of GUP after fork() and GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page similarly has to make sure that the page is exclusive -- even if it's mapped R/O (!).
Why? AFAIK we don't have bugs here. If the page is RO and has an elevated refcount it cannot be 'PageAnonExclusive' and so any place that wants to drop the WP just cannot. What is the issue?
But what I think you actually mean is if we want to get R/O pins right.
What I ment was a page that is GUP'd RO, is not PageAnonExclusive and has an elevated refcount. Those cannot be transformed to PageAnonExclusive, or re-used during COW, but also they don't have problems today. Either places are like O_DIRECT read and are tolerant of a false COW, or they are broken like VFIO and should be using FOLL_FORCE|FOLL_WRITE, which turns them into a WRITE and then we know they get PageAnonExclusive.
So, the swap issue is fixed directly with PageAnonExclusive and no change to READ GUP is required, at least in your #1 scenario, AFAICT..
There are 2 models, leaving FOLL_FAULT_UNSHARE out of the picture for now:
- Whenever mapping an anonymous page R/W (after COW, during ordinary
fault, on swapin), we mark the page exclusive. We must never lose the PageAnonExclusive bit, not during migration, not during swapout.
I prefer this one as well.
It allows us to keep Linus's simple logic that refcount == 1 means always safe to re-use, no matter what.
And refcount != 1 goes on to consider the additional bit to decide what to do. The simple bit really means 'we know this page has one PTE so ignore the refcount for COW reuse decisions'.
fork() will process the bit for each and every process, even if there was no GUP, and will copy if there are additional references.
Yes, just like it does today already for mapcount.
- Whenever GUP wants to pin/ref a page, we try marking it exclusive. We
can lose the PageAnonExclusive bit during migration and swapout, because that can only happen when there are no additional references.
I haven't thought of a way this is achievable.
At least not without destroying GUP fast..
Idea #2 is really a "this page is GUP'd" flag with some sneaky logic to clear it. That comes along with all the races too because as an idea it is fundamentally about GUP which runs without locks.
Jason
On Tue, Dec 21, 2021 at 12:58 AM David Hildenbrand david@redhat.com wrote:
On 21.12.21 02:03, Jason Gunthorpe wrote:
I'm having a hard time imagining how gup_fast can maintain any sort of bit - it lacks all forms of locks so how can we do an atomic test and set between two pieces of data?
And exactly that is to be figured out.
So my preference would be to just always maintain the "exclusive to this VM" bit in the 'struct page', because that makes things easier to think about.
[ Of course - the bit could be reversed, and be a 'not exclusive to this VM' bit, semantically the set-or-cleared issue doesn't matter. Also, when I talk about some "exclusive to this VM" bit, I'm purely talking about pages that are marked PageAnon(), so the bit may or may not even exist for other pager types ]
And then all GUP-fast would need to do is to refuse to look up a page that isn't exclusive to that VM. We already have the situation that GUP-fast can fail for non-writable pages etc, so it's just another test.
Note that I am trying to make also any kind of R/O pins on an anonymous page work as expected as well, to fix any kind of GUP after fork() and GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page similarly has to make sure that the page is exclusive -- even if it's mapped R/O (!).
I do think the existing "maybe_pinned()" logic is fine for that. The "exclusive to this VM" bit can be used to *help* that decision - because only an exclusive page can be pinned - bit I don't think it should _replace_ that logic.
There's a quite fundamental difference between
(a) COW and GUP: these two operations _have_ to know that they get an exclusive page in order to re-use or look up the page respectively
(b) the pre-cow logic in fork() or the "add this to the swap cache" logic in vmscan that decides whether a page can be turned into a COW page by adding a reference coutn to it (whether due to fork or swap cache doesn't matter - the end result is the same).
The difference is that in (a) the thing we *have* to get right is whether a page is exclusively owned by that VM or not. We can COW too much, but we can never share a page unless it's exclusive. That's true whether it's pinned or not.
In (b), the "have to get right" is different. In (b), it's perfectly ok to COW an exclusive page and turn it non-exclusive. But we must never COW a pinned page.
So (a) and (b) are very different situations, and have different logic.
If we always maintain an exclusive bit for AnonPage pages, then both (a) and (b) can use that bit, but they'll use it very differently. In (a) we'll refuse to look it up and will force a 'handle_mm_fault()' to get an exclusive copy. And in (b), we just use it as a "we know only exclusive pages can be pinned", so it's just another check for page_needs_cow_for_dma(), the same way we currently check "MMF_HAS_PINNED" to narrow down the whole "page count indicates this may be a pinned page" question.
And the "page is exclusive" would actually be the *common* case for almost all pages. Any time you've written to a page and you haven't forked after the write (and it hasn't been turned into a swap page), that page would be exclusive to that VM.
Doesn't this seem like really straightforward semantics to maintain (and think about)?
I'd like the exclusive page bit to *not* be directly about "has this page been pinned" exactly because we already have too many special cases for GUP. It would be nicer to have a page bit that has very clear semantics even in the absence of GUP.
Linus
On 21.12.21 18:05, Linus Torvalds wrote:
On Tue, Dec 21, 2021 at 12:58 AM David Hildenbrand david@redhat.com wrote:
On 21.12.21 02:03, Jason Gunthorpe wrote:
I'm having a hard time imagining how gup_fast can maintain any sort of bit - it lacks all forms of locks so how can we do an atomic test and set between two pieces of data?
And exactly that is to be figured out.
So my preference would be to just always maintain the "exclusive to this VM" bit in the 'struct page', because that makes things easier to think about.
[ Of course - the bit could be reversed, and be a 'not exclusive to this VM' bit, semantically the set-or-cleared issue doesn't matter. Also, when I talk about some "exclusive to this VM" bit, I'm purely talking about pages that are marked PageAnon(), so the bit may or may not even exist for other pager types ]
Yes, whatever I say applies to PageAnon() only -- including the (overloaded bit), called PG_anon_exclusive now.
And then all GUP-fast would need to do is to refuse to look up a page that isn't exclusive to that VM. We already have the situation that GUP-fast can fail for non-writable pages etc, so it's just another test.
Right, the simplest way is simply failing GUP fast if the bit isn't set, forcing it into the slow path. If that would primarily happens for R/O pins after fork(), fine with me.
Note that I am trying to make also any kind of R/O pins on an anonymous page work as expected as well, to fix any kind of GUP after fork() and GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page similarly has to make sure that the page is exclusive -- even if it's mapped R/O (!).
I do think the existing "maybe_pinned()" logic is fine for that. The "exclusive to this VM" bit can be used to *help* that decision - because only an exclusive page can be pinned - bit I don't think it should _replace_ that logic.
The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to FOLL_PIN unfortunately. So I'm *trying* to make it more generic such that such corner cases can be handled as well correctly. But yeah, I'll see where this goes ... O_DIRECT has to be fixed one way or the other.
John H. mentioned that he wants to look into converting that to FOLL_PIN. So maybe that will work eventually.
There's a quite fundamental difference between
(a) COW and GUP: these two operations _have_ to know that they get an exclusive page in order to re-use or look up the page respectively
(b) the pre-cow logic in fork() or the "add this to the swap cache" logic in vmscan that decides whether a page can be turned into a COW page by adding a reference coutn to it (whether due to fork or swap cache doesn't matter - the end result is the same).
The difference is that in (a) the thing we *have* to get right is whether a page is exclusively owned by that VM or not. We can COW too much, but we can never share a page unless it's exclusive. That's true whether it's pinned or not.
Exactly. Once a page is "exclusive" it must not get shared *unless* we can turn it into a "shared" page during fork().
There are some ugly corner cases that will require some thought.
In (b), the "have to get right" is different. In (b), it's perfectly ok to COW an exclusive page and turn it non-exclusive. But we must never COW a pinned page.
So (a) and (b) are very different situations, and have different logic.
If we always maintain an exclusive bit for AnonPage pages, then both (a) and (b) can use that bit, but they'll use it very differently. In (a) we'll refuse to look it up and will force a 'handle_mm_fault()' to get an exclusive copy. And in (b), we just use it as a "we know only exclusive pages can be pinned", so it's just another check for page_needs_cow_for_dma(), the same way we currently check "MMF_HAS_PINNED" to narrow down the whole "page count indicates this may be a pinned page" question.
If we use page_needs_cow_for_dma() for that purpose we can still have other references from our process referencing the page, including right now O_DIRECT ones. So the safest thing to do would be relying on the same logic as we do in the COW path regarding the pagecount ... but that might result in unnecessary copies as I mentioned.
It would be perfect if just anything that modifies page content would be using FOLL_PIN, unfortunately that's not reality ...
And the "page is exclusive" would actually be the *common* case for almost all pages. Any time you've written to a page and you haven't forked after the write (and it hasn't been turned into a swap page), that page would be exclusive to that VM.
Yes. Essentially every time we create a new anonymous page it would end up as exclusive. Or if we're in a fault and can convert the "exclusive" page into a "shared" page (essentially the COW reuse logic).
Doesn't this seem like really straightforward semantics to maintain (and think about)?
I'd like the exclusive page bit to *not* be directly about "has this page been pinned" exactly because we already have too many special cases for GUP. It would be nicer to have a page bit that has very clear semantics even in the absence of GUP.
What adds complexity to correctly maintain the "exclusive" state are at least: * KSM (might be harder, have to think about it) * migration (might be easy to just copy the bit) * fork() with migration/swap entries that reference a page that is "exclusive". I'll have to think about that more.
So I have plenty of stuff to look into.
Just so we're on the same page what I'd like to achieve with anonymous pages:
1) If we take a R/W pin on an anonymous page, we will always pin an "exclusive page".
2) If we take a R/O pin on an anonymous page, we will always pin an "exclusive page", even if the page is mapped R/O.
3) "Exclusive" pages cannot be turned "shared" during fork (and ksm? :/ ) if pinned.
4) "Exclusive" pages can be turned "shared" during fork if not pinned.
5) "Exclusive" pages will never be COWed but remain there for all eternity, until unmapped ... well or until converted into "shared" again if possible
Ideally we'd handle O_DIRECT ... :(
2) is certainly the cherry on top. But it just means that R/O pins don't have to be the weird kid. And yes, achieving 2) would require FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do what existing COW logic does, just bypass the "map writable" and "trigger write fault" semantics.
I hope we agree that R/O pins don't have to have weird kid if we can "get it right" with the same approach.
On Tue, Dec 21, 2021 at 9:40 AM David Hildenbrand david@redhat.com wrote:
I do think the existing "maybe_pinned()" logic is fine for that. The "exclusive to this VM" bit can be used to *help* that decision - because only an exclusive page can be pinned - bit I don't think it should _replace_ that logic.
The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to FOLL_PIN unfortunately. So I'm *trying* to make it more generic such that such corner cases can be handled as well correctly. But yeah, I'll see where this goes ... O_DIRECT has to be fixed one way or the other.
John H. mentioned that he wants to look into converting that to FOLL_PIN. So maybe that will work eventually.
I'd really prefer that as the plan.
What exactly is the issue with O_DIRECT? Is it purely that it uses "put_page()" instead of "unpin", or what?
I really think that if people look up pages and expect those pages to stay coherent with the VM they looked it up for, they _have_ to actively tell the VM layer - which means using FOLL_PIN.
Note that this is in absolutely no way a "new" issue. It has *always* been true. If some O_DIORECT path depends on pinning behavior, it has never worked correctly, and it is entirely on O_DIRECT, and not at all a VM issue. We've had people doing GUP games forever, and being burnt by those games not working reliably.
GUP (before we even had the notion of pinning) would always just take a reference to the page, but it would not guarantee that that exact page then kept an association with the VM.
Now, in *practice* this all works if:
(a) the GUP user had always written to the page since the fork (either explicitly, or with FOLL_WRITE obviously acting as such)
(b) the GUP user never forks afterwards until the IO is done
(c) the GUP user plays no other VM games on that address
and it's also very possible that it has worked by pure luck (ie we've had a lot of random code that actively mis-used things and it would work in practice just because COW would happen to cut the right direction etc).
Is there some particular GUP user you happen to care about more than others? I think it's a valid option to try to fix things up one by one, even if you don't perhaps fix _all_ cases.
Linus
On 21.12.21 19:00, Linus Torvalds wrote:
On Tue, Dec 21, 2021 at 9:40 AM David Hildenbrand david@redhat.com wrote:
I do think the existing "maybe_pinned()" logic is fine for that. The "exclusive to this VM" bit can be used to *help* that decision - because only an exclusive page can be pinned - bit I don't think it should _replace_ that logic.
The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to FOLL_PIN unfortunately. So I'm *trying* to make it more generic such that such corner cases can be handled as well correctly. But yeah, I'll see where this goes ... O_DIRECT has to be fixed one way or the other.
John H. mentioned that he wants to look into converting that to FOLL_PIN. So maybe that will work eventually.
I'd really prefer that as the plan.
What exactly is the issue with O_DIRECT? Is it purely that it uses "put_page()" instead of "unpin", or what?
I really think that if people look up pages and expect those pages to stay coherent with the VM they looked it up for, they _have_ to actively tell the VM layer - which means using FOLL_PIN.
Note that this is in absolutely no way a "new" issue. It has *always* been true. If some O_DIORECT path depends on pinning behavior, it has never worked correctly, and it is entirely on O_DIRECT, and not at all a VM issue. We've had people doing GUP games forever, and being burnt by those games not working reliably.
GUP (before we even had the notion of pinning) would always just take a reference to the page, but it would not guarantee that that exact page then kept an association with the VM.
Now, in *practice* this all works if:
(a) the GUP user had always written to the page since the fork (either explicitly, or with FOLL_WRITE obviously acting as such)
(b) the GUP user never forks afterwards until the IO is done
(c) the GUP user plays no other VM games on that address
and it's also very possible that it has worked by pure luck (ie we've had a lot of random code that actively mis-used things and it would work in practice just because COW would happen to cut the right direction etc).
Is there some particular GUP user you happen to care about more than others? I think it's a valid option to try to fix things up one by one, even if you don't perhaps fix _all_ cases.
Yes, of course. The important part for me is to have a rough idea in how to tackle all pieces and have a reliable design/approach. Besides the security issue, highest priority is getting R/W pins (FOLL_WRITE) right, including O_DIRECT, because that can silently break existing use cases.
Lower priority is getting R/O pins on anonymous memory right, because that never worked reliably. Lowest priority is getting R/O pins on MAP_PRIVATE file memory right.
I'd appreciate if someone could work on the O_DIRECT FOLL_PIN conversion while I struggle with PageAnonExclusive() and R/W pins :)
[noting that I'll not get too much done within the next 2 weeks]
On 12/21/21 10:28, David Hildenbrand wrote: ...
I'd appreciate if someone could work on the O_DIRECT FOLL_PIN conversion while I struggle with PageAnonExclusive() and R/W pins :)
Yes, I'll sign up for that (unless someone else who is faster is already working on it). I've tried a couple times in the past, but without the proper level of determination to see it through. So this time will be different. :)
[noting that I'll not get too much done within the next 2 weeks]
Likewise. Starting in early January.
thanks,
On Tue 21-12-21 18:40:30, David Hildenbrand wrote:
On 21.12.21 18:05, Linus Torvalds wrote:
On Tue, Dec 21, 2021 at 12:58 AM David Hildenbrand david@redhat.com wrote:
Note that I am trying to make also any kind of R/O pins on an anonymous page work as expected as well, to fix any kind of GUP after fork() and GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page similarly has to make sure that the page is exclusive -- even if it's mapped R/O (!).
I do think the existing "maybe_pinned()" logic is fine for that. The "exclusive to this VM" bit can be used to *help* that decision - because only an exclusive page can be pinned - bit I don't think it should _replace_ that logic.
The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to FOLL_PIN unfortunately. So I'm *trying* to make it more generic such that such corner cases can be handled as well correctly. But yeah, I'll see where this goes ... O_DIRECT has to be fixed one way or the other.
John H. mentioned that he wants to look into converting that to FOLL_PIN. So maybe that will work eventually.
For record we always intended (and still intend) to make O_DIRECT use FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP and pages acquired through get_page() in a single bio (such as zero page) and thus it is non-trivial to do the right thing on IO completion (unpin or just put_page).
Honza
On Tue, Dec 21, 2021 at 10:07 AM Jan Kara jack@suse.cz wrote:
For record we always intended (and still intend) to make O_DIRECT use FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP and pages acquired through get_page() in a single bio (such as zero page) and thus it is non-trivial to do the right thing on IO completion (unpin or just put_page).
Side note: the new "exclusive VM" bit wouldn't _solve_ this issue, but it might make it much easier to debug and catch.
If we only set the exclusive VM bit on pages that get mapped into user space, and we guarantee that GUP only looks up such pages, then we can also add a debug test to the "unpin" case that the bit is still set.
And that would catch anybody who ends up using other pages for unpin(), and you could have a WARN_ON() for it (obviously also trigger on the page count being too small to unpin).
That way, at least from a kernel debugging and development standpoint it would make it easy to see "ok, this unpinning got a page that wasn't pinned", and it would help find these cases where some situation had used just a get_page() rather than a pin to get a page pointer.
No?
Linus
On 21.12.21 19:30, Linus Torvalds wrote:
On Tue, Dec 21, 2021 at 10:07 AM Jan Kara jack@suse.cz wrote:
For record we always intended (and still intend) to make O_DIRECT use FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP and pages acquired through get_page() in a single bio (such as zero page) and thus it is non-trivial to do the right thing on IO completion (unpin or just put_page).
Side note: the new "exclusive VM" bit wouldn't _solve_ this issue, but it might make it much easier to debug and catch.
If we only set the exclusive VM bit on pages that get mapped into user space, and we guarantee that GUP only looks up such pages, then we can also add a debug test to the "unpin" case that the bit is still set.
And that would catch anybody who ends up using other pages for unpin(), and you could have a WARN_ON() for it (obviously also trigger on the page count being too small to unpin).
It would also catch if someone would be wrongly dropping the exclusive flag although there are users (pin) relying on the page staying exclusive.
That way, at least from a kernel debugging and development standpoint it would make it easy to see "ok, this unpinning got a page that wasn't pinned"
For that purpose the pincount would already kind-off work. Not precise, but at least something ("this page cannot possibly have been pinned").
On Tue, Dec 21, 2021 at 10:51 AM David Hildenbrand david@redhat.com wrote:
For that purpose the pincount would already kind-off work. Not precise, but at least something ("this page cannot possibly have been pinned").
That part actually exists already, ie put_page_refs() has this:
#ifdef CONFIG_DEBUG_VM if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page)) return; #endif
And yeah, it shouldn't have that '#ifdef CONFIG_DEBUG_VM' there, but I think it's because the non-CONFIG_DEBUG_VM #define for VM_WARN_ON_ONCE_PAGE() is broken, and doesn't return 0.
Linus
On 12/21/21 10:30, Linus Torvalds wrote:
On Tue, Dec 21, 2021 at 10:07 AM Jan Kara jack@suse.cz wrote:
For record we always intended (and still intend) to make O_DIRECT use FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP and pages acquired through get_page() in a single bio (such as zero page) and thus it is non-trivial to do the right thing on IO completion (unpin or just put_page).
Side note: the new "exclusive VM" bit wouldn't _solve_ this issue, but it might make it much easier to debug and catch.
If we only set the exclusive VM bit on pages that get mapped into user space, and we guarantee that GUP only looks up such pages, then we can also add a debug test to the "unpin" case that the bit is still set.
And that would catch anybody who ends up using other pages for unpin(), and you could have a WARN_ON() for it (obviously also trigger on the page count being too small to unpin).
That way, at least from a kernel debugging and development standpoint it would make it easy to see "ok, this unpinning got a page that wasn't pinned", and it would help find these cases where some situation had used just a get_page() rather than a pin to get a page pointer.
No?
Linus
Yes, this is especially welcome, because it means that after enough time sitting in the -mm tree, we can reasonably expect to catch the most important cases, if any were missed. That makes it a whole other level of useful, as compared to local testing hacks.
thanks,
On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
- is certainly the cherry on top. But it just means that R/O pins don't
have to be the weird kid. And yes, achieving 2) would require FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do what existing COW logic does, just bypass the "map writable" and "trigger write fault" semantics.
I still don't agree with this - when you come to patches can you have this work at the end and under a good cover letter? Maybe it will make more sense then.
Thanks, Jason
On 21.12.21 20:07, Jason Gunthorpe wrote:
On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
- is certainly the cherry on top. But it just means that R/O pins don't
have to be the weird kid. And yes, achieving 2) would require FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do what existing COW logic does, just bypass the "map writable" and "trigger write fault" semantics.
I still don't agree with this - when you come to patches can you have this work at the end and under a good cover letter? Maybe it will make more sense then.
Yes. But really, I think it's the logical consequence of what Linus said [1]:
"And then all GUP-fast would need to do is to refuse to look up a page that isn't exclusive to that VM. We already have the situation that GUP-fast can fail for non-writable pages etc, so it's just another test."
We must not FOLL_PIN a page that is not exclusive (not only on gup-fast, but really, on any gup). If we special case R/O FOLL_PIN, we cannot enable the sanity check on unpin as suggested by Linus [2]:
"If we only set the exclusive VM bit on pages that get mapped into user space, and we guarantee that GUP only looks up such pages, then we can also add a debug test to the "unpin" case that the bit is still set."
There are really only two feasible options I see when we want to take a R/O FOLL_PIN on a !PageAnonExclusive() anon page
(1) Fail the pinning completely. This implies that we'll have to fail O_DIRECT once converted to FOLL_PIN. (2) Request to mark the page PageAnonExclusive() via a FAULT_FLAG_UNSHARE and let it succeed.
Anything else would require additional accounting that we already discussed in the past is hard -- for example, to differentiate R/O from R/W pins requiring two pin counters.
The only impact would be that FOLL_PIN after fork() has to go via a FAULT_FLAG_UNSHARE once, to turn the page PageAnonExclusive. IMHO this is the right thing to do for FOLL_LONGTERM. For !FOLL_LONGTERM it would be nice to optimize this, to *not* do that, but again ... this would require even more counters I think, for example, to differentiate between "R/W short/long-term or R/O long-term pin" and "R/O short-term pin".
So unless we discover a way to do additional accounting for ordinary 4k pages, I think we really can only do (1) or (2) to make sure we never ever pin a !PageAnonExclusive() page.
[1] https://lkml.kernel.org/r/CAHk-=wgQq3H6wfkW7+MmduVgBOqHeiXQN97yCMd+m1mM-1xCL... [2] https://lkml.kernel.org/r/CAHk-=wiyxQ==vnHFHW99S_OPwA=u1Qrfg2OGr_6zPcBAuhQY2...
On 22.12.21 09:51, David Hildenbrand wrote:
On 21.12.21 20:07, Jason Gunthorpe wrote:
On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
- is certainly the cherry on top. But it just means that R/O pins don't
have to be the weird kid. And yes, achieving 2) would require FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do what existing COW logic does, just bypass the "map writable" and "trigger write fault" semantics.
I still don't agree with this - when you come to patches can you have this work at the end and under a good cover letter? Maybe it will make more sense then.
Yes. But really, I think it's the logical consequence of what Linus said [1]:
"And then all GUP-fast would need to do is to refuse to look up a page that isn't exclusive to that VM. We already have the situation that GUP-fast can fail for non-writable pages etc, so it's just another test."
We must not FOLL_PIN a page that is not exclusive (not only on gup-fast, but really, on any gup). If we special case R/O FOLL_PIN, we cannot enable the sanity check on unpin as suggested by Linus [2]:
"If we only set the exclusive VM bit on pages that get mapped into user space, and we guarantee that GUP only looks up such pages, then we can also add a debug test to the "unpin" case that the bit is still set."
There are really only two feasible options I see when we want to take a R/O FOLL_PIN on a !PageAnonExclusive() anon page
(1) Fail the pinning completely. This implies that we'll have to fail O_DIRECT once converted to FOLL_PIN. (2) Request to mark the page PageAnonExclusive() via a FAULT_FLAG_UNSHARE and let it succeed.
Anything else would require additional accounting that we already discussed in the past is hard -- for example, to differentiate R/O from R/W pins requiring two pin counters.
The only impact would be that FOLL_PIN after fork() has to go via a FAULT_FLAG_UNSHARE once, to turn the page PageAnonExclusive. IMHO this is the right thing to do for FOLL_LONGTERM. For !FOLL_LONGTERM it would be nice to optimize this, to *not* do that, but again ... this would require even more counters I think, for example, to differentiate between "R/W short/long-term or R/O long-term pin" and "R/O short-term pin".
So unless we discover a way to do additional accounting for ordinary 4k pages, I think we really can only do (1) or (2) to make sure we never ever pin a !PageAnonExclusive() page.
BTW, I just wondered if the optimization should actually be that R/O short-term FOLL_PIN users should actually be using FOLL_GET instead. So O_DIRECT with R/O would already be doing the right thing.
And it somewhat aligns with what we found: only R/W short-term FOLL_GET is problematic, where we can lose writes to the page from the device via O_DIRECT.
IIUC, our COW logic makes sure that a shared anonymous page that might still be used by a R/O FOLL_GET cannot be modified, because any attempt to modify it would result in a copy.
But I might be missing something, just an idea.
On Wed 22-12-21 10:58:36, David Hildenbrand wrote:
On 22.12.21 09:51, David Hildenbrand wrote:
On 21.12.21 20:07, Jason Gunthorpe wrote:
On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
- is certainly the cherry on top. But it just means that R/O pins don't
have to be the weird kid. And yes, achieving 2) would require FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do what existing COW logic does, just bypass the "map writable" and "trigger write fault" semantics.
I still don't agree with this - when you come to patches can you have this work at the end and under a good cover letter? Maybe it will make more sense then.
Yes. But really, I think it's the logical consequence of what Linus said [1]:
"And then all GUP-fast would need to do is to refuse to look up a page that isn't exclusive to that VM. We already have the situation that GUP-fast can fail for non-writable pages etc, so it's just another test."
We must not FOLL_PIN a page that is not exclusive (not only on gup-fast, but really, on any gup). If we special case R/O FOLL_PIN, we cannot enable the sanity check on unpin as suggested by Linus [2]:
"If we only set the exclusive VM bit on pages that get mapped into user space, and we guarantee that GUP only looks up such pages, then we can also add a debug test to the "unpin" case that the bit is still set."
There are really only two feasible options I see when we want to take a R/O FOLL_PIN on a !PageAnonExclusive() anon page
(1) Fail the pinning completely. This implies that we'll have to fail O_DIRECT once converted to FOLL_PIN. (2) Request to mark the page PageAnonExclusive() via a FAULT_FLAG_UNSHARE and let it succeed.
Anything else would require additional accounting that we already discussed in the past is hard -- for example, to differentiate R/O from R/W pins requiring two pin counters.
The only impact would be that FOLL_PIN after fork() has to go via a FAULT_FLAG_UNSHARE once, to turn the page PageAnonExclusive. IMHO this is the right thing to do for FOLL_LONGTERM. For !FOLL_LONGTERM it would be nice to optimize this, to *not* do that, but again ... this would require even more counters I think, for example, to differentiate between "R/W short/long-term or R/O long-term pin" and "R/O short-term pin".
So unless we discover a way to do additional accounting for ordinary 4k pages, I think we really can only do (1) or (2) to make sure we never ever pin a !PageAnonExclusive() page.
BTW, I just wondered if the optimization should actually be that R/O short-term FOLL_PIN users should actually be using FOLL_GET instead. So O_DIRECT with R/O would already be doing the right thing.
And it somewhat aligns with what we found: only R/W short-term FOLL_GET is problematic, where we can lose writes to the page from the device via O_DIRECT.
IIUC, our COW logic makes sure that a shared anonymous page that might still be used by a R/O FOLL_GET cannot be modified, because any attempt to modify it would result in a copy.
Well, we defined FOLL_PIN to mean the intent that the caller wants to access not only page state (for which is enough FOLL_GET and there are some users - mostly inside mm - who need this) but also page data. Eventually, we even wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
For file pages we need this data vs no-data access distinction so that filesystems can detect when someone can be accessing page data although the page is unmapped. Practically, filesystems care most about when someone can be *modifying* page data (we need to make sure data is stable e.g. when writing back data to disk or doing data checksumming or other operations) so using FOLL_GET when wanting to only read page data should be OK for filesystems but honestly I would be reluctant to break the rule of "use FOLL_PIN when wanting to access page data" to keep things simple and reasonably easy to understand for parties such as filesystem developers or driver developers who all need to interact with pinned pages...
Honza
IIUC, our COW logic makes sure that a shared anonymous page that might still be used by a R/O FOLL_GET cannot be modified, because any attempt to modify it would result in a copy.
Well, we defined FOLL_PIN to mean the intent that the caller wants to access not only page state (for which is enough FOLL_GET and there are some users
- mostly inside mm - who need this) but also page data. Eventually, we even
wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
For file pages we need this data vs no-data access distinction so that filesystems can detect when someone can be accessing page data although the page is unmapped. Practically, filesystems care most about when someone can be *modifying* page data (we need to make sure data is stable e.g. when writing back data to disk or doing data checksumming or other operations) so using FOLL_GET when wanting to only read page data should be OK for filesystems but honestly I would be reluctant to break the rule of "use FOLL_PIN when wanting to access page data" to keep things simple and reasonably easy to understand for parties such as filesystem developers or driver developers who all need to interact with pinned pages...
Right, from an API perspective we really want people to use FOLL_PIN.
To optimize this case in particular it would help if we would have the FOLL flags on the unpin path. Then we could just decide internally "well, short-term R/O FOLL_PIN can be really lightweight, we can treat this like a FOLL_GET instead". And we would need that as well if we were to keep different counters for R/O vs. R/W pinned.
On Wed 22-12-21 14:09:41, David Hildenbrand wrote:
IIUC, our COW logic makes sure that a shared anonymous page that might still be used by a R/O FOLL_GET cannot be modified, because any attempt to modify it would result in a copy.
Well, we defined FOLL_PIN to mean the intent that the caller wants to access not only page state (for which is enough FOLL_GET and there are some users
- mostly inside mm - who need this) but also page data. Eventually, we even
wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
For file pages we need this data vs no-data access distinction so that filesystems can detect when someone can be accessing page data although the page is unmapped. Practically, filesystems care most about when someone can be *modifying* page data (we need to make sure data is stable e.g. when writing back data to disk or doing data checksumming or other operations) so using FOLL_GET when wanting to only read page data should be OK for filesystems but honestly I would be reluctant to break the rule of "use FOLL_PIN when wanting to access page data" to keep things simple and reasonably easy to understand for parties such as filesystem developers or driver developers who all need to interact with pinned pages...
Right, from an API perspective we really want people to use FOLL_PIN.
To optimize this case in particular it would help if we would have the FOLL flags on the unpin path. Then we could just decide internally "well, short-term R/O FOLL_PIN can be really lightweight, we can treat this like a FOLL_GET instead". And we would need that as well if we were to keep different counters for R/O vs. R/W pinned.
Well, I guess the question here is: Which GUP user needs only R/O access to page data and is so performance critical that it would be worth it to sacrifice API clarity for speed? I'm not aware of any but I was not looking really hard...
Honza
On 22.12.21 15:42, Jan Kara wrote:
On Wed 22-12-21 14:09:41, David Hildenbrand wrote:
IIUC, our COW logic makes sure that a shared anonymous page that might still be used by a R/O FOLL_GET cannot be modified, because any attempt to modify it would result in a copy.
Well, we defined FOLL_PIN to mean the intent that the caller wants to access not only page state (for which is enough FOLL_GET and there are some users
- mostly inside mm - who need this) but also page data. Eventually, we even
wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
For file pages we need this data vs no-data access distinction so that filesystems can detect when someone can be accessing page data although the page is unmapped. Practically, filesystems care most about when someone can be *modifying* page data (we need to make sure data is stable e.g. when writing back data to disk or doing data checksumming or other operations) so using FOLL_GET when wanting to only read page data should be OK for filesystems but honestly I would be reluctant to break the rule of "use FOLL_PIN when wanting to access page data" to keep things simple and reasonably easy to understand for parties such as filesystem developers or driver developers who all need to interact with pinned pages...
Right, from an API perspective we really want people to use FOLL_PIN.
To optimize this case in particular it would help if we would have the FOLL flags on the unpin path. Then we could just decide internally "well, short-term R/O FOLL_PIN can be really lightweight, we can treat this like a FOLL_GET instead". And we would need that as well if we were to keep different counters for R/O vs. R/W pinned.
Well, I guess the question here is: Which GUP user needs only R/O access to page data and is so performance critical that it would be worth it to sacrifice API clarity for speed? I'm not aware of any but I was not looking really hard...
I'd be interested in examples as well. Maybe databases that use O_DIRECT after fork()?
On Wed 22-12-21 15:48:34, David Hildenbrand wrote:
On 22.12.21 15:42, Jan Kara wrote:
On Wed 22-12-21 14:09:41, David Hildenbrand wrote:
IIUC, our COW logic makes sure that a shared anonymous page that might still be used by a R/O FOLL_GET cannot be modified, because any attempt to modify it would result in a copy.
Well, we defined FOLL_PIN to mean the intent that the caller wants to access not only page state (for which is enough FOLL_GET and there are some users
- mostly inside mm - who need this) but also page data. Eventually, we even
wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
For file pages we need this data vs no-data access distinction so that filesystems can detect when someone can be accessing page data although the page is unmapped. Practically, filesystems care most about when someone can be *modifying* page data (we need to make sure data is stable e.g. when writing back data to disk or doing data checksumming or other operations) so using FOLL_GET when wanting to only read page data should be OK for filesystems but honestly I would be reluctant to break the rule of "use FOLL_PIN when wanting to access page data" to keep things simple and reasonably easy to understand for parties such as filesystem developers or driver developers who all need to interact with pinned pages...
Right, from an API perspective we really want people to use FOLL_PIN.
To optimize this case in particular it would help if we would have the FOLL flags on the unpin path. Then we could just decide internally "well, short-term R/O FOLL_PIN can be really lightweight, we can treat this like a FOLL_GET instead". And we would need that as well if we were to keep different counters for R/O vs. R/W pinned.
Well, I guess the question here is: Which GUP user needs only R/O access to page data and is so performance critical that it would be worth it to sacrifice API clarity for speed? I'm not aware of any but I was not looking really hard...
I'd be interested in examples as well. Maybe databases that use O_DIRECT after fork()?
Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify page data (and so we need to detect them both for COW and filesystem needs). O_DIRECT writes could use FOLL_GET but at this point I'm not convinced it is worth it.
Honza
On Wed, Dec 22, 2021 at 05:08:46PM +0100, Jan Kara wrote:
On Wed 22-12-21 15:48:34, David Hildenbrand wrote:
On 22.12.21 15:42, Jan Kara wrote:
On Wed 22-12-21 14:09:41, David Hildenbrand wrote:
IIUC, our COW logic makes sure that a shared anonymous page that might still be used by a R/O FOLL_GET cannot be modified, because any attempt to modify it would result in a copy.
Well, we defined FOLL_PIN to mean the intent that the caller wants to access not only page state (for which is enough FOLL_GET and there are some users
- mostly inside mm - who need this) but also page data. Eventually, we even
wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
For file pages we need this data vs no-data access distinction so that filesystems can detect when someone can be accessing page data although the page is unmapped. Practically, filesystems care most about when someone can be *modifying* page data (we need to make sure data is stable e.g. when writing back data to disk or doing data checksumming or other operations) so using FOLL_GET when wanting to only read page data should be OK for filesystems but honestly I would be reluctant to break the rule of "use FOLL_PIN when wanting to access page data" to keep things simple and reasonably easy to understand for parties such as filesystem developers or driver developers who all need to interact with pinned pages...
Right, from an API perspective we really want people to use FOLL_PIN.
To optimize this case in particular it would help if we would have the FOLL flags on the unpin path. Then we could just decide internally "well, short-term R/O FOLL_PIN can be really lightweight, we can treat this like a FOLL_GET instead". And we would need that as well if we were to keep different counters for R/O vs. R/W pinned.
Well, I guess the question here is: Which GUP user needs only R/O access to page data and is so performance critical that it would be worth it to sacrifice API clarity for speed? I'm not aware of any but I was not looking really hard...
I'd be interested in examples as well. Maybe databases that use O_DIRECT after fork()?
Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify page data (and so we need to detect them both for COW and filesystem needs). O_DIRECT writes could use FOLL_GET but at this point I'm not convinced it is worth it.
Wow, I didn't realise the plan was to make FOLL_PIN the "default". I hoped it was weird crap that was going away soon. Looks like we'd better fix all the bugs in it then ...
On Wed, Dec 22, 2021 at 8:08 AM Jan Kara jack@suse.cz wrote:
Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify page data (and so we need to detect them both for COW and filesystem needs).
Well, O_DIRECT reads do, but not necessarily writes.
And hey, even reads have been dodgy in the past when we didn't really have the pinning logic - there's been a lot of users that just wanted it to work for their particular use-case rather than in general and in all situations..
Linus
On Wed 22-12-21 10:40:18, Linus Torvalds wrote:
On Wed, Dec 22, 2021 at 8:08 AM Jan Kara jack@suse.cz wrote:
Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify page data (and so we need to detect them both for COW and filesystem needs).
Well, O_DIRECT reads do, but not necessarily writes.
I agree.
And hey, even reads have been dodgy in the past when we didn't really have the pinning logic - there's been a lot of users that just wanted it to work for their particular use-case rather than in general and in all situations..
Yes, but currently a malicious user can take the system down (BUG_ON) or cause DIF/DIX failures if he is nasty and tries hard enough with O_DIRECT reads (practically, the window is small so I haven't really seen a report that I could trace to O_DIRECT reads but in principle the problem is the same as with pinning & dirtying done e.g. by video capture drivers and there we've seen these problem happen). So forcing pinning for O_DIRECT reads is IMO mandatory.
Honza
On Thu, Dec 23, 2021 at 4:54 AM Jan Kara jack@suse.cz wrote:
So forcing pinning for O_DIRECT reads is IMO mandatory.
I don't disagree.
And I do think the eventual aim should be to do it for writes too even if they don't necessarily require it (since they write _from_ the VM data, not _to_ the VM data - the "read-vs-write direction has always been confusing when it comes to GUP").
Partly just for consistency in the IO paths - I think people want to share as much as possible in there - but also just to make sure that we're all done with the "wrong-way-cow" kind of issues for good.
If we get to the point where the legacy GUP is used only for very special things (looking up physical pages for debug and trace purposes etc), I think that would be lovely.
That may be a pretty long-range goal, though.
Linus
On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote:
Right, from an API perspective we really want people to use FOLL_PIN.
To optimize this case in particular it would help if we would have the FOLL flags on the unpin path. Then we could just decide internally "well, short-term R/O FOLL_PIN can be really lightweight, we can treat this like a FOLL_GET instead". And we would need that as well if we were to keep different counters for R/O vs. R/W pinned.
FYI, in my current tree, there's a gup_put_folio() which replaces put_compound_head:
static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) { if (flags & FOLL_PIN) { node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); if (hpage_pincount_available(&folio->page)) hpage_pincount_sub(&folio->page, refs); else refs *= GUP_PIN_COUNTING_BIAS; }
folio_put_refs(folio, refs); }
That can become non-static if it's needed. I'm still working on that series, because I'd like to get it to a point where we return one folio pointer instead of N page pointers. Not quite there yet.
On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote:
On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote:
Right, from an API perspective we really want people to use FOLL_PIN.
To optimize this case in particular it would help if we would have the FOLL flags on the unpin path. Then we could just decide internally "well, short-term R/O FOLL_PIN can be really lightweight, we can treat this like a FOLL_GET instead". And we would need that as well if we were to keep different counters for R/O vs. R/W pinned.
FYI, in my current tree, there's a gup_put_folio() which replaces put_compound_head:
static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) { if (flags & FOLL_PIN) { node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); if (hpage_pincount_available(&folio->page)) hpage_pincount_sub(&folio->page, refs); else refs *= GUP_PIN_COUNTING_BIAS; }
folio_put_refs(folio, refs);
}
That can become non-static if it's needed. I'm still working on that series, because I'd like to get it to a point where we return one folio pointer instead of N page pointers. Not quite there yet.
I'm keen to see what that looks like, every driver I'm working on that calls PUP goes through gyrations to recover contiguous pages, so this is most welcomed!
Jason
On Thu, Dec 23, 2021 at 10:53:09PM -0400, Jason Gunthorpe wrote:
On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote:
On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote:
Right, from an API perspective we really want people to use FOLL_PIN.
To optimize this case in particular it would help if we would have the FOLL flags on the unpin path. Then we could just decide internally "well, short-term R/O FOLL_PIN can be really lightweight, we can treat this like a FOLL_GET instead". And we would need that as well if we were to keep different counters for R/O vs. R/W pinned.
FYI, in my current tree, there's a gup_put_folio() which replaces put_compound_head:
static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) { if (flags & FOLL_PIN) { node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); if (hpage_pincount_available(&folio->page)) hpage_pincount_sub(&folio->page, refs); else refs *= GUP_PIN_COUNTING_BIAS; }
folio_put_refs(folio, refs);
}
That can become non-static if it's needed. I'm still working on that series, because I'd like to get it to a point where we return one folio pointer instead of N page pointers. Not quite there yet.
I'm keen to see what that looks like, every driver I'm working on that calls PUP goes through gyrations to recover contiguous pages, so this is most welcomed!
I'm about to take some time off, so alas, you won't see it any time soon. It'd be good to talk with some of the interested users because it's actually a pretty tricky problem. We can't just return an array of the struct folios because the actual memory you want to access might be anywhere in that folio, and you don't want to have to redo the lookup just to find out which subpages of the folio are meant.
So I'm currently thinking about returning a bio_vec:
struct bio_vec { struct page *bv_page; unsigned int bv_len; unsigned int bv_offset; };
In the iomap patchset which should go upstream in the next merge window, you can iterate over a bio like this:
struct folio_iter fi;
bio_for_each_folio_all(fi, bio) iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);
There aren't any equivalent helpers for a bvec yet, but obviously we can add them so that you can iterate over each folio in a contiguous range.
But now that each component in it is variable length, the caller can't know how large an array of bio_vecs to allocate.
1. The callee can allocate the array and let the caller free it when it's finished 2. The caller passes in a (small, fixed-size, on-stack) array of bio_vecs over (potentially) multiple calls. 3. The caller can overallocate and ignore that most of the array isn't used.
Any preferences? I don't like #3.
On Fri, Dec 24, 2021 at 04:53:38AM +0000, Matthew Wilcox wrote:
On Thu, Dec 23, 2021 at 10:53:09PM -0400, Jason Gunthorpe wrote:
On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote:
On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote:
Right, from an API perspective we really want people to use FOLL_PIN.
To optimize this case in particular it would help if we would have the FOLL flags on the unpin path. Then we could just decide internally "well, short-term R/O FOLL_PIN can be really lightweight, we can treat this like a FOLL_GET instead". And we would need that as well if we were to keep different counters for R/O vs. R/W pinned.
FYI, in my current tree, there's a gup_put_folio() which replaces put_compound_head:
static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) { if (flags & FOLL_PIN) { node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); if (hpage_pincount_available(&folio->page)) hpage_pincount_sub(&folio->page, refs); else refs *= GUP_PIN_COUNTING_BIAS; }
folio_put_refs(folio, refs);
}
That can become non-static if it's needed. I'm still working on that series, because I'd like to get it to a point where we return one folio pointer instead of N page pointers. Not quite there yet.
I'm keen to see what that looks like, every driver I'm working on that calls PUP goes through gyrations to recover contiguous pages, so this is most welcomed!
I'm about to take some time off, so alas, you won't see it any time soon. It'd be good to talk with some of the interested users because it's actually a pretty tricky problem.
Sure, it is a good idea
We can't just return an array of the struct folios because the actual memory you want to access might be anywhere in that folio, and you don't want to have to redo the lookup just to find out which subpages of the folio are meant.
Yep
So I'm currently thinking about returning a bio_vec:
struct bio_vec { struct page *bv_page; unsigned int bv_len; unsigned int bv_offset; };
The cases I'm looking at basically want an efficient list of physical addresses + lengths. They don't care about pages or folios, eg often the next step is to build a SGL and DMA map it which largely ignores all of that.
As the memory used to hold the output of pin_user_pages() is all temporary there is a sensitivity to allocate the memory quicky, but also to have enough of it so that we don't have to do redundant work in pin_user_pages() - eg traversing to the same PMD table again and again.
But now that each component in it is variable length, the caller can't know how large an array of bio_vecs to allocate.
And the array entry is now 2x the size and there is no way to scatter the array to 4k segments?
- The callee can allocate the array and let the caller free it when it's finished
It is not bad, but a bit tricky, alot of the GUP code executes in an IRQ disabled state, so it has to use a pre-allocating scheme. We also can't scan everything twice and hope it didn't change, so exact preallocation doesn't seem likely either.
- The caller passes in a (small, fixed-size, on-stack) array of bio_vecs over (potentially) multiple calls.
It is slow, because we do redundant work traversing the same locks and page tables again and again..
- The caller can overallocate and ignore that most of the array isn't used.
Any preferences? I don't like #3.
#3 is OK for my applications because we immediately turn around and copy the output to something else and free the memory anyhow...
However, being an array means we can't reliably allocate more than 4k and with 16 bytes per entry that isn't even able to store a full PTE table.
What would be nice for these cases is if the caller can supply an array of 4k pages and GUP will fill them in. In many cases we'd probably pass in up to 2M worth of pages or something.
There is some batching balance here where we want to minimize the temporary memory consumed by GUP's output (and the time to allocate it!) but also minimize the redundant work inside GUP repeatedly walking the same tables and locks.
eg ideally GUP would stop at some natural alignment boundary if it could tell it can't fully fill the buffer. Then the next iteration would not redo the same locks.
I was once thinking about something like storing an array of PFNs and using the high bits to encode that the PFN is not 4k. It would allow efficient packing of the common fragmented cases. To make it work you'd need to have each 4k page grow the pfn list up from the start and the pfn sizes down from the end. A size entry is only consumed if the pfn bits can't encode the size directly so the packing can be a perfect 8 bytes per PFN for the common 4k and 2M aligned cases.
Jason
On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
What adds complexity to correctly maintain the "exclusive" state are at least:
- KSM (might be harder, have to think about it)
I know little about it, but isn't KSM like fork where it is trying to WP pages with the intention of copying them? Shouldn't KSM completely reject WP'ing a page that is under any kind of writable GUP?
Jason
On 22.12.21 00:59, Jason Gunthorpe wrote:
On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
What adds complexity to correctly maintain the "exclusive" state are at least:
- KSM (might be harder, have to think about it)
I know little about it, but isn't KSM like fork where it is trying to WP pages with the intention of copying them? Shouldn't KSM completely reject WP'ing a page that is under any kind of writable GUP?
I think KSM will, similar to fork(), always have to try removing PageAnonExclusive() while synchronizing against concurrent GUP pins. If that fails, the page cannot be converted to KSM and consequently not be shared.
That will need some rework of KSM AFAIU, but shouldn't be impossible to do.
On Tue 21-12-21 19:59:16, Jason Gunthorpe wrote:
On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
What adds complexity to correctly maintain the "exclusive" state are at least:
- KSM (might be harder, have to think about it)
I know little about it, but isn't KSM like fork where it is trying to WP pages with the intention of copying them? Shouldn't KSM completely reject WP'ing a page that is under any kind of writable GUP?
I know little about KSM as well but I think fundamentally it has similar requirements for anon pages as filesystems have for page cache pages e.g. when doing block deduplication or data checksumming... I.e., it needs to make sure data in the page is stable and nobody can modify it.
Honza
On 17.12.21 20:04, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 3:34 AM David Hildenbrand david@redhat.com wrote:
- If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not
- set) and then unmaps the target page, we have:
- page has mapcount == 1 and refcount > 1
Hi Linus,
All these games with mapcount makes me think this is still broken.
mapcount has been a horribly broken thing in the past, and I'm not convinced it's not a broken thing now.
It all started when Jann detected the security issue in GUP – and this patch set is fixing the problem exactly there, in GUP itself. Are you aware of COW issues regarding the mapcount if we would remove GUP from the equation? My point is that COW without GUP works just perfectly fine, but I’ll be happy to learn about other cases I was ignoring so far.
Unfortunately, page_count() is even more unreliable, and the issues we're just detecting (see the link in the cover letter: memory corruptions inside user space -- e.g., lost DMA writes) are even worse than what we had before -- IMHO of course.
vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
page_mapcount(vmf->page) > 1) {
What keeps the mapcount stable in here?
So, we're reading an atomic value here. It’s read via atomic_read for regular pages, and the THP mapcount case has also been made atomic (as lockless as page_count) in patch #5.
If a page is mapped exactly once, page_mapcount(page) == 1 and there is nothing to do.
If the page is mapped more than once, page_mapcount(page) > 1 and we would have to trigger unsharing. And it’s true that the value is unstable in this case, but we really only care about page_mapcount(page)
1 vs. page_mapcount(page) == 1. In this respect, there is no
difference from the instability of the page_count and the mapcount – we still only care if it’s >1 or == 1.
So the only case we could care about is concurrent additional mappings that can increment the mapcount -- which can only happen due to concurrent fork. So if we're reading page_mapcount(page) == 1 the only way we can get page_mapcount(page) > 1 is due to fork(). But we're holding the mmap_lock in read mode during faults and fork requires the mmap_lock in write mode.
And I still believe that the whole notion that "COW should use mapcount" is pure and utter garbage.
If we are doing a COW, we need an *exclusive* access to the page. That is not mapcount, that is the page ref.
I thought about this a lot, because initially I had the same opinion.
But really, we don't care about any speculative references (pagecache, migration, daemon, pagevec, ...) or any short-term "I just want to grab this reference real quick such that the page can't get freed" references.
All we care about are GUP references, and we attack that problem at the root by triggering unsharing exactly at the point where GUP comes into play.
So IMHO GUP is the problem and needs unsharing either: * On write access to a shared anonymous page, which is just COW as we know it. * On read access to a shared anonymous page, which is what we’re proposing in this patch set.
So as soon as GUP comes into play, even if only pinning R/O, we have to trigger unsharing . Doing so enforces the invariant that it is impossible to take a GUP pin on an anonymous page with a mapcount > 1. In turn, the COW does not need to worry about the GUP after fork() security issue anymore and it can focus on doing optimally the COW faults as if GUP just wouldn’t exist.
On Fri, Dec 17, 2021 at 12:45 PM David Hildenbrand david@redhat.com wrote:
If a page is mapped exactly once, page_mapcount(page) == 1 and there is nothing to do.
Why?
You state that, but you stating that doesn't magically make it so.
What makes "mapcount==1" stable and special? Your "it's an atomic_read()" argument is nonsense - it implies that the count can be changing, but you will get _one_ answer.
What makes that one answer of a changing count special?
What if there are other references to that same page, gotten with vmsplice(), and just about to be mapped into another address space?
This is the meat of my argument. You claim that "mapcount==1" is special. I claim that you haven't explained why it would be. And I do not believe it is.
Linus
On 17.12.21 21:51, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 12:45 PM David Hildenbrand david@redhat.com wrote:
If a page is mapped exactly once, page_mapcount(page) == 1 and there is nothing to do.
Why?
You state that, but you stating that doesn't magically make it so.
What makes "mapcount==1" stable and special? Your "it's an atomic_read()" argument is nonsense - it implies that the count can be changing, but you will get _one_ answer.
And I explained how it can not increment. And the only way is via fork(), which cannot run concurrently.
What makes that one answer of a changing count special?
What if there are other references to that same page, gotten with vmsplice(), and just about to be mapped into another address space?
If we have a shared anonymous page we cannot have GUP references, not even R/O ones. Because GUP would have unshared and copied the page, resulting in a R/O mapped anonymous page.
What am I missing?
On Fri, Dec 17, 2021 at 12:55 PM David Hildenbrand david@redhat.com wrote:
If we have a shared anonymous page we cannot have GUP references, not even R/O ones. Because GUP would have unshared and copied the page, resulting in a R/O mapped anonymous page.
Doing a GUP on an actual shared page is wrong to begin with.
You even know that, you try to use "page_mapcount() > 1" to disallow it.
My point is that it's wrong regardless, and that "mapcount" is dubious, and that COW cannot - and must not - use mapcount, and that I think your shared case should strive to avoid it for the exact same reason.
So, what I think should happen is:
(a) GUP makes sure that it only ever looks up pages that can be shared with this VM. This may in involve breaking COW early with any past fork().
(b) it marks such pages so that any future work will not cause them to COW either
Note that (a) is not necessarily "always COW and have to allocate and copy new page". In particular, if the page is already writable, you know you already have exclusive access to it and don't need to COW.
And if it isn't writable, then the other common case is "the cow has only one user, and it's us" - that's the "refcount == 1" case.
And (b) is what we do with that page_maybe_dma_pinned() logic for fork(), but also for things like swap cache creation (eg see commit feb889fb40fa: "mm: don't put pinned pages into the swap cache").
Note that this code all already exists, and already works - even without getting the (very expensive) mmap_sem. So it works with fast-GUP and it can race with concurrent forking by another thread, which is why we also have that seqcount thing.
As far as I can tell, your "mapcount" logic fundamentally requires mmap_sem for the fork() race avoidance, for example.
So this is why I don't like the mapcount games - I think they are very fragile, and not at all as logical as the two simple rules a/b above.
I believe you can make mapcount games _work_ - we used to have something like that. It was incredibly fragile, and it had its own set of bugs, but with enough care it's doable.
But my argument really is that I think it's the wrong approach, and that we should simply strive to follow the two simple conceptual rules above.
Linus
On 17.12.21 22:36, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 12:55 PM David Hildenbrand david@redhat.com wrote:
If we have a shared anonymous page we cannot have GUP references, not even R/O ones. Because GUP would have unshared and copied the page, resulting in a R/O mapped anonymous page.
Doing a GUP on an actual shared page is wrong to begin with.
You even know that, you try to use "page_mapcount() > 1" to disallow it.
GUP is incomaptible with shared anonymous pages, therefore it has to trigger unsharing, correct.
My point is that it's wrong regardless, and that "mapcount" is dubious, and that COW cannot - and must not - use mapcount, and that I think your shared case should strive to avoid it for the exact same reason.
For now I have not heard a compelling argument why the mapcount is dubious, I repeat:
* mapcount can only increase due to fork() * mapcount can decrease due to unmap / zap
We can protect from the transtition == 1 -> >1 using the mmap_lock.
For COW the mapcount is the only thing that matters *if we take GUP* out of the equation. And that's exactly what we
OTOH, take a look which issues resulted from the page_count changes. That's what I call dubious, sorry to say.
So, what I think should happen is:
(a) GUP makes sure that it only ever looks up pages that can be shared with this VM. This may in involve breaking COW early with any past fork().
Is that unsharing as we propose it?
(b) it marks such pages so that any future work will not cause them to COW either
Right, exactly. GUP before fork does not result in a page getting shared again.
Note that (a) is not necessarily "always COW and have to allocate and copy new page". In particular, if the page is already writable, you know you already have exclusive access to it and don't need to COW.
And if it isn't writable, then the other common case is "the cow has only one user, and it's us" - that's the "refcount == 1" case.
And (b) is what we do with that page_maybe_dma_pinned() logic for fork(), but also for things like swap cache creation (eg see commit feb889fb40fa: "mm: don't put pinned pages into the swap cache").
I fully agree with b). GUP before fork is a totally different set of problems than GUP after fork.
Note that this code all already exists, and already works - even without getting the (very expensive) mmap_sem. So it works with fast-GUP and it can race with concurrent forking by another thread, which is why we also have that seqcount thing.
I know, I studied it intensively :)
As far as I can tell, your "mapcount" logic fundamentally requires mmap_sem for the fork() race avoidance, for example.
Yes. Or any other more lightweight synchronization in the future. For now this is just perfect.
So this is why I don't like the mapcount games - I think they are very fragile, and not at all as logical as the two simple rules a/b above.
I don't really see anything fragile, really. I'm happy to learn as always.
I believe you can make mapcount games _work_ - we used to have something like that. It was incredibly fragile, and it had its own set of bugs, but with enough care it's doable.
We made it work, and it was comparatively simple.
On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand david@redhat.com wrote:
For now I have not heard a compelling argument why the mapcount is dubious, I repeat:
- mapcount can only increase due to fork()
- mapcount can decrease due to unmap / zap
We can protect from the transtition == 1 -> >1 using the mmap_lock.
For COW the mapcount is the only thing that matters *if we take GUP* out of the equation. And that's exactly what we
What do you have against just doing what we already do in other parts, that a/b thing?
Which avoids the whole mmap_sem issue. That was a big issue for the rdma people, afaik.
Linus
On 17.12.21 22:50, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand david@redhat.com wrote:
For now I have not heard a compelling argument why the mapcount is dubious, I repeat:
- mapcount can only increase due to fork()
- mapcount can decrease due to unmap / zap
We can protect from the transtition == 1 -> >1 using the mmap_lock.
For COW the mapcount is the only thing that matters *if we take GUP* out of the equation. And that's exactly what we
What do you have against just doing what we already do in other parts, that a/b thing?
Let me put it that way: I just want to get all of this fixed for good. I don't particularly care how. *But* I will fight for something that is superior, logically makes sense (at least to me :) ) and not super complicated. And I call also avoiding unnecessary COW "superior".
I do know that what this series proposes fixes the CVE: GUP after fork. I do know that the part 2 we'll be sending out next year will fix everything else we discovered so far, and it will rely on this as a basis, to not reintroduce any other COW issues we've seen so far.
If someone can propose something comparable that makes all discovered problems go away I'll be *extremely* happy. We have reproducers for all issues, so it's easy to verify, and I'm planning on extending the selftests to cover even more corner cases.
So far, I am not convinced that using the mapcount is dubious or problematic, I just don't see how. COW is an about sharing pages between processes, each expressed in the mapcount. It's a pure optimization for exactly that purpose.
GUP is the problem, not COW, not the mapcount. To me the mapcount is the only thing that makes sense in COW+unsharing logic, and GUP has to be taught to identify it and resolve it -> unshare when it detects a shared anaonymous page.
Which avoids the whole mmap_sem issue. That was a big issue for the rdma people, afaik.
While I do care about future use cases, I cannot possibly see fork() not requiring the mmap_lock in the foreseeable future. Just so much depends on it as of now.
And after all, fixing everything what we discovered so far is more important to me than something like that for the future. We have other problems to solve in that regard.
----------------------------------------------------------------------
I didn't want to talk about hugetlb here but I will just because it's a good example why using the refocunt is just wrong -- because unnecessary COW are just absolutely problematic.
Assume you have a R/O mapped huge page that is only mapped into your process and you get a write fault. What should your COW logic do?
a) Rely on the mapcount? Yes, iff GUP has been taught to unshare properly, because then it expresses exactly what we want to know. mapcount == 1 -> reuse. mapcount > 1 -> COW.
b) Rely on the refocunt? If we have a speculative refrence on the page we would COW. As huge pages are a scarce resource we can easily just not have a free huge page anymore and crash the application. The app didn't do anything wrong.
So teaching the hugetlb COW code to rely on the refount would just be highly fragile.
On Fri, Dec 17, 2021 at 2:29 PM David Hildenbrand david@redhat.com wrote:
While I do care about future use cases, I cannot possibly see fork() not requiring the mmap_lock in the foreseeable future. Just so much depends on it as of now.
It's not that *fork()* depends on it.
Of course fork() takes the mmap_sem.
It's that fast-gup really really doesn't want it, and can't take it.
So any fast-gup user fundamentally cannot look at mapcount(), because that would be fundamentally wrong and racy, and could race with fork.
And yet, as far as I can tell, that's *exactly* what your gup patches do, with gup_pte_range() adding
+ if (!pte_write(pte) && gup_must_unshare(flags, page, false)) { + put_compound_head(head, 1, flags); + goto pte_unmap; + }
which looks at the page mapcount without holding the mmap sem at all.
And see my other email - I think there are other examples of your patches looking at data that isn't stable because you don't hold the right locks.
And you can't even do the optimistic case without taking the lock, because in your world, a COW that optimistically copies in the case of a race condition is fundamentally *wrong* and buggy. Because in your world-view, GUP and COW are very different and have different rules, but you need things to be *exact*, and they aren't.
And none of this is anything at least I can think about, because I don't see what the "design" is.
I really have a hard time following what the rules actually are. You seem to think that "page_mapcount()" is a really simple rule, and I fundamentally disagree. It's a _very_ complicated thing indeed, with locking issues, AND YOU ACTIVELY VIOLATE THE LOCKING RULES!
See why I'm so unhappy?
We *did* do the page_mapcount() thing. It was bad. It forced COW to always take the page lock. There's a very real reason why I'm pushing my "let's have a _design_ here", instead of your "let's look at page_mapcount without even doing the locking".
And yes, I *know* that fork() takes the mmap_sem, and likely always will. That really isn't the problem here. The problem is that your page_mapcount() paths DO NOT take that lock.
Btw, maybe I'm misreading things. I looked at the individual patches, I didn't apply them, maybe I missed something. But I don't think I am.
Linus
On 17.12.21 23:58, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 2:29 PM David Hildenbrand david@redhat.com wrote:
While I do care about future use cases, I cannot possibly see fork() not requiring the mmap_lock in the foreseeable future. Just so much depends on it as of now.
It's not that *fork()* depends on it.
Of course fork() takes the mmap_sem.
It's that fast-gup really really doesn't want it, and can't take it.
Oh, sorry, I was misreading your mail.
So any fast-gup user fundamentally cannot look at mapcount(), because that would be fundamentally wrong and racy, and could race with fork.
So we're concerned about fork() racing with gup-fast-only. gup-fast-only runs essentially lockless.
As we read an atomic mapcount, the relevant thing to happen would be that we read an mapcount of 1 and decide to share the page, but there is concurrent fork() such that the mapcount is increased.
So the parent process has to be the only process owning that page for this to trigger (mapcount == 1). In that situation, we would pin the page in gup-fast-only.
BUT this is just like GUP before fork() and caught using the mm->write_protect_seq, so we'd immediately unpin it and not actually return it from get-user-pages-fast. No harm done AFAIKS.
And yet, as far as I can tell, that's *exactly* what your gup patches do, with gup_pte_range() adding
if (!pte_write(pte) && gup_must_unshare(flags, page, false)) {
put_compound_head(head, 1, flags);
goto pte_unmap;
}
which looks at the page mapcount without holding the mmap sem at all.
And see my other email - I think there are other examples of your patches looking at data that isn't stable because you don't hold the right locks.
We rely on PageAnon(), PageKsm() and the mapcount. To my understanding, they are stable for our use in pagefault handling code under mmap_lock and in gup-fast because of above reasoning.
And you can't even do the optimistic case without taking the lock, because in your world, a COW that optimistically copies in the case of a race condition is fundamentally *wrong* and buggy. Because in your world-view, GUP and COW are very different and have different rules, but you need things to be *exact*, and they aren't.
And none of this is anything at least I can think about, because I don't see what the "design" is.
I really have a hard time following what the rules actually are. You seem to think that "page_mapcount()" is a really simple rule, and I fundamentally disagree. It's a _very_ complicated thing indeed, with locking issues, AND YOU ACTIVELY VIOLATE THE LOCKING RULES!
See why I'm so unhappy?
I see why your unhappy, and I appreciate the productive discussion :) But I think we just have to complete the big picture of what we're proposing and how the mapcount is safe to be used for this purpose.
I mean, I'm happy if you actually find a flaw in the current design proposal.
We *did* do the page_mapcount() thing. It was bad. It forced COW to always take the page lock. There's a very real reason why I'm pushing my "let's have a _design_ here", instead of your "let's look at page_mapcount without even doing the locking".
The locking semantics just have to be clarified and written in stone -- if we don't find any flaws.
But this will be my last mail for today, have a nice weekend Linus!
On Dec 17, 2021, at 3:29 PM, David Hildenbrand david@redhat.com wrote:
On 17.12.21 23:58, Linus Torvalds wrote:
And you can't even do the optimistic case without taking the lock, because in your world, a COW that optimistically copies in the case of a race condition is fundamentally *wrong* and buggy. Because in your world-view, GUP and COW are very different and have different rules, but you need things to be *exact*, and they aren’t.
I understand the discussion mainly revolves correctness, which is obviously the most important property, but I would like to mention that having transient get_page() calls causing unnecessary COWs can cause hard-to-analyze and hard-to-avoid performance degradation. COW means a page copy, a TLB flush and potentially a TLB shootdown, which is the most painful, specifically on VMs.
So I think that any solution should be able to limit the cases/number of unnecessary COW operations to be minimal.
On Fri, Dec 17, 2021 at 3:53 PM Nadav Amit namit@vmware.com wrote:
I understand the discussion mainly revolves correctness, which is obviously the most important property, but I would like to mention that having transient get_page() calls causing unnecessary COWs can cause hard-to-analyze and hard-to-avoid performance degradation.
Note that the COW itself is pretty cheap. Yes, there's the page allocation and copy, but it's mostly a local thing.
So that falls under the "good to avoid" heading, but in the end it's not an immense deal.
In contrast, the page lock has been an actual big user-visible latency issue, to the point of correctness.
A couple of years ago, we literally had NMI watchdog timeouts due to the page wait-queues growing basically boundlessly. This was some customer internal benchmark code that I never saw, so it wasn't *quite* clear exactly what was going on, but we ended up having to split up the page wait list traversal using bookmark entries, because it was such a huge latency issue.
That was mostly NUMA balancing faults, I think, but the point I'm making is that avoiding the page lock can be a *much* bigger deal than avoiding some local allocation and copying of a page of data. There are real loads where the page-lock gets insanely bad, and I think it's because we use it much too much.
See commit 2554db916586 ("sched/wait: Break up long wake list walk") for some of that saga.
So I really think that having to serialize with the page lock in order to do some "exact page use counting" is a false economy. Yes, maybe you'd be able to avoid a COW or two, but at what locking cost?
Linus
On Dec 17, 2021, at 8:02 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Dec 17, 2021 at 3:53 PM Nadav Amit namit@vmware.com wrote:
I understand the discussion mainly revolves correctness, which is obviously the most important property, but I would like to mention that having transient get_page() calls causing unnecessary COWs can cause hard-to-analyze and hard-to-avoid performance degradation.
Note that the COW itself is pretty cheap. Yes, there's the page allocation and copy, but it's mostly a local thing.
I don’t know about the page-lock overhead, but I understand your argument.
Having said that, I do know a bit about TLB flushes, which you did not mention as overheads of COW. Such flushes can be quite expensive on multithreaded workloads (specifically on VMs, but lets put those aside).
Take for instance memcached and assume you overcommit memory with a very fast swap (e.g., pmem, zram, perhaps even slower). Now, it turns out memcached often accesses a page first for read and shortly after for write. I encountered, in a similar scenario, that the page reference that lru_cache_add() takes during the first faultin event (for read), causes a COW on a write page-fault that happens shortly after [1]. So on memcached I assume this would also trigger frequent unnecessary COWs.
Besides page allocation and copy, COW would then require a TLB flush, which, when performed locally, might not be too bad (~200 cycles). But if memcached has many threads, as it usually does, then you need a TLB shootdown and this one can be expensive (microseconds). If you start getting a TLB shootdown storm, you may avoid some IPIs since you see that other CPUs already queued IPIs for the target CPU. But then the kernel would flush the entire TLB on the the target CPU, as it realizes that multiple TLB flushes were queued, and as it assumes that a full TLB flush would be cheaper.
[ I can try to run a benchmark during the weekend to measure the impact, as I did not really measure the impact on memcached before/after 5.8. ]
So I am in no position to prioritize one overhead over the other, but I do not think that COW can be characterized as mostly-local and cheap in the case of multithreaded workloads.
[1] https://lore.kernel.org/linux-mm/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail....
On Sat, Dec 18, 2021 at 04:52:13AM +0000, Nadav Amit wrote:
Take for instance memcached and assume you overcommit memory with a very fast swap (e.g., pmem, zram, perhaps even slower). Now, it turns out memcached often accesses a page first for read and shortly after for write. I encountered, in a similar scenario, that the page reference that lru_cache_add() takes during the first faultin event (for read), causes a COW on a write page-fault that happens shortly after [1]. So on memcached I assume this would also trigger frequent unnecessary COWs.
Why are we comparing page_count() against 1 and not 1 + PageLRU(page)? Having a reference from the LRU should be expected. Is it because of some race that we'd need to take the page lock to protect against?
On Dec 17, 2021, at 9:03 PM, Matthew Wilcox willy@infradead.org wrote:
On Sat, Dec 18, 2021 at 04:52:13AM +0000, Nadav Amit wrote:
Take for instance memcached and assume you overcommit memory with a very fast swap (e.g., pmem, zram, perhaps even slower). Now, it turns out memcached often accesses a page first for read and shortly after for write. I encountered, in a similar scenario, that the page reference that lru_cache_add() takes during the first faultin event (for read), causes a COW on a write page-fault that happens shortly after [1]. So on memcached I assume this would also trigger frequent unnecessary COWs.
Why are we comparing page_count() against 1 and not 1 + PageLRU(page)? Having a reference from the LRU should be expected. Is it because of some race that we'd need to take the page lock to protect against?
IIUC, the reference that is taken on the page is taken before SetPageLRU() is called and the reference is later dropped:
lru_add_drain() lru_add_drain_cpu() __pagevec_lru_add() __pagevec_lru_add_fn() __pagevec_lru_add_fn() SetPageLRU() <- sets the LRU release_pages() <- drops the reference
It is one scenario I encountered. There might be others that take transient references on pages that cause unnecessary COWs. I think David and Andrea had few in mind. To trigger a COW bug I once used mlock()/munlock() that take such transient reference. But who knows how many other cases exist (KSM? vmscan?)
On Fri, Dec 17, 2021 at 9:03 PM Matthew Wilcox willy@infradead.org wrote:
Why are we comparing page_count() against 1 and not 1 + PageLRU(page)? Having a reference from the LRU should be expected. Is it because of some race that we'd need to take the page lock to protect against?
The LRU doesn't actually count towards a reference - the LRU list is maintained independently of the lifetime of the page (and is torn down on last release - which wouldn't work if the LRU list itself held a ref to the page).
But atr least some of the code that gathers up pages to then put them on the LRU list takes a ref to the pages before passing them off, just to guarantee to keep them around during the operation.
So yes, various things can increment page counts in a transitory manner.
I still *much* prefer a reliable COW over one that doesn't happen enough.
The page count can have these (on the whole fairly rare) blips. That's ok. The page count is still *reliable*, in ways that teh mapcount can never be. The mapcount fundamentally doesn't show "other non-mapped users".
So Nadav is correct that unnecessary cow events will cause extra work (and the TLB flush is a good point - just marking a page writable as-is is much cheaper).
But I'm looking at teh actual code, and the actual logic, and I am dismissign the whole mapcount games completely.
David has a 10-patch series (plus one test) of complex, grotty, hard-to-understand code with new flags.
I posted a patch that removed 10 lines, and fixes the problem case his test-case was designed for.
I think that really speaks to the issues.
My approach is *simpler* and a hell of a lot more robust. And as mentioned, I can explain it.
And christ the thing I'm advocating for is WHAT WE ALREADY DO FOR 99.99% of all cases. Why? Because it's literally how the regular COW paths work TODAY.
And we had benchmarks show performance improvements (or no movement at all) from when we made those changes. Not the downsides that people claim.
It's only the THP paths that are broken (and possibly some individual mis-uses of GUP - people have mentioned virtio).
So nbow people are trying to do a fragile, complex thing that was shown to be problematic for the common case, and they are doing it for the insanely rare case? When a ten-line removal patch fixes that one too?
Linus
PS. Yes, yes, that 10-line removal patch is obviously still not tested, it's still likely incomplete because the THP case needs to do the page-pinning logic on the other side too, so I'm very obviously over-simplifying. But the fact that the *normal* pages already do this correctly - and don't use mapcount - should really make people go "Hmm".
On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand david@redhat.com wrote:
For now I have not heard a compelling argument why the mapcount is dubious, I repeat:
- mapcount can only increase due to fork()
- mapcount can decrease due to unmap / zap
And to answer the "why is this dubious", let' sjust look at your actual code that I reacted to:
+ vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte); + if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) && + page_mapcount(vmf->page) > 1) {
Note how you don't just check page_mapcount(). Why not? Because mapcount is completely immaterial if it's not a PageAnon page, so you test for that.
So even when you do the mapcount read as one atomic thing, it's one atomic thing that depends on _other_ things, and all these checks are not atomic.
But a PageAnon() page can actually become a swap-backed page, and as far as I can tell, your code doesn't have any locking to protect against that.
So now you need not only the mmap_sem (to protect against fork), you also need the page lock (to protect against rmap changing the type of page).
I don't see you taking the page lock anywhere. Maybe the page table lock ends up serializing sufficiently with the rmap code that it ends up working
In the do_wp_page() path, we currently do those kinds of racy checks too, but then we do a trylock_page, and re-do them. And at any time there is any question about things, we fall back to copying - because a copy is always safe.
Well, it's always safe if we have the rule that "once we've pinned things, we don't cause them to be COW again".
But that "it's safe if" was exactly my (b) case.
That's why I much prefer the model I'm trying to push - it's conceptually quite simple. I can literally explain mine at a conceptual level with that "break pre-existing COW, make sure no future COW" model.
In contrast, I look at your page_mapcount() code, and I go "there is no conceptual rules here, and the actual implementation details look dodgy".
I personally like having clear conceptual rules - as opposed to random implementation details.
Linus
On 17.12.21 23:18, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand david@redhat.com wrote:
For now I have not heard a compelling argument why the mapcount is dubious, I repeat:
- mapcount can only increase due to fork()
- mapcount can decrease due to unmap / zap
And to answer the "why is this dubious", let' sjust look at your actual code that I reacted to:
vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
page_mapcount(vmf->page) > 1) {
Note how you don't just check page_mapcount(). Why not? Because mapcount is completely immaterial if it's not a PageAnon page, so you test for that.
So even when you do the mapcount read as one atomic thing, it's one atomic thing that depends on _other_ things, and all these checks are not atomic.
But a PageAnon() page can actually become a swap-backed page, and as far as I can tell, your code doesn't have any locking to protect against that.
The pages stay PageAnon(). swap-backed pages simply set a bit IIRC. mapcount still applies.
So now you need not only the mmap_sem (to protect against fork), you also need the page lock (to protect against rmap changing the type of page).
No, I don't think so. But I'm happy to be proven wrong because I might just be missing something important.
I don't see you taking the page lock anywhere. Maybe the page table lock ends up serializing sufficiently with the rmap code that it ends up working
In the do_wp_page() path, we currently do those kinds of racy checks too, but then we do a trylock_page, and re-do them. And at any time there is any question about things, we fall back to copying - because a copy is always safe.
Yes, I studied that code in detail as well.
Well, it's always safe if we have the rule that "once we've pinned things, we don't cause them to be COW again".
We should also be handling FOLL_GET, but that's a completely different discussion.
But that "it's safe if" was exactly my (b) case.
That's why I much prefer the model I'm trying to push - it's conceptually quite simple. I can literally explain mine at a conceptual level with that "break pre-existing COW, make sure no future COW" model.
:)
We really might be talking about the same thing just that my point is that the mapcount is the right thing to use for making the discussion whether to break COW -> triger unsharing.
In contrast, I look at your page_mapcount() code, and I go "there is no conceptual rules here, and the actual implementation details look dodgy".
I personally like having clear conceptual rules - as opposed to random implementation details.
Oh, don't get me wrong, me to. But for me it just all makes perfect.
What we document is:
"The fault is an unsharing request to unshare a shared anonymous page (-> mapped R/O). Does not apply to KSM."
And the code checks for exactly that. And in that context the mapcount just expresses exactly what we want. Again, unless I am missing something important that you raise above.
Anyhow, it's late in Germany. thanks for the discussion Linus!
On Fri, Dec 17, 2021 at 2:43 PM David Hildenbrand david@redhat.com wrote:
The pages stay PageAnon(). swap-backed pages simply set a bit IIRC. mapcount still applies.
Our code-base is too large for me to remember all the details, but if we still end up having PageAnon for swapbacked pages, then mapcount can increase from another process faulting in an pte with that swap entry.
And mmap_sem doesn't protect against that. Again, page_lock() does.
And taking the page lock was a big performance issue.
One of the reasons that new COW handling is so nice is that you can do things like
if (!trylock_page(page)) goto copy;
exactly because in the a/b world order, the copy case is always safe.
In your model, as far as I can tell, you leave the page read-only and a subsequent COW fault _can_ happen, which means that now the subsequent COW needs to b every very careful, because if it ever copies a page that was GUP'ed, you just broke the rules.
So COWing too much is a bug (because it breaks the page from the GUP), but COWing too little is an even worse problem (because it measn that now the GUP user can see data it shouldn't have seen).
Our old code literally COWed too little. It's why all those changes happened in the first place.
This is why I'm pushing that whole story line of
(1) COW is based purely on refcounting, because that's the only thing that obviously can never COW too little.
(2) GUP pre-COWs (the thing I called the "(a)" rule earlier) and then makes sure to not mark pinned pages COW again (that "(b)" rule).
and here "don't use page_mapcount()" really is about that (1).
You do seem to have kept (1) in that your COW rules don't seem to change (but maybe I missed it), but because your GUP-vs-COW semantics are very different indeed, I'm not at all convinced about (2).
Linus
On 18.12.21 00:20, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 2:43 PM David Hildenbrand david@redhat.com wrote:
The pages stay PageAnon(). swap-backed pages simply set a bit IIRC. mapcount still applies.
Our code-base is too large for me to remember all the details, but if we still end up having PageAnon for swapbacked pages, then mapcount can increase from another process faulting in an pte with that swap entry.
"Our code-base is too large for me to remember all the details". I second that.
You might a valid point with the mapcount regarding concurrent swapin in the current code, I'll have to think further about that if it could be a problem and if it cannot be handled without heavy synchronization (I think the concern is that gup unsharing could miss doing an unshare because it doesn't detect that there are other page sharers not expressed in the mapcount code but via the swap code when seeing mapcount == 1).
Do you have any other concerns regarding the semantics/stability regarding the following points (as discussed, fork() is not the issue because it can be handled via write_protect_seq or something comparable. handling per-process thingies is not the problem):
a) Using PageAnon(): It cannot possibly change in the pagefault path or in the gup-fast-only path (otherwise there would be use-after-free already). b) Using PageKsm(): It cannot possibly change in the pagefault path or in the gup-fast path (otherwise there would be use-after-free already). c) Using mapcount: It cannot possibly change in the way we care about or cannot detect (mapcount going from == 1 to > 1 concurrently) in the pagefault path or in the gup-fast path due to fork().
You're point for c) is that we might currently not handle swap correctly. Any other concerns, especially regarding the mapcount or is that it?
IIUC, any GUP approach to detect necessary unsharing would at least require a check for a) and b). What we're arguing about is c).
And mmap_sem doesn't protect against that. Again, page_lock() does.
And taking the page lock was a big performance issue.
One of the reasons that new COW handling is so nice is that you can do things like
if (!trylock_page(page)) goto copy;
exactly because in the a/b world order, the copy case is always safe.
In your model, as far as I can tell, you leave the page read-only and a subsequent COW fault _can_ happen, which means that now the subsequent COW needs to b every very careful, because if it ever copies a page that was GUP'ed, you just broke the rules.
So COWing too much is a bug (because it breaks the page from the GUP), but COWing too little is an even worse problem (because it measn that now the GUP user can see data it shouldn't have seen).
Good summary, I'll extend below.
Our old code literally COWed too little. It's why all those changes happened in the first place.
Let's see if we can agree on some things to get a common understanding.
What can happen with COW is:
1) Missed COW
We miss a COW, therefore someone has access to a wrong page.
This is the security issue as in patch #11. The security issue documented in [1].
2) Unnecessary COW
We do a COW, but there are no other valid users, so it's just overhead + noise.
The performance issue documented in section 5 in [1].
3) Wrong COW
We do a COW but there are other valid users (-> GUP).
The memory corruption issue documented in section 2 and 3 in [1].
Most notably, the io_uring reproducer which races with the page_maybe_dma_pinned() check in current code can trigger this easily, and exactly this issues is what gives me nightmares. [2]
Does that make sense? If we agree on the above, then here is how the currently discussed approaches differ:
page_count != 1: * 1) cannot happen * 2) can happen easily (speculative references due to pagecache, migration, daemon, pagevec, ...) * 3) can happen in the current code
mapcount > 1: * 1) your concern is that this can happen due to concurrent swapin * 2) cannot happen. * 3) your concern is that this can happen due to concurrent swapin
If we can agree on that, I can see why you dislike mapcount, can you see why I dislike page_count?
Ideally we'd really have a fast and reliable check for "is this page shared and could get used by multiple processes -- either multiple processes are already mapping it R/O or could map it via the swap R/O later".
This is why I'm pushing that whole story line of
(1) COW is based purely on refcounting, because that's the only thing that obviously can never COW too little.
I am completely missing how 2) or 3) could *ever* be handled properly for page_count != 1. 3) is obviously more important and gives me nightmares.
And that's what I'm trying to communicate the whole time: page_count is absolutely fragile, because anything that results in a page getting mapped R/O into a page table can trigger 3). And as [2] proves that can even happen with *swap*.
(see how we're running into the same swap issues with both approaches? Stupid swap :) )
(2) GUP pre-COWs (the thing I called the "(a)" rule earlier) and then makes sure to not mark pinned pages COW again (that "(b)" rule).
and here "don't use page_mapcount()" really is about that (1).
You do seem to have kept (1) in that your COW rules don't seem to change (but maybe I missed it), but because your GUP-vs-COW semantics are very different indeed, I'm not at all convinced about (2).
Oh yes, sorry, not in the context of this series. The point is that the current page_count != 1 covers mapcount > 1, so we can adjust that separately later.
You mentioned "design", so let's assume we have a nice function:
/* * Check if an anon page is shared or exclusively used by a single * process: if shared, the page is shared by multiple processes either * mapping the page R/O ("active sharing") or having swap entries that * could result in the page getting mapped R/O ("inactive sharing"). * * This function is safe to be called under mmap_lock in read/write mode * because it prevents concurrent fork() sharing the page. * This function is safe to be called from gup-fast-only in IRQ context, * as it detects concurrent fork() sharing the page */ bool page_anon_shared();
Can we agree that that would that be a suitable function for (1) and (2) instead of using either the page_count or the mapcount directly? (yes, how to actually make it reliable due to swapin is to be discussed, but it might be a problem worth solving if that's the way to go)
For hugetlb, this would really have to use the mapcount as explained (after all, fortunately there is no swap ...).
[1] https://lore.kernel.org/all/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com/
[2] https://gitlab.com/aarcange/kernel-testcases-for-v5.11/-/blob/main/io_uring_...
[ Cutting down ruthlessly to the core of the issue ]
On Sat, Dec 18, 2021 at 1:58 AM David Hildenbrand david@redhat.com wrote:
Missed COW
Unnecessary COW
Wrong COW
Does that make sense? If we agree on the above, then here is how the currently discussed approaches differ:
page_count != 1:
- cannot happen
- can happen easily (speculative references due to pagecache, migration, daemon, pagevec, ...)
- can happen in the current code
I claim that (1) "cannot happen" is a huge mother of a deal. It's *LITERALLY* the bug you are chasing, and it's the security issue, so on a bug scale, it's about the worst there is.
I further then claim that (2) "happen easily" is you just making things up. Yes, it can happen. But no, it's not actually that common, and since (2) is harmless from a correctness standpoint, it is purely about performance.
And as mentioned, not using the mapcount actually makes *common* operations much simpler and faster. You don't need the page lock to serialize the mapcount.
So (2) is a performance argument, and you haven't actually shown it to be a problem.
Which really only leaves (3). Which I've already explained what the fix is: don't ever mark pages that shouldn't be COW'ed as being COW pages.
(3) is really that simple, although it ended up depending on Jason and John Hubbard and others doing that FOLL_PIN logic to distinguish "I just want to see a random page, and I don't care about COW" from "I want to get a page, and that page needs to be coherent with this VM and not be COW'ed away"
So I'm not claiming (3) is "trivial", but at the same time it's certainly not some fundamentally complicated thing, and it's easy to explain what is going on.
mapcount > 1:
- your concern is that this can happen due to concurrent swapin
- cannot happen.
- your concern is that this can happen due to concurrent swapin
No, my concern about (1) is that IT IS WRONG.
"mapcount" means nothing for COW. I even gave you an example of exactly where it means nothing. It's crazy. It's illogical. And it's complicated as hell.
The fact that only one user maps a page is simply not meaningful. That page can have other users that you don't know anything about, and that don't show up in the mapcount.
That page can be swapcached, in which case mapcount can change radically in ways that you earlier indicated cannot happen. You were wrong.
But even if you fix that - by taking the page lock in every single place - there are still *other* users that for all you know may want the old contents. You don't know.
The only thing that says "no other users" is the page count. Not the mapcount.
In other words, I claim that
(a) mapcount is fundamentally the wrong thing to test. You can be the only mapper, without being the "owner" of the page.
(b) it's *LITERALLY* the direct and present source of that bug in the testcase you added, where a page with a mapcount of 1 has other concurrent users and needs to be COW'ed but isn't.
(c) it's complicated and expensive to calculate (where one big part of the expense is the page lock synchronization requirements, but there are others)
And this all happens for that "case (1)", which is the worst adn scariest of them all.
In contrast to that, your argument that "(2) cannot happen" is a total non-argument. (2) isn't the problem.
And I claim that (3) can happen because you're testing the wrong counter, so who knows if the COW is wrong or not?
I am completely missing how 2) or 3) could *ever* be handled properly for page_count != 1. 3) is obviously more important and gives me nightmares.
Ok, so if I tell you how (2) and (3) are handled properly, you will just admit you were wrong?
Here's how they are handled properly with page counts. I have told you this before, but I'll summarize:
(2) is handled semantically properly by definition - it may be "unnecessary", but it has no semantic meaning
This is an IMPORTANT thing to realize. The fact is, (2) is not in the same class as (1) or (3).
And honestly - we've been doing this for all the common cases already since at least 5.9, and your performance argument simply has not really reared its head. Which makes the whole argument moot. I claim that it simplifies lots of common operations and avoids having to serialize on a lock that has been a real and major problem. You claim it's extra overhead and can cause extra COW events. Neither of has any numbers worth anything, but at least I can point to the fact that all the *normal* VM paths have been doing the thing I advocate for many releases now, and the sky most definitely is NOT falling.
So that only leaves (3).
Handling (3) really is so conceptually simple that I feel silly for repeating it: if you don't want a COW to happen, then you mark the page as being not-COW.
That sounds so simple as to be stupid. But it really is the solution. It's what that pinning logic does, and keeps that "page may be pinned" state around, and then operations like fork() that would otherwise create a COW mapping of it will just not do it.
So that incredibly simple approach does require actual code: it requires that explicit "fork() needs to copy instead of COW" code, it requires that "if it's pinned, we don't make a new swapcache entry out of it". So it's real code, and it's a real issue, but it's conceptually absolutely trivial, and the code is usualyl really simple to understand too.
So you have a *trivial* concept, and you have simple code that could be described to a slightly developmentally challenged waterfowl. If you're one of the programmers doing the "explain your code to a rubber ducky", you can look at code like this:
/* * Anonymous process memory has backing store? * Try to allocate it some swap space here. * Lazyfree page could be freed directly */ if (PageAnon(page) && PageSwapBacked(page)) { if (!PageSwapCache(page)) { if (!(sc->gfp_mask & __GFP_IO)) goto keep_locked; if (page_maybe_dma_pinned(page)) goto keep_locked;
and you can explain that page_maybe_dma_pinned() test to your rubber ducky, and that rubber ducky will literally nod its head. It gets it.
To recap: (1) is important, and page_count() is the only thing that guarantees "you get full access to a page only when it's *obviously* exclusively yours". (2) is NOT important, but could be a performance issue, but we have real data from the past year that it isn't. (3) is important, and has a really spectacularly simple conceptual fix with quite simple code too.
In contrast, with the "mapcount" games you can't even explain why they should work, and the patches I see are actively buggy because everything is so subtle.
Linus
On Sat, Dec 18, 2021 at 11:21 AM Linus Torvalds torvalds@linux-foundation.org wrote:
To recap: (1) is important, and page_count() is the only thing that guarantees "you get full access to a page only when it's *obviously* exclusively yours". (2) is NOT important, but could be a performance issue, but we have real data from the past year that it isn't. (3) is important, and has a really spectacularly simple conceptual fix with quite simple code too.
In contrast, with the "mapcount" games you can't even explain why they should work, and the patches I see are actively buggy because everything is so subtle.
So to challenge you, please explain exactly how mapcount works to solve (1) and (3), and how it incidentally guarantees that (2) doesn't happen.
And that really involves explaining the actual code too. I can explain the high-level concepts in literally a couple of sentences.
For (1), "the page_count()==1 guarantees you are the only owner, so a COW event can re-use the page" really explains it. And the code is pretty simple too. There's nothing subtle about "goto copy" when pagecount is not 1. And even the locking is simple: "we hold the page table lock, we found a page, it has only one ref to it, we own it"
Our VM is *incredibly* complicated. There really are serious advantages to having simple rules in place.
And for (2), the simple rule is "yeah, we can cause spurious cow events". That's not only simple to explain, it's simple to code for. Suddenly you don't need to worry. "Copying the page is always safe". That's a really really powerful statement.
Now, admittedly (3) is the one that ends up being more complicated, but the *concept* sure is simple. "If you don't want to COW this page, then don't mark it for COW".
The *code* for (3) is admittedly a bit more complicated. The "don't mark it for COW" is simple to say, but we do have that fairly odd locking thing with fork() doing a seqcount_write_begin/end, and then GIP does the read-seqcount thing with retry. So it's a bit unusual, and I don't think we have that particular pattern anywhere else, but it's one well-defined lock and while unusual it's not *complicated* as far as kernel locking rules go. It's unusual and perhaps not trivial, but in the end those seqcount code sequences are maybe 10 lines total, and they don't interact with anything else.
And yes, the "don't mark it for COW" means that write-protecting something is special, mainly because we sadly do not have extra bits in the page tables. It would be *really* easy if we could just hide this "don't COW this page" in the page table. Truly trivial. We don't, because of portability across different architectures ;(
So I'll freely give you that my (3) is somewhat painful, but it's painful with a really simple concept.
And the places that get (3) wrong are generally places that nobody has been able to care about. I didn't realize the problem with creating a swap page after the fact for a while, so that commit feb889fb40fa ("mm: don't put pinned pages into the swap cache") came later, but it's literally a very simple two-liner.
The commit message for commit feb889fb40fa may be worth reading. It very much explains the spirit of the thing, and is much longer than the trivial patch itself.
Simple and clear concepts matter. Code gets complicated even then, but complex code with complex concepts is a bad combination.
Linus
On 18.12.21 20:52, Linus Torvalds wrote:
On Sat, Dec 18, 2021 at 11:21 AM Linus Torvalds torvalds@linux-foundation.org wrote:
To recap: (1) is important, and page_count() is the only thing that guarantees "you get full access to a page only when it's *obviously* exclusively yours". (2) is NOT important, but could be a performance issue, but we have real data from the past year that it isn't. (3) is important, and has a really spectacularly simple conceptual fix with quite simple code too.
In contrast, with the "mapcount" games you can't even explain why they should work, and the patches I see are actively buggy because everything is so subtle.
So to challenge you, please explain exactly how mapcount works to solve (1) and (3), and how it incidentally guarantees that (2) doesn't happen.
Oh, there is no need for additional challenges, I've been challenged with this problem for way too long already ;)
And again, I appreciate this discussion and your feedback. I want to get all of this fixed ASAP, even if it's not going to be the way I propose as I raised. Any input is appreciated (as long as people don't scream at me).
To get to your point: I thought about your remarks with the swapcount and it makes sense. The mapcount itself is not expressive enough to catch macpount == 1 vs mapcount > 1.
What *would* work to have precise COW I think is having the active and inactive count instead of just the active (mapcount) part, whereby:
active: page tables mapping this page -> mapcount inactive: page tables referencing this page via swap/migration entry
An alternative would be to just know that there are inactive users. We'd have to read both values atomically in one shot.
There would be ways to store that information in the _mapcount but it certainly adds a lot of complexity, and ...
And that really involves explaining the actual code too. I can explain the high-level concepts in literally a couple of sentences.
For (1), "the page_count()==1 guarantees you are the only owner, so a COW event can re-use the page" really explains it. And the code is pretty simple too. There's nothing subtle about "goto copy" when pagecount is not 1. And even the locking is simple: "we hold the page table lock, we found a page, it has only one ref to it, we own it"
Our VM is *incredibly* complicated. There really are serious advantages to having simple rules in place.
... you have a point here.
Having that said, I hope we can agree that the "page_count" is not the perfect solution. I hope we can at least tweak it for now to get rid of 3) Wrong COW.
And for (2), the simple rule is "yeah, we can cause spurious cow events". That's not only simple to explain, it's simple to code for. Suddenly you don't need to worry. "Copying the page is always safe". That's a really really powerful statement.
Now, admittedly (3) is the one that ends up being more complicated, but the *concept* sure is simple. "If you don't want to COW this page, then don't mark it for COW".
The *code* for (3) is admittedly a bit more complicated. The "don't mark it for COW" is simple to say, but we do have that fairly odd locking thing with fork() doing a seqcount_write_begin/end, and then GIP does the read-seqcount thing with retry. So it's a bit unusual, and I don't think we have that particular pattern anywhere else, but it's one well-defined lock and while unusual it's not *complicated* as far as kernel locking rules go. It's unusual and perhaps not trivial, but in the end those seqcount code sequences are maybe 10 lines total, and they don't interact with anything else.
And yes, the "don't mark it for COW" means that write-protecting something is special, mainly because we sadly do not have extra bits in the page tables. It would be *really* easy if we could just hide this "don't COW this page" in the page table. Truly trivial. We don't, because of portability across different architectures ;(
So I'll freely give you that my (3) is somewhat painful, but it's painful with a really simple concept.
Thanks for admitting that!
I might have had an idea yesterday on how to fix most of the issues without relying on the mapcount, doing it similar (but slightly different) as you propose here. Let's call it a mixture of the unsharing approach and your approach. I cannot promise anything, so ...
... I'll go playing with it and share some details ASAP. At least it sounds comparatively simple in my head.
And the places that get (3) wrong are generally places that nobody has been able to care about. I didn't realize the problem with creating a swap page after the fact for a while, so that commit feb889fb40fa ("mm: don't put pinned pages into the swap cache") came later, but it's literally a very simple two-liner.
Just to give you my perspective:
Personally I don't care too much about 2). The only reason why I somehow care about "Unnecessary COW" are * Challenging for hugetlb use as I explained. We might still want to use the mapcount there. * It's mostly a symptom of our eventually too simple COW logic that effectively leads to 3).
While I do care about 1) (Missed CoW) for our customers, I *especially* care about 3) (Wrong Cow) simply because silent memory corruptions in user space are not acceptable.
As you say, fixing 1) the "page_count" way might be easy, at least for THP.
Simple example: Have swapping enabled and register a fixed io_uring buffer at the wrong time. Fixed io_uring buffers are no a commodity feature for unprivileged users space ...
So that's why I so deeply care about all of this.
It is currently possible for a child process to observe modifications of anonymous pages by the parent process after fork() in some cases, which is not only a userspace visible violation of the POSIX semantics of MAP_PRIVATE, but more importantly a real security issue.
This issue, including other related COW issues, has been summarized in [1]: " 1. Observing Memory Modifications of Private Pages From A Child Process
Long story short: process-private memory might not be as private as you think once you fork(): successive modifications of private memory regions in the parent process can still be observed by the child process, for example, by smart use of vmsplice()+munmap().
The core problem is that pinning pages readable in a child process, such as done via the vmsplice system call, can result in a child process observing memory modifications done in the parent process the child is not supposed to observe. [1] contains an excellent summary and [2] contains further details. This issue was assigned CVE-2020-29374 [9].
For this to trigger, it's required to use a fork() without subsequent exec(), for example, as used under Android zygote. Without further details about an application that forks less-privileged child processes, one cannot really say what's actually affected and what's not -- see the details section the end of this mail for a short sshd/openssh analysis.
While commit 17839856fd58 ("gup: document and work around "COW can break either way" issue") fixed this issue and resulted in other problems (e.g., ptrace on pmem), commit 09854ba94c6a ("mm: do_wp_page() simplification") re-introduced part of the problem unfortunately.
The original reproducer can be modified quite easily to use THP [3] and make the issue appear again on upstream kernels. I modified it to use hugetlb [4] and it triggers as well. The problem is certainly less severe with hugetlb than with THP; it merely highlights that we still have plenty of open holes we should be closing/fixing.
Regarding vmsplice(), the only known workaround is to disallow the vmsplice() system call ... or disable THP and hugetlb. But who knows what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in the end, it's a more generic issue. "
This security issue / MAP_PRIVATE POSIX violation was first reported by Jann Horn on 27 May 2020 and it currently affects anonymous THP and hugetlb.
Ordinary anonymous pages are currently not affected, because the COW logic was changed in commit 09854ba94c6a ("mm: do_wp_page() simplification") for them to COW on "page_count() != 1" instead of "mapcount > 1", which unfortunately results in other COW issues, some of them documented in [1] as well.
To fix this COW issue once and for all, introduce GUP-triggered unsharing that can be conditionally triggered via FAULT_FLAG_UNSHARE. In contrast to traditional COW, unsharing will leave the copied page mapped write-protected in the page table, not having the semantics of a write fault.
Logically, unsharing is triggered "early", as soon as GUP performs the action that could result in a COW getting missed later and the security issue triggering: however, unsharing is not triggered as before via a write fault with undesired side effects.
GUP triggers unsharing if all of the following conditions are met: * The page is mapped R/O * We have an anonymous page, excluding KSM * We want to read (!FOLL_WRITE) * Unsharing is not disabled (!FOLL_NOUNSHARE) * We want to take a reference (FOLL_GET or FOLL_PIN) * The page is a shared anonymous page: mapcount > 1
As this patch introduces the same unsharing logic also for ordinary PTE-mapped anonymous pages, it also paves the way to fix the other known COW related issues documented in [1] without reintroducing the security issue or reintroducing other issues we observed in the past (e.g., broken ptrace on pmem).
We better leave the follow_page() API alone: it's an internal API and its users don't actually allow for user space to read page content and they don't expect to get "NULL" for actually present pages -- because they usually don't trigger faults. Introduce and use FOLL_NOUNSHARE for that purpose. We could also think about using it for other corner cases, such as get_dump_page().
Note: GUP users that use memory notifiers to synchronize with the MM don't have to bother about unsharing: they don't actually take a reference on the pages and are properly synchronized against MM changes to never result in consistency issues.
Add a TODO item that the mechanism should be extended to improve GUP long-term as a whole, avoiding the requirement for FOLL_WRITE|FOLL_FORCE.
hugetlb case will be handled separately.
This commit is based on prototype patches by Andrea.
[1] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
Co-developed-by: Andrea Arcangeli aarcange@redhat.com Signed-off-by: Andrea Arcangeli aarcange@redhat.com Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com --- include/linux/mm.h | 10 ++++++ mm/gup.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-- mm/huge_memory.c | 7 ++++ 3 files changed, 104 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 37d1fb2f865e..ebcdaed60701 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2975,6 +2975,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ #define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ #define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ +#define FOLL_NOUNSHARE 0x100000 /* don't trigger unsharing on shared anon pages */
/* * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each @@ -3029,6 +3030,12 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, * releasing pages: get_user_pages*() pages must be released via put_page(), * while pin_user_pages*() pages must be released via unpin_user_page(). * + * FOLL_NOUNSHARE should be set when no unsharing should be triggered when + * eventually taking a read-only reference on a shared anonymous page, because + * we are sure that user space cannot use that reference for reading the page + * after eventually unmapping the page. FOLL_NOUNSHARE is implicitly set for the + * follow_page() API. + * * Please see Documentation/core-api/pin_user_pages.rst for more information. */
@@ -3043,6 +3050,9 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) return 0; }
+extern bool gup_must_unshare(unsigned int flags, struct page *page, + bool is_head); + typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data); extern int apply_to_page_range(struct mm_struct *mm, unsigned long address, unsigned long size, pte_fn_t fn, void *data); diff --git a/mm/gup.c b/mm/gup.c index 2c51e9748a6a..2a83388c3fb4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,6 +29,53 @@ struct follow_page_context { unsigned int page_mask; };
+/* + * Indicates for which pages that are write-protected in the page table, + * whether GUP has to trigger unsharing via FAULT_FLAG_UNSHARE such that the + * GUP pin will remain consistent with the pages mapped into the page tables + * of the MM. + * + * This handling is required to guarantee that a child process that triggered + * a read-only GUP before unmapping the page of interest cannot observe + * modifications of shared anonymous pages with COW semantics in the parent + * after fork(). + * + * TODO: although the security issue described does no longer apply in any case, + * the full consistency between the pinned pages and the pages mapped into the + * page tables of the MM only apply to short-term pinnings only. For + * FOLL_LONGTERM, FOLL_WRITE|FOLL_FORCE is required for now, which can be + * inefficient and still result in some consistency issues. Extend this + * mechanism to also provide full synchronicity to FOLL_LONGTERM, avoiding + * FOLL_WRITE|FOLL_FORCE. + * + * This function is safe to be called in IRQ context. + */ +bool gup_must_unshare(unsigned int flags, struct page *page, bool is_head) +{ + /* We only care about read faults where unsharing is desired. */ + if (flags & (FOLL_WRITE | FOLL_NOUNSHARE)) + return false; + /* + * We only care when the reference count of the page is to get + * increased. In particular, GUP users that rely on memory notifiers + * instead don't have to trigger unsharing. + */ + if (!(flags & (FOLL_GET|FOLL_PIN))) + return false; + if (!PageAnon(page)) + return false; + if (PageKsm(page)) + return false; + if (PageHuge(page)) + /* TODO: handle hugetlb as well. */ + return false; + if (is_head) { + VM_BUG_ON(!PageTransHuge(page)); + return page_trans_huge_mapcount(page, NULL) > 1; + } + return page_mapcount(page) > 1; +} + static void hpage_pincount_add(struct page *page, int refs) { VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); @@ -543,6 +590,14 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } }
+ /* + * If unsharing is required, keep retrying to unshare until the + * page becomes exclusive. + */ + if (!pte_write(pte) && gup_must_unshare(flags, page, false)) { + page = ERR_PTR(-EMLINK); + goto out; + } /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */ if (unlikely(!try_grab_page(page, flags))) { page = ERR_PTR(-ENOMEM); @@ -790,6 +845,11 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma, * When getting pages from ZONE_DEVICE memory, the @ctx->pgmap caches * the device's dev_pagemap metadata to avoid repeating expensive lookups. * + * When getting an anonymous page and the caller has to trigger unsharing + * of a shared anonymous page first, -EMLINK is returned. The caller should + * trigger a fault with FAULT_FLAG_UNSHARE set. With FOLL_NOUNSHARE set, will + * never require unsharing and consequently not return -EMLINK. + * * On output, the @ctx->page_mask is set according to the size of the page. * * Return: the mapped (struct page *), %NULL if no mapping exists, or @@ -845,6 +905,12 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, if (vma_is_secretmem(vma)) return NULL;
+ /* + * Don't require unsharing in case we stumble over a read-only mapped, + * shared anonymous page: this is an internal API only and callers don't + * actually use it for exposing page content to user space. + */ + foll_flags |= FOLL_NOUNSHARE; page = follow_page_mask(vma, address, foll_flags, &ctx); if (ctx.pgmap) put_dev_pagemap(ctx.pgmap); @@ -910,7 +976,8 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, * is, *@locked will be set to 0 and -EBUSY returned. */ static int faultin_page(struct vm_area_struct *vma, - unsigned long address, unsigned int *flags, int *locked) + unsigned long address, unsigned int *flags, bool unshare, + int *locked) { unsigned int fault_flags = 0; vm_fault_t ret; @@ -935,6 +1002,12 @@ static int faultin_page(struct vm_area_struct *vma, */ fault_flags |= FAULT_FLAG_TRIED; } + if (unshare) { + VM_BUG_ON(unshare && *flags & FOLL_NOUNSHARE); + fault_flags |= FAULT_FLAG_UNSHARE; + /* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */ + VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE); + }
ret = handle_mm_fault(vma, address, fault_flags, NULL); if (ret & VM_FAULT_ERROR) { @@ -1156,8 +1229,9 @@ static long __get_user_pages(struct mm_struct *mm, cond_resched();
page = follow_page_mask(vma, start, foll_flags, &ctx); - if (!page) { - ret = faultin_page(vma, start, &foll_flags, locked); + if (!page || PTR_ERR(page) == -EMLINK) { + ret = faultin_page(vma, start, &foll_flags, + PTR_ERR(page) == -EMLINK, locked); switch (ret) { case 0: goto retry; @@ -2311,6 +2385,11 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, goto pte_unmap; }
+ if (!pte_write(pte) && gup_must_unshare(flags, page, false)) { + put_compound_head(head, 1, flags); + goto pte_unmap; + } + VM_BUG_ON_PAGE(compound_head(page) != head, page);
/* @@ -2554,6 +2633,11 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, return 0; }
+ if (!pmd_write(orig) && gup_must_unshare(flags, head, true)) { + put_compound_head(head, refs, flags); + return 0; + } + *nr += refs; SetPageReferenced(head); return 1; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1685821525e8..57842e8b13d4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1375,6 +1375,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, page = pmd_page(*pmd); VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+ /* + * If unsharing is required, keep retrying to unshare until the + * page becomes exclusive. + */ + if (!pmd_write(*pmd) && gup_must_unshare(flags, page, true)) + return ERR_PTR(-EMLINK); + if (!try_grab_page(page, flags)) return ERR_PTR(-ENOMEM);
Let's support FAULT_FLAG_UNSHARE to implement GUP-triggered unsharing, preparing for its use in the GUP paths when there is need to unshare a shared anonymous hugetlb page.
We'll make use of it next by setting FAULT_FLAG_UNSHARE in case we detect that unsharing is necessary.
This commit is based on a prototype patch by Andrea.
Co-developed-by: Andrea Arcangeli aarcange@redhat.com Signed-off-by: Andrea Arcangeli aarcange@redhat.com Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com --- mm/hugetlb.c | 86 ++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 63 insertions(+), 23 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a1baa198519a..5f2863b046ef 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5130,14 +5130,15 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, }
/* - * Hugetlb_cow() should be called with page lock of the original hugepage held. + * __wp_hugetlb() should be called with page lock of the original hugepage held. * Called with hugetlb_fault_mutex_table held and pte_page locked so we * cannot race with other handlers or page migration. * Keep the pte_same checks anyway to make transition from the mutex easier. */ -static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *ptep, - struct page *pagecache_page, spinlock_t *ptl) +static __always_inline vm_fault_t +__wp_hugetlb(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, struct page *pagecache_page, + spinlock_t *ptl, bool unshare) { pte_t pte; struct hstate *h = hstate_vma(vma); @@ -5151,11 +5152,21 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, old_page = pte_page(pte);
retry_avoidcopy: - /* If no-one else is actually using this page, avoid the copy - * and just make the page writable */ - if (page_mapcount(old_page) == 1 && PageAnon(old_page)) { - page_move_anon_rmap(old_page, vma); - set_huge_ptep_writable(vma, haddr, ptep); + if (!unshare) { + /* + * If no-one else is actually using this page, avoid the copy + * and just make the page writable. + */ + if (page_mapcount(old_page) == 1 && PageAnon(old_page)) { + page_move_anon_rmap(old_page, vma); + set_huge_ptep_writable(vma, haddr, ptep); + return 0; + } + } else if (!PageAnon(old_page) || page_mapcount(old_page) == 1) { + /* + * GUP-triggered unsharing only applies to shared anonymous + * pages. If that does no longer apply, there is nothing to do. + */ return 0; }
@@ -5256,11 +5267,11 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) { ClearHPageRestoreReserve(new_page);
- /* Break COW */ + /* Break COW or unshare */ huge_ptep_clear_flush(vma, haddr, ptep); mmu_notifier_invalidate_range(mm, range.start, range.end); set_huge_pte_at(mm, haddr, ptep, - make_huge_pte(vma, new_page, 1)); + make_huge_pte(vma, new_page, !unshare)); page_remove_rmap(old_page, true); hugepage_add_new_anon_rmap(new_page, vma, haddr); SetHPageMigratable(new_page); @@ -5270,7 +5281,10 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, spin_unlock(ptl); mmu_notifier_invalidate_range_end(&range); out_release_all: - /* No restore in case of successful pagetable update (Break COW) */ + /* + * No restore in case of successful pagetable update (Break COW or + * unshare) + */ if (new_page != old_page) restore_reserve_on_error(h, vma, haddr, new_page); put_page(new_page); @@ -5281,6 +5295,23 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, return ret; }
+static vm_fault_t +wp_hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, struct page *pagecache_page, + spinlock_t *ptl) +{ + return __wp_hugetlb(mm, vma, address, ptep, pagecache_page, ptl, + false); +} + +static vm_fault_t +wp_hugetlb_unshare(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, + struct page *pagecache_page, spinlock_t *ptl) +{ + return __wp_hugetlb(mm, vma, address, ptep, pagecache_page, ptl, true); +} + /* Return the pagecache page at a given address within a VMA */ static struct page *hugetlbfs_pagecache_page(struct hstate *h, struct vm_area_struct *vma, unsigned long address) @@ -5393,7 +5424,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, /* * Currently, we are forced to kill the process in the event the * original mapper has unmapped pages from the child due to a failed - * COW. Warn that such a situation has occurred as it may not be obvious + * COW/unsharing. Warn that such a situation has occurred as it may not + * be obvious. */ if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) { pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n", @@ -5519,7 +5551,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, hugetlb_count_add(pages_per_huge_page(h), mm); if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { /* Optimization, do the COW without a second fault */ - ret = hugetlb_cow(mm, vma, address, ptep, page, ptl); + ret = wp_hugetlb_cow(mm, vma, address, ptep, page, ptl); }
spin_unlock(ptl); @@ -5649,14 +5681,15 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, goto out_mutex;
/* - * If we are going to COW the mapping later, we examine the pending - * reservations for this page now. This will ensure that any + * If we are going to COW/unshare the mapping later, we examine the + * pending reservations for this page now. This will ensure that any * allocations necessary to record that reservation occur outside the * spinlock. For private mappings, we also lookup the pagecache * page now as it is used to determine if a reservation has been * consumed. */ - if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) { + if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && + !huge_pte_write(entry)) { if (vma_needs_reservation(h, vma, haddr) < 0) { ret = VM_FAULT_OOM; goto out_mutex; @@ -5671,14 +5704,17 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ptl = huge_pte_lock(h, mm, ptep);
- /* Check for a racing update before calling hugetlb_cow */ + /* + * Check for a racing update before calling wp_hugetlb_cow / + * wp_hugetlb_unshare + */ if (unlikely(!pte_same(entry, huge_ptep_get(ptep)))) goto out_ptl;
/* - * hugetlb_cow() requires page locks of pte_page(entry) and - * pagecache_page, so here we need take the former one - * when page != pagecache_page or !pagecache_page. + * wp_hugetlb_cow()/wp_hugetlb_unshare() requires page locks of + * pte_page(entry) and pagecache_page, so here we need take the former + * one when page != pagecache_page or !pagecache_page. */ page = pte_page(entry); if (page != pagecache_page) @@ -5691,11 +5727,15 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_WRITE) { if (!huge_pte_write(entry)) { - ret = hugetlb_cow(mm, vma, address, ptep, - pagecache_page, ptl); + ret = wp_hugetlb_cow(mm, vma, address, ptep, + pagecache_page, ptl); goto out_put_page; } entry = huge_pte_mkdirty(entry); + } else if (flags & FAULT_FLAG_UNSHARE && !huge_pte_write(entry)) { + ret = wp_hugetlb_unshare(mm, vma, address, ptep, pagecache_page, + ptl); + goto out_put_page; } entry = pte_mkyoung(entry); if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
Similar to the !hugetlb variant, invoke unsharing for shared anonymous pages when required during GUP by setting FOLL_FAULT_UNSHARE in hugetlb code as well.
FAULT_FLAG_UNSHARE will trigger unsharing of shared anonymous pages during GUP, resulting in a child process no longer being able to observe memory modifications performed by the parent after fork() to anonymous shared hugetlb pages.
This commit is based on prototype patches by Andrea.
Co-developed-by: Andrea Arcangeli aarcange@redhat.com Signed-off-by: Andrea Arcangeli aarcange@redhat.com Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com --- mm/gup.c | 3 +-- mm/hugetlb.c | 43 +++++++++++++++++++++++++++++++++++++++---- 2 files changed, 40 insertions(+), 6 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c index 2a83388c3fb4..35d1b28e3829 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -67,8 +67,7 @@ bool gup_must_unshare(unsigned int flags, struct page *page, bool is_head) if (PageKsm(page)) return false; if (PageHuge(page)) - /* TODO: handle hugetlb as well. */ - return false; + return __page_mapcount(page) > 1; if (is_head) { VM_BUG_ON(!PageTransHuge(page)); return page_trans_huge_mapcount(page, NULL) > 1; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5f2863b046ef..dc42018ee1a6 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5971,6 +5971,25 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma, } }
+static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte, + bool *unshare) +{ + pte_t pteval = huge_ptep_get(pte); + + *unshare = false; + if (is_swap_pte(pteval)) + return true; + if (huge_pte_write(pteval)) + return false; + if (flags & FOLL_WRITE) + return true; + if (gup_must_unshare(flags, pte_page(pteval), true)) { + *unshare = true; + return true; + } + return false; +} + long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *position, unsigned long *nr_pages, @@ -5985,6 +6004,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, while (vaddr < vma->vm_end && remainder) { pte_t *pte; spinlock_t *ptl = NULL; + bool unshare; int absent; struct page *page;
@@ -6035,9 +6055,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, * both cases, and because we can't follow correct pages * directly from any kind of swap entries. */ - if (absent || is_swap_pte(huge_ptep_get(pte)) || - ((flags & FOLL_WRITE) && - !huge_pte_write(huge_ptep_get(pte)))) { + if (absent || + __follow_hugetlb_must_fault(flags, pte, &unshare)) { vm_fault_t ret; unsigned int fault_flags = 0;
@@ -6045,6 +6064,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, spin_unlock(ptl); if (flags & FOLL_WRITE) fault_flags |= FAULT_FLAG_WRITE; + else if (unshare) + fault_flags |= FAULT_FLAG_UNSHARE; if (locked) fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; @@ -6734,7 +6755,21 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address, goto out; pte = huge_ptep_get((pte_t *)pmd); if (pte_present(pte)) { - page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT); + struct page *head_page = pmd_page(*pmd); + + /* + * follow_huge_pmd() is only called when coming via + * follow_page(), where we set FOLL_NOUNSHARE. Ordinary GUP + * goes via follow_hugetlb_page(), where we can properly unshare + * if required. + */ + if (WARN_ON_ONCE(!huge_pte_write(pte) && + gup_must_unshare(flags, head_page, true))) { + page = NULL; + goto out; + } + + page = head_page + ((address & ~PMD_MASK) >> PAGE_SHIFT); /* * try_grab_page() should always succeed here, because: a) we * hold the pmd (ptl) lock, and b) we've just checked that the
Let's add an optimized way to check "page_trans_huge_mapcount() > 1" that is allowed to break the loop early.
This commit is based on a prototype patch by Andrea.
Co-developed-by: Andrea Arcangeli aarcange@redhat.com Signed-off-by: Andrea Arcangeli aarcange@redhat.com Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: David Hildenbrand david@redhat.com --- include/linux/huge_mm.h | 7 +++++++ mm/gup.c | 2 +- mm/huge_memory.c | 34 ++++++++++++++++++++++++++++++++++ 3 files changed, 42 insertions(+), 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 44e02d47c65a..3a9d8cf64219 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -361,6 +361,8 @@ static inline void thp_mapcount_unlock(struct page *page, local_irq_restore(irq_flags); }
+extern bool page_trans_huge_anon_shared(struct page *page); + #else /* CONFIG_TRANSPARENT_HUGEPAGE */ #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) @@ -532,6 +534,11 @@ static inline void thp_mapcount_unlock(struct page *page, { }
+static inline bool page_trans_huge_anon_shared(struct page *page) +{ + return false; +} + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
/** diff --git a/mm/gup.c b/mm/gup.c index 35d1b28e3829..496575ff9ac8 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -70,7 +70,7 @@ bool gup_must_unshare(unsigned int flags, struct page *page, bool is_head) return __page_mapcount(page) > 1; if (is_head) { VM_BUG_ON(!PageTransHuge(page)); - return page_trans_huge_mapcount(page, NULL) > 1; + return page_trans_huge_anon_shared(page); } return page_mapcount(page) > 1; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 57842e8b13d4..dced82274f1d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1281,6 +1281,40 @@ void huge_pmd_set_accessed(struct vm_fault *vmf) spin_unlock(vmf->ptl); }
+ +static bool __page_trans_huge_anon_shared(struct page *page) +{ + int i, mapcount; + + mapcount = head_compound_mapcount(page); + if (mapcount > 1) + return true; + if (PageDoubleMap(page)) + mapcount -= 1; + for (i = 0; i < thp_nr_pages(page); i++) { + if (atomic_read(&page[i]._mapcount) + mapcount + 1 > 1) + return true; + } + return false; +} + +/* A lightweight check corresponding to "page_trans_huge_mapcount() > 1". */ +bool page_trans_huge_anon_shared(struct page *page) +{ + unsigned int seqcount; + bool shared; + + VM_BUG_ON_PAGE(PageHuge(page) || PageTail(page), page); + VM_BUG_ON_PAGE(!PageAnon(page) || !PageTransHuge(page), page); + + do { + seqcount = thp_mapcount_read_begin(page); + shared = __page_trans_huge_anon_shared(page); + } while (thp_mapcount_read_retry(page, seqcount)); + + return shared; +} + vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma;
Let's make sure the security issue / MAP_PRIVATE violation of POSIX semantics doesn't reappear again using variations of the original vmsplice reproducer. Ideally, we'd also be test some more cases with R/O long-term pinnings -- but the existing mechanisms like RDMA or VFIO require rather complicated setups not suitable for simple selftests.
In the future we might be able to add some O_DIRECT test and maybe extend the gup tests in the kernel accordingly.
Using barrier() is a little clunky, but "volatile" seems to be in general frowned upon and makes checkpatch angry.
Cc: Shuah Khan shuah@kernel.org Signed-off-by: David Hildenbrand david@redhat.com --- tools/testing/selftests/vm/Makefile | 1 + tools/testing/selftests/vm/gup_cow.c | 312 ++++++++++++++++++++++ tools/testing/selftests/vm/run_vmtests.sh | 16 ++ 3 files changed, 329 insertions(+) create mode 100644 tools/testing/selftests/vm/gup_cow.c
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 1607322a112c..dad6037d735f 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -27,6 +27,7 @@ CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS) LDLIBS = -lrt -lpthread TEST_GEN_FILES = compaction_test TEST_GEN_FILES += gup_test +TEST_GEN_FILES += gup_cow TEST_GEN_FILES += hmm-tests TEST_GEN_FILES += hugepage-mmap TEST_GEN_FILES += hugepage-mremap diff --git a/tools/testing/selftests/vm/gup_cow.c b/tools/testing/selftests/vm/gup_cow.c new file mode 100644 index 000000000000..9d44ed2ffdfc --- /dev/null +++ b/tools/testing/selftests/vm/gup_cow.c @@ -0,0 +1,312 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * GUP (Get User Pages) interaction with COW (Copy On Write) tests. + * + * Copyright 2021, Red Hat, Inc. + * + * Author(s): David Hildenbrand david@redhat.com + */ +#define _GNU_SOURCE +#include <stdlib.h> +#include <string.h> +#include <stdbool.h> +#include <stdint.h> +#include <unistd.h> +#include <errno.h> +#include <fcntl.h> +#include <sys/mman.h> +#include <sys/wait.h> + +#include "../kselftest.h" + +#define barrier() asm volatile("" ::: "memory") + +static size_t pagesize; +static size_t thpsize; +static size_t hugetlbsize; + +struct shared_mem { + bool parent_ready; + bool child_ready; +}; +struct shared_mem *shared; + +static size_t detect_thpsize(void) +{ + int fd = open("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", + O_RDONLY); + size_t size = 0; + char buf[15]; + int ret; + + if (fd < 0) + return 0; + + ret = pread(fd, buf, sizeof(buf), 0); + if (ret < 0 || ret == sizeof(buf)) + goto out; + buf[ret] = 0; + + size = strtoul(buf, NULL, 10); +out: + close(fd); + if (size < pagesize) + size = 0; + return size; +} + +static uint64_t pagemap_get_entry(int fd, void *addr) +{ + const unsigned long pfn = (unsigned long)addr / pagesize; + uint64_t entry; + int ret; + + ret = pread(fd, &entry, sizeof(entry), pfn * sizeof(entry)); + if (ret != sizeof(entry)) + ksft_exit_fail_msg("reading pagemap failed\n"); + return entry; +} + +static bool page_is_populated(void *addr) +{ + int fd = open("/proc/self/pagemap", O_RDONLY); + uint64_t entry; + bool ret; + + if (fd < 0) + ksft_exit_fail_msg("opening pagemap failed\n"); + + /* Present or swapped. */ + entry = pagemap_get_entry(fd, addr); + ret = !!(entry & 0xc000000000000000ull); + close(fd); + return ret; +} + +static int child_vmsplice_fn(unsigned char *mem, size_t size) +{ + struct iovec iov = { + .iov_base = mem, + .iov_len = size, + }; + size_t cur, total, transferred; + char *old, *new; + int fds[2]; + + old = malloc(size); + new = malloc(size); + + /* Backup the original content. */ + memcpy(old, mem, size); + + if (pipe(fds) < 0) + return -errno; + + /* Trigger a read-only pin. */ + transferred = vmsplice(fds[1], &iov, 1, 0); + if (transferred < 0) + return -errno; + if (transferred == 0) + return -EINVAL; + + /* Unmap it from our page tables. */ + if (munmap(mem, size) < 0) + return -errno; + + /* Wait until the parent modified it. */ + barrier(); + shared->child_ready = true; + barrier(); + while (!shared->parent_ready) + barrier(); + barrier(); + + /* See if we still read the old values. */ + total = 0; + while (total < transferred) { + cur = read(fds[0], new + total, transferred - total); + if (cur < 0) + return -errno; + total += cur; + } + + return memcmp(old, new, transferred); +} + +static void test_child_ro_gup(unsigned char *mem, size_t size) +{ + int ret; + + /* Populate the page. */ + memset(mem, 0, size); + + shared->parent_ready = false; + shared->child_ready = false; + barrier(); + + ret = fork(); + if (ret < 0) { + ksft_exit_fail_msg("fork failed\n"); + } else if (!ret) { + ret = child_vmsplice_fn(mem, size); + exit(ret); + } + + barrier(); + while (!shared->child_ready) + barrier(); + /* Modify the page. */ + barrier(); + memset(mem, 0xff, size); + barrier(); + shared->parent_ready = true; + + wait(&ret); + if (WIFEXITED(ret)) + ret = WEXITSTATUS(ret); + else + ret = -EINVAL; + + ksft_test_result(!ret, "child has correct MAP_PRIVATE semantics\n"); +} + +static void test_anon_ro_gup_child(void) +{ + unsigned char *mem; + int ret; + + ksft_print_msg("[RUN] %s\n", __func__); + + mem = mmap(NULL, pagesize, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (mem == MAP_FAILED) { + ksft_test_result_fail("mmap failed\n"); + return; + } + + ret = madvise(mem, pagesize, MADV_NOHUGEPAGE); + /* Ignore if not around on a kernel. */ + if (ret && ret != -EINVAL) { + ksft_test_result_fail("madvise failed\n"); + goto out; + } + + test_child_ro_gup(mem, pagesize); +out: + munmap(mem, pagesize); +} + +static void test_anon_thp_ro_gup_child(void) +{ + unsigned char *mem, *mmap_mem; + size_t mmap_size; + int ret; + + ksft_print_msg("[RUN] %s\n", __func__); + + if (!thpsize) + ksft_test_result_skip("THP size not detected\n"); + + mmap_size = 2 * thpsize; + mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (mmap_mem == MAP_FAILED) { + ksft_test_result_fail("mmap failed\n"); + return; + } + + mem = (unsigned char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1)); + + ret = madvise(mem, thpsize, MADV_HUGEPAGE); + if (ret) { + ksft_test_result_fail("madvise(MADV_HUGEPAGE) failed\n"); + goto out; + } + + /* + * Touch the first sub-page and test of we get another sub-page + * populated. + */ + mem[0] = 0; + if (!page_is_populated(mem + pagesize)) { + ksft_test_result_skip("Did not get a THP populated\n"); + goto out; + } + + test_child_ro_gup(mem, thpsize); +out: + munmap(mmap_mem, mmap_size); +} + +static void test_anon_hugetlb_ro_gup_child(void) +{ + unsigned char *mem, *dummy; + + ksft_print_msg("[RUN] %s\n", __func__); + + if (!hugetlbsize) + ksft_test_result_skip("hugetlb size not detected\n"); + + ksft_print_msg("[INFO] Assuming hugetlb size of %zd bytes\n", + hugetlbsize); + + mem = mmap(NULL, hugetlbsize, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); + if (mem == MAP_FAILED) { + ksft_test_result_skip("need more free huge pages\n"); + return; + } + + /* + * We need a total of two hugetlb pages to handle COW/unsharing + * properly. + */ + dummy = mmap(NULL, hugetlbsize, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); + if (dummy == MAP_FAILED) { + ksft_test_result_skip("need more free huge pages\n"); + goto out; + } + munmap(dummy, hugetlbsize); + + test_child_ro_gup(mem, hugetlbsize); +out: + munmap(mem, hugetlbsize); +} + +int main(int argc, char **argv) +{ + int err; + + pagesize = getpagesize(); + thpsize = detect_thpsize(); + /* For simplicity, we'll rely on the thp size. */ + hugetlbsize = thpsize; + + ksft_print_header(); + ksft_set_plan(3); + + /* We need an easy way to talk to our child. */ + shared = mmap(NULL, pagesize, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); + if (shared == MAP_FAILED) + ksft_exit_fail_msg("mmap(MAP_SHARED)\n"); + + /* + * Tests for the security issue reported by Jann Horn that originally + * resulted in CVE-2020-29374. More generally, it's a violation of + * POSIX MAP_PRIVATE semantics, because some other process can modify + * pages that are supposed to be private to one process. + * + * So let's test that process-private pages stay private using the + * known vmsplice reproducer. + */ + test_anon_ro_gup_child(); + test_anon_thp_ro_gup_child(); + test_anon_hugetlb_ro_gup_child(); + + err = ksft_get_fail_cnt(); + if (err) + ksft_exit_fail_msg("%d out of %d tests failed\n", + err, ksft_test_num()); + return ksft_exit_pass(); +} diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh index a24d30af3094..80e441e0ae45 100755 --- a/tools/testing/selftests/vm/run_vmtests.sh +++ b/tools/testing/selftests/vm/run_vmtests.sh @@ -168,6 +168,22 @@ else echo "[PASS]" fi
+echo "--------------------------------------------------------" +echo "running "GUP interaction with COW tests. +echo "--------------------------------------------------------" +./gup_cow +ret_val=$? + +if [ $ret_val -eq 0 ]; then + echo "[PASS]" +elif [ $ret_val -eq $ksft_skip ]; then + echo "[SKIP]" + exitcode=$ksft_skip +else + echo "[FAIL]" + exitcode=1 +fi + echo "-------------------" echo "running userfaultfd" echo "-------------------"
linux-kselftest-mirror@lists.linaro.org