The patch titled
Subject: mm/uffd: fix vma check on userfault for wp
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-uffd-fix-vma-check-on-userfault-for-wp.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Peter Xu <peterx(a)redhat.com>
Subject: mm/uffd: fix vma check on userfault for wp
Date: Mon, 24 Oct 2022 15:33:35 -0400
We used to have a report that pte-marker code can be reached even when
uffd-wp is not compiled in for file memories, here:
https://lore.kernel.org/all/YzeR+R6b4bwBlBHh@x1n/T/#u
I just got time to revisit this and found that the root cause is we simply
messed up with the vma check, so that for !PTE_MARKER_UFFD_WP system, we
will allow UFFDIO_REGISTER of MINOR & WP upon shmem as the check was
wrong:
if (vm_flags & VM_UFFD_MINOR)
return is_vm_hugetlb_page(vma) || vma_is_shmem(vma);
Where we'll allow anything to pass on shmem as long as minor mode is
requested.
Axel did it right when introducing minor mode but I messed it up in
b1f9e876862d when moving code around. Fix it.
Link: https://lkml.kernel.org/r/20221024193336.1233616-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221024193336.1233616-2-peterx@redhat.com
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: Peter Xu <peterx(a)redhat.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/userfaultfd_k.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/include/linux/userfaultfd_k.h~mm-uffd-fix-vma-check-on-userfault-for-wp
+++ a/include/linux/userfaultfd_k.h
@@ -146,9 +146,9 @@ static inline bool userfaultfd_armed(str
static inline bool vma_can_userfault(struct vm_area_struct *vma,
unsigned long vm_flags)
{
- if (vm_flags & VM_UFFD_MINOR)
- return is_vm_hugetlb_page(vma) || vma_is_shmem(vma);
-
+ if ((vm_flags & VM_UFFD_MINOR) &&
+ (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
+ return false;
#ifndef CONFIG_PTE_MARKER_UFFD_WP
/*
* If user requested uffd-wp but not enabled pte markers for
_
Patches currently in -mm which might be from peterx(a)redhat.com are
mm-uffd-fix-vma-check-on-userfault-for-wp.patch
revert-mm-uffd-fix-warning-without-pte_marker_uffd_wp-compiled-in.patch
selftests-vm-use-memfd-for-uffd-hugetlb-tests.patch
selftests-vm-use-memfd-for-hugetlb-madvise-test.patch
selftests-vm-use-memfd-for-hugepage-mremap-test.patch
selftests-vm-drop-mnt-point-for-hugetlb-in-run_vmtestssh.patch
mm-hugetlb-unify-clearing-of-restorereserve-for-private-pages.patch
Commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically
spin on owner") assumes that when the owner field is changed to NULL,
the lock will become free soon. Commit 48dfb5d2560d ("locking/rwsem:
Disable preemption while trying for rwsem lock") disable preemption
when acquiring rwsem for write. However, preemption has not yet been
disabled when acquiring a read lock on a rwsem. So a reader can add a
RWSEM_READER_BIAS to count without setting owner to signal a reader,
got preempted out by a RT task which then spins in the writer slowpath
as owner remains NULL leading to live lock.
One way to fix that is to disable preemption before the read lock attempt
and then immediately remove RWSEM_READER_BIAS when the trylock fails
before reenabling preemption. This will remove some optimizations that
can be done by delaying the RWSEM_READER_BIAS backoff. Alternatively
we could delay the preempt_enable() into the rwsem_down_read_slowpath()
and even after acquiring and releasing the wait_lock. Another possible
alternative is to limit the number of trylock attempts without sleeping.
The last alternative seems to be less messy and is being implemented
in this patch.
The current limit is now set to 8 to allow enough time for the other
task to hopefully complete its action.
By adding new lock events to track the number of NULL owner retries with
handoff flag set before a successful trylock when running a 96 threads
locking microbenchmark with equal number of readers and writers running
on a 2-core 96-thread system for 15 seconds, the following stats are
obtained. Note that none of locking threads are RT tasks.
Retries of successful trylock Count
----------------------------- -----
1 1738
2 19
3 11
4 2
5 1
6 1
7 1
8 0
X 1
The last row is the one failed attempt that needs more than 8 retries.
So a retry count maximum of 8 should capture most of them if no RT task
is in the mix.
Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner")
Reported-by: Mukesh Ojha <quic_mojha(a)quicinc.com>
Signed-off-by: Waiman Long <longman(a)redhat.com>
Reviewed-and-tested-by: Mukesh Ojha <quic_mojha(a)quicinc.com>
Cc: stable(a)vger.kernel.org
---
kernel/locking/rwsem.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index be2df9ea7c30..c68d76fc8c68 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1115,6 +1115,7 @@ static struct rw_semaphore __sched *
rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
{
struct rwsem_waiter waiter;
+ int null_owner_retries;
DEFINE_WAKE_Q(wake_q);
/* do optimistic spinning and steal lock if possible */
@@ -1156,7 +1157,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
set_current_state(state);
trace_contention_begin(sem, LCB_F_WRITE);
- for (;;) {
+ for (null_owner_retries = 0;;) {
if (rwsem_try_write_lock(sem, &waiter)) {
/* rwsem_try_write_lock() implies ACQUIRE on success */
break;
@@ -1182,8 +1183,21 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
owner_state = rwsem_spin_on_owner(sem);
preempt_enable();
- if (owner_state == OWNER_NULL)
+ /*
+ * owner is NULL doesn't guarantee the lock is free.
+ * An incoming reader will temporarily increment the
+ * reader count without changing owner and the
+ * rwsem_try_write_lock() will fails if the reader
+ * is not able to decrement it in time. Allow 8
+ * trylock attempts when hitting a NULL owner before
+ * going to sleep.
+ */
+ if ((owner_state == OWNER_NULL) &&
+ (null_owner_retries < 8)) {
+ null_owner_retries++;
goto trylock_again;
+ }
+ null_owner_retries = 0;
}
schedule();
--
2.31.1
A non-first waiter can potentially spin in the for loop of
rwsem_down_write_slowpath() without sleeping but fail to acquire the
lock even if the rwsem is free if the following sequence happens:
Non-first RT waiter First waiter Lock holder
------------------- ------------ -----------
Acquire wait_lock
rwsem_try_write_lock():
Set handoff bit if RT or
wait too long
Set waiter->handoff_set
Release wait_lock
Acquire wait_lock
Inherit waiter->handoff_set
Release wait_lock
Clear owner
Release lock
if (waiter.handoff_set) {
rwsem_spin_on_owner(();
if (OWNER_NULL)
goto trylock_again;
}
trylock_again:
Acquire wait_lock
rwsem_try_write_lock():
if (first->handoff_set && (waiter != first))
return false;
Release wait_lock
A non-first waiter cannot really acquire the rwsem even if it mistakenly
believes that it can spin on OWNER_NULL value. If that waiter happens
to be an RT task running on the same CPU as the first waiter, it can
block the first waiter from acquiring the rwsem leading to live lock.
Fix this problem by making sure that a non-first waiter cannot spin in
the slowpath loop without sleeping.
Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent")
Reviewed-and-tested-by: Mukesh Ojha <quic_mojha(a)quicinc.com>
Signed-off-by: Waiman Long <longman(a)redhat.com>
Cc: stable(a)vger.kernel.org
---
kernel/locking/rwsem.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 44873594de03..be2df9ea7c30 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -624,18 +624,16 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
*/
if (first->handoff_set && (waiter != first))
return false;
-
- /*
- * First waiter can inherit a previously set handoff
- * bit and spin on rwsem if lock acquisition fails.
- */
- if (waiter == first)
- waiter->handoff_set = true;
}
new = count;
if (count & RWSEM_LOCK_MASK) {
+ /*
+ * A waiter (first or not) can set the handoff bit
+ * if it is an RT task or wait in the wait queue
+ * for too long.
+ */
if (has_handoff || (!rt_task(waiter->task) &&
!time_after(jiffies, waiter->timeout)))
return false;
@@ -651,11 +649,12 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
} while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
/*
- * We have either acquired the lock with handoff bit cleared or
- * set the handoff bit.
+ * We have either acquired the lock with handoff bit cleared or set
+ * the handoff bit. Only the first waiter can have its handoff_set
+ * set here to enable optimistic spinning in slowpath loop.
*/
if (new & RWSEM_FLAG_HANDOFF) {
- waiter->handoff_set = true;
+ first->handoff_set = true;
lockevent_inc(rwsem_wlock_handoff);
return false;
}
--
2.31.1