From: Yang Shi <yang.shi(a)linux.alibaba.com>
Subject: mm: mempolicy: fix the wrong return value and potential pages leak of mbind
Commit d883544515aa ("mm: mempolicy: make the behavior consistent when
MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") fixed the return value
of mbind() for a couple of corner cases. But, it altered the errno for
some other cases, for example, mbind() should return -EFAULT when part or
all of the memory range specified by nodemask and maxnode points outside
your accessible address space, or there was an unmapped hole in the
specified memory range specified by addr and len.
Fix this by preserving the errno returned by queue_pages_range(). And,
the pagelist may be not empty even though queue_pages_range() returns
error, put the pages back to LRU since mbind_range() is not called to
really apply the policy so those pages should not be migrated, this is
also the old behavior before the problematic commit.
Link: http://lkml.kernel.org/r/1572454731-3925-1-git-send-email-yang.shi@linux.al…
Fixes: d883544515aa ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified")
Signed-off-by: Yang Shi <yang.shi(a)linux.alibaba.com>
Reported-by: Li Xinhai <lixinhai.lxh(a)gmail.com>
Reviewed-by: Li Xinhai <lixinhai.lxh(a)gmail.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: <stable(a)vger.kernel.org> [4.19 and 5.2+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mempolicy.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
--- a/mm/mempolicy.c~mm-mempolicy-fix-the-wrong-return-value-and-potential-pages-leak-of-mbind
+++ a/mm/mempolicy.c
@@ -672,7 +672,9 @@ static const struct mm_walk_ops queue_pa
* 1 - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were
* specified.
* 0 - queue pages successfully or no misplaced page.
- * -EIO - there is misplaced page and only MPOL_MF_STRICT was specified.
+ * errno - i.e. misplaced pages with MPOL_MF_STRICT specified (-EIO) or
+ * memory range specified by nodemask and maxnode points outside
+ * your accessible address space (-EFAULT)
*/
static int
queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
@@ -1286,7 +1288,7 @@ static long do_mbind(unsigned long start
flags | MPOL_MF_INVERT, &pagelist);
if (ret < 0) {
- err = -EIO;
+ err = ret;
goto up_out;
}
@@ -1305,10 +1307,12 @@ static long do_mbind(unsigned long start
if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
err = -EIO;
- } else
- putback_movable_pages(&pagelist);
-
+ } else {
up_out:
+ if (!list_empty(&pagelist))
+ putback_movable_pages(&pagelist);
+ }
+
up_write(&mm->mmap_sem);
mpol_out:
mpol_put(new);
_
The patch titled
Subject: x86/mm: Split vmalloc_sync_all()
has been added to the -mm tree. Its filename is
x86-mm-split-vmalloc_sync_all.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/x86-mm-split-vmalloc_sync_all.patch
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/x86-mm-split-vmalloc_sync_all.patch
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Joerg Roedel <jroedel(a)suse.de>
Subject: x86/mm: Split vmalloc_sync_all()
Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the
vunmap() code-path. While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.
Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.
To avoid the unnecessary work on x86-64 and to gain the performance back,
split up vmalloc_sync_all() into two functions:
* vmalloc_sync_mappings(), and
* vmalloc_sync_unmappings()
Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized. The only exception is the new call-site added in the above
mentioned commit.
Shile Zhang directed us to a report of an 80% regression in reaim
throughput.
Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK…
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.…
Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Signed-off-by: Joerg Roedel <jroedel(a)suse.de>
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Reported-by: Shile Zhang <shile.zhang(a)linux.alibaba.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com> [GHES]
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/x86/mm/fault.c | 26 ++++++++++++++++++++++++--
drivers/acpi/apei/ghes.c | 2 +-
include/linux/vmalloc.h | 5 +++--
kernel/notifier.c | 2 +-
mm/nommu.c | 10 +++++++---
mm/vmalloc.c | 11 +++++++----
6 files changed, 43 insertions(+), 13 deletions(-)
--- a/arch/x86/mm/fault.c~x86-mm-split-vmalloc_sync_all
+++ a/arch/x86/mm/fault.c
@@ -189,7 +189,7 @@ static inline pmd_t *vmalloc_sync_one(pg
return pmd_k;
}
-void vmalloc_sync_all(void)
+static void vmalloc_sync(void)
{
unsigned long address;
@@ -216,6 +216,16 @@ void vmalloc_sync_all(void)
}
}
+void vmalloc_sync_mappings(void)
+{
+ vmalloc_sync();
+}
+
+void vmalloc_sync_unmappings(void)
+{
+ vmalloc_sync();
+}
+
/*
* 32-bit:
*
@@ -318,11 +328,23 @@ out:
#else /* CONFIG_X86_64: */
-void vmalloc_sync_all(void)
+void vmalloc_sync_mappings(void)
{
+ /*
+ * 64-bit mappings might allocate new p4d/pud pages
+ * that need to be propagated to all tasks' PGDs.
+ */
sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END);
}
+void vmalloc_sync_unmappings(void)
+{
+ /*
+ * Unmappings never allocate or free p4d/pud pages.
+ * No work is required here.
+ */
+}
+
/*
* 64-bit:
*
--- a/drivers/acpi/apei/ghes.c~x86-mm-split-vmalloc_sync_all
+++ a/drivers/acpi/apei/ghes.c
@@ -171,7 +171,7 @@ int ghes_estatus_pool_init(int num_ghes)
* New allocation must be visible in all pgd before it can be found by
* an NMI allocating from the pool.
*/
- vmalloc_sync_all();
+ vmalloc_sync_mappings();
rc = gen_pool_add(ghes_estatus_pool, addr, PAGE_ALIGN(len), -1);
if (rc)
--- a/include/linux/vmalloc.h~x86-mm-split-vmalloc_sync_all
+++ a/include/linux/vmalloc.h
@@ -126,8 +126,9 @@ extern int remap_vmalloc_range_partial(s
extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
unsigned long pgoff);
-void vmalloc_sync_all(void);
-
+void vmalloc_sync_mappings(void);
+void vmalloc_sync_unmappings(void);
+
/*
* Lowlevel-APIs (not for driver use!)
*/
--- a/kernel/notifier.c~x86-mm-split-vmalloc_sync_all
+++ a/kernel/notifier.c
@@ -554,7 +554,7 @@ NOKPROBE_SYMBOL(notify_die);
int register_die_notifier(struct notifier_block *nb)
{
- vmalloc_sync_all();
+ vmalloc_sync_mappings();
return atomic_notifier_chain_register(&die_chain, nb);
}
EXPORT_SYMBOL_GPL(register_die_notifier);
--- a/mm/nommu.c~x86-mm-split-vmalloc_sync_all
+++ a/mm/nommu.c
@@ -359,10 +359,14 @@ void vm_unmap_aliases(void)
EXPORT_SYMBOL_GPL(vm_unmap_aliases);
/*
- * Implement a stub for vmalloc_sync_all() if the architecture chose not to
- * have one.
+ * Implement a stub for vmalloc_sync_[un]mapping() if the architecture
+ * chose not to have one.
*/
-void __weak vmalloc_sync_all(void)
+void __weak vmalloc_sync_mappings(void)
+{
+}
+
+void __weak vmalloc_sync_unmappings(void)
{
}
--- a/mm/vmalloc.c~x86-mm-split-vmalloc_sync_all
+++ a/mm/vmalloc.c
@@ -1259,7 +1259,7 @@ static bool __purge_vmap_area_lazy(unsig
* First make sure the mappings are removed from all page-tables
* before they are freed.
*/
- vmalloc_sync_all();
+ vmalloc_sync_unmappings();
/*
* TODO: to calculate a flush range without looping.
@@ -3050,16 +3050,19 @@ int remap_vmalloc_range(struct vm_area_s
EXPORT_SYMBOL(remap_vmalloc_range);
/*
- * Implement a stub for vmalloc_sync_all() if the architecture chose not to
- * have one.
+ * Implement stubs for vmalloc_sync_[un]mappings () if the architecture chose
+ * not to have one.
*
* The purpose of this function is to make sure the vmalloc area
* mappings are identical in all page-tables in the system.
*/
-void __weak vmalloc_sync_all(void)
+void __weak vmalloc_sync_mappings(void)
{
}
+void __weak vmalloc_sync_unmappings(void)
+{
+}
static int f(pte_t *pte, unsigned long addr, void *data)
{
_
Patches currently in -mm which might be from jroedel(a)suse.de are
x86-mm-split-vmalloc_sync_all.patch
The patch titled
Subject: mm/vmalloc: fix performance regression caused by needless vmalloc_sync_all()
has been removed from the -mm tree. Its filename was
mm-vmalloc-fix-regression-caused-by-needless-vmalloc_sync_all.patch
This patch was dropped because an alternative patch was merged
------------------------------------------------------
From: Shile Zhang <shile.zhang(a)linux.alibaba.com>
Subject: mm/vmalloc: fix performance regression caused by needless vmalloc_sync_all()
vmalloc_sync_all() was put in the common path in __purge_vmap_area_lazy(),
for one sync issue only happened on X86_32 with PTI enabled. It is
needless for X86_64, which caused a big regression in UnixBench Shell8
testing on X86_64. Similar regression also reported by 0-day kernel test
robot in reaim benchmarking:
https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK…
Fix it by adding more conditions.
[akpm(a)linux-foundation.org: simplify config expression, use IS_ENABLED()]
[akpm(a)linux-foundation.org: build fix - go back to using an ifdef]
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.…
Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Signed-off-by: Shile Zhang <shile.zhang(a)linux.alibaba.com>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: Qian Cai <cai(a)lca.pw>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmalloc.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
--- a/mm/vmalloc.c~mm-vmalloc-fix-regression-caused-by-needless-vmalloc_sync_all
+++ a/mm/vmalloc.c
@@ -1255,11 +1255,17 @@ static bool __purge_vmap_area_lazy(unsig
if (unlikely(valist == NULL))
return false;
+#ifdef CONFIG_X86_PAE
/*
- * First make sure the mappings are removed from all page-tables
- * before they are freed.
+ * First make sure the mappings are removed from all pagetables before
+ * they are freed.
+ *
+ * This is only needed on x86-32 with !SHARED_KERNEL_PMD, which is the
+ * case on a PAE kernel with PTI enabled.
*/
- vmalloc_sync_all();
+ if (!SHARED_KERNEL_PMD && boot_cpu_has(X86_FEATURE_PTI))
+ vmalloc_sync_all();
+#endif
/*
* TODO: to calculate a flush range without looping.
_
Patches currently in -mm which might be from shile.zhang(a)linux.alibaba.com are
Just got one of these for debugging some unrelated issues, and noticed
that Lenovo seems to have gone back to using RMI4 over smbus with
Synaptics touchpads on some of their new systems, particularly this one.
So, let's enable RMI mode for the X1 Extreme 2nd Generation.
Signed-off-by: Lyude Paul <lyude(a)redhat.com>
Cc: stable(a)vger.kernel.org
---
drivers/input/mouse/synaptics.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/input/mouse/synaptics.c b/drivers/input/mouse/synaptics.c
index 56fae3472114..704558d449a2 100644
--- a/drivers/input/mouse/synaptics.c
+++ b/drivers/input/mouse/synaptics.c
@@ -177,6 +177,7 @@ static const char * const smbus_pnp_ids[] = {
"LEN0096", /* X280 */
"LEN0097", /* X280 -> ALPS trackpoint */
"LEN009b", /* T580 */
+ "LEN0402", /* X1 Extreme 2nd Generation */
"LEN200f", /* T450s */
"LEN2054", /* E480 */
"LEN2055", /* E580 */
--
2.21.0
hello all
i get error and warning from a typical 5.4.0-rc7.
------x--------x--error---x---------------x----
$cat 5.4.0-rc7-error.txt
[ 2.064029] Couldn't get size: 0x800000000000000e
[ 12.906185] tpm_tis MSFT0101:00: IRQ index 0 not found
$
-------------x------------------x--------------
----------x----------warning----x-------------x-------------------------x-------------
$cat 5.4.0-rc7-warn.txt
[ 0.249749] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[ 0.249783] #3
[ 0.253901] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[ 2.011803] i8042: PNP: PS/2 appears to have AUX port disabled, if this is incorrect please boot with i8042.nopnp
[ 2.013717] rtc rtc0: invalid alarm value: 2019-11-14T14:63:40
[ 2.234933] sdhci-pci 0000:00:1e.6: failed to setup card detect gpio
[ 2.280028] i2c_hid i2c-ELAN1300:00: i2c-ELAN1300:00 supply vdd not found, using dummy regulator
[ 2.280065] i2c_hid i2c-ELAN1300:00: i2c-ELAN1300:00 supply vddl not found, using dummy regulator
[ 3.043252] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x83 has wMaxPacketSize 0, skipping
[ 3.043254] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x3 has wMaxPacketSize 0, skipping
[ 15.114547] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
[ 16.059585] uvcvideo 1-6:1.0: Entity type for entity Extension 4 was not initialized!
[ 16.059587] uvcvideo 1-6:1.0: Entity type for entity Processing 2 was not initialized!
[ 16.059588] uvcvideo 1-6:1.0: Entity type for entity Camera 1 was not initialized!
[ 23.368830] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x83 has wMaxPacketSize 0, skipping
[ 23.368835] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x3 has wMaxPacketSize 0, skipping
[ 1415.852546] done.
[ 1416.144078] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x83 has wMaxPacketSize 0, skipping
[ 1416.144083] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x3 has wMaxPacketSize 0, skipping
[ 1421.652063] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x83 has wMaxPacketSize 0, skipping
[ 1421.652068] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x3 has wMaxPacketSize 0, skipping
$
------------------x-------------------x-------------------x----------------
--
software engineer
rajagiri school of engineering and technology
The following commit has been merged into the locking/core branch of tip:
Commit-ID: ca16d5bee59807bf04deaab0a8eccecd5061528c
Gitweb: https://git.kernel.org/tip/ca16d5bee59807bf04deaab0a8eccecd5061528c
Author: Yang Tao <yang.tao172(a)zte.com.cn>
AuthorDate: Wed, 06 Nov 2019 22:55:35 +01:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Fri, 15 Nov 2019 19:10:49 +01:00
futex: Prevent robust futex exit race
Robust futexes utilize the robust_list mechanism to allow the kernel to
release futexes which are held when a task exits. The exit can be voluntary
or caused by a signal or fault. This prevents that waiters block forever.
The futex operations in user space store a pointer to the futex they are
either locking or unlocking in the op_pending member of the per task robust
list.
After a lock operation has succeeded the futex is queued in the robust list
linked list and the op_pending pointer is cleared.
After an unlock operation has succeeded the futex is removed from the
robust list linked list and the op_pending pointer is cleared.
The robust list exit code checks for the pending operation and any futex
which is queued in the linked list. It carefully checks whether the futex
value is the TID of the exiting task. If so, it sets the OWNER_DIED bit and
tries to wake up a potential waiter.
This is race free for the lock operation but unlock has two race scenarios
where waiters might not be woken up. These issues can be observed with
regular robust pthread mutexes. PI aware pthread mutexes are not affected.
(1) Unlocking task is killed after unlocking the futex value in user space
before being able to wake a waiter.
pthread_mutex_unlock()
|
V
atomic_exchange_rel (&mutex->__data.__lock, 0)
<------------------------killed
lll_futex_wake () |
|
|(__lock = 0)
|(enter kernel)
|
V
do_exit()
exit_mm()
mm_release()
exit_robust_list()
handle_futex_death()
|
|(__lock = 0)
|(uval = 0)
|
V
if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
return 0;
The sanity check which ensures that the user space futex is owned by
the exiting task prevents the wakeup of waiters which in consequence
block infinitely.
(2) Waiting task is killed after a wakeup and before it can acquire the
futex in user space.
OWNER WAITER
futex_wait()
pthread_mutex_unlock() |
| |
|(__lock = 0) |
| |
V |
futex_wake() ------------> wakeup()
|
|(return to userspace)
|(__lock = 0)
|
V
oldval = mutex->__data.__lock
<-----------------killed
atomic_compare_and_exchange_val_acq (&mutex->__data.__lock, |
id | assume_other_futex_waiters, 0) |
|
|
(enter kernel)|
|
V
do_exit()
|
|
V
handle_futex_death()
|
|(__lock = 0)
|(uval = 0)
|
V
if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
return 0;
The sanity check which ensures that the user space futex is owned
by the exiting task prevents the wakeup of waiters, which seems to
be correct as the exiting task does not own the futex value, but
the consequence is that other waiters wont be woken up and block
infinitely.
In both scenarios the following conditions are true:
- task->robust_list->list_op_pending != NULL
- user space futex value == 0
- Regular futex (not PI)
If these conditions are met then it is reasonably safe to wake up a
potential waiter in order to prevent the above problems.
As this might be a false positive it can cause spurious wakeups, but the
waiter side has to handle other types of unrelated wakeups, e.g. signals
gracefully anyway. So such a spurious wakeup will not affect the
correctness of these operations.
This workaround must not touch the user space futex value and cannot set
the OWNER_DIED bit because the lock value is 0, i.e. uncontended. Setting
OWNER_DIED in this case would result in inconsistent state and subsequently
in malfunction of the owner died handling in user space.
The rest of the user space state is still consistent as no other task can
observe the list_op_pending entry in the exiting tasks robust list.
The eventually woken up waiter will observe the uncontended lock value and
take it over.
[ tglx: Massaged changelog and comment. Made the return explicit and not
depend on the subsequent check and added constants to hand into
handle_futex_death() instead of plain numbers. Fixed a few coding
style issues. ]
Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core")
Signed-off-by: Yang Tao <yang.tao172(a)zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59(a)zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Ingo Molnar <mingo(a)kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org
Link: https://lkml.kernel.org/r/1573010582-35297-1-git-send-email-wang.yi59@zte.c…
Link: https://lkml.kernel.org/r/20191106224555.943191378@linutronix.de
---
kernel/futex.c | 58 +++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 51 insertions(+), 7 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 43229f8..49eaf5b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3452,11 +3452,16 @@ err_unlock:
return ret;
}
+/* Constants for the pending_op argument of handle_futex_death */
+#define HANDLE_DEATH_PENDING true
+#define HANDLE_DEATH_LIST false
+
/*
* Process a futex-list entry, check whether it's owned by the
* dying task, and do notification if so:
*/
-static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi)
+static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr,
+ bool pi, bool pending_op)
{
u32 uval, uninitialized_var(nval), mval;
int err;
@@ -3469,6 +3474,42 @@ retry:
if (get_user(uval, uaddr))
return -1;
+ /*
+ * Special case for regular (non PI) futexes. The unlock path in
+ * user space has two race scenarios:
+ *
+ * 1. The unlock path releases the user space futex value and
+ * before it can execute the futex() syscall to wake up
+ * waiters it is killed.
+ *
+ * 2. A woken up waiter is killed before it can acquire the
+ * futex in user space.
+ *
+ * In both cases the TID validation below prevents a wakeup of
+ * potential waiters which can cause these waiters to block
+ * forever.
+ *
+ * In both cases the following conditions are met:
+ *
+ * 1) task->robust_list->list_op_pending != NULL
+ * @pending_op == true
+ * 2) User space futex value == 0
+ * 3) Regular futex: @pi == false
+ *
+ * If these conditions are met, it is safe to attempt waking up a
+ * potential waiter without touching the user space futex value and
+ * trying to set the OWNER_DIED bit. The user space futex value is
+ * uncontended and the rest of the user space mutex state is
+ * consistent, so a woken waiter will just take over the
+ * uncontended futex. Setting the OWNER_DIED bit would create
+ * inconsistent state and malfunction of the user space owner died
+ * handling.
+ */
+ if (pending_op && !pi && !uval) {
+ futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
+ return 0;
+ }
+
if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
return 0;
@@ -3588,10 +3629,11 @@ void exit_robust_list(struct task_struct *curr)
* A pending lock might already be on the list, so
* don't process it twice:
*/
- if (entry != pending)
+ if (entry != pending) {
if (handle_futex_death((void __user *)entry + futex_offset,
- curr, pi))
+ curr, pi, HANDLE_DEATH_LIST))
return;
+ }
if (rc)
return;
entry = next_entry;
@@ -3605,9 +3647,10 @@ void exit_robust_list(struct task_struct *curr)
cond_resched();
}
- if (pending)
+ if (pending) {
handle_futex_death((void __user *)pending + futex_offset,
- curr, pip);
+ curr, pip, HANDLE_DEATH_PENDING);
+ }
}
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
@@ -3784,7 +3827,8 @@ void compat_exit_robust_list(struct task_struct *curr)
if (entry != pending) {
void __user *uaddr = futex_uaddr(entry, futex_offset);
- if (handle_futex_death(uaddr, curr, pi))
+ if (handle_futex_death(uaddr, curr, pi,
+ HANDLE_DEATH_LIST))
return;
}
if (rc)
@@ -3803,7 +3847,7 @@ void compat_exit_robust_list(struct task_struct *curr)
if (pending) {
void __user *uaddr = futex_uaddr(pending, futex_offset);
- handle_futex_death(uaddr, curr, pip);
+ handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING);
}
}
The following commit has been merged into the locking/core branch of tip:
Commit-ID: 3e0fe4f0ac713a156b96638aa6c1fb3d2c548e5a
Gitweb: https://git.kernel.org/tip/3e0fe4f0ac713a156b96638aa6c1fb3d2c548e5a
Author: Thomas Gleixner <tglx(a)linutronix.de>
AuthorDate: Wed, 06 Nov 2019 22:55:46 +01:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Fri, 15 Nov 2019 19:11:27 +01:00
futex: Prevent exit livelock
Oleg provided the following test case:
int main(void)
{
struct sched_param sp = {};
sp.sched_priority = 2;
assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
int lock = vfork();
if (!lock) {
sp.sched_priority = 1;
assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
_exit(0);
}
syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
return 0;
}
This creates an unkillable RT process spinning in futex_lock_pi() on a UP
machine or if the process is affine to a single CPU. The reason is:
parent child
set FIFO prio 2
vfork() -> set FIFO prio 1
implies wait_for_child() sched_setscheduler(...)
exit()
do_exit()
....
mm_release()
tsk->futex_state = FUTEX_STATE_EXITING;
exit_futex(); (NOOP in this case)
complete() --> wakes parent
sys_futex()
loop infinite because
tsk->futex_state == FUTEX_STATE_EXITING
The same problem can happen just by regular preemption as well:
task holds futex
...
do_exit()
tsk->futex_state = FUTEX_STATE_EXITING;
--> preemption (unrelated wakeup of some other higher prio task, e.g. timer)
switch_to(other_task)
return to user
sys_futex()
loop infinite as above
Just for the fun of it the futex exit cleanup could trigger the wakeup
itself before the task sets its futex state to DEAD.
To cure this, the handling of the exiting owner is changed so:
- A refcount is held on the task
- The task pointer is stored in a caller visible location
- The caller drops all locks (hash bucket, mmap_sem) and blocks
on task::futex_exit_mutex. When the mutex is acquired then
the exiting task has completed the cleanup and the state
is consistent and can be reevaluated.
This is not a pretty solution, but there is no choice other than returning
an error code to user space, which would break the state consistency
guarantee and open another can of problems including regressions.
For stable backports the preparatory commits 01e06025a2f8 .. 8d4da5b197dc
are required as well, but for anything older than 5.3.y the backports are
going to be provided when this hits mainline as the other dependencies for
those kernels are definitely not stable material.
Fixes: 778e9a9c3e71 ("pi-futex: fix exit races and locking problems")
Reported-by: Oleg Nesterov <oleg(a)redhat.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Ingo Molnar <mingo(a)kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: Stable Team <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de
---
kernel/futex.c | 106 +++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 91 insertions(+), 15 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 4f9d7a4..03c518e 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1176,6 +1176,36 @@ out_error:
return ret;
}
+/**
+ * wait_for_owner_exiting - Block until the owner has exited
+ * @exiting: Pointer to the exiting task
+ *
+ * Caller must hold a refcount on @exiting.
+ */
+static void wait_for_owner_exiting(int ret, struct task_struct *exiting)
+{
+ if (ret != -EBUSY) {
+ WARN_ON_ONCE(exiting);
+ return;
+ }
+
+ if (WARN_ON_ONCE(ret == -EBUSY && !exiting))
+ return;
+
+ mutex_lock(&exiting->futex_exit_mutex);
+ /*
+ * No point in doing state checking here. If the waiter got here
+ * while the task was in exec()->exec_futex_release() then it can
+ * have any FUTEX_STATE_* value when the waiter has acquired the
+ * mutex. OK, if running, EXITING or DEAD if it reached exit()
+ * already. Highly unlikely and not a problem. Just one more round
+ * through the futex maze.
+ */
+ mutex_unlock(&exiting->futex_exit_mutex);
+
+ put_task_struct(exiting);
+}
+
static int handle_exit_race(u32 __user *uaddr, u32 uval,
struct task_struct *tsk)
{
@@ -1237,7 +1267,8 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval,
* it after doing proper sanity checks.
*/
static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
- struct futex_pi_state **ps)
+ struct futex_pi_state **ps,
+ struct task_struct **exiting)
{
pid_t pid = uval & FUTEX_TID_MASK;
struct futex_pi_state *pi_state;
@@ -1276,7 +1307,19 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
int ret = handle_exit_race(uaddr, uval, p);
raw_spin_unlock_irq(&p->pi_lock);
- put_task_struct(p);
+ /*
+ * If the owner task is between FUTEX_STATE_EXITING and
+ * FUTEX_STATE_DEAD then store the task pointer and keep
+ * the reference on the task struct. The calling code will
+ * drop all locks, wait for the task to reach
+ * FUTEX_STATE_DEAD and then drop the refcount. This is
+ * required to prevent a live lock when the current task
+ * preempted the exiting task between the two states.
+ */
+ if (ret == -EBUSY)
+ *exiting = p;
+ else
+ put_task_struct(p);
return ret;
}
@@ -1315,7 +1358,8 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
static int lookup_pi_state(u32 __user *uaddr, u32 uval,
struct futex_hash_bucket *hb,
- union futex_key *key, struct futex_pi_state **ps)
+ union futex_key *key, struct futex_pi_state **ps,
+ struct task_struct **exiting)
{
struct futex_q *top_waiter = futex_top_waiter(hb, key);
@@ -1330,7 +1374,7 @@ static int lookup_pi_state(u32 __user *uaddr, u32 uval,
* We are the first waiter - try to look up the owner based on
* @uval and attach to it.
*/
- return attach_to_pi_owner(uaddr, uval, key, ps);
+ return attach_to_pi_owner(uaddr, uval, key, ps, exiting);
}
static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
@@ -1358,6 +1402,8 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
* lookup
* @task: the task to perform the atomic lock work for. This will
* be "current" except in the case of requeue pi.
+ * @exiting: Pointer to store the task pointer of the owner task
+ * which is in the middle of exiting
* @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
*
* Return:
@@ -1366,11 +1412,17 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
* - <0 - error
*
* The hb->lock and futex_key refs shall be held by the caller.
+ *
+ * @exiting is only set when the return value is -EBUSY. If so, this holds
+ * a refcount on the exiting task on return and the caller needs to drop it
+ * after waiting for the exit to complete.
*/
static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
union futex_key *key,
struct futex_pi_state **ps,
- struct task_struct *task, int set_waiters)
+ struct task_struct *task,
+ struct task_struct **exiting,
+ int set_waiters)
{
u32 uval, newval, vpid = task_pid_vnr(task);
struct futex_q *top_waiter;
@@ -1440,7 +1492,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
* attach to the owner. If that fails, no harm done, we only
* set the FUTEX_WAITERS bit in the user space variable.
*/
- return attach_to_pi_owner(uaddr, newval, key, ps);
+ return attach_to_pi_owner(uaddr, newval, key, ps, exiting);
}
/**
@@ -1858,6 +1910,8 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
* @key1: the from futex key
* @key2: the to futex key
* @ps: address to store the pi_state pointer
+ * @exiting: Pointer to store the task pointer of the owner task
+ * which is in the middle of exiting
* @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
*
* Try and get the lock on behalf of the top waiter if we can do it atomically.
@@ -1865,16 +1919,20 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
* then direct futex_lock_pi_atomic() to force setting the FUTEX_WAITERS bit.
* hb1 and hb2 must be held by the caller.
*
+ * @exiting is only set when the return value is -EBUSY. If so, this holds
+ * a refcount on the exiting task on return and the caller needs to drop it
+ * after waiting for the exit to complete.
+ *
* Return:
* - 0 - failed to acquire the lock atomically;
* - >0 - acquired the lock, return value is vpid of the top_waiter
* - <0 - error
*/
-static int futex_proxy_trylock_atomic(u32 __user *pifutex,
- struct futex_hash_bucket *hb1,
- struct futex_hash_bucket *hb2,
- union futex_key *key1, union futex_key *key2,
- struct futex_pi_state **ps, int set_waiters)
+static int
+futex_proxy_trylock_atomic(u32 __user *pifutex, struct futex_hash_bucket *hb1,
+ struct futex_hash_bucket *hb2, union futex_key *key1,
+ union futex_key *key2, struct futex_pi_state **ps,
+ struct task_struct **exiting, int set_waiters)
{
struct futex_q *top_waiter = NULL;
u32 curval;
@@ -1911,7 +1969,7 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex,
*/
vpid = task_pid_vnr(top_waiter->task);
ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task,
- set_waiters);
+ exiting, set_waiters);
if (ret == 1) {
requeue_pi_wake_futex(top_waiter, key2, hb2);
return vpid;
@@ -2040,6 +2098,8 @@ retry_private:
}
if (requeue_pi && (task_count - nr_wake < nr_requeue)) {
+ struct task_struct *exiting = NULL;
+
/*
* Attempt to acquire uaddr2 and wake the top waiter. If we
* intend to requeue waiters, force setting the FUTEX_WAITERS
@@ -2047,7 +2107,8 @@ retry_private:
* faults rather in the requeue loop below.
*/
ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
- &key2, &pi_state, nr_requeue);
+ &key2, &pi_state,
+ &exiting, nr_requeue);
/*
* At this point the top_waiter has either taken uaddr2 or is
@@ -2074,7 +2135,8 @@ retry_private:
* If that call succeeds then we have pi_state and an
* initial refcount on it.
*/
- ret = lookup_pi_state(uaddr2, ret, hb2, &key2, &pi_state);
+ ret = lookup_pi_state(uaddr2, ret, hb2, &key2,
+ &pi_state, &exiting);
}
switch (ret) {
@@ -2104,6 +2166,12 @@ retry_private:
hb_waiters_dec(hb2);
put_futex_key(&key2);
put_futex_key(&key1);
+ /*
+ * Handle the case where the owner is in the middle of
+ * exiting. Wait for the exit to complete otherwise
+ * this task might loop forever, aka. live lock.
+ */
+ wait_for_owner_exiting(ret, exiting);
cond_resched();
goto retry;
default:
@@ -2810,6 +2878,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
{
struct hrtimer_sleeper timeout, *to;
struct futex_pi_state *pi_state = NULL;
+ struct task_struct *exiting = NULL;
struct rt_mutex_waiter rt_waiter;
struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
@@ -2831,7 +2900,8 @@ retry:
retry_private:
hb = queue_lock(&q);
- ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0);
+ ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
+ &exiting, 0);
if (unlikely(ret)) {
/*
* Atomic work succeeded and we got the lock,
@@ -2854,6 +2924,12 @@ retry_private:
*/
queue_unlock(hb);
put_futex_key(&q.key);
+ /*
+ * Handle the case where the owner is in the middle of
+ * exiting. Wait for the exit to complete otherwise
+ * this task might loop forever, aka. live lock.
+ */
+ wait_for_owner_exiting(ret, exiting);
cond_resched();
goto retry;
default:
hello all,
i found a warning during kernel build (5.3.11-rc1+).
-----------------x----------x-----------------
kernel/exit.o: warning: objtool: __x64_sys_exit_group()+0x14: unreachable instruction
-------------------x---------------x-----------
Related details:
---------------
$uname -a
Linux debian 5.3.11-rc1+ #6 SMP Tue Nov 12 01:23:06 IST 2019 x86_64 GNU/Linux
$
$gcc --version
gcc (Debian 9.2.1-14) 9.2.1 20191025
------x---has-been-cut-here--x------
(gdb) l __x64_sys_exit_group
987 /*
988 * this kills every thread in the thread group. Note that any externally
989 * wait4()-ing process will get the correct exit code - even if this
990 * thread is not the thread group leader.
991 */
992 SYSCALL_DEFINE1(exit_group, int, error_code)
993 {
994 do_group_exit((error_code & 0xff) << 8);
995 /* NOTREACHED */
996 return 0;
(gdb)
(gdb) l *__x64_sys_exit_group+0x14
0xffffffff81085404 is in __x64_sys_exit_group (kernel/exit.c:996).
991 */
992 SYSCALL_DEFINE1(exit_group, int, error_code)
993 {
994 do_group_exit((error_code & 0xff) << 8);
995 /* NOTREACHED */
996 return 0;
997 }
998
999 struct waitid_info {
1000 pid_t pid;
(gdb)
(gdb) l *__x64_sys_exit_group
0xffffffff810853f0 is in __x64_sys_exit_group (kernel/exit.c:992).
987 /*
988 * this kills every thread in the thread group. Note that any externally
989 * wait4()-ing process will get the correct exit code - even if this
990 * thread is not the thread group leader.
991 */
992 SYSCALL_DEFINE1(exit_group, int, error_code)
993 {
994 do_group_exit((error_code & 0xff) << 8);
995 /* NOTREACHED */
996 return 0;
(gdb)
--------------------x-------------x-----------------------------
objdump -r -S -l --disassemble kernel/exit.o output is attached
--
software engineer
rajagiri school of engineering and technology
Greetings,
Find attached email very confidential. reply for more details
Thanks.
Peter Wong
----------------------------------------------------
This email was sent by the shareware version of Postman Professional.
This is the start of the stable review cycle for the 4.4.202 release.
There are 20 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun, 17 Nov 2019 06:18:31 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.202-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.4.202-rc1
Vineela Tummalapalli <vineela.tummalapalli(a)intel.com>
x86/bugs: Add ITLB_MULTIHIT bug infrastructure
Josh Poimboeuf <jpoimboe(a)redhat.com>
x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs
Michal Hocko <mhocko(a)suse.com>
x86/tsx: Add config options to set tsx=on|off|auto
Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
x86/speculation/taa: Add documentation for TSX Async Abort
Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
x86/tsx: Add "auto" option to the tsx= cmdline parameter
Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
kvm/x86: Export MDS_NO=0 to guests when TSX is enabled
Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
x86/speculation/taa: Add sysfs reporting for TSX Async Abort
Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
x86/speculation/taa: Add mitigation for TSX Async Abort
Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default
Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
x86/cpu: Add a helper function x86_read_arch_cap_msr()
Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
x86/msr: Add the IA32_TSX_CTRL MSR
Paolo Bonzini <pbonzini(a)redhat.com>
KVM: x86: use Intel speculation bugs and features as derived in generic x86 code
Jim Mattson <jmattson(a)google.com>
kvm: x86: IA32_ARCH_CAPABILITIES is always supported
Sean Christopherson <sean.j.christopherson(a)intel.com>
KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts
Ben Hutchings <ben(a)decadent.org.uk>
KVM: Introduce kvm_get_arch_capabilities()
Nicholas Piggin <npiggin(a)gmail.com>
powerpc/boot: Request no dynamic linker for boot wrapper
Nicholas Piggin <npiggin(a)gmail.com>
powerpc: Fix compiling a BE kernel with a powerpc64le toolchain
Michael Ellerman <mpe(a)ellerman.id.au>
powerpc/Makefile: Use cflags-y/aflags-y for setting endian options
Jonas Gorski <jonas.gorski(a)gmail.com>
MIPS: BCM63XX: fix switch core reset on BCM6368
Junaid Shahid <junaids(a)google.com>
kvm: mmu: Don't read PDPTEs when paging is not enabled
-------------
Diffstat:
Documentation/ABI/testing/sysfs-devices-system-cpu | 2 +
Documentation/hw-vuln/tsx_async_abort.rst | 268 +++++++++++++++++++++
Documentation/kernel-parameters.txt | 62 +++++
Documentation/x86/tsx_async_abort.rst | 117 +++++++++
Makefile | 4 +-
arch/mips/bcm63xx/reset.c | 2 +-
arch/powerpc/Makefile | 31 ++-
arch/powerpc/boot/wrapper | 24 +-
arch/x86/Kconfig | 45 ++++
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/asm/msr-index.h | 16 ++
arch/x86/include/asm/nospec-branch.h | 4 +-
arch/x86/include/asm/processor.h | 7 +
arch/x86/kernel/cpu/Makefile | 2 +-
arch/x86/kernel/cpu/bugs.c | 143 ++++++++++-
arch/x86/kernel/cpu/common.c | 93 ++++---
arch/x86/kernel/cpu/cpu.h | 18 ++
arch/x86/kernel/cpu/intel.c | 5 +
arch/x86/kernel/cpu/tsx.c | 140 +++++++++++
arch/x86/kvm/cpuid.c | 12 +
arch/x86/kvm/vmx.c | 15 --
arch/x86/kvm/x86.c | 53 +++-
drivers/base/cpu.c | 17 ++
include/linux/cpu.h | 5 +
25 files changed, 1019 insertions(+), 70 deletions(-)