November 2019 - Linux-stable-mirror

[patch 01/11] mm: mempolicy: fix the wrong return value and potential pages leak of mbind

by akpm＠linux-foundation.org

From: Yang Shi <yang.shi(a)linux.alibaba.com> Subject: mm: mempolicy: fix the wrong return value and potential pages leak of mbind Commit d883544515aa ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") fixed the return value of mbind() for a couple of corner cases. But, it altered the errno for some other cases, for example, mbind() should return -EFAULT when part or all of the memory range specified by nodemask and maxnode points outside your accessible address space, or there was an unmapped hole in the specified memory range specified by addr and len. Fix this by preserving the errno returned by queue_pages_range(). And, the pagelist may be not empty even though queue_pages_range() returns error, put the pages back to LRU since mbind_range() is not called to really apply the policy so those pages should not be migrated, this is also the old behavior before the problematic commit. Link: http://lkml.kernel.org/r/1572454731-3925-1-git-send-email-yang.shi@linux.al… Fixes: d883544515aa ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") Signed-off-by: Yang Shi <yang.shi(a)linux.alibaba.com> Reported-by: Li Xinhai <lixinhai.lxh(a)gmail.com> Reviewed-by: Li Xinhai <lixinhai.lxh(a)gmail.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Mel Gorman <mgorman(a)techsingularity.net> Cc: <stable(a)vger.kernel.org> [4.19 and 5.2+] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/mempolicy.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) --- a/mm/mempolicy.c~mm-mempolicy-fix-the-wrong-return-value-and-potential-pages-leak-of-mbind +++ a/mm/mempolicy.c @@ -672,7 +672,9 @@ static const struct mm_walk_ops queue_pa * 1 - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were * specified. * 0 - queue pages successfully or no misplaced page. - * -EIO - there is misplaced page and only MPOL_MF_STRICT was specified. + * errno - i.e. misplaced pages with MPOL_MF_STRICT specified (-EIO) or + * memory range specified by nodemask and maxnode points outside + * your accessible address space (-EFAULT) */ static int queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, @@ -1286,7 +1288,7 @@ static long do_mbind(unsigned long start flags | MPOL_MF_INVERT, &pagelist); if (ret < 0) { - err = -EIO; + err = ret; goto up_out; } @@ -1305,10 +1307,12 @@ static long do_mbind(unsigned long start if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT))) err = -EIO; - } else - putback_movable_pages(&pagelist); - + } else { up_out: + if (!list_empty(&pagelist)) + putback_movable_pages(&pagelist); + } + up_write(&mm->mmap_sem); mpol_out: mpol_put(new); _

5 years, 7 months

1
0
0 0

+ x86-mm-split-vmalloc_sync_all.patch added to -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: x86/mm: Split vmalloc_sync_all() has been added to the -mm tree. Its filename is x86-mm-split-vmalloc_sync_all.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/x86-mm-split-vmalloc_sync_all.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/x86-mm-split-vmalloc_sync_all.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Joerg Roedel <jroedel(a)suse.de> Subject: x86/mm: Split vmalloc_sync_all() Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the vunmap() code-path. While this change was necessary to maintain correctness on x86-32-pae kernels, it also adds additional cycles for architectures that don't need it. Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported severe performance regressions in micro-benchmarks because it now also calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But the vmalloc_sync_all() implementation on x86-64 is only needed for newly created mappings. To avoid the unnecessary work on x86-64 and to gain the performance back, split up vmalloc_sync_all() into two functions: * vmalloc_sync_mappings(), and * vmalloc_sync_unmappings() Most call-sites to vmalloc_sync_all() only care about new mappings being synchronized. The only exception is the new call-site added in the above mentioned commit. Shile Zhang directed us to a report of an 80% regression in reaim throughput. Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK… Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.… Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") Signed-off-by: Joerg Roedel <jroedel(a)suse.de> Reported-by: kernel test robot <oliver.sang(a)intel.com> Reported-by: Shile Zhang <shile.zhang(a)linux.alibaba.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com> [GHES] Cc: Dave Hansen <dave.hansen(a)linux.intel.com> Cc: Andy Lutomirski <luto(a)kernel.org> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Thomas Gleixner <tglx(a)linutronix.de> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: Borislav Petkov <bp(a)alien8.de> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- arch/x86/mm/fault.c | 26 ++++++++++++++++++++++++-- drivers/acpi/apei/ghes.c | 2 +- include/linux/vmalloc.h | 5 +++-- kernel/notifier.c | 2 +- mm/nommu.c | 10 +++++++--- mm/vmalloc.c | 11 +++++++---- 6 files changed, 43 insertions(+), 13 deletions(-) --- a/arch/x86/mm/fault.c~x86-mm-split-vmalloc_sync_all +++ a/arch/x86/mm/fault.c @@ -189,7 +189,7 @@ static inline pmd_t *vmalloc_sync_one(pg return pmd_k; } -void vmalloc_sync_all(void) +static void vmalloc_sync(void) { unsigned long address; @@ -216,6 +216,16 @@ void vmalloc_sync_all(void) } } +void vmalloc_sync_mappings(void) +{ + vmalloc_sync(); +} + +void vmalloc_sync_unmappings(void) +{ + vmalloc_sync(); +} + /* * 32-bit: * @@ -318,11 +328,23 @@ out: #else /* CONFIG_X86_64: */ -void vmalloc_sync_all(void) +void vmalloc_sync_mappings(void) { + /* + * 64-bit mappings might allocate new p4d/pud pages + * that need to be propagated to all tasks' PGDs. + */ sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END); } +void vmalloc_sync_unmappings(void) +{ + /* + * Unmappings never allocate or free p4d/pud pages. + * No work is required here. + */ +} + /* * 64-bit: * --- a/drivers/acpi/apei/ghes.c~x86-mm-split-vmalloc_sync_all +++ a/drivers/acpi/apei/ghes.c @@ -171,7 +171,7 @@ int ghes_estatus_pool_init(int num_ghes) * New allocation must be visible in all pgd before it can be found by * an NMI allocating from the pool. */ - vmalloc_sync_all(); + vmalloc_sync_mappings(); rc = gen_pool_add(ghes_estatus_pool, addr, PAGE_ALIGN(len), -1); if (rc) --- a/include/linux/vmalloc.h~x86-mm-split-vmalloc_sync_all +++ a/include/linux/vmalloc.h @@ -126,8 +126,9 @@ extern int remap_vmalloc_range_partial(s extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr, unsigned long pgoff); -void vmalloc_sync_all(void); - +void vmalloc_sync_mappings(void); +void vmalloc_sync_unmappings(void); + /* * Lowlevel-APIs (not for driver use!) */ --- a/kernel/notifier.c~x86-mm-split-vmalloc_sync_all +++ a/kernel/notifier.c @@ -554,7 +554,7 @@ NOKPROBE_SYMBOL(notify_die); int register_die_notifier(struct notifier_block *nb) { - vmalloc_sync_all(); + vmalloc_sync_mappings(); return atomic_notifier_chain_register(&die_chain, nb); } EXPORT_SYMBOL_GPL(register_die_notifier); --- a/mm/nommu.c~x86-mm-split-vmalloc_sync_all +++ a/mm/nommu.c @@ -359,10 +359,14 @@ void vm_unmap_aliases(void) EXPORT_SYMBOL_GPL(vm_unmap_aliases); /* - * Implement a stub for vmalloc_sync_all() if the architecture chose not to - * have one. + * Implement a stub for vmalloc_sync_[un]mapping() if the architecture + * chose not to have one. */ -void __weak vmalloc_sync_all(void) +void __weak vmalloc_sync_mappings(void) +{ +} + +void __weak vmalloc_sync_unmappings(void) { } --- a/mm/vmalloc.c~x86-mm-split-vmalloc_sync_all +++ a/mm/vmalloc.c @@ -1259,7 +1259,7 @@ static bool __purge_vmap_area_lazy(unsig * First make sure the mappings are removed from all page-tables * before they are freed. */ - vmalloc_sync_all(); + vmalloc_sync_unmappings(); /* * TODO: to calculate a flush range without looping. @@ -3050,16 +3050,19 @@ int remap_vmalloc_range(struct vm_area_s EXPORT_SYMBOL(remap_vmalloc_range); /* - * Implement a stub for vmalloc_sync_all() if the architecture chose not to - * have one. + * Implement stubs for vmalloc_sync_[un]mappings () if the architecture chose + * not to have one. * * The purpose of this function is to make sure the vmalloc area * mappings are identical in all page-tables in the system. */ -void __weak vmalloc_sync_all(void) +void __weak vmalloc_sync_mappings(void) { } +void __weak vmalloc_sync_unmappings(void) +{ +} static int f(pte_t *pte, unsigned long addr, void *data) { _ Patches currently in -mm which might be from jroedel(a)suse.de are x86-mm-split-vmalloc_sync_all.patch

5 years, 7 months

1
0
0 0

[alternative-merged] mm-vmalloc-fix-regression-caused-by-needless-vmalloc_sync_all.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm/vmalloc: fix performance regression caused by needless vmalloc_sync_all() has been removed from the -mm tree. Its filename was mm-vmalloc-fix-regression-caused-by-needless-vmalloc_sync_all.patch This patch was dropped because an alternative patch was merged ------------------------------------------------------ From: Shile Zhang <shile.zhang(a)linux.alibaba.com> Subject: mm/vmalloc: fix performance regression caused by needless vmalloc_sync_all() vmalloc_sync_all() was put in the common path in __purge_vmap_area_lazy(), for one sync issue only happened on X86_32 with PTI enabled. It is needless for X86_64, which caused a big regression in UnixBench Shell8 testing on X86_64. Similar regression also reported by 0-day kernel test robot in reaim benchmarking: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK… Fix it by adding more conditions. [akpm(a)linux-foundation.org: simplify config expression, use IS_ENABLED()] [akpm(a)linux-foundation.org: build fix - go back to using an ifdef] Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.… Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") Signed-off-by: Shile Zhang <shile.zhang(a)linux.alibaba.com> Cc: Joerg Roedel <jroedel(a)suse.de> Cc: Qian Cai <cai(a)lca.pw> Cc: Thomas Gleixner <tglx(a)linutronix.de> Cc: Dave Hansen <dave.hansen(a)linux.intel.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/vmalloc.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) --- a/mm/vmalloc.c~mm-vmalloc-fix-regression-caused-by-needless-vmalloc_sync_all +++ a/mm/vmalloc.c @@ -1255,11 +1255,17 @@ static bool __purge_vmap_area_lazy(unsig if (unlikely(valist == NULL)) return false; +#ifdef CONFIG_X86_PAE /* - * First make sure the mappings are removed from all page-tables - * before they are freed. + * First make sure the mappings are removed from all pagetables before + * they are freed. + * + * This is only needed on x86-32 with !SHARED_KERNEL_PMD, which is the + * case on a PAE kernel with PTI enabled. */ - vmalloc_sync_all(); + if (!SHARED_KERNEL_PMD && boot_cpu_has(X86_FEATURE_PTI)) + vmalloc_sync_all(); +#endif /* * TODO: to calculate a flush range without looping. _ Patches currently in -mm which might be from shile.zhang(a)linux.alibaba.com are

5 years, 7 months

1
0
0 0

[PATCH] Input: synaptics - enable RMI mode for X1 Extreme 2nd Generation

by Lyude Paul

Just got one of these for debugging some unrelated issues, and noticed that Lenovo seems to have gone back to using RMI4 over smbus with Synaptics touchpads on some of their new systems, particularly this one. So, let's enable RMI mode for the X1 Extreme 2nd Generation. Signed-off-by: Lyude Paul <lyude(a)redhat.com> Cc: stable(a)vger.kernel.org --- drivers/input/mouse/synaptics.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/input/mouse/synaptics.c b/drivers/input/mouse/synaptics.c index 56fae3472114..704558d449a2 100644 --- a/drivers/input/mouse/synaptics.c +++ b/drivers/input/mouse/synaptics.c @@ -177,6 +177,7 @@ static const char * const smbus_pnp_ids[] = { "LEN0096", /* X280 */ "LEN0097", /* X280 -> ALPS trackpoint */ "LEN009b", /* T580 */ + "LEN0402", /* X1 Extreme 2nd Generation */ "LEN200f", /* T450s */ "LEN2054", /* E480 */ "LEN2055", /* E580 */ -- 2.21.0

5 years, 7 months

2
1
0 0

PROBLEM: error and warning from 5.4.0-rc7

by Jeffrin Jose

hello all i get error and warning from a typical 5.4.0-rc7. ------x--------x--error---x---------------x---- $cat 5.4.0-rc7-error.txt [ 2.064029] Couldn't get size: 0x800000000000000e [ 12.906185] tpm_tis MSFT0101:00: IRQ index 0 not found $ -------------x------------------x-------------- ----------x----------warning----x-------------x-------------------------x------------- $cat 5.4.0-rc7-warn.txt [ 0.249749] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. [ 0.249783] #3 [ 0.253901] ENERGY_PERF_BIAS: Set to 'normal', was 'performance' [ 2.011803] i8042: PNP: PS/2 appears to have AUX port disabled, if this is incorrect please boot with i8042.nopnp [ 2.013717] rtc rtc0: invalid alarm value: 2019-11-14T14:63:40 [ 2.234933] sdhci-pci 0000:00:1e.6: failed to setup card detect gpio [ 2.280028] i2c_hid i2c-ELAN1300:00: i2c-ELAN1300:00 supply vdd not found, using dummy regulator [ 2.280065] i2c_hid i2c-ELAN1300:00: i2c-ELAN1300:00 supply vddl not found, using dummy regulator [ 3.043252] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x83 has wMaxPacketSize 0, skipping [ 3.043254] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x3 has wMaxPacketSize 0, skipping [ 15.114547] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2 [ 16.059585] uvcvideo 1-6:1.0: Entity type for entity Extension 4 was not initialized! [ 16.059587] uvcvideo 1-6:1.0: Entity type for entity Processing 2 was not initialized! [ 16.059588] uvcvideo 1-6:1.0: Entity type for entity Camera 1 was not initialized! [ 23.368830] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x83 has wMaxPacketSize 0, skipping [ 23.368835] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x3 has wMaxPacketSize 0, skipping [ 1415.852546] done. [ 1416.144078] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x83 has wMaxPacketSize 0, skipping [ 1416.144083] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x3 has wMaxPacketSize 0, skipping [ 1421.652063] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x83 has wMaxPacketSize 0, skipping [ 1421.652068] usb 1-8: config 1 interface 1 altsetting 0 endpoint 0x3 has wMaxPacketSize 0, skipping $ ------------------x-------------------x-------------------x---------------- -- software engineer rajagiri school of engineering and technology

5 years, 7 months

2
1
0 0

[tip: locking/core] futex: Prevent robust futex exit race

by tip-bot2 for Yang Tao

The following commit has been merged into the locking/core branch of tip: Commit-ID: ca16d5bee59807bf04deaab0a8eccecd5061528c Gitweb: https://git.kernel.org/tip/ca16d5bee59807bf04deaab0a8eccecd5061528c Author: Yang Tao <yang.tao172(a)zte.com.cn> AuthorDate: Wed, 06 Nov 2019 22:55:35 +01:00 Committer: Thomas Gleixner <tglx(a)linutronix.de> CommitterDate: Fri, 15 Nov 2019 19:10:49 +01:00 futex: Prevent robust futex exit race Robust futexes utilize the robust_list mechanism to allow the kernel to release futexes which are held when a task exits. The exit can be voluntary or caused by a signal or fault. This prevents that waiters block forever. The futex operations in user space store a pointer to the futex they are either locking or unlocking in the op_pending member of the per task robust list. After a lock operation has succeeded the futex is queued in the robust list linked list and the op_pending pointer is cleared. After an unlock operation has succeeded the futex is removed from the robust list linked list and the op_pending pointer is cleared. The robust list exit code checks for the pending operation and any futex which is queued in the linked list. It carefully checks whether the futex value is the TID of the exiting task. If so, it sets the OWNER_DIED bit and tries to wake up a potential waiter. This is race free for the lock operation but unlock has two race scenarios where waiters might not be woken up. These issues can be observed with regular robust pthread mutexes. PI aware pthread mutexes are not affected. (1) Unlocking task is killed after unlocking the futex value in user space before being able to wake a waiter. pthread_mutex_unlock() | V atomic_exchange_rel (&mutex->__data.__lock, 0) <------------------------killed lll_futex_wake () | | |(__lock = 0) |(enter kernel) | V do_exit() exit_mm() mm_release() exit_robust_list() handle_futex_death() | |(__lock = 0) |(uval = 0) | V if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) return 0; The sanity check which ensures that the user space futex is owned by the exiting task prevents the wakeup of waiters which in consequence block infinitely. (2) Waiting task is killed after a wakeup and before it can acquire the futex in user space. OWNER WAITER futex_wait() pthread_mutex_unlock() | | | |(__lock = 0) | | | V | futex_wake() ------------> wakeup() | |(return to userspace) |(__lock = 0) | V oldval = mutex->__data.__lock <-----------------killed atomic_compare_and_exchange_val_acq (&mutex->__data.__lock, | id | assume_other_futex_waiters, 0) | | | (enter kernel)| | V do_exit() | | V handle_futex_death() | |(__lock = 0) |(uval = 0) | V if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) return 0; The sanity check which ensures that the user space futex is owned by the exiting task prevents the wakeup of waiters, which seems to be correct as the exiting task does not own the futex value, but the consequence is that other waiters wont be woken up and block infinitely. In both scenarios the following conditions are true: - task->robust_list->list_op_pending != NULL - user space futex value == 0 - Regular futex (not PI) If these conditions are met then it is reasonably safe to wake up a potential waiter in order to prevent the above problems. As this might be a false positive it can cause spurious wakeups, but the waiter side has to handle other types of unrelated wakeups, e.g. signals gracefully anyway. So such a spurious wakeup will not affect the correctness of these operations. This workaround must not touch the user space futex value and cannot set the OWNER_DIED bit because the lock value is 0, i.e. uncontended. Setting OWNER_DIED in this case would result in inconsistent state and subsequently in malfunction of the owner died handling in user space. The rest of the user space state is still consistent as no other task can observe the list_op_pending entry in the exiting tasks robust list. The eventually woken up waiter will observe the uncontended lock value and take it over. [ tglx: Massaged changelog and comment. Made the return explicit and not depend on the subsequent check and added constants to hand into handle_futex_death() instead of plain numbers. Fixed a few coding style issues. ] Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core") Signed-off-by: Yang Tao <yang.tao172(a)zte.com.cn> Signed-off-by: Yi Wang <wang.yi59(a)zte.com.cn> Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de> Reviewed-by: Ingo Molnar <mingo(a)kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org> Cc: stable(a)vger.kernel.org Link: https://lkml.kernel.org/r/1573010582-35297-1-git-send-email-wang.yi59@zte.c… Link: https://lkml.kernel.org/r/20191106224555.943191378@linutronix.de --- kernel/futex.c | 58 +++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 51 insertions(+), 7 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 43229f8..49eaf5b 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3452,11 +3452,16 @@ err_unlock: return ret; } +/* Constants for the pending_op argument of handle_futex_death */ +#define HANDLE_DEATH_PENDING true +#define HANDLE_DEATH_LIST false + /* * Process a futex-list entry, check whether it's owned by the * dying task, and do notification if so: */ -static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi) +static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, + bool pi, bool pending_op) { u32 uval, uninitialized_var(nval), mval; int err; @@ -3469,6 +3474,42 @@ retry: if (get_user(uval, uaddr)) return -1; + /* + * Special case for regular (non PI) futexes. The unlock path in + * user space has two race scenarios: + * + * 1. The unlock path releases the user space futex value and + * before it can execute the futex() syscall to wake up + * waiters it is killed. + * + * 2. A woken up waiter is killed before it can acquire the + * futex in user space. + * + * In both cases the TID validation below prevents a wakeup of + * potential waiters which can cause these waiters to block + * forever. + * + * In both cases the following conditions are met: + * + * 1) task->robust_list->list_op_pending != NULL + * @pending_op == true + * 2) User space futex value == 0 + * 3) Regular futex: @pi == false + * + * If these conditions are met, it is safe to attempt waking up a + * potential waiter without touching the user space futex value and + * trying to set the OWNER_DIED bit. The user space futex value is + * uncontended and the rest of the user space mutex state is + * consistent, so a woken waiter will just take over the + * uncontended futex. Setting the OWNER_DIED bit would create + * inconsistent state and malfunction of the user space owner died + * handling. + */ + if (pending_op && !pi && !uval) { + futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY); + return 0; + } + if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) return 0; @@ -3588,10 +3629,11 @@ void exit_robust_list(struct task_struct *curr) * A pending lock might already be on the list, so * don't process it twice: */ - if (entry != pending) + if (entry != pending) { if (handle_futex_death((void __user *)entry + futex_offset, - curr, pi)) + curr, pi, HANDLE_DEATH_LIST)) return; + } if (rc) return; entry = next_entry; @@ -3605,9 +3647,10 @@ void exit_robust_list(struct task_struct *curr) cond_resched(); } - if (pending) + if (pending) { handle_futex_death((void __user *)pending + futex_offset, - curr, pip); + curr, pip, HANDLE_DEATH_PENDING); + } } long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, @@ -3784,7 +3827,8 @@ void compat_exit_robust_list(struct task_struct *curr) if (entry != pending) { void __user *uaddr = futex_uaddr(entry, futex_offset); - if (handle_futex_death(uaddr, curr, pi)) + if (handle_futex_death(uaddr, curr, pi, + HANDLE_DEATH_LIST)) return; } if (rc) @@ -3803,7 +3847,7 @@ void compat_exit_robust_list(struct task_struct *curr) if (pending) { void __user *uaddr = futex_uaddr(pending, futex_offset); - handle_futex_death(uaddr, curr, pip); + handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING); } }

5 years, 7 months

1
0
0 0

[tip: locking/core] futex: Prevent exit livelock

by tip-bot2 for Thomas Gleixner

The following commit has been merged into the locking/core branch of tip: Commit-ID: 3e0fe4f0ac713a156b96638aa6c1fb3d2c548e5a Gitweb: https://git.kernel.org/tip/3e0fe4f0ac713a156b96638aa6c1fb3d2c548e5a Author: Thomas Gleixner <tglx(a)linutronix.de> AuthorDate: Wed, 06 Nov 2019 22:55:46 +01:00 Committer: Thomas Gleixner <tglx(a)linutronix.de> CommitterDate: Fri, 15 Nov 2019 19:11:27 +01:00 futex: Prevent exit livelock Oleg provided the following test case: int main(void) { struct sched_param sp = {}; sp.sched_priority = 2; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); int lock = vfork(); if (!lock) { sp.sched_priority = 1; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); _exit(0); } syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0); return 0; } This creates an unkillable RT process spinning in futex_lock_pi() on a UP machine or if the process is affine to a single CPU. The reason is: parent child set FIFO prio 2 vfork() -> set FIFO prio 1 implies wait_for_child() sched_setscheduler(...) exit() do_exit() .... mm_release() tsk->futex_state = FUTEX_STATE_EXITING; exit_futex(); (NOOP in this case) complete() --> wakes parent sys_futex() loop infinite because tsk->futex_state == FUTEX_STATE_EXITING The same problem can happen just by regular preemption as well: task holds futex ... do_exit() tsk->futex_state = FUTEX_STATE_EXITING; --> preemption (unrelated wakeup of some other higher prio task, e.g. timer) switch_to(other_task) return to user sys_futex() loop infinite as above Just for the fun of it the futex exit cleanup could trigger the wakeup itself before the task sets its futex state to DEAD. To cure this, the handling of the exiting owner is changed so: - A refcount is held on the task - The task pointer is stored in a caller visible location - The caller drops all locks (hash bucket, mmap_sem) and blocks on task::futex_exit_mutex. When the mutex is acquired then the exiting task has completed the cleanup and the state is consistent and can be reevaluated. This is not a pretty solution, but there is no choice other than returning an error code to user space, which would break the state consistency guarantee and open another can of problems including regressions. For stable backports the preparatory commits 01e06025a2f8 .. 8d4da5b197dc are required as well, but for anything older than 5.3.y the backports are going to be provided when this hits mainline as the other dependencies for those kernels are definitely not stable material. Fixes: 778e9a9c3e71 ("pi-futex: fix exit races and locking problems") Reported-by: Oleg Nesterov <oleg(a)redhat.com> Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de> Reviewed-by: Ingo Molnar <mingo(a)kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org> Cc: Stable Team <stable(a)vger.kernel.org> Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de --- kernel/futex.c | 106 +++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 91 insertions(+), 15 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 4f9d7a4..03c518e 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1176,6 +1176,36 @@ out_error: return ret; } +/** + * wait_for_owner_exiting - Block until the owner has exited + * @exiting: Pointer to the exiting task + * + * Caller must hold a refcount on @exiting. + */ +static void wait_for_owner_exiting(int ret, struct task_struct *exiting) +{ + if (ret != -EBUSY) { + WARN_ON_ONCE(exiting); + return; + } + + if (WARN_ON_ONCE(ret == -EBUSY && !exiting)) + return; + + mutex_lock(&exiting->futex_exit_mutex); + /* + * No point in doing state checking here. If the waiter got here + * while the task was in exec()->exec_futex_release() then it can + * have any FUTEX_STATE_* value when the waiter has acquired the + * mutex. OK, if running, EXITING or DEAD if it reached exit() + * already. Highly unlikely and not a problem. Just one more round + * through the futex maze. + */ + mutex_unlock(&exiting->futex_exit_mutex); + + put_task_struct(exiting); +} + static int handle_exit_race(u32 __user *uaddr, u32 uval, struct task_struct *tsk) { @@ -1237,7 +1267,8 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval, * it after doing proper sanity checks. */ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key, - struct futex_pi_state **ps) + struct futex_pi_state **ps, + struct task_struct **exiting) { pid_t pid = uval & FUTEX_TID_MASK; struct futex_pi_state *pi_state; @@ -1276,7 +1307,19 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key, int ret = handle_exit_race(uaddr, uval, p); raw_spin_unlock_irq(&p->pi_lock); - put_task_struct(p); + /* + * If the owner task is between FUTEX_STATE_EXITING and + * FUTEX_STATE_DEAD then store the task pointer and keep + * the reference on the task struct. The calling code will + * drop all locks, wait for the task to reach + * FUTEX_STATE_DEAD and then drop the refcount. This is + * required to prevent a live lock when the current task + * preempted the exiting task between the two states. + */ + if (ret == -EBUSY) + *exiting = p; + else + put_task_struct(p); return ret; } @@ -1315,7 +1358,8 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key, static int lookup_pi_state(u32 __user *uaddr, u32 uval, struct futex_hash_bucket *hb, - union futex_key *key, struct futex_pi_state **ps) + union futex_key *key, struct futex_pi_state **ps, + struct task_struct **exiting) { struct futex_q *top_waiter = futex_top_waiter(hb, key); @@ -1330,7 +1374,7 @@ static int lookup_pi_state(u32 __user *uaddr, u32 uval, * We are the first waiter - try to look up the owner based on * @uval and attach to it. */ - return attach_to_pi_owner(uaddr, uval, key, ps); + return attach_to_pi_owner(uaddr, uval, key, ps, exiting); } static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval) @@ -1358,6 +1402,8 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval) * lookup * @task: the task to perform the atomic lock work for. This will * be "current" except in the case of requeue pi. + * @exiting: Pointer to store the task pointer of the owner task + * which is in the middle of exiting * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0) * * Return: @@ -1366,11 +1412,17 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval) * - <0 - error * * The hb->lock and futex_key refs shall be held by the caller. + * + * @exiting is only set when the return value is -EBUSY. If so, this holds + * a refcount on the exiting task on return and the caller needs to drop it + * after waiting for the exit to complete. */ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, union futex_key *key, struct futex_pi_state **ps, - struct task_struct *task, int set_waiters) + struct task_struct *task, + struct task_struct **exiting, + int set_waiters) { u32 uval, newval, vpid = task_pid_vnr(task); struct futex_q *top_waiter; @@ -1440,7 +1492,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, * attach to the owner. If that fails, no harm done, we only * set the FUTEX_WAITERS bit in the user space variable. */ - return attach_to_pi_owner(uaddr, newval, key, ps); + return attach_to_pi_owner(uaddr, newval, key, ps, exiting); } /** @@ -1858,6 +1910,8 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key, * @key1: the from futex key * @key2: the to futex key * @ps: address to store the pi_state pointer + * @exiting: Pointer to store the task pointer of the owner task + * which is in the middle of exiting * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0) * * Try and get the lock on behalf of the top waiter if we can do it atomically. @@ -1865,16 +1919,20 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key, * then direct futex_lock_pi_atomic() to force setting the FUTEX_WAITERS bit. * hb1 and hb2 must be held by the caller. * + * @exiting is only set when the return value is -EBUSY. If so, this holds + * a refcount on the exiting task on return and the caller needs to drop it + * after waiting for the exit to complete. + * * Return: * - 0 - failed to acquire the lock atomically; * - >0 - acquired the lock, return value is vpid of the top_waiter * - <0 - error */ -static int futex_proxy_trylock_atomic(u32 __user *pifutex, - struct futex_hash_bucket *hb1, - struct futex_hash_bucket *hb2, - union futex_key *key1, union futex_key *key2, - struct futex_pi_state **ps, int set_waiters) +static int +futex_proxy_trylock_atomic(u32 __user *pifutex, struct futex_hash_bucket *hb1, + struct futex_hash_bucket *hb2, union futex_key *key1, + union futex_key *key2, struct futex_pi_state **ps, + struct task_struct **exiting, int set_waiters) { struct futex_q *top_waiter = NULL; u32 curval; @@ -1911,7 +1969,7 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex, */ vpid = task_pid_vnr(top_waiter->task); ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task, - set_waiters); + exiting, set_waiters); if (ret == 1) { requeue_pi_wake_futex(top_waiter, key2, hb2); return vpid; @@ -2040,6 +2098,8 @@ retry_private: } if (requeue_pi && (task_count - nr_wake < nr_requeue)) { + struct task_struct *exiting = NULL; + /* * Attempt to acquire uaddr2 and wake the top waiter. If we * intend to requeue waiters, force setting the FUTEX_WAITERS @@ -2047,7 +2107,8 @@ retry_private: * faults rather in the requeue loop below. */ ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1, - &key2, &pi_state, nr_requeue); + &key2, &pi_state, + &exiting, nr_requeue); /* * At this point the top_waiter has either taken uaddr2 or is @@ -2074,7 +2135,8 @@ retry_private: * If that call succeeds then we have pi_state and an * initial refcount on it. */ - ret = lookup_pi_state(uaddr2, ret, hb2, &key2, &pi_state); + ret = lookup_pi_state(uaddr2, ret, hb2, &key2, + &pi_state, &exiting); } switch (ret) { @@ -2104,6 +2166,12 @@ retry_private: hb_waiters_dec(hb2); put_futex_key(&key2); put_futex_key(&key1); + /* + * Handle the case where the owner is in the middle of + * exiting. Wait for the exit to complete otherwise + * this task might loop forever, aka. live lock. + */ + wait_for_owner_exiting(ret, exiting); cond_resched(); goto retry; default: @@ -2810,6 +2878,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, { struct hrtimer_sleeper timeout, *to; struct futex_pi_state *pi_state = NULL; + struct task_struct *exiting = NULL; struct rt_mutex_waiter rt_waiter; struct futex_hash_bucket *hb; struct futex_q q = futex_q_init; @@ -2831,7 +2900,8 @@ retry: retry_private: hb = queue_lock(&q); - ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0); + ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, + &exiting, 0); if (unlikely(ret)) { /* * Atomic work succeeded and we got the lock, @@ -2854,6 +2924,12 @@ retry_private: */ queue_unlock(hb); put_futex_key(&q.key); + /* + * Handle the case where the owner is in the middle of + * exiting. Wait for the exit to complete otherwise + * this task might loop forever, aka. live lock. + */ + wait_for_owner_exiting(ret, exiting); cond_resched(); goto retry; default:

5 years, 7 months

1
0
0 0

PROBLEM: objtool: __x64_sys_exit_group()+0x14: unreachable instruction

by Jeffrin Jose

hello all, i found a warning during kernel build (5.3.11-rc1+). -----------------x----------x----------------- kernel/exit.o: warning: objtool: __x64_sys_exit_group()+0x14: unreachable instruction -------------------x---------------x----------- Related details: --------------- $uname -a Linux debian 5.3.11-rc1+ #6 SMP Tue Nov 12 01:23:06 IST 2019 x86_64 GNU/Linux $ $gcc --version gcc (Debian 9.2.1-14) 9.2.1 20191025 ------x---has-been-cut-here--x------ (gdb) l __x64_sys_exit_group 987 /* 988 * this kills every thread in the thread group. Note that any externally 989 * wait4()-ing process will get the correct exit code - even if this 990 * thread is not the thread group leader. 991 */ 992 SYSCALL_DEFINE1(exit_group, int, error_code) 993 { 994 do_group_exit((error_code & 0xff) << 8); 995 /* NOTREACHED */ 996 return 0; (gdb) (gdb) l *__x64_sys_exit_group+0x14 0xffffffff81085404 is in __x64_sys_exit_group (kernel/exit.c:996). 991 */ 992 SYSCALL_DEFINE1(exit_group, int, error_code) 993 { 994 do_group_exit((error_code & 0xff) << 8); 995 /* NOTREACHED */ 996 return 0; 997 } 998 999 struct waitid_info { 1000 pid_t pid; (gdb) (gdb) l *__x64_sys_exit_group 0xffffffff810853f0 is in __x64_sys_exit_group (kernel/exit.c:992). 987 /* 988 * this kills every thread in the thread group. Note that any externally 989 * wait4()-ing process will get the correct exit code - even if this 990 * thread is not the thread group leader. 991 */ 992 SYSCALL_DEFINE1(exit_group, int, error_code) 993 { 994 do_group_exit((error_code & 0xff) << 8); 995 /* NOTREACHED */ 996 return 0; (gdb) --------------------x-------------x----------------------------- objdump -r -S -l --disassemble kernel/exit.o output is attached -- software engineer rajagiri school of engineering and technology

5 years, 7 months

3
3
0 0

Investment opportunity

by Peter Wong

Greetings, Find attached email very confidential. reply for more details Thanks. Peter Wong ---------------------------------------------------- This email was sent by the shareware version of Postman Professional.

5 years, 7 months

1
0
0 0

[PATCH 4.4 00/20] 4.4.202-stable review

by Greg Kroah-Hartman

This is the start of the stable review cycle for the 4.4.202 release. There are 20 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know. Responses should be made by Sun, 17 Nov 2019 06:18:31 +0000. Anything received after that time might be too late. The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.202-rc… or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y and the diffstat can be found below. thanks, greg k-h ------------- Pseudo-Shortlog of commits: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Linux 4.4.202-rc1 Vineela Tummalapalli <vineela.tummalapalli(a)intel.com> x86/bugs: Add ITLB_MULTIHIT bug infrastructure Josh Poimboeuf <jpoimboe(a)redhat.com> x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs Michal Hocko <mhocko(a)suse.com> x86/tsx: Add config options to set tsx=on|off|auto Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com> x86/speculation/taa: Add documentation for TSX Async Abort Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com> x86/tsx: Add "auto" option to the tsx= cmdline parameter Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com> kvm/x86: Export MDS_NO=0 to guests when TSX is enabled Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com> x86/speculation/taa: Add sysfs reporting for TSX Async Abort Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com> x86/speculation/taa: Add mitigation for TSX Async Abort Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com> x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com> x86/cpu: Add a helper function x86_read_arch_cap_msr() Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com> x86/msr: Add the IA32_TSX_CTRL MSR Paolo Bonzini <pbonzini(a)redhat.com> KVM: x86: use Intel speculation bugs and features as derived in generic x86 code Jim Mattson <jmattson(a)google.com> kvm: x86: IA32_ARCH_CAPABILITIES is always supported Sean Christopherson <sean.j.christopherson(a)intel.com> KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts Ben Hutchings <ben(a)decadent.org.uk> KVM: Introduce kvm_get_arch_capabilities() Nicholas Piggin <npiggin(a)gmail.com> powerpc/boot: Request no dynamic linker for boot wrapper Nicholas Piggin <npiggin(a)gmail.com> powerpc: Fix compiling a BE kernel with a powerpc64le toolchain Michael Ellerman <mpe(a)ellerman.id.au> powerpc/Makefile: Use cflags-y/aflags-y for setting endian options Jonas Gorski <jonas.gorski(a)gmail.com> MIPS: BCM63XX: fix switch core reset on BCM6368 Junaid Shahid <junaids(a)google.com> kvm: mmu: Don't read PDPTEs when paging is not enabled ------------- Diffstat: Documentation/ABI/testing/sysfs-devices-system-cpu | 2 + Documentation/hw-vuln/tsx_async_abort.rst | 268 +++++++++++++++++++++ Documentation/kernel-parameters.txt | 62 +++++ Documentation/x86/tsx_async_abort.rst | 117 +++++++++ Makefile | 4 +- arch/mips/bcm63xx/reset.c | 2 +- arch/powerpc/Makefile | 31 ++- arch/powerpc/boot/wrapper | 24 +- arch/x86/Kconfig | 45 ++++ arch/x86/include/asm/cpufeatures.h | 2 + arch/x86/include/asm/kvm_host.h | 2 + arch/x86/include/asm/msr-index.h | 16 ++ arch/x86/include/asm/nospec-branch.h | 4 +- arch/x86/include/asm/processor.h | 7 + arch/x86/kernel/cpu/Makefile | 2 +- arch/x86/kernel/cpu/bugs.c | 143 ++++++++++- arch/x86/kernel/cpu/common.c | 93 ++++--- arch/x86/kernel/cpu/cpu.h | 18 ++ arch/x86/kernel/cpu/intel.c | 5 + arch/x86/kernel/cpu/tsx.c | 140 +++++++++++ arch/x86/kvm/cpuid.c | 12 + arch/x86/kvm/vmx.c | 15 -- arch/x86/kvm/x86.c | 53 +++- drivers/base/cpu.c | 17 ++ include/linux/cpu.h | 5 + 25 files changed, 1019 insertions(+), 70 deletions(-)

5 years, 7 months

5
24
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror November 2019