This is the start of the stable review cycle for the 6.3.11 release. There are 29 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sat, 01 Jul 2023 18:41:39 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.3.11-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.3.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 6.3.11-rc1
Ricardo Cañuelo ricardo.canuelo@collabora.com Revert "thermal/drivers/mediatek: Use devm_of_iomap to avoid resource leak in mtk_thermal_probe"
Mike Hommey mh@glandium.org HID: logitech-hidpp: add HIDPP_QUIRK_DELAYED_INIT for the T651.
Jason Gerecke jason.gerecke@wacom.com HID: wacom: Use ktime_t rather than int when dealing with timestamps
Ludvig Michaelsson ludvig.michaelsson@yubico.com HID: hidraw: fix data race on device refcount
Zhang Shurong zhang_shurong@foxmail.com fbdev: fix potential OOB read in fast_imageblit()
Linus Torvalds torvalds@linux-foundation.org gup: add warning if some caller would seem to want stack expansion
Linus Torvalds torvalds@linux-foundation.org mm: always expand the stack with the mmap write lock held
Linus Torvalds torvalds@linux-foundation.org execve: expand new process stack manually ahead of time
Liam R. Howlett Liam.Howlett@oracle.com mm: make find_extend_vma() fail if write lock not held
Linus Torvalds torvalds@linux-foundation.org powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma()
Linus Torvalds torvalds@linux-foundation.org mm/fault: convert remaining simple cases to lock_mm_and_find_vma()
Ben Hutchings ben@decadent.org.uk arm/mm: Convert to using lock_mm_and_find_vma()
Ben Hutchings ben@decadent.org.uk riscv/mm: Convert to using lock_mm_and_find_vma()
Ben Hutchings ben@decadent.org.uk mips/mm: Convert to using lock_mm_and_find_vma()
Michael Ellerman mpe@ellerman.id.au powerpc/mm: Convert to using lock_mm_and_find_vma()
Linus Torvalds torvalds@linux-foundation.org arm64/mm: Convert to using lock_mm_and_find_vma()
Linus Torvalds torvalds@linux-foundation.org mm: make the page fault mmap locking killable
Linus Torvalds torvalds@linux-foundation.org mm: introduce new 'lock_mm_and_find_vma()' page fault helper
Peng Zhang zhangpeng.00@bytedance.com maple_tree: fix potential out-of-bounds access in mas_wr_end_piv()
Oliver Hartkopp socketcan@hartkopp.net can: isotp: isotp_sendmsg(): fix return error fix on TX path
Wyes Karny wyes.karny@amd.com cpufreq: amd-pstate: Make amd-pstate EPP driver name hyphenated
Thomas Gleixner tglx@linutronix.de x86/smp: Cure kexec() vs. mwait_play_dead() breakage
Thomas Gleixner tglx@linutronix.de x86/smp: Use dedicated cache-line for mwait_play_dead()
Thomas Gleixner tglx@linutronix.de x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
Tony Battersby tonyb@cybernetics.com x86/smp: Dont access non-existing CPUID leaf
Thomas Gleixner tglx@linutronix.de x86/smp: Make stop_other_cpus() more robust
Borislav Petkov (AMD) bp@alien8.de x86/microcode/AMD: Load late on both threads too
David Woodhouse dwmw@amazon.co.uk mm/mmap: Fix error return in do_vmi_align_munmap()
Liam R. Howlett Liam.Howlett@oracle.com mm/mmap: Fix error path in do_vmi_align_munmap()
-------------
Diffstat:
Makefile | 4 +- arch/alpha/Kconfig | 1 + arch/alpha/mm/fault.c | 13 +-- arch/arc/Kconfig | 1 + arch/arc/mm/fault.c | 11 +-- arch/arm/Kconfig | 1 + arch/arm/mm/fault.c | 63 +++--------- arch/arm64/Kconfig | 1 + arch/arm64/mm/fault.c | 44 ++------- arch/csky/Kconfig | 1 + arch/csky/mm/fault.c | 22 +---- arch/hexagon/Kconfig | 1 + arch/hexagon/mm/vm_fault.c | 18 +--- arch/ia64/mm/fault.c | 36 ++----- arch/loongarch/Kconfig | 1 + arch/loongarch/mm/fault.c | 16 ++-- arch/m68k/mm/fault.c | 9 +- arch/microblaze/mm/fault.c | 5 +- arch/mips/Kconfig | 1 + arch/mips/mm/fault.c | 12 +-- arch/nios2/Kconfig | 1 + arch/nios2/mm/fault.c | 17 +--- arch/openrisc/mm/fault.c | 5 +- arch/parisc/mm/fault.c | 23 +++-- arch/powerpc/Kconfig | 1 + arch/powerpc/mm/copro_fault.c | 14 +-- arch/powerpc/mm/fault.c | 39 +------- arch/riscv/Kconfig | 1 + arch/riscv/mm/fault.c | 31 +++--- arch/s390/mm/fault.c | 5 +- arch/sh/Kconfig | 1 + arch/sh/mm/fault.c | 17 +--- arch/sparc/Kconfig | 1 + arch/sparc/mm/fault_32.c | 32 ++----- arch/sparc/mm/fault_64.c | 8 +- arch/um/kernel/trap.c | 11 ++- arch/x86/Kconfig | 1 + arch/x86/include/asm/cpu.h | 2 + arch/x86/include/asm/smp.h | 2 + arch/x86/kernel/cpu/microcode/amd.c | 2 +- arch/x86/kernel/process.c | 28 +++++- arch/x86/kernel/smp.c | 73 ++++++++------ arch/x86/kernel/smpboot.c | 81 ++++++++++++++-- arch/x86/mm/fault.c | 52 +--------- arch/xtensa/Kconfig | 1 + arch/xtensa/mm/fault.c | 14 +-- drivers/cpufreq/amd-pstate.c | 2 +- drivers/hid/hid-logitech-hidpp.c | 2 +- drivers/hid/hidraw.c | 9 +- drivers/hid/wacom_wac.c | 6 +- drivers/hid/wacom_wac.h | 2 +- drivers/iommu/amd/iommu_v2.c | 4 +- drivers/iommu/iommu-sva.c | 2 +- drivers/thermal/mediatek/auxadc_thermal.c | 14 +-- drivers/video/fbdev/core/sysimgblt.c | 2 +- fs/binfmt_elf.c | 6 +- fs/exec.c | 38 ++++---- include/linux/mm.h | 16 ++-- lib/maple_tree.c | 11 ++- mm/Kconfig | 4 + mm/gup.c | 14 ++- mm/memory.c | 127 +++++++++++++++++++++++++ mm/mmap.c | 153 +++++++++++++++++++++++------- mm/nommu.c | 17 ++-- net/can/isotp.c | 5 +- 65 files changed, 614 insertions(+), 544 deletions(-)
From: "Liam R. Howlett" Liam.Howlett@oracle.com
commit 606c812eb1d5b5fb0dd9e330ca94b52d7c227830 upstream.
The error unrolling was leaving the VMAs detached in many cases and leaving the locked_vm statistic altered, and skipping the unrolling entirely in the case of the vma tree write failing.
Fix the error path by re-attaching the detached VMAs and adding the necessary goto for the failed vma tree write, and fix the locked_vm statistic by only updating after the vma tree write succeeds.
Fixes: 763ecb035029 ("mm: remove the vma linked list") Reported-by: Vegard Nossum vegard.nossum@oracle.com Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [ dwmw2: Strictly, the original patch wasn't *re-attaching* the detached VMAs. They *were* still attached but just had the 'detached' flag set, which is an optimisation. Which doesn't exist in 6.3, so drop that. Also drop the call to vma_start_write() which came in with the per-VMA locking in 6.4. ] Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/mmap.c | 29 +++++++++++------------------ 1 file changed, 11 insertions(+), 18 deletions(-)
--- a/mm/mmap.c +++ b/mm/mmap.c @@ -2280,19 +2280,6 @@ int split_vma(struct vma_iterator *vmi, return __split_vma(vmi, vma, addr, new_below); }
-static inline int munmap_sidetree(struct vm_area_struct *vma, - struct ma_state *mas_detach) -{ - mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1); - if (mas_store_gfp(mas_detach, vma, GFP_KERNEL)) - return -ENOMEM; - - if (vma->vm_flags & VM_LOCKED) - vma->vm_mm->locked_vm -= vma_pages(vma); - - return 0; -} - /* * do_vmi_align_munmap() - munmap the aligned region from @start to @end. * @vmi: The vma iterator @@ -2314,6 +2301,7 @@ do_vmi_align_munmap(struct vma_iterator struct maple_tree mt_detach; int count = 0; int error = -ENOMEM; + unsigned long locked_vm = 0; MA_STATE(mas_detach, &mt_detach, 0, 0); mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK); mt_set_external_lock(&mt_detach, &mm->mmap_lock); @@ -2359,9 +2347,11 @@ do_vmi_align_munmap(struct vma_iterator if (error) goto end_split_failed; } - error = munmap_sidetree(next, &mas_detach); - if (error) - goto munmap_sidetree_failed; + mas_set_range(&mas_detach, next->vm_start, next->vm_end - 1); + if (mas_store_gfp(&mas_detach, next, GFP_KERNEL)) + goto munmap_gather_failed; + if (next->vm_flags & VM_LOCKED) + locked_vm += vma_pages(next);
count++; #ifdef CONFIG_DEBUG_VM_MAPLE_TREE @@ -2407,10 +2397,12 @@ do_vmi_align_munmap(struct vma_iterator } #endif /* Point of no return */ + error = -ENOMEM; vma_iter_set(vmi, start); if (vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL)) - return -ENOMEM; + goto clear_tree_failed;
+ mm->locked_vm -= locked_vm; mm->map_count -= count; /* * Do not downgrade mmap_lock if we are next to VM_GROWSDOWN or @@ -2440,8 +2432,9 @@ do_vmi_align_munmap(struct vma_iterator validate_mm(mm); return downgrade ? 1 : 0;
+clear_tree_failed: userfaultfd_error: -munmap_sidetree_failed: +munmap_gather_failed: end_split_failed: __mt_destroy(&mt_detach); start_split_failed:
From: David Woodhouse dwmw@amazon.co.uk
commit 6c26bd4384da24841bac4f067741bbca18b0fb74 upstream,
If mas_store_gfp() in the gather loop failed, the 'error' variable that ultimately gets returned was not being set. In many cases, its original value of -ENOMEM was still in place, and that was fine. But if VMAs had been split at the start or end of the range, then 'error' could be zero.
Change to the 'error = foo(); if (error) goto â¦' idiom to fix the bug.
Also clean up a later case which avoided the same bug by *explicitly* setting error = -ENOMEM right before calling the function that might return -ENOMEM.
In a final cosmetic change, move the 'Point of no return' comment to *after* the goto. That's been in the wrong place since the preallocation was removed, and this new error path was added.
Fixes: 606c812eb1d5 ("mm/mmap: Fix error path in do_vmi_align_munmap()") Signed-off-by: David Woodhouse dwmw@amazon.co.uk Cc: stable@vger.kernel.org Reviewed-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Reviewed-by: Liam R. Howlett Liam.Howlett@oracle.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/mmap.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
--- a/mm/mmap.c +++ b/mm/mmap.c @@ -2348,7 +2348,8 @@ do_vmi_align_munmap(struct vma_iterator goto end_split_failed; } mas_set_range(&mas_detach, next->vm_start, next->vm_end - 1); - if (mas_store_gfp(&mas_detach, next, GFP_KERNEL)) + error = mas_store_gfp(&mas_detach, next, GFP_KERNEL); + if (error) goto munmap_gather_failed; if (next->vm_flags & VM_LOCKED) locked_vm += vma_pages(next); @@ -2396,12 +2397,12 @@ do_vmi_align_munmap(struct vma_iterator BUG_ON(count != test_count); } #endif - /* Point of no return */ - error = -ENOMEM; vma_iter_set(vmi, start); - if (vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL)) + error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL); + if (error) goto clear_tree_failed;
+ /* Point of no return */ mm->locked_vm -= locked_vm; mm->map_count -= count; /*
On Thu, 2023-06-29 at 20:43 +0200, Greg Kroah-Hartman wrote:
From: David Woodhouse dwmw@amazon.co.uk
commit 6c26bd4384da24841bac4f067741bbca18b0fb74 upstream,
If mas_store_gfp() in the gather loop failed, the 'error' variable that ultimately gets returned was not being set. In many cases, its original value of -ENOMEM was still in place, and that was fine. But if VMAs had been split at the start or end of the range, then 'error' could be zero.
Change to the 'error = foo(); if (error) goto …' idiom to fix the bug.
Hrm, that isn't what the original commit message said. It said:
Change to the 'error = foo(); if (error) goto …' idiom to fix the bug.
This far into the 21st century, we don't see a lot of tools injecting Mojibake any more; the mantra of "everything is UTF-8, all of the time" mostly seems to work.
Granted, there are more important problems in the world, but it'd be good to identify where that happened and file bugs if needed.
On Mon, Jul 03, 2023 at 10:23:59AM +0100, David Woodhouse wrote:
On Thu, 2023-06-29 at 20:43 +0200, Greg Kroah-Hartman wrote:
From: David Woodhouse dwmw@amazon.co.uk
commit 6c26bd4384da24841bac4f067741bbca18b0fb74 upstream,
If mas_store_gfp() in the gather loop failed, the 'error' variable that ultimately gets returned was not being set. In many cases, its original value of -ENOMEM was still in place, and that was fine. But if VMAs had been split at the start or end of the range, then 'error' could be zero.
Change to the 'error = foo(); if (error) goto …' idiom to fix the bug.
Hrm, that isn't what the original commit message said. It said:
Change to the 'error = foo(); if (error) goto …' idiom to fix the bug.
This far into the 21st century, we don't see a lot of tools injecting Mojibake any more; the mantra of "everything is UTF-8, all of the time" mostly seems to work.
Granted, there are more important problems in the world, but it'd be good to identify where that happened and file bugs if needed.
This is probably due to me going from 'git format-patch' to 'quit import' and then to 'git am' as part of the workflow I use here. I'll try to narrow it down as to where this went wrong...
thanks,
greg k-h
From: Borislav Petkov (AMD) bp@alien8.de
commit a32b0f0db3f396f1c9be2fe621e77c09ec3d8e7d upstream.
Do the same as early loading - load on both threads.
Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230605141332.25948-1-bp@alien8.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/cpu/microcode/amd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kernel/cpu/microcode/amd.c +++ b/arch/x86/kernel/cpu/microcode/amd.c @@ -705,7 +705,7 @@ static enum ucode_state apply_microcode_ rdmsr(MSR_AMD64_PATCH_LEVEL, rev, dummy);
/* need to apply patch? */ - if (rev >= mc_amd->hdr.patch_id) { + if (rev > mc_amd->hdr.patch_id) { ret = UCODE_OK; goto out; }
From: Thomas Gleixner tglx@linutronix.de
commit 1f5e7eb7868e42227ac426c96d437117e6e06e8e upstream.
Tony reported intermittent lockups on poweroff. His analysis identified the wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that on SME enabled machines a kexec() does not leave any stale data in the caches when switching from encrypted to non-encrypted mode or vice versa.
That wbinvd() is conditional on the SME feature bit which is read directly from CPUID. But that readout does not check whether the CPUID leaf is available or not. If it's not available the CPU will return the value of the highest supported leaf instead. Depending on the content the "SME" bit might be set or not.
That's incorrect but harmless. Making the CPUID readout conditional makes the observed hangs go away, but it does not fix the underlying problem:
CPU0 CPU1
stop_other_cpus() send_IPIs(REBOOT); stop_this_cpu() while (num_online_cpus() > 1); set_online(false); proceed... -> hang wbinvd()
WBINVD is an expensive operation and if multiple CPUs issue it at the same time the resulting delays are even larger.
But CPU0 already observed num_online_cpus() going down to 1 and proceeds which causes the system to hang.
This issue exists independent of WBINVD, but the delays caused by WBINVD make it more prominent.
Make this more robust by adding a cpumask which is initialized to the online CPU mask before sending the IPIs and CPUs clear their bit in stop_this_cpu() after the WBINVD completed. Check for that cpumask to become empty in stop_other_cpus() instead of watching num_online_cpus().
The cpumask cannot plug all holes either, but it's better than a raw counter and allows to restrict the NMI fallback IPI to be sent only the CPUs which have not reported within the timeout window.
Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use") Reported-by: Tony Battersby tonyb@cybernetics.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Ashok Raj ashok.raj@intel.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics... Link: https://lore.kernel.org/r/87h6r770bv.ffs@tglx Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/include/asm/cpu.h | 2 + arch/x86/kernel/process.c | 23 +++++++++++++++- arch/x86/kernel/smp.c | 62 +++++++++++++++++++++++++++++---------------- 3 files changed, 64 insertions(+), 23 deletions(-)
--- a/arch/x86/include/asm/cpu.h +++ b/arch/x86/include/asm/cpu.h @@ -98,4 +98,6 @@ extern u64 x86_read_arch_cap_msr(void); int intel_find_matching_signature(void *mc, unsigned int csig, int cpf); int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
+extern struct cpumask cpus_stop_mask; + #endif /* _ASM_X86_CPU_H */ --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -752,13 +752,23 @@ bool xen_set_default_idle(void) } #endif
+struct cpumask cpus_stop_mask; + void __noreturn stop_this_cpu(void *dummy) { + unsigned int cpu = smp_processor_id(); + local_irq_disable(); + /* - * Remove this CPU: + * Remove this CPU from the online mask and disable it + * unconditionally. This might be redundant in case that the reboot + * vector was handled late and stop_other_cpus() sent an NMI. + * + * According to SDM and APM NMIs can be accepted even after soft + * disabling the local APIC. */ - set_cpu_online(smp_processor_id(), false); + set_cpu_online(cpu, false); disable_local_APIC(); mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
@@ -776,6 +786,15 @@ void __noreturn stop_this_cpu(void *dumm */ if (cpuid_eax(0x8000001f) & BIT(0)) native_wbinvd(); + + /* + * This brings a cache line back and dirties it, but + * native_stop_other_cpus() will overwrite cpus_stop_mask after it + * observed that all CPUs reported stop. This write will invalidate + * the related cache line on this CPU. + */ + cpumask_clear_cpu(cpu, &cpus_stop_mask); + for (;;) { /* * Use native_halt() so that memory contents don't change --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -27,6 +27,7 @@ #include <asm/mmu_context.h> #include <asm/proto.h> #include <asm/apic.h> +#include <asm/cpu.h> #include <asm/idtentry.h> #include <asm/nmi.h> #include <asm/mce.h> @@ -146,31 +147,43 @@ static int register_stop_handler(void)
static void native_stop_other_cpus(int wait) { - unsigned long flags; - unsigned long timeout; + unsigned int cpu = smp_processor_id(); + unsigned long flags, timeout;
if (reboot_force) return;
- /* - * Use an own vector here because smp_call_function - * does lots of things not suitable in a panic situation. - */ + /* Only proceed if this is the first CPU to reach this code */ + if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1) + return;
/* - * We start by using the REBOOT_VECTOR irq. - * The irq is treated as a sync point to allow critical - * regions of code on other cpus to release their spin locks - * and re-enable irqs. Jumping straight to an NMI might - * accidentally cause deadlocks with further shutdown/panic - * code. By syncing, we give the cpus up to one second to - * finish their work before we force them off with the NMI. + * 1) Send an IPI on the reboot vector to all other CPUs. + * + * The other CPUs should react on it after leaving critical + * sections and re-enabling interrupts. They might still hold + * locks, but there is nothing which can be done about that. + * + * 2) Wait for all other CPUs to report that they reached the + * HLT loop in stop_this_cpu() + * + * 3) If #2 timed out send an NMI to the CPUs which did not + * yet report + * + * 4) Wait for all other CPUs to report that they reached the + * HLT loop in stop_this_cpu() + * + * #3 can obviously race against a CPU reaching the HLT loop late. + * That CPU will have reported already and the "have all CPUs + * reached HLT" condition will be true despite the fact that the + * other CPU is still handling the NMI. Again, there is no + * protection against that as "disabled" APICs still respond to + * NMIs. */ - if (num_online_cpus() > 1) { - /* did someone beat us here? */ - if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1) - return; + cpumask_copy(&cpus_stop_mask, cpu_online_mask); + cpumask_clear_cpu(cpu, &cpus_stop_mask);
+ if (!cpumask_empty(&cpus_stop_mask)) { /* sync above data before sending IRQ */ wmb();
@@ -183,12 +196,12 @@ static void native_stop_other_cpus(int w * CPUs reach shutdown state. */ timeout = USEC_PER_SEC; - while (num_online_cpus() > 1 && timeout--) + while (!cpumask_empty(&cpus_stop_mask) && timeout--) udelay(1); }
/* if the REBOOT_VECTOR didn't work, try with the NMI */ - if (num_online_cpus() > 1) { + if (!cpumask_empty(&cpus_stop_mask)) { /* * If NMI IPI is enabled, try to register the stop handler * and send the IPI. In any case try to wait for the other @@ -200,7 +213,8 @@ static void native_stop_other_cpus(int w
pr_emerg("Shutting down cpus with NMI\n");
- apic_send_IPI_allbutself(NMI_VECTOR); + for_each_cpu(cpu, &cpus_stop_mask) + apic->send_IPI(cpu, NMI_VECTOR); } /* * Don't wait longer than 10 ms if the caller didn't @@ -208,7 +222,7 @@ static void native_stop_other_cpus(int w * one or more CPUs do not reach shutdown state. */ timeout = USEC_PER_MSEC * 10; - while (num_online_cpus() > 1 && (wait || timeout--)) + while (!cpumask_empty(&cpus_stop_mask) && (wait || timeout--)) udelay(1); }
@@ -216,6 +230,12 @@ static void native_stop_other_cpus(int w disable_local_APIC(); mcheck_cpu_clear(this_cpu_ptr(&cpu_info)); local_irq_restore(flags); + + /* + * Ensure that the cpus_stop_mask cache lines are invalidated on + * the other CPUs. See comment vs. SME in stop_this_cpu(). + */ + cpumask_clear(&cpus_stop_mask); }
/*
From: Tony Battersby tonyb@cybernetics.com
commit 9b040453d4440659f33dc6f0aa26af418ebfe70b upstream.
stop_this_cpu() tests CPUID leaf 0x8000001f::EAX unconditionally. Intel CPUs return the content of the highest supported leaf when a non-existing leaf is read, while AMD CPUs return all zeros for unsupported leafs.
So the result of the test on Intel CPUs is lottery.
While harmless it's incorrect and causes the conditional wbinvd() to be issued where not required.
Check whether the leaf is supported before reading it.
[ tglx: Adjusted changelog ]
Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use") Signed-off-by: Tony Battersby tonyb@cybernetics.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Mario Limonciello mario.limonciello@amd.com Reviewed-by: Borislav Petkov (AMD) bp@alien8.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.c... Link: https://lore.kernel.org/r/20230615193330.322186388@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/process.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
--- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -756,6 +756,7 @@ struct cpumask cpus_stop_mask;
void __noreturn stop_this_cpu(void *dummy) { + struct cpuinfo_x86 *c = this_cpu_ptr(&cpu_info); unsigned int cpu = smp_processor_id();
local_irq_disable(); @@ -770,7 +771,7 @@ void __noreturn stop_this_cpu(void *dumm */ set_cpu_online(cpu, false); disable_local_APIC(); - mcheck_cpu_clear(this_cpu_ptr(&cpu_info)); + mcheck_cpu_clear(c);
/* * Use wbinvd on processors that support SME. This provides support @@ -784,7 +785,7 @@ void __noreturn stop_this_cpu(void *dumm * Test the CPUID bit directly because the machine might've cleared * X86_FEATURE_SME due to cmdline options. */ - if (cpuid_eax(0x8000001f) & BIT(0)) + if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0))) native_wbinvd();
/*
From: Thomas Gleixner tglx@linutronix.de
commit 2affa6d6db28855e6340b060b809c23477aa546e upstream.
The wmb()s before sending the IPIs are not synchronizing anything.
If at all then the apic IPI functions have to provide or act as appropriate barriers.
Remove these cargo cult barriers which have no explanation of what they are synchronizing.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov (AMD) bp@alien8.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615193330.378358382@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/smp.c | 6 ------ 1 file changed, 6 deletions(-)
--- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -184,9 +184,6 @@ static void native_stop_other_cpus(int w cpumask_clear_cpu(cpu, &cpus_stop_mask);
if (!cpumask_empty(&cpus_stop_mask)) { - /* sync above data before sending IRQ */ - wmb(); - apic_send_IPI_allbutself(REBOOT_VECTOR);
/* @@ -208,9 +205,6 @@ static void native_stop_other_cpus(int w * CPUs to stop. */ if (!smp_no_nmi_ipi && !register_stop_handler()) { - /* Sync above data before sending IRQ */ - wmb(); - pr_emerg("Shutting down cpus with NMI\n");
for_each_cpu(cpu, &cpus_stop_mask)
From: Thomas Gleixner tglx@linutronix.de
commit f9c9987bf52f4e42e940ae217333ebb5a4c3b506 upstream.
Monitoring idletask::thread_info::flags in mwait_play_dead() has been an obvious choice as all what is needed is a cache line which is not written by other CPUs.
But there is a use case where a "dead" CPU needs to be brought out of MWAIT: kexec().
This is required as kexec() can overwrite text, pagetables, stacks and the monitored cacheline of the original kernel. The latter causes MWAIT to resume execution which obviously causes havoc on the kexec kernel which results usually in triple faults.
Use a dedicated per CPU storage to prepare for that.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ashok Raj ashok.raj@intel.com Reviewed-by: Borislav Petkov (AMD) bp@alien8.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615193330.434553750@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/smpboot.c | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-)
--- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -101,6 +101,17 @@ EXPORT_PER_CPU_SYMBOL(cpu_die_map); DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info); EXPORT_PER_CPU_SYMBOL(cpu_info);
+struct mwait_cpu_dead { + unsigned int control; + unsigned int status; +}; + +/* + * Cache line aligned data for mwait_play_dead(). Separate on purpose so + * that it's unlikely to be touched by other CPUs. + */ +static DEFINE_PER_CPU_ALIGNED(struct mwait_cpu_dead, mwait_cpu_dead); + /* Logical package management. We might want to allocate that dynamically */ unsigned int __max_logical_packages __read_mostly; EXPORT_SYMBOL(__max_logical_packages); @@ -1750,10 +1761,10 @@ EXPORT_SYMBOL_GPL(cond_wakeup_cpu0); */ static inline void mwait_play_dead(void) { + struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead); unsigned int eax, ebx, ecx, edx; unsigned int highest_cstate = 0; unsigned int highest_subcstate = 0; - void *mwait_ptr; int i;
if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD || @@ -1788,13 +1799,6 @@ static inline void mwait_play_dead(void) (highest_subcstate - 1); }
- /* - * This should be a memory location in a cache line which is - * unlikely to be touched by other processors. The actual - * content is immaterial as it is not actually modified in any way. - */ - mwait_ptr = ¤t_thread_info()->flags; - wbinvd();
while (1) { @@ -1806,9 +1810,9 @@ static inline void mwait_play_dead(void) * case where we return around the loop. */ mb(); - clflush(mwait_ptr); + clflush(md); mb(); - __monitor(mwait_ptr, 0, 0); + __monitor(md, 0, 0); mb(); __mwait(eax, 0);
From: Thomas Gleixner tglx@linutronix.de
commit d7893093a7417527c0d73c9832244e65c9d0114f upstream.
TLDR: It's a mess.
When kexec() is executed on a system with offline CPUs, which are parked in mwait_play_dead() it can end up in a triple fault during the bootup of the kexec kernel or cause hard to diagnose data corruption.
The reason is that kexec() eventually overwrites the previous kernel's text, page tables, data and stack. If it writes to the cache line which is monitored by a previously offlined CPU, MWAIT resumes execution and ends up executing the wrong text, dereferencing overwritten page tables or corrupting the kexec kernels data.
Cure this by bringing the offlined CPUs out of MWAIT into HLT.
Write to the monitored cache line of each offline CPU, which makes MWAIT resume execution. The written control word tells the offlined CPUs to issue HLT, which does not have the MWAIT problem.
That does not help, if a stray NMI, MCE or SMI hits the offlined CPUs as those make it come out of HLT.
A follow up change will put them into INIT, which protects at least against NMI and SMI.
Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case") Reported-by: Ashok Raj ashok.raj@intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Ashok Raj ashok.raj@intel.com Reviewed-by: Ashok Raj ashok.raj@intel.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615193330.492257119@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/include/asm/smp.h | 2 + arch/x86/kernel/smp.c | 5 +++ arch/x86/kernel/smpboot.c | 59 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 66 insertions(+)
--- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -131,6 +131,8 @@ void wbinvd_on_cpu(int cpu); int wbinvd_on_all_cpus(void); void cond_wakeup_cpu0(void);
+void smp_kick_mwait_play_dead(void); + void native_smp_send_reschedule(int cpu); void native_send_call_func_ipi(const struct cpumask *mask); void native_send_call_func_single_ipi(int cpu); --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -21,6 +21,7 @@ #include <linux/interrupt.h> #include <linux/cpu.h> #include <linux/gfp.h> +#include <linux/kexec.h>
#include <asm/mtrr.h> #include <asm/tlbflush.h> @@ -157,6 +158,10 @@ static void native_stop_other_cpus(int w if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1) return;
+ /* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */ + if (kexec_in_progress) + smp_kick_mwait_play_dead(); + /* * 1) Send an IPI on the reboot vector to all other CPUs. * --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -53,6 +53,7 @@ #include <linux/tboot.h> #include <linux/gfp.h> #include <linux/cpuidle.h> +#include <linux/kexec.h> #include <linux/numa.h> #include <linux/pgtable.h> #include <linux/overflow.h> @@ -106,6 +107,9 @@ struct mwait_cpu_dead { unsigned int status; };
+#define CPUDEAD_MWAIT_WAIT 0xDEADBEEF +#define CPUDEAD_MWAIT_KEXEC_HLT 0x4A17DEAD + /* * Cache line aligned data for mwait_play_dead(). Separate on purpose so * that it's unlikely to be touched by other CPUs. @@ -168,6 +172,10 @@ static void smp_callin(void) { int cpuid;
+ /* Mop up eventual mwait_play_dead() wreckage */ + this_cpu_write(mwait_cpu_dead.status, 0); + this_cpu_write(mwait_cpu_dead.control, 0); + /* * If waken up by an INIT in an 82489DX configuration * cpu_callout_mask guarantees we don't get here before @@ -1799,6 +1807,10 @@ static inline void mwait_play_dead(void) (highest_subcstate - 1); }
+ /* Set up state for the kexec() hack below */ + md->status = CPUDEAD_MWAIT_WAIT; + md->control = CPUDEAD_MWAIT_WAIT; + wbinvd();
while (1) { @@ -1816,10 +1828,57 @@ static inline void mwait_play_dead(void) mb(); __mwait(eax, 0);
+ if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) { + /* + * Kexec is about to happen. Don't go back into mwait() as + * the kexec kernel might overwrite text and data including + * page tables and stack. So mwait() would resume when the + * monitor cache line is written to and then the CPU goes + * south due to overwritten text, page tables and stack. + * + * Note: This does _NOT_ protect against a stray MCE, NMI, + * SMI. They will resume execution at the instruction + * following the HLT instruction and run into the problem + * which this is trying to prevent. + */ + WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT); + while(1) + native_halt(); + } + cond_wakeup_cpu0(); } }
+/* + * Kick all "offline" CPUs out of mwait on kexec(). See comment in + * mwait_play_dead(). + */ +void smp_kick_mwait_play_dead(void) +{ + u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT; + struct mwait_cpu_dead *md; + unsigned int cpu, i; + + for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) { + md = per_cpu_ptr(&mwait_cpu_dead, cpu); + + /* Does it sit in mwait_play_dead() ? */ + if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT) + continue; + + /* Wait up to 5ms */ + for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) { + /* Bring it out of mwait */ + WRITE_ONCE(md->control, newstate); + udelay(5); + } + + if (READ_ONCE(md->status) != newstate) + pr_err_once("CPU%u is stuck in mwait_play_dead()\n", cpu); + } +} + void hlt_play_dead(void) { if (__this_cpu_read(cpu_info.x86) >= 4)
From: Wyes Karny wyes.karny@amd.com
commit f4aad639302a07454dcb23b408dcadf8a9efb031 upstream.
amd-pstate passive mode driver is hyphenated. So make amd-pstate active mode driver consistent with that rename "amd_pstate_epp" to "amd-pstate-epp".
Fixes: ffa5096a7c33 ("cpufreq: amd-pstate: implement Pstate EPP support for the AMD processors") Cc: All applicable stable@vger.kernel.org Reviewed-by: Gautham R. Shenoy gautham.shenoy@amd.com Signed-off-by: Wyes Karny wyes.karny@amd.com Acked-by: Huang Rui ray.huang@amd.com Reviewed-by: Perry Yuan Perry.Yuan@amd.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/cpufreq/amd-pstate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/cpufreq/amd-pstate.c +++ b/drivers/cpufreq/amd-pstate.c @@ -1272,7 +1272,7 @@ static struct cpufreq_driver amd_pstate_ .online = amd_pstate_epp_cpu_online, .suspend = amd_pstate_epp_suspend, .resume = amd_pstate_epp_resume, - .name = "amd_pstate_epp", + .name = "amd-pstate-epp", .attr = amd_pstate_epp_attr, };
From: Oliver Hartkopp socketcan@hartkopp.net
commit e38910c0072b541a91954682c8b074a93e57c09b upstream.
With commit d674a8f123b4 ("can: isotp: isotp_sendmsg(): fix return error on FC timeout on TX path") the missing correct return value in the case of a protocol error was introduced.
But the way the error value has been read and sent to the user space does not follow the common scheme to clear the error after reading which is provided by the sock_error() function. This leads to an error report at the following write() attempt although everything should be working.
Fixes: d674a8f123b4 ("can: isotp: isotp_sendmsg(): fix return error on FC timeout on TX path") Reported-by: Carsten Schmidt carsten.schmidt-achim@t-online.de Signed-off-by: Oliver Hartkopp socketcan@hartkopp.net Link: https://lore.kernel.org/all/20230607072708.38809-1-socketcan@hartkopp.net Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde mkl@pengutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/can/isotp.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
--- a/net/can/isotp.c +++ b/net/can/isotp.c @@ -1079,8 +1079,9 @@ wait_free_buffer: if (err) goto err_event_drop;
- if (sk->sk_err) - return -sk->sk_err; + err = sock_error(sk); + if (err) + return err; }
return size;
From: Peng Zhang zhangpeng.00@bytedance.com
commit cd00dd2585c4158e81fdfac0bbcc0446afbad26d upstream.
Check the write offset end bounds before using it as the offset into the pivot array. This avoids a possible out-of-bounds access on the pivot array if the write extends to the last slot in the node, in which case the node maximum should be used as the end pivot.
akpm: this doesn't affect any current callers, but new users of mapletree may encounter this problem if backported into earlier kernels, so let's fix it in -stable kernels in case of this.
Link: https://lkml.kernel.org/r/20230506024752.2550-1-zhangpeng.00@bytedance.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Peng Zhang zhangpeng.00@bytedance.com Reviewed-by: Liam R. Howlett Liam.Howlett@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- lib/maple_tree.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)
--- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -4287,11 +4287,13 @@ done:
static inline void mas_wr_end_piv(struct ma_wr_state *wr_mas) { - while ((wr_mas->mas->last > wr_mas->end_piv) && - (wr_mas->offset_end < wr_mas->node_end)) - wr_mas->end_piv = wr_mas->pivots[++wr_mas->offset_end]; + while ((wr_mas->offset_end < wr_mas->node_end) && + (wr_mas->mas->last > wr_mas->pivots[wr_mas->offset_end])) + wr_mas->offset_end++;
- if (wr_mas->mas->last > wr_mas->end_piv) + if (wr_mas->offset_end < wr_mas->node_end) + wr_mas->end_piv = wr_mas->pivots[wr_mas->offset_end]; + else wr_mas->end_piv = wr_mas->mas->max; }
@@ -4448,7 +4450,6 @@ static inline void *mas_wr_store_entry(s }
/* At this point, we are at the leaf node that needs to be altered. */ - wr_mas->end_piv = wr_mas->r_max; mas_wr_end_piv(wr_mas);
if (!wr_mas->entry)
From: Linus Torvalds torvalds@linux-foundation.org
commit c2508ec5a58db67093f4fb8bf89a9a7c53a109e9 upstream.
.. and make x86 use it.
This basically extracts the existing x86 "find and expand faulting vma" code, but extends it to also take the mmap lock for writing in case we actually do need to expand the vma.
We've historically short-circuited that case, and have some rather ugly special logic to serialize the stack segment expansion (since we only hold the mmap lock for reading) that doesn't match the normal VM locking.
That slight violation of locking worked well, right up until it didn't: the maple tree code really does want proper locking even for simple extension of an existing vma.
So extract the code for "look up the vma of the fault" from x86, fix it up to do the necessary write locking, and make it available as a helper function for other architectures that can use the common helper.
Note: I say "common helper", but it really only handles the normal stack-grows-down case. Which is all architectures except for PA-RISC and IA64. So some rare architectures can't use the helper, but if they care they'll just need to open-code this logic.
It's also worth pointing out that this code really would like to have an optimistic "mmap_upgrade_trylock()" to make it quicker to go from a read-lock (for the common case) to taking the write lock (for having to extend the vma) in the normal single-threaded situation where there is no other locking activity.
But that _is_ all the very uncommon special case, so while it would be nice to have such an operation, it probably doesn't matter in reality. I did put in the skeleton code for such a possible future expansion, even if it only acts as pseudo-documentation for what we're doing.
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/Kconfig | 1 arch/x86/mm/fault.c | 52 ---------------------- include/linux/mm.h | 2 mm/Kconfig | 4 + mm/memory.c | 121 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 130 insertions(+), 50 deletions(-)
--- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -274,6 +274,7 @@ config X86 select HAVE_GENERIC_VDSO select HOTPLUG_SMT if SMP select IRQ_FORCED_THREADING + select LOCK_MM_AND_FIND_VMA select NEED_PER_CPU_EMBED_FIRST_CHUNK select NEED_PER_CPU_PAGE_FIRST_CHUNK select NEED_SG_DMA_LENGTH --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -879,12 +879,6 @@ __bad_area(struct pt_regs *regs, unsigne __bad_area_nosemaphore(regs, error_code, address, pkey, si_code); }
-static noinline void -bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address) -{ - __bad_area(regs, error_code, address, 0, SEGV_MAPERR); -} - static inline bool bad_area_access_from_pkeys(unsigned long error_code, struct vm_area_struct *vma) { @@ -1333,51 +1327,10 @@ void do_user_addr_fault(struct pt_regs * } #endif
- /* - * Kernel-mode access to the user address space should only occur - * on well-defined single instructions listed in the exception - * tables. But, an erroneous kernel fault occurring outside one of - * those areas which also holds mmap_lock might deadlock attempting - * to validate the fault against the address space. - * - * Only do the expensive exception table search when we might be at - * risk of a deadlock. This happens if we - * 1. Failed to acquire mmap_lock, and - * 2. The access did not originate in userspace. - */ - if (unlikely(!mmap_read_trylock(mm))) { - if (!user_mode(regs) && !search_exception_tables(regs->ip)) { - /* - * Fault from code in kernel from - * which we do not expect faults. - */ - bad_area_nosemaphore(regs, error_code, address); - return; - } retry: - mmap_read_lock(mm); - } else { - /* - * The above down_read_trylock() might have succeeded in - * which case we'll have missed the might_sleep() from - * down_read(): - */ - might_sleep(); - } - - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (unlikely(!vma)) { - bad_area(regs, error_code, address); - return; - } - if (likely(vma->vm_start <= address)) - goto good_area; - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { - bad_area(regs, error_code, address); - return; - } - if (unlikely(expand_stack(vma, address))) { - bad_area(regs, error_code, address); + bad_area_nosemaphore(regs, error_code, address); return; }
@@ -1385,7 +1338,6 @@ retry: * Ok, we have a good vm_area for this memory access, so * we can handle it.. */ -good_area: if (unlikely(access_error(error_code, vma))) { bad_area_access_error(regs, error_code, address, vma); return; --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2190,6 +2190,8 @@ void unmap_mapping_pages(struct address_ pgoff_t start, pgoff_t nr, bool even_cows); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); +struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm, + unsigned long address, struct pt_regs *regs); #else static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1202,6 +1202,10 @@ config LRU_GEN_STATS This option has a per-memcg and per-node memory overhead. # }
+config LOCK_MM_AND_FIND_VMA + bool + depends on !STACK_GROWSUP + source "mm/damon/Kconfig"
endmenu --- a/mm/memory.c +++ b/mm/memory.c @@ -5230,6 +5230,127 @@ vm_fault_t handle_mm_fault(struct vm_are } EXPORT_SYMBOL_GPL(handle_mm_fault);
+#ifdef CONFIG_LOCK_MM_AND_FIND_VMA +#include <linux/extable.h> + +static inline bool get_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs *regs) +{ + /* Even if this succeeds, make it clear we *might* have slept */ + if (likely(mmap_read_trylock(mm))) { + might_sleep(); + return true; + } + + if (regs && !user_mode(regs)) { + unsigned long ip = instruction_pointer(regs); + if (!search_exception_tables(ip)) + return false; + } + + mmap_read_lock(mm); + return true; +} + +static inline bool mmap_upgrade_trylock(struct mm_struct *mm) +{ + /* + * We don't have this operation yet. + * + * It should be easy enough to do: it's basically a + * atomic_long_try_cmpxchg_acquire() + * from RWSEM_READER_BIAS -> RWSEM_WRITER_LOCKED, but + * it also needs the proper lockdep magic etc. + */ + return false; +} + +static inline bool upgrade_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs *regs) +{ + mmap_read_unlock(mm); + if (regs && !user_mode(regs)) { + unsigned long ip = instruction_pointer(regs); + if (!search_exception_tables(ip)) + return false; + } + mmap_write_lock(mm); + return true; +} + +/* + * Helper for page fault handling. + * + * This is kind of equivalend to "mmap_read_lock()" followed + * by "find_extend_vma()", except it's a lot more careful about + * the locking (and will drop the lock on failure). + * + * For example, if we have a kernel bug that causes a page + * fault, we don't want to just use mmap_read_lock() to get + * the mm lock, because that would deadlock if the bug were + * to happen while we're holding the mm lock for writing. + * + * So this checks the exception tables on kernel faults in + * order to only do this all for instructions that are actually + * expected to fault. + * + * We can also actually take the mm lock for writing if we + * need to extend the vma, which helps the VM layer a lot. + */ +struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm, + unsigned long addr, struct pt_regs *regs) +{ + struct vm_area_struct *vma; + + if (!get_mmap_lock_carefully(mm, regs)) + return NULL; + + vma = find_vma(mm, addr); + if (likely(vma && (vma->vm_start <= addr))) + return vma; + + /* + * Well, dang. We might still be successful, but only + * if we can extend a vma to do so. + */ + if (!vma || !(vma->vm_flags & VM_GROWSDOWN)) { + mmap_read_unlock(mm); + return NULL; + } + + /* + * We can try to upgrade the mmap lock atomically, + * in which case we can continue to use the vma + * we already looked up. + * + * Otherwise we'll have to drop the mmap lock and + * re-take it, and also look up the vma again, + * re-checking it. + */ + if (!mmap_upgrade_trylock(mm)) { + if (!upgrade_mmap_lock_carefully(mm, regs)) + return NULL; + + vma = find_vma(mm, addr); + if (!vma) + goto fail; + if (vma->vm_start <= addr) + goto success; + if (!(vma->vm_flags & VM_GROWSDOWN)) + goto fail; + } + + if (expand_stack(vma, addr)) + goto fail; + +success: + mmap_write_downgrade(mm); + return vma; + +fail: + mmap_write_unlock(mm); + return NULL; +} +#endif + #ifndef __PAGETABLE_P4D_FOLDED /* * Allocate p4d page table.
From: Linus Torvalds torvalds@linux-foundation.org
commit eda0047296a16d65a7f2bc60a408f70d178b2014 upstream.
This is done as a separate patch from introducing the new lock_mm_and_find_vma() helper, because while it's an obvious change, it's not what x86 used to do in this area.
We already abort the page fault on fatal signals anyway, so why should we wait for the mmap lock only to then abort later? With the new helper function that returns without the lock held on failure anyway, this is particularly easy and straightforward.
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/memory.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
--- a/mm/memory.c +++ b/mm/memory.c @@ -5247,8 +5247,7 @@ static inline bool get_mmap_lock_careful return false; }
- mmap_read_lock(mm); - return true; + return !mmap_read_lock_killable(mm); }
static inline bool mmap_upgrade_trylock(struct mm_struct *mm) @@ -5272,8 +5271,7 @@ static inline bool upgrade_mmap_lock_car if (!search_exception_tables(ip)) return false; } - mmap_write_lock(mm); - return true; + return !mmap_write_lock_killable(mm); }
/*
From: Linus Torvalds torvalds@linux-foundation.org
commit ae870a68b5d13d67cf4f18d47bb01ee3fee40acb upstream.
This converts arm64 to use the new page fault helper. It was very straightforward, but still needed a fix for the "obvious" conversion I initially did. Thanks to Suren for the fix and testing.
Fixed-and-tested-by: Suren Baghdasaryan surenb@google.com Unnecessary-code-removal-by: Liam R. Howlett Liam.Howlett@oracle.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/arm64/Kconfig | 1 + arch/arm64/mm/fault.c | 44 +++++++------------------------------------- 2 files changed, 8 insertions(+), 37 deletions(-)
--- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -219,6 +219,7 @@ config ARM64 select IRQ_DOMAIN select IRQ_FORCED_THREADING select KASAN_VMALLOC if KASAN + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select NEED_DMA_MAP_STATE select NEED_SG_DMA_LENGTH --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -483,27 +483,14 @@ static void do_bad_area(unsigned long fa #define VM_FAULT_BADMAP ((__force vm_fault_t)0x010000) #define VM_FAULT_BADACCESS ((__force vm_fault_t)0x020000)
-static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr, +static vm_fault_t __do_page_fault(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long addr, unsigned int mm_flags, unsigned long vm_flags, struct pt_regs *regs) { - struct vm_area_struct *vma = find_vma(mm, addr); - - if (unlikely(!vma)) - return VM_FAULT_BADMAP; - /* * Ok, we have a good vm_area for this memory access, so we can handle * it. - */ - if (unlikely(vma->vm_start > addr)) { - if (!(vma->vm_flags & VM_GROWSDOWN)) - return VM_FAULT_BADMAP; - if (expand_stack(vma, addr)) - return VM_FAULT_BADMAP; - } - - /* * Check that the permissions on the VMA allow for the fault which * occurred. */ @@ -585,31 +572,14 @@ static int __kprobes do_page_fault(unsig
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
- /* - * As per x86, we may deadlock here. However, since the kernel only - * validly references user space from well defined areas of the code, - * we can bug out early if this is from code which shouldn't. - */ - if (!mmap_read_trylock(mm)) { - if (!user_mode(regs) && !search_exception_tables(regs->pc)) - goto no_context; retry: - mmap_read_lock(mm); - } else { - /* - * The above mmap_read_trylock() might have succeeded in which - * case, we'll have missed the might_sleep() from down_read(). - */ - might_sleep(); -#ifdef CONFIG_DEBUG_VM - if (!user_mode(regs) && !search_exception_tables(regs->pc)) { - mmap_read_unlock(mm); - goto no_context; - } -#endif + vma = lock_mm_and_find_vma(mm, addr, regs); + if (unlikely(!vma)) { + fault = VM_FAULT_BADMAP; + goto done; }
- fault = __do_page_fault(mm, addr, mm_flags, vm_flags, regs); + fault = __do_page_fault(mm, vma, addr, mm_flags, vm_flags, regs);
/* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) {
From: Michael Ellerman mpe@ellerman.id.au
commit e6fe228c4ffafdfc970cf6d46883a1f481baf7ea upstream.
Signed-off-by: Michael Ellerman mpe@ellerman.id.au Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/powerpc/Kconfig | 1 + arch/powerpc/mm/fault.c | 41 ++++------------------------------------- 2 files changed, 5 insertions(+), 37 deletions(-)
--- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -263,6 +263,7 @@ config PPC select IRQ_DOMAIN select IRQ_FORCED_THREADING select KASAN_VMALLOC if KASAN && MODULES + select LOCK_MM_AND_FIND_VMA select MMU_GATHER_PAGE_SIZE select MMU_GATHER_RCU_TABLE_FREE select MMU_GATHER_MERGE_VMAS --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -84,11 +84,6 @@ static int __bad_area(struct pt_regs *re return __bad_area_nosemaphore(regs, address, si_code); }
-static noinline int bad_area(struct pt_regs *regs, unsigned long address) -{ - return __bad_area(regs, address, SEGV_MAPERR); -} - static noinline int bad_access_pkey(struct pt_regs *regs, unsigned long address, struct vm_area_struct *vma) { @@ -481,40 +476,12 @@ static int ___do_page_fault(struct pt_re * we will deadlock attempting to validate the fault against the * address space. Luckily the kernel only validly references user * space from well defined areas of code, which are listed in the - * exceptions table. - * - * As the vast majority of faults will be valid we will only perform - * the source reference check when there is a possibility of a deadlock. - * Attempt to lock the address space, if we cannot we then validate the - * source. If this is invalid we can skip the address space check, - * thus avoiding the deadlock. - */ - if (unlikely(!mmap_read_trylock(mm))) { - if (!is_user && !search_exception_tables(regs->nip)) - return bad_area_nosemaphore(regs, address); - + * exceptions table. lock_mm_and_find_vma() handles that logic. + */ retry: - mmap_read_lock(mm); - } else { - /* - * The above down_read_trylock() might have succeeded in - * which case we'll have missed the might_sleep() from - * down_read(): - */ - might_sleep(); - } - - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (unlikely(!vma)) - return bad_area(regs, address); - - if (unlikely(vma->vm_start > address)) { - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) - return bad_area(regs, address); - - if (unlikely(expand_stack(vma, address))) - return bad_area(regs, address); - } + return bad_area_nosemaphore(regs, address);
if (unlikely(access_pkey_error(is_write, is_exec, (error_code & DSISR_KEYFAULT), vma)))
From: Ben Hutchings ben@decadent.org.uk
commit 4bce37a68ff884e821a02a731897a8119e0c37b7 upstream.
Signed-off-by: Ben Hutchings ben@decadent.org.uk Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/mips/Kconfig | 1 + arch/mips/mm/fault.c | 12 ++---------- 2 files changed, 3 insertions(+), 10 deletions(-)
--- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -94,6 +94,7 @@ config MIPS select HAVE_VIRT_CPU_ACCOUNTING_GEN if 64BIT || !SMP select IRQ_FORCED_THREADING select ISA if EISA + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_REL if MODULES select MODULES_USE_ELF_RELA if MODULES && 64BIT select PERF_USE_VMALLOC --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -99,21 +99,13 @@ static void __do_page_fault(struct pt_re
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: - mmap_read_lock(mm); - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; + goto bad_area_nosemaphore; /* * Ok, we have a good vm_area for this memory access, so * we can handle it.. */ -good_area: si_code = SEGV_ACCERR;
if (write) {
From: Ben Hutchings ben@decadent.org.uk
commit 7267ef7b0b77f4ed23b7b3c87d8eca7bd9c2d007 upstream.
Signed-off-by: Ben Hutchings ben@decadent.org.uk Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/riscv/Kconfig | 1 + arch/riscv/mm/fault.c | 31 +++++++++++++------------------ 2 files changed, 14 insertions(+), 18 deletions(-)
--- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -119,6 +119,7 @@ config RISCV select HAVE_SYSCALL_TRACEPOINTS select IRQ_DOMAIN select IRQ_FORCED_THREADING + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA if MODULES select MODULE_SECTIONS if MODULES select OF --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -83,13 +83,13 @@ static inline void mm_fault_error(struct BUG(); }
-static inline void bad_area(struct pt_regs *regs, struct mm_struct *mm, int code, unsigned long addr) +static inline void +bad_area_nosemaphore(struct pt_regs *regs, int code, unsigned long addr) { /* * Something tried to access memory that isn't in our memory map. * Fix it, but check if it's kernel or user first. */ - mmap_read_unlock(mm); /* User mode accesses just cause a SIGSEGV */ if (user_mode(regs)) { do_trap(regs, SIGSEGV, code, addr); @@ -99,6 +99,15 @@ static inline void bad_area(struct pt_re no_context(regs, addr); }
+static inline void +bad_area(struct pt_regs *regs, struct mm_struct *mm, int code, + unsigned long addr) +{ + mmap_read_unlock(mm); + + bad_area_nosemaphore(regs, code, addr); +} + static inline void vmalloc_fault(struct pt_regs *regs, int code, unsigned long addr) { pgd_t *pgd, *pgd_k; @@ -286,23 +295,10 @@ asmlinkage void do_page_fault(struct pt_ else if (cause == EXC_INST_PAGE_FAULT) flags |= FAULT_FLAG_INSTRUCTION; retry: - mmap_read_lock(mm); - vma = find_vma(mm, addr); + vma = lock_mm_and_find_vma(mm, addr, regs); if (unlikely(!vma)) { tsk->thread.bad_cause = cause; - bad_area(regs, mm, code, addr); - return; - } - if (likely(vma->vm_start <= addr)) - goto good_area; - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { - tsk->thread.bad_cause = cause; - bad_area(regs, mm, code, addr); - return; - } - if (unlikely(expand_stack(vma, addr))) { - tsk->thread.bad_cause = cause; - bad_area(regs, mm, code, addr); + bad_area_nosemaphore(regs, code, addr); return; }
@@ -310,7 +306,6 @@ retry: * Ok, we have a good vm_area for this memory access, so * we can handle it. */ -good_area: code = SEGV_ACCERR;
if (unlikely(access_error(cause, vma))) {
From: Ben Hutchings ben@decadent.org.uk
commit 8b35ca3e45e35a26a21427f35d4093606e93ad0a upstream.
arm has an additional check for address < FIRST_USER_ADDRESS before expanding the stack. Since FIRST_USER_ADDRESS is defined everywhere (generally as 0), move that check to the generic expand_downwards().
Signed-off-by: Ben Hutchings ben@decadent.org.uk Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/arm/Kconfig | 1 arch/arm/mm/fault.c | 63 +++++++++++----------------------------------------- mm/mmap.c | 2 - 3 files changed, 16 insertions(+), 50 deletions(-)
--- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -125,6 +125,7 @@ config ARM select HAVE_UID16 select HAVE_VIRT_CPU_ACCOUNTING_GEN select IRQ_FORCED_THREADING + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_REL select NEED_DMA_MAP_STATE select OF_EARLY_FLATTREE if OF --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -232,37 +232,11 @@ static inline bool is_permission_fault(u return false; }
-static vm_fault_t __kprobes -__do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int flags, - unsigned long vma_flags, struct pt_regs *regs) -{ - struct vm_area_struct *vma = find_vma(mm, addr); - if (unlikely(!vma)) - return VM_FAULT_BADMAP; - - if (unlikely(vma->vm_start > addr)) { - if (!(vma->vm_flags & VM_GROWSDOWN)) - return VM_FAULT_BADMAP; - if (addr < FIRST_USER_ADDRESS) - return VM_FAULT_BADMAP; - if (expand_stack(vma, addr)) - return VM_FAULT_BADMAP; - } - - /* - * ok, we have a good vm_area for this memory access, check the - * permissions on the VMA allow for the fault which occurred. - */ - if (!(vma->vm_flags & vma_flags)) - return VM_FAULT_BADACCESS; - - return handle_mm_fault(vma, addr & PAGE_MASK, flags, regs); -} - static int __kprobes do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) { struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; int sig, code; vm_fault_t fault; unsigned int flags = FAULT_FLAG_DEFAULT; @@ -301,31 +275,21 @@ do_page_fault(unsigned long addr, unsign
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
- /* - * As per x86, we may deadlock here. However, since the kernel only - * validly references user space from well defined areas of the code, - * we can bug out early if this is from code which shouldn't. - */ - if (!mmap_read_trylock(mm)) { - if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc)) - goto no_context; retry: - mmap_read_lock(mm); - } else { - /* - * The above down_read_trylock() might have succeeded in - * which case, we'll have missed the might_sleep() from - * down_read() - */ - might_sleep(); -#ifdef CONFIG_DEBUG_VM - if (!user_mode(regs) && - !search_exception_tables(regs->ARM_pc)) - goto no_context; -#endif + vma = lock_mm_and_find_vma(mm, addr, regs); + if (unlikely(!vma)) { + fault = VM_FAULT_BADMAP; + goto bad_area; }
- fault = __do_page_fault(mm, addr, flags, vm_flags, regs); + /* + * ok, we have a good vm_area for this memory access, check the + * permissions on the VMA allow for the fault which occurred. + */ + if (!(vma->vm_flags & vm_flags)) + fault = VM_FAULT_BADACCESS; + else + fault = handle_mm_fault(vma, addr & PAGE_MASK, flags, regs);
/* If we need to retry but a fatal signal is pending, handle the * signal first. We do not need to release the mmap_lock because @@ -356,6 +320,7 @@ retry: if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS)))) return 0;
+bad_area: /* * If we are in kernel mode at this point, we * have no context to handle this fault with. --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1999,7 +1999,7 @@ int expand_downwards(struct vm_area_stru int error = 0;
address &= PAGE_MASK; - if (address < mmap_min_addr) + if (address < mmap_min_addr || address < FIRST_USER_ADDRESS) return -EPERM;
/* Enforce stack_guard_gap */
From: Linus Torvalds torvalds@linux-foundation.org
commit a050ba1e7422f2cc60ff8bfde3f96d34d00cb585 upstream.
This does the simple pattern conversion of alpha, arc, csky, hexagon, loongarch, nios2, sh, sparc32, and xtensa to the lock_mm_and_find_vma() helper. They all have the regular fault handling pattern without odd special cases.
The remaining architectures all have something that keeps us from a straightforward conversion: ia64 and parisc have stacks that can grow both up as well as down (and ia64 has special address region checks).
And m68k, microblaze, openrisc, sparc64, and um end up having extra rules about only expanding the stack down a limited amount below the user space stack pointer. That is something that x86 used to do too (long long ago), and it probably could just be skipped, but it still makes the conversion less than trivial.
Note that this conversion was done manually and with the exception of alpha without any build testing, because I have a fairly limited cross- building environment. The cases are all simple, and I went through the changes several times, but...
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/alpha/Kconfig | 1 + arch/alpha/mm/fault.c | 13 +++---------- arch/arc/Kconfig | 1 + arch/arc/mm/fault.c | 11 +++-------- arch/csky/Kconfig | 1 + arch/csky/mm/fault.c | 22 +++++----------------- arch/hexagon/Kconfig | 1 + arch/hexagon/mm/vm_fault.c | 18 ++++-------------- arch/loongarch/Kconfig | 1 + arch/loongarch/mm/fault.c | 16 ++++++---------- arch/nios2/Kconfig | 1 + arch/nios2/mm/fault.c | 17 ++--------------- arch/sh/Kconfig | 1 + arch/sh/mm/fault.c | 17 ++--------------- arch/sparc/Kconfig | 1 + arch/sparc/mm/fault_32.c | 32 ++++++++------------------------ arch/xtensa/Kconfig | 1 + arch/xtensa/mm/fault.c | 14 +++----------- 18 files changed, 45 insertions(+), 124 deletions(-)
--- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -29,6 +29,7 @@ config ALPHA select GENERIC_SMP_IDLE_THREAD select HAVE_ARCH_AUDITSYSCALL select HAVE_MOD_ARCH_SPECIFIC + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select ODD_RT_SIGACTION select OLD_SIGSUSPEND --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -119,20 +119,12 @@ do_page_fault(unsigned long address, uns flags |= FAULT_FLAG_USER; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: - mmap_read_lock(mm); - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; + goto bad_area_nosemaphore;
/* Ok, we have a good vm_area for this memory access, so we can handle it. */ - good_area: si_code = SEGV_ACCERR; if (cause < 0) { if (!(vma->vm_flags & VM_EXEC)) @@ -192,6 +184,7 @@ retry: bad_area: mmap_read_unlock(mm);
+ bad_area_nosemaphore: if (user_mode(regs)) goto do_sigsegv;
--- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -41,6 +41,7 @@ config ARC select HAVE_PERF_EVENTS select HAVE_SYSCALL_TRACEPOINTS select IRQ_DOMAIN + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select OF select OF_EARLY_FLATTREE --- a/arch/arc/mm/fault.c +++ b/arch/arc/mm/fault.c @@ -113,15 +113,9 @@ void do_page_fault(unsigned long address
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: - mmap_read_lock(mm); - - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (unlikely(address < vma->vm_start)) { - if (!(vma->vm_flags & VM_GROWSDOWN) || expand_stack(vma, address)) - goto bad_area; - } + goto bad_area_nosemaphore;
/* * vm_area is good, now check permissions for this memory access @@ -161,6 +155,7 @@ retry: bad_area: mmap_read_unlock(mm);
+bad_area_nosemaphore: /* * Major/minor page fault accounting * (in case of retry we only land here once) --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -96,6 +96,7 @@ config CSKY select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_STACKPROTECTOR select HAVE_SYSCALL_TRACEPOINTS + select LOCK_MM_AND_FIND_VMA select MAY_HAVE_SPARSE_IRQ select MODULES_USE_ELF_RELA if MODULES select OF --- a/arch/csky/mm/fault.c +++ b/arch/csky/mm/fault.c @@ -97,13 +97,12 @@ static inline void mm_fault_error(struct BUG(); }
-static inline void bad_area(struct pt_regs *regs, struct mm_struct *mm, int code, unsigned long addr) +static inline void bad_area_nosemaphore(struct pt_regs *regs, struct mm_struct *mm, int code, unsigned long addr) { /* * Something tried to access memory that isn't in our memory map. * Fix it, but check if it's kernel or user first. */ - mmap_read_unlock(mm); /* User mode accesses just cause a SIGSEGV */ if (user_mode(regs)) { do_trap(regs, SIGSEGV, code, addr); @@ -238,20 +237,9 @@ asmlinkage void do_page_fault(struct pt_ if (is_write(regs)) flags |= FAULT_FLAG_WRITE; retry: - mmap_read_lock(mm); - vma = find_vma(mm, addr); + vma = lock_mm_and_find_vma(mm, address, regs); if (unlikely(!vma)) { - bad_area(regs, mm, code, addr); - return; - } - if (likely(vma->vm_start <= addr)) - goto good_area; - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { - bad_area(regs, mm, code, addr); - return; - } - if (unlikely(expand_stack(vma, addr))) { - bad_area(regs, mm, code, addr); + bad_area_nosemaphore(regs, mm, code, addr); return; }
@@ -259,11 +247,11 @@ retry: * Ok, we have a good vm_area for this memory access, so * we can handle it. */ -good_area: code = SEGV_ACCERR;
if (unlikely(access_error(regs, vma))) { - bad_area(regs, mm, code, addr); + mmap_read_unlock(mm); + bad_area_nosemaphore(regs, mm, code, addr); return; }
--- a/arch/hexagon/Kconfig +++ b/arch/hexagon/Kconfig @@ -28,6 +28,7 @@ config HEXAGON select GENERIC_SMP_IDLE_THREAD select STACKTRACE_SUPPORT select GENERIC_CLOCKEVENTS_BROADCAST + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select GENERIC_CPU_DEVICES select ARCH_WANT_LD_ORPHAN_WARN --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -57,21 +57,10 @@ void do_page_fault(unsigned long address
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: - mmap_read_lock(mm); - vma = find_vma(mm, address); - if (!vma) - goto bad_area; + vma = lock_mm_and_find_vma(mm, address, regs); + if (unlikely(!vma)) + goto bad_area_nosemaphore;
- if (vma->vm_start <= address) - goto good_area; - - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - - if (expand_stack(vma, address)) - goto bad_area; - -good_area: /* Address space is OK. Now check access rights. */ si_code = SEGV_ACCERR;
@@ -143,6 +132,7 @@ good_area: bad_area: mmap_read_unlock(mm);
+bad_area_nosemaphore: if (user_mode(regs)) { force_sig_fault(SIGSEGV, si_code, (void __user *)address); return; --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -125,6 +125,7 @@ config LOONGARCH select HAVE_VIRT_CPU_ACCOUNTING_GEN if !SMP select IRQ_FORCED_THREADING select IRQ_LOONGARCH_CPU + select LOCK_MM_AND_FIND_VMA select MMU_GATHER_MERGE_VMAS if MMU select MODULES_USE_ELF_RELA if MODULES select NEED_PER_CPU_EMBED_FIRST_CHUNK --- a/arch/loongarch/mm/fault.c +++ b/arch/loongarch/mm/fault.c @@ -169,22 +169,18 @@ static void __kprobes __do_page_fault(st
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: - mmap_read_lock(mm); - vma = find_vma(mm, address); - if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (!expand_stack(vma, address)) - goto good_area; + vma = lock_mm_and_find_vma(mm, address, regs); + if (unlikely(!vma)) + goto bad_area_nosemaphore; + goto good_area; + /* * Something tried to access memory that isn't in our memory map.. * Fix it, but check if it's kernel or user first.. */ bad_area: mmap_read_unlock(mm); +bad_area_nosemaphore: do_sigsegv(regs, write, address, si_code); return;
--- a/arch/nios2/Kconfig +++ b/arch/nios2/Kconfig @@ -16,6 +16,7 @@ config NIOS2 select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_KGDB select IRQ_DOMAIN + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select OF select OF_EARLY_FLATTREE --- a/arch/nios2/mm/fault.c +++ b/arch/nios2/mm/fault.c @@ -86,27 +86,14 @@ asmlinkage void do_page_fault(struct pt_
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
- if (!mmap_read_trylock(mm)) { - if (!user_mode(regs) && !search_exception_tables(regs->ea)) - goto bad_area_nosemaphore; retry: - mmap_read_lock(mm); - } - - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; + goto bad_area_nosemaphore; /* * Ok, we have a good vm_area for this memory access, so * we can handle it.. */ -good_area: code = SEGV_ACCERR;
switch (cause) { --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -56,6 +56,7 @@ config SUPERH select HAVE_STACKPROTECTOR select HAVE_SYSCALL_TRACEPOINTS select IRQ_FORCED_THREADING + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select NEED_SG_DMA_LENGTH select NO_DMA if !MMU && !DMA_COHERENT --- a/arch/sh/mm/fault.c +++ b/arch/sh/mm/fault.c @@ -439,21 +439,9 @@ asmlinkage void __kprobes do_page_fault( }
retry: - mmap_read_lock(mm); - - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (unlikely(!vma)) { - bad_area(regs, error_code, address); - return; - } - if (likely(vma->vm_start <= address)) - goto good_area; - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { - bad_area(regs, error_code, address); - return; - } - if (unlikely(expand_stack(vma, address))) { - bad_area(regs, error_code, address); + bad_area_nosemaphore(regs, error_code, address); return; }
@@ -461,7 +449,6 @@ retry: * Ok, we have a good vm_area for this memory access, so * we can handle it.. */ -good_area: if (unlikely(access_error(error_code, vma))) { bad_area_access_error(regs, error_code, address); return; --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -56,6 +56,7 @@ config SPARC32 select DMA_DIRECT_REMAP select GENERIC_ATOMIC64 select HAVE_UID16 + select LOCK_MM_AND_FIND_VMA select OLD_SIGACTION select ZONE_DMA
--- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -143,28 +143,19 @@ asmlinkage void do_sparc_fault(struct pt if (pagefault_disabled() || !mm) goto no_context;
+ if (!from_user && address >= PAGE_OFFSET) + goto no_context; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
retry: - mmap_read_lock(mm); - - if (!from_user && address >= PAGE_OFFSET) - goto bad_area; - - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; + goto bad_area_nosemaphore; /* * Ok, we have a good vm_area for this memory access, so * we can handle it.. */ -good_area: code = SEGV_ACCERR; if (write) { if (!(vma->vm_flags & VM_WRITE)) @@ -321,17 +312,9 @@ static void force_user_fault(unsigned lo
code = SEGV_MAPERR;
- mmap_read_lock(mm); - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; -good_area: + goto bad_area_nosemaphore; code = SEGV_ACCERR; if (write) { if (!(vma->vm_flags & VM_WRITE)) @@ -350,6 +333,7 @@ good_area: return; bad_area: mmap_read_unlock(mm); +bad_area_nosemaphore: __do_fault_siginfo(code, SIGSEGV, tsk->thread.kregs, address); return;
--- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -49,6 +49,7 @@ config XTENSA select HAVE_SYSCALL_TRACEPOINTS select HAVE_VIRT_CPU_ACCOUNTING_GEN select IRQ_DOMAIN + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select PERF_USE_VMALLOC select TRACE_IRQFLAGS_SUPPORT --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -130,23 +130,14 @@ void do_page_fault(struct pt_regs *regs) perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
retry: - mmap_read_lock(mm); - vma = find_vma(mm, address); - + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; + goto bad_area_nosemaphore;
/* Ok, we have a good vm_area for this memory access, so * we can handle it.. */
-good_area: code = SEGV_ACCERR;
if (is_write) { @@ -205,6 +196,7 @@ good_area: */ bad_area: mmap_read_unlock(mm); +bad_area_nosemaphore: if (user_mode(regs)) { force_sig_fault(SIGSEGV, code, (void *) address); return;
From: Linus Torvalds torvalds@linux-foundation.org
commit 2cd76c50d0b41cec5c87abfcdf25b236a2793fb6 upstream.
This is one of the simple cases, except there's no pt_regs pointer. Which is fine, as lock_mm_and_find_vma() is set up to work fine with a NULL pt_regs.
Powerpc already enabled LOCK_MM_AND_FIND_VMA for the main CPU faulting, so we can just use the helper without any extra work.
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/powerpc/mm/copro_fault.c | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-)
--- a/arch/powerpc/mm/copro_fault.c +++ b/arch/powerpc/mm/copro_fault.c @@ -33,19 +33,11 @@ int copro_handle_mm_fault(struct mm_stru if (mm->pgd == NULL) return -EFAULT;
- mmap_read_lock(mm); - ret = -EFAULT; - vma = find_vma(mm, ea); + vma = lock_mm_and_find_vma(mm, ea, NULL); if (!vma) - goto out_unlock; - - if (ea < vma->vm_start) { - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto out_unlock; - if (expand_stack(vma, ea)) - goto out_unlock; - } + return -EFAULT;
+ ret = -EFAULT; is_write = dsisr & DSISR_ISSTORE; if (is_write) { if (!(vma->vm_flags & VM_WRITE))
From: Liam R. Howlett Liam.Howlett@oracle.com
commit f440fa1ac955e2898893f9301568435eb5cdfc4b upstream.
Make calls to extend_vma() and find_extend_vma() fail if the write lock is required.
To avoid making this a flag-day event, this still allows the old read-locking case for the trivial situations, and passes in a flag to say "is it write-locked". That way write-lockers can say "yes, I'm being careful", and legacy users will continue to work in all the common cases until they have been fully converted to the new world order.
Co-Developed-by: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/binfmt_elf.c | 6 +++--- fs/exec.c | 5 +++-- include/linux/mm.h | 10 +++++++--- mm/memory.c | 2 +- mm/mmap.c | 50 +++++++++++++++++++++++++++++++++----------------- mm/nommu.c | 3 ++- 6 files changed, 49 insertions(+), 27 deletions(-)
--- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -320,10 +320,10 @@ create_elf_tables(struct linux_binprm *b * Grow the stack manually; some architectures have a limit on how * far ahead a user-space access may be in order to grow the stack. */ - if (mmap_read_lock_killable(mm)) + if (mmap_write_lock_killable(mm)) return -EINTR; - vma = find_extend_vma(mm, bprm->p); - mmap_read_unlock(mm); + vma = find_extend_vma_locked(mm, bprm->p, true); + mmap_write_unlock(mm); if (!vma) return -EFAULT;
--- a/fs/exec.c +++ b/fs/exec.c @@ -204,7 +204,8 @@ static struct page *get_arg_page(struct
#ifdef CONFIG_STACK_GROWSUP if (write) { - ret = expand_downwards(bprm->vma, pos); + /* We claim to hold the lock - nobody to race with */ + ret = expand_downwards(bprm->vma, pos, true); if (ret < 0) return NULL; } @@ -852,7 +853,7 @@ int setup_arg_pages(struct linux_binprm stack_base = vma->vm_end - stack_expand; #endif current->mm->start_stack = bprm->p; - ret = expand_stack(vma, stack_base); + ret = expand_stack_locked(vma, stack_base, true); if (ret) ret = -EFAULT;
--- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3065,11 +3065,13 @@ extern vm_fault_t filemap_page_mkwrite(s
extern unsigned long stack_guard_gap; /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ -extern int expand_stack(struct vm_area_struct *vma, unsigned long address); +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, + bool write_locked); +#define expand_stack(vma,addr) expand_stack_locked(vma,addr,false)
/* CONFIG_STACK_GROWSUP still needs to grow downwards at some places */ -extern int expand_downwards(struct vm_area_struct *vma, - unsigned long address); +int expand_downwards(struct vm_area_struct *vma, unsigned long address, + bool write_locked); #if VM_GROWSUP extern int expand_upwards(struct vm_area_struct *vma, unsigned long address); #else @@ -3170,6 +3172,8 @@ unsigned long change_prot_numa(struct vm #endif
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr); +struct vm_area_struct *find_extend_vma_locked(struct mm_struct *, + unsigned long addr, bool write_locked); int remap_pfn_range(struct vm_area_struct *, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t); int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, --- a/mm/memory.c +++ b/mm/memory.c @@ -5336,7 +5336,7 @@ struct vm_area_struct *lock_mm_and_find_ goto fail; }
- if (expand_stack(vma, addr)) + if (expand_stack_locked(vma, addr, true)) goto fail;
success: --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1898,7 +1898,8 @@ static int acct_stack_growth(struct vm_a * PA-RISC uses this for its stack; IA64 for its Register Backing Store. * vma is the last one with address > vma->vm_end. Have to extend vma. */ -int expand_upwards(struct vm_area_struct *vma, unsigned long address) +int expand_upwards(struct vm_area_struct *vma, unsigned long address, + bool write_locked) { struct mm_struct *mm = vma->vm_mm; struct vm_area_struct *next; @@ -1922,6 +1923,8 @@ int expand_upwards(struct vm_area_struct if (gap_addr < address || gap_addr > TASK_SIZE) gap_addr = TASK_SIZE;
+ if (!write_locked) + return -EAGAIN; next = find_vma_intersection(mm, vma->vm_end, gap_addr); if (next && vma_is_accessible(next)) { if (!(next->vm_flags & VM_GROWSUP)) @@ -1991,7 +1994,8 @@ int expand_upwards(struct vm_area_struct /* * vma is the first one with address < vma->vm_start. Have to extend vma. */ -int expand_downwards(struct vm_area_struct *vma, unsigned long address) +int expand_downwards(struct vm_area_struct *vma, unsigned long address, + bool write_locked) { struct mm_struct *mm = vma->vm_mm; MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_start); @@ -2005,10 +2009,13 @@ int expand_downwards(struct vm_area_stru /* Enforce stack_guard_gap */ prev = mas_prev(&mas, 0); /* Check that both stack segments have the same anon_vma? */ - if (prev && !(prev->vm_flags & VM_GROWSDOWN) && - vma_is_accessible(prev)) { - if (address - prev->vm_end < stack_guard_gap) + if (prev) { + if (!(prev->vm_flags & VM_GROWSDOWN) && + vma_is_accessible(prev) && + (address - prev->vm_end < stack_guard_gap)) return -ENOMEM; + if (!write_locked && (prev->vm_end == address)) + return -EAGAIN; }
if (mas_preallocate(&mas, GFP_KERNEL)) @@ -2087,13 +2094,14 @@ static int __init cmdline_parse_stack_gu __setup("stack_guard_gap=", cmdline_parse_stack_guard_gap);
#ifdef CONFIG_STACK_GROWSUP -int expand_stack(struct vm_area_struct *vma, unsigned long address) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, + bool write_locked) { - return expand_upwards(vma, address); + return expand_upwards(vma, address, write_locked); }
-struct vm_area_struct * -find_extend_vma(struct mm_struct *mm, unsigned long addr) +struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, + unsigned long addr, bool write_locked) { struct vm_area_struct *vma, *prev;
@@ -2101,20 +2109,25 @@ find_extend_vma(struct mm_struct *mm, un vma = find_vma_prev(mm, addr, &prev); if (vma && (vma->vm_start <= addr)) return vma; - if (!prev || expand_stack(prev, addr)) + if (!prev) + return NULL; + if (expand_stack_locked(prev, addr, write_locked)) return NULL; if (prev->vm_flags & VM_LOCKED) populate_vma_page_range(prev, addr, prev->vm_end, NULL); return prev; } #else -int expand_stack(struct vm_area_struct *vma, unsigned long address) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, + bool write_locked) { - return expand_downwards(vma, address); + if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) + return -EINVAL; + return expand_downwards(vma, address, write_locked); }
-struct vm_area_struct * -find_extend_vma(struct mm_struct *mm, unsigned long addr) +struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, + unsigned long addr, bool write_locked) { struct vm_area_struct *vma; unsigned long start; @@ -2125,10 +2138,8 @@ find_extend_vma(struct mm_struct *mm, un return NULL; if (vma->vm_start <= addr) return vma; - if (!(vma->vm_flags & VM_GROWSDOWN)) - return NULL; start = vma->vm_start; - if (expand_stack(vma, addr)) + if (expand_stack_locked(vma, addr, write_locked)) return NULL; if (vma->vm_flags & VM_LOCKED) populate_vma_page_range(vma, addr, start, NULL); @@ -2136,6 +2147,11 @@ find_extend_vma(struct mm_struct *mm, un } #endif
+struct vm_area_struct *find_extend_vma(struct mm_struct *mm, + unsigned long addr) +{ + return find_extend_vma_locked(mm, addr, false); +} EXPORT_SYMBOL_GPL(find_extend_vma);
/* --- a/mm/nommu.c +++ b/mm/nommu.c @@ -643,7 +643,8 @@ struct vm_area_struct *find_extend_vma(s * expand a stack to a given address * - not supported under NOMMU conditions */ -int expand_stack(struct vm_area_struct *vma, unsigned long address) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, + bool write_locked) { return -ENOMEM; }
From: Linus Torvalds torvalds@linux-foundation.org
commit f313c51d26aa87e69633c9b46efb37a930faca71 upstream.
This is a small step towards a model where GUP itself would not expand the stack, and any user that needs GUP to not look up existing mappings, but actually expand on them, would have to do so manually before-hand, and with the mm lock held for writing.
It turns out that execve() already did almost exactly that, except it didn't take the mm lock at all (it's single-threaded so no locking technically needed, but it could cause lockdep errors). And it only did it for the CONFIG_STACK_GROWSUP case, since in that case GUP has obviously never expanded the stack downwards.
So just make that CONFIG_STACK_GROWSUP case do the right thing with locking, and enable it generally. This will eventually help GUP, and in the meantime avoids a special case and the lockdep issue.
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/exec.c | 37 +++++++++++++++++++++---------------- 1 file changed, 21 insertions(+), 16 deletions(-)
--- a/fs/exec.c +++ b/fs/exec.c @@ -199,34 +199,39 @@ static struct page *get_arg_page(struct int write) { struct page *page; + struct vm_area_struct *vma = bprm->vma; + struct mm_struct *mm = bprm->mm; int ret; - unsigned int gup_flags = 0;
-#ifdef CONFIG_STACK_GROWSUP - if (write) { - /* We claim to hold the lock - nobody to race with */ - ret = expand_downwards(bprm->vma, pos, true); - if (ret < 0) + /* + * Avoid relying on expanding the stack down in GUP (which + * does not work for STACK_GROWSUP anyway), and just do it + * by hand ahead of time. + */ + if (write && pos < vma->vm_start) { + mmap_write_lock(mm); + ret = expand_downwards(vma, pos, true); + if (unlikely(ret < 0)) { + mmap_write_unlock(mm); return NULL; - } -#endif - - if (write) - gup_flags |= FOLL_WRITE; + } + mmap_write_downgrade(mm); + } else + mmap_read_lock(mm);
/* * We are doing an exec(). 'current' is the process - * doing the exec and bprm->mm is the new process's mm. + * doing the exec and 'mm' is the new process's mm. */ - mmap_read_lock(bprm->mm); - ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, + ret = get_user_pages_remote(mm, pos, 1, + write ? FOLL_WRITE : 0, &page, NULL, NULL); - mmap_read_unlock(bprm->mm); + mmap_read_unlock(mm); if (ret <= 0) return NULL;
if (write) - acct_arg_size(bprm, vma_pages(bprm->vma)); + acct_arg_size(bprm, vma_pages(vma));
return page; }
From: Linus Torvalds torvalds@linux-foundation.org
commit 8d7071af890768438c14db6172cc8f9f4d04e184 upstream.
This finishes the job of always holding the mmap write lock when extending the user stack vma, and removes the 'write_locked' argument from the vm helper functions again.
For some cases, we just avoid expanding the stack at all: drivers and page pinning really shouldn't be extending any stacks. Let's see if any strange users really wanted that.
It's worth noting that architectures that weren't converted to the new lock_mm_and_find_vma() helper function are left using the legacy "expand_stack()" function, but it has been changed to drop the mmap_lock and take it for writing while expanding the vma. This makes it fairly straightforward to convert the remaining architectures.
As a result of dropping and re-taking the lock, the calling conventions for this function have also changed, since the old vma may no longer be valid. So it will now return the new vma if successful, and NULL - and the lock dropped - if the area could not be extended.
Tested-by: Vegard Nossum vegard.nossum@oracle.com Tested-by: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de # ia64 Tested-by: Frank Scheiner frank.scheiner@web.de # ia64 Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/ia64/mm/fault.c | 36 ++---------- arch/m68k/mm/fault.c | 9 ++- arch/microblaze/mm/fault.c | 5 + arch/openrisc/mm/fault.c | 5 + arch/parisc/mm/fault.c | 23 +++----- arch/s390/mm/fault.c | 5 + arch/sparc/mm/fault_64.c | 8 +- arch/um/kernel/trap.c | 11 ++- drivers/iommu/amd/iommu_v2.c | 4 - drivers/iommu/iommu-sva.c | 2 fs/binfmt_elf.c | 2 fs/exec.c | 4 - include/linux/mm.h | 16 +---- mm/gup.c | 6 +- mm/memory.c | 10 +++ mm/mmap.c | 121 ++++++++++++++++++++++++++++++++++--------- mm/nommu.c | 18 ++---- 17 files changed, 169 insertions(+), 116 deletions(-)
--- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -110,10 +110,12 @@ retry: * register backing store that needs to expand upwards, in * this case vma will be null, but prev_vma will ne non-null */ - if (( !vma && prev_vma ) || (address < vma->vm_start) ) - goto check_expansion; + if (( !vma && prev_vma ) || (address < vma->vm_start) ) { + vma = expand_stack(mm, address); + if (!vma) + goto bad_area_nosemaphore; + }
- good_area: code = SEGV_ACCERR;
/* OK, we've got a good vm_area for this memory area. Check the access permissions: */ @@ -177,35 +179,9 @@ retry: mmap_read_unlock(mm); return;
- check_expansion: - if (!(prev_vma && (prev_vma->vm_flags & VM_GROWSUP) && (address == prev_vma->vm_end))) { - if (!vma) - goto bad_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (REGION_NUMBER(address) != REGION_NUMBER(vma->vm_start) - || REGION_OFFSET(address) >= RGN_MAP_LIMIT) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; - } else { - vma = prev_vma; - if (REGION_NUMBER(address) != REGION_NUMBER(vma->vm_start) - || REGION_OFFSET(address) >= RGN_MAP_LIMIT) - goto bad_area; - /* - * Since the register backing store is accessed sequentially, - * we disallow growing it by more than a page at a time. - */ - if (address > vma->vm_end + PAGE_SIZE - sizeof(long)) - goto bad_area; - if (expand_upwards(vma, address)) - goto bad_area; - } - goto good_area; - bad_area: mmap_read_unlock(mm); + bad_area_nosemaphore: if ((isr & IA64_ISR_SP) || ((isr & IA64_ISR_NA) && (isr & IA64_ISR_CODE_MASK) == IA64_ISR_CODE_LFETCH)) { --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -105,8 +105,9 @@ retry: if (address + 256 < rdusp()) goto map_err; } - if (expand_stack(vma, address)) - goto map_err; + vma = expand_stack(mm, address); + if (!vma) + goto map_err_nosemaphore;
/* * Ok, we have a good vm_area for this memory access, so @@ -196,10 +197,12 @@ bus_err: goto send_sig;
map_err: + mmap_read_unlock(mm); +map_err_nosemaphore: current->thread.signo = SIGSEGV; current->thread.code = SEGV_MAPERR; current->thread.faddr = address; - goto send_sig; + return send_fault_sig(regs);
acc_err: current->thread.signo = SIGSEGV; --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -192,8 +192,9 @@ retry: && (kernel_mode(regs) || !store_updates_sp(regs))) goto bad_area; } - if (expand_stack(vma, address)) - goto bad_area; + vma = expand_stack(mm, address); + if (!vma) + goto bad_area_nosemaphore;
good_area: code = SEGV_ACCERR; --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -127,8 +127,9 @@ retry: if (address + PAGE_SIZE < regs->sp) goto bad_area; } - if (expand_stack(vma, address)) - goto bad_area; + vma = expand_stack(mm, address); + if (!vma) + goto bad_area_nosemaphore;
/* * Ok, we have a good vm_area for this memory access, so --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -288,15 +288,19 @@ void do_page_fault(struct pt_regs *regs, retry: mmap_read_lock(mm); vma = find_vma_prev(mm, address, &prev_vma); - if (!vma || address < vma->vm_start) - goto check_expansion; + if (!vma || address < vma->vm_start) { + if (!prev || !(prev->vm_flags & VM_GROWSUP)) + goto bad_area; + vma = expand_stack(mm, address); + if (!vma) + goto bad_area_nosemaphore; + } + /* * Ok, we have a good vm_area for this memory access. We still need to * check the access permissions. */
-good_area: - if ((vma->vm_flags & acc_type) != acc_type) goto bad_area;
@@ -347,17 +351,13 @@ good_area: mmap_read_unlock(mm); return;
-check_expansion: - vma = prev_vma; - if (vma && (expand_stack(vma, address) == 0)) - goto good_area; - /* * Something tried to access memory that isn't in our memory map.. */ bad_area: mmap_read_unlock(mm);
+bad_area_nosemaphore: if (user_mode(regs)) { int signo, si_code;
@@ -449,7 +449,7 @@ handle_nadtlb_fault(struct pt_regs *regs { unsigned long insn = regs->iir; int breg, treg, xreg, val = 0; - struct vm_area_struct *vma, *prev_vma; + struct vm_area_struct *vma; struct task_struct *tsk; struct mm_struct *mm; unsigned long address; @@ -485,7 +485,7 @@ handle_nadtlb_fault(struct pt_regs *regs /* Search for VMA */ address = regs->ior; mmap_read_lock(mm); - vma = find_vma_prev(mm, address, &prev_vma); + vma = vma_lookup(mm, address); mmap_read_unlock(mm);
/* @@ -494,7 +494,6 @@ handle_nadtlb_fault(struct pt_regs *regs */ acc_type = (insn & 0x40) ? VM_WRITE : VM_READ; if (vma - && address >= vma->vm_start && (vma->vm_flags & acc_type) == acc_type) val = 1; } --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -433,8 +433,9 @@ retry: if (unlikely(vma->vm_start > address)) { if (!(vma->vm_flags & VM_GROWSDOWN)) goto out_up; - if (expand_stack(vma, address)) - goto out_up; + vma = expand_stack(mm, address); + if (!vma) + goto out; }
/* --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -383,8 +383,9 @@ continue_fault: goto bad_area; } } - if (expand_stack(vma, address)) - goto bad_area; + vma = expand_stack(mm, address); + if (!vma) + goto bad_area_nosemaphore; /* * Ok, we have a good vm_area for this memory access, so * we can handle it.. @@ -487,8 +488,9 @@ exit_exception: * Fix it, but check if it's kernel or user first.. */ bad_area: - insn = get_fault_insn(regs, insn); mmap_read_unlock(mm); +bad_area_nosemaphore: + insn = get_fault_insn(regs, insn);
handle_kernel_fault: do_kernel_fault(regs, si_code, fault_code, insn, address); --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -47,14 +47,15 @@ retry: vma = find_vma(mm, address); if (!vma) goto out; - else if (vma->vm_start <= address) + if (vma->vm_start <= address) goto good_area; - else if (!(vma->vm_flags & VM_GROWSDOWN)) + if (!(vma->vm_flags & VM_GROWSDOWN)) goto out; - else if (is_user && !ARCH_IS_STACKGROW(address)) - goto out; - else if (expand_stack(vma, address)) + if (is_user && !ARCH_IS_STACKGROW(address)) goto out; + vma = expand_stack(mm, address); + if (!vma) + goto out_nosemaphore;
good_area: *code_out = SEGV_ACCERR; --- a/drivers/iommu/amd/iommu_v2.c +++ b/drivers/iommu/amd/iommu_v2.c @@ -485,8 +485,8 @@ static void do_fault(struct work_struct flags |= FAULT_FLAG_REMOTE;
mmap_read_lock(mm); - vma = find_extend_vma(mm, address); - if (!vma || address < vma->vm_start) + vma = vma_lookup(mm, address); + if (!vma) /* failed to get a vma in the right range */ goto out;
--- a/drivers/iommu/iommu-sva.c +++ b/drivers/iommu/iommu-sva.c @@ -203,7 +203,7 @@ iommu_sva_handle_iopf(struct iommu_fault
mmap_read_lock(mm);
- vma = find_extend_vma(mm, prm->addr); + vma = vma_lookup(mm, prm->addr); if (!vma) /* Unmapped area */ goto out_put_mm; --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -322,7 +322,7 @@ create_elf_tables(struct linux_binprm *b */ if (mmap_write_lock_killable(mm)) return -EINTR; - vma = find_extend_vma_locked(mm, bprm->p, true); + vma = find_extend_vma_locked(mm, bprm->p); mmap_write_unlock(mm); if (!vma) return -EFAULT; --- a/fs/exec.c +++ b/fs/exec.c @@ -210,7 +210,7 @@ static struct page *get_arg_page(struct */ if (write && pos < vma->vm_start) { mmap_write_lock(mm); - ret = expand_downwards(vma, pos, true); + ret = expand_downwards(vma, pos); if (unlikely(ret < 0)) { mmap_write_unlock(mm); return NULL; @@ -858,7 +858,7 @@ int setup_arg_pages(struct linux_binprm stack_base = vma->vm_end - stack_expand; #endif current->mm->start_stack = bprm->p; - ret = expand_stack_locked(vma, stack_base, true); + ret = expand_stack_locked(vma, stack_base); if (ret) ret = -EFAULT;
--- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3065,18 +3065,11 @@ extern vm_fault_t filemap_page_mkwrite(s
extern unsigned long stack_guard_gap; /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ -int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, - bool write_locked); -#define expand_stack(vma,addr) expand_stack_locked(vma,addr,false) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address); +struct vm_area_struct *expand_stack(struct mm_struct * mm, unsigned long addr);
/* CONFIG_STACK_GROWSUP still needs to grow downwards at some places */ -int expand_downwards(struct vm_area_struct *vma, unsigned long address, - bool write_locked); -#if VM_GROWSUP -extern int expand_upwards(struct vm_area_struct *vma, unsigned long address); -#else - #define expand_upwards(vma, address) (0) -#endif +int expand_downwards(struct vm_area_struct *vma, unsigned long address);
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */ extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr); @@ -3171,9 +3164,8 @@ unsigned long change_prot_numa(struct vm unsigned long start, unsigned long end); #endif
-struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr); struct vm_area_struct *find_extend_vma_locked(struct mm_struct *, - unsigned long addr, bool write_locked); + unsigned long addr); int remap_pfn_range(struct vm_area_struct *, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t); int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, --- a/mm/gup.c +++ b/mm/gup.c @@ -1096,7 +1096,7 @@ static long __get_user_pages(struct mm_s
/* first iteration or cross vma bound */ if (!vma || start >= vma->vm_end) { - vma = find_extend_vma(mm, start); + vma = vma_lookup(mm, start); if (!vma && in_gate_area(mm, start)) { ret = get_gate_page(mm, start & PAGE_MASK, gup_flags, &vma, @@ -1265,8 +1265,8 @@ int fixup_user_fault(struct mm_struct *m fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
retry: - vma = find_extend_vma(mm, address); - if (!vma || address < vma->vm_start) + vma = vma_lookup(mm, address); + if (!vma) return -EFAULT;
if (!vma_permits_fault(vma, fault_flags)) --- a/mm/memory.c +++ b/mm/memory.c @@ -5336,7 +5336,7 @@ struct vm_area_struct *lock_mm_and_find_ goto fail; }
- if (expand_stack_locked(vma, addr, true)) + if (expand_stack_locked(vma, addr)) goto fail;
success: @@ -5620,6 +5620,14 @@ int __access_remote_vm(struct mm_struct if (mmap_read_lock_killable(mm)) return 0;
+ /* We might need to expand the stack to access it */ + vma = vma_lookup(mm, addr); + if (!vma) { + vma = expand_stack(mm, addr); + if (!vma) + return 0; + } + /* ignore errors, just check how much was successfully transferred */ while (len) { int bytes, ret, offset; --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1898,8 +1898,7 @@ static int acct_stack_growth(struct vm_a * PA-RISC uses this for its stack; IA64 for its Register Backing Store. * vma is the last one with address > vma->vm_end. Have to extend vma. */ -int expand_upwards(struct vm_area_struct *vma, unsigned long address, - bool write_locked) +static int expand_upwards(struct vm_area_struct *vma, unsigned long address) { struct mm_struct *mm = vma->vm_mm; struct vm_area_struct *next; @@ -1923,8 +1922,6 @@ int expand_upwards(struct vm_area_struct if (gap_addr < address || gap_addr > TASK_SIZE) gap_addr = TASK_SIZE;
- if (!write_locked) - return -EAGAIN; next = find_vma_intersection(mm, vma->vm_end, gap_addr); if (next && vma_is_accessible(next)) { if (!(next->vm_flags & VM_GROWSUP)) @@ -1993,15 +1990,18 @@ int expand_upwards(struct vm_area_struct
/* * vma is the first one with address < vma->vm_start. Have to extend vma. + * mmap_lock held for writing. */ -int expand_downwards(struct vm_area_struct *vma, unsigned long address, - bool write_locked) +int expand_downwards(struct vm_area_struct *vma, unsigned long address) { struct mm_struct *mm = vma->vm_mm; MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_start); struct vm_area_struct *prev; int error = 0;
+ if (!(vma->vm_flags & VM_GROWSDOWN)) + return -EFAULT; + address &= PAGE_MASK; if (address < mmap_min_addr || address < FIRST_USER_ADDRESS) return -EPERM; @@ -2014,8 +2014,6 @@ int expand_downwards(struct vm_area_stru vma_is_accessible(prev) && (address - prev->vm_end < stack_guard_gap)) return -ENOMEM; - if (!write_locked && (prev->vm_end == address)) - return -EAGAIN; }
if (mas_preallocate(&mas, GFP_KERNEL)) @@ -2094,14 +2092,12 @@ static int __init cmdline_parse_stack_gu __setup("stack_guard_gap=", cmdline_parse_stack_guard_gap);
#ifdef CONFIG_STACK_GROWSUP -int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, - bool write_locked) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address) { - return expand_upwards(vma, address, write_locked); + return expand_upwards(vma, address); }
-struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, - unsigned long addr, bool write_locked) +struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, unsigned long addr) { struct vm_area_struct *vma, *prev;
@@ -2111,23 +2107,21 @@ struct vm_area_struct *find_extend_vma_l return vma; if (!prev) return NULL; - if (expand_stack_locked(prev, addr, write_locked)) + if (expand_stack_locked(prev, addr)) return NULL; if (prev->vm_flags & VM_LOCKED) populate_vma_page_range(prev, addr, prev->vm_end, NULL); return prev; } #else -int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, - bool write_locked) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address) { if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) return -EINVAL; - return expand_downwards(vma, address, write_locked); + return expand_downwards(vma, address); }
-struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, - unsigned long addr, bool write_locked) +struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, unsigned long addr) { struct vm_area_struct *vma; unsigned long start; @@ -2139,7 +2133,7 @@ struct vm_area_struct *find_extend_vma_l if (vma->vm_start <= addr) return vma; start = vma->vm_start; - if (expand_stack_locked(vma, addr, write_locked)) + if (expand_stack_locked(vma, addr)) return NULL; if (vma->vm_flags & VM_LOCKED) populate_vma_page_range(vma, addr, start, NULL); @@ -2147,12 +2141,91 @@ struct vm_area_struct *find_extend_vma_l } #endif
-struct vm_area_struct *find_extend_vma(struct mm_struct *mm, - unsigned long addr) +/* + * IA64 has some horrid mapping rules: it can expand both up and down, + * but with various special rules. + * + * We'll get rid of this architecture eventually, so the ugliness is + * temporary. + */ +#ifdef CONFIG_IA64 +static inline bool vma_expand_ok(struct vm_area_struct *vma, unsigned long addr) +{ + return REGION_NUMBER(addr) == REGION_NUMBER(vma->vm_start) && + REGION_OFFSET(addr) < RGN_MAP_LIMIT; +} + +/* + * IA64 stacks grow down, but there's a special register backing store + * that can grow up. Only sequentially, though, so the new address must + * match vm_end. + */ +static inline int vma_expand_up(struct vm_area_struct *vma, unsigned long addr) +{ + if (!vma_expand_ok(vma, addr)) + return -EFAULT; + if (vma->vm_end != (addr & PAGE_MASK)) + return -EFAULT; + return expand_upwards(vma, addr); +} + +static inline bool vma_expand_down(struct vm_area_struct *vma, unsigned long addr) +{ + if (!vma_expand_ok(vma, addr)) + return -EFAULT; + return expand_downwards(vma, addr); +} + +#elif defined(CONFIG_STACK_GROWSUP) + +#define vma_expand_up(vma,addr) expand_upwards(vma, addr) +#define vma_expand_down(vma, addr) (-EFAULT) + +#else + +#define vma_expand_up(vma,addr) (-EFAULT) +#define vma_expand_down(vma, addr) expand_downwards(vma, addr) + +#endif + +/* + * expand_stack(): legacy interface for page faulting. Don't use unless + * you have to. + * + * This is called with the mm locked for reading, drops the lock, takes + * the lock for writing, tries to look up a vma again, expands it if + * necessary, and downgrades the lock to reading again. + * + * If no vma is found or it can't be expanded, it returns NULL and has + * dropped the lock. + */ +struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr) { - return find_extend_vma_locked(mm, addr, false); + struct vm_area_struct *vma, *prev; + + mmap_read_unlock(mm); + if (mmap_write_lock_killable(mm)) + return NULL; + + vma = find_vma_prev(mm, addr, &prev); + if (vma && vma->vm_start <= addr) + goto success; + + if (prev && !vma_expand_up(prev, addr)) { + vma = prev; + goto success; + } + + if (vma && !vma_expand_down(vma, addr)) + goto success; + + mmap_write_unlock(mm); + return NULL; + +success: + mmap_write_downgrade(mm); + return vma; } -EXPORT_SYMBOL_GPL(find_extend_vma);
/* * Ok - we have the memory areas we should free on a maple tree so release them, --- a/mm/nommu.c +++ b/mm/nommu.c @@ -631,24 +631,20 @@ struct vm_area_struct *find_vma(struct m EXPORT_SYMBOL(find_vma);
/* - * find a VMA - * - we don't extend stack VMAs under NOMMU conditions - */ -struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr) -{ - return find_vma(mm, addr); -} - -/* * expand a stack to a given address * - not supported under NOMMU conditions */ -int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, - bool write_locked) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long addr) { return -ENOMEM; }
+struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr) +{ + mmap_read_unlock(mm); + return NULL; +} + /* * look up the first VMA exactly that exactly matches addr * - should be called with mm->mmap_lock at least held readlocked
From: Linus Torvalds torvalds@linux-foundation.org
commit a425ac5365f6cb3cc47bf83e6bff0213c10445f7 upstream.
It feels very unlikely that anybody would want to do a GUP in an unmapped area under the stack pointer, but real users sometimes do some really strange things. So add a (temporary) warning for the case where a GUP fails and expanding the stack might have made it work.
It's trivial to do the expansion in the caller as part of getting the mm lock in the first place - see __access_remote_vm() for ptrace, for example - it's just that it's unnecessarily painful to do it deep in the guts of the GUP lookup when we might have to drop and re-take the lock.
I doubt anybody actually does anything quite this strange, but let's be proactive: adding these warnings is simple, and will make debugging it much easier if they trigger.
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/gup.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)
--- a/mm/gup.c +++ b/mm/gup.c @@ -1096,7 +1096,11 @@ static long __get_user_pages(struct mm_s
/* first iteration or cross vma bound */ if (!vma || start >= vma->vm_end) { - vma = vma_lookup(mm, start); + vma = find_vma(mm, start); + if (vma && (start < vma->vm_start)) { + WARN_ON_ONCE(vma->vm_flags & VM_GROWSDOWN); + vma = NULL; + } if (!vma && in_gate_area(mm, start)) { ret = get_gate_page(mm, start & PAGE_MASK, gup_flags, &vma, @@ -1265,9 +1269,13 @@ int fixup_user_fault(struct mm_struct *m fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
retry: - vma = vma_lookup(mm, address); + vma = find_vma(mm, address); if (!vma) return -EFAULT; + if (address < vma->vm_start ) { + WARN_ON_ONCE(vma->vm_flags & VM_GROWSDOWN); + return -EFAULT; + }
if (!vma_permits_fault(vma, fault_flags)) return -EFAULT;
From: Zhang Shurong zhang_shurong@foxmail.com
commit c2d22806aecb24e2de55c30a06e5d6eb297d161d upstream.
There is a potential OOB read at fast_imageblit, for "colortab[(*src >> 4)]" can become a negative value due to "const char *s = image->data, *src". This change makes sure the index for colortab always positive or zero.
Similar commit: https://patchwork.kernel.org/patch/11746067
Potential bug report: https://groups.google.com/g/syzkaller-bugs/c/9ubBXKeKXf4/m/k-QXy4UgAAAJ
Signed-off-by: Zhang Shurong zhang_shurong@foxmail.com Cc: stable@vger.kernel.org Signed-off-by: Helge Deller deller@gmx.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/video/fbdev/core/sysimgblt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/video/fbdev/core/sysimgblt.c +++ b/drivers/video/fbdev/core/sysimgblt.c @@ -189,7 +189,7 @@ static void fast_imageblit(const struct u32 fgx = fgcolor, bgx = bgcolor, bpp = p->var.bits_per_pixel; u32 ppw = 32/bpp, spitch = (image->width + 7)/8; u32 bit_mask, eorx, shift; - const char *s = image->data, *src; + const u8 *s = image->data, *src; u32 *dst; const u32 *tab; size_t tablen;
From: Ludvig Michaelsson ludvig.michaelsson@yubico.com
commit 944ee77dc6ec7b0afd8ec70ffc418b238c92f12b upstream.
The hidraw_open() function increments the hidraw device reference counter. The counter has no dedicated synchronization mechanism, resulting in a potential data race when concurrently opening a device.
The race is a regression introduced by commit 8590222e4b02 ("HID: hidraw: Replace hidraw device table mutex with a rwsem"). While minors_rwsem is intended to protect the hidraw_table itself, by instead acquiring the lock for writing, the reference counter is also protected. This is symmetrical to hidraw_release().
Link: https://github.com/systemd/systemd/issues/27947 Fixes: 8590222e4b02 ("HID: hidraw: Replace hidraw device table mutex with a rwsem") Cc: stable@vger.kernel.org Signed-off-by: Ludvig Michaelsson ludvig.michaelsson@yubico.com Link: https://lore.kernel.org/r/20230621-hidraw-race-v1-1-a58e6ac69bab@yubico.com Signed-off-by: Benjamin Tissoires benjamin.tissoires@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/hidraw.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
--- a/drivers/hid/hidraw.c +++ b/drivers/hid/hidraw.c @@ -272,7 +272,12 @@ static int hidraw_open(struct inode *ino goto out; }
- down_read(&minors_rwsem); + /* + * Technically not writing to the hidraw_table but a write lock is + * required to protect the device refcount. This is symmetrical to + * hidraw_release(). + */ + down_write(&minors_rwsem); if (!hidraw_table[minor] || !hidraw_table[minor]->exist) { err = -ENODEV; goto out_unlock; @@ -301,7 +306,7 @@ static int hidraw_open(struct inode *ino spin_unlock_irqrestore(&hidraw_table[minor]->list_lock, flags); file->private_data = list; out_unlock: - up_read(&minors_rwsem); + up_write(&minors_rwsem); out: if (err < 0) kfree(list);
From: Jason Gerecke jason.gerecke@wacom.com
commit 9a6c0e28e215535b2938c61ded54603b4e5814c5 upstream.
Code which interacts with timestamps needs to use the ktime_t type returned by functions like ktime_get. The int type does not offer enough space to store these values, and attempting to use it is a recipe for problems. In this particular case, overflows would occur when calculating/storing timestamps leading to incorrect values being reported to userspace. In some cases these bad timestamps cause input handling in userspace to appear hung.
Link: https://gitlab.freedesktop.org/libinput/libinput/-/issues/901 Fixes: 17d793f3ed53 ("HID: wacom: insert timestamp to packed Bluetooth (BT) events") CC: stable@vger.kernel.org Signed-off-by: Jason Gerecke jason.gerecke@wacom.com Reviewed-by: Benjamin Tissoires benjamin.tissoires@redhat.com Link: https://lore.kernel.org/r/20230608213828.2108-1-jason.gerecke@wacom.com Signed-off-by: Benjamin Tissoires bentiss@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/wacom_wac.c | 6 +++--- drivers/hid/wacom_wac.h | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-)
--- a/drivers/hid/wacom_wac.c +++ b/drivers/hid/wacom_wac.c @@ -1309,7 +1309,7 @@ static void wacom_intuos_pro2_bt_pen(str struct input_dev *pen_input = wacom->pen_input; unsigned char *data = wacom->data; int number_of_valid_frames = 0; - int time_interval = 15000000; + ktime_t time_interval = 15000000; ktime_t time_packet_received = ktime_get(); int i;
@@ -1343,7 +1343,7 @@ static void wacom_intuos_pro2_bt_pen(str if (number_of_valid_frames) { if (wacom->hid_data.time_delayed) time_interval = ktime_get() - wacom->hid_data.time_delayed; - time_interval /= number_of_valid_frames; + time_interval = div_u64(time_interval, number_of_valid_frames); wacom->hid_data.time_delayed = time_packet_received; }
@@ -1354,7 +1354,7 @@ static void wacom_intuos_pro2_bt_pen(str bool range = frame[0] & 0x20; bool invert = frame[0] & 0x10; int frames_number_reversed = number_of_valid_frames - i - 1; - int event_timestamp = time_packet_received - frames_number_reversed * time_interval; + ktime_t event_timestamp = time_packet_received - frames_number_reversed * time_interval;
if (!valid) continue; --- a/drivers/hid/wacom_wac.h +++ b/drivers/hid/wacom_wac.h @@ -324,7 +324,7 @@ struct hid_data { int ps_connected; bool pad_input_event_flag; unsigned short sequence_number; - int time_delayed; + ktime_t time_delayed; };
struct wacom_remote_data {
From: Mike Hommey mh@glandium.org
commit 5fe251112646d8626818ea90f7af325bab243efa upstream.
commit 498ba2069035 ("HID: logitech-hidpp: Don't restart communication if not necessary") put restarting communication behind that flag, and this was apparently necessary on the T651, but the flag was not set for it.
Fixes: 498ba2069035 ("HID: logitech-hidpp: Don't restart communication if not necessary") Cc: stable@vger.kernel.org Signed-off-by: Mike Hommey mh@glandium.org Link: https://lore.kernel.org/r/20230617230957.6mx73th4blv7owqk@glandium.org Signed-off-by: Benjamin Tissoires benjamin.tissoires@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/hid-logitech-hidpp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/hid/hid-logitech-hidpp.c +++ b/drivers/hid/hid-logitech-hidpp.c @@ -4364,7 +4364,7 @@ static const struct hid_device_id hidpp_ { /* wireless touchpad T651 */ HID_BLUETOOTH_DEVICE(USB_VENDOR_ID_LOGITECH, USB_DEVICE_ID_LOGITECH_T651), - .driver_data = HIDPP_QUIRK_CLASS_WTP }, + .driver_data = HIDPP_QUIRK_CLASS_WTP | HIDPP_QUIRK_DELAYED_INIT }, { /* Mouse Logitech Anywhere MX */ LDJ_DEVICE(0x1017), .driver_data = HIDPP_QUIRK_HI_RES_SCROLL_1P0 }, { /* Mouse logitech M560 */
From: Ricardo Cañuelo ricardo.canuelo@collabora.com
commit 86edac7d3888c715fe3a81bd61f3617ecfe2e1dd upstream.
This reverts commit f05c7b7d9ea9477fcc388476c6f4ade8c66d2d26.
That change was causing a regression in the generic-adc-thermal-probed bootrr test as reported in the kernelci-results list [1]. A proper rework will take longer, so revert it for now.
[1] https://groups.io/g/kernelci-results/message/42660
Fixes: f05c7b7d9ea9 ("thermal/drivers/mediatek: Use devm_of_iomap to avoid resource leak in mtk_thermal_probe") Signed-off-by: Ricardo Cañuelo ricardo.canuelo@collabora.com Suggested-by: AngeloGioacchino Del Regno angelogioacchino.delregno@collabora.com Reviewed-by: AngeloGioacchino Del Regno angelogioacchino.delregno@collabora.com Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org Link: https://lore.kernel.org/r/20230525121811.3360268-1-ricardo.canuelo@collabora... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/thermal/mediatek/auxadc_thermal.c | 14 ++------------ 1 file changed, 2 insertions(+), 12 deletions(-)
--- a/drivers/thermal/mediatek/auxadc_thermal.c +++ b/drivers/thermal/mediatek/auxadc_thermal.c @@ -1142,12 +1142,7 @@ static int mtk_thermal_probe(struct plat return -ENODEV; }
- auxadc_base = devm_of_iomap(&pdev->dev, auxadc, 0, NULL); - if (IS_ERR(auxadc_base)) { - of_node_put(auxadc); - return PTR_ERR(auxadc_base); - } - + auxadc_base = of_iomap(auxadc, 0); auxadc_phys_base = of_get_phys_base(auxadc);
of_node_put(auxadc); @@ -1163,12 +1158,7 @@ static int mtk_thermal_probe(struct plat return -ENODEV; }
- apmixed_base = devm_of_iomap(&pdev->dev, apmixedsys, 0, NULL); - if (IS_ERR(apmixed_base)) { - of_node_put(apmixedsys); - return PTR_ERR(apmixed_base); - } - + apmixed_base = of_iomap(apmixedsys, 0); apmixed_phys_base = of_get_phys_base(apmixedsys);
of_node_put(apmixedsys);
Hello!
Early report of failures.
Arm64 fails with GCC-11 on the following configurations: * lkftconfig * lkftconfig-64k_page_size * lkftconfig-debug * lkftconfig-debug-kmemleak * lkftconfig-kasan * lkftconfig-kselftest * lkftconfig-kunit * lkftconfig-libgpiod * lkftconfig-perf * lkftconfig-rcutorture
lkftconfig is basically defconfig + a few fragments [1]. The suffixes mean that we're enabling a few other kconfigs.
Failure: -----8<----- /builds/linux/arch/arm64/mm/fault.c: In function 'do_page_fault': /builds/linux/arch/arm64/mm/fault.c:576:9: error: 'vma' undeclared (first use in this function); did you mean 'vmap'? 576 | vma = lock_mm_and_find_vma(mm, addr, regs); | ^~~ | vmap /builds/linux/arch/arm64/mm/fault.c:576:9: note: each undeclared identifier is reported only once for each function it appears in /builds/linux/arch/arm64/mm/fault.c:579:17: error: label 'done' used but not defined 579 | goto done; | ^~~~ make[4]: *** [/builds/linux/scripts/Makefile.build:252: arch/arm64/mm/fault.o] Error 1 make[4]: Target 'arch/arm64/mm/' not remade because of errors. ----->8-----
We're expecting to see more failures on other architectures, and so will follow-up with that.
[1] https://github.com/Linaro/meta-lkft/tree/kirkstone/meta/recipes-kernel/linux...
Greetings!
Daniel Díaz daniel.diaz@linaro.org
On Thu, 29 Jun 2023 at 12:47, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 6.3.11 release. There are 29 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sat, 01 Jul 2023 18:41:39 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.3.11-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.3.y and the diffstat can be found below.
thanks,
greg k-h
Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 6.3.11-rc1
Ricardo Cañuelo ricardo.canuelo@collabora.com Revert "thermal/drivers/mediatek: Use devm_of_iomap to avoid resource leak in mtk_thermal_probe"
Mike Hommey mh@glandium.org HID: logitech-hidpp: add HIDPP_QUIRK_DELAYED_INIT for the T651.
Jason Gerecke jason.gerecke@wacom.com HID: wacom: Use ktime_t rather than int when dealing with timestamps
Ludvig Michaelsson ludvig.michaelsson@yubico.com HID: hidraw: fix data race on device refcount
Zhang Shurong zhang_shurong@foxmail.com fbdev: fix potential OOB read in fast_imageblit()
Linus Torvalds torvalds@linux-foundation.org gup: add warning if some caller would seem to want stack expansion
Linus Torvalds torvalds@linux-foundation.org mm: always expand the stack with the mmap write lock held
Linus Torvalds torvalds@linux-foundation.org execve: expand new process stack manually ahead of time
Liam R. Howlett Liam.Howlett@oracle.com mm: make find_extend_vma() fail if write lock not held
Linus Torvalds torvalds@linux-foundation.org powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma()
Linus Torvalds torvalds@linux-foundation.org mm/fault: convert remaining simple cases to lock_mm_and_find_vma()
Ben Hutchings ben@decadent.org.uk arm/mm: Convert to using lock_mm_and_find_vma()
Ben Hutchings ben@decadent.org.uk riscv/mm: Convert to using lock_mm_and_find_vma()
Ben Hutchings ben@decadent.org.uk mips/mm: Convert to using lock_mm_and_find_vma()
Michael Ellerman mpe@ellerman.id.au powerpc/mm: Convert to using lock_mm_and_find_vma()
Linus Torvalds torvalds@linux-foundation.org arm64/mm: Convert to using lock_mm_and_find_vma()
Linus Torvalds torvalds@linux-foundation.org mm: make the page fault mmap locking killable
Linus Torvalds torvalds@linux-foundation.org mm: introduce new 'lock_mm_and_find_vma()' page fault helper
Peng Zhang zhangpeng.00@bytedance.com maple_tree: fix potential out-of-bounds access in mas_wr_end_piv()
Oliver Hartkopp socketcan@hartkopp.net can: isotp: isotp_sendmsg(): fix return error fix on TX path
Wyes Karny wyes.karny@amd.com cpufreq: amd-pstate: Make amd-pstate EPP driver name hyphenated
Thomas Gleixner tglx@linutronix.de x86/smp: Cure kexec() vs. mwait_play_dead() breakage
Thomas Gleixner tglx@linutronix.de x86/smp: Use dedicated cache-line for mwait_play_dead()
Thomas Gleixner tglx@linutronix.de x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
Tony Battersby tonyb@cybernetics.com x86/smp: Dont access non-existing CPUID leaf
Thomas Gleixner tglx@linutronix.de x86/smp: Make stop_other_cpus() more robust
Borislav Petkov (AMD) bp@alien8.de x86/microcode/AMD: Load late on both threads too
David Woodhouse dwmw@amazon.co.uk mm/mmap: Fix error return in do_vmi_align_munmap()
Liam R. Howlett Liam.Howlett@oracle.com mm/mmap: Fix error path in do_vmi_align_munmap()
Diffstat:
Makefile | 4 +- arch/alpha/Kconfig | 1 + arch/alpha/mm/fault.c | 13 +-- arch/arc/Kconfig | 1 + arch/arc/mm/fault.c | 11 +-- arch/arm/Kconfig | 1 + arch/arm/mm/fault.c | 63 +++--------- arch/arm64/Kconfig | 1 + arch/arm64/mm/fault.c | 44 ++------- arch/csky/Kconfig | 1 + arch/csky/mm/fault.c | 22 +---- arch/hexagon/Kconfig | 1 + arch/hexagon/mm/vm_fault.c | 18 +--- arch/ia64/mm/fault.c | 36 ++----- arch/loongarch/Kconfig | 1 + arch/loongarch/mm/fault.c | 16 ++-- arch/m68k/mm/fault.c | 9 +- arch/microblaze/mm/fault.c | 5 +- arch/mips/Kconfig | 1 + arch/mips/mm/fault.c | 12 +-- arch/nios2/Kconfig | 1 + arch/nios2/mm/fault.c | 17 +--- arch/openrisc/mm/fault.c | 5 +- arch/parisc/mm/fault.c | 23 +++-- arch/powerpc/Kconfig | 1 + arch/powerpc/mm/copro_fault.c | 14 +-- arch/powerpc/mm/fault.c | 39 +------- arch/riscv/Kconfig | 1 + arch/riscv/mm/fault.c | 31 +++--- arch/s390/mm/fault.c | 5 +- arch/sh/Kconfig | 1 + arch/sh/mm/fault.c | 17 +--- arch/sparc/Kconfig | 1 + arch/sparc/mm/fault_32.c | 32 ++----- arch/sparc/mm/fault_64.c | 8 +- arch/um/kernel/trap.c | 11 ++- arch/x86/Kconfig | 1 + arch/x86/include/asm/cpu.h | 2 + arch/x86/include/asm/smp.h | 2 + arch/x86/kernel/cpu/microcode/amd.c | 2 +- arch/x86/kernel/process.c | 28 +++++- arch/x86/kernel/smp.c | 73 ++++++++------ arch/x86/kernel/smpboot.c | 81 ++++++++++++++-- arch/x86/mm/fault.c | 52 +--------- arch/xtensa/Kconfig | 1 + arch/xtensa/mm/fault.c | 14 +-- drivers/cpufreq/amd-pstate.c | 2 +- drivers/hid/hid-logitech-hidpp.c | 2 +- drivers/hid/hidraw.c | 9 +- drivers/hid/wacom_wac.c | 6 +- drivers/hid/wacom_wac.h | 2 +- drivers/iommu/amd/iommu_v2.c | 4 +- drivers/iommu/iommu-sva.c | 2 +- drivers/thermal/mediatek/auxadc_thermal.c | 14 +-- drivers/video/fbdev/core/sysimgblt.c | 2 +- fs/binfmt_elf.c | 6 +- fs/exec.c | 38 ++++---- include/linux/mm.h | 16 ++-- lib/maple_tree.c | 11 ++- mm/Kconfig | 4 + mm/gup.c | 14 ++- mm/memory.c | 127 +++++++++++++++++++++++++ mm/mmap.c | 153 +++++++++++++++++++++++------- mm/nommu.c | 17 ++-- net/can/isotp.c | 5 +- 65 files changed, 614 insertions(+), 544 deletions(-)
On Thu, Jun 29, 2023 at 03:54:03PM -0600, Daniel Díaz wrote:
Hello!
Early report of failures.
Arm64 fails with GCC-11 on the following configurations:
- lkftconfig
- lkftconfig-64k_page_size
- lkftconfig-debug
- lkftconfig-debug-kmemleak
- lkftconfig-kasan
- lkftconfig-kselftest
- lkftconfig-kunit
- lkftconfig-libgpiod
- lkftconfig-perf
- lkftconfig-rcutorture
lkftconfig is basically defconfig + a few fragments [1]. The suffixes mean that we're enabling a few other kconfigs.
Failure: -----8<----- /builds/linux/arch/arm64/mm/fault.c: In function 'do_page_fault': /builds/linux/arch/arm64/mm/fault.c:576:9: error: 'vma' undeclared (first use in this function); did you mean 'vmap'? 576 | vma = lock_mm_and_find_vma(mm, addr, regs); | ^~~ | vmap /builds/linux/arch/arm64/mm/fault.c:576:9: note: each undeclared identifier is reported only once for each function it appears in /builds/linux/arch/arm64/mm/fault.c:579:17: error: label 'done' used but not defined 579 | goto done; | ^~~~ make[4]: *** [/builds/linux/scripts/Makefile.build:252: arch/arm64/mm/fault.o] Error 1 make[4]: Target 'arch/arm64/mm/' not remade because of errors. ----->8-----
Is this also failing in Linus's tree?
thanks,
greg k-h
Hello!
On Thu, 29 Jun 2023 at 23:19, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
On Thu, Jun 29, 2023 at 03:54:03PM -0600, Daniel Díaz wrote:
Hello!
Early report of failures.
Arm64 fails with GCC-11 on the following configurations:
- lkftconfig
- lkftconfig-64k_page_size
- lkftconfig-debug
- lkftconfig-debug-kmemleak
- lkftconfig-kasan
- lkftconfig-kselftest
- lkftconfig-kunit
- lkftconfig-libgpiod
- lkftconfig-perf
- lkftconfig-rcutorture
lkftconfig is basically defconfig + a few fragments [1]. The suffixes mean that we're enabling a few other kconfigs.
Failure: -----8<----- /builds/linux/arch/arm64/mm/fault.c: In function 'do_page_fault': /builds/linux/arch/arm64/mm/fault.c:576:9: error: 'vma' undeclared (first use in this function); did you mean 'vmap'? 576 | vma = lock_mm_and_find_vma(mm, addr, regs); | ^~~ | vmap /builds/linux/arch/arm64/mm/fault.c:576:9: note: each undeclared identifier is reported only once for each function it appears in /builds/linux/arch/arm64/mm/fault.c:579:17: error: label 'done' used but not defined 579 | goto done; | ^~~~ make[4]: *** [/builds/linux/scripts/Makefile.build:252: arch/arm64/mm/fault.o] Error 1 make[4]: Target 'arch/arm64/mm/' not remade because of errors. ----->8-----
Is this also failing in Linus's tree?
(Sorry for the previous top-post.)
No, only here on 6.3.
Greetings!
Daniel Díaz daniel.diaz@linaro.org
On Thu, Jun 29, 2023 at 11:25:13PM -0600, Daniel Díaz wrote:
Hello!
On Thu, 29 Jun 2023 at 23:19, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
On Thu, Jun 29, 2023 at 03:54:03PM -0600, Daniel Díaz wrote:
Hello!
Early report of failures.
Arm64 fails with GCC-11 on the following configurations:
- lkftconfig
- lkftconfig-64k_page_size
- lkftconfig-debug
- lkftconfig-debug-kmemleak
- lkftconfig-kasan
- lkftconfig-kselftest
- lkftconfig-kunit
- lkftconfig-libgpiod
- lkftconfig-perf
- lkftconfig-rcutorture
lkftconfig is basically defconfig + a few fragments [1]. The suffixes mean that we're enabling a few other kconfigs.
Failure: -----8<----- /builds/linux/arch/arm64/mm/fault.c: In function 'do_page_fault': /builds/linux/arch/arm64/mm/fault.c:576:9: error: 'vma' undeclared (first use in this function); did you mean 'vmap'? 576 | vma = lock_mm_and_find_vma(mm, addr, regs); | ^~~ | vmap /builds/linux/arch/arm64/mm/fault.c:576:9: note: each undeclared identifier is reported only once for each function it appears in /builds/linux/arch/arm64/mm/fault.c:579:17: error: label 'done' used but not defined 579 | goto done; | ^~~~ make[4]: *** [/builds/linux/scripts/Makefile.build:252: arch/arm64/mm/fault.o] Error 1 make[4]: Target 'arch/arm64/mm/' not remade because of errors. ----->8-----
Is this also failing in Linus's tree?
(Sorry for the previous top-post.)
No, only here on 6.3.
Ok, found the problem, will push out a -rc2 now, thanks for the quick notice!
greg k-h
linux-stable-mirror@lists.linaro.org