This is the start of the stable review cycle for the 6.4.1 release. There are 28 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sat, 01 Jul 2023 18:41:39 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.1-rc1.g... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 6.4.1-rc1
Ricardo Cañuelo ricardo.canuelo@collabora.com Revert "thermal/drivers/mediatek: Use devm_of_iomap to avoid resource leak in mtk_thermal_probe"
Mike Hommey mh@glandium.org HID: logitech-hidpp: add HIDPP_QUIRK_DELAYED_INIT for the T651.
Ludvig Michaelsson ludvig.michaelsson@yubico.com HID: hidraw: fix data race on device refcount
Zhang Shurong zhang_shurong@foxmail.com fbdev: fix potential OOB read in fast_imageblit()
Hugh Dickins hughd@google.com mm/khugepaged: fix regression in collapse_file()
Linus Torvalds torvalds@linux-foundation.org gup: add warning if some caller would seem to want stack expansion
Jason Gerecke jason.gerecke@wacom.com HID: wacom: Use ktime_t rather than int when dealing with timestamps
Linus Torvalds torvalds@linux-foundation.org mm: always expand the stack with the mmap write lock held
Linus Torvalds torvalds@linux-foundation.org execve: expand new process stack manually ahead of time
Liam R. Howlett Liam.Howlett@oracle.com mm: make find_extend_vma() fail if write lock not held
Linus Torvalds torvalds@linux-foundation.org powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma()
Linus Torvalds torvalds@linux-foundation.org mm/fault: convert remaining simple cases to lock_mm_and_find_vma()
Ben Hutchings ben@decadent.org.uk arm/mm: Convert to using lock_mm_and_find_vma()
Ben Hutchings ben@decadent.org.uk riscv/mm: Convert to using lock_mm_and_find_vma()
Ben Hutchings ben@decadent.org.uk mips/mm: Convert to using lock_mm_and_find_vma()
Michael Ellerman mpe@ellerman.id.au powerpc/mm: Convert to using lock_mm_and_find_vma()
Linus Torvalds torvalds@linux-foundation.org arm64/mm: Convert to using lock_mm_and_find_vma()
Linus Torvalds torvalds@linux-foundation.org mm: make the page fault mmap locking killable
Linus Torvalds torvalds@linux-foundation.org mm: introduce new 'lock_mm_and_find_vma()' page fault helper
Peng Zhang zhangpeng.00@bytedance.com maple_tree: fix potential out-of-bounds access in mas_wr_end_piv()
Oliver Hartkopp socketcan@hartkopp.net can: isotp: isotp_sendmsg(): fix return error fix on TX path
Wyes Karny wyes.karny@amd.com cpufreq: amd-pstate: Make amd-pstate EPP driver name hyphenated
Thomas Gleixner tglx@linutronix.de x86/smp: Cure kexec() vs. mwait_play_dead() breakage
Thomas Gleixner tglx@linutronix.de x86/smp: Use dedicated cache-line for mwait_play_dead()
Thomas Gleixner tglx@linutronix.de x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
Tony Battersby tonyb@cybernetics.com x86/smp: Dont access non-existing CPUID leaf
Thomas Gleixner tglx@linutronix.de x86/smp: Make stop_other_cpus() more robust
Borislav Petkov (AMD) bp@alien8.de x86/microcode/AMD: Load late on both threads too
-------------
Diffstat:
Makefile | 4 +- arch/alpha/Kconfig | 1 + arch/alpha/mm/fault.c | 13 +-- arch/arc/Kconfig | 1 + arch/arc/mm/fault.c | 11 +-- arch/arm/Kconfig | 1 + arch/arm/mm/fault.c | 63 ++++----------- arch/arm64/Kconfig | 1 + arch/arm64/mm/fault.c | 47 ++--------- arch/csky/Kconfig | 1 + arch/csky/mm/fault.c | 22 ++---- arch/hexagon/Kconfig | 1 + arch/hexagon/mm/vm_fault.c | 18 +---- arch/ia64/mm/fault.c | 36 ++------- arch/loongarch/Kconfig | 1 + arch/loongarch/mm/fault.c | 16 ++-- arch/m68k/mm/fault.c | 9 ++- arch/microblaze/mm/fault.c | 5 +- arch/mips/Kconfig | 1 + arch/mips/mm/fault.c | 12 +-- arch/nios2/Kconfig | 1 + arch/nios2/mm/fault.c | 17 +--- arch/openrisc/mm/fault.c | 5 +- arch/parisc/mm/fault.c | 23 +++--- arch/powerpc/Kconfig | 1 + arch/powerpc/mm/copro_fault.c | 14 +--- arch/powerpc/mm/fault.c | 39 +-------- arch/riscv/Kconfig | 1 + arch/riscv/mm/fault.c | 31 +++----- arch/s390/mm/fault.c | 5 +- arch/sh/Kconfig | 1 + arch/sh/mm/fault.c | 17 +--- arch/sparc/Kconfig | 1 + arch/sparc/mm/fault_32.c | 32 ++------ arch/sparc/mm/fault_64.c | 8 +- arch/um/kernel/trap.c | 11 +-- arch/x86/Kconfig | 1 + arch/x86/include/asm/cpu.h | 2 + arch/x86/include/asm/smp.h | 2 + arch/x86/kernel/cpu/microcode/amd.c | 2 +- arch/x86/kernel/process.c | 28 ++++++- arch/x86/kernel/smp.c | 73 ++++++++++------- arch/x86/kernel/smpboot.c | 81 ++++++++++++++++--- arch/x86/mm/fault.c | 52 +----------- arch/xtensa/Kconfig | 1 + arch/xtensa/mm/fault.c | 14 +--- drivers/cpufreq/amd-pstate.c | 2 +- drivers/hid/hid-logitech-hidpp.c | 2 +- drivers/hid/hidraw.c | 9 ++- drivers/hid/wacom_wac.c | 6 +- drivers/hid/wacom_wac.h | 2 +- drivers/iommu/amd/iommu_v2.c | 4 +- drivers/iommu/iommu-sva.c | 2 +- drivers/thermal/mediatek/auxadc_thermal.c | 14 +--- drivers/video/fbdev/core/sysimgblt.c | 2 +- fs/binfmt_elf.c | 6 +- fs/exec.c | 38 +++++---- include/linux/mm.h | 16 ++-- lib/maple_tree.c | 11 +-- mm/Kconfig | 4 + mm/gup.c | 14 +++- mm/khugepaged.c | 7 +- mm/memory.c | 127 ++++++++++++++++++++++++++++++ mm/mmap.c | 121 ++++++++++++++++++++++++---- mm/nommu.c | 17 ++-- net/can/isotp.c | 5 +- 66 files changed, 605 insertions(+), 531 deletions(-)
From: Borislav Petkov (AMD) bp@alien8.de
commit a32b0f0db3f396f1c9be2fe621e77c09ec3d8e7d upstream.
Do the same as early loading - load on both threads.
Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230605141332.25948-1-bp@alien8.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/cpu/microcode/amd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/kernel/cpu/microcode/amd.c +++ b/arch/x86/kernel/cpu/microcode/amd.c @@ -705,7 +705,7 @@ static enum ucode_state apply_microcode_ rdmsr(MSR_AMD64_PATCH_LEVEL, rev, dummy);
/* need to apply patch? */ - if (rev >= mc_amd->hdr.patch_id) { + if (rev > mc_amd->hdr.patch_id) { ret = UCODE_OK; goto out; }
From: Thomas Gleixner tglx@linutronix.de
commit 1f5e7eb7868e42227ac426c96d437117e6e06e8e upstream.
Tony reported intermittent lockups on poweroff. His analysis identified the wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that on SME enabled machines a kexec() does not leave any stale data in the caches when switching from encrypted to non-encrypted mode or vice versa.
That wbinvd() is conditional on the SME feature bit which is read directly from CPUID. But that readout does not check whether the CPUID leaf is available or not. If it's not available the CPU will return the value of the highest supported leaf instead. Depending on the content the "SME" bit might be set or not.
That's incorrect but harmless. Making the CPUID readout conditional makes the observed hangs go away, but it does not fix the underlying problem:
CPU0 CPU1
stop_other_cpus() send_IPIs(REBOOT); stop_this_cpu() while (num_online_cpus() > 1); set_online(false); proceed... -> hang wbinvd()
WBINVD is an expensive operation and if multiple CPUs issue it at the same time the resulting delays are even larger.
But CPU0 already observed num_online_cpus() going down to 1 and proceeds which causes the system to hang.
This issue exists independent of WBINVD, but the delays caused by WBINVD make it more prominent.
Make this more robust by adding a cpumask which is initialized to the online CPU mask before sending the IPIs and CPUs clear their bit in stop_this_cpu() after the WBINVD completed. Check for that cpumask to become empty in stop_other_cpus() instead of watching num_online_cpus().
The cpumask cannot plug all holes either, but it's better than a raw counter and allows to restrict the NMI fallback IPI to be sent only the CPUs which have not reported within the timeout window.
Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use") Reported-by: Tony Battersby tonyb@cybernetics.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Ashok Raj ashok.raj@intel.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics... Link: https://lore.kernel.org/r/87h6r770bv.ffs@tglx Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/include/asm/cpu.h | 2 + arch/x86/kernel/process.c | 23 +++++++++++++++- arch/x86/kernel/smp.c | 62 +++++++++++++++++++++++++++++---------------- 3 files changed, 64 insertions(+), 23 deletions(-)
--- a/arch/x86/include/asm/cpu.h +++ b/arch/x86/include/asm/cpu.h @@ -98,4 +98,6 @@ extern u64 x86_read_arch_cap_msr(void); int intel_find_matching_signature(void *mc, unsigned int csig, int cpf); int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
+extern struct cpumask cpus_stop_mask; + #endif /* _ASM_X86_CPU_H */ --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -759,13 +759,23 @@ bool xen_set_default_idle(void) } #endif
+struct cpumask cpus_stop_mask; + void __noreturn stop_this_cpu(void *dummy) { + unsigned int cpu = smp_processor_id(); + local_irq_disable(); + /* - * Remove this CPU: + * Remove this CPU from the online mask and disable it + * unconditionally. This might be redundant in case that the reboot + * vector was handled late and stop_other_cpus() sent an NMI. + * + * According to SDM and APM NMIs can be accepted even after soft + * disabling the local APIC. */ - set_cpu_online(smp_processor_id(), false); + set_cpu_online(cpu, false); disable_local_APIC(); mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
@@ -783,6 +793,15 @@ void __noreturn stop_this_cpu(void *dumm */ if (cpuid_eax(0x8000001f) & BIT(0)) native_wbinvd(); + + /* + * This brings a cache line back and dirties it, but + * native_stop_other_cpus() will overwrite cpus_stop_mask after it + * observed that all CPUs reported stop. This write will invalidate + * the related cache line on this CPU. + */ + cpumask_clear_cpu(cpu, &cpus_stop_mask); + for (;;) { /* * Use native_halt() so that memory contents don't change --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -27,6 +27,7 @@ #include <asm/mmu_context.h> #include <asm/proto.h> #include <asm/apic.h> +#include <asm/cpu.h> #include <asm/idtentry.h> #include <asm/nmi.h> #include <asm/mce.h> @@ -146,31 +147,43 @@ static int register_stop_handler(void)
static void native_stop_other_cpus(int wait) { - unsigned long flags; - unsigned long timeout; + unsigned int cpu = smp_processor_id(); + unsigned long flags, timeout;
if (reboot_force) return;
- /* - * Use an own vector here because smp_call_function - * does lots of things not suitable in a panic situation. - */ + /* Only proceed if this is the first CPU to reach this code */ + if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1) + return;
/* - * We start by using the REBOOT_VECTOR irq. - * The irq is treated as a sync point to allow critical - * regions of code on other cpus to release their spin locks - * and re-enable irqs. Jumping straight to an NMI might - * accidentally cause deadlocks with further shutdown/panic - * code. By syncing, we give the cpus up to one second to - * finish their work before we force them off with the NMI. + * 1) Send an IPI on the reboot vector to all other CPUs. + * + * The other CPUs should react on it after leaving critical + * sections and re-enabling interrupts. They might still hold + * locks, but there is nothing which can be done about that. + * + * 2) Wait for all other CPUs to report that they reached the + * HLT loop in stop_this_cpu() + * + * 3) If #2 timed out send an NMI to the CPUs which did not + * yet report + * + * 4) Wait for all other CPUs to report that they reached the + * HLT loop in stop_this_cpu() + * + * #3 can obviously race against a CPU reaching the HLT loop late. + * That CPU will have reported already and the "have all CPUs + * reached HLT" condition will be true despite the fact that the + * other CPU is still handling the NMI. Again, there is no + * protection against that as "disabled" APICs still respond to + * NMIs. */ - if (num_online_cpus() > 1) { - /* did someone beat us here? */ - if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1) - return; + cpumask_copy(&cpus_stop_mask, cpu_online_mask); + cpumask_clear_cpu(cpu, &cpus_stop_mask);
+ if (!cpumask_empty(&cpus_stop_mask)) { /* sync above data before sending IRQ */ wmb();
@@ -183,12 +196,12 @@ static void native_stop_other_cpus(int w * CPUs reach shutdown state. */ timeout = USEC_PER_SEC; - while (num_online_cpus() > 1 && timeout--) + while (!cpumask_empty(&cpus_stop_mask) && timeout--) udelay(1); }
/* if the REBOOT_VECTOR didn't work, try with the NMI */ - if (num_online_cpus() > 1) { + if (!cpumask_empty(&cpus_stop_mask)) { /* * If NMI IPI is enabled, try to register the stop handler * and send the IPI. In any case try to wait for the other @@ -200,7 +213,8 @@ static void native_stop_other_cpus(int w
pr_emerg("Shutting down cpus with NMI\n");
- apic_send_IPI_allbutself(NMI_VECTOR); + for_each_cpu(cpu, &cpus_stop_mask) + apic->send_IPI(cpu, NMI_VECTOR); } /* * Don't wait longer than 10 ms if the caller didn't @@ -208,7 +222,7 @@ static void native_stop_other_cpus(int w * one or more CPUs do not reach shutdown state. */ timeout = USEC_PER_MSEC * 10; - while (num_online_cpus() > 1 && (wait || timeout--)) + while (!cpumask_empty(&cpus_stop_mask) && (wait || timeout--)) udelay(1); }
@@ -216,6 +230,12 @@ static void native_stop_other_cpus(int w disable_local_APIC(); mcheck_cpu_clear(this_cpu_ptr(&cpu_info)); local_irq_restore(flags); + + /* + * Ensure that the cpus_stop_mask cache lines are invalidated on + * the other CPUs. See comment vs. SME in stop_this_cpu(). + */ + cpumask_clear(&cpus_stop_mask); }
/*
From: Tony Battersby tonyb@cybernetics.com
commit 9b040453d4440659f33dc6f0aa26af418ebfe70b upstream.
stop_this_cpu() tests CPUID leaf 0x8000001f::EAX unconditionally. Intel CPUs return the content of the highest supported leaf when a non-existing leaf is read, while AMD CPUs return all zeros for unsupported leafs.
So the result of the test on Intel CPUs is lottery.
While harmless it's incorrect and causes the conditional wbinvd() to be issued where not required.
Check whether the leaf is supported before reading it.
[ tglx: Adjusted changelog ]
Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use") Signed-off-by: Tony Battersby tonyb@cybernetics.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Mario Limonciello mario.limonciello@amd.com Reviewed-by: Borislav Petkov (AMD) bp@alien8.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.c... Link: https://lore.kernel.org/r/20230615193330.322186388@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/process.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
--- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -763,6 +763,7 @@ struct cpumask cpus_stop_mask;
void __noreturn stop_this_cpu(void *dummy) { + struct cpuinfo_x86 *c = this_cpu_ptr(&cpu_info); unsigned int cpu = smp_processor_id();
local_irq_disable(); @@ -777,7 +778,7 @@ void __noreturn stop_this_cpu(void *dumm */ set_cpu_online(cpu, false); disable_local_APIC(); - mcheck_cpu_clear(this_cpu_ptr(&cpu_info)); + mcheck_cpu_clear(c);
/* * Use wbinvd on processors that support SME. This provides support @@ -791,7 +792,7 @@ void __noreturn stop_this_cpu(void *dumm * Test the CPUID bit directly because the machine might've cleared * X86_FEATURE_SME due to cmdline options. */ - if (cpuid_eax(0x8000001f) & BIT(0)) + if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0))) native_wbinvd();
/*
From: Thomas Gleixner tglx@linutronix.de
commit 2affa6d6db28855e6340b060b809c23477aa546e upstream.
The wmb()s before sending the IPIs are not synchronizing anything.
If at all then the apic IPI functions have to provide or act as appropriate barriers.
Remove these cargo cult barriers which have no explanation of what they are synchronizing.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Borislav Petkov (AMD) bp@alien8.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615193330.378358382@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/smp.c | 6 ------ 1 file changed, 6 deletions(-)
--- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -184,9 +184,6 @@ static void native_stop_other_cpus(int w cpumask_clear_cpu(cpu, &cpus_stop_mask);
if (!cpumask_empty(&cpus_stop_mask)) { - /* sync above data before sending IRQ */ - wmb(); - apic_send_IPI_allbutself(REBOOT_VECTOR);
/* @@ -208,9 +205,6 @@ static void native_stop_other_cpus(int w * CPUs to stop. */ if (!smp_no_nmi_ipi && !register_stop_handler()) { - /* Sync above data before sending IRQ */ - wmb(); - pr_emerg("Shutting down cpus with NMI\n");
for_each_cpu(cpu, &cpus_stop_mask)
From: Thomas Gleixner tglx@linutronix.de
commit f9c9987bf52f4e42e940ae217333ebb5a4c3b506 upstream.
Monitoring idletask::thread_info::flags in mwait_play_dead() has been an obvious choice as all what is needed is a cache line which is not written by other CPUs.
But there is a use case where a "dead" CPU needs to be brought out of MWAIT: kexec().
This is required as kexec() can overwrite text, pagetables, stacks and the monitored cacheline of the original kernel. The latter causes MWAIT to resume execution which obviously causes havoc on the kexec kernel which results usually in triple faults.
Use a dedicated per CPU storage to prepare for that.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Ashok Raj ashok.raj@intel.com Reviewed-by: Borislav Petkov (AMD) bp@alien8.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615193330.434553750@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/smpboot.c | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-)
--- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -101,6 +101,17 @@ EXPORT_PER_CPU_SYMBOL(cpu_die_map); DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info); EXPORT_PER_CPU_SYMBOL(cpu_info);
+struct mwait_cpu_dead { + unsigned int control; + unsigned int status; +}; + +/* + * Cache line aligned data for mwait_play_dead(). Separate on purpose so + * that it's unlikely to be touched by other CPUs. + */ +static DEFINE_PER_CPU_ALIGNED(struct mwait_cpu_dead, mwait_cpu_dead); + /* Logical package management. We might want to allocate that dynamically */ unsigned int __max_logical_packages __read_mostly; EXPORT_SYMBOL(__max_logical_packages); @@ -1758,10 +1769,10 @@ EXPORT_SYMBOL_GPL(cond_wakeup_cpu0); */ static inline void mwait_play_dead(void) { + struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead); unsigned int eax, ebx, ecx, edx; unsigned int highest_cstate = 0; unsigned int highest_subcstate = 0; - void *mwait_ptr; int i;
if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD || @@ -1796,13 +1807,6 @@ static inline void mwait_play_dead(void) (highest_subcstate - 1); }
- /* - * This should be a memory location in a cache line which is - * unlikely to be touched by other processors. The actual - * content is immaterial as it is not actually modified in any way. - */ - mwait_ptr = ¤t_thread_info()->flags; - wbinvd();
while (1) { @@ -1814,9 +1818,9 @@ static inline void mwait_play_dead(void) * case where we return around the loop. */ mb(); - clflush(mwait_ptr); + clflush(md); mb(); - __monitor(mwait_ptr, 0, 0); + __monitor(md, 0, 0); mb(); __mwait(eax, 0);
From: Thomas Gleixner tglx@linutronix.de
commit d7893093a7417527c0d73c9832244e65c9d0114f upstream.
TLDR: It's a mess.
When kexec() is executed on a system with offline CPUs, which are parked in mwait_play_dead() it can end up in a triple fault during the bootup of the kexec kernel or cause hard to diagnose data corruption.
The reason is that kexec() eventually overwrites the previous kernel's text, page tables, data and stack. If it writes to the cache line which is monitored by a previously offlined CPU, MWAIT resumes execution and ends up executing the wrong text, dereferencing overwritten page tables or corrupting the kexec kernels data.
Cure this by bringing the offlined CPUs out of MWAIT into HLT.
Write to the monitored cache line of each offline CPU, which makes MWAIT resume execution. The written control word tells the offlined CPUs to issue HLT, which does not have the MWAIT problem.
That does not help, if a stray NMI, MCE or SMI hits the offlined CPUs as those make it come out of HLT.
A follow up change will put them into INIT, which protects at least against NMI and SMI.
Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case") Reported-by: Ashok Raj ashok.raj@intel.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Ashok Raj ashok.raj@intel.com Reviewed-by: Ashok Raj ashok.raj@intel.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615193330.492257119@linutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/include/asm/smp.h | 2 + arch/x86/kernel/smp.c | 5 +++ arch/x86/kernel/smpboot.c | 59 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 66 insertions(+)
--- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -132,6 +132,8 @@ void wbinvd_on_cpu(int cpu); int wbinvd_on_all_cpus(void); void cond_wakeup_cpu0(void);
+void smp_kick_mwait_play_dead(void); + void native_smp_send_reschedule(int cpu); void native_send_call_func_ipi(const struct cpumask *mask); void native_send_call_func_single_ipi(int cpu); --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -21,6 +21,7 @@ #include <linux/interrupt.h> #include <linux/cpu.h> #include <linux/gfp.h> +#include <linux/kexec.h>
#include <asm/mtrr.h> #include <asm/tlbflush.h> @@ -157,6 +158,10 @@ static void native_stop_other_cpus(int w if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1) return;
+ /* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */ + if (kexec_in_progress) + smp_kick_mwait_play_dead(); + /* * 1) Send an IPI on the reboot vector to all other CPUs. * --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -53,6 +53,7 @@ #include <linux/tboot.h> #include <linux/gfp.h> #include <linux/cpuidle.h> +#include <linux/kexec.h> #include <linux/numa.h> #include <linux/pgtable.h> #include <linux/overflow.h> @@ -106,6 +107,9 @@ struct mwait_cpu_dead { unsigned int status; };
+#define CPUDEAD_MWAIT_WAIT 0xDEADBEEF +#define CPUDEAD_MWAIT_KEXEC_HLT 0x4A17DEAD + /* * Cache line aligned data for mwait_play_dead(). Separate on purpose so * that it's unlikely to be touched by other CPUs. @@ -173,6 +177,10 @@ static void smp_callin(void) { int cpuid;
+ /* Mop up eventual mwait_play_dead() wreckage */ + this_cpu_write(mwait_cpu_dead.status, 0); + this_cpu_write(mwait_cpu_dead.control, 0); + /* * If waken up by an INIT in an 82489DX configuration * cpu_callout_mask guarantees we don't get here before @@ -1807,6 +1815,10 @@ static inline void mwait_play_dead(void) (highest_subcstate - 1); }
+ /* Set up state for the kexec() hack below */ + md->status = CPUDEAD_MWAIT_WAIT; + md->control = CPUDEAD_MWAIT_WAIT; + wbinvd();
while (1) { @@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void) mb(); __mwait(eax, 0);
+ if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) { + /* + * Kexec is about to happen. Don't go back into mwait() as + * the kexec kernel might overwrite text and data including + * page tables and stack. So mwait() would resume when the + * monitor cache line is written to and then the CPU goes + * south due to overwritten text, page tables and stack. + * + * Note: This does _NOT_ protect against a stray MCE, NMI, + * SMI. They will resume execution at the instruction + * following the HLT instruction and run into the problem + * which this is trying to prevent. + */ + WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT); + while(1) + native_halt(); + } + cond_wakeup_cpu0(); } }
+/* + * Kick all "offline" CPUs out of mwait on kexec(). See comment in + * mwait_play_dead(). + */ +void smp_kick_mwait_play_dead(void) +{ + u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT; + struct mwait_cpu_dead *md; + unsigned int cpu, i; + + for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) { + md = per_cpu_ptr(&mwait_cpu_dead, cpu); + + /* Does it sit in mwait_play_dead() ? */ + if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT) + continue; + + /* Wait up to 5ms */ + for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) { + /* Bring it out of mwait */ + WRITE_ONCE(md->control, newstate); + udelay(5); + } + + if (READ_ONCE(md->status) != newstate) + pr_err_once("CPU%u is stuck in mwait_play_dead()\n", cpu); + } +} + void __noreturn hlt_play_dead(void) { if (__this_cpu_read(cpu_info.x86) >= 4)
From: Wyes Karny wyes.karny@amd.com
commit f4aad639302a07454dcb23b408dcadf8a9efb031 upstream.
amd-pstate passive mode driver is hyphenated. So make amd-pstate active mode driver consistent with that rename "amd_pstate_epp" to "amd-pstate-epp".
Fixes: ffa5096a7c33 ("cpufreq: amd-pstate: implement Pstate EPP support for the AMD processors") Cc: All applicable stable@vger.kernel.org Reviewed-by: Gautham R. Shenoy gautham.shenoy@amd.com Signed-off-by: Wyes Karny wyes.karny@amd.com Acked-by: Huang Rui ray.huang@amd.com Reviewed-by: Perry Yuan Perry.Yuan@amd.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/cpufreq/amd-pstate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/cpufreq/amd-pstate.c +++ b/drivers/cpufreq/amd-pstate.c @@ -1356,7 +1356,7 @@ static struct cpufreq_driver amd_pstate_ .online = amd_pstate_epp_cpu_online, .suspend = amd_pstate_epp_suspend, .resume = amd_pstate_epp_resume, - .name = "amd_pstate_epp", + .name = "amd-pstate-epp", .attr = amd_pstate_epp_attr, };
From: Oliver Hartkopp socketcan@hartkopp.net
commit e38910c0072b541a91954682c8b074a93e57c09b upstream.
With commit d674a8f123b4 ("can: isotp: isotp_sendmsg(): fix return error on FC timeout on TX path") the missing correct return value in the case of a protocol error was introduced.
But the way the error value has been read and sent to the user space does not follow the common scheme to clear the error after reading which is provided by the sock_error() function. This leads to an error report at the following write() attempt although everything should be working.
Fixes: d674a8f123b4 ("can: isotp: isotp_sendmsg(): fix return error on FC timeout on TX path") Reported-by: Carsten Schmidt carsten.schmidt-achim@t-online.de Signed-off-by: Oliver Hartkopp socketcan@hartkopp.net Link: https://lore.kernel.org/all/20230607072708.38809-1-socketcan@hartkopp.net Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde mkl@pengutronix.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- net/can/isotp.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
--- a/net/can/isotp.c +++ b/net/can/isotp.c @@ -1112,8 +1112,9 @@ wait_free_buffer: if (err) goto err_event_drop;
- if (sk->sk_err) - return -sk->sk_err; + err = sock_error(sk); + if (err) + return err; }
return size;
From: Peng Zhang zhangpeng.00@bytedance.com
commit cd00dd2585c4158e81fdfac0bbcc0446afbad26d upstream.
Check the write offset end bounds before using it as the offset into the pivot array. This avoids a possible out-of-bounds access on the pivot array if the write extends to the last slot in the node, in which case the node maximum should be used as the end pivot.
akpm: this doesn't affect any current callers, but new users of mapletree may encounter this problem if backported into earlier kernels, so let's fix it in -stable kernels in case of this.
Link: https://lkml.kernel.org/r/20230506024752.2550-1-zhangpeng.00@bytedance.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Peng Zhang zhangpeng.00@bytedance.com Reviewed-by: Liam R. Howlett Liam.Howlett@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- lib/maple_tree.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)
--- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -4263,11 +4263,13 @@ done:
static inline void mas_wr_end_piv(struct ma_wr_state *wr_mas) { - while ((wr_mas->mas->last > wr_mas->end_piv) && - (wr_mas->offset_end < wr_mas->node_end)) - wr_mas->end_piv = wr_mas->pivots[++wr_mas->offset_end]; + while ((wr_mas->offset_end < wr_mas->node_end) && + (wr_mas->mas->last > wr_mas->pivots[wr_mas->offset_end])) + wr_mas->offset_end++;
- if (wr_mas->mas->last > wr_mas->end_piv) + if (wr_mas->offset_end < wr_mas->node_end) + wr_mas->end_piv = wr_mas->pivots[wr_mas->offset_end]; + else wr_mas->end_piv = wr_mas->mas->max; }
@@ -4424,7 +4426,6 @@ static inline void *mas_wr_store_entry(s }
/* At this point, we are at the leaf node that needs to be altered. */ - wr_mas->end_piv = wr_mas->r_max; mas_wr_end_piv(wr_mas);
if (!wr_mas->entry)
From: Linus Torvalds torvalds@linux-foundation.org
commit c2508ec5a58db67093f4fb8bf89a9a7c53a109e9 upstream.
.. and make x86 use it.
This basically extracts the existing x86 "find and expand faulting vma" code, but extends it to also take the mmap lock for writing in case we actually do need to expand the vma.
We've historically short-circuited that case, and have some rather ugly special logic to serialize the stack segment expansion (since we only hold the mmap lock for reading) that doesn't match the normal VM locking.
That slight violation of locking worked well, right up until it didn't: the maple tree code really does want proper locking even for simple extension of an existing vma.
So extract the code for "look up the vma of the fault" from x86, fix it up to do the necessary write locking, and make it available as a helper function for other architectures that can use the common helper.
Note: I say "common helper", but it really only handles the normal stack-grows-down case. Which is all architectures except for PA-RISC and IA64. So some rare architectures can't use the helper, but if they care they'll just need to open-code this logic.
It's also worth pointing out that this code really would like to have an optimistic "mmap_upgrade_trylock()" to make it quicker to go from a read-lock (for the common case) to taking the write lock (for having to extend the vma) in the normal single-threaded situation where there is no other locking activity.
But that _is_ all the very uncommon special case, so while it would be nice to have such an operation, it probably doesn't matter in reality. I did put in the skeleton code for such a possible future expansion, even if it only acts as pseudo-documentation for what we're doing.
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/Kconfig | 1 arch/x86/mm/fault.c | 52 ---------------------- include/linux/mm.h | 2 mm/Kconfig | 4 + mm/memory.c | 121 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 130 insertions(+), 50 deletions(-)
--- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -276,6 +276,7 @@ config X86 select HAVE_GENERIC_VDSO select HOTPLUG_SMT if SMP select IRQ_FORCED_THREADING + select LOCK_MM_AND_FIND_VMA select NEED_PER_CPU_EMBED_FIRST_CHUNK select NEED_PER_CPU_PAGE_FIRST_CHUNK select NEED_SG_DMA_LENGTH --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -880,12 +880,6 @@ __bad_area(struct pt_regs *regs, unsigne __bad_area_nosemaphore(regs, error_code, address, pkey, si_code); }
-static noinline void -bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address) -{ - __bad_area(regs, error_code, address, 0, SEGV_MAPERR); -} - static inline bool bad_area_access_from_pkeys(unsigned long error_code, struct vm_area_struct *vma) { @@ -1366,51 +1360,10 @@ void do_user_addr_fault(struct pt_regs * lock_mmap: #endif /* CONFIG_PER_VMA_LOCK */
- /* - * Kernel-mode access to the user address space should only occur - * on well-defined single instructions listed in the exception - * tables. But, an erroneous kernel fault occurring outside one of - * those areas which also holds mmap_lock might deadlock attempting - * to validate the fault against the address space. - * - * Only do the expensive exception table search when we might be at - * risk of a deadlock. This happens if we - * 1. Failed to acquire mmap_lock, and - * 2. The access did not originate in userspace. - */ - if (unlikely(!mmap_read_trylock(mm))) { - if (!user_mode(regs) && !search_exception_tables(regs->ip)) { - /* - * Fault from code in kernel from - * which we do not expect faults. - */ - bad_area_nosemaphore(regs, error_code, address); - return; - } retry: - mmap_read_lock(mm); - } else { - /* - * The above down_read_trylock() might have succeeded in - * which case we'll have missed the might_sleep() from - * down_read(): - */ - might_sleep(); - } - - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (unlikely(!vma)) { - bad_area(regs, error_code, address); - return; - } - if (likely(vma->vm_start <= address)) - goto good_area; - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { - bad_area(regs, error_code, address); - return; - } - if (unlikely(expand_stack(vma, address))) { - bad_area(regs, error_code, address); + bad_area_nosemaphore(regs, error_code, address); return; }
@@ -1418,7 +1371,6 @@ retry: * Ok, we have a good vm_area for this memory access, so * we can handle it.. */ -good_area: if (unlikely(access_error(error_code, vma))) { bad_area_access_error(regs, error_code, address, vma); return; --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2325,6 +2325,8 @@ void unmap_mapping_pages(struct address_ pgoff_t start, pgoff_t nr, bool even_cows); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); +struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm, + unsigned long address, struct pt_regs *regs); #else static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1206,6 +1206,10 @@ config PER_VMA_LOCK This feature allows locking each virtual memory area separately when handling page faults instead of taking mmap_lock.
+config LOCK_MM_AND_FIND_VMA + bool + depends on !STACK_GROWSUP + source "mm/damon/Kconfig"
endmenu --- a/mm/memory.c +++ b/mm/memory.c @@ -5262,6 +5262,127 @@ out: } EXPORT_SYMBOL_GPL(handle_mm_fault);
+#ifdef CONFIG_LOCK_MM_AND_FIND_VMA +#include <linux/extable.h> + +static inline bool get_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs *regs) +{ + /* Even if this succeeds, make it clear we *might* have slept */ + if (likely(mmap_read_trylock(mm))) { + might_sleep(); + return true; + } + + if (regs && !user_mode(regs)) { + unsigned long ip = instruction_pointer(regs); + if (!search_exception_tables(ip)) + return false; + } + + mmap_read_lock(mm); + return true; +} + +static inline bool mmap_upgrade_trylock(struct mm_struct *mm) +{ + /* + * We don't have this operation yet. + * + * It should be easy enough to do: it's basically a + * atomic_long_try_cmpxchg_acquire() + * from RWSEM_READER_BIAS -> RWSEM_WRITER_LOCKED, but + * it also needs the proper lockdep magic etc. + */ + return false; +} + +static inline bool upgrade_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs *regs) +{ + mmap_read_unlock(mm); + if (regs && !user_mode(regs)) { + unsigned long ip = instruction_pointer(regs); + if (!search_exception_tables(ip)) + return false; + } + mmap_write_lock(mm); + return true; +} + +/* + * Helper for page fault handling. + * + * This is kind of equivalend to "mmap_read_lock()" followed + * by "find_extend_vma()", except it's a lot more careful about + * the locking (and will drop the lock on failure). + * + * For example, if we have a kernel bug that causes a page + * fault, we don't want to just use mmap_read_lock() to get + * the mm lock, because that would deadlock if the bug were + * to happen while we're holding the mm lock for writing. + * + * So this checks the exception tables on kernel faults in + * order to only do this all for instructions that are actually + * expected to fault. + * + * We can also actually take the mm lock for writing if we + * need to extend the vma, which helps the VM layer a lot. + */ +struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm, + unsigned long addr, struct pt_regs *regs) +{ + struct vm_area_struct *vma; + + if (!get_mmap_lock_carefully(mm, regs)) + return NULL; + + vma = find_vma(mm, addr); + if (likely(vma && (vma->vm_start <= addr))) + return vma; + + /* + * Well, dang. We might still be successful, but only + * if we can extend a vma to do so. + */ + if (!vma || !(vma->vm_flags & VM_GROWSDOWN)) { + mmap_read_unlock(mm); + return NULL; + } + + /* + * We can try to upgrade the mmap lock atomically, + * in which case we can continue to use the vma + * we already looked up. + * + * Otherwise we'll have to drop the mmap lock and + * re-take it, and also look up the vma again, + * re-checking it. + */ + if (!mmap_upgrade_trylock(mm)) { + if (!upgrade_mmap_lock_carefully(mm, regs)) + return NULL; + + vma = find_vma(mm, addr); + if (!vma) + goto fail; + if (vma->vm_start <= addr) + goto success; + if (!(vma->vm_flags & VM_GROWSDOWN)) + goto fail; + } + + if (expand_stack(vma, addr)) + goto fail; + +success: + mmap_write_downgrade(mm); + return vma; + +fail: + mmap_write_unlock(mm); + return NULL; +} +#endif + #ifdef CONFIG_PER_VMA_LOCK /* * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
From: Linus Torvalds torvalds@linux-foundation.org
commit eda0047296a16d65a7f2bc60a408f70d178b2014 upstream.
This is done as a separate patch from introducing the new lock_mm_and_find_vma() helper, because while it's an obvious change, it's not what x86 used to do in this area.
We already abort the page fault on fatal signals anyway, so why should we wait for the mmap lock only to then abort later? With the new helper function that returns without the lock held on failure anyway, this is particularly easy and straightforward.
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/memory.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
--- a/mm/memory.c +++ b/mm/memory.c @@ -5279,8 +5279,7 @@ static inline bool get_mmap_lock_careful return false; }
- mmap_read_lock(mm); - return true; + return !mmap_read_lock_killable(mm); }
static inline bool mmap_upgrade_trylock(struct mm_struct *mm) @@ -5304,8 +5303,7 @@ static inline bool upgrade_mmap_lock_car if (!search_exception_tables(ip)) return false; } - mmap_write_lock(mm); - return true; + return !mmap_write_lock_killable(mm); }
/*
From: Linus Torvalds torvalds@linux-foundation.org
commit ae870a68b5d13d67cf4f18d47bb01ee3fee40acb upstream.
This converts arm64 to use the new page fault helper. It was very straightforward, but still needed a fix for the "obvious" conversion I initially did. Thanks to Suren for the fix and testing.
Fixed-and-tested-by: Suren Baghdasaryan surenb@google.com Unnecessary-code-removal-by: Liam R. Howlett Liam.Howlett@oracle.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/arm64/Kconfig | 1 + arch/arm64/mm/fault.c | 47 ++++++++--------------------------------------- 2 files changed, 9 insertions(+), 39 deletions(-)
--- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -225,6 +225,7 @@ config ARM64 select IRQ_DOMAIN select IRQ_FORCED_THREADING select KASAN_VMALLOC if KASAN + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select NEED_DMA_MAP_STATE select NEED_SG_DMA_LENGTH --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -483,27 +483,14 @@ static void do_bad_area(unsigned long fa #define VM_FAULT_BADMAP ((__force vm_fault_t)0x010000) #define VM_FAULT_BADACCESS ((__force vm_fault_t)0x020000)
-static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr, +static vm_fault_t __do_page_fault(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long addr, unsigned int mm_flags, unsigned long vm_flags, struct pt_regs *regs) { - struct vm_area_struct *vma = find_vma(mm, addr); - - if (unlikely(!vma)) - return VM_FAULT_BADMAP; - /* * Ok, we have a good vm_area for this memory access, so we can handle * it. - */ - if (unlikely(vma->vm_start > addr)) { - if (!(vma->vm_flags & VM_GROWSDOWN)) - return VM_FAULT_BADMAP; - if (expand_stack(vma, addr)) - return VM_FAULT_BADMAP; - } - - /* * Check that the permissions on the VMA allow for the fault which * occurred. */ @@ -617,31 +604,15 @@ static int __kprobes do_page_fault(unsig } lock_mmap: #endif /* CONFIG_PER_VMA_LOCK */ - /* - * As per x86, we may deadlock here. However, since the kernel only - * validly references user space from well defined areas of the code, - * we can bug out early if this is from code which shouldn't. - */ - if (!mmap_read_trylock(mm)) { - if (!user_mode(regs) && !search_exception_tables(regs->pc)) - goto no_context; + retry: - mmap_read_lock(mm); - } else { - /* - * The above mmap_read_trylock() might have succeeded in which - * case, we'll have missed the might_sleep() from down_read(). - */ - might_sleep(); -#ifdef CONFIG_DEBUG_VM - if (!user_mode(regs) && !search_exception_tables(regs->pc)) { - mmap_read_unlock(mm); - goto no_context; - } -#endif + vma = lock_mm_and_find_vma(mm, addr, regs); + if (unlikely(!vma)) { + fault = VM_FAULT_BADMAP; + goto done; }
- fault = __do_page_fault(mm, addr, mm_flags, vm_flags, regs); + fault = __do_page_fault(mm, vma, addr, mm_flags, vm_flags, regs);
/* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { @@ -660,9 +631,7 @@ retry: } mmap_read_unlock(mm);
-#ifdef CONFIG_PER_VMA_LOCK done: -#endif /* * Handle the "normal" (no error) case first. */
From: Michael Ellerman mpe@ellerman.id.au
commit e6fe228c4ffafdfc970cf6d46883a1f481baf7ea upstream.
Signed-off-by: Michael Ellerman mpe@ellerman.id.au Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/powerpc/Kconfig | 1 + arch/powerpc/mm/fault.c | 41 ++++------------------------------------- 2 files changed, 5 insertions(+), 37 deletions(-)
--- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -278,6 +278,7 @@ config PPC select IRQ_DOMAIN select IRQ_FORCED_THREADING select KASAN_VMALLOC if KASAN && MODULES + select LOCK_MM_AND_FIND_VMA select MMU_GATHER_PAGE_SIZE select MMU_GATHER_RCU_TABLE_FREE select MMU_GATHER_MERGE_VMAS --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -84,11 +84,6 @@ static int __bad_area(struct pt_regs *re return __bad_area_nosemaphore(regs, address, si_code); }
-static noinline int bad_area(struct pt_regs *regs, unsigned long address) -{ - return __bad_area(regs, address, SEGV_MAPERR); -} - static noinline int bad_access_pkey(struct pt_regs *regs, unsigned long address, struct vm_area_struct *vma) { @@ -515,40 +510,12 @@ lock_mmap: * we will deadlock attempting to validate the fault against the * address space. Luckily the kernel only validly references user * space from well defined areas of code, which are listed in the - * exceptions table. - * - * As the vast majority of faults will be valid we will only perform - * the source reference check when there is a possibility of a deadlock. - * Attempt to lock the address space, if we cannot we then validate the - * source. If this is invalid we can skip the address space check, - * thus avoiding the deadlock. - */ - if (unlikely(!mmap_read_trylock(mm))) { - if (!is_user && !search_exception_tables(regs->nip)) - return bad_area_nosemaphore(regs, address); - + * exceptions table. lock_mm_and_find_vma() handles that logic. + */ retry: - mmap_read_lock(mm); - } else { - /* - * The above down_read_trylock() might have succeeded in - * which case we'll have missed the might_sleep() from - * down_read(): - */ - might_sleep(); - } - - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (unlikely(!vma)) - return bad_area(regs, address); - - if (unlikely(vma->vm_start > address)) { - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) - return bad_area(regs, address); - - if (unlikely(expand_stack(vma, address))) - return bad_area(regs, address); - } + return bad_area_nosemaphore(regs, address);
if (unlikely(access_pkey_error(is_write, is_exec, (error_code & DSISR_KEYFAULT), vma)))
From: Ben Hutchings ben@decadent.org.uk
commit 4bce37a68ff884e821a02a731897a8119e0c37b7 upstream.
Signed-off-by: Ben Hutchings ben@decadent.org.uk Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/mips/Kconfig | 1 + arch/mips/mm/fault.c | 12 ++---------- 2 files changed, 3 insertions(+), 10 deletions(-)
--- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -91,6 +91,7 @@ config MIPS select HAVE_VIRT_CPU_ACCOUNTING_GEN if 64BIT || !SMP select IRQ_FORCED_THREADING select ISA if EISA + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_REL if MODULES select MODULES_USE_ELF_RELA if MODULES && 64BIT select PERF_USE_VMALLOC --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -99,21 +99,13 @@ static void __do_page_fault(struct pt_re
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: - mmap_read_lock(mm); - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; + goto bad_area_nosemaphore; /* * Ok, we have a good vm_area for this memory access, so * we can handle it.. */ -good_area: si_code = SEGV_ACCERR;
if (write) {
From: Ben Hutchings ben@decadent.org.uk
commit 7267ef7b0b77f4ed23b7b3c87d8eca7bd9c2d007 upstream.
Signed-off-by: Ben Hutchings ben@decadent.org.uk Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/riscv/Kconfig | 1 + arch/riscv/mm/fault.c | 31 +++++++++++++------------------ 2 files changed, 14 insertions(+), 18 deletions(-)
--- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -126,6 +126,7 @@ config RISCV select IRQ_DOMAIN select IRQ_FORCED_THREADING select KASAN_VMALLOC if KASAN + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA if MODULES select MODULE_SECTIONS if MODULES select OF --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -84,13 +84,13 @@ static inline void mm_fault_error(struct BUG(); }
-static inline void bad_area(struct pt_regs *regs, struct mm_struct *mm, int code, unsigned long addr) +static inline void +bad_area_nosemaphore(struct pt_regs *regs, int code, unsigned long addr) { /* * Something tried to access memory that isn't in our memory map. * Fix it, but check if it's kernel or user first. */ - mmap_read_unlock(mm); /* User mode accesses just cause a SIGSEGV */ if (user_mode(regs)) { do_trap(regs, SIGSEGV, code, addr); @@ -100,6 +100,15 @@ static inline void bad_area(struct pt_re no_context(regs, addr); }
+static inline void +bad_area(struct pt_regs *regs, struct mm_struct *mm, int code, + unsigned long addr) +{ + mmap_read_unlock(mm); + + bad_area_nosemaphore(regs, code, addr); +} + static inline void vmalloc_fault(struct pt_regs *regs, int code, unsigned long addr) { pgd_t *pgd, *pgd_k; @@ -287,23 +296,10 @@ void handle_page_fault(struct pt_regs *r else if (cause == EXC_INST_PAGE_FAULT) flags |= FAULT_FLAG_INSTRUCTION; retry: - mmap_read_lock(mm); - vma = find_vma(mm, addr); + vma = lock_mm_and_find_vma(mm, addr, regs); if (unlikely(!vma)) { tsk->thread.bad_cause = cause; - bad_area(regs, mm, code, addr); - return; - } - if (likely(vma->vm_start <= addr)) - goto good_area; - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { - tsk->thread.bad_cause = cause; - bad_area(regs, mm, code, addr); - return; - } - if (unlikely(expand_stack(vma, addr))) { - tsk->thread.bad_cause = cause; - bad_area(regs, mm, code, addr); + bad_area_nosemaphore(regs, code, addr); return; }
@@ -311,7 +307,6 @@ retry: * Ok, we have a good vm_area for this memory access, so * we can handle it. */ -good_area: code = SEGV_ACCERR;
if (unlikely(access_error(cause, vma))) {
From: Ben Hutchings ben@decadent.org.uk
commit 8b35ca3e45e35a26a21427f35d4093606e93ad0a upstream.
arm has an additional check for address < FIRST_USER_ADDRESS before expanding the stack. Since FIRST_USER_ADDRESS is defined everywhere (generally as 0), move that check to the generic expand_downwards().
Signed-off-by: Ben Hutchings ben@decadent.org.uk Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/arm/Kconfig | 1 arch/arm/mm/fault.c | 63 +++++++++++----------------------------------------- mm/mmap.c | 2 - 3 files changed, 16 insertions(+), 50 deletions(-)
--- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -125,6 +125,7 @@ config ARM select HAVE_UID16 select HAVE_VIRT_CPU_ACCOUNTING_GEN select IRQ_FORCED_THREADING + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_REL select NEED_DMA_MAP_STATE select OF_EARLY_FLATTREE if OF --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -232,37 +232,11 @@ static inline bool is_permission_fault(u return false; }
-static vm_fault_t __kprobes -__do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int flags, - unsigned long vma_flags, struct pt_regs *regs) -{ - struct vm_area_struct *vma = find_vma(mm, addr); - if (unlikely(!vma)) - return VM_FAULT_BADMAP; - - if (unlikely(vma->vm_start > addr)) { - if (!(vma->vm_flags & VM_GROWSDOWN)) - return VM_FAULT_BADMAP; - if (addr < FIRST_USER_ADDRESS) - return VM_FAULT_BADMAP; - if (expand_stack(vma, addr)) - return VM_FAULT_BADMAP; - } - - /* - * ok, we have a good vm_area for this memory access, check the - * permissions on the VMA allow for the fault which occurred. - */ - if (!(vma->vm_flags & vma_flags)) - return VM_FAULT_BADACCESS; - - return handle_mm_fault(vma, addr & PAGE_MASK, flags, regs); -} - static int __kprobes do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) { struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; int sig, code; vm_fault_t fault; unsigned int flags = FAULT_FLAG_DEFAULT; @@ -301,31 +275,21 @@ do_page_fault(unsigned long addr, unsign
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
- /* - * As per x86, we may deadlock here. However, since the kernel only - * validly references user space from well defined areas of the code, - * we can bug out early if this is from code which shouldn't. - */ - if (!mmap_read_trylock(mm)) { - if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc)) - goto no_context; retry: - mmap_read_lock(mm); - } else { - /* - * The above down_read_trylock() might have succeeded in - * which case, we'll have missed the might_sleep() from - * down_read() - */ - might_sleep(); -#ifdef CONFIG_DEBUG_VM - if (!user_mode(regs) && - !search_exception_tables(regs->ARM_pc)) - goto no_context; -#endif + vma = lock_mm_and_find_vma(mm, addr, regs); + if (unlikely(!vma)) { + fault = VM_FAULT_BADMAP; + goto bad_area; }
- fault = __do_page_fault(mm, addr, flags, vm_flags, regs); + /* + * ok, we have a good vm_area for this memory access, check the + * permissions on the VMA allow for the fault which occurred. + */ + if (!(vma->vm_flags & vm_flags)) + fault = VM_FAULT_BADACCESS; + else + fault = handle_mm_fault(vma, addr & PAGE_MASK, flags, regs);
/* If we need to retry but a fatal signal is pending, handle the * signal first. We do not need to release the mmap_lock because @@ -356,6 +320,7 @@ retry: if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS)))) return 0;
+bad_area: /* * If we are in kernel mode at this point, we * have no context to handle this fault with. --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2036,7 +2036,7 @@ int expand_downwards(struct vm_area_stru int error = 0;
address &= PAGE_MASK; - if (address < mmap_min_addr) + if (address < mmap_min_addr || address < FIRST_USER_ADDRESS) return -EPERM;
/* Enforce stack_guard_gap */
From: Linus Torvalds torvalds@linux-foundation.org
commit a050ba1e7422f2cc60ff8bfde3f96d34d00cb585 upstream.
This does the simple pattern conversion of alpha, arc, csky, hexagon, loongarch, nios2, sh, sparc32, and xtensa to the lock_mm_and_find_vma() helper. They all have the regular fault handling pattern without odd special cases.
The remaining architectures all have something that keeps us from a straightforward conversion: ia64 and parisc have stacks that can grow both up as well as down (and ia64 has special address region checks).
And m68k, microblaze, openrisc, sparc64, and um end up having extra rules about only expanding the stack down a limited amount below the user space stack pointer. That is something that x86 used to do too (long long ago), and it probably could just be skipped, but it still makes the conversion less than trivial.
Note that this conversion was done manually and with the exception of alpha without any build testing, because I have a fairly limited cross- building environment. The cases are all simple, and I went through the changes several times, but...
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/alpha/Kconfig | 1 + arch/alpha/mm/fault.c | 13 +++---------- arch/arc/Kconfig | 1 + arch/arc/mm/fault.c | 11 +++-------- arch/csky/Kconfig | 1 + arch/csky/mm/fault.c | 22 +++++----------------- arch/hexagon/Kconfig | 1 + arch/hexagon/mm/vm_fault.c | 18 ++++-------------- arch/loongarch/Kconfig | 1 + arch/loongarch/mm/fault.c | 16 ++++++---------- arch/nios2/Kconfig | 1 + arch/nios2/mm/fault.c | 17 ++--------------- arch/sh/Kconfig | 1 + arch/sh/mm/fault.c | 17 ++--------------- arch/sparc/Kconfig | 1 + arch/sparc/mm/fault_32.c | 32 ++++++++------------------------ arch/xtensa/Kconfig | 1 + arch/xtensa/mm/fault.c | 14 +++----------- 18 files changed, 45 insertions(+), 124 deletions(-)
--- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -30,6 +30,7 @@ config ALPHA select HAS_IOPORT select HAVE_ARCH_AUDITSYSCALL select HAVE_MOD_ARCH_SPECIFIC + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select ODD_RT_SIGACTION select OLD_SIGSUSPEND --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -119,20 +119,12 @@ do_page_fault(unsigned long address, uns flags |= FAULT_FLAG_USER; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: - mmap_read_lock(mm); - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; + goto bad_area_nosemaphore;
/* Ok, we have a good vm_area for this memory access, so we can handle it. */ - good_area: si_code = SEGV_ACCERR; if (cause < 0) { if (!(vma->vm_flags & VM_EXEC)) @@ -192,6 +184,7 @@ retry: bad_area: mmap_read_unlock(mm);
+ bad_area_nosemaphore: if (user_mode(regs)) goto do_sigsegv;
--- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -41,6 +41,7 @@ config ARC select HAVE_PERF_EVENTS select HAVE_SYSCALL_TRACEPOINTS select IRQ_DOMAIN + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select OF select OF_EARLY_FLATTREE --- a/arch/arc/mm/fault.c +++ b/arch/arc/mm/fault.c @@ -113,15 +113,9 @@ void do_page_fault(unsigned long address
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: - mmap_read_lock(mm); - - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (unlikely(address < vma->vm_start)) { - if (!(vma->vm_flags & VM_GROWSDOWN) || expand_stack(vma, address)) - goto bad_area; - } + goto bad_area_nosemaphore;
/* * vm_area is good, now check permissions for this memory access @@ -161,6 +155,7 @@ retry: bad_area: mmap_read_unlock(mm);
+bad_area_nosemaphore: /* * Major/minor page fault accounting * (in case of retry we only land here once) --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -96,6 +96,7 @@ config CSKY select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_STACKPROTECTOR select HAVE_SYSCALL_TRACEPOINTS + select LOCK_MM_AND_FIND_VMA select MAY_HAVE_SPARSE_IRQ select MODULES_USE_ELF_RELA if MODULES select OF --- a/arch/csky/mm/fault.c +++ b/arch/csky/mm/fault.c @@ -97,13 +97,12 @@ static inline void mm_fault_error(struct BUG(); }
-static inline void bad_area(struct pt_regs *regs, struct mm_struct *mm, int code, unsigned long addr) +static inline void bad_area_nosemaphore(struct pt_regs *regs, struct mm_struct *mm, int code, unsigned long addr) { /* * Something tried to access memory that isn't in our memory map. * Fix it, but check if it's kernel or user first. */ - mmap_read_unlock(mm); /* User mode accesses just cause a SIGSEGV */ if (user_mode(regs)) { do_trap(regs, SIGSEGV, code, addr); @@ -238,20 +237,9 @@ asmlinkage void do_page_fault(struct pt_ if (is_write(regs)) flags |= FAULT_FLAG_WRITE; retry: - mmap_read_lock(mm); - vma = find_vma(mm, addr); + vma = lock_mm_and_find_vma(mm, address, regs); if (unlikely(!vma)) { - bad_area(regs, mm, code, addr); - return; - } - if (likely(vma->vm_start <= addr)) - goto good_area; - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { - bad_area(regs, mm, code, addr); - return; - } - if (unlikely(expand_stack(vma, addr))) { - bad_area(regs, mm, code, addr); + bad_area_nosemaphore(regs, mm, code, addr); return; }
@@ -259,11 +247,11 @@ retry: * Ok, we have a good vm_area for this memory access, so * we can handle it. */ -good_area: code = SEGV_ACCERR;
if (unlikely(access_error(regs, vma))) { - bad_area(regs, mm, code, addr); + mmap_read_unlock(mm); + bad_area_nosemaphore(regs, mm, code, addr); return; }
--- a/arch/hexagon/Kconfig +++ b/arch/hexagon/Kconfig @@ -28,6 +28,7 @@ config HEXAGON select GENERIC_SMP_IDLE_THREAD select STACKTRACE_SUPPORT select GENERIC_CLOCKEVENTS_BROADCAST + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select GENERIC_CPU_DEVICES select ARCH_WANT_LD_ORPHAN_WARN --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -57,21 +57,10 @@ void do_page_fault(unsigned long address
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: - mmap_read_lock(mm); - vma = find_vma(mm, address); - if (!vma) - goto bad_area; + vma = lock_mm_and_find_vma(mm, address, regs); + if (unlikely(!vma)) + goto bad_area_nosemaphore;
- if (vma->vm_start <= address) - goto good_area; - - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - - if (expand_stack(vma, address)) - goto bad_area; - -good_area: /* Address space is OK. Now check access rights. */ si_code = SEGV_ACCERR;
@@ -143,6 +132,7 @@ good_area: bad_area: mmap_read_unlock(mm);
+bad_area_nosemaphore: if (user_mode(regs)) { force_sig_fault(SIGSEGV, si_code, (void __user *)address); return; --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -130,6 +130,7 @@ config LOONGARCH select HAVE_VIRT_CPU_ACCOUNTING_GEN if !SMP select IRQ_FORCED_THREADING select IRQ_LOONGARCH_CPU + select LOCK_MM_AND_FIND_VMA select MMU_GATHER_MERGE_VMAS if MMU select MODULES_USE_ELF_RELA if MODULES select NEED_PER_CPU_EMBED_FIRST_CHUNK --- a/arch/loongarch/mm/fault.c +++ b/arch/loongarch/mm/fault.c @@ -169,22 +169,18 @@ static void __kprobes __do_page_fault(st
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: - mmap_read_lock(mm); - vma = find_vma(mm, address); - if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (!expand_stack(vma, address)) - goto good_area; + vma = lock_mm_and_find_vma(mm, address, regs); + if (unlikely(!vma)) + goto bad_area_nosemaphore; + goto good_area; + /* * Something tried to access memory that isn't in our memory map.. * Fix it, but check if it's kernel or user first.. */ bad_area: mmap_read_unlock(mm); +bad_area_nosemaphore: do_sigsegv(regs, write, address, si_code); return;
--- a/arch/nios2/Kconfig +++ b/arch/nios2/Kconfig @@ -16,6 +16,7 @@ config NIOS2 select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_KGDB select IRQ_DOMAIN + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select OF select OF_EARLY_FLATTREE --- a/arch/nios2/mm/fault.c +++ b/arch/nios2/mm/fault.c @@ -86,27 +86,14 @@ asmlinkage void do_page_fault(struct pt_
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
- if (!mmap_read_trylock(mm)) { - if (!user_mode(regs) && !search_exception_tables(regs->ea)) - goto bad_area_nosemaphore; retry: - mmap_read_lock(mm); - } - - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; + goto bad_area_nosemaphore; /* * Ok, we have a good vm_area for this memory access, so * we can handle it.. */ -good_area: code = SEGV_ACCERR;
switch (cause) { --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -59,6 +59,7 @@ config SUPERH select HAVE_STACKPROTECTOR select HAVE_SYSCALL_TRACEPOINTS select IRQ_FORCED_THREADING + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select NEED_SG_DMA_LENGTH select NO_DMA if !MMU && !DMA_COHERENT --- a/arch/sh/mm/fault.c +++ b/arch/sh/mm/fault.c @@ -439,21 +439,9 @@ asmlinkage void __kprobes do_page_fault( }
retry: - mmap_read_lock(mm); - - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (unlikely(!vma)) { - bad_area(regs, error_code, address); - return; - } - if (likely(vma->vm_start <= address)) - goto good_area; - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { - bad_area(regs, error_code, address); - return; - } - if (unlikely(expand_stack(vma, address))) { - bad_area(regs, error_code, address); + bad_area_nosemaphore(regs, error_code, address); return; }
@@ -461,7 +449,6 @@ retry: * Ok, we have a good vm_area for this memory access, so * we can handle it.. */ -good_area: if (unlikely(access_error(error_code, vma))) { bad_area_access_error(regs, error_code, address); return; --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -57,6 +57,7 @@ config SPARC32 select DMA_DIRECT_REMAP select GENERIC_ATOMIC64 select HAVE_UID16 + select LOCK_MM_AND_FIND_VMA select OLD_SIGACTION select ZONE_DMA
--- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -143,28 +143,19 @@ asmlinkage void do_sparc_fault(struct pt if (pagefault_disabled() || !mm) goto no_context;
+ if (!from_user && address >= PAGE_OFFSET) + goto no_context; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
retry: - mmap_read_lock(mm); - - if (!from_user && address >= PAGE_OFFSET) - goto bad_area; - - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; + goto bad_area_nosemaphore; /* * Ok, we have a good vm_area for this memory access, so * we can handle it.. */ -good_area: code = SEGV_ACCERR; if (write) { if (!(vma->vm_flags & VM_WRITE)) @@ -321,17 +312,9 @@ static void force_user_fault(unsigned lo
code = SEGV_MAPERR;
- mmap_read_lock(mm); - vma = find_vma(mm, address); + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; -good_area: + goto bad_area_nosemaphore; code = SEGV_ACCERR; if (write) { if (!(vma->vm_flags & VM_WRITE)) @@ -350,6 +333,7 @@ good_area: return; bad_area: mmap_read_unlock(mm); +bad_area_nosemaphore: __do_fault_siginfo(code, SIGSEGV, tsk->thread.kregs, address); return;
--- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -49,6 +49,7 @@ config XTENSA select HAVE_SYSCALL_TRACEPOINTS select HAVE_VIRT_CPU_ACCOUNTING_GEN select IRQ_DOMAIN + select LOCK_MM_AND_FIND_VMA select MODULES_USE_ELF_RELA select PERF_USE_VMALLOC select TRACE_IRQFLAGS_SUPPORT --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -130,23 +130,14 @@ void do_page_fault(struct pt_regs *regs) perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
retry: - mmap_read_lock(mm); - vma = find_vma(mm, address); - + vma = lock_mm_and_find_vma(mm, address, regs); if (!vma) - goto bad_area; - if (vma->vm_start <= address) - goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; + goto bad_area_nosemaphore;
/* Ok, we have a good vm_area for this memory access, so * we can handle it.. */
-good_area: code = SEGV_ACCERR;
if (is_write) { @@ -205,6 +196,7 @@ good_area: */ bad_area: mmap_read_unlock(mm); +bad_area_nosemaphore: if (user_mode(regs)) { force_sig_fault(SIGSEGV, code, (void *) address); return;
From: Linus Torvalds torvalds@linux-foundation.org
commit 2cd76c50d0b41cec5c87abfcdf25b236a2793fb6 upstream.
This is one of the simple cases, except there's no pt_regs pointer. Which is fine, as lock_mm_and_find_vma() is set up to work fine with a NULL pt_regs.
Powerpc already enabled LOCK_MM_AND_FIND_VMA for the main CPU faulting, so we can just use the helper without any extra work.
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/powerpc/mm/copro_fault.c | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-)
--- a/arch/powerpc/mm/copro_fault.c +++ b/arch/powerpc/mm/copro_fault.c @@ -33,19 +33,11 @@ int copro_handle_mm_fault(struct mm_stru if (mm->pgd == NULL) return -EFAULT;
- mmap_read_lock(mm); - ret = -EFAULT; - vma = find_vma(mm, ea); + vma = lock_mm_and_find_vma(mm, ea, NULL); if (!vma) - goto out_unlock; - - if (ea < vma->vm_start) { - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto out_unlock; - if (expand_stack(vma, ea)) - goto out_unlock; - } + return -EFAULT;
+ ret = -EFAULT; is_write = dsisr & DSISR_ISSTORE; if (is_write) { if (!(vma->vm_flags & VM_WRITE))
From: Liam R. Howlett Liam.Howlett@oracle.com
commit f440fa1ac955e2898893f9301568435eb5cdfc4b upstream.
Make calls to extend_vma() and find_extend_vma() fail if the write lock is required.
To avoid making this a flag-day event, this still allows the old read-locking case for the trivial situations, and passes in a flag to say "is it write-locked". That way write-lockers can say "yes, I'm being careful", and legacy users will continue to work in all the common cases until they have been fully converted to the new world order.
Co-Developed-by: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/binfmt_elf.c | 6 +++--- fs/exec.c | 5 +++-- include/linux/mm.h | 10 +++++++--- mm/memory.c | 2 +- mm/mmap.c | 50 +++++++++++++++++++++++++++++++++----------------- mm/nommu.c | 3 ++- 6 files changed, 49 insertions(+), 27 deletions(-)
--- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -320,10 +320,10 @@ create_elf_tables(struct linux_binprm *b * Grow the stack manually; some architectures have a limit on how * far ahead a user-space access may be in order to grow the stack. */ - if (mmap_read_lock_killable(mm)) + if (mmap_write_lock_killable(mm)) return -EINTR; - vma = find_extend_vma(mm, bprm->p); - mmap_read_unlock(mm); + vma = find_extend_vma_locked(mm, bprm->p, true); + mmap_write_unlock(mm); if (!vma) return -EFAULT;
--- a/fs/exec.c +++ b/fs/exec.c @@ -205,7 +205,8 @@ static struct page *get_arg_page(struct
#ifdef CONFIG_STACK_GROWSUP if (write) { - ret = expand_downwards(bprm->vma, pos); + /* We claim to hold the lock - nobody to race with */ + ret = expand_downwards(bprm->vma, pos, true); if (ret < 0) return NULL; } @@ -853,7 +854,7 @@ int setup_arg_pages(struct linux_binprm stack_base = vma->vm_end - stack_expand; #endif current->mm->start_stack = bprm->p; - ret = expand_stack(vma, stack_base); + ret = expand_stack_locked(vma, stack_base, true); if (ret) ret = -EFAULT;
--- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3192,11 +3192,13 @@ extern vm_fault_t filemap_page_mkwrite(s
extern unsigned long stack_guard_gap; /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ -extern int expand_stack(struct vm_area_struct *vma, unsigned long address); +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, + bool write_locked); +#define expand_stack(vma,addr) expand_stack_locked(vma,addr,false)
/* CONFIG_STACK_GROWSUP still needs to grow downwards at some places */ -extern int expand_downwards(struct vm_area_struct *vma, - unsigned long address); +int expand_downwards(struct vm_area_struct *vma, unsigned long address, + bool write_locked); #if VM_GROWSUP extern int expand_upwards(struct vm_area_struct *vma, unsigned long address); #else @@ -3297,6 +3299,8 @@ unsigned long change_prot_numa(struct vm #endif
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr); +struct vm_area_struct *find_extend_vma_locked(struct mm_struct *, + unsigned long addr, bool write_locked); int remap_pfn_range(struct vm_area_struct *, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t); int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, --- a/mm/memory.c +++ b/mm/memory.c @@ -5368,7 +5368,7 @@ struct vm_area_struct *lock_mm_and_find_ goto fail; }
- if (expand_stack(vma, addr)) + if (expand_stack_locked(vma, addr, true)) goto fail;
success: --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1935,7 +1935,8 @@ static int acct_stack_growth(struct vm_a * PA-RISC uses this for its stack; IA64 for its Register Backing Store. * vma is the last one with address > vma->vm_end. Have to extend vma. */ -int expand_upwards(struct vm_area_struct *vma, unsigned long address) +int expand_upwards(struct vm_area_struct *vma, unsigned long address, + bool write_locked) { struct mm_struct *mm = vma->vm_mm; struct vm_area_struct *next; @@ -1959,6 +1960,8 @@ int expand_upwards(struct vm_area_struct if (gap_addr < address || gap_addr > TASK_SIZE) gap_addr = TASK_SIZE;
+ if (!write_locked) + return -EAGAIN; next = find_vma_intersection(mm, vma->vm_end, gap_addr); if (next && vma_is_accessible(next)) { if (!(next->vm_flags & VM_GROWSUP)) @@ -2028,7 +2031,8 @@ int expand_upwards(struct vm_area_struct /* * vma is the first one with address < vma->vm_start. Have to extend vma. */ -int expand_downwards(struct vm_area_struct *vma, unsigned long address) +int expand_downwards(struct vm_area_struct *vma, unsigned long address, + bool write_locked) { struct mm_struct *mm = vma->vm_mm; MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_start); @@ -2042,10 +2046,13 @@ int expand_downwards(struct vm_area_stru /* Enforce stack_guard_gap */ prev = mas_prev(&mas, 0); /* Check that both stack segments have the same anon_vma? */ - if (prev && !(prev->vm_flags & VM_GROWSDOWN) && - vma_is_accessible(prev)) { - if (address - prev->vm_end < stack_guard_gap) + if (prev) { + if (!(prev->vm_flags & VM_GROWSDOWN) && + vma_is_accessible(prev) && + (address - prev->vm_end < stack_guard_gap)) return -ENOMEM; + if (!write_locked && (prev->vm_end == address)) + return -EAGAIN; }
if (mas_preallocate(&mas, GFP_KERNEL)) @@ -2124,13 +2131,14 @@ static int __init cmdline_parse_stack_gu __setup("stack_guard_gap=", cmdline_parse_stack_guard_gap);
#ifdef CONFIG_STACK_GROWSUP -int expand_stack(struct vm_area_struct *vma, unsigned long address) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, + bool write_locked) { - return expand_upwards(vma, address); + return expand_upwards(vma, address, write_locked); }
-struct vm_area_struct * -find_extend_vma(struct mm_struct *mm, unsigned long addr) +struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, + unsigned long addr, bool write_locked) { struct vm_area_struct *vma, *prev;
@@ -2138,20 +2146,25 @@ find_extend_vma(struct mm_struct *mm, un vma = find_vma_prev(mm, addr, &prev); if (vma && (vma->vm_start <= addr)) return vma; - if (!prev || expand_stack(prev, addr)) + if (!prev) + return NULL; + if (expand_stack_locked(prev, addr, write_locked)) return NULL; if (prev->vm_flags & VM_LOCKED) populate_vma_page_range(prev, addr, prev->vm_end, NULL); return prev; } #else -int expand_stack(struct vm_area_struct *vma, unsigned long address) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, + bool write_locked) { - return expand_downwards(vma, address); + if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) + return -EINVAL; + return expand_downwards(vma, address, write_locked); }
-struct vm_area_struct * -find_extend_vma(struct mm_struct *mm, unsigned long addr) +struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, + unsigned long addr, bool write_locked) { struct vm_area_struct *vma; unsigned long start; @@ -2162,10 +2175,8 @@ find_extend_vma(struct mm_struct *mm, un return NULL; if (vma->vm_start <= addr) return vma; - if (!(vma->vm_flags & VM_GROWSDOWN)) - return NULL; start = vma->vm_start; - if (expand_stack(vma, addr)) + if (expand_stack_locked(vma, addr, write_locked)) return NULL; if (vma->vm_flags & VM_LOCKED) populate_vma_page_range(vma, addr, start, NULL); @@ -2173,6 +2184,11 @@ find_extend_vma(struct mm_struct *mm, un } #endif
+struct vm_area_struct *find_extend_vma(struct mm_struct *mm, + unsigned long addr) +{ + return find_extend_vma_locked(mm, addr, false); +} EXPORT_SYMBOL_GPL(find_extend_vma);
/* --- a/mm/nommu.c +++ b/mm/nommu.c @@ -643,7 +643,8 @@ struct vm_area_struct *find_extend_vma(s * expand a stack to a given address * - not supported under NOMMU conditions */ -int expand_stack(struct vm_area_struct *vma, unsigned long address) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, + bool write_locked) { return -ENOMEM; }
From: Linus Torvalds torvalds@linux-foundation.org
commit f313c51d26aa87e69633c9b46efb37a930faca71 upstream.
This is a small step towards a model where GUP itself would not expand the stack, and any user that needs GUP to not look up existing mappings, but actually expand on them, would have to do so manually before-hand, and with the mm lock held for writing.
It turns out that execve() already did almost exactly that, except it didn't take the mm lock at all (it's single-threaded so no locking technically needed, but it could cause lockdep errors). And it only did it for the CONFIG_STACK_GROWSUP case, since in that case GUP has obviously never expanded the stack downwards.
So just make that CONFIG_STACK_GROWSUP case do the right thing with locking, and enable it generally. This will eventually help GUP, and in the meantime avoids a special case and the lockdep issue.
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/exec.c | 37 +++++++++++++++++++++---------------- 1 file changed, 21 insertions(+), 16 deletions(-)
--- a/fs/exec.c +++ b/fs/exec.c @@ -200,34 +200,39 @@ static struct page *get_arg_page(struct int write) { struct page *page; + struct vm_area_struct *vma = bprm->vma; + struct mm_struct *mm = bprm->mm; int ret; - unsigned int gup_flags = 0;
-#ifdef CONFIG_STACK_GROWSUP - if (write) { - /* We claim to hold the lock - nobody to race with */ - ret = expand_downwards(bprm->vma, pos, true); - if (ret < 0) + /* + * Avoid relying on expanding the stack down in GUP (which + * does not work for STACK_GROWSUP anyway), and just do it + * by hand ahead of time. + */ + if (write && pos < vma->vm_start) { + mmap_write_lock(mm); + ret = expand_downwards(vma, pos, true); + if (unlikely(ret < 0)) { + mmap_write_unlock(mm); return NULL; - } -#endif - - if (write) - gup_flags |= FOLL_WRITE; + } + mmap_write_downgrade(mm); + } else + mmap_read_lock(mm);
/* * We are doing an exec(). 'current' is the process - * doing the exec and bprm->mm is the new process's mm. + * doing the exec and 'mm' is the new process's mm. */ - mmap_read_lock(bprm->mm); - ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, + ret = get_user_pages_remote(mm, pos, 1, + write ? FOLL_WRITE : 0, &page, NULL, NULL); - mmap_read_unlock(bprm->mm); + mmap_read_unlock(mm); if (ret <= 0) return NULL;
if (write) - acct_arg_size(bprm, vma_pages(bprm->vma)); + acct_arg_size(bprm, vma_pages(vma));
return page; }
From: Linus Torvalds torvalds@linux-foundation.org
commit 8d7071af890768438c14db6172cc8f9f4d04e184 upstream.
This finishes the job of always holding the mmap write lock when extending the user stack vma, and removes the 'write_locked' argument from the vm helper functions again.
For some cases, we just avoid expanding the stack at all: drivers and page pinning really shouldn't be extending any stacks. Let's see if any strange users really wanted that.
It's worth noting that architectures that weren't converted to the new lock_mm_and_find_vma() helper function are left using the legacy "expand_stack()" function, but it has been changed to drop the mmap_lock and take it for writing while expanding the vma. This makes it fairly straightforward to convert the remaining architectures.
As a result of dropping and re-taking the lock, the calling conventions for this function have also changed, since the old vma may no longer be valid. So it will now return the new vma if successful, and NULL - and the lock dropped - if the area could not be extended.
Tested-by: Vegard Nossum vegard.nossum@oracle.com Tested-by: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de # ia64 Tested-by: Frank Scheiner frank.scheiner@web.de # ia64 Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/ia64/mm/fault.c | 36 ++---------- arch/m68k/mm/fault.c | 9 ++- arch/microblaze/mm/fault.c | 5 + arch/openrisc/mm/fault.c | 5 + arch/parisc/mm/fault.c | 23 +++----- arch/s390/mm/fault.c | 5 + arch/sparc/mm/fault_64.c | 8 +- arch/um/kernel/trap.c | 11 ++- drivers/iommu/amd/iommu_v2.c | 4 - drivers/iommu/iommu-sva.c | 2 fs/binfmt_elf.c | 2 fs/exec.c | 4 - include/linux/mm.h | 16 +---- mm/gup.c | 6 +- mm/memory.c | 10 +++ mm/mmap.c | 121 ++++++++++++++++++++++++++++++++++--------- mm/nommu.c | 18 ++---- 17 files changed, 169 insertions(+), 116 deletions(-)
--- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -110,10 +110,12 @@ retry: * register backing store that needs to expand upwards, in * this case vma will be null, but prev_vma will ne non-null */ - if (( !vma && prev_vma ) || (address < vma->vm_start) ) - goto check_expansion; + if (( !vma && prev_vma ) || (address < vma->vm_start) ) { + vma = expand_stack(mm, address); + if (!vma) + goto bad_area_nosemaphore; + }
- good_area: code = SEGV_ACCERR;
/* OK, we've got a good vm_area for this memory area. Check the access permissions: */ @@ -177,35 +179,9 @@ retry: mmap_read_unlock(mm); return;
- check_expansion: - if (!(prev_vma && (prev_vma->vm_flags & VM_GROWSUP) && (address == prev_vma->vm_end))) { - if (!vma) - goto bad_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; - if (REGION_NUMBER(address) != REGION_NUMBER(vma->vm_start) - || REGION_OFFSET(address) >= RGN_MAP_LIMIT) - goto bad_area; - if (expand_stack(vma, address)) - goto bad_area; - } else { - vma = prev_vma; - if (REGION_NUMBER(address) != REGION_NUMBER(vma->vm_start) - || REGION_OFFSET(address) >= RGN_MAP_LIMIT) - goto bad_area; - /* - * Since the register backing store is accessed sequentially, - * we disallow growing it by more than a page at a time. - */ - if (address > vma->vm_end + PAGE_SIZE - sizeof(long)) - goto bad_area; - if (expand_upwards(vma, address)) - goto bad_area; - } - goto good_area; - bad_area: mmap_read_unlock(mm); + bad_area_nosemaphore: if ((isr & IA64_ISR_SP) || ((isr & IA64_ISR_NA) && (isr & IA64_ISR_CODE_MASK) == IA64_ISR_CODE_LFETCH)) { --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -105,8 +105,9 @@ retry: if (address + 256 < rdusp()) goto map_err; } - if (expand_stack(vma, address)) - goto map_err; + vma = expand_stack(mm, address); + if (!vma) + goto map_err_nosemaphore;
/* * Ok, we have a good vm_area for this memory access, so @@ -196,10 +197,12 @@ bus_err: goto send_sig;
map_err: + mmap_read_unlock(mm); +map_err_nosemaphore: current->thread.signo = SIGSEGV; current->thread.code = SEGV_MAPERR; current->thread.faddr = address; - goto send_sig; + return send_fault_sig(regs);
acc_err: current->thread.signo = SIGSEGV; --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -192,8 +192,9 @@ retry: && (kernel_mode(regs) || !store_updates_sp(regs))) goto bad_area; } - if (expand_stack(vma, address)) - goto bad_area; + vma = expand_stack(mm, address); + if (!vma) + goto bad_area_nosemaphore;
good_area: code = SEGV_ACCERR; --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -127,8 +127,9 @@ retry: if (address + PAGE_SIZE < regs->sp) goto bad_area; } - if (expand_stack(vma, address)) - goto bad_area; + vma = expand_stack(mm, address); + if (!vma) + goto bad_area_nosemaphore;
/* * Ok, we have a good vm_area for this memory access, so --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -288,15 +288,19 @@ void do_page_fault(struct pt_regs *regs, retry: mmap_read_lock(mm); vma = find_vma_prev(mm, address, &prev_vma); - if (!vma || address < vma->vm_start) - goto check_expansion; + if (!vma || address < vma->vm_start) { + if (!prev || !(prev->vm_flags & VM_GROWSUP)) + goto bad_area; + vma = expand_stack(mm, address); + if (!vma) + goto bad_area_nosemaphore; + } + /* * Ok, we have a good vm_area for this memory access. We still need to * check the access permissions. */
-good_area: - if ((vma->vm_flags & acc_type) != acc_type) goto bad_area;
@@ -347,17 +351,13 @@ good_area: mmap_read_unlock(mm); return;
-check_expansion: - vma = prev_vma; - if (vma && (expand_stack(vma, address) == 0)) - goto good_area; - /* * Something tried to access memory that isn't in our memory map.. */ bad_area: mmap_read_unlock(mm);
+bad_area_nosemaphore: if (user_mode(regs)) { int signo, si_code;
@@ -449,7 +449,7 @@ handle_nadtlb_fault(struct pt_regs *regs { unsigned long insn = regs->iir; int breg, treg, xreg, val = 0; - struct vm_area_struct *vma, *prev_vma; + struct vm_area_struct *vma; struct task_struct *tsk; struct mm_struct *mm; unsigned long address; @@ -485,7 +485,7 @@ handle_nadtlb_fault(struct pt_regs *regs /* Search for VMA */ address = regs->ior; mmap_read_lock(mm); - vma = find_vma_prev(mm, address, &prev_vma); + vma = vma_lookup(mm, address); mmap_read_unlock(mm);
/* @@ -494,7 +494,6 @@ handle_nadtlb_fault(struct pt_regs *regs */ acc_type = (insn & 0x40) ? VM_WRITE : VM_READ; if (vma - && address >= vma->vm_start && (vma->vm_flags & acc_type) == acc_type) val = 1; } --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -457,8 +457,9 @@ retry: if (unlikely(vma->vm_start > address)) { if (!(vma->vm_flags & VM_GROWSDOWN)) goto out_up; - if (expand_stack(vma, address)) - goto out_up; + vma = expand_stack(mm, address); + if (!vma) + goto out; }
/* --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -383,8 +383,9 @@ continue_fault: goto bad_area; } } - if (expand_stack(vma, address)) - goto bad_area; + vma = expand_stack(mm, address); + if (!vma) + goto bad_area_nosemaphore; /* * Ok, we have a good vm_area for this memory access, so * we can handle it.. @@ -487,8 +488,9 @@ exit_exception: * Fix it, but check if it's kernel or user first.. */ bad_area: - insn = get_fault_insn(regs, insn); mmap_read_unlock(mm); +bad_area_nosemaphore: + insn = get_fault_insn(regs, insn);
handle_kernel_fault: do_kernel_fault(regs, si_code, fault_code, insn, address); --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -47,14 +47,15 @@ retry: vma = find_vma(mm, address); if (!vma) goto out; - else if (vma->vm_start <= address) + if (vma->vm_start <= address) goto good_area; - else if (!(vma->vm_flags & VM_GROWSDOWN)) + if (!(vma->vm_flags & VM_GROWSDOWN)) goto out; - else if (is_user && !ARCH_IS_STACKGROW(address)) - goto out; - else if (expand_stack(vma, address)) + if (is_user && !ARCH_IS_STACKGROW(address)) goto out; + vma = expand_stack(mm, address); + if (!vma) + goto out_nosemaphore;
good_area: *code_out = SEGV_ACCERR; --- a/drivers/iommu/amd/iommu_v2.c +++ b/drivers/iommu/amd/iommu_v2.c @@ -485,8 +485,8 @@ static void do_fault(struct work_struct flags |= FAULT_FLAG_REMOTE;
mmap_read_lock(mm); - vma = find_extend_vma(mm, address); - if (!vma || address < vma->vm_start) + vma = vma_lookup(mm, address); + if (!vma) /* failed to get a vma in the right range */ goto out;
--- a/drivers/iommu/iommu-sva.c +++ b/drivers/iommu/iommu-sva.c @@ -175,7 +175,7 @@ iommu_sva_handle_iopf(struct iommu_fault
mmap_read_lock(mm);
- vma = find_extend_vma(mm, prm->addr); + vma = vma_lookup(mm, prm->addr); if (!vma) /* Unmapped area */ goto out_put_mm; --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -322,7 +322,7 @@ create_elf_tables(struct linux_binprm *b */ if (mmap_write_lock_killable(mm)) return -EINTR; - vma = find_extend_vma_locked(mm, bprm->p, true); + vma = find_extend_vma_locked(mm, bprm->p); mmap_write_unlock(mm); if (!vma) return -EFAULT; --- a/fs/exec.c +++ b/fs/exec.c @@ -211,7 +211,7 @@ static struct page *get_arg_page(struct */ if (write && pos < vma->vm_start) { mmap_write_lock(mm); - ret = expand_downwards(vma, pos, true); + ret = expand_downwards(vma, pos); if (unlikely(ret < 0)) { mmap_write_unlock(mm); return NULL; @@ -859,7 +859,7 @@ int setup_arg_pages(struct linux_binprm stack_base = vma->vm_end - stack_expand; #endif current->mm->start_stack = bprm->p; - ret = expand_stack_locked(vma, stack_base, true); + ret = expand_stack_locked(vma, stack_base); if (ret) ret = -EFAULT;
--- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3192,18 +3192,11 @@ extern vm_fault_t filemap_page_mkwrite(s
extern unsigned long stack_guard_gap; /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ -int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, - bool write_locked); -#define expand_stack(vma,addr) expand_stack_locked(vma,addr,false) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address); +struct vm_area_struct *expand_stack(struct mm_struct * mm, unsigned long addr);
/* CONFIG_STACK_GROWSUP still needs to grow downwards at some places */ -int expand_downwards(struct vm_area_struct *vma, unsigned long address, - bool write_locked); -#if VM_GROWSUP -extern int expand_upwards(struct vm_area_struct *vma, unsigned long address); -#else - #define expand_upwards(vma, address) (0) -#endif +int expand_downwards(struct vm_area_struct *vma, unsigned long address);
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */ extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr); @@ -3298,9 +3291,8 @@ unsigned long change_prot_numa(struct vm unsigned long start, unsigned long end); #endif
-struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr); struct vm_area_struct *find_extend_vma_locked(struct mm_struct *, - unsigned long addr, bool write_locked); + unsigned long addr); int remap_pfn_range(struct vm_area_struct *, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t); int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, --- a/mm/gup.c +++ b/mm/gup.c @@ -1096,7 +1096,7 @@ static long __get_user_pages(struct mm_s
/* first iteration or cross vma bound */ if (!vma || start >= vma->vm_end) { - vma = find_extend_vma(mm, start); + vma = vma_lookup(mm, start); if (!vma && in_gate_area(mm, start)) { ret = get_gate_page(mm, start & PAGE_MASK, gup_flags, &vma, @@ -1265,8 +1265,8 @@ int fixup_user_fault(struct mm_struct *m fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
retry: - vma = find_extend_vma(mm, address); - if (!vma || address < vma->vm_start) + vma = vma_lookup(mm, address); + if (!vma) return -EFAULT;
if (!vma_permits_fault(vma, fault_flags)) --- a/mm/memory.c +++ b/mm/memory.c @@ -5368,7 +5368,7 @@ struct vm_area_struct *lock_mm_and_find_ goto fail; }
- if (expand_stack_locked(vma, addr, true)) + if (expand_stack_locked(vma, addr)) goto fail;
success: @@ -5713,6 +5713,14 @@ int __access_remote_vm(struct mm_struct if (mmap_read_lock_killable(mm)) return 0;
+ /* We might need to expand the stack to access it */ + vma = vma_lookup(mm, addr); + if (!vma) { + vma = expand_stack(mm, addr); + if (!vma) + return 0; + } + /* ignore errors, just check how much was successfully transferred */ while (len) { int bytes, ret, offset; --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1935,8 +1935,7 @@ static int acct_stack_growth(struct vm_a * PA-RISC uses this for its stack; IA64 for its Register Backing Store. * vma is the last one with address > vma->vm_end. Have to extend vma. */ -int expand_upwards(struct vm_area_struct *vma, unsigned long address, - bool write_locked) +static int expand_upwards(struct vm_area_struct *vma, unsigned long address) { struct mm_struct *mm = vma->vm_mm; struct vm_area_struct *next; @@ -1960,8 +1959,6 @@ int expand_upwards(struct vm_area_struct if (gap_addr < address || gap_addr > TASK_SIZE) gap_addr = TASK_SIZE;
- if (!write_locked) - return -EAGAIN; next = find_vma_intersection(mm, vma->vm_end, gap_addr); if (next && vma_is_accessible(next)) { if (!(next->vm_flags & VM_GROWSUP)) @@ -2030,15 +2027,18 @@ int expand_upwards(struct vm_area_struct
/* * vma is the first one with address < vma->vm_start. Have to extend vma. + * mmap_lock held for writing. */ -int expand_downwards(struct vm_area_struct *vma, unsigned long address, - bool write_locked) +int expand_downwards(struct vm_area_struct *vma, unsigned long address) { struct mm_struct *mm = vma->vm_mm; MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_start); struct vm_area_struct *prev; int error = 0;
+ if (!(vma->vm_flags & VM_GROWSDOWN)) + return -EFAULT; + address &= PAGE_MASK; if (address < mmap_min_addr || address < FIRST_USER_ADDRESS) return -EPERM; @@ -2051,8 +2051,6 @@ int expand_downwards(struct vm_area_stru vma_is_accessible(prev) && (address - prev->vm_end < stack_guard_gap)) return -ENOMEM; - if (!write_locked && (prev->vm_end == address)) - return -EAGAIN; }
if (mas_preallocate(&mas, GFP_KERNEL)) @@ -2131,14 +2129,12 @@ static int __init cmdline_parse_stack_gu __setup("stack_guard_gap=", cmdline_parse_stack_guard_gap);
#ifdef CONFIG_STACK_GROWSUP -int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, - bool write_locked) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address) { - return expand_upwards(vma, address, write_locked); + return expand_upwards(vma, address); }
-struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, - unsigned long addr, bool write_locked) +struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, unsigned long addr) { struct vm_area_struct *vma, *prev;
@@ -2148,23 +2144,21 @@ struct vm_area_struct *find_extend_vma_l return vma; if (!prev) return NULL; - if (expand_stack_locked(prev, addr, write_locked)) + if (expand_stack_locked(prev, addr)) return NULL; if (prev->vm_flags & VM_LOCKED) populate_vma_page_range(prev, addr, prev->vm_end, NULL); return prev; } #else -int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, - bool write_locked) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long address) { if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) return -EINVAL; - return expand_downwards(vma, address, write_locked); + return expand_downwards(vma, address); }
-struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, - unsigned long addr, bool write_locked) +struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, unsigned long addr) { struct vm_area_struct *vma; unsigned long start; @@ -2176,7 +2170,7 @@ struct vm_area_struct *find_extend_vma_l if (vma->vm_start <= addr) return vma; start = vma->vm_start; - if (expand_stack_locked(vma, addr, write_locked)) + if (expand_stack_locked(vma, addr)) return NULL; if (vma->vm_flags & VM_LOCKED) populate_vma_page_range(vma, addr, start, NULL); @@ -2184,12 +2178,91 @@ struct vm_area_struct *find_extend_vma_l } #endif
-struct vm_area_struct *find_extend_vma(struct mm_struct *mm, - unsigned long addr) +/* + * IA64 has some horrid mapping rules: it can expand both up and down, + * but with various special rules. + * + * We'll get rid of this architecture eventually, so the ugliness is + * temporary. + */ +#ifdef CONFIG_IA64 +static inline bool vma_expand_ok(struct vm_area_struct *vma, unsigned long addr) +{ + return REGION_NUMBER(addr) == REGION_NUMBER(vma->vm_start) && + REGION_OFFSET(addr) < RGN_MAP_LIMIT; +} + +/* + * IA64 stacks grow down, but there's a special register backing store + * that can grow up. Only sequentially, though, so the new address must + * match vm_end. + */ +static inline int vma_expand_up(struct vm_area_struct *vma, unsigned long addr) +{ + if (!vma_expand_ok(vma, addr)) + return -EFAULT; + if (vma->vm_end != (addr & PAGE_MASK)) + return -EFAULT; + return expand_upwards(vma, addr); +} + +static inline bool vma_expand_down(struct vm_area_struct *vma, unsigned long addr) +{ + if (!vma_expand_ok(vma, addr)) + return -EFAULT; + return expand_downwards(vma, addr); +} + +#elif defined(CONFIG_STACK_GROWSUP) + +#define vma_expand_up(vma,addr) expand_upwards(vma, addr) +#define vma_expand_down(vma, addr) (-EFAULT) + +#else + +#define vma_expand_up(vma,addr) (-EFAULT) +#define vma_expand_down(vma, addr) expand_downwards(vma, addr) + +#endif + +/* + * expand_stack(): legacy interface for page faulting. Don't use unless + * you have to. + * + * This is called with the mm locked for reading, drops the lock, takes + * the lock for writing, tries to look up a vma again, expands it if + * necessary, and downgrades the lock to reading again. + * + * If no vma is found or it can't be expanded, it returns NULL and has + * dropped the lock. + */ +struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr) { - return find_extend_vma_locked(mm, addr, false); + struct vm_area_struct *vma, *prev; + + mmap_read_unlock(mm); + if (mmap_write_lock_killable(mm)) + return NULL; + + vma = find_vma_prev(mm, addr, &prev); + if (vma && vma->vm_start <= addr) + goto success; + + if (prev && !vma_expand_up(prev, addr)) { + vma = prev; + goto success; + } + + if (vma && !vma_expand_down(vma, addr)) + goto success; + + mmap_write_unlock(mm); + return NULL; + +success: + mmap_write_downgrade(mm); + return vma; } -EXPORT_SYMBOL_GPL(find_extend_vma);
/* * Ok - we have the memory areas we should free on a maple tree so release them, --- a/mm/nommu.c +++ b/mm/nommu.c @@ -631,24 +631,20 @@ struct vm_area_struct *find_vma(struct m EXPORT_SYMBOL(find_vma);
/* - * find a VMA - * - we don't extend stack VMAs under NOMMU conditions - */ -struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr) -{ - return find_vma(mm, addr); -} - -/* * expand a stack to a given address * - not supported under NOMMU conditions */ -int expand_stack_locked(struct vm_area_struct *vma, unsigned long address, - bool write_locked) +int expand_stack_locked(struct vm_area_struct *vma, unsigned long addr) { return -ENOMEM; }
+struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr) +{ + mmap_read_unlock(mm); + return NULL; +} + /* * look up the first VMA exactly that exactly matches addr * - should be called with mm->mmap_lock at least held readlocked
From: Jason Gerecke jason.gerecke@wacom.com
commit 9a6c0e28e215535b2938c61ded54603b4e5814c5 upstream.
Code which interacts with timestamps needs to use the ktime_t type returned by functions like ktime_get. The int type does not offer enough space to store these values, and attempting to use it is a recipe for problems. In this particular case, overflows would occur when calculating/storing timestamps leading to incorrect values being reported to userspace. In some cases these bad timestamps cause input handling in userspace to appear hung.
Link: https://gitlab.freedesktop.org/libinput/libinput/-/issues/901 Fixes: 17d793f3ed53 ("HID: wacom: insert timestamp to packed Bluetooth (BT) events") CC: stable@vger.kernel.org Signed-off-by: Jason Gerecke jason.gerecke@wacom.com Reviewed-by: Benjamin Tissoires benjamin.tissoires@redhat.com Link: https://lore.kernel.org/r/20230608213828.2108-1-jason.gerecke@wacom.com Signed-off-by: Benjamin Tissoires bentiss@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/wacom_wac.c | 6 +++--- drivers/hid/wacom_wac.h | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-)
--- a/drivers/hid/wacom_wac.c +++ b/drivers/hid/wacom_wac.c @@ -1314,7 +1314,7 @@ static void wacom_intuos_pro2_bt_pen(str struct input_dev *pen_input = wacom->pen_input; unsigned char *data = wacom->data; int number_of_valid_frames = 0; - int time_interval = 15000000; + ktime_t time_interval = 15000000; ktime_t time_packet_received = ktime_get(); int i;
@@ -1348,7 +1348,7 @@ static void wacom_intuos_pro2_bt_pen(str if (number_of_valid_frames) { if (wacom->hid_data.time_delayed) time_interval = ktime_get() - wacom->hid_data.time_delayed; - time_interval /= number_of_valid_frames; + time_interval = div_u64(time_interval, number_of_valid_frames); wacom->hid_data.time_delayed = time_packet_received; }
@@ -1359,7 +1359,7 @@ static void wacom_intuos_pro2_bt_pen(str bool range = frame[0] & 0x20; bool invert = frame[0] & 0x10; int frames_number_reversed = number_of_valid_frames - i - 1; - int event_timestamp = time_packet_received - frames_number_reversed * time_interval; + ktime_t event_timestamp = time_packet_received - frames_number_reversed * time_interval;
if (!valid) continue; --- a/drivers/hid/wacom_wac.h +++ b/drivers/hid/wacom_wac.h @@ -324,7 +324,7 @@ struct hid_data { int ps_connected; bool pad_input_event_flag; unsigned short sequence_number; - int time_delayed; + ktime_t time_delayed; };
struct wacom_remote_data {
From: Linus Torvalds torvalds@linux-foundation.org
commit a425ac5365f6cb3cc47bf83e6bff0213c10445f7 upstream.
It feels very unlikely that anybody would want to do a GUP in an unmapped area under the stack pointer, but real users sometimes do some really strange things. So add a (temporary) warning for the case where a GUP fails and expanding the stack might have made it work.
It's trivial to do the expansion in the caller as part of getting the mm lock in the first place - see __access_remote_vm() for ptrace, for example - it's just that it's unnecessarily painful to do it deep in the guts of the GUP lookup when we might have to drop and re-take the lock.
I doubt anybody actually does anything quite this strange, but let's be proactive: adding these warnings is simple, and will make debugging it much easier if they trigger.
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/gup.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)
--- a/mm/gup.c +++ b/mm/gup.c @@ -1096,7 +1096,11 @@ static long __get_user_pages(struct mm_s
/* first iteration or cross vma bound */ if (!vma || start >= vma->vm_end) { - vma = vma_lookup(mm, start); + vma = find_vma(mm, start); + if (vma && (start < vma->vm_start)) { + WARN_ON_ONCE(vma->vm_flags & VM_GROWSDOWN); + vma = NULL; + } if (!vma && in_gate_area(mm, start)) { ret = get_gate_page(mm, start & PAGE_MASK, gup_flags, &vma, @@ -1265,9 +1269,13 @@ int fixup_user_fault(struct mm_struct *m fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
retry: - vma = vma_lookup(mm, address); + vma = find_vma(mm, address); if (!vma) return -EFAULT; + if (address < vma->vm_start ) { + WARN_ON_ONCE(vma->vm_flags & VM_GROWSDOWN); + return -EFAULT; + }
if (!vma_permits_fault(vma, fault_flags)) return -EFAULT;
From: Hugh Dickins hughd@google.com
commit e8c716bc6812202ccf4ce0f0bad3428b794fb39c upstream.
There is no xas_pause(&xas) in collapse_file()'s main loop, at the points where it does xas_unlock_irq(&xas) and then continues.
That would explain why, once two weeks ago and twice yesterday, I have hit the VM_BUG_ON_PAGE(page != xas_load(&xas), page) since "mm/khugepaged: fix iteration in collapse_file" removed the xas_set(&xas, index) just before it: xas.xa_node could be left pointing to a stale node, if there was concurrent activity on the file which transformed its xarray.
I tried inserting xas_pause()s, but then even bootup crashed on that VM_BUG_ON_PAGE(): there appears to be a subtle "nextness" implicit in xas_pause().
xas_next() and xas_pause() are good for use in simple loops, but not in this one: xas_set() worked well until now, so use xas_set(&xas, index) explicitly at the head of the loop; and change that VM_BUG_ON_PAGE() not to need its own xas_set(), and not to interfere with the xa_state (which would probably stop the crashes from xas_pause(), but I trust that less).
The user-visible effects of this bug (if VM_BUG_ONs are configured out) would be data loss and data leak - potentially - though in practice I expect it is more likely that a subsequent check (e.g. on mapping or on nr_none) would notice an inconsistency, and just abandon the collapse.
Link: https://lore.kernel.org/linux-mm/f18e4b64-3f88-a8ab-56cc-d1f5f9c58d4@google.... Fixes: c8a8f3b4a95a ("mm/khugepaged: fix iteration in collapse_file") Signed-off-by: Hugh Dickins hughd@google.com Cc: stable@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Matthew Wilcox willy@infradead.org Cc: David Stevens stevensd@chromium.org Cc: Peter Xu peterx@redhat.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/khugepaged.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)
--- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1918,9 +1918,9 @@ static int collapse_file(struct mm_struc } } while (1);
- xas_set(&xas, start); for (index = start; index < end; index++) { - page = xas_next(&xas); + xas_set(&xas, index); + page = xas_load(&xas);
VM_BUG_ON(index != xas.xa_index); if (is_shmem) { @@ -1935,7 +1935,6 @@ static int collapse_file(struct mm_struc result = SCAN_TRUNCATED; goto xa_locked; } - xas_set(&xas, index + 1); } if (!shmem_charge(mapping->host, 1)) { result = SCAN_FAIL; @@ -2071,7 +2070,7 @@ static int collapse_file(struct mm_struc
xas_lock_irq(&xas);
- VM_BUG_ON_PAGE(page != xas_load(&xas), page); + VM_BUG_ON_PAGE(page != xa_load(xas.xa, index), page);
/* * We control three references to the page:
From: Zhang Shurong zhang_shurong@foxmail.com
commit c2d22806aecb24e2de55c30a06e5d6eb297d161d upstream.
There is a potential OOB read at fast_imageblit, for "colortab[(*src >> 4)]" can become a negative value due to "const char *s = image->data, *src". This change makes sure the index for colortab always positive or zero.
Similar commit: https://patchwork.kernel.org/patch/11746067
Potential bug report: https://groups.google.com/g/syzkaller-bugs/c/9ubBXKeKXf4/m/k-QXy4UgAAAJ
Signed-off-by: Zhang Shurong zhang_shurong@foxmail.com Cc: stable@vger.kernel.org Signed-off-by: Helge Deller deller@gmx.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/video/fbdev/core/sysimgblt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/video/fbdev/core/sysimgblt.c +++ b/drivers/video/fbdev/core/sysimgblt.c @@ -189,7 +189,7 @@ static void fast_imageblit(const struct u32 fgx = fgcolor, bgx = bgcolor, bpp = p->var.bits_per_pixel; u32 ppw = 32/bpp, spitch = (image->width + 7)/8; u32 bit_mask, eorx, shift; - const char *s = image->data, *src; + const u8 *s = image->data, *src; u32 *dst; const u32 *tab; size_t tablen;
From: Ludvig Michaelsson ludvig.michaelsson@yubico.com
commit 944ee77dc6ec7b0afd8ec70ffc418b238c92f12b upstream.
The hidraw_open() function increments the hidraw device reference counter. The counter has no dedicated synchronization mechanism, resulting in a potential data race when concurrently opening a device.
The race is a regression introduced by commit 8590222e4b02 ("HID: hidraw: Replace hidraw device table mutex with a rwsem"). While minors_rwsem is intended to protect the hidraw_table itself, by instead acquiring the lock for writing, the reference counter is also protected. This is symmetrical to hidraw_release().
Link: https://github.com/systemd/systemd/issues/27947 Fixes: 8590222e4b02 ("HID: hidraw: Replace hidraw device table mutex with a rwsem") Cc: stable@vger.kernel.org Signed-off-by: Ludvig Michaelsson ludvig.michaelsson@yubico.com Link: https://lore.kernel.org/r/20230621-hidraw-race-v1-1-a58e6ac69bab@yubico.com Signed-off-by: Benjamin Tissoires benjamin.tissoires@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/hidraw.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
--- a/drivers/hid/hidraw.c +++ b/drivers/hid/hidraw.c @@ -272,7 +272,12 @@ static int hidraw_open(struct inode *ino goto out; }
- down_read(&minors_rwsem); + /* + * Technically not writing to the hidraw_table but a write lock is + * required to protect the device refcount. This is symmetrical to + * hidraw_release(). + */ + down_write(&minors_rwsem); if (!hidraw_table[minor] || !hidraw_table[minor]->exist) { err = -ENODEV; goto out_unlock; @@ -301,7 +306,7 @@ static int hidraw_open(struct inode *ino spin_unlock_irqrestore(&hidraw_table[minor]->list_lock, flags); file->private_data = list; out_unlock: - up_read(&minors_rwsem); + up_write(&minors_rwsem); out: if (err < 0) kfree(list);
From: Mike Hommey mh@glandium.org
commit 5fe251112646d8626818ea90f7af325bab243efa upstream.
commit 498ba2069035 ("HID: logitech-hidpp: Don't restart communication if not necessary") put restarting communication behind that flag, and this was apparently necessary on the T651, but the flag was not set for it.
Fixes: 498ba2069035 ("HID: logitech-hidpp: Don't restart communication if not necessary") Cc: stable@vger.kernel.org Signed-off-by: Mike Hommey mh@glandium.org Link: https://lore.kernel.org/r/20230617230957.6mx73th4blv7owqk@glandium.org Signed-off-by: Benjamin Tissoires benjamin.tissoires@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/hid/hid-logitech-hidpp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/hid/hid-logitech-hidpp.c +++ b/drivers/hid/hid-logitech-hidpp.c @@ -4553,7 +4553,7 @@ static const struct hid_device_id hidpp_ { /* wireless touchpad T651 */ HID_BLUETOOTH_DEVICE(USB_VENDOR_ID_LOGITECH, USB_DEVICE_ID_LOGITECH_T651), - .driver_data = HIDPP_QUIRK_CLASS_WTP }, + .driver_data = HIDPP_QUIRK_CLASS_WTP | HIDPP_QUIRK_DELAYED_INIT }, { /* Mouse Logitech Anywhere MX */ LDJ_DEVICE(0x1017), .driver_data = HIDPP_QUIRK_HI_RES_SCROLL_1P0 }, { /* Mouse logitech M560 */
From: Ricardo Cañuelo ricardo.canuelo@collabora.com
commit 86edac7d3888c715fe3a81bd61f3617ecfe2e1dd upstream.
This reverts commit f05c7b7d9ea9477fcc388476c6f4ade8c66d2d26.
That change was causing a regression in the generic-adc-thermal-probed bootrr test as reported in the kernelci-results list [1]. A proper rework will take longer, so revert it for now.
[1] https://groups.io/g/kernelci-results/message/42660
Fixes: f05c7b7d9ea9 ("thermal/drivers/mediatek: Use devm_of_iomap to avoid resource leak in mtk_thermal_probe") Signed-off-by: Ricardo Cañuelo ricardo.canuelo@collabora.com Suggested-by: AngeloGioacchino Del Regno angelogioacchino.delregno@collabora.com Reviewed-by: AngeloGioacchino Del Regno angelogioacchino.delregno@collabora.com Signed-off-by: Daniel Lezcano daniel.lezcano@linaro.org Link: https://lore.kernel.org/r/20230525121811.3360268-1-ricardo.canuelo@collabora... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/thermal/mediatek/auxadc_thermal.c | 14 ++------------ 1 file changed, 2 insertions(+), 12 deletions(-)
--- a/drivers/thermal/mediatek/auxadc_thermal.c +++ b/drivers/thermal/mediatek/auxadc_thermal.c @@ -1222,12 +1222,7 @@ static int mtk_thermal_probe(struct plat return -ENODEV; }
- auxadc_base = devm_of_iomap(&pdev->dev, auxadc, 0, NULL); - if (IS_ERR(auxadc_base)) { - of_node_put(auxadc); - return PTR_ERR(auxadc_base); - } - + auxadc_base = of_iomap(auxadc, 0); auxadc_phys_base = of_get_phys_base(auxadc);
of_node_put(auxadc); @@ -1243,12 +1238,7 @@ static int mtk_thermal_probe(struct plat return -ENODEV; }
- apmixed_base = devm_of_iomap(&pdev->dev, apmixedsys, 0, NULL); - if (IS_ERR(apmixed_base)) { - of_node_put(apmixedsys); - return PTR_ERR(apmixed_base); - } - + apmixed_base = of_iomap(apmixedsys, 0); apmixed_phys_base = of_get_phys_base(apmixedsys);
of_node_put(apmixedsys);
On Fri, 30 Jun 2023 at 00:18, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 6.4.1 release. There are 28 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sat, 01 Jul 2023 18:41:39 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.1-rc1.g... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y and the diffstat can be found below.
thanks,
greg k-h
Results from Linaro’s test farm.
Following build regression noticed on Linux stable-rc 6.4 and also noticed on Linux mainline master.
Regressions found on Parisc and Sparc build failed: - build/gcc-11-defconfig
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
Parisc Build log: ============= arch/parisc/mm/fault.c: In function 'do_page_fault': arch/parisc/mm/fault.c:292:22: error: 'prev' undeclared (first use in this function) 292 | if (!prev || !(prev->vm_flags & VM_GROWSUP)) | ^~~~ arch/parisc/mm/fault.c:292:22: note: each undeclared identifier is reported only once for each function it appears in
sparc Build log: =========== <stdin>:1519:2: warning: #warning syscall clone3 not implemented [-Wcpp] arch/sparc/mm/fault_32.c: In function 'force_user_fault': arch/sparc/mm/fault_32.c:315:49: error: 'regs' undeclared (first use in this function) 315 | vma = lock_mm_and_find_vma(mm, address, regs); | ^~~~ arch/sparc/mm/fault_32.c:315:49: note: each undeclared identifier is reported only once for each function it appears in
Links: - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.4.y/build/v6.4-29... - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.4.y/build/v6.4-29...
- https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.4.y/build/v6.4-29... - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.4.y/build/v6.4-29...
Both build failures noticed on mainline and sparc build have been fixed yesterday. - https://qa-reports.linaro.org/lkft/linux-mainline-master/build/v6.4-8542-g82...
Following patch that got fixed --- From 0b26eadbf200abf6c97c6d870286c73219cdac65 Mon Sep 17 00:00:00 2001 From: Linus Torvalds torvalds@linux-foundation.org Date: Thu, 29 Jun 2023 20:41:24 -0700 Subject: sparc32: fix lock_mm_and_find_vma() conversion
The sparc32 conversion to lock_mm_and_find_vma() in commit a050ba1e7422 ("mm/fault: convert remaining simple cases to lock_mm_and_find_vma()") missed the fact that we didn't actually have a 'regs' pointer available in the 'force_user_fault()' case.
It's there in the regular page fault path ("do_sparc_fault()"), but not the window underflow/overflow paths.
Which is all fine - we can just pass in a NULL pointer. The register state is only used to avoid deadlock with kernel faults, which is not the case for any of these register window faults.
Reported-by: Stephen Rothwell sfr@canb.auug.org.au Fixes: a050ba1e7422 ("mm/fault: convert remaining simple cases to lock_mm_and_find_vma()") Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
-- Linaro LKFT https://lkft.linaro.org
On Fri, Jun 30, 2023 at 11:00:51AM +0530, Naresh Kamboju wrote:
On Fri, 30 Jun 2023 at 00:18, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 6.4.1 release. There are 28 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sat, 01 Jul 2023 18:41:39 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.1-rc1.g... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y and the diffstat can be found below.
thanks,
greg k-h
Results from Linaro’s test farm.
Following build regression noticed on Linux stable-rc 6.4 and also noticed on Linux mainline master.
Regressions found on Parisc and Sparc build failed:
- build/gcc-11-defconfig
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
Parisc Build log:
arch/parisc/mm/fault.c: In function 'do_page_fault': arch/parisc/mm/fault.c:292:22: error: 'prev' undeclared (first use in this function) 292 | if (!prev || !(prev->vm_flags & VM_GROWSUP)) | ^~~~ arch/parisc/mm/fault.c:292:22: note: each undeclared identifier is reported only once for each function it appears in
sparc Build log:
<stdin>:1519:2: warning: #warning syscall clone3 not implemented [-Wcpp] arch/sparc/mm/fault_32.c: In function 'force_user_fault': arch/sparc/mm/fault_32.c:315:49: error: 'regs' undeclared (first use in this function) 315 | vma = lock_mm_and_find_vma(mm, address, regs); | ^~~~ arch/sparc/mm/fault_32.c:315:49: note: each undeclared identifier is reported only once for each function it appears in
Links:
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.4.y/build/v6.4-29...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.4.y/build/v6.4-29...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.4.y/build/v6.4-29...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.4.y/build/v6.4-29...
Both build failures noticed on mainline and sparc build have been fixed yesterday.
Following patch that got fixed
From 0b26eadbf200abf6c97c6d870286c73219cdac65 Mon Sep 17 00:00:00 2001
From: Linus Torvalds torvalds@linux-foundation.org Date: Thu, 29 Jun 2023 20:41:24 -0700 Subject: sparc32: fix lock_mm_and_find_vma() conversion
The sparc32 conversion to lock_mm_and_find_vma() in commit a050ba1e7422 ("mm/fault: convert remaining simple cases to lock_mm_and_find_vma()") missed the fact that we didn't actually have a 'regs' pointer available in the 'force_user_fault()' case.
It's there in the regular page fault path ("do_sparc_fault()"), but not the window underflow/overflow paths.
Which is all fine - we can just pass in a NULL pointer. The register state is only used to avoid deadlock with kernel faults, which is not the case for any of these register window faults.
Reported-by: Stephen Rothwell sfr@canb.auug.org.au Fixes: a050ba1e7422 ("mm/fault: convert remaining simple cases to lock_mm_and_find_vma()") Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Thanks! That saves me having to dig. I'll go push out updates with this in them...
greg k-h
On Thu, 29 Jun 2023 at 22:31, Naresh Kamboju naresh.kamboju@linaro.org wrote:
arch/parisc/mm/fault.c: In function 'do_page_fault': arch/parisc/mm/fault.c:292:22: error: 'prev' undeclared (first use in this function) 292 | if (!prev || !(prev->vm_flags & VM_GROWSUP))
Bah. "prev" should be "prev_vma" here.
I've pushed out the fix. Greg, apologies. It's
ea3f8272876f parisc: fix expand_stack() conversion
and Naresh already pointed to the similarly silly sparc32 fix.
Linus
On Thu, Jun 29, 2023 at 11:16:21PM -0700, Linus Torvalds wrote:
On Thu, 29 Jun 2023 at 22:31, Naresh Kamboju naresh.kamboju@linaro.org wrote:
arch/parisc/mm/fault.c: In function 'do_page_fault': arch/parisc/mm/fault.c:292:22: error: 'prev' undeclared (first use in this function) 292 | if (!prev || !(prev->vm_flags & VM_GROWSUP))
Bah. "prev" should be "prev_vma" here.
I've pushed out the fix. Greg, apologies. It's
ea3f8272876f parisc: fix expand_stack() conversion
and Naresh already pointed to the similarly silly sparc32 fix.
Ah, I saw it hit your repo before your email here, sorry about that. Now picked up.
greg k-h
On 6/30/23 08:29, Greg Kroah-Hartman wrote:
On Thu, Jun 29, 2023 at 11:16:21PM -0700, Linus Torvalds wrote:
On Thu, 29 Jun 2023 at 22:31, Naresh Kamboju naresh.kamboju@linaro.org wrote:
arch/parisc/mm/fault.c: In function 'do_page_fault': arch/parisc/mm/fault.c:292:22: error: 'prev' undeclared (first use in this function) 292 | if (!prev || !(prev->vm_flags & VM_GROWSUP))
Bah. "prev" should be "prev_vma" here.
I've pushed out the fix. Greg, apologies. It's
ea3f8272876f parisc: fix expand_stack() conversion
and Naresh already pointed to the similarly silly sparc32 fix.
Ah, I saw it hit your repo before your email here, sorry about that. Now picked up.
I've just cherry-picked ea3f8272876f on top of -rc2, built and run-tested it, and everything is OK on parisc.
Thanks! Helge
Hi Linus,
On 6/30/23 08:56, Helge Deller wrote:
On 6/30/23 08:29, Greg Kroah-Hartman wrote:
On Thu, Jun 29, 2023 at 11:16:21PM -0700, Linus Torvalds wrote:
On Thu, 29 Jun 2023 at 22:31, Naresh Kamboju naresh.kamboju@linaro.org wrote:
arch/parisc/mm/fault.c: In function 'do_page_fault': arch/parisc/mm/fault.c:292:22: error: 'prev' undeclared (first use in this function) 292 | if (!prev || !(prev->vm_flags & VM_GROWSUP))
Bah. "prev" should be "prev_vma" here.
I've pushed out the fix. Greg, apologies. It's
ea3f8272876f parisc: fix expand_stack() conversion
and Naresh already pointed to the similarly silly sparc32 fix.
Ah, I saw it hit your repo before your email here, sorry about that. Now picked up.
I've just cherry-picked ea3f8272876f on top of -rc2, built and run-tested it, and everything is OK on parisc.
Actually, your changes seems to trigger...:
root@debian:~# /usr/bin/ls /usr/bin/* -bash: /usr/bin/ls: Argument list too long
or with a long gcc argument list: gcc: fatal error: cannot execute '/usr/lib/gcc/hppa-linux-gnu/12/cc1': execv: Argument list too long
I'm trying to understand what's missing, but maybe you have some idea?
Helge
On Sun, 2 Jul 2023 at 14:33, Helge Deller deller@gmx.de wrote:
Actually, your changes seems to trigger...:
root@debian:~# /usr/bin/ls /usr/bin/* -bash: /usr/bin/ls: Argument list too long
So this only happens with _fairly_ long argument lists, right? Maybe your config has a 64kB page size, and normal programs never expand beyond a single page?
I bet it is because of f313c51d26aa ("execve: expand new process stack manually ahead of time"), but I don't see exactly why.
But pa-risc is the only architecture with CONFIG_STACK_GROWSUP, and while I really thought that commit should do the exact same thing as the old
#ifdef CONFIG_STACK_GROWSUP
special case, I must clearly have been wrong.
Would you mind just verifying that yes, that commit on mainline is broken for you, and the previous one works?
Linus
On Sun, 2 Jul 2023 at 15:45, Linus Torvalds torvalds@linux-foundation.org wrote:
Would you mind just verifying that yes, that commit on mainline is broken for you, and the previous one works?
Also, while I looked at it again, and still didn't understand why parisc would be different here, I *did* realize that because parisc has a stack that grows up, the debug warning I added for GUP won't trigger.
So if I got that execve() logic wrong for STACK_GROWSUP (which I clearly must have), then exactly because it's grows-up, a GUP failure wouldn't warn about not expanding the stack.
IOW, would you mind applying something like this on top of the current kernel, and let me know if it warns?
.. and here I thought ia64 would be the pain-point. Silly me.
Linus
On 7/2/23 16:30, Linus Torvalds wrote:
On Sun, 2 Jul 2023 at 15:45, Linus Torvalds torvalds@linux-foundation.org wrote:
Would you mind just verifying that yes, that commit on mainline is broken for you, and the previous one works?
Also, while I looked at it again, and still didn't understand why parisc would be different here, I *did* realize that because parisc has a stack that grows up, the debug warning I added for GUP won't trigger.
So if I got that execve() logic wrong for STACK_GROWSUP (which I clearly must have), then exactly because it's grows-up, a GUP failure wouldn't warn about not expanding the stack.
IOW, would you mind applying something like this on top of the current kernel, and let me know if it warns?
I can reproduce the problem in qemu. However, I do not see a warning after applying your patch.
Guenter
On Sun, 2 Jul 2023 at 20:23, Guenter Roeck linux@roeck-us.net wrote:
I can reproduce the problem in qemu. However, I do not see a warning after applying your patch.
Funky, funky.
I'm assuming it's the
page = get_arg_page(bprm, pos, 1); if (!page) { ret = -E2BIG; goto out; }
in copy_strings() that causes this. Or possibly, the version in copy_string_kernel().
Does *this* get that "pr_warn()" printout (and a stack trace once, just for good measure)?
Linus
On 7/2/23 21:22, Linus Torvalds wrote:
On Sun, 2 Jul 2023 at 20:23, Guenter Roeck linux@roeck-us.net wrote:
I can reproduce the problem in qemu. However, I do not see a warning after applying your patch.
Funky, funky.
I'm assuming it's the
page = get_arg_page(bprm, pos, 1); if (!page) { ret = -E2BIG; goto out; }
in copy_strings() that causes this. Or possibly, the version in copy_string_kernel().
Does *this* get that "pr_warn()" printout (and a stack trace once, just for good measure)?
Sorry, you lost me. Isn't that the same patch as before ? Or is it just time for me to go to bed ?
Guenter
On Sun, 2 Jul 2023 at 21:46, Guenter Roeck linux@roeck-us.net wrote:
Sorry, you lost me. Isn't that the same patch as before ? Or is it just time for me to go to bed ?
No, I think it's time for *me* to go to bed.
Let's get the right diff this time.
Linus
On 7/2/23 21:49, Linus Torvalds wrote:
On Sun, 2 Jul 2023 at 21:46, Guenter Roeck linux@roeck-us.net wrote:
Sorry, you lost me. Isn't that the same patch as before ? Or is it just time for me to go to bed ?
No, I think it's time for *me* to go to bed.
Let's get the right diff this time.
Here you are:
[ 31.188688] stack expand failed: ffeff000-fff00000 (ffefeff2) [ 31.189131] ------------[ cut here ]------------ [ 31.189259] WARNING: CPU: 0 PID: 472 at fs/exec.c:217 get_arg_page+0x1e8/0x1f4 [ 31.189827] Modules linked in: [ 31.190083] CPU: 0 PID: 472 Comm: sh Tainted: G N 6.4.0-32bit+ #1 [ 31.190213] Hardware name: 9000/778/B160L [ 31.190347] [ 31.190407] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI [ 31.190496] PSW: 00000000000001001011111100001111 Tainted: G N [ 31.190625] r00-03 0004bf0f 11026240 1034a3ec 12bb41c0 [ 31.190741] r04-07 127ec400 00000001 12b725a4 12b72530 [ 31.190821] r08-11 129e6708 ffefeff2 2ff9d000 ffeff000 [ 31.190895] r12-15 127ec400 10e463f0 10e34348 12a4d1a0 [ 31.190962] r16-19 00000002 00001000 ffefe000 12a4d1a0 [ 31.191033] r20-23 0000000f 00001a46 013ae000 12bb4498 [ 31.191103] r24-27 11542330 00000000 115430a0 10ed98d8 [ 31.191173] r28-31 00000031 00000310 12bb4240 0000000f [ 31.191251] sr00-03 00000000 00000000 00000000 000000a0 [ 31.191332] sr04-07 00000000 00000000 00000000 00000000 [ 31.191407] [ 31.191443] IASQ: 00000000 00000000 IAOQ: 1034a3ec 1034a3f0 [ 31.191522] IIR: 03ffe01f ISR: 00000000 IOR: 1065d424 [ 31.191593] CPU: 0 CR30: 12a4d1a0 CR31: 00000000 [ 31.192675] ORIG_R28: 12a4d1a0 [ 31.192770] IAOQ[0]: get_arg_page+0x1e8/0x1f4 [ 31.192851] IAOQ[1]: get_arg_page+0x1ec/0x1f4 [ 31.192922] RP(r2): get_arg_page+0x1e8/0x1f4 [ 31.193007] Backtrace: [ 31.193085] [<1034a9cc>] copy_strings+0x148/0x3d8 [ 31.193214] [<1034ad94>] do_execveat_common+0x138/0x21c [ 31.193302] [<1034bcc4>] sys_execve+0x3c/0x54 [ 31.193400] [<101af1b4>] syscall_exit+0x0/0x10 [ 31.193562] [ 31.193698] ---[ end trace 0000000000000000 ]--- [ 31.200551] stack expand failed: ffeff000-fff00000 (ffefefee) /bin/sh: ls: Argument list too long
Guenter
On Sun, 2 Jul 2023 at 22:33, Guenter Roeck linux@roeck-us.net wrote:
Here you are:
[ 31.188688] stack expand failed: ffeff000-fff00000 (ffefeff2)
Ahhah!
I think the problem is actually ridiculously simple.
The thing is, the parisc stack expands upwards. That's obvious. I've mentioned it several times in just this thread as being the thing that makes parisc special.
But it's *so* obvious that I didn't even think about what it really implies.
And part of all the changes was this part in expand_downwards():
if (!(vma->vm_flags & VM_GROWSDOWN)) return -EFAULT;
and that will *always* fail on parisc, because - as said multiple times - the parisc stack expands upwards. It doesn't have VM_GROWSDOWN set.
What a dum-dum I am.
And I did it that way because the *normal* stack expansion obviously wants it that way and putting the check there not only made sense, but simplified other code.
But fs/execve.c is special - and only special for parisc - in that it really wants to expand a normally upwards-growing stack downwards unconditionally.
Anyway, I think that new check in expand_downwards() is the right thing to do, and the real fix here is to simply make vm_flags reflect reality.
Because during execve, that stack that will _eventually_ grow upwards, does in fact grow downwards. Let's make it reflect that.
We already do magical extra setup for the stack flags during setup (VM_STACK_INCOMPLETE_SETUP), so extending that logic to contain VM_GROWSDOWN seems sane and the right thing to do.
IOW, I think a patch like the attached will fix the problem for real.
It needs a good commit log and maybe a code comment or two, but before I bother to do that, let's verify that yes, it does actually fix things.
In the meantime, I will actually go to bed, but I'm pretty sure this is it.
Linus
Hi Linus,
On 7/3/23 08:20, Linus Torvalds wrote:
On Sun, 2 Jul 2023 at 22:33, Guenter Roeck linux@roeck-us.net wrote:
Here you are:
[ 31.188688] stack expand failed: ffeff000-fff00000 (ffefeff2)
Ahhah!
I think the problem is actually ridiculously simple.
The thing is, the parisc stack expands upwards. That's obvious. I've mentioned it several times in just this thread as being the thing that makes parisc special.
But it's *so* obvious that I didn't even think about what it really implies.
And part of all the changes was this part in expand_downwards():
if (!(vma->vm_flags & VM_GROWSDOWN)) return -EFAULT;
and that will *always* fail on parisc, because - as said multiple times - the parisc stack expands upwards. It doesn't have VM_GROWSDOWN set.
What a dum-dum I am.
And I did it that way because the *normal* stack expansion obviously wants it that way and putting the check there not only made sense, but simplified other code.
But fs/execve.c is special - and only special for parisc - in that it really wants to expand a normally upwards-growing stack downwards unconditionally.
Anyway, I think that new check in expand_downwards() is the right thing to do, and the real fix here is to simply make vm_flags reflect reality.
Because during execve, that stack that will _eventually_ grow upwards, does in fact grow downwards. Let's make it reflect that.
We already do magical extra setup for the stack flags during setup (VM_STACK_INCOMPLETE_SETUP), so extending that logic to contain VM_GROWSDOWN seems sane and the right thing to do.
IOW, I think a patch like the attached will fix the problem for real.
It needs a good commit log and maybe a code comment or two, but before I bother to do that, let's verify that yes, it does actually fix things.
In the meantime, I will actually go to bed, but I'm pretty sure this is it.
Great, that patch fixes it!
I wonder if you want to #define VM_STACK_EARLY VM_GROWSDOWN even for the case where the stack grows down too (instead of 0), just to make clear that in both cases the stack goes downwards initially.
Helge
On Mon, 3 Jul 2023 at 00:08, Helge Deller deller@gmx.de wrote:
Great, that patch fixes it!
Yeah, I was pretty sure this was it, but it's good to have it confirmed. Committed.
I wonder if you want to #define VM_STACK_EARLY VM_GROWSDOWN even for the case where the stack grows down too (instead of 0), just to make clear that in both cases the stack goes downwards initially.
No, that wouldn't work for the simple reason that the special bits in VM_STACK_INCOMPLETE_SETUP are always cleared after the stack setup is done.
So if we added VM_GROWSDOWN to those early bits in general, the bit would then be cleared even when that wasn't the intent.
Yes, yes, we could change the VM_STACK_INCOMPLETE_SETUP logic to only clear some of the bits in the end, but the end result would be practically the same: we'd still have to do different things for grows-up vs grows-down cases, so the difference might as well be here in the VM_STACK_EARLY bit.
Linus
On 7/3/23 09:49, Linus Torvalds wrote:
On Mon, 3 Jul 2023 at 00:08, Helge Deller deller@gmx.de wrote:
Great, that patch fixes it!
Yeah, I was pretty sure this was it, but it's good to have it confirmed. Committed.
FWIW, my qemu boot tests didn't find any problems with other architectures.
Guenter
I wonder if you want to #define VM_STACK_EARLY VM_GROWSDOWN even for the case where the stack grows down too (instead of 0), just to make clear that in both cases the stack goes downwards initially.
No, that wouldn't work for the simple reason that the special bits in VM_STACK_INCOMPLETE_SETUP are always cleared after the stack setup is done.
So if we added VM_GROWSDOWN to those early bits in general, the bit would then be cleared even when that wasn't the intent.
Yes, yes, we could change the VM_STACK_INCOMPLETE_SETUP logic to only clear some of the bits in the end, but the end result would be practically the same: we'd still have to do different things for grows-up vs grows-down cases, so the difference might as well be here in the VM_STACK_EARLY bit.
Linus
On Mon, 3 Jul 2023 at 10:19, Guenter Roeck linux@roeck-us.net wrote:
FWIW, my qemu boot tests didn't find any problems with other architectures.
Thanks. This whole "let's get the stack expansion locking right" wasn't exactly buttery smooth, but given all our crazy architectures it was not entirely unexpected.
Let's hope it really is all done now,
Linus
On 7/3/23 18:49, Linus Torvalds wrote:
On Mon, 3 Jul 2023 at 00:08, Helge Deller deller@gmx.de wrote:
Great, that patch fixes it!
Yeah, I was pretty sure this was it, but it's good to have it confirmed. Committed.
Thank you!
Nice to see that Greg picked up the patch for stable that fast as well!
I wonder if you want to #define VM_STACK_EARLY VM_GROWSDOWN even for the case where the stack grows down too (instead of 0), just to make clear that in both cases the stack goes downwards initially.
No, that wouldn't work for the simple reason that the special bits in VM_STACK_INCOMPLETE_SETUP are always cleared after the stack setup is done.
So if we added VM_GROWSDOWN to those early bits in general, the bit would then be cleared even when that wasn't the intent.
Yes, yes, we could change the VM_STACK_INCOMPLETE_SETUP logic to only clear some of the bits in the end, but the end result would be practically the same: we'd still have to do different things for grows-up vs grows-down cases, so the difference might as well be here in the VM_STACK_EARLY bit.
Ok, thanks for explainig!
Helge
Helge Deller deller@gmx.de writes:
On 7/3/23 18:49, Linus Torvalds wrote:
On Mon, 3 Jul 2023 at 00:08, Helge Deller deller@gmx.de wrote:
Great, that patch fixes it!
Yeah, I was pretty sure this was it, but it's good to have it confirmed. Committed.
Thank you!
Nice to see that Greg picked up the patch for stable that fast as well!
Sorry, where? I was just about to check if it was marked for backporting but I can't see it in Greg's trees yet.
We need it fo 6.1, 6.3, 6.4.
(Apologies if I'm missing it somewhere obvious.)
Sam James sam@gentoo.org writes:
Helge Deller deller@gmx.de writes:
On 7/3/23 18:49, Linus Torvalds wrote:
On Mon, 3 Jul 2023 at 00:08, Helge Deller deller@gmx.de wrote:
Great, that patch fixes it!
Yeah, I was pretty sure this was it, but it's good to have it confirmed. Committed.
Thank you!
Nice to see that Greg picked up the patch for stable that fast as well!
Sorry, where? I was just about to check if it was marked for backporting but I can't see it in Greg's trees yet.
We need it fo 6.1, 6.3, 6.4.
(Apologies if I'm missing it somewhere obvious.)
.. and I was. I see it now, sorry!
On 7/2/23 23:20, Linus Torvalds wrote:
On Sun, 2 Jul 2023 at 22:33, Guenter Roeck linux@roeck-us.net wrote:
Here you are:
[ 31.188688] stack expand failed: ffeff000-fff00000 (ffefeff2)
Ahhah!
I think the problem is actually ridiculously simple.
The thing is, the parisc stack expands upwards. That's obvious. I've mentioned it several times in just this thread as being the thing that makes parisc special.
But it's *so* obvious that I didn't even think about what it really implies.
And part of all the changes was this part in expand_downwards():
if (!(vma->vm_flags & VM_GROWSDOWN)) return -EFAULT;
and that will *always* fail on parisc, because - as said multiple times - the parisc stack expands upwards. It doesn't have VM_GROWSDOWN set.
What a dum-dum I am.
And I did it that way because the *normal* stack expansion obviously wants it that way and putting the check there not only made sense, but simplified other code.
But fs/execve.c is special - and only special for parisc - in that it really wants to expand a normally upwards-growing stack downwards unconditionally.
Anyway, I think that new check in expand_downwards() is the right thing to do, and the real fix here is to simply make vm_flags reflect reality.
Because during execve, that stack that will _eventually_ grow upwards, does in fact grow downwards. Let's make it reflect that.
We already do magical extra setup for the stack flags during setup (VM_STACK_INCOMPLETE_SETUP), so extending that logic to contain VM_GROWSDOWN seems sane and the right thing to do.
IOW, I think a patch like the attached will fix the problem for real.
It needs a good commit log and maybe a code comment or two, but before I bother to do that, let's verify that yes, it does actually fix things.
Yes, it does. I'll run a complete qemu test with it applied to be sure there is no impact on other architectures (yes, I know, that should not be the case, but better safe than sorry). I'll even apply https://lore.kernel.org/all/20230609075528.9390-12-bhe@redhat.com/raw to be able to test sh4.
Guenter
On 7/3/23 05:59, Guenter Roeck wrote:
On 7/2/23 23:20, Linus Torvalds wrote:
On Sun, 2 Jul 2023 at 22:33, Guenter Roeck linux@roeck-us.net wrote:
Here you are:
[ 31.188688] stack expand failed: ffeff000-fff00000 (ffefeff2)
Ahhah!
I think the problem is actually ridiculously simple.
The thing is, the parisc stack expands upwards. That's obvious. I've mentioned it several times in just this thread as being the thing that makes parisc special.
But it's *so* obvious that I didn't even think about what it really implies.
And part of all the changes was this part in expand_downwards():
if (!(vma->vm_flags & VM_GROWSDOWN)) return -EFAULT;
and that will *always* fail on parisc, because - as said multiple times - the parisc stack expands upwards. It doesn't have VM_GROWSDOWN set.
What a dum-dum I am.
And I did it that way because the *normal* stack expansion obviously wants it that way and putting the check there not only made sense, but simplified other code.
But fs/execve.c is special - and only special for parisc - in that it really wants to expand a normally upwards-growing stack downwards unconditionally.
Anyway, I think that new check in expand_downwards() is the right thing to do, and the real fix here is to simply make vm_flags reflect reality.
Because during execve, that stack that will _eventually_ grow upwards, does in fact grow downwards. Let's make it reflect that.
We already do magical extra setup for the stack flags during setup (VM_STACK_INCOMPLETE_SETUP), so extending that logic to contain VM_GROWSDOWN seems sane and the right thing to do.
IOW, I think a patch like the attached will fix the problem for real.
It needs a good commit log and maybe a code comment or two, but before I bother to do that, let's verify that yes, it does actually fix things.
Yes, it does. I'll run a complete qemu test with it applied to be sure there is no impact on other architectures (yes, I know, that should not be the case, but better safe than sorry). I'll even apply https://lore.kernel.org/all/20230609075528.9390-12-bhe@redhat.com/raw to be able to test sh4.
Meh, should have figured. That fixes one problem with sh4 builds and creates another. Should have figured.
Guenter
On 6/29/23 23:16, Linus Torvalds wrote:
On Thu, 29 Jun 2023 at 22:31, Naresh Kamboju naresh.kamboju@linaro.org wrote:
arch/parisc/mm/fault.c: In function 'do_page_fault': arch/parisc/mm/fault.c:292:22: error: 'prev' undeclared (first use in this function) 292 | if (!prev || !(prev->vm_flags & VM_GROWSUP))
Bah. "prev" should be "prev_vma" here.
I've pushed out the fix. Greg, apologies. It's
ea3f8272876f parisc: fix expand_stack() conversion
and Naresh already pointed to the similarly silly sparc32 fix.
Linus
Did you see that one (in mainline) ?
Building csky:defconfig ... failed -------------- Error log: arch/csky/mm/fault.c: In function 'do_page_fault': arch/csky/mm/fault.c:240:40: error: 'address' undeclared (first use in this function); did you mean 'addr'? 240 | vma = lock_mm_and_find_vma(mm, address, regs);
Guenter
On 6/29/23 23:29, Guenter Roeck wrote:
On 6/29/23 23:16, Linus Torvalds wrote:
On Thu, 29 Jun 2023 at 22:31, Naresh Kamboju naresh.kamboju@linaro.org wrote:
arch/parisc/mm/fault.c: In function 'do_page_fault': arch/parisc/mm/fault.c:292:22: error: 'prev' undeclared (first use in this function) 292 | if (!prev || !(prev->vm_flags & VM_GROWSUP))
Bah. "prev" should be "prev_vma" here.
I've pushed out the fix. Greg, apologies. It's
ea3f8272876f parisc: fix expand_stack() conversion
and Naresh already pointed to the similarly silly sparc32 fix.
Linus
Did you see that one (in mainline) ?
Building csky:defconfig ... failed
Error log: arch/csky/mm/fault.c: In function 'do_page_fault': arch/csky/mm/fault.c:240:40: error: 'address' undeclared (first use in this function); did you mean 'addr'? 240 | vma = lock_mm_and_find_vma(mm, address, regs);
This is also in {6.1,6.3,6.4}-rc unless I am missing something.
Guenter
On Thu, 29 Jun 2023 at 23:29, Guenter Roeck linux@roeck-us.net wrote:
Did you see that one (in mainline) ?
Building csky:defconfig ... failed
Nope. Thanks. Obvious fix: 'address' is called 'addr' here.
I knew we had all these tiny little mazes that looked the same but were just _subtly_ different, but I still ended up doing too much cut-and-paste.
And I only ended up cross-compiling the fairly small set that I had existing cross-build environments for. Which was less than half our ~24 different architectures.
Oh well. We'll get them all. Eventually. Let me go fix up that csky case.
Linus
On Thu, 29 Jun 2023 at 23:33, Linus Torvalds torvalds@linux-foundation.org wrote:
Oh well. We'll get them all. Eventually. Let me go fix up that csky case.
It's commit e55e5df193d2 ("csky: fix up lock_mm_and_find_vma() conversion").
Let's hope all the problems are these kinds of silly - but obvious - naming differences between different architectures.
Because as long as they cause build errors, they may be embarrassing, but easy to find and notice.
I may not have cared enough about some of these architectures, and it shows. sparc32. parisc. csky...
Linus
On 6/29/23 23:47, Linus Torvalds wrote:
On Thu, 29 Jun 2023 at 23:33, Linus Torvalds torvalds@linux-foundation.org wrote:
Oh well. We'll get them all. Eventually. Let me go fix up that csky case.
It's commit e55e5df193d2 ("csky: fix up lock_mm_and_find_vma() conversion").
Let's hope all the problems are these kinds of silly - but obvious - naming differences between different architectures.
Because as long as they cause build errors, they may be embarrassing, but easy to find and notice.
I may not have cared enough about some of these architectures, and it shows. sparc32. parisc. csky...
There is one more, unfortunately.
Building xtensa:de212:kc705-nommu:nommu_kc705_defconfig ... failed ------------ Error log: arch/xtensa/mm/fault.c: In function ‘do_page_fault’: arch/xtensa/mm/fault.c:133:8: error: implicit declaration of function ‘lock_mm_and_find_vma’
This affects all stable release candidates as well as mainline. mmu builds are fine, and indeed lock_mm_and_find_vma() is only declared for CONFIG_MMU. I don't know if this needs a dummy or some other fix for the nommu case.
Guenter
On Fri, 30 Jun 2023 at 15:51, Guenter Roeck linux@roeck-us.net wrote:
There is one more, unfortunately.
Building xtensa:de212:kc705-nommu:nommu_kc705_defconfig ... failed
Heh. I didn't even realize that anybody would ever do lock_mm_and_find_vma() code on a nommu platform.
With nommu, handle_mm_fault() will just BUG(), so it's kind of pointless to do any of this at all, and I didn't expect anybody to have this page faulting path that just causes that BUG() for any faults.
But it turns out xtensa has a notion of protection faults even for NOMMU configs:
config PFAULT bool "Handle protection faults" if EXPERT && !MMU default y help Handle protection faults. MMU configurations must enable it. noMMU configurations may disable it if used memory map never generates protection faults or faults are always fatal.
If unsure, say Y.
which is why it violated my expectations so badly.
I'm not sure if that protection fault handling really ever gets quite this far (it certainly should *not* make it to the BUG() in handle_mm_fault()), but I think the attached patch is likely the right thing to do.
Can you check if it fixes that xtensa case? It looks ObviouslyCorrect(tm) to me, but considering that I clearly missed this case existing AT ALL, it might be best to double-check.
Linus
On Fri, Jun 30, 2023 at 06:24:49PM -0700, Linus Torvalds wrote:
On Fri, 30 Jun 2023 at 15:51, Guenter Roeck linux@roeck-us.net wrote:
There is one more, unfortunately.
Building xtensa:de212:kc705-nommu:nommu_kc705_defconfig ... failed
Heh. I didn't even realize that anybody would ever do lock_mm_and_find_vma() code on a nommu platform.
With nommu, handle_mm_fault() will just BUG(), so it's kind of pointless to do any of this at all, and I didn't expect anybody to have this page faulting path that just causes that BUG() for any faults.
But it turns out xtensa has a notion of protection faults even for NOMMU configs:
config PFAULT bool "Handle protection faults" if EXPERT && !MMU default y help Handle protection faults. MMU configurations must enable it. noMMU configurations may disable it if used memory map never generates protection faults or faults are always fatal. If unsure, say Y.
which is why it violated my expectations so badly.
I'm not sure if that protection fault handling really ever gets quite this far (it certainly should *not* make it to the BUG() in handle_mm_fault()), but I think the attached patch is likely the right thing to do.
Can you check if it fixes that xtensa case? It looks ObviouslyCorrect(tm) to me, but considering that I clearly missed this case existing AT ALL, it might be best to double-check.
Linus
Yes, the patch below fixes the problem.
Building xtensa:de212:kc705-nommu:nommu_kc705_defconfig ... running ......... passed
Thanks, Guenter
include/linux/mm.h | 5 +++-- mm/nommu.c | 11 +++++++++++ 2 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 39aa409e84d5..4f2c33c273eb 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2323,6 +2323,9 @@ void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to); void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end); int generic_error_remove_page(struct address_space *mapping, struct page *page); +struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
unsigned long address, struct pt_regs *regs);
#ifdef CONFIG_MMU extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, @@ -2334,8 +2337,6 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t nr, bool even_cows); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); -struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
unsigned long address, struct pt_regs *regs);
#else static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, diff --git a/mm/nommu.c b/mm/nommu.c index 37d0b03143f1..fdc392735ec6 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -630,6 +630,17 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) } EXPORT_SYMBOL(find_vma); +/*
- At least xtensa ends up having protection faults even with no
- MMU.. No stack expansion, at least.
- */
+struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
unsigned long addr, struct pt_regs *regs)
+{
- mmap_read_lock(mm);
- return vma_lookup(mm, addr);
+}
/*
- expand a stack to a given address
- not supported under NOMMU conditions
On Fri, 30 Jun 2023 at 19:50, Guenter Roeck linux@roeck-us.net wrote:
Yes, the patch below fixes the problem.
Building xtensa:de212:kc705-nommu:nommu_kc705_defconfig ... running ......... passed
Thanks. Committed as
d85a143b69ab ("xtensa: fix NOMMU build with lock_mm_and_find_vma() conversion")
and pushed out.
Linus
On Fri, Jun 30, 2023 at 09:22:45PM -0700, Linus Torvalds wrote:
On Fri, 30 Jun 2023 at 19:50, Guenter Roeck linux@roeck-us.net wrote:
Yes, the patch below fixes the problem.
Building xtensa:de212:kc705-nommu:nommu_kc705_defconfig ... running ......... passed
Thanks. Committed as
d85a143b69ab ("xtensa: fix NOMMU build with lock_mm_and_find_vma() conversion")
and pushed out.
Thanks, now queued up.
greg k-h
Hi Linus,
On Fri, Jun 30, 2023 at 9:23 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, 30 Jun 2023 at 19:50, Guenter Roeck linux@roeck-us.net wrote:
Yes, the patch below fixes the problem.
Building xtensa:de212:kc705-nommu:nommu_kc705_defconfig ... running ......... passed
Thanks. Committed as
d85a143b69ab ("xtensa: fix NOMMU build with lock_mm_and_find_vma() conversion")
and pushed out.
Thanks for the build fix. Unfortunately despite being obviously correct it doesn't release the mm lock in case VMA is not found, so it results in a runtime hang. I've posted a fix for that.
On Sat, 1 Jul 2023 at 03:32, Max Filippov jcmvbkbc@gmail.com wrote:
Thanks for the build fix. Unfortunately despite being obviously correct it doesn't release the mm lock in case VMA is not found, so it results in a runtime hang. I've posted a fix for that.
Heh. I woke up this morning to that feeling of "Duh!" about this, and find you already had fixed it. Patch applied.
Linus
On Thu, Jun 29, 2023 at 11:33:45PM -0700, Linus Torvalds wrote:
On Thu, 29 Jun 2023 at 23:29, Guenter Roeck linux@roeck-us.net wrote:
Did you see that one (in mainline) ?
Building csky:defconfig ... failed
Nope. Thanks. Obvious fix: 'address' is called 'addr' here.
I knew we had all these tiny little mazes that looked the same but were just _subtly_ different, but I still ended up doing too much cut-and-paste.
And I only ended up cross-compiling the fairly small set that I had existing cross-build environments for. Which was less than half our ~24 different architectures.
Oh well. We'll get them all. Eventually. Let me go fix up that csky case.
Thanks, I've picked that up now as well.
greg k-h
On Fri, Jun 30, 2023 at 11:00:51AM +0530, Naresh Kamboju wrote:
On Fri, 30 Jun 2023 at 00:18, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 6.4.1 release. There are 28 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sat, 01 Jul 2023 18:41:39 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.1-rc1.g... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y and the diffstat can be found below.
thanks,
greg k-h
Results from Linaro’s test farm.
Following build regression noticed on Linux stable-rc 6.4 and also noticed on Linux mainline master.
Regressions found on Parisc and Sparc build failed:
- build/gcc-11-defconfig
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
Parisc Build log:
arch/parisc/mm/fault.c: In function 'do_page_fault': arch/parisc/mm/fault.c:292:22: error: 'prev' undeclared (first use in this function) 292 | if (!prev || !(prev->vm_flags & VM_GROWSUP)) | ^~~~ arch/parisc/mm/fault.c:292:22: note: each undeclared identifier is reported only once for each function it appears in
This is now fixed in Linus's tree with ea3f8272876f ("parisc: fix expand_stack() conversion"), so I'll queue it up and push out yet-another-rc...
thanks,
greg k-h
linux-stable-mirror@lists.linaro.org