July 2024 - Linux-stable-mirror

+ mm-mglru-fix-overshooting-shrinker-memory.patch added to mm-unstable branch

by Andrew Morton

The patch titled Subject: mm/mglru: fix overshooting shrinker memory has been added to the -mm mm-unstable branch. Its filename is mm-mglru-fix-overshooting-shrinker-memory.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Yu Zhao <yuzhao(a)google.com> Subject: mm/mglru: fix overshooting shrinker memory Date: Thu, 11 Jul 2024 13:19:57 -0600 set_initial_priority() tries to jump-start global reclaim by estimating the priority based on cold/hot LRU pages. The estimation does not account for shrinker objects, and it cannot do so because their sizes can be in different units other than page. If shrinker objects are the majority, e.g., on TrueNAS SCALE 24.04.0 where ZFS ARC can use almost all system memory, set_initial_priority() can vastly underestimate how much memory ARC shrinker can evict and assign extreme low values to scan_control->priority, resulting in overshoots of shrinker objects. To reproduce the problem, using TrueNAS SCALE 24.04.0 with 32GB DRAM, a test ZFS pool and the following commands: fio --name=mglru.file --numjobs=36 --ioengine=io_uring \ --directory=/root/test-zfs-pool/ --size=1024m --buffered=1 \ --rw=randread --random_distribution=random \ --time_based --runtime=1h & for ((i = 0; i < 20; i++)) do sleep 120 fio --name=mglru.anon --numjobs=16 --ioengine=mmap \ --filename=/dev/zero --size=1024m --fadvise_hint=0 \ --rw=randrw --random_distribution=random \ --time_based --runtime=1m done To fix the problem: 1. Cap scan_control->priority at or above DEF_PRIORITY/2, to prevent the jump-start from being overly aggressive. 2. Account for the progress from mm_account_reclaimed_pages(), to prevent kswapd_shrink_node() from raising the priority unnecessarily. Link: https://lkml.kernel.org/r/20240711191957.939105-2-yuzhao@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <yuzhao(a)google.com> Reported-by: Alexander Motin <mav(a)ixsystems.com> Cc: Wei Xu <weixugc(a)google.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/vmscan.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) --- a/mm/vmscan.c~mm-mglru-fix-overshooting-shrinker-memory +++ a/mm/vmscan.c @@ -4930,7 +4930,11 @@ static void set_initial_priority(struct /* round down reclaimable and round up sc->nr_to_reclaim */ priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1); - sc->priority = clamp(priority, 0, DEF_PRIORITY); + /* + * The estimation is based on LRU pages only, so cap it to prevent + * overshoots of shrinker objects by large margins. + */ + sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY); } static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *sc) @@ -6754,6 +6758,7 @@ static bool kswapd_shrink_node(pg_data_t { struct zone *zone; int z; + unsigned long nr_reclaimed = sc->nr_reclaimed; /* Reclaim a number of pages proportional to the number of zones */ sc->nr_to_reclaim = 0; @@ -6781,7 +6786,8 @@ static bool kswapd_shrink_node(pg_data_t if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order)) sc->order = 0; - return sc->nr_scanned >= sc->nr_to_reclaim; + /* account for progress from mm_account_reclaimed_pages() */ + return max(sc->nr_scanned, sc->nr_reclaimed - nr_reclaimed) >= sc->nr_to_reclaim; } /* Page allocator PCP high watermark is lowered if reclaim is active. */ _ Patches currently in -mm which might be from yuzhao(a)google.com are mm-truncate-batch-clear-shadow-entries.patch mm-truncate-batch-clear-shadow-entries-v2.patch mm-mglru-fix-div-by-zero-in-vmpressure_calc_level.patch mm-mglru-fix-overshooting-shrinker-memory.patch

1 year

1
0
0 0

+ mm-mglru-fix-div-by-zero-in-vmpressure_calc_level.patch added to mm-unstable branch

by Andrew Morton

The patch titled Subject: mm/mglru: fix div-by-zero in vmpressure_calc_level() has been added to the -mm mm-unstable branch. Its filename is mm-mglru-fix-div-by-zero-in-vmpressure_calc_level.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Yu Zhao <yuzhao(a)google.com> Subject: mm/mglru: fix div-by-zero in vmpressure_calc_level() Date: Thu, 11 Jul 2024 13:19:56 -0600 evict_folios() uses a second pass to reclaim folios that have gone through page writeback and become clean before it finishes the first pass, since folio_rotate_reclaimable() cannot handle those folios due to the isolation. The second pass tries to avoid potential double counting by deducting scan_control->nr_scanned. However, this can result in underflow of nr_scanned, under a condition where shrink_folio_list() does not increment nr_scanned, i.e., when folio_trylock() fails. The underflow can cause the divisor, i.e., scale=scanned+reclaimed in vmpressure_calc_level(), to become zero, resulting in the following crash: [exception RIP: vmpressure_work_fn+101] process_one_work at ffffffffa3313f2b Since scan_control->nr_scanned has no established semantics, the potential double counting has minimal risks. Therefore, fix the problem by not deducting scan_control->nr_scanned in evict_folios(). Link: https://lkml.kernel.org/r/20240711191957.939105-1-yuzhao@google.com Fixes: 359a5e1416ca ("mm: multi-gen LRU: retry folios written back while isolated") Reported-by: Wei Xu <weixugc(a)google.com> Signed-off-by: Yu Zhao <yuzhao(a)google.com> Cc: Alexander Motin <mav(a)ixsystems.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/vmscan.c | 1 - 1 file changed, 1 deletion(-) --- a/mm/vmscan.c~mm-mglru-fix-div-by-zero-in-vmpressure_calc_level +++ a/mm/vmscan.c @@ -4597,7 +4597,6 @@ retry: /* retry folios that may have missed folio_rotate_reclaimable() */ list_move(&folio->lru, &clean); - sc->nr_scanned -= folio_nr_pages(folio); } spin_lock_irq(&lruvec->lru_lock); _ Patches currently in -mm which might be from yuzhao(a)google.com are mm-truncate-batch-clear-shadow-entries.patch mm-truncate-batch-clear-shadow-entries-v2.patch mm-mglru-fix-div-by-zero-in-vmpressure_calc_level.patch mm-mglru-fix-overshooting-shrinker-memory.patch

1 year

1
0
0 0

[PATCH mm-unstable v1 1/2] mm/mglru: fix div-by-zero in vmpressure_calc_level()

by Yu Zhao

evict_folios() uses a second pass to reclaim folios that have gone through page writeback and become clean before it finishes the first pass, since folio_rotate_reclaimable() cannot handle those folios due to the isolation. The second pass tries to avoid potential double counting by deducting scan_control->nr_scanned. However, this can result in underflow of nr_scanned, under a condition where shrink_folio_list() does not increment nr_scanned, i.e., when folio_trylock() fails. The underflow can cause the divisor, i.e., scale=scanned+reclaimed in vmpressure_calc_level(), to become zero, resulting in the following crash: [exception RIP: vmpressure_work_fn+101] process_one_work at ffffffffa3313f2b Since scan_control->nr_scanned has no established semantics, the potential double counting has minimal risks. Therefore, fix the problem by not deducting scan_control->nr_scanned in evict_folios(). Reported-by: Wei Xu <weixugc(a)google.com> Fixes: 359a5e1416ca ("mm: multi-gen LRU: retry folios written back while isolated") Cc: stable(a)vger.kernel.org Signed-off-by: Yu Zhao <yuzhao(a)google.com> --- mm/vmscan.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 0761f91b407f..6403038c776e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4597,7 +4597,6 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap /* retry folios that may have missed folio_rotate_reclaimable() */ list_move(&folio->lru, &clean); - sc->nr_scanned -= folio_nr_pages(folio); } spin_lock_irq(&lruvec->lru_lock); -- 2.45.2.993.g49e7a77208-goog

1 year

1
1
0 0

+ crash-fix-x86_32-memory-reserve-dead-loop-retry-bug.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: crash: fix x86_32 memory reserve dead loop retry bug has been added to the -mm mm-hotfixes-unstable branch. Its filename is crash-fix-x86_32-memory-reserve-dead-loop-retry-bug.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Jinjie Ruan <ruanjinjie(a)huawei.com> Subject: crash: fix x86_32 memory reserve dead loop retry bug Date: Thu, 11 Jul 2024 15:31:18 +0800 On x86_32 Qemu machine with 1GB memory, the cmdline "crashkernel=1G,high" will cause system stall as below: ACPI: Reserving FACP table memory at [mem 0x3ffe18b8-0x3ffe192b] ACPI: Reserving DSDT table memory at [mem 0x3ffe0040-0x3ffe18b7] ACPI: Reserving FACS table memory at [mem 0x3ffe0000-0x3ffe003f] ACPI: Reserving APIC table memory at [mem 0x3ffe192c-0x3ffe19bb] ACPI: Reserving HPET table memory at [mem 0x3ffe19bc-0x3ffe19f3] ACPI: Reserving WAET table memory at [mem 0x3ffe19f4-0x3ffe1a1b] 143MB HIGHMEM available. 879MB LOWMEM available. mapped low ram: 0 - 36ffe000 low ram: 0 - 36ffe000 (stall here) The reason is that the CRASH_ADDR_LOW_MAX is equal to CRASH_ADDR_HIGH_MAX on x86_32, the first high crash kernel memory reservation will fail, then go into the "retry" loop and never came out as below. -> reserve_crashkernel_generic() and high is true -> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail -> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly (because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX). Fix it by changing the out check condition. After this patch, it prints: cannot allocate crashkernel (size:0x40000000) Link: https://lkml.kernel.org/r/20240711073118.1289866-1-ruanjinjie@huawei.com Fixes: 9c08a2a139fe ("x86: kdump: use generic interface to simplify crashkernel reservation code") Signed-off-by: Jinjie Ruan <ruanjinjie(a)huawei.com> Cc: Baoquan He <bhe(a)redhat.com> Cc: Dave Young <dyoung(a)redhat.com> Cc: Vivek Goyal <vgoyal(a)redhat.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- kernel/crash_reserve.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/kernel/crash_reserve.c~crash-fix-x86_32-memory-reserve-dead-loop-retry-bug +++ a/kernel/crash_reserve.c @@ -421,7 +421,7 @@ retry: * For crashkernel=size[KMG],high, if the first attempt was * for high memory, fall back to low memory. */ - if (high && search_end == CRASH_ADDR_HIGH_MAX) { + if (high && search_base == CRASH_ADDR_LOW_MAX) { search_end = CRASH_ADDR_LOW_MAX; search_base = 0; goto retry; _ Patches currently in -mm which might be from ruanjinjie(a)huawei.com are crash-fix-x86_32-memory-reserve-dead-loop-retry-bug.patch

1 year

1
0
0 0

[PATCH v4] x86/entry_32: Use stack segment selector for VERW operand

by Pawan Gupta

Robert Gill reported below #GP when dosemu software was executing vm86() system call: general protection fault: 0000 [#1] PREEMPT SMP CPU: 4 PID: 4610 Comm: dosemu.bin Not tainted 6.6.21-gentoo-x86 #1 Hardware name: Dell Inc. PowerEdge 1950/0H723K, BIOS 2.7.0 10/30/2010 EIP: restore_all_switch_stack+0xbe/0xcf EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000 ESI: 00000000 EDI: 00000000 EBP: 00000000 ESP: ff8affdc DS: 0000 ES: 0000 FS: 0000 GS: 0033 SS: 0068 EFLAGS: 00010046 CR0: 80050033 CR2: 00c2101c CR3: 04b6d000 CR4: 000406d0 Call Trace: show_regs+0x70/0x78 die_addr+0x29/0x70 exc_general_protection+0x13c/0x348 exc_bounds+0x98/0x98 handle_exception+0x14d/0x14d exc_bounds+0x98/0x98 restore_all_switch_stack+0xbe/0xcf exc_bounds+0x98/0x98 restore_all_switch_stack+0xbe/0xcf This only happens when VERW based mitigations like MDS/RFDS are enabled. This is because segment registers with an arbitrary user value can result in #GP when executing VERW. Intel SDM vol. 2C documents the following behavior for VERW instruction: #GP(0) - If a memory operand effective address is outside the CS, DS, ES, FS, or GS segment limit. CLEAR_CPU_BUFFERS macro executes VERW instruction before returning to user space. Replace CLEAR_CPU_BUFFERS with a safer version that uses %ss to refer VERW operand mds_verw_sel. This ensures VERW will not #GP for an arbitrary user %ds. Also, in NMI return path, move VERW to after RESTORE_ALL_NMI that touches GPRs. For clarity, below are the locations where the new CLEAR_CPU_BUFFERS_SAFE version is being used: * entry_INT80_32(), entry_SYSENTER_32() and interrupts (via handle_exception_return) do: restore_all_switch_stack: [...] mov %esi,%esi verw %ss:0xc0fc92c0 <------------- iret * Opportunistic SYSEXIT: [...] verw %ss:0xc0fc92c0 <------------- btrl $0x9,(%esp) popf pop %eax sti sysexit * nmi_return and nmi_from_espfix: mov %esi,%esi verw %ss:0xc0fc92c0 <------------- jmp .Lirq_return Fixes: a0e2dab44d22 ("x86/entry_32: Add VERW just before userspace transition") Cc: stable(a)vger.kernel.org # 5.10+ Reported-by: Robert Gill <rtgill82(a)gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218707 Closes: https://lore.kernel.org/all/8c77ccfd-d561-45a1-8ed5-6b75212c7a58@leemhuis.i… Suggested-by: Dave Hansen <dave.hansen(a)linux.intel.com> Suggested-by: Brian Gerst <brgerst(a)gmail.com> # Use %ss Signed-off-by: Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com> --- v4: - Further simplify the patch by using %ss for all VERW calls in 32-bit mode (Brian). - In NMI exit path move VERW after RESTORE_ALL_NMI that touches GPRs (Dave). v3: https://lore.kernel.org/r/20240701-fix-dosemu-vm86-v3-1-b1969532c75a@linux.… - Simplify CLEAR_CPU_BUFFERS_SAFE by using %ss instead of %ds (Brian). - Do verw before popf in SYSEXIT path (Jari). v2: https://lore.kernel.org/r/20240627-fix-dosemu-vm86-v2-1-d5579f698e77@linux.… - Safe guard against any other system calls like vm86() that might change %ds (Dave). v1: https://lore.kernel.org/r/20240426-fix-dosemu-vm86-v1-1-88c826a3f378@linux.… --- --- arch/x86/entry/entry_32.S | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index d3a814efbff6..d54f6002e5a0 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -253,6 +253,16 @@ .Lend_\@: .endm +/* + * Safer version of CLEAR_CPU_BUFFERS that uses %ss to reference VERW operand + * mds_verw_sel. This ensures VERW will not #GP for an arbitrary user %ds. + */ +.macro CLEAR_CPU_BUFFERS_SAFE + ALTERNATIVE "jmp .Lskip_verw\@", "", X86_FEATURE_CLEAR_CPU_BUF + verw %ss:_ASM_RIP(mds_verw_sel) +.Lskip_verw\@: +.endm + .macro RESTORE_INT_REGS popl %ebx popl %ecx @@ -871,6 +881,8 @@ SYM_FUNC_START(entry_SYSENTER_32) /* Now ready to switch the cr3 */ SWITCH_TO_USER_CR3 scratch_reg=%eax + /* Clobbers ZF */ + CLEAR_CPU_BUFFERS_SAFE /* * Restore all flags except IF. (We restore IF separately because @@ -881,7 +893,6 @@ SYM_FUNC_START(entry_SYSENTER_32) BUG_IF_WRONG_CR3 no_user_check=1 popfl popl %eax - CLEAR_CPU_BUFFERS /* * Return back to the vDSO, which will pop ecx and edx. @@ -951,7 +962,7 @@ restore_all_switch_stack: /* Restore user state */ RESTORE_REGS pop=4 # skip orig_eax/error_code - CLEAR_CPU_BUFFERS + CLEAR_CPU_BUFFERS_SAFE .Lirq_return: /* * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization @@ -1144,7 +1155,6 @@ SYM_CODE_START(asm_exc_nmi) /* Not on SYSENTER stack. */ call exc_nmi - CLEAR_CPU_BUFFERS jmp .Lnmi_return .Lnmi_from_sysenter_stack: @@ -1165,6 +1175,7 @@ SYM_CODE_START(asm_exc_nmi) CHECK_AND_APPLY_ESPFIX RESTORE_ALL_NMI cr3_reg=%edi pop=4 + CLEAR_CPU_BUFFERS_SAFE jmp .Lirq_return #ifdef CONFIG_X86_ESPFIX32 @@ -1206,6 +1217,7 @@ SYM_CODE_START(asm_exc_nmi) * 1 - orig_ax */ lss (1+5+6)*4(%esp), %esp # back to espfix stack + CLEAR_CPU_BUFFERS_SAFE jmp .Lirq_return #endif SYM_CODE_END(asm_exc_nmi) --- base-commit: f2661062f16b2de5d7b6a5c42a9a5c96326b8454 change-id: 20240426-fix-dosemu-vm86-dd111a01737e

1 year

5
9
0 0

[PATCH fs/bfs 0/2] bfs: fix null-ptr-deref and possible warning in bfs_move_block() func

by kovalev＠altlinux.org

https://syzkaller.appspot.com/bug?extid=d98fd19acd08b36ff422 [PATCH fs/bfs 1/2] bfs: fix null-ptr-deref in bfs_move_block [PATCH fs/bfs 2/2] bfs: add buffer_uptodate check before mark_buffer_dirty

1 year

4
6
0 0

[PATCH] perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF

by kan.liang＠linux.intel.com

From: Kan Liang <kan.liang(a)linux.intel.com> The EAX of the CPUID Leaf 023H enumerates the mask of valid sub-leaves. To tell the availability of the sub-leaf 1 (enumerate the counter mask), perf should check the bit 1 (0x2) of EAS, rather than bit 0 (0x1). The error is not user-visible on bare metal. Because the sub-leaf 0 and the sub-leaf 1 are always available. However, it may bring issues in a virtualization environment when a VMM only enumerates the sub-leaf 0. Fixes: eb467aaac21e ("perf/x86/intel: Support Architectural PerfMon Extension leaf") Signed-off-by: Kan Liang <kan.liang(a)linux.intel.com> Cc: stable(a)vger.kernel.org --- arch/x86/events/intel/core.c | 4 ++-- arch/x86/include/asm/perf_event.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index cd8f2db6cdf6..3fb81f7b618c 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -4842,8 +4842,8 @@ static void update_pmu_cap(struct x86_hybrid_pmu *pmu) if (ebx & ARCH_PERFMON_EXT_EQ) pmu->config_mask |= ARCH_PERFMON_EVENTSEL_EQ; - if (sub_bitmaps & ARCH_PERFMON_NUM_COUNTER_LEAF_BIT) { - cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_NUM_COUNTER_LEAF, + if (sub_bitmaps & ARCH_PERFMON_NUM_COUNTER_LEAF) { + cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_NUM_COUNTER_LEAF_BIT, &eax, &ebx, &ecx, &edx); pmu->cntr_mask64 = eax; pmu->fixed_cntr_mask64 = ebx; diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 91b73571412f..41ace8431e01 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -190,7 +190,7 @@ union cpuid10_edx { #define ARCH_PERFMON_EXT_UMASK2 0x1 #define ARCH_PERFMON_EXT_EQ 0x2 #define ARCH_PERFMON_NUM_COUNTER_LEAF_BIT 0x1 -#define ARCH_PERFMON_NUM_COUNTER_LEAF 0x1 +#define ARCH_PERFMON_NUM_COUNTER_LEAF BIT(ARCH_PERFMON_NUM_COUNTER_LEAF_BIT) /* * Intel Architectural LBR CPUID detection/enumeration details: -- 2.38.1

1 year

1
0
0 0

[tip: timers/core] tick/broadcast: Make takeover of broadcast hrtimer reliable

by tip-bot2 for Yu Liao

The following commit has been merged into the timers/core branch of tip: Commit-ID: f7d43dd206e7e18c182f200e67a8db8c209907fa Gitweb: https://git.kernel.org/tip/f7d43dd206e7e18c182f200e67a8db8c209907fa Author: Yu Liao <liaoyu15(a)huawei.com> AuthorDate: Thu, 11 Jul 2024 20:48:43 +08:00 Committer: Thomas Gleixner <tglx(a)linutronix.de> CommitterDate: Thu, 11 Jul 2024 18:00:24 +02:00 tick/broadcast: Make takeover of broadcast hrtimer reliable Running the LTP hotplug stress test on a aarch64 machine results in rcu_sched stall warnings when the broadcast hrtimer was owned by the un-plugged CPU. The issue is the following: CPU1 (owns the broadcast hrtimer) CPU2 tick_broadcast_enter() // shutdown local timer device broadcast_shutdown_local() ... tick_broadcast_exit() clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT) // timer device is not programmed cpumask_set_cpu(cpu, tick_broadcast_force_mask) initiates offlining of CPU1 take_cpu_down() /* * CPU1 shuts down and does not * send broadcast IPI anymore */ takedown_cpu() hotplug_cpu__broadcast_tick_pull() // move broadcast hrtimer to this CPU clockevents_program_event() bc_set_next() hrtimer_start() /* * timer device is not programmed * because only the first expiring * timer will trigger clockevent * device reprogramming */ What happens is that CPU2 exits broadcast mode with force bit set, then the local timer device is not reprogrammed and CPU2 expects to receive the expired event by the broadcast IPI. But this does not happen because CPU1 is offlined by CPU2. CPU switches the clockevent device to ONESHOT state, but does not reprogram the device. The subsequent reprogramming of the hrtimer broadcast device does not program the clockevent device of CPU2 either because the pending expiry time is already in the past and the CPU expects the event to be delivered. As a consequence all CPUs which wait for a broadcast event to be delivered are stuck forever. Fix this issue by reprogramming the local timer device if the broadcast force bit of the CPU is set so that the broadcast hrtimer is delivered. [ tglx: Massage comment and change log. Add Fixes tag ] Fixes: 989dcb645ca7 ("tick: Handle broadcast wakeup of multiple cpus") Signed-off-by: Yu Liao <liaoyu15(a)huawei.com> Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de> Cc: stable(a)vger.kernel.org Link: https://lore.kernel.org/r/20240711124843.64167-1-liaoyu15@huawei.com --- kernel/time/tick-broadcast.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 771d1e0..b484309 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -1141,6 +1141,7 @@ void tick_broadcast_switch_to_oneshot(void) #ifdef CONFIG_HOTPLUG_CPU void hotplug_cpu__broadcast_tick_pull(int deadcpu) { + struct tick_device *td = this_cpu_ptr(&tick_cpu_device); struct clock_event_device *bc; unsigned long flags; @@ -1148,6 +1149,28 @@ void hotplug_cpu__broadcast_tick_pull(int deadcpu) bc = tick_broadcast_device.evtdev; if (bc && broadcast_needs_cpu(bc, deadcpu)) { + /* + * If the broadcast force bit of the current CPU is set, + * then the current CPU has not yet reprogrammed the local + * timer device to avoid a ping-pong race. See + * ___tick_broadcast_oneshot_control(). + * + * If the broadcast device is hrtimer based then + * programming the broadcast event below does not have any + * effect because the local clockevent device is not + * running and not programmed because the broadcast event + * is not earlier than the pending event of the local clock + * event device. As a consequence all CPUs waiting for a + * broadcast event are stuck forever. + * + * Detect this condition and reprogram the cpu local timer + * device to avoid the starvation. + */ + if (tick_check_broadcast_expired()) { + cpumask_clear_cpu(smp_processor_id(), tick_broadcast_force_mask); + tick_program_event(td->evtdev->next_event, 1); + } + /* This moves the broadcast assignment to this CPU: */ clockevents_program_event(bc, bc->next_event, 1); }

1 year

1
0
0 0

[PATCH 1/2] mmc: davinci_mmc: Prevent transmitted data size from exceeding sgm's length

by Bastien Curutchet

No check is done on the size of the data to be transmiited. This causes a kernel panic when this size exceeds the sg_miter's length. Limit the number of transmitted bytes to sgm->length. Cc: stable(a)vger.kernel.org Fixes: ed01d210fd91 ("mmc: davinci_mmc: Use sg_miter for PIO") Signed-off-by: Bastien Curutchet <bastien.curutchet(a)bootlin.com> --- drivers/mmc/host/davinci_mmc.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/mmc/host/davinci_mmc.c b/drivers/mmc/host/davinci_mmc.c index d7427894e0bc..c302eb380e42 100644 --- a/drivers/mmc/host/davinci_mmc.c +++ b/drivers/mmc/host/davinci_mmc.c @@ -224,6 +224,9 @@ static void davinci_fifo_data_trans(struct mmc_davinci_host *host, } p = sgm->addr; + if (n > sgm->length) + n = sgm->length; + /* NOTE: we never transfer more than rw_threshold bytes * to/from the fifo here; there's no I/O overlap. * This also assumes that access width( i.e. ACCWD) is 4 bytes -- 2.45.0

1 year

2
1
0 0

[tip: irq/core] irqchip/imx-irqsteer: Handle runtime power management correctly

by tip-bot2 for Shenwei Wang

The following commit has been merged into the irq/core branch of tip: Commit-ID: cd0406fab8d4a16ec4f617c0a4b405bf70543b5b Gitweb: https://git.kernel.org/tip/cd0406fab8d4a16ec4f617c0a4b405bf70543b5b Author: Shenwei Wang <shenwei.wang(a)nxp.com> AuthorDate: Wed, 03 Jul 2024 11:32:50 -05:00 Committer: Thomas Gleixner <tglx(a)linutronix.de> CommitterDate: Thu, 11 Jul 2024 17:06:02 +02:00 irqchip/imx-irqsteer: Handle runtime power management correctly The power domain is automatically activated from clk_prepare(). However, on certain platforms like i.MX8QM and i.MX8QXP, the power-on handling invokes sleeping functions, which triggers the 'scheduling while atomic' bug in the context switch path during device probing: BUG: scheduling while atomic: kworker/u13:1/48/0x00000002 Call trace: __schedule_bug+0x54/0x6c __schedule+0x7f0/0xa94 schedule+0x5c/0xc4 schedule_preempt_disabled+0x24/0x40 __mutex_lock.constprop.0+0x2c0/0x540 __mutex_lock_slowpath+0x14/0x20 mutex_lock+0x48/0x54 clk_prepare_lock+0x44/0xa0 clk_prepare+0x20/0x44 imx_irqsteer_resume+0x28/0xe0 pm_generic_runtime_resume+0x2c/0x44 __genpd_runtime_resume+0x30/0x80 genpd_runtime_resume+0xc8/0x2c0 __rpm_callback+0x48/0x1d8 rpm_callback+0x6c/0x78 rpm_resume+0x490/0x6b4 __pm_runtime_resume+0x50/0x94 irq_chip_pm_get+0x2c/0xa0 __irq_do_set_handler+0x178/0x24c irq_set_chained_handler_and_data+0x60/0xa4 mxc_gpio_probe+0x160/0x4b0 Cure this by implementing the irq_bus_lock/sync_unlock() interrupt chip callbacks and handle power management in them as they are invoked from non-atomic context. [ tglx: Rewrote change log, added Fixes tag ] Fixes: 0136afa08967 ("irqchip: Add driver for imx-irqsteer controller") Signed-off-by: Shenwei Wang <shenwei.wang(a)nxp.com> Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de> Cc: stable(a)vger.kernel.org Link: https://lore.kernel.org/r/20240703163250.47887-1-shenwei.wang@nxp.com --- drivers/irqchip/irq-imx-irqsteer.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/drivers/irqchip/irq-imx-irqsteer.c b/drivers/irqchip/irq-imx-irqsteer.c index 20cf7a9..75a0e98 100644 --- a/drivers/irqchip/irq-imx-irqsteer.c +++ b/drivers/irqchip/irq-imx-irqsteer.c @@ -36,6 +36,7 @@ struct irqsteer_data { int channel; struct irq_domain *domain; u32 *saved_reg; + struct device *dev; }; static int imx_irqsteer_get_reg_index(struct irqsteer_data *data, @@ -72,10 +73,26 @@ static void imx_irqsteer_irq_mask(struct irq_data *d) raw_spin_unlock_irqrestore(&data->lock, flags); } +static void imx_irqsteer_irq_bus_lock(struct irq_data *d) +{ + struct irqsteer_data *data = d->chip_data; + + pm_runtime_get_sync(data->dev); +} + +static void imx_irqsteer_irq_bus_sync_unlock(struct irq_data *d) +{ + struct irqsteer_data *data = d->chip_data; + + pm_runtime_put_autosuspend(data->dev); +} + static const struct irq_chip imx_irqsteer_irq_chip = { - .name = "irqsteer", - .irq_mask = imx_irqsteer_irq_mask, - .irq_unmask = imx_irqsteer_irq_unmask, + .name = "irqsteer", + .irq_mask = imx_irqsteer_irq_mask, + .irq_unmask = imx_irqsteer_irq_unmask, + .irq_bus_lock = imx_irqsteer_irq_bus_lock, + .irq_bus_sync_unlock = imx_irqsteer_irq_bus_sync_unlock, }; static int imx_irqsteer_irq_map(struct irq_domain *h, unsigned int irq, @@ -150,6 +167,7 @@ static int imx_irqsteer_probe(struct platform_device *pdev) if (!data) return -ENOMEM; + data->dev = &pdev->dev; data->regs = devm_platform_ioremap_resource(pdev, 0); if (IS_ERR(data->regs)) { dev_err(&pdev->dev, "failed to initialize reg\n");

1 year

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror July 2024