evict_folios() uses a second pass to reclaim folios that have gone
through page writeback and become clean before it finishes the first
pass, since folio_rotate_reclaimable() cannot handle those folios due
to the isolation.
The second pass tries to avoid potential double counting by deducting
scan_control->nr_scanned. However, this can result in underflow of
nr_scanned, under a condition where shrink_folio_list() does not
increment nr_scanned, i.e., when folio_trylock() fails.
The underflow can cause the divisor, i.e., scale=scanned+reclaimed in
vmpressure_calc_level(), to become zero, resulting in the following
crash:
[exception RIP: vmpressure_work_fn+101]
process_one_work at ffffffffa3313f2b
Since scan_control->nr_scanned has no established semantics, the
potential double counting has minimal risks. Therefore, fix the
problem by not deducting scan_control->nr_scanned in evict_folios().
Reported-by: Wei Xu <weixugc(a)google.com>
Fixes: 359a5e1416ca ("mm: multi-gen LRU: retry folios written back while isolated")
Cc: stable(a)vger.kernel.org
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
---
mm/vmscan.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0761f91b407f..6403038c776e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4597,7 +4597,6 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
/* retry folios that may have missed folio_rotate_reclaimable() */
list_move(&folio->lru, &clean);
- sc->nr_scanned -= folio_nr_pages(folio);
}
spin_lock_irq(&lruvec->lru_lock);
--
2.45.2.993.g49e7a77208-goog
The patch titled
Subject: crash: fix x86_32 memory reserve dead loop retry bug
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
crash-fix-x86_32-memory-reserve-dead-loop-retry-bug.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Jinjie Ruan <ruanjinjie(a)huawei.com>
Subject: crash: fix x86_32 memory reserve dead loop retry bug
Date: Thu, 11 Jul 2024 15:31:18 +0800
On x86_32 Qemu machine with 1GB memory, the cmdline "crashkernel=1G,high"
will cause system stall as below:
ACPI: Reserving FACP table memory at [mem 0x3ffe18b8-0x3ffe192b]
ACPI: Reserving DSDT table memory at [mem 0x3ffe0040-0x3ffe18b7]
ACPI: Reserving FACS table memory at [mem 0x3ffe0000-0x3ffe003f]
ACPI: Reserving APIC table memory at [mem 0x3ffe192c-0x3ffe19bb]
ACPI: Reserving HPET table memory at [mem 0x3ffe19bc-0x3ffe19f3]
ACPI: Reserving WAET table memory at [mem 0x3ffe19f4-0x3ffe1a1b]
143MB HIGHMEM available.
879MB LOWMEM available.
mapped low ram: 0 - 36ffe000
low ram: 0 - 36ffe000
(stall here)
The reason is that the CRASH_ADDR_LOW_MAX is equal to CRASH_ADDR_HIGH_MAX
on x86_32, the first high crash kernel memory reservation will fail, then
go into the "retry" loop and never came out as below.
-> reserve_crashkernel_generic() and high is true
-> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail
-> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly
(because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX).
Fix it by changing the out check condition.
After this patch, it prints:
cannot allocate crashkernel (size:0x40000000)
Link: https://lkml.kernel.org/r/20240711073118.1289866-1-ruanjinjie@huawei.com
Fixes: 9c08a2a139fe ("x86: kdump: use generic interface to simplify crashkernel reservation code")
Signed-off-by: Jinjie Ruan <ruanjinjie(a)huawei.com>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Dave Young <dyoung(a)redhat.com>
Cc: Vivek Goyal <vgoyal(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/crash_reserve.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/kernel/crash_reserve.c~crash-fix-x86_32-memory-reserve-dead-loop-retry-bug
+++ a/kernel/crash_reserve.c
@@ -421,7 +421,7 @@ retry:
* For crashkernel=size[KMG],high, if the first attempt was
* for high memory, fall back to low memory.
*/
- if (high && search_end == CRASH_ADDR_HIGH_MAX) {
+ if (high && search_base == CRASH_ADDR_LOW_MAX) {
search_end = CRASH_ADDR_LOW_MAX;
search_base = 0;
goto retry;
_
Patches currently in -mm which might be from ruanjinjie(a)huawei.com are
crash-fix-x86_32-memory-reserve-dead-loop-retry-bug.patch
Robert Gill reported below #GP when dosemu software was executing vm86()
system call:
general protection fault: 0000 [#1] PREEMPT SMP
CPU: 4 PID: 4610 Comm: dosemu.bin Not tainted 6.6.21-gentoo-x86 #1
Hardware name: Dell Inc. PowerEdge 1950/0H723K, BIOS 2.7.0 10/30/2010
EIP: restore_all_switch_stack+0xbe/0xcf
EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
ESI: 00000000 EDI: 00000000 EBP: 00000000 ESP: ff8affdc
DS: 0000 ES: 0000 FS: 0000 GS: 0033 SS: 0068 EFLAGS: 00010046
CR0: 80050033 CR2: 00c2101c CR3: 04b6d000 CR4: 000406d0
Call Trace:
show_regs+0x70/0x78
die_addr+0x29/0x70
exc_general_protection+0x13c/0x348
exc_bounds+0x98/0x98
handle_exception+0x14d/0x14d
exc_bounds+0x98/0x98
restore_all_switch_stack+0xbe/0xcf
exc_bounds+0x98/0x98
restore_all_switch_stack+0xbe/0xcf
This only happens when VERW based mitigations like MDS/RFDS are enabled.
This is because segment registers with an arbitrary user value can result
in #GP when executing VERW. Intel SDM vol. 2C documents the following
behavior for VERW instruction:
#GP(0) - If a memory operand effective address is outside the CS, DS, ES,
FS, or GS segment limit.
CLEAR_CPU_BUFFERS macro executes VERW instruction before returning to user
space. Replace CLEAR_CPU_BUFFERS with a safer version that uses %ss to
refer VERW operand mds_verw_sel. This ensures VERW will not #GP for an
arbitrary user %ds. Also, in NMI return path, move VERW to after
RESTORE_ALL_NMI that touches GPRs.
For clarity, below are the locations where the new CLEAR_CPU_BUFFERS_SAFE
version is being used:
* entry_INT80_32(), entry_SYSENTER_32() and interrupts (via
handle_exception_return) do:
restore_all_switch_stack:
[...]
mov %esi,%esi
verw %ss:0xc0fc92c0 <-------------
iret
* Opportunistic SYSEXIT:
[...]
verw %ss:0xc0fc92c0 <-------------
btrl $0x9,(%esp)
popf
pop %eax
sti
sysexit
* nmi_return and nmi_from_espfix:
mov %esi,%esi
verw %ss:0xc0fc92c0 <-------------
jmp .Lirq_return
Fixes: a0e2dab44d22 ("x86/entry_32: Add VERW just before userspace transition")
Cc: stable(a)vger.kernel.org # 5.10+
Reported-by: Robert Gill <rtgill82(a)gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218707
Closes: https://lore.kernel.org/all/8c77ccfd-d561-45a1-8ed5-6b75212c7a58@leemhuis.i…
Suggested-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Suggested-by: Brian Gerst <brgerst(a)gmail.com> # Use %ss
Signed-off-by: Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
---
v4:
- Further simplify the patch by using %ss for all VERW calls in 32-bit mode (Brian).
- In NMI exit path move VERW after RESTORE_ALL_NMI that touches GPRs (Dave).
v3: https://lore.kernel.org/r/20240701-fix-dosemu-vm86-v3-1-b1969532c75a@linux.…
- Simplify CLEAR_CPU_BUFFERS_SAFE by using %ss instead of %ds (Brian).
- Do verw before popf in SYSEXIT path (Jari).
v2: https://lore.kernel.org/r/20240627-fix-dosemu-vm86-v2-1-d5579f698e77@linux.…
- Safe guard against any other system calls like vm86() that might change %ds (Dave).
v1: https://lore.kernel.org/r/20240426-fix-dosemu-vm86-v1-1-88c826a3f378@linux.…
---
---
arch/x86/entry/entry_32.S | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d3a814efbff6..d54f6002e5a0 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -253,6 +253,16 @@
.Lend_\@:
.endm
+/*
+ * Safer version of CLEAR_CPU_BUFFERS that uses %ss to reference VERW operand
+ * mds_verw_sel. This ensures VERW will not #GP for an arbitrary user %ds.
+ */
+.macro CLEAR_CPU_BUFFERS_SAFE
+ ALTERNATIVE "jmp .Lskip_verw\@", "", X86_FEATURE_CLEAR_CPU_BUF
+ verw %ss:_ASM_RIP(mds_verw_sel)
+.Lskip_verw\@:
+.endm
+
.macro RESTORE_INT_REGS
popl %ebx
popl %ecx
@@ -871,6 +881,8 @@ SYM_FUNC_START(entry_SYSENTER_32)
/* Now ready to switch the cr3 */
SWITCH_TO_USER_CR3 scratch_reg=%eax
+ /* Clobbers ZF */
+ CLEAR_CPU_BUFFERS_SAFE
/*
* Restore all flags except IF. (We restore IF separately because
@@ -881,7 +893,6 @@ SYM_FUNC_START(entry_SYSENTER_32)
BUG_IF_WRONG_CR3 no_user_check=1
popfl
popl %eax
- CLEAR_CPU_BUFFERS
/*
* Return back to the vDSO, which will pop ecx and edx.
@@ -951,7 +962,7 @@ restore_all_switch_stack:
/* Restore user state */
RESTORE_REGS pop=4 # skip orig_eax/error_code
- CLEAR_CPU_BUFFERS
+ CLEAR_CPU_BUFFERS_SAFE
.Lirq_return:
/*
* ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
@@ -1144,7 +1155,6 @@ SYM_CODE_START(asm_exc_nmi)
/* Not on SYSENTER stack. */
call exc_nmi
- CLEAR_CPU_BUFFERS
jmp .Lnmi_return
.Lnmi_from_sysenter_stack:
@@ -1165,6 +1175,7 @@ SYM_CODE_START(asm_exc_nmi)
CHECK_AND_APPLY_ESPFIX
RESTORE_ALL_NMI cr3_reg=%edi pop=4
+ CLEAR_CPU_BUFFERS_SAFE
jmp .Lirq_return
#ifdef CONFIG_X86_ESPFIX32
@@ -1206,6 +1217,7 @@ SYM_CODE_START(asm_exc_nmi)
* 1 - orig_ax
*/
lss (1+5+6)*4(%esp), %esp # back to espfix stack
+ CLEAR_CPU_BUFFERS_SAFE
jmp .Lirq_return
#endif
SYM_CODE_END(asm_exc_nmi)
---
base-commit: f2661062f16b2de5d7b6a5c42a9a5c96326b8454
change-id: 20240426-fix-dosemu-vm86-dd111a01737e
From: Kan Liang <kan.liang(a)linux.intel.com>
The EAX of the CPUID Leaf 023H enumerates the mask of valid sub-leaves.
To tell the availability of the sub-leaf 1 (enumerate the counter mask),
perf should check the bit 1 (0x2) of EAS, rather than bit 0 (0x1).
The error is not user-visible on bare metal. Because the sub-leaf 0 and
the sub-leaf 1 are always available. However, it may bring issues in a
virtualization environment when a VMM only enumerates the sub-leaf 0.
Fixes: eb467aaac21e ("perf/x86/intel: Support Architectural PerfMon Extension leaf")
Signed-off-by: Kan Liang <kan.liang(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
---
arch/x86/events/intel/core.c | 4 ++--
arch/x86/include/asm/perf_event.h | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index cd8f2db6cdf6..3fb81f7b618c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4842,8 +4842,8 @@ static void update_pmu_cap(struct x86_hybrid_pmu *pmu)
if (ebx & ARCH_PERFMON_EXT_EQ)
pmu->config_mask |= ARCH_PERFMON_EVENTSEL_EQ;
- if (sub_bitmaps & ARCH_PERFMON_NUM_COUNTER_LEAF_BIT) {
- cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_NUM_COUNTER_LEAF,
+ if (sub_bitmaps & ARCH_PERFMON_NUM_COUNTER_LEAF) {
+ cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_NUM_COUNTER_LEAF_BIT,
&eax, &ebx, &ecx, &edx);
pmu->cntr_mask64 = eax;
pmu->fixed_cntr_mask64 = ebx;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 91b73571412f..41ace8431e01 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -190,7 +190,7 @@ union cpuid10_edx {
#define ARCH_PERFMON_EXT_UMASK2 0x1
#define ARCH_PERFMON_EXT_EQ 0x2
#define ARCH_PERFMON_NUM_COUNTER_LEAF_BIT 0x1
-#define ARCH_PERFMON_NUM_COUNTER_LEAF 0x1
+#define ARCH_PERFMON_NUM_COUNTER_LEAF BIT(ARCH_PERFMON_NUM_COUNTER_LEAF_BIT)
/*
* Intel Architectural LBR CPUID detection/enumeration details:
--
2.38.1
The following commit has been merged into the timers/core branch of tip:
Commit-ID: f7d43dd206e7e18c182f200e67a8db8c209907fa
Gitweb: https://git.kernel.org/tip/f7d43dd206e7e18c182f200e67a8db8c209907fa
Author: Yu Liao <liaoyu15(a)huawei.com>
AuthorDate: Thu, 11 Jul 2024 20:48:43 +08:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Thu, 11 Jul 2024 18:00:24 +02:00
tick/broadcast: Make takeover of broadcast hrtimer reliable
Running the LTP hotplug stress test on a aarch64 machine results in
rcu_sched stall warnings when the broadcast hrtimer was owned by the
un-plugged CPU. The issue is the following:
CPU1 (owns the broadcast hrtimer) CPU2
tick_broadcast_enter()
// shutdown local timer device
broadcast_shutdown_local()
...
tick_broadcast_exit()
clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT)
// timer device is not programmed
cpumask_set_cpu(cpu, tick_broadcast_force_mask)
initiates offlining of CPU1
take_cpu_down()
/*
* CPU1 shuts down and does not
* send broadcast IPI anymore
*/
takedown_cpu()
hotplug_cpu__broadcast_tick_pull()
// move broadcast hrtimer to this CPU
clockevents_program_event()
bc_set_next()
hrtimer_start()
/*
* timer device is not programmed
* because only the first expiring
* timer will trigger clockevent
* device reprogramming
*/
What happens is that CPU2 exits broadcast mode with force bit set, then the
local timer device is not reprogrammed and CPU2 expects to receive the
expired event by the broadcast IPI. But this does not happen because CPU1
is offlined by CPU2. CPU switches the clockevent device to ONESHOT state,
but does not reprogram the device.
The subsequent reprogramming of the hrtimer broadcast device does not
program the clockevent device of CPU2 either because the pending expiry
time is already in the past and the CPU expects the event to be delivered.
As a consequence all CPUs which wait for a broadcast event to be delivered
are stuck forever.
Fix this issue by reprogramming the local timer device if the broadcast
force bit of the CPU is set so that the broadcast hrtimer is delivered.
[ tglx: Massage comment and change log. Add Fixes tag ]
Fixes: 989dcb645ca7 ("tick: Handle broadcast wakeup of multiple cpus")
Signed-off-by: Yu Liao <liaoyu15(a)huawei.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20240711124843.64167-1-liaoyu15@huawei.com
---
kernel/time/tick-broadcast.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 771d1e0..b484309 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -1141,6 +1141,7 @@ void tick_broadcast_switch_to_oneshot(void)
#ifdef CONFIG_HOTPLUG_CPU
void hotplug_cpu__broadcast_tick_pull(int deadcpu)
{
+ struct tick_device *td = this_cpu_ptr(&tick_cpu_device);
struct clock_event_device *bc;
unsigned long flags;
@@ -1148,6 +1149,28 @@ void hotplug_cpu__broadcast_tick_pull(int deadcpu)
bc = tick_broadcast_device.evtdev;
if (bc && broadcast_needs_cpu(bc, deadcpu)) {
+ /*
+ * If the broadcast force bit of the current CPU is set,
+ * then the current CPU has not yet reprogrammed the local
+ * timer device to avoid a ping-pong race. See
+ * ___tick_broadcast_oneshot_control().
+ *
+ * If the broadcast device is hrtimer based then
+ * programming the broadcast event below does not have any
+ * effect because the local clockevent device is not
+ * running and not programmed because the broadcast event
+ * is not earlier than the pending event of the local clock
+ * event device. As a consequence all CPUs waiting for a
+ * broadcast event are stuck forever.
+ *
+ * Detect this condition and reprogram the cpu local timer
+ * device to avoid the starvation.
+ */
+ if (tick_check_broadcast_expired()) {
+ cpumask_clear_cpu(smp_processor_id(), tick_broadcast_force_mask);
+ tick_program_event(td->evtdev->next_event, 1);
+ }
+
/* This moves the broadcast assignment to this CPU: */
clockevents_program_event(bc, bc->next_event, 1);
}
No check is done on the size of the data to be transmiited. This causes
a kernel panic when this size exceeds the sg_miter's length.
Limit the number of transmitted bytes to sgm->length.
Cc: stable(a)vger.kernel.org
Fixes: ed01d210fd91 ("mmc: davinci_mmc: Use sg_miter for PIO")
Signed-off-by: Bastien Curutchet <bastien.curutchet(a)bootlin.com>
---
drivers/mmc/host/davinci_mmc.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/mmc/host/davinci_mmc.c b/drivers/mmc/host/davinci_mmc.c
index d7427894e0bc..c302eb380e42 100644
--- a/drivers/mmc/host/davinci_mmc.c
+++ b/drivers/mmc/host/davinci_mmc.c
@@ -224,6 +224,9 @@ static void davinci_fifo_data_trans(struct mmc_davinci_host *host,
}
p = sgm->addr;
+ if (n > sgm->length)
+ n = sgm->length;
+
/* NOTE: we never transfer more than rw_threshold bytes
* to/from the fifo here; there's no I/O overlap.
* This also assumes that access width( i.e. ACCWD) is 4 bytes
--
2.45.0