From: Quang Le <quanglex97(a)gmail.com>
When packet_set_ring() releases po->bind_lock, another thread can
run packet_notifier() and process an NETDEV_UP event.
This race and the fix are both similar to that of commit 15fe076edea7
("net/packet: fix a race in packet_bind() and packet_notifier()").
There too the packet_notifier NETDEV_UP event managed to run while a
po->bind_lock critical section had to be temporarily released. And
the fix was similarly to temporarily set po->num to zero to keep
the socket unhooked until the lock is retaken.
The po->bind_lock in packet_set_ring and packet_notifier precede the
introduction of git history.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable(a)vger.kernel.org
Signed-off-by: Quang Le <quanglex97(a)gmail.com>
Signed-off-by: Willem de Bruijn <willemb(a)google.com>
---
v1->v2:
- fix author attribution (From: at the top)
v1: https://lore.kernel.org/netdev/20250731175132.2592130-1-willemdebruijn.kern…
---
net/packet/af_packet.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index bc438d0d96a7..a7017d7f0927 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -4573,10 +4573,10 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
spin_lock(&po->bind_lock);
was_running = packet_sock_flag(po, PACKET_SOCK_RUNNING);
num = po->num;
- if (was_running) {
- WRITE_ONCE(po->num, 0);
+ WRITE_ONCE(po->num, 0);
+ if (was_running)
__unregister_prot_hook(sk, false);
- }
+
spin_unlock(&po->bind_lock);
synchronize_net();
@@ -4608,10 +4608,10 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
mutex_unlock(&po->pg_vec_lock);
spin_lock(&po->bind_lock);
- if (was_running) {
- WRITE_ONCE(po->num, num);
+ WRITE_ONCE(po->num, num);
+ if (was_running)
register_prot_hook(sk);
- }
+
spin_unlock(&po->bind_lock);
if (pg_vec && (po->tp_version > TPACKET_V2)) {
/* Because we don't support block-based V3 on tx-ring */
--
2.50.1.565.gc32cd1483b-goog
On systems using the hash MMU, there is a software SLB preload cache that
mirrors the entries loaded into the hardware SLB buffer. This preload
cache is subject to periodic eviction — typically after every 256 context
switches — to remove old entry.
To optimize performance, the kernel skips switch_mmu_context() in
switch_mm_irqs_off() when the prev and next mm_struct are the same.
However, on hash MMU systems, this can lead to inconsistencies between
the hardware SLB and the software preload cache.
If an SLB entry for a process is evicted from the software cache on one
CPU, and the same process later runs on another CPU without executing
switch_mmu_context(), the hardware SLB may retain stale entries. If the
kernel then attempts to reload that entry, it can trigger an SLB
multi-hit error.
The following timeline shows how stale SLB entries are created and can
cause a multi-hit error when a process moves between CPUs without a
MMU context switch.
CPU 0 CPU 1
----- -----
Process P
exec swapper/1
load_elf_binary
begin_new_exc
activate_mm
switch_mm_irqs_off
switch_mmu_context
switch_slb
/*
* This invalidates all
* the entries in the HW
* and setup the new HW
* SLB entries as per the
* preload cache.
*/
context_switch
sched_migrate_task migrates process P to cpu-1
Process swapper/0 context switch (to process P)
(uses mm_struct of Process P) switch_mm_irqs_off()
switch_slb
load_slb++
/*
* load_slb becomes 0 here
* and we evict an entry from
* the preload cache with
* preload_age(). We still
* keep HW SLB and preload
* cache in sync, that is
* because all HW SLB entries
* anyways gets evicted in
* switch_slb during SLBIA.
* We then only add those
* entries back in HW SLB,
* which are currently
* present in preload_cache
* (after eviction).
*/
load_elf_binary continues...
setup_new_exec()
slb_setup_new_exec()
sched_switch event
sched_migrate_task migrates
process P to cpu-0
context_switch from swapper/0 to Process P
switch_mm_irqs_off()
/*
* Since both prev and next mm struct are same we don't call
* switch_mmu_context(). This will cause the HW SLB and SW preload
* cache to go out of sync in preload_new_slb_context. Because there
* was an SLB entry which was evicted from both HW and preload cache
* on cpu-1. Now later in preload_new_slb_context(), when we will try
* to add the same preload entry again, we will add this to the SW
* preload cache and then will add it to the HW SLB. Since on cpu-0
* this entry was never invalidated, hence adding this entry to the HW
* SLB will cause a SLB multi-hit error.
*/
load_elf_binary continues...
START_THREAD
start_thread
preload_new_slb_context
/*
* This tries to add a new EA to preload cache which was earlier
* evicted from both cpu-1 HW SLB and preload cache. This caused the
* HW SLB of cpu-0 to go out of sync with the SW preload cache. The
* reason for this was, that when we context switched back on CPU-0,
* we should have ideally called switch_mmu_context() which will
* bring the HW SLB entries on CPU-0 in sync with SW preload cache
* entries by setting up the mmu context properly. But we didn't do
* that since the prev mm_struct running on cpu-0 was same as the
* next mm_struct (which is true for swapper / kernel threads). So
* now when we try to add this new entry into the HW SLB of cpu-0,
* we hit a SLB multi-hit error.
*/
WARNING: CPU: 0 PID: 1810970 at arch/powerpc/mm/book3s64/slb.c:62
assert_slb_presence+0x2c/0x50(48 results) 02:47:29 [20157/42149]
Modules linked in:
CPU: 0 UID: 0 PID: 1810970 Comm: dd Not tainted 6.16.0-rc3-dirty #12
VOLUNTARY
Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected)
0x4d0200 0xf000004 of:SLOF,HEAD hv:linux,kvm pSeries
NIP: c00000000015426c LR: c0000000001543b4 CTR: 0000000000000000
REGS: c0000000497c77e0 TRAP: 0700 Not tainted (6.16.0-rc3-dirty)
MSR: 8000000002823033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE> CR: 28888482 XER: 00000000
CFAR: c0000000001543b0 IRQMASK: 3
<...>
NIP [c00000000015426c] assert_slb_presence+0x2c/0x50
LR [c0000000001543b4] slb_insert_entry+0x124/0x390
Call Trace:
0x7fffceb5ffff (unreliable)
preload_new_slb_context+0x100/0x1a0
start_thread+0x26c/0x420
load_elf_binary+0x1b04/0x1c40
bprm_execve+0x358/0x680
do_execveat_common+0x1f8/0x240
sys_execve+0x58/0x70
system_call_exception+0x114/0x300
system_call_common+0x160/0x2c4
To fix this issue, we add a code change to always switch the MMU context on
hash MMU if the SLB preload cache has aged. With this change, the
SLB multi-hit error no longer occurs.
cc: Christophe Leroy <christophe.leroy(a)csgroup.eu>
cc: Ritesh Harjani (IBM) <ritesh.list(a)gmail.com>
cc: Michael Ellerman <mpe(a)ellerman.id.au>
cc: Nicholas Piggin <npiggin(a)gmail.com>
Fixes: 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache")
cc: stable(a)vger.kernel.org
Suggested-by: Ritesh Harjani (IBM) <ritesh.list(a)gmail.com>
Signed-off-by: Donet Tom <donettom(a)linux.ibm.com>
---
v1 -> v2 : Changed commit message and added a comment in
switch_mm_irqs_off()
v1 - https://lore.kernel.org/all/20250731161027.966196-1-donettom@linux.ibm.com/
---
arch/powerpc/mm/book3s64/slb.c | 2 +-
arch/powerpc/mm/mmu_context.c | 7 +++++--
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index 6b783552403c..08daac3f978c 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -509,7 +509,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
* SLB preload cache.
*/
tsk->thread.load_slb++;
- if (!tsk->thread.load_slb) {
+ if (tsk->thread.load_slb == U8_MAX) {
unsigned long pc = KSTK_EIP(tsk);
preload_age(ti);
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 3e3af29b4523..95455d787288 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -83,8 +83,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
/* Some subarchs need to track the PGD elsewhere */
switch_mm_pgdir(tsk, next);
- /* Nothing else to do if we aren't actually switching */
- if (prev == next)
+ /*
+ * Nothing else to do if we aren't actually switching and
+ * the preload slb cache has not aged
+ */
+ if ((prev == next) && (tsk->thread.load_slb != U8_MAX))
return;
/*
--
2.50.1
When handling non-swap entries in move_pages_pte(), the error handling
for entries that are NOT migration entries fails to unmap the page table
entries before jumping to the error handling label.
This results in a kmap/kunmap imbalance which on CONFIG_HIGHPTE systems
triggers a WARNING in kunmap_local_indexed() because the kmap stack is
corrupted.
Example call trace on ARM32 (CONFIG_HIGHPTE enabled):
WARNING: CPU: 1 PID: 633 at mm/highmem.c:622 kunmap_local_indexed+0x178/0x17c
Call trace:
kunmap_local_indexed from move_pages+0x964/0x19f4
move_pages from userfaultfd_ioctl+0x129c/0x2144
userfaultfd_ioctl from sys_ioctl+0x558/0xd24
The issue was introduced with the UFFDIO_MOVE feature but became more
frequent with the addition of guard pages (commit 7c53dfbdb024 ("mm: add
PTE_MARKER_GUARD PTE marker")) which made the non-migration entry code
path more commonly executed during userfaultfd operations.
Fix this by ensuring PTEs are properly unmapped in all non-swap entry
paths before jumping to the error handling label, not just for migration
entries.
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
mm/userfaultfd.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 8253978ee0fb1..7c298e9cbc18f 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1384,14 +1384,15 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
entry = pte_to_swp_entry(orig_src_pte);
if (non_swap_entry(entry)) {
+ pte_unmap(src_pte);
+ pte_unmap(dst_pte);
+ src_pte = dst_pte = NULL;
if (is_migration_entry(entry)) {
- pte_unmap(src_pte);
- pte_unmap(dst_pte);
- src_pte = dst_pte = NULL;
migration_entry_wait(mm, src_pmd, src_addr);
err = -EAGAIN;
- } else
+ } else {
err = -EFAULT;
+ }
goto out;
}
--
2.39.5
Hi,
The issue that I originally fixed with
commit 8ff4fb276e23 ("pinctrl: amd: Clear GPIO debounce for suspend")
I have found out will be a problem on more platforms due to more designs
switching to power button controlled this way.
I think it would be a good idea to take it back to stable kernels.
Thanks!
Hi Greg,
The below two patches are needed on linux-5.15.y and linux-6.1.y, please
help to add them to the stable tree.
b7a62611fab7 usb: chipidea: add USB PHY event
87ed257acb09 usb: phy: mxs: disconnect line when USB charger is attached
They are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git branch usb-testing
Thanks,
Xu Yang
+ stable
+ regressions
New subject
Great news.
Greg, Sasha,
Can you please pull in these 3 commits specifically to 6.6.y to fix a
regression that was reported by Morgan in 6.6.y:
commit 12753d71e8c5 ("ACPI: CPPC: Add helper to get the highest
performance value")
commit ed429c686b79 ("cpufreq: amd-pstate: Enable amd-pstate preferred
core support")
commit 3d291fe47fe1 ("cpufreq: amd-pstate: fix the highest frequency
issue which limits performance")
Further details are below.
Thanks!
On 9/5/2024 16:09, Jones, Morgan wrote:
> Mario,
>
> Confirmed. Thank you for the help! Slightly different refs on my end:
>
> Remotes:
>
> next https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (fetch)
> next https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (push)
> origin git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (fetch)
> origin git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git (push)
> superm1 https://git.kernel.org/pub/scm/linux/kernel/git/superm1/linux.git/ (fetch)
> superm1 https://git.kernel.org/pub/scm/linux/kernel/git/superm1/linux.git/ (push)
> torvalds git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git (fetch)
> torvalds git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git (push)
>
> Patches:
>
> git format-patch 12753d71e8c5^..12753d71e8c5
> git format-patch f3a052391822b772b4e27f2594526cf1eb103cab^..f3a052391822b772b4e27f2594526cf1eb103cab
> git format-patch bf202e654bfa57fb8cf9d93d4c6855890b70b9c4^..bf202e654bfa57fb8cf9d93d4c6855890b70b9c4
>
> Results:
>
> Linux redact 6.6.48 #1-NixOS SMP PREEMPT_DYNAMIC Tue Jan 1 00:00:00 UTC 1980 x86_64 GNU/Linux
>
> analyzing CPU 56:
> driver: amd-pstate-epp
> CPUs which run at the same hardware frequency: 56
> CPUs which need to have their frequency coordinated by software: 56
> maximum transition latency: Cannot determine or is not supported.
> hardware limits: 400 MHz - 3.35 GHz
> available cpufreq governors: performance powersave
> current policy: frequency should be within 400 MHz and 3.35 GHz.
> The governor "performance" may decide which speed to use
> within this range.
> current CPU frequency: Unable to call hardware
> current CPU frequency: 2.09 GHz (asserted by call to kernel)
> boost state support:
> Supported: yes
> Active: yes
> AMD PSTATE Highest Performance: 255. Maximum Frequency: 3.35 GHz.
> AMD PSTATE Nominal Performance: 152. Nominal Frequency: 2.00 GHz.
> AMD PSTATE Lowest Non-linear Performance: 115. Lowest Non-linear Frequency: 1.51 GHz.
> AMD PSTATE Lowest Performance: 31. Lowest Frequency: 400 MHz.
>
> And our builds are back to being fast with `amd_pstate=active amd_prefcore=enable amd_pstate.shared_mem=1`.
>
> Morgan
>
> -----Original Message-----
> From: Mario Limonciello <mario.limonciello(a)amd.com>
> Sent: Thursday, September 5, 2024 8:12 AM
> To: Jones, Morgan <Morgan.Jones(a)viasat.com>
> Cc: linux-pm(a)vger.kernel.org; linux-kernel(a)vger.kernel.org; David Arcari <darcari(a)redhat.com>; Dhananjay Ugwekar <Dhananjay.Ugwekar(a)amd.com>; rafael(a)kernel.org; viresh.kumar(a)linaro.org; gautham.shenoy(a)amd.com; perry.yuan(a)amd.com; skhan(a)linuxfoundation.org; li.meng(a)amd.com; ray.huang(a)amd.com
> Subject: Re: [EXTERNAL] Re: [PATCH v2 2/2] cpufreq/amd-pstate: Fix the scaling_max_freq setting on shared memory CPPC systems
>
> Hi Morgan,
>
> Please apply these 3 commits:
>
> commit 12753d71e8c5 ("ACPI: CPPC: Add helper to get the highest performance value") commit ed429c686b79 ("cpufreq: amd-pstate: Enable amd-pstate preferred core support") commit 3d291fe47fe1 ("cpufreq: amd-pstate: fix the highest frequency issue which limits performance")
>
> The first two should help your system, the third will prevent introducing a regression on a different one.
>
> Assuming that works we should ask @stable to pull all 3 in to fix this regression.
>
> Thanks,
>
> On 9/4/2024 08:57, Mario Limonciello wrote:
>> Morgan,
>>
>> I was referring specfiically to the version that landed in Linus' tree:
>> https://urldefense.us/v3/__https://git.kernel.org/torvalds/c/8164f7433
>> 264__;!!C5Asm8uRnZQmlRln!aIZEDEbIUKD7OrxN0b0KjoqKYDL2yMkwk4EK7x_oSnyHQ
>> 6MEq7yt6JHjd0TD9DgEYEWDcF58OKL8c7G11bT3dSqL8eM$
>>
>> But yeah it's effectively the same thing. In any case, it's not the
>> solution.
>>
>> We had some internal discussion and suspect this is due to missing
>> prefcore patches in 6.6 as that feature landed in 6.9. We'll try to
>> reproduce this on a Rome system and come back with our findings and
>> suggestions what to do.
>>
>> Thanks,
>>
>