There is a race between laundromat handling of revoked delegations
and a client sending free_stateid operation. Laundromat thread
finds that delegation has expired and needs to be revoked so it
marks the delegation stid revoked and it puts it on a reaper list
but then it unlock the state lock and the actual delegation revocation
happens without the lock. Once the stid is marked revoked a racing
free_stateid processing thread does the following (1) it calls
list_del_init() which removes it from the reaper list and (2) frees
the delegation stid structure. The laundromat thread ends up not
calling the revoke_delegation() function for this particular delegation
but that means it will no release the lock lease that exists on
the file.
Now, a new open for this file comes in and ends up finding that
lease list isn't empty and calls nfsd_breaker_owns_lease() which ends
up trying to derefence a freed delegation stateid. Leading to the
followint use-after-free KASAN warning:
kernel: ==================================================================
kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205
kernel:
kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9
kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024
kernel: Call trace:
kernel: dump_backtrace+0x98/0x120
kernel: show_stack+0x1c/0x30
kernel: dump_stack_lvl+0x80/0xe8
kernel: print_address_description.constprop.0+0x84/0x390
kernel: print_report+0xa4/0x268
kernel: kasan_report+0xb4/0xf8
kernel: __asan_report_load8_noabort+0x1c/0x28
kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: leases_conflict+0x68/0x370
kernel: __break_lease+0x204/0xc38
kernel: nfsd_open_break_lease+0x8c/0xf0 [nfsd]
kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd]
kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd]
kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd]
kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd]
kernel: nfsd4_open+0xa08/0xe80 [nfsd]
kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd]
kernel: nfsd_dispatch+0x22c/0x718 [nfsd]
kernel: svc_process_common+0x8e8/0x1960 [sunrpc]
kernel: svc_process+0x3d4/0x7e0 [sunrpc]
kernel: svc_handle_xprt+0x828/0xe10 [sunrpc]
kernel: svc_recv+0x2cc/0x6a8 [sunrpc]
kernel: nfsd+0x270/0x400 [nfsd]
kernel: kthread+0x288/0x310
kernel: ret_from_fork+0x10/0x20
Proposing to have laundromat thread hold the state_lock over both
marking thru revoking the delegation as well as making free_stateid
acquire state_lock before accessing the list. Making sure that
revoke_delegation() (ie kernel_setlease(unlock)) is called for
every delegation that was revoked and added to the reaper list.
CC: stable(a)vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev(a)redhat.com>
--- I can't figure out the Fixes: tag. Laundromat's behaviour has
been like that forever. But the free_stateid bits wont apply before
the 1e3577a4521e ("SUNRPC: discard sv_refcnt, and svc_get/svc_put").
But we used that fixes tag already with a previous fix for a different
problem.
---
fs/nfsd/nfs4state.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 9c2b1d251ab3..c97907d7fb38 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -6605,13 +6605,13 @@ nfs4_laundromat(struct nfsd_net *nn)
unhash_delegation_locked(dp, SC_STATUS_REVOKED);
list_add(&dp->dl_recall_lru, &reaplist);
}
- spin_unlock(&state_lock);
while (!list_empty(&reaplist)) {
dp = list_first_entry(&reaplist, struct nfs4_delegation,
dl_recall_lru);
list_del_init(&dp->dl_recall_lru);
revoke_delegation(dp);
}
+ spin_unlock(&state_lock);
spin_lock(&nn->client_lock);
while (!list_empty(&nn->close_lru)) {
@@ -7213,7 +7213,9 @@ nfsd4_free_stateid(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
if (s->sc_status & SC_STATUS_REVOKED) {
spin_unlock(&s->sc_lock);
dp = delegstateid(s);
+ spin_lock(&state_lock);
list_del_init(&dp->dl_recall_lru);
+ spin_unlock(&state_lock);
spin_unlock(&cl->cl_lock);
nfs4_put_stid(s);
ret = nfs_ok;
--
2.43.5
As explained in the comment block this change adds, we can't tell what
userspace's intent is when the stack grows towards an inaccessible VMA.
We should ensure that, as long as code is compiled with something like
-fstack-check, a stack overflow in this code can never cause the main stack
to overflow into adjacent heap memory - so the bottom of a stack should
never be directly adjacent to an accessible VMA.
As suggested by Lorenzo, enforce this by blocking attempts to:
- make an inaccessible VMA accessible with mprotect() when it is too close
to a stack
- replace an inaccessible VMA with another VMA using MAP_FIXED when it is
too close to a stack
I have a (highly contrived) C testcase for 32-bit x86 userspace with glibc
that mixes malloc(), pthread creation, and recursion in just the right way
such that the main stack overflows into malloc() arena memory, see the
linked list post.
I don't know of any specific scenario where this is actually exploitable,
but it seems like it could be a security problem for sufficiently unlucky
userspace.
Link: https://lore.kernel.org/r/CAG48ez2v=r9-37JADA5DgnZdMLCjcbVxAjLt5eH5uoBohRdq…
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Fixes: 561b5e0709e4 ("mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stack")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jann Horn <jannh(a)google.com>
---
This is an attempt at the alternate fix approach suggested by Lorenzo.
This turned into more code than I would prefer to have for a scenario
like this...
Also, the way the gap is enforced in my patch for MAP_FIXED_NOREPLACE is
a bit ugly. In the existing code, __get_unmapped_area() normally already
enforces the stack gap even when it is called with a hint; but when
MAP_FIXED_NOREPLACE is used, we kinda lie to __get_unmapped_area() and
tell it we'll do a MAP_FIXED mapping (introduced in commit
a4ff8e8620d3f when MAP_FIXED_NOREPLACE was created), then afterwards
manually reject overlapping mappings.
So I ended up also doing the gap check separately for
MAP_FIXED_NOREPLACE.
The following test program exercises scenarios that could lead to the
stack becoming directly adjacent to another accessible VMA,
and passes with this patch applied:
<<<
int main(void) {
setbuf(stdout, NULL);
char *ptr = (char*)( (unsigned long)(STACK_POINTER() - (1024*1024*4)/*4MiB*/) & ~0xfffUL );
if (mmap(ptr, 0x1000, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0) != ptr)
err(1, "mmap distant-from-stack");
*(volatile char *)(ptr + 0x1000); /* expand stack */
system("echo;cat /proc/$PPID/maps;echo");
/* test transforming PROT_NONE mapping adjacent to stack */
if (mprotect(ptr, 0x1000, PROT_READ|PROT_WRITE|PROT_EXEC) == 0)
errx(1, "mprotect adjacent to stack allowed");
if (mmap(ptr, 0x1000, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0) != MAP_FAILED)
errx(1, "MAP_FIXED adjacent to stack allowed");
if (munmap(ptr, 0x1000))
err(1, "munmap failed???");
/* test creating new mapping adjacent to stack */
if (mmap(ptr, 0x1000, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0) != MAP_FAILED)
errx(1, "MAP_FIXED_NOREPLACE adjacent to stack allowed");
if (mmap(ptr, 0x1000, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0) == ptr)
errx(1, "mmap hint adjacent to stack accepted");
printf("all tests passed\n");
}
>>>
---
Changes in v2:
- Entirely new approach (suggested by Lorenzo)
- Link to v1: https://lore.kernel.org/r/20241008-stack-gap-inaccessible-v1-1-848d4d891f21…
---
include/linux/mm.h | 1 +
mm/mmap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/mprotect.c | 6 +++++
3 files changed, 72 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecf63d2b0582..ecd4afc304ca 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3520,6 +3520,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
struct vm_area_struct *find_extend_vma_locked(struct mm_struct *,
unsigned long addr);
+bool overlaps_stack_gap(struct mm_struct *mm, unsigned long addr, unsigned long len);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/mmap.c b/mm/mmap.c
index dd4b35a25aeb..937361be3c48 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -359,6 +359,20 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
return -EEXIST;
}
+ /*
+ * This does two things:
+ *
+ * 1. Disallow MAP_FIXED replacing a PROT_NONE VMA adjacent to a stack
+ * with an accessible VMA.
+ * 2. Disallow MAP_FIXED_NOREPLACE creating a new accessible VMA
+ * adjacent to a stack.
+ */
+ if ((flags & (MAP_FIXED_NOREPLACE | MAP_FIXED)) &&
+ (prot & (PROT_READ | PROT_WRITE | PROT_EXEC)) &&
+ !(vm_flags & (VM_GROWSUP|VM_GROWSDOWN)) &&
+ overlaps_stack_gap(mm, addr, len))
+ return (flags & MAP_FIXED) ? -ENOMEM : -EEXIST;
+
if (flags & MAP_LOCKED)
if (!can_do_mlock())
return -EPERM;
@@ -1341,6 +1355,57 @@ struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr)
return vma;
}
+/*
+ * Does the specified VA range overlap the stack gap of a preceding or following
+ * stack VMA?
+ * Overlapping stack VMAs are ignored - so if someone deliberately creates a
+ * MAP_FIXED mapping in the middle of a stack or such, we let that go through.
+ *
+ * This is needed partly because userspace's intent when making PROT_NONE
+ * mappings is unclear; there are two different reasons for creating PROT_NONE
+ * mappings:
+ *
+ * A) Userspace wants to create its own guard mapping, for example for stacks.
+ * According to
+ * <https://lore.kernel.org/all/1499126133.2707.20.camel@decadent.org.uk/T/>,
+ * some Rust/Java programs do this with the main stack.
+ * Enforcing the kernel's stack gap between these userspace guard mappings and
+ * the main stack breaks stuff.
+ *
+ * B) Userspace wants to reserve some virtual address space for later mappings.
+ * This is done by memory allocators.
+ * In this case, we want to enforce a stack gap between the mapping and the
+ * stack.
+ *
+ * Because we can't tell these cases apart when a PROT_NONE mapping is created,
+ * we instead enforce the stack gap when a PROT_NONE mapping is made accessible
+ * (using mprotect()) or replaced with an accessible one (using MAP_FIXED).
+ */
+bool overlaps_stack_gap(struct mm_struct *mm, unsigned long addr, unsigned long len)
+{
+
+ struct vm_area_struct *vma, *prev_vma;
+
+ /* step 1: search for a non-overlapping following stack VMA */
+ vma = find_vma(mm, addr+len);
+ if (vma && vma->vm_start >= addr+len) {
+ /* is it too close? */
+ if (vma->vm_start - (addr+len) < stack_guard_start_gap(vma))
+ return true;
+ }
+
+ /* step 2: search for a non-overlapping preceding stack VMA */
+ if (!IS_ENABLED(CONFIG_STACK_GROWSUP))
+ return false;
+ vma = find_vma_prev(mm, addr, &prev_vma);
+ /* don't handle cases where the VA start overlaps a VMA */
+ if (vma && vma->vm_start < addr)
+ return false;
+ if (!prev_vma || !(prev_vma->vm_flags & VM_GROWSUP))
+ return false;
+ return addr - prev_vma->vm_end < stack_guard_gap;
+}
+
/* do_munmap() - Wrapper function for non-maple tree aware do_munmap() calls.
* @mm: The mm_struct
* @start: The start address to munmap
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0c5d6d06107d..2300e2eff956 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -772,6 +772,12 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
}
}
+ error = -ENOMEM;
+ if ((prot & (PROT_READ | PROT_WRITE | PROT_EXEC)) &&
+ !(vma->vm_flags & (VM_GROWSUP|VM_GROWSDOWN)) &&
+ overlaps_stack_gap(current->mm, start, end - start))
+ goto out;
+
prev = vma_prev(&vmi);
if (start > vma->vm_start)
prev = vma;
---
base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
change-id: 20241008-stack-gap-inaccessible-c7319f7d4b1b
--
Jann Horn <jannh(a)google.com>
A problem was introduced with commit f69759be251d ("x86/CPU/AMD: Move
Zenbleed check to the Zen2 init function") where a bit in the DE_CFG MSR
is getting set after a microcode late load.
The problem seems to be that the microcode late load path calls into
amd_check_microcode() and subsequently zen2_zenbleed_check(). Since the
patch removes the cpu_has_amd_erratum() check from zen2_zenbleed_check(),
this will cause all non-Zen2 CPUs to go through the function and set
the bit in the DE_CFG MSR.
Call into the zenbleed fix path on Zen2 CPUs only.
Fixes: f69759be251d ("x86/CPU/AMD: Move Zenbleed check to the Zen2 init function")
Cc: <stable(a)vger.kernel.org>
Acked-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Signed-off-by: John Allen <john.allen(a)amd.com>
---
v2:
- Clean up commit description
---
arch/x86/kernel/cpu/amd.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 015971adadfc..368344e1394b 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1202,5 +1202,6 @@ void amd_check_microcode(void)
if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD)
return;
- on_each_cpu(zenbleed_check_cpu, NULL, 1);
+ if (boot_cpu_has(X86_FEATURE_ZEN2))
+ on_each_cpu(zenbleed_check_cpu, NULL, 1);
}
--
2.34.1
'new_map' is allocated using devm_* which takes care of freeing the
allocated data on device removal, call to
.dt_free_map = pinconf_generic_dt_free_map
double frees the map as pinconf_generic_dt_free_map() calls
pinctrl_utils_free_map().
Fix this by using kcalloc() instead of auto-managed devm_kcalloc().
Cc: stable(a)vger.kernel.org
Fixes: f805e356313b ("pinctrl: nuvoton: Add ma35d1 pinctrl and GPIO driver")
Reported-by: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli(a)oracle.com>
---
This is based on static analysis and reading code, only compile tested.
Added the stable tag as the commit in Fixes is also in 6.11.y
---
drivers/pinctrl/nuvoton/pinctrl-ma35.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pinctrl/nuvoton/pinctrl-ma35.c b/drivers/pinctrl/nuvoton/pinctrl-ma35.c
index 1fa00a23534a..59c4e7c6cdde 100644
--- a/drivers/pinctrl/nuvoton/pinctrl-ma35.c
+++ b/drivers/pinctrl/nuvoton/pinctrl-ma35.c
@@ -218,7 +218,7 @@ static int ma35_pinctrl_dt_node_to_map_func(struct pinctrl_dev *pctldev,
}
map_num += grp->npins;
- new_map = devm_kcalloc(pctldev->dev, map_num, sizeof(*new_map), GFP_KERNEL);
+ new_map = kcalloc(map_num, sizeof(*new_map), GFP_KERNEL);
if (!new_map)
return -ENOMEM;
--
2.39.3
Hi,
The following commit is to make amd_pmc and amd_pmf drivers work
properly on AMD family 1Ah model 60h processors,
59c34008d3bd x86/amd_nb: Add new PCI IDs for AMD family 1Ah model 60h
Please add this commit to stable kernel 6.11.y. Thanks!
Regards,
Richard
If USB virtualizatoin is enabled, USB2 ports are shared between all
Virtual Functions. The USB2 port number owned by an USB2 root hub in
a Virtual Function may be less than total USB2 phy number supported
by the Tegra XUSB controller.
Using total USB2 phy number as port number to check all PORTSC values
would cause invalid memory access.
[ 116.923438] Unable to handle kernel paging request at virtual address 006c622f7665642f
...
[ 117.213640] Call trace:
[ 117.216783] tegra_xusb_enter_elpg+0x23c/0x658
[ 117.222021] tegra_xusb_runtime_suspend+0x40/0x68
[ 117.227260] pm_generic_runtime_suspend+0x30/0x50
[ 117.232847] __rpm_callback+0x84/0x3c0
[ 117.237038] rpm_suspend+0x2dc/0x740
[ 117.241229] pm_runtime_work+0xa0/0xb8
[ 117.245769] process_scheduled_works+0x24c/0x478
[ 117.251007] worker_thread+0x23c/0x328
[ 117.255547] kthread+0x104/0x1b0
[ 117.259389] ret_from_fork+0x10/0x20
[ 117.263582] Code: 54000222 f9461ae8 f8747908 b4ffff48 (f9400100)
Cc: <stable(a)vger.kernel.org> # v6.3+
Fixes: a30951d31b25 ("xhci: tegra: USB2 pad power controls")
Signed-off-by: Henry Lin <henryl(a)nvidia.com>
---
drivers/usb/host/xhci-tegra.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-tegra.c b/drivers/usb/host/xhci-tegra.c
index 6246d5ad1468..76f228e7443c 100644
--- a/drivers/usb/host/xhci-tegra.c
+++ b/drivers/usb/host/xhci-tegra.c
@@ -2183,7 +2183,7 @@ static int tegra_xusb_enter_elpg(struct tegra_xusb *tegra, bool runtime)
goto out;
}
- for (i = 0; i < tegra->num_usb_phys; i++) {
+ for (i = 0; i < xhci->usb2_rhub.num_ports; i++) {
if (!xhci->usb2_rhub.ports[i])
continue;
portsc = readl(xhci->usb2_rhub.ports[i]->addr);
--
2.25.1
Until the TSADC, thermal zones, thermal trips and cooling maps are defined
in the RK3308 SoC dtsi, none of the CPU OPPs except the slowest one may be
enabled under any circumstances. Allowing the DVFS to scale the CPU cores
up without even just the critical CPU thermal trip in place can rather easily
result in thermal runaways and damaged SoCs, which is bad.
Thus, leave only the lowest available CPU OPP enabled for now.
Fixes: 6913c45239fd ("arm64: dts: rockchip: Add core dts for RK3308 SOC")
Cc: stable(a)vger.kernel.org
Signed-off-by: Dragan Simic <dsimic(a)manjaro.org>
---
arch/arm64/boot/dts/rockchip/rk3308.dtsi | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm64/boot/dts/rockchip/rk3308.dtsi b/arch/arm64/boot/dts/rockchip/rk3308.dtsi
index 31c25de2d689..a7698e1f6b9e 100644
--- a/arch/arm64/boot/dts/rockchip/rk3308.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3308.dtsi
@@ -120,16 +120,19 @@ opp-600000000 {
opp-hz = /bits/ 64 <600000000>;
opp-microvolt = <950000 950000 1340000>;
clock-latency-ns = <40000>;
+ status = "disabled";
};
opp-816000000 {
opp-hz = /bits/ 64 <816000000>;
opp-microvolt = <1025000 1025000 1340000>;
clock-latency-ns = <40000>;
+ status = "disabled";
};
opp-1008000000 {
opp-hz = /bits/ 64 <1008000000>;
opp-microvolt = <1125000 1125000 1340000>;
clock-latency-ns = <40000>;
+ status = "disabled";
};
};