Since commit 25f39d3dcb48 ("s390/pci: Ignore RID for isolated VFs") PFs
which are not initially configured but in standby are considered
isolated. That is they create only a single function PCI domain. Due to
the PCI domains being created on discovery, this means that even if they
are configured later on, sibling PFs and their child VFs will not be
added to their PCI domain breaking SR-IOV expectations.
The reason the referenced commit ignored standby PFs for the creation of
multi-function PCI subhierarchies, was to work around a PCI domain
renumbering scenario on reboot. The renumbering would occur after
removing a previously in standby PF, whose domain number is used for its
configured sibling PFs and their child VFs, but which itself remained in
standby. When this is followed by a reboot, the sibling PF is used
instead to determine the PCI domain number of it and its child VFs.
In principle it is not possible to know which standby PFs will be
configured later and which may be removed. The PCI domain and root bus
are pre-requisites for hotplug slots so the decision of which functions
belong to which domain can not be postponed. With the renumbering
occurring only in rare circumstances and being generally benign, accept
it as an oddity and fix SR-IOV for initially standby PFs simply by
allowing them to create PCI domains.
Cc: stable(a)vger.kernel.org
Reviewed-by: Gerd Bayer <gbayer(a)linux.ibm.com>
Fixes: 25f39d3dcb48 ("s390/pci: Ignore RID for isolated VFs")
Signed-off-by: Niklas Schnelle <schnelle(a)linux.ibm.com>
---
Changes in v3:
- Add R-b from Gerd
- Add Cc: stable…
- Add commas (Sandy)
Changes in v2:
- Reword commit message
---
arch/s390/pci/pci_bus.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/s390/pci/pci_bus.c b/arch/s390/pci/pci_bus.c
index d5ace00d10f04285f899284481f1e426187d4ff4..857afbc4828f0c677f88cc80dd4a5fff104a615a 100644
--- a/arch/s390/pci/pci_bus.c
+++ b/arch/s390/pci/pci_bus.c
@@ -171,7 +171,6 @@ void zpci_bus_scan_busses(void)
static bool zpci_bus_is_multifunction_root(struct zpci_dev *zdev)
{
return !s390_pci_no_rid && zdev->rid_available &&
- zpci_is_device_configured(zdev) &&
!zdev->vfn;
}
---
base-commit: 6b7afe1a2b6905e42fe45bd7015f20baa856e28e
change-id: 20250116-fix_standby_pf-e1d51394e9b3
Best regards,
--
Niklas Schnelle
On pSeries, when user attempts to use the same vfio container used by
different iommu group, the spapr_tce_set_window() returns -EPERM
and the subsequent cleanup leads to the below crash.
Kernel attempted to read user page (308) - exploit attempt?
BUG: Kernel NULL pointer dereference on read at 0x00000308
Faulting instruction address: 0xc0000000001ce358
Oops: Kernel access of bad area, sig: 11 [#1]
NIP: c0000000001ce358 LR: c0000000001ce05c CTR: c00000000005add0
<snip>
NIP [c0000000001ce358] spapr_tce_unset_window+0x3b8/0x510
LR [c0000000001ce05c] spapr_tce_unset_window+0xbc/0x510
Call Trace:
spapr_tce_unset_window+0xbc/0x510 (unreliable)
tce_iommu_attach_group+0x24c/0x340 [vfio_iommu_spapr_tce]
vfio_container_attach_group+0xec/0x240 [vfio]
vfio_group_fops_unl_ioctl+0x548/0xb00 [vfio]
sys_ioctl+0x754/0x1580
system_call_exception+0x13c/0x330
system_call_vectored_common+0x15c/0x2ec
<snip>
--- interrupt: 3000
Fix this by having null check for the tbl passed to the
spapr_tce_unset_window().
Fixes: f431a8cde7f1 ("powerpc/iommu: Reimplement the iommu_table_group_ops for pSeries")
Cc: stable(a)vger.kernel.org
Reported-by: Vaishnavi Bhat <vaish123(a)in.ibm.com>
Signed-off-by: Shivaprasad G Bhat <sbhat(a)linux.ibm.com>
---
arch/powerpc/platforms/pseries/iommu.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 534cd159e9ab..78b895b568b3 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -2205,6 +2205,9 @@ static long spapr_tce_unset_window(struct iommu_table_group *table_group, int nu
const char *win_name;
int ret = -ENODEV;
+ if (!tbl) /* The table was never created OR window was never opened */
+ return 0;
+
mutex_lock(&dma_win_init_mutex);
if ((num == 0) && is_default_window_table(table_group, tbl))
The PE Reset State "0" returned by RTAS calls
"ibm_read_slot_reset_[state|state2]" indicates that the reset is
deactivated and the PE is in a state where MMIO and DMA are allowed.
However, the current implementation of "pseries_eeh_get_state()" does
not reflect this, causing drivers to incorrectly assume that MMIO and
DMA operations cannot be resumed.
The userspace drivers as a part of EEH recovery using VFIO ioctls fail
to detect when the recovery process is complete. The VFIO_EEH_PE_GET_STATE
ioctl does not report the expected EEH_PE_STATE_NORMAL state, preventing
userspace drivers from functioning properly on pseries systems.
The patch addresses this issue by updating 'pseries_eeh_get_state()'
to include "EEH_STATE_MMIO_ENABLED" and "EEH_STATE_DMA_ENABLED" in
the result mask for PE Reset State "0". This ensures correct state
reporting to the callers, aligning the behavior with the PAPR specification
and fixing the bug in EEH recovery for VFIO user workflows.
Fixes: 00ba05a12b3c ("powerpc/pseries: Cleanup on pseries_eeh_get_state()")
Cc: <stable(a)vger.kernel.org>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list(a)gmail.com>
Signed-off-by: Narayana Murty N <nnmlinux(a)linux.ibm.com>
---
Changelog:
V1:https://lore.kernel.org/all/20241107042027.338065-1-nnmlinux@linux.ibm.c…
--added Fixes tag for "powerpc/pseries: Cleanup on
pseries_eeh_get_state()".
V2:https://lore.kernel.org/stable/20241212075044.10563-1-nnmlinux%40linux.i…
--Updated the patch description to include it in the stable kernel tree.
V3:https://lore.kernel.org/all/87v7vm8pwz.fsf@gmail.com/
--Updated commit description.
---
arch/powerpc/platforms/pseries/eeh_pseries.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
index 1893f66371fa..b12ef382fec7 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -580,8 +580,10 @@ static int pseries_eeh_get_state(struct eeh_pe *pe, int *delay)
switch(rets[0]) {
case 0:
- result = EEH_STATE_MMIO_ACTIVE |
- EEH_STATE_DMA_ACTIVE;
+ result = EEH_STATE_MMIO_ACTIVE |
+ EEH_STATE_DMA_ACTIVE |
+ EEH_STATE_MMIO_ENABLED |
+ EEH_STATE_DMA_ENABLED;
break;
case 1:
result = EEH_STATE_RESET_ACTIVE |
--
2.47.1
There is a period of time after returning from a KVM_RUN ioctl where
userspace may use SVE without trapping, but the kernel can unexpectedly
discard the live SVE state. Eric Auger has observed this causing QEMU
crashes where SVE is used by memmove():
https://issues.redhat.com/browse/RHEL-68997
The only state discarded is the user SVE state of the task which issued
the KVM_RUN ioctl. Other tasks are unaffected, plain FPSIMD state is
unaffected, and kernel state is unaffected.
This happens because fpsimd_kvm_prepare() incorrectly manipulates the
FPSIMD/SVE state. When the vCPU is loaded, fpsimd_kvm_prepare()
unconditionally clears TIF_SVE but does not reconfigure CPACR_EL1.ZEN to
trap userspace SVE usage. If the vCPU does not use FPSIMD/SVE and hyp
does not save the host's FPSIMD/SVE state, the kernel may return to
userspace with TIF_SVE clear while SVE is still enabled in
CPACR_EL1.ZEN. Subsequent userspace usage of SVE will not be trapped,
and the next save of userspace FPSIMD/SVE state will only store the
FPSIMD portion due to TIF_SVE being clear, discarding any SVE state.
The broken logic was originally introduced in commit:
93ae6b01bafee8fa ("KVM: arm64: Discard any SVE state when entering KVM guests")
... though at the time fp_user_discard() would reconfigure CPACR_EL1.ZEN
to trap subsequent SVE usage, masking the issue until that logic was
removed in commit:
8c845e2731041f0f ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
Avoid this issue by reconfiguring CPACR_EL1.ZEN when clearing
TIF_SVE. At the same time, add a comment to explain why
current->thread.fp_type must be set even though the FPSIMD state is not
foreign. A similar issue exists when SME is enabled, and will require
further rework. As SME currently depends on BROKEN, a BUILD_BUG() and
comment are added for now, and this issue will need to be fixed properly
in a follow-up patch.
Commit 93ae6b01bafee8fa also introduced an unintended ptrace ABI change.
Unconditionally clearing TIF_SVE regardless of whether the state is
foreign discards saved SVE state created by ptrace after syscall entry.
Avoid this by only clearing TIF_SVE when the FPSIMD/SVE state is not
foreign. When the state is foreign, KVM hyp code does not need to save
any host state, and so this will not affect KVM.
There appear to be further issues with unintentional SVE state
discarding, largely impacting ptrace and signal handling, which will
need to be addressed in separate patches.
Reported-by: Eric Auger <eauger(a)redhat.com>
Reported-by: Wilco Dijkstra <wilco.dijkstra(a)arm.com>
Cc: stable(a)vger.kernel.org
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Florian Weimer <fweimer(a)redhat.com>
Cc: Jeremy Linton <jeremy.linton(a)arm.com>
Cc: Marc Zyngier <maz(a)kernel.org>
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Oliver Upton <oliver.upton(a)linux.dev>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Will Deacon <will(a)kernel.org>
Signed-off-by: Mark Rutland <mark.rutland(a)arm.com>
---
arch/arm64/kernel/fpsimd.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
I believe there are some other issues in this area, but I'm sending this
out on its own because I beleive the other issues are more complex while
this is self-contained, and people are actively hitting this case in
production.
I intend to follow-up with fixes for the other cases I mention in the
commit message, and for the SME case with the BUILD_BUG_ON().
Mark.
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 8c4c1a2186cc5..e4053a90ed240 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -1711,8 +1711,24 @@ void fpsimd_kvm_prepare(void)
*/
get_cpu_fpsimd_context();
- if (test_and_clear_thread_flag(TIF_SVE)) {
- sve_to_fpsimd(current);
+ if (!test_thread_flag(TIF_FOREIGN_FPSTATE) &&
+ test_and_clear_thread_flag(TIF_SVE)) {
+ sve_user_disable();
+
+ /*
+ * The KVM hyp code doesn't set fp_type when saving the host's
+ * FPSIMD state. Set fp_type here in case the hyp code saves
+ * the host state.
+ *
+ * If hyp code does not save the host state, then the host
+ * state remains live on the CPU and saved fp_type is
+ * irrelevant until it is overwritten by a later call to
+ * fpsimd_save_user_state().
+ *
+ * This is *NOT* sufficient when CONFIG_ARM64_SME=y, where
+ * fp_type can be FP_STATE_SVE regardless of TIF_SVE.
+ */
+ BUILD_BUG_ON(IS_ENABLED(CONFIG_ARM64_SME));
current->thread.fp_type = FP_STATE_FPSIMD;
}
--
2.30.2
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 2ca06a2f65310aeef30bb69b7405437a14766e4d
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025012037-siesta-sulfite-8b05@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 2ca06a2f65310aeef30bb69b7405437a14766e4d Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Mon, 13 Jan 2025 16:44:56 +0100
Subject: [PATCH] mptcp: be sure to send ack when mptcp-level window re-opens
mptcp_cleanup_rbuf() is responsible to send acks when the user-space
reads enough data to update the receive windows significantly.
It tries hard to avoid acquiring the subflow sockets locks by checking
conditions similar to the ones implemented at the TCP level.
To avoid too much code duplication - the MPTCP protocol can't reuse the
TCP helpers as part of the relevant status is maintained into the msk
socket - and multiple costly window size computation, mptcp_cleanup_rbuf
uses a rough estimate for the most recently advertised window size:
the MPTCP receive free space, as recorded as at last-ack time.
Unfortunately the above does not allow mptcp_cleanup_rbuf() to detect
a zero to non-zero win change in some corner cases, skipping the
tcp_cleanup_rbuf call and leaving the peer stuck.
After commit ea66758c1795 ("tcp: allow MPTCP to update the announced
window"), MPTCP has actually cheap access to the announced window value.
Use it in mptcp_cleanup_rbuf() for a more accurate ack generation.
Fixes: e3859603ba13 ("mptcp: better msk receive window updates")
Cc: stable(a)vger.kernel.org
Reported-by: Jakub Kicinski <kuba(a)kernel.org>
Closes: https://lore.kernel.org/20250107131845.5e5de3c5@kernel.org
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Link: https://patch.msgid.link/20250113-net-mptcp-connect-st-flakes-v1-1-0d986ee7…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index a62bc874bf1e..123f3f297284 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -607,7 +607,6 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
}
opts->ext_copy.use_ack = 1;
opts->suboptions = OPTION_MPTCP_DSS;
- WRITE_ONCE(msk->old_wspace, __mptcp_space((struct sock *)msk));
/* Add kind/length/subtype/flag overhead if mapping is not populated */
if (dss_size == 0)
@@ -1288,7 +1287,7 @@ static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
}
MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_RCVWNDCONFLICT);
}
- return;
+ goto update_wspace;
}
if (rcv_wnd_new != rcv_wnd_old) {
@@ -1313,6 +1312,9 @@ static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
th->window = htons(new_win);
MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_RCVWNDSHARED);
}
+
+update_wspace:
+ WRITE_ONCE(msk->old_wspace, tp->rcv_wnd);
}
__sum16 __mptcp_make_csum(u64 data_seq, u32 subflow_seq, u16 data_len, __wsum sum)
Hi,
We noticed that the patch 0f022d32c3ec should be probably ported to 6.1 and 6.6
LTS according to the bug introducing commit. Also, it can be applied
to the latest version of these two LTS branches without conflicts. Its
bug introducing commit is 3bcb846ca4cf. According to our
manual analysis, the vulnerability is a deadlock caused by recursive
locking of the qdisc lock (`sch->q.lock`) when packets are redirected
in a loop (e.g., mirroring or redirecting packets to the same device).
This happens because the same qdisc lock is attempted to be acquired
multiple times by the same CPU, leading to a deadlock. The commit
3bcb846ca4cf removes the `spin_trylock()` in `net_tx_action()` and
replaces it with `spin_lock()`. By doing so, it eliminates the
non-blocking lock attempt (`spin_trylock()`), which would fail if the
lock was already held, preventing recursive locking. The
`spin_lock()` will block (wait) if the lock is already held, allowing
for the possibility of the same CPU attempting to acquire the same
lock recursively, leading to a deadlock. The patch adds an `owner`
field to the `Qdisc` structure to track the CPU that currently owns
the qdisc. Before enqueueing a packet to the qdisc, it checks if the
current CPU is the owner. If so, it drops the packet to prevent the
recursive locking. This effectively prevents the deadlock by ensuring
that the same CPU doesn't attempt to acquire the lock recursively.
--
Yours sincerely,
Xingyu
Hi,
We noticed that the patch 11a4d6f67cf5 should be ported to 5.10 and
5.15 LTS according to the bug introducing commit. Also, it can be
applied
to the latest version of these two LTS branches without conflicts. Its
bug introducing commit is f25dcc7687d4. The kernel warning and stack
trace indicate a problem when sending a SYN message in TIPC
(Transparent Inter-Process Communication). The issue arises because
`copy_from_iter()` is being called with an uninitialized `iov_iter`
structure, leading to invalid memory operations. The commit
(`f25dcc7687d4`) introduces the vulnerability by replacing the old
data copying mechanisms with the new `copy_from_iter()` function
without ensuring that the `iov_iter` structure is properly initialized
in all code paths. The patch adds initialization of `iov_iter` with
"iov_iter_kvec(&m.msg_iter, ITER_SOURCE, NULL, 0, 0);", which ensures
that even when there's no data to send, the `iov_iter` is correctly
set up, preventing the kernel warning/crash when `copy_from_iter()` is
called.
--
Yours sincerely,
Xingyu