This is the start of the stable review cycle for the 6.6.85 release.
There are 77 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 27 Mar 2025 12:21:27 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.6.85-rc1…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.6.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.6.85-rc1
Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
netfilter: nft_counter: Use u64_stats_t for statistic.
Benjamin Berg <benjamin.berg(a)intel.com>
wifi: iwlwifi: mvm: ensure offloading TID queue exists
Miri Korenblit <miriam.rachel.korenblit(a)intel.com>
wifi: iwlwifi: support BIOS override for 5G9 in CA also in LARI version 8
Shravya KN <shravya.k-n(a)broadcom.com>
bnxt_en: Fix receive ring space parameters when XDP is active
Josef Bacik <josef(a)toxicpanda.com>
btrfs: make sure that WRITTEN is set on all metadata blocks
Dietmar Eggemann <dietmar.eggemann(a)arm.com>
Revert "sched/core: Reduce cost of sched_move_task when config autogroup"
Justin Klaassen <justin(a)tidylabs.net>
arm64: dts: rockchip: fix u2phy1_host status for NanoPi R4S
Mark Rutland <mark.rutland(a)arm.com>
KVM: arm64: Eagerly switch ZCR_EL{1,2}
Mark Rutland <mark.rutland(a)arm.com>
KVM: arm64: Mark some header functions as inline
Mark Rutland <mark.rutland(a)arm.com>
KVM: arm64: Refactor exit handlers
Mark Rutland <mark.rutland(a)arm.com>
KVM: arm64: Remove VHE host restore of CPACR_EL1.SMEN
Mark Rutland <mark.rutland(a)arm.com>
KVM: arm64: Remove VHE host restore of CPACR_EL1.ZEN
Mark Rutland <mark.rutland(a)arm.com>
KVM: arm64: Remove host FPSIMD saving for non-protected KVM
Mark Rutland <mark.rutland(a)arm.com>
KVM: arm64: Unconditionally save+flush host FPSIMD/SVE/SME state
Fuad Tabba <tabba(a)google.com>
KVM: arm64: Calculate cptr_el2 traps on activating traps
Arthur Mongodin <amongodin(a)randorisec.fr>
mptcp: Fix data stream corruption in the address announcement
Namjae Jeon <linkinjeon(a)kernel.org>
ksmbd: fix incorrect validation for num_aces field of smb_acl
Mario Limonciello <mario.limonciello(a)amd.com>
drm/amd/display: Use HW lock mgr for PSR1 when only one eDP
Martin Tsai <martin.tsai(a)amd.com>
drm/amd/display: should support dmub hw lock on Replay
David Rosca <david.rosca(a)amd.com>
drm/amdgpu: Fix JPEG video caps max size for navi1x and raven
David Rosca <david.rosca(a)amd.com>
drm/amdgpu: Fix MPEG2, MPEG4 and VC1 video caps max size
qianyi liu <liuqianyi125(a)gmail.com>
drm/sched: Fix fence reference count leak
Nikita Zhandarovich <n.zhandarovich(a)fintech.ru>
drm/radeon: fix uninitialized size issue in radeon_vce_cs_parse()
Saranya R <quic_sarar(a)quicinc.com>
soc: qcom: pdr: Fix the potential deadlock
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Ignore own maximum aggregation size during RX
Gavrilov Ilia <Ilia.Gavrilov(a)infotecs.ru>
xsk: fix an integer overflow in xp_create_and_assign_umem()
Ard Biesheuvel <ardb(a)kernel.org>
efi/libstub: Avoid physical address 0x0 when doing random allocation
Geert Uytterhoeven <geert+renesas(a)glider.be>
ARM: shmobile: smp: Enforce shmobile_smp_* alignment
Stefan Eichenberger <stefan.eichenberger(a)toradex.com>
ARM: dts: imx6qdl-apalis: Fix poweroff on Apalis iMX6
Shakeel Butt <shakeel.butt(a)linux.dev>
memcg: drain obj stock on cpu hotplug teardown
Ye Bin <yebin10(a)huawei.com>
proc: fix UAF in proc_get_inode()
Zi Yan <ziy(a)nvidia.com>
mm/migrate: fix shmem xarray update during migration
Raphael S. Carvalho <raphaelsc(a)scylladb.com>
mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT
Gu Bowen <gubowen5(a)huawei.com>
mmc: atmel-mci: Add missing clk_disable_unprepare()
Kamal Dasu <kamal.dasu(a)broadcom.com>
mmc: sdhci-brcmstb: add cqhci suspend/resume to PM ops
Dragan Simic <dsimic(a)manjaro.org>
arm64: dts: rockchip: Add missing PCIe supplies to RockPro64 board dtsi
Quentin Schulz <quentin.schulz(a)cherry.de>
arm64: dts: rockchip: fix pinmux of UART0 for PX30 Ringneck on Haikou
Stefan Eichenberger <stefan.eichenberger(a)toradex.com>
arm64: dts: freescale: imx8mm-verdin-dahlia: add Microphone Jack to sound card
Stefan Eichenberger <stefan.eichenberger(a)toradex.com>
arm64: dts: freescale: imx8mp-verdin-dahlia: add Microphone Jack to sound card
Dan Carpenter <dan.carpenter(a)linaro.org>
accel/qaic: Fix integer overflow in qaic_validate_req()
Christian Eggers <ceggers(a)arri.de>
regulator: check that dummy regulator has been probed before using it
Christian Eggers <ceggers(a)arri.de>
regulator: dummy: force synchronous probing
E Shattow <e(a)freeshell.de>
riscv: dts: starfive: Fix a typo in StarFive JH7110 pin function definitions
Maíra Canal <mcanal(a)igalia.com>
drm/v3d: Don't run jobs that have errors flagged in its fence
Haibo Chen <haibo.chen(a)nxp.com>
can: flexcan: disable transceiver during system PM
Haibo Chen <haibo.chen(a)nxp.com>
can: flexcan: only change CAN state when link up in system PM
Vincent Mailhol <mailhol.vincent(a)wanadoo.fr>
can: ucan: fix out of bound read in strscpy() source
Biju Das <biju.das.jz(a)bp.renesas.com>
can: rcar_canfd: Fix page entries in the AFL list
Andreas Kemnade <andreas(a)kemnade.info>
i2c: omap: fix IRQ storms
Guillaume Nault <gnault(a)redhat.com>
Revert "gre: Fix IPv6 link-local address generation."
Lin Ma <linma(a)zju.edu.cn>
net/neighbor: add missing policy for NDTPA_QUEUE_LENBYTES
Justin Iurman <justin.iurman(a)uliege.be>
net: lwtunnel: fix recursion loops
Dan Carpenter <dan.carpenter(a)linaro.org>
net: atm: fix use after free in lec_send()
Kuniyuki Iwashima <kuniyu(a)amazon.com>
ipv6: Set errno after ip_fib_metrics_init() in ip6_route_info_create().
Kuniyuki Iwashima <kuniyu(a)amazon.com>
ipv6: Fix memleak of nhc_pcpu_rth_output in fib_check_nh_v6_gw().
David Lechner <dlechner(a)baylibre.com>
ARM: davinci: da850: fix selecting ARCH_DAVINCI_DA8XX
Jeffrey Hugo <quic_jhugo(a)quicinc.com>
accel/qaic: Fix possible data corruption in BOs > 2G
Arkadiusz Bokowy <arkadiusz.bokowy(a)gmail.com>
Bluetooth: hci_event: Fix connection regression between LE and non-LE adapters
Dan Carpenter <dan.carpenter(a)linaro.org>
Bluetooth: Fix error code in chan_alloc_skb_cb()
Junxian Huang <huangjunxian6(a)hisilicon.com>
RDMA/hns: Fix wrong value of max_sge_rd
Junxian Huang <huangjunxian6(a)hisilicon.com>
RDMA/hns: Fix a missing rollback in error path of hns_roce_create_qp_common()
Junxian Huang <huangjunxian6(a)hisilicon.com>
RDMA/hns: Fix unmatched condition in error path of alloc_user_qp_db()
Junxian Huang <huangjunxian6(a)hisilicon.com>
RDMA/hns: Fix soft lockup during bt pages loop
Saravanan Vajravel <saravanan.vajravel(a)broadcom.com>
RDMA/bnxt_re: Avoid clearing VLAN_ID mask in modify qp path
Phil Elwell <phil(a)raspberrypi.com>
ARM: dts: bcm2711: Don't mark timer regs unconfigured
Arnd Bergmann <arnd(a)arndb.de>
ARM: OMAP1: select CONFIG_GENERIC_IRQ_CHIP
Qasim Ijaz <qasdev00(a)gmail.com>
RDMA/mlx5: Handle errors returned from mlx5r_ib_rate()
Kashyap Desai <kashyap.desai(a)broadcom.com>
RDMA/bnxt_re: Add missing paranthesis in map_qp_id_to_tbl_indx
Yao Zi <ziyao(a)disroot.org>
arm64: dts: rockchip: Remove undocumented sdmmc property from lubancat-1
Phil Elwell <phil(a)raspberrypi.com>
ARM: dts: bcm2711: PL011 UARTs are actually r1p5
Peng Fan <peng.fan(a)nxp.com>
soc: imx8m: Unregister cpufreq and soc dev in cleanup path
Marek Vasut <marex(a)denx.de>
soc: imx8m: Use devm_* to simplify probe failure handling
Marek Vasut <marex(a)denx.de>
soc: imx8m: Remove global soc_uid
Cosmin Ratiu <cratiu(a)nvidia.com>
xfrm_output: Force software GSO only in tunnel mode
Alexandre Cassen <acassen(a)corp.free.fr>
xfrm: fix tunnel mode TX datapath in packet offload mode
Alexander Stein <alexander.stein(a)ew.tq-group.com>
arm64: dts: freescale: tqma8mpql: Fix vqmmc-supply
Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
firmware: imx-scu: fix OF node leak in .probe()
-------------
Diffstat:
Makefile | 4 +-
arch/arm/boot/dts/broadcom/bcm2711.dtsi | 11 +-
arch/arm/boot/dts/nxp/imx/imx6qdl-apalis.dtsi | 10 +-
arch/arm/mach-davinci/Kconfig | 1 +
arch/arm/mach-omap1/Kconfig | 1 +
arch/arm/mach-shmobile/headsmp.S | 1 +
.../boot/dts/freescale/imx8mm-verdin-dahlia.dtsi | 6 +-
.../arm64/boot/dts/freescale/imx8mp-tqma8mpql.dtsi | 16 +--
.../boot/dts/freescale/imx8mp-verdin-dahlia.dtsi | 6 +-
.../boot/dts/rockchip/px30-ringneck-haikou.dts | 2 +
arch/arm64/boot/dts/rockchip/rk3399-nanopi-r4s.dts | 2 +-
arch/arm64/boot/dts/rockchip/rk3399-rockpro64.dtsi | 2 +
arch/arm64/boot/dts/rockchip/rk3566-lubancat-1.dts | 1 -
arch/arm64/include/asm/kvm_host.h | 7 +-
arch/arm64/include/asm/kvm_hyp.h | 1 +
arch/arm64/kernel/fpsimd.c | 25 ----
arch/arm64/kvm/arm.c | 1 -
arch/arm64/kvm/fpsimd.c | 89 +++---------
arch/arm64/kvm/hyp/entry.S | 5 +
arch/arm64/kvm/hyp/include/hyp/switch.h | 106 ++++++++++-----
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 15 +-
arch/arm64/kvm/hyp/nvhe/pkvm.c | 29 +---
arch/arm64/kvm/hyp/nvhe/switch.c | 112 ++++++++++-----
arch/arm64/kvm/hyp/vhe/switch.c | 13 +-
arch/arm64/kvm/reset.c | 3 +
arch/riscv/boot/dts/starfive/jh7110-pinfunc.h | 2 +-
drivers/accel/qaic/qaic_data.c | 9 +-
drivers/firmware/efi/libstub/randomalloc.c | 4 +
drivers/firmware/imx/imx-scu.c | 1 +
drivers/gpu/drm/amd/amdgpu/nv.c | 20 +--
drivers/gpu/drm/amd/amdgpu/soc15.c | 20 +--
drivers/gpu/drm/amd/amdgpu/vi.c | 36 ++---
.../gpu/drm/amd/display/dc/dce/dmub_hw_lock_mgr.c | 15 ++
drivers/gpu/drm/radeon/radeon_vce.c | 2 +-
drivers/gpu/drm/scheduler/sched_entity.c | 11 +-
drivers/gpu/drm/v3d/v3d_sched.c | 9 +-
drivers/i2c/busses/i2c-omap.c | 26 +---
drivers/infiniband/hw/bnxt_re/qplib_fp.c | 2 -
drivers/infiniband/hw/bnxt_re/qplib_rcfw.h | 3 +-
drivers/infiniband/hw/hns/hns_roce_hem.c | 16 ++-
drivers/infiniband/hw/hns/hns_roce_main.c | 2 +-
drivers/infiniband/hw/hns/hns_roce_qp.c | 10 +-
drivers/infiniband/hw/mlx5/ah.c | 14 +-
drivers/mmc/host/atmel-mci.c | 4 +-
drivers/mmc/host/sdhci-brcmstb.c | 10 ++
drivers/net/can/flexcan/flexcan-core.c | 18 ++-
drivers/net/can/rcar/rcar_canfd.c | 28 ++--
drivers/net/can/usb/ucan.c | 43 +++---
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 10 +-
drivers/net/wireless/intel/iwlwifi/fw/file.h | 4 +-
drivers/net/wireless/intel/iwlwifi/mvm/d3.c | 9 +-
drivers/net/wireless/intel/iwlwifi/mvm/fw.c | 37 ++++-
drivers/net/wireless/intel/iwlwifi/mvm/sta.c | 28 ++++
drivers/net/wireless/intel/iwlwifi/mvm/sta.h | 3 +-
drivers/regulator/core.c | 12 +-
drivers/regulator/dummy.c | 2 +-
drivers/soc/imx/soc-imx8m.c | 151 ++++++++++-----------
drivers/soc/qcom/pdr_interface.c | 8 +-
fs/btrfs/tree-checker.c | 30 ++--
fs/btrfs/tree-checker.h | 1 +
fs/proc/generic.c | 10 +-
fs/proc/inode.c | 6 +-
fs/proc/internal.h | 14 ++
fs/smb/server/smbacl.c | 5 +-
include/linux/proc_fs.h | 7 +-
include/net/bluetooth/hci.h | 2 +-
kernel/sched/core.c | 22 +--
mm/filemap.c | 13 +-
mm/memcontrol.c | 9 ++
mm/migrate.c | 10 +-
net/atm/lec.c | 3 +-
net/batman-adv/bat_iv_ogm.c | 3 +-
net/batman-adv/bat_v_ogm.c | 3 +-
net/bluetooth/6lowpan.c | 7 +-
net/core/lwtunnel.c | 65 +++++++--
net/core/neighbour.c | 1 +
net/ipv6/addrconf.c | 15 +-
net/ipv6/route.c | 5 +-
net/mptcp/options.c | 6 +-
net/netfilter/nft_counter.c | 90 ++++++------
net/xdp/xsk_buff_pool.c | 2 +-
net/xfrm/xfrm_output.c | 43 +++++-
82 files changed, 810 insertions(+), 600 deletions(-)
Once of_device_register() failed, we should call put_device() to
decrement reference count for cleanup. Or it could cause memory leak.
So fix this by calling put_device(), then the name can be freed in
kobject_cleanup().
As comment of device_add() says, 'if device_add() succeeds, you should
call device_del() when you want to get rid of it. If device_add() has
not succeeded, use only put_device() to drop the reference count'.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: cf44bbc26cf1 ("[SPARC]: Beginnings of generic of_device framework.")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
arch/sparc/kernel/of_device_64.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/sparc/kernel/of_device_64.c b/arch/sparc/kernel/of_device_64.c
index f98c2901f335..4272746d7166 100644
--- a/arch/sparc/kernel/of_device_64.c
+++ b/arch/sparc/kernel/of_device_64.c
@@ -677,7 +677,7 @@ static struct platform_device * __init scan_one_device(struct device_node *dp,
if (of_device_register(op)) {
printk("%pOF: Could not register of device.\n", dp);
- kfree(op);
+ put_device(&op->dev);
op = NULL;
}
--
2.25.1
The OHCI controller (rev 0x02) under LS7A PCI host has a hardware flaw.
MMIO register with offset 0x60/0x64 is treated as legacy PS2-compatible
keyboard/mouse interface, which confuse the OHCI controller. Since OHCI
only use a 4KB BAR resource indeed, the LS7A OHCI controller's 32KB BAR
is wrapped around (the second 4KB BAR space is the same as the first 4KB
internally). So we can add an 4KB offset (0x1000) to the OHCI registers
(from the PCI BAR resource) as a quirk.
Cc: stable(a)vger.kernel.org
Suggested-by: Bjorn Helgaas <bhelgaas(a)google.com>
Tested-by: Mingcong Bai <jeffbai(a)aosc.io>
Signed-off-by: Huacai Chen <chenhuacai(a)loongson.cn>
---
drivers/usb/host/ohci-pci.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/usb/host/ohci-pci.c b/drivers/usb/host/ohci-pci.c
index 900ea0d368e0..38e535aa09fe 100644
--- a/drivers/usb/host/ohci-pci.c
+++ b/drivers/usb/host/ohci-pci.c
@@ -165,6 +165,15 @@ static int ohci_quirk_amd700(struct usb_hcd *hcd)
return 0;
}
+static int ohci_quirk_loongson(struct usb_hcd *hcd)
+{
+ struct pci_dev *pdev = to_pci_dev(hcd->self.controller);
+
+ hcd->regs += (pdev->revision == 0x2) ? 0x1000 : 0x0;
+
+ return 0;
+}
+
static int ohci_quirk_qemu(struct usb_hcd *hcd)
{
struct ohci_hcd *ohci = hcd_to_ohci(hcd);
@@ -224,6 +233,10 @@ static const struct pci_device_id ohci_pci_quirks[] = {
PCI_DEVICE(PCI_VENDOR_ID_ATI, 0x4399),
.driver_data = (unsigned long)ohci_quirk_amd700,
},
+ {
+ PCI_DEVICE(PCI_VENDOR_ID_LOONGSON, 0x7a24),
+ .driver_data = (unsigned long)ohci_quirk_loongson,
+ },
{
.vendor = PCI_VENDOR_ID_APPLE,
.device = 0x003f,
--
2.47.1
From: Douglas Raillard <douglas.raillard(a)arm.com>
The printk format for synth event uses "%.*s" to print string fields,
but then only passes the pointer part as var arg.
Replace %.*s with %s as the C string is guaranteed to be null-terminated.
The output in print fmt should never have been updated as __get_str()
handles the string limit because it can access the length of the string in
the string meta data that is saved in the ring buffer.
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Fixes: 8db4d6bfbbf92 ("tracing: Change synthetic event string format to limit printed length")
Link: https://lore.kernel.org/20250325165202.541088-1-douglas.raillard@arm.com
Signed-off-by: Douglas Raillard <douglas.raillard(a)arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_events_synth.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
index a5c5f34c207a..6d592cbc38e4 100644
--- a/kernel/trace/trace_events_synth.c
+++ b/kernel/trace/trace_events_synth.c
@@ -305,7 +305,7 @@ static const char *synth_field_fmt(char *type)
else if (strcmp(type, "gfp_t") == 0)
fmt = "%x";
else if (synth_field_is_string(type))
- fmt = "%.*s";
+ fmt = "%s";
else if (synth_field_is_stack(type))
fmt = "%s";
--
2.47.2
Hi ,
Planning to get the ICS West 2025 attendee list?
Expo Name: International Security Conference & Exposition West 2025
Total Number of records: 23,000 records
List includes: Company Name, Contact Name, Job Title, Mailing Address, Phone, Emails, etc.
Interested in moving forward with these leads? Let me know, and I'll share the price.
Can't wait for your reply
Regards
Jessica
Marketing Manager
Campaign Data Leads.,
Please reply with REMOVE if you don't wish to receive further emails
Hi all,
A recent LLVM change [1] introduces a call to wcslen() in
fs/smb/client/smb2pdu.c through UniStrcat() via
alloc_path_with_tree_prefix(). Similar to the bcmp() and stpcpy()
additions that happened in 5f074f3e192f and 1e1b6d63d634, add wcslen()
to fix the linkage failure.
[1]: https://github.com/llvm/llvm-project/commit/9694844d7e36fd5e01011ab56b64f27…
---
Changes in v2:
- Refactor typedefs from nls.h into nls_types.h to make it safe to
include in string.h, which may be included in many places throughout
the kernel that may not like the other stuff nls.h brings in:
https://lore.kernel.org/202503260611.MDurOUhF-lkp@intel.com/
- Drop libstub change due to the above change, as it is no longer
necessary.
- Move prototype shuffle of patch 2 into the patch that adds wcslen()
(Andy)
- Use new nls_types.h in string.{c,h}
- Link to v1: https://lore.kernel.org/r/20250325-string-add-wcslen-for-llvm-opt-v1-0-b8f1…
---
Nathan Chancellor (2):
include: Move typedefs in nls.h to their own header
lib/string.c: Add wcslen()
include/linux/nls.h | 19 +------------------
include/linux/nls_types.h | 25 +++++++++++++++++++++++++
include/linux/string.h | 2 ++
lib/string.c | 11 +++++++++++
4 files changed, 39 insertions(+), 18 deletions(-)
---
base-commit: 78ab93c78fb31c5dfe207318aa2b7bd4e41f8dba
change-id: 20250324-string-add-wcslen-for-llvm-opt-705791db92c0
Best regards,
--
Nathan Chancellor <nathan(a)kernel.org>
struct rdma_cm_id has member "struct work_struct net_work"
that is reused for enqueuing cma_netevent_work_handler()s
onto cma_wq.
Below crash[1] can occur if more than one call to
cma_netevent_callback() occurs in quick succession,
which further enqueues cma_netevent_work_handler()s for the
same rdma_cm_id, overwriting any previously queued work-item(s)
that was just scheduled to run i.e. there is no guarantee
the queued work item may run between two successive calls
to cma_netevent_callback() and the 2nd INIT_WORK would overwrite
the 1st work item (for the same rdma_cm_id), despite grabbing
id_table_lock during enqueue.
Also drgn analysis [2] indicates the work item was likely overwritten.
Fix this by moving the INIT_WORK() to __rdma_create_id(),
so that it doesn't race with any existing queue_work() or
its worker thread.
[1] Trimmed crash stack:
=============================================
BUG: kernel NULL pointer dereference, address: 0000000000000008
kworker/u256:6 ... 6.12.0-0...
Workqueue: cma_netevent_work_handler [rdma_cm] (rdma_cm)
RIP: 0010:process_one_work+0xba/0x31a
Call Trace:
worker_thread+0x266/0x3a0
kthread+0xcf/0x100
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x1a/0x30
=============================================
[2] drgn crash analysis:
>>> trace = prog.crashed_thread().stack_trace()
>>> trace
(0) crash_setup_regs (./arch/x86/include/asm/kexec.h:111:15)
(1) __crash_kexec (kernel/crash_core.c:122:4)
(2) panic (kernel/panic.c:399:3)
(3) oops_end (arch/x86/kernel/dumpstack.c:382:3)
...
(8) process_one_work (kernel/workqueue.c:3168:2)
(9) process_scheduled_works (kernel/workqueue.c:3310:3)
(10) worker_thread (kernel/workqueue.c:3391:4)
(11) kthread (kernel/kthread.c:389:9)
Line workqueue.c:3168 for this kernel version is in process_one_work():
3168 strscpy(worker->desc, pwq->wq->name, WORKER_DESC_LEN);
>>> trace[8]["work"]
*(struct work_struct *)0xffff92577d0a21d8 = {
.data = (atomic_long_t){
.counter = (s64)536870912, <=== Note
},
.entry = (struct list_head){
.next = (struct list_head *)0xffff924d075924c0,
.prev = (struct list_head *)0xffff924d075924c0,
},
.func = (work_func_t)cma_netevent_work_handler+0x0 = 0xffffffffc2cec280,
}
Suspicion is that pwq is NULL:
>>> trace[8]["pwq"]
(struct pool_workqueue *)<absent>
In process_one_work(), pwq is assigned from:
struct pool_workqueue *pwq = get_work_pwq(work);
and get_work_pwq() is:
static struct pool_workqueue *get_work_pwq(struct work_struct *work)
{
unsigned long data = atomic_long_read(&work->data);
if (data & WORK_STRUCT_PWQ)
return work_struct_pwq(data);
else
return NULL;
}
WORK_STRUCT_PWQ is 0x4:
>>> print(repr(prog['WORK_STRUCT_PWQ']))
Object(prog, 'enum work_flags', value=4)
But work->data is 536870912 which is 0x20000000.
So, get_work_pwq() returns NULL and we crash in process_one_work():
3168 strscpy(worker->desc, pwq->wq->name, WORKER_DESC_LEN);
=============================================
Fixes: 925d046e7e52 ("RDMA/core: Add a netevent notifier to cma")
Co-developed-by: Håkon Bugge <haakon.bugge(a)oracle.com>
Signed-off-by: Håkon Bugge <haakon.bugge(a)oracle.com>
Signed-off-by: Sharath Srinivasan <sharath.srinivasan(a)oracle.com>
---
drivers/infiniband/core/cma.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 91db10515d74..176d0b3e4488 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -72,6 +72,8 @@ static const char * const cma_events[] = {
static void cma_iboe_set_mgid(struct sockaddr *addr, union ib_gid *mgid,
enum ib_gid_type gid_type);
+static void cma_netevent_work_handler(struct work_struct *_work);
+
const char *__attribute_const__ rdma_event_msg(enum rdma_cm_event_type event)
{
size_t index = event;
@@ -1033,6 +1035,7 @@ __rdma_create_id(struct net *net, rdma_cm_event_handler event_handler,
get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
id_priv->id.route.addr.dev_addr.net = get_net(net);
id_priv->seq_num &= 0x00ffffff;
+ INIT_WORK(&id_priv->id.net_work, cma_netevent_work_handler);
rdma_restrack_new(&id_priv->res, RDMA_RESTRACK_CM_ID);
if (parent)
@@ -5227,7 +5230,6 @@ static int cma_netevent_callback(struct notifier_block *self,
if (!memcmp(current_id->id.route.addr.dev_addr.dst_dev_addr,
neigh->ha, ETH_ALEN))
continue;
- INIT_WORK(¤t_id->id.net_work, cma_netevent_work_handler);
cma_id_get(current_id);
queue_work(cma_wq, ¤t_id->id.net_work);
}
--
2.39.5 (Apple Git-154)
Overview
========
When a CPU chooses to call push_rt_task and picks a task to push to
another CPU's runqueue then it will call find_lock_lowest_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.
Crashes
=======
This bug resulted in quite a few flavors of crashes triggering kernel
panics with various crash signatures such as assert failures, page
faults, null pointer dereferences, and queue corruption errors all
coming from scheduler itself.
Some of the crashes:
-> kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx >= MAX_RT_PRIO)
Call Trace:
? __die_body+0x1a/0x60
? die+0x2a/0x50
? do_trap+0x85/0x100
? pick_next_task_rt+0x6e/0x1d0
? do_error_trap+0x64/0xa0
? pick_next_task_rt+0x6e/0x1d0
? exc_invalid_op+0x4c/0x60
? pick_next_task_rt+0x6e/0x1d0
? asm_exc_invalid_op+0x12/0x20
? pick_next_task_rt+0x6e/0x1d0
__schedule+0x5cb/0x790
? update_ts_time_stats+0x55/0x70
schedule_idle+0x1e/0x40
do_idle+0x15e/0x200
cpu_startup_entry+0x19/0x20
start_secondary+0x117/0x160
secondary_startup_64_no_verify+0xb0/0xbb
-> BUG: kernel NULL pointer dereference, address: 00000000000000c0
Call Trace:
? __die_body+0x1a/0x60
? no_context+0x183/0x350
? __warn+0x8a/0xe0
? exc_page_fault+0x3d6/0x520
? asm_exc_page_fault+0x1e/0x30
? pick_next_task_rt+0xb5/0x1d0
? pick_next_task_rt+0x8c/0x1d0
__schedule+0x583/0x7e0
? update_ts_time_stats+0x55/0x70
schedule_idle+0x1e/0x40
do_idle+0x15e/0x200
cpu_startup_entry+0x19/0x20
start_secondary+0x117/0x160
secondary_startup_64_no_verify+0xb0/0xbb
-> BUG: unable to handle page fault for address: ffff9464daea5900
kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq->cpu != task_cpu(p))
-> kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq->nr_running)
Call Trace:
? __die_body+0x1a/0x60
? die+0x2a/0x50
? do_trap+0x85/0x100
? dequeue_top_rt_rq+0xa2/0xb0
? do_error_trap+0x64/0xa0
? dequeue_top_rt_rq+0xa2/0xb0
? exc_invalid_op+0x4c/0x60
? dequeue_top_rt_rq+0xa2/0xb0
? asm_exc_invalid_op+0x12/0x20
? dequeue_top_rt_rq+0xa2/0xb0
dequeue_rt_entity+0x1f/0x70
dequeue_task_rt+0x2d/0x70
__schedule+0x1a8/0x7e0
? blk_finish_plug+0x25/0x40
schedule+0x3c/0xb0
futex_wait_queue_me+0xb6/0x120
futex_wait+0xd9/0x240
do_futex+0x344/0xa90
? get_mm_exe_file+0x30/0x60
? audit_exe_compare+0x58/0x70
? audit_filter_rules.constprop.26+0x65e/0x1220
__x64_sys_futex+0x148/0x1f0
do_syscall_64+0x30/0x80
entry_SYSCALL_64_after_hwframe+0x62/0xc7
-> BUG: unable to handle page fault for address: ffff8cf3608bc2c0
Call Trace:
? __die_body+0x1a/0x60
? no_context+0x183/0x350
? spurious_kernel_fault+0x171/0x1c0
? exc_page_fault+0x3b6/0x520
? plist_check_list+0x15/0x40
? plist_check_list+0x2e/0x40
? asm_exc_page_fault+0x1e/0x30
? _cond_resched+0x15/0x30
? futex_wait_queue_me+0xc8/0x120
? futex_wait+0xd9/0x240
? try_to_wake_up+0x1b8/0x490
? futex_wake+0x78/0x160
? do_futex+0xcd/0xa90
? plist_check_list+0x15/0x40
? plist_check_list+0x2e/0x40
? plist_del+0x6a/0xd0
? plist_check_list+0x15/0x40
? plist_check_list+0x2e/0x40
? dequeue_pushable_task+0x20/0x70
? __schedule+0x382/0x7e0
? asm_sysvec_reschedule_ipi+0xa/0x20
? schedule+0x3c/0xb0
? exit_to_user_mode_prepare+0x9e/0x150
? irqentry_exit_to_user_mode+0x5/0x30
? asm_sysvec_reschedule_ipi+0x12/0x20
Above are some of the common examples of the crashes that were observed
due to this issue.
Details
=======
Let's look at the following scenario to understand this race.
1) CPU A enters push_rt_task
a) CPU A has chosen next_task = task p.
b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq).
c) CPU A identifies CPU X as a destination CPU (X < Z).
d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq).
e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has
locked CPU X’s rq, and thus, CPU A must wait.
2) At CPU Z
a) Previous task has completed execution and thus, CPU Z enters
schedule, locks its own rq after CPU A releases it.
b) CPU Z dequeues previous task and begins executing task p.
c) CPU Z unlocks its rq.
d) Task p yields the CPU (ex. by doing IO or waiting to acquire a
lock) which triggers the schedule function on CPU Z.
e) CPU Z enters schedule again, locks its own rq, and dequeues task p.
f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq.
3) At CPU B
a) CPU B enters try_to_wake_up with input task p.
b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates
B.state = WAKING.
c) CPU B via select_task_rq determines CPU Y as the target CPU.
4) The race
a) CPU A acquires CPU X’s lock and relocks CPU Z.
b) CPU A reads task p.cpu = Z and incorrectly concludes task p is
still on CPU Z.
c) CPU A failed to notice task p had been dequeued from CPU Z while
CPU A was waiting for locks in double_lock_balance. If CPU A knew
that task p had been dequeued, it would return NULL forcing
push_rt_task to give up the task p's migration.
d) CPU B updates task p.cpu = Y and calls ttwu_queue.
e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task
p.on_rq = 1.
f) CPU B unlocks CPU Y, triggering memory synchronization.
g) CPU A reads task p.on_rq = 1, cementing its assumption that task p
has not migrated.
h) CPU A decides to migrate p to CPU X.
This leads to A dequeuing p from Y's queue and various crashes down the
line.
Solution
========
The solution here is fairly simple. After obtaining the lock (at 4a),
the check is enhanced to make sure that the task is still at the head of
the pushable tasks list. If not, then it is anyway not suitable for
being pushed out.
Testing
=======
The fix is tested on a cluster of 3 nodes, where the panics due to this
are hit every couple of days. A fix similar to this was deployed on such
cluster and was stable for more than 30 days.
Co-developed-by: Jon Kohler <jon(a)nutanix.com>
Signed-off-by: Jon Kohler <jon(a)nutanix.com>
Co-developed-by: Gauri Patwardhan <gauri.patwardhan(a)nutanix.com>
Signed-off-by: Gauri Patwardhan <gauri.patwardhan(a)nutanix.com>
Co-developed-by: Rahul Chunduru <rahul.chunduru(a)nutanix.com>
Signed-off-by: Rahul Chunduru <rahul.chunduru(a)nutanix.com>
Signed-off-by: Harshit Agarwal <harshit(a)nutanix.com>
Tested-by: Will Ton <william.ton(a)nutanix.com>
---
Changes in v2:
- As per Steve's suggestion, removed some checks that are done after
obtaining the lock that are no longer needed with the addition of new
check.
- Moved up is_migration_disabled check.
- Link to v1:
https://lore.kernel.org/lkml/20250211054646.23987-1-harshit@nutanix.com/
---
kernel/sched/rt.c | 54 +++++++++++++++++++++++------------------------
1 file changed, 26 insertions(+), 28 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4b8e33c615b1..4762dd3f50c5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1885,6 +1885,27 @@ static int find_lowest_rq(struct task_struct *task)
return -1;
}
+static struct task_struct *pick_next_pushable_task(struct rq *rq)
+{
+ struct task_struct *p;
+
+ if (!has_pushable_tasks(rq))
+ return NULL;
+
+ p = plist_first_entry(&rq->rt.pushable_tasks,
+ struct task_struct, pushable_tasks);
+
+ BUG_ON(rq->cpu != task_cpu(p));
+ BUG_ON(task_current(rq, p));
+ BUG_ON(task_current_donor(rq, p));
+ BUG_ON(p->nr_cpus_allowed <= 1);
+
+ BUG_ON(!task_on_rq_queued(p));
+ BUG_ON(!rt_task(p));
+
+ return p;
+}
+
/* Will lock the rq it finds */
static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
{
@@ -1915,18 +1936,16 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
/*
* We had to unlock the run queue. In
* the mean time, task could have
- * migrated already or had its affinity changed.
- * Also make sure that it wasn't scheduled on its rq.
+ * migrated already or had its affinity changed,
+ * therefore check if the task is still at the
+ * head of the pushable tasks list.
* It is possible the task was scheduled, set
* "migrate_disabled" and then got preempted, so we must
* check the task migration disable flag here too.
*/
- if (unlikely(task_rq(task) != rq ||
+ if (unlikely(is_migration_disabled(task) ||
!cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
- task_on_cpu(rq, task) ||
- !rt_task(task) ||
- is_migration_disabled(task) ||
- !task_on_rq_queued(task))) {
+ task != pick_next_pushable_task(rq))) {
double_unlock_balance(rq, lowest_rq);
lowest_rq = NULL;
@@ -1946,27 +1965,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
return lowest_rq;
}
-static struct task_struct *pick_next_pushable_task(struct rq *rq)
-{
- struct task_struct *p;
-
- if (!has_pushable_tasks(rq))
- return NULL;
-
- p = plist_first_entry(&rq->rt.pushable_tasks,
- struct task_struct, pushable_tasks);
-
- BUG_ON(rq->cpu != task_cpu(p));
- BUG_ON(task_current(rq, p));
- BUG_ON(task_current_donor(rq, p));
- BUG_ON(p->nr_cpus_allowed <= 1);
-
- BUG_ON(!task_on_rq_queued(p));
- BUG_ON(!rt_task(p));
-
- return p;
-}
-
/*
* If the current CPU has more than one RT task, see if the non
* running task can migrate over to a CPU that is running a task
--
2.22.3