From: Max Filippov <jcmvbkbc(a)gmail.com>
Subject: highmem: fix checks in __kmap_local_sched_{in,out}
When CONFIG_DEBUG_KMAP_LOCAL is enabled __kmap_local_sched_{in,out} check
that even slots in the tsk->kmap_ctrl.pteval are unmapped. The slots are
initialized with 0 value, but the check is done with pte_none. 0 pte
however does not necessarily mean that pte_none will return true. e.g.
on xtensa it returns false, resulting in the following runtime warnings:
WARNING: CPU: 0 PID: 101 at mm/highmem.c:627 __kmap_local_sched_out+0x51/0x108
CPU: 0 PID: 101 Comm: touch Not tainted 5.17.0-rc7-00010-gd3a1cdde80d2-dirty #13
Call Trace:
dump_stack+0xc/0x40
__warn+0x8f/0x174
warn_slowpath_fmt+0x48/0xac
__kmap_local_sched_out+0x51/0x108
__schedule+0x71a/0x9c4
preempt_schedule_irq+0xa0/0xe0
common_exception_return+0x5c/0x93
do_wp_page+0x30e/0x330
handle_mm_fault+0xa70/0xc3c
do_page_fault+0x1d8/0x3c4
common_exception+0x7f/0x7f
WARNING: CPU: 0 PID: 101 at mm/highmem.c:664 __kmap_local_sched_in+0x50/0xe0
CPU: 0 PID: 101 Comm: touch Tainted: G W 5.17.0-rc7-00010-gd3a1cdde80d2-dirty #13
Call Trace:
dump_stack+0xc/0x40
__warn+0x8f/0x174
warn_slowpath_fmt+0x48/0xac
__kmap_local_sched_in+0x50/0xe0
finish_task_switch$isra$0+0x1ce/0x2f8
__schedule+0x86e/0x9c4
preempt_schedule_irq+0xa0/0xe0
common_exception_return+0x5c/0x93
do_wp_page+0x30e/0x330
handle_mm_fault+0xa70/0xc3c
do_page_fault+0x1d8/0x3c4
common_exception+0x7f/0x7f
Fix it by replacing !pte_none(pteval) with pte_val(pteval) != 0.
Link: https://lkml.kernel.org/r/20220403235159.3498065-1-jcmvbkbc@gmail.com
Fixes: 5fbda3ecd14a ("sched: highmem: Store local kmaps in task struct")
Signed-off-by: Max Filippov <jcmvbkbc(a)gmail.com>
Reviewed-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/highmem.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/mm/highmem.c~highmem-fix-checks-in-__kmap_local_sched_inout
+++ a/mm/highmem.c
@@ -624,7 +624,7 @@ void __kmap_local_sched_out(void)
/* With debug all even slots are unmapped and act as guard */
if (IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL) && !(i & 0x01)) {
- WARN_ON_ONCE(!pte_none(pteval));
+ WARN_ON_ONCE(pte_val(pteval) != 0);
continue;
}
if (WARN_ON_ONCE(pte_none(pteval)))
@@ -661,7 +661,7 @@ void __kmap_local_sched_in(void)
/* With debug all even slots are unmapped and act as guard */
if (IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL) && !(i & 0x01)) {
- WARN_ON_ONCE(!pte_none(pteval));
+ WARN_ON_ONCE(pte_val(pteval) != 0);
continue;
}
if (WARN_ON_ONCE(pte_none(pteval)))
_
Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop
Control Unit, but have no shared CPU-side last level cache.
cpu_coregroup_mask() will return a cpumask with weight 1, while
cpu_clustergroup_mask() will return a cpumask with weight 2.
As a result, build_sched_domain() will BUG() once per CPU with:
BUG: arch topology borken
the CLS domain not a subset of the MC domain
The MC level cpumask is then extended to that of the CLS child, and is
later removed entirely as redundant. This sched domain topology is an
improvement over previous topologies, or those built without
SCHED_CLUSTER, particularly for certain latency sensitive workloads.
With the current scheduler model and heuristics, this is a desirable
default topology for Ampere Altra and Altra Max system.
Rather than create a custom sched domains topology structure and
introduce new logic in arch/arm64 to detect these systems, update the
core_mask so coregroup is never a subset of clustergroup, extending it
to cluster_siblings if necessary. Only do this if CONFIG_SCHED_CLUSTER
is enabled to avoid also changing the topology (MC) when
CONFIG_SCHED_CLUSTER is disabled.
This has the added benefit over a custom topology of working for both
symmetric and asymmetric topologies. It does not address systems where
the CLUSTER topology is above a populated MC topology, but these are not
considered today and can be addressed separately if and when they
appear.
The final sched domain topology for a 2 socket Ampere Altra system is
unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
For CPU0:
CONFIG_SCHED_CLUSTER=y
CLS [0-1]
DIE [0-79]
NUMA [0-159]
CONFIG_SCHED_CLUSTER is not set
DIE [0-79]
NUMA [0-159]
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Sudeep Holla <sudeep.holla(a)arm.com>
Cc: "Rafael J. Wysocki" <rafael(a)kernel.org>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Vincent Guittot <vincent.guittot(a)linaro.org>
Cc: Barry Song <song.bao.hua(a)hisilicon.com>
Cc: Valentin Schneider <valentin.schneider(a)arm.com>
Cc: D. Scott Phillips <scott(a)os.amperecomputing.com>
Cc: Ilkka Koskinen <ilkka(a)os.amperecomputing.com>
Cc: Carl Worth <carl(a)os.amperecomputing.com>
Cc: <stable(a)vger.kernel.org> # 5.16.x
Suggested-by: Barry Song <song.bao.hua(a)hisilicon.com>
Signed-off-by: Darren Hart <darren(a)os.amperecomputing.com>
---
v1: Drop MC level if coregroup weight == 1
v2: New sd topo in arch/arm64/kernel/smp.c
v3: No new topo, extend core_mask to cluster_siblings
v4: Rebase on 5.18-rc1 for GregKH to pull. Add IS_ENABLED(CONFIG_SCHED_CLUSTER).
drivers/base/arch_topology.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1d6636ebaac5..5497c5ab7318 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -667,6 +667,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
core_mask = &cpu_topology[cpu].llc_sibling;
}
+ /*
+ * For systems with no shared cpu-side LLC but with clusters defined,
+ * extend core_mask to cluster_siblings. The sched domain builder will
+ * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled.
+ */
+ if (IS_ENABLED(CONFIG_SCHED_CLUSTER) &&
+ cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
+ core_mask = &cpu_topology[cpu].cluster_sibling;
+
return core_mask;
}
--
2.31.1
From: Douglas Miller <doug.miller(a)cornelisnetworks.com>
Under certain conditions, such as MPI_Abort, the hfi1 cleanup
code may represent the last reference held on the task mm.
hfi1_mmu_rb_unregister() then drops the last reference and the mm is
freed before the final use in hfi1_release_user_pages(). A new task
may allocate the mm structure while it is still being used, resulting in
problems. One manifestation is corruption of the mmap_sem counter leading
to a hang in down_write(). Another is corruption of an mm struct that
is in use by another task.
Fixes: 3d2a9d642512 ("IB/hfi1: Ensure correct mm is used at all times")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Douglas Miller <doug.miller(a)cornelisnetworks.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro(a)cornelisnetworks.com>
---
drivers/infiniband/hw/hfi1/mmu_rb.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 876cc78..7333646 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -80,6 +80,9 @@ void hfi1_mmu_rb_unregister(struct mmu_rb_handler *handler)
unsigned long flags;
struct list_head del_list;
+ /* Prevent freeing of mm until we are completely finished. */
+ mmgrab(handler->mn.mm);
+
/* Unregister first so we don't get any more notifications. */
mmu_notifier_unregister(&handler->mn, handler->mn.mm);
@@ -102,6 +105,9 @@ void hfi1_mmu_rb_unregister(struct mmu_rb_handler *handler)
do_remove(handler, &del_list);
+ /* Now the mm may be freed. */
+ mmdrop(handler->mn.mm);
+
kfree(handler);
}
The bug is here:
if (!rdev || rdev->desc_nr != nr) {
The list iterator value 'rdev' will *always* be set and non-NULL
by rdev_for_each_rcu(), so it is incorrect to assume that the
iterator value will be NULL if the list is empty or no element
found (In fact, it will be a bogus pointer to an invalid struct
object containing the HEAD). Otherwise it will bypass the check
and lead to invalid memory access passing the check.
To fix the bug, use a new variable 'iter' as the list iterator,
while using the original variable 'pdev' as a dedicated pointer to
point to the found element.
Cc: stable(a)vger.kernel.org
Fixes: 70bcecdb1534 ("md-cluster: Improve md_reload_sb to be less error prone")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
---
changes from v2:
- fix typo (Song Liu)
changes from v1:
- rephrase the subject (Guoqing Jiang)
v2:https://lore.kernel.org/lkml/20220328080559.25984-1-xiam0nd.tong@gmail.c…v1:https://lore.kernel.org/lkml/20220327080111.12028-1-xiam0nd.tong@gmail.c…
---
drivers/md/md.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 7476fc204172..f156678c08bc 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9794,16 +9794,18 @@ static int read_rdev(struct mddev *mddev, struct md_rdev *rdev)
void md_reload_sb(struct mddev *mddev, int nr)
{
- struct md_rdev *rdev;
+ struct md_rdev *rdev = NULL, *iter;
int err;
/* Find the rdev */
- rdev_for_each_rcu(rdev, mddev) {
- if (rdev->desc_nr == nr)
+ rdev_for_each_rcu(iter, mddev) {
+ if (iter->desc_nr == nr) {
+ rdev = iter;
break;
+ }
}
- if (!rdev || rdev->desc_nr != nr) {
+ if (!rdev) {
pr_warn("%s: %d Could not find rdev with nr %d\n", __func__, __LINE__, nr);
return;
}
--
2.17.1
The bug is here:
if (!rdev)
The list iterator value 'rdev' will *always* be set and non-NULL
by rdev_for_each(), so it is incorrect to assume that the iterator
value will be NULL if the list is empty or no element found.
Otherwise it will bypass the NULL check and lead to invalid memory
access passing the check.
To fix the bug, use a new variable 'iter' as the list iterator,
while using the original variable 'rdev' as a dedicated pointer to
point to the found element.
Cc: stable(a)vger.kernel.org
Fixes: 2aa82191ac36 ("md-cluster: Perform a lazy update")
Acked-by: Guoqing Jiang <guoqing.jiang(a)linux.dev>
Signed-off-by: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
---
changes since v2:
- fix typo (Song Liu)
changes since v1:
- rephrase the subject (Guoqing Jiang)
- add Acked-by: for Guoqing Jiang
v2:https://lore.kernel.org/lkml/20220328081127.26148-1-xiam0nd.tong@gmail.c…v1:https://lore.kernel.org/lkml/20220327080002.11923-1-xiam0nd.tong@gmail.c…
---
drivers/md/md.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4d38bd7dadd6..7476fc204172 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2629,14 +2629,16 @@ static void sync_sbs(struct mddev *mddev, int nospares)
static bool does_sb_need_changing(struct mddev *mddev)
{
- struct md_rdev *rdev;
+ struct md_rdev *rdev = NULL, *iter;
struct mdp_superblock_1 *sb;
int role;
/* Find a good rdev */
- rdev_for_each(rdev, mddev)
- if ((rdev->raid_disk >= 0) && !test_bit(Faulty, &rdev->flags))
+ rdev_for_each(iter, mddev)
+ if ((iter->raid_disk >= 0) && !test_bit(Faulty, &iter->flags)) {
+ rdev = iter;
break;
+ }
/* No good device found. */
if (!rdev)
--
2.17.1
The first U3 wake signal by the host may be lost if the USB 3 connection is
tunneled over USB4, with a runtime suspended USB4 host, and firmware
implemented connection manager.
Specs state the host must wait 100ms (tU3WakeupRetryDelay) before
resending a U3 wake signal if device doesn't respond, leading to U3 -> U0
link transition times around 270ms in the tunneled case.
Cc: stable(a)vger.kernel.org
Fixes: 0200b9f790b0 ("xhci: Wait until link state trainsits to U0 after setting USB_SS_PORT_LS_U0")
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci-hub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
index 1e7dc130c39a..f65f1ba2b592 100644
--- a/drivers/usb/host/xhci-hub.c
+++ b/drivers/usb/host/xhci-hub.c
@@ -1434,7 +1434,7 @@ int xhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue,
}
spin_unlock_irqrestore(&xhci->lock, flags);
if (!wait_for_completion_timeout(&bus_state->u3exit_done[wIndex],
- msecs_to_jiffies(100)))
+ msecs_to_jiffies(500)))
xhci_dbg(xhci, "missing U0 port change event for port %d-%d\n",
hcd->self.busnum, wIndex + 1);
spin_lock_irqsave(&xhci->lock, flags);
--
2.25.1
From: Henry Lin <henryl(a)nvidia.com>
While rebooting, XHCI controller and its bus device will be shut down
in order by .shutdown callback. Stopping roothubs polling in
xhci_shutdown() can prevent XHCI driver from accessing port status
after its bus device shutdown.
Take PCIe XHCI controller as example, if XHCI driver doesn't stop roothubs
polling, XHCI driver may access PCIe BAR register for port status after
parent PCIe root port driver is shutdown and cause PCIe bus error.
[check shared hcd exist before stopping its roothub polling -Mathias]
Cc: stable(a)vger.kernel.org
Signed-off-by: Henry Lin <henryl(a)nvidia.com>
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 642610c78f58..25b87e99b4dd 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -781,6 +781,17 @@ void xhci_shutdown(struct usb_hcd *hcd)
if (xhci->quirks & XHCI_SPURIOUS_REBOOT)
usb_disable_xhci_ports(to_pci_dev(hcd->self.sysdev));
+ /* Don't poll the roothubs after shutdown. */
+ xhci_dbg(xhci, "%s: stopping usb%d port polling.\n",
+ __func__, hcd->self.busnum);
+ clear_bit(HCD_FLAG_POLL_RH, &hcd->flags);
+ del_timer_sync(&hcd->rh_timer);
+
+ if (xhci->shared_hcd) {
+ clear_bit(HCD_FLAG_POLL_RH, &xhci->shared_hcd->flags);
+ del_timer_sync(&xhci->shared_hcd->rh_timer);
+ }
+
spin_lock_irq(&xhci->lock);
xhci_halt(xhci);
/* Workaround for spurious wakeups at shutdown with HSW */
--
2.25.1