Dear Linux folks,
Could you please apply commit 0c25422d34b4 (scsi: mpt3sas: Remove
scsi_dma_map() error messages) to the 5.15.y series?
commit 0c25422d34b4726b2707d5f38560943155a91b80
Author: Sreekanth Reddy <sreekanth.reddy(a)broadcom.com>
Date: Thu Mar 3 19:32:03 2022 +0530
scsi: mpt3sas: Remove scsi_dma_map() error messages
When scsi_dma_map() fails by returning a sges_left value less than
zero,
the amount of logging produced can be extremely high. In a recent
end-user
environment, 1200 messages per second were being sent to the log
buffer.
This eventually overwhelmed the system and it stalled.
These error messages are not needed. Remove them.
Link:
https://lore.kernel.org/r/20220303140203.12642-1-sreekanth.reddy@broadcom.c…
Suggested-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Sreekanth Reddy <sreekanth.reddy(a)broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
We see this regression after upgrading from Linux 5.10 to 5.15 on our
file servers with Broadcom/LSI SAS3008 PCI-Express Fusion-MPT SAS-3
(mpt3sas) – though luckily our systems do not stall/crash.
The commit message does not say anything about, what commit caused these
error to be appearing – the log statements have been there since
v4.20-rc1, if I am not mistaken, so it must be something else –, and
also do not mention, why these log messages are not needed, but the new
error condition is actually expected.
In the Canonical/Ubuntu bug tracker I found the explanation below [2].
> 2. mpt3sas: Remove scsi_dma_map errors messages:
> When driver set the DMA mask to 32bit then we observe that the
> SWIOTLB bounce buffers are getting exhausted quickly. For most of the
> IOs driver observe that scsi_dma_map() API returned with failure
> status and hence driver was printing below error message. Since this
> error message is getting printed per IO and if user issues heavy IOs
> then we observe that kernel overwhelmed with this error message. Also
> we will observe the kernel panic when the serial console is enabled.
> So to limit this issue, we removed this error message though this
> patch.
> "scsi_dma_map failed: request for 1310720 bytes!"
The Launchpad issue was created in March 2022, and the fixed Linux
kernel package 5.15.0-53.59 for Ubuntu 22.04 released on November 15th,
2022.
Sreekanth, looking again, you are the patch author, one of the Broadcom
maintainers (LSILOGIC MPT FUSION DRIVERS (FC/SAS/SPI)) and created the
Launchpad bug report. I am surprised you didn’t get it backported upstream.
Kind regards,
Paul
[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
[2]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1965927
"[Ubuntu 22.04] mpt3sas: Request to include latest bug fix patches"
This patch is fix for Linux kernel v2.6.33 or later.
For request subaction to IEC 61883-1 FCP region, Linux FireWire subsystem
have had an issue of use-after-free. The subsystem allows multiple
user space listeners to the region, while data of the payload was likely
released before the listeners execute read(2) to access to it for copying
to user space.
The issue was fixed by a commit 281e20323ab7 ("firewire: core: fix
use-after-free regression in FCP handler"). The object of payload is
duplicated in kernel space for each listener. When the listener executes
ioctl(2) with FW_CDEV_IOC_SEND_RESPONSE request, the object is going to
be released.
However, it causes memory leak since the commit relies on call of
release_request() in drivers/firewire/core-cdev.c. Against the
expectation, the function is never called due to the design of
release_client_resource(). The function delegates release task
to caller when called with non-NULL fourth argument. The implementation
of ioctl_send_response() is the case. It should release the object
explicitly.
This commit fixes the bug.
Cc: <stable(a)vger.kernel.org>
Fixes: 281e20323ab7 ("firewire: core: fix use-after-free regression in FCP handler")
Signed-off-by: Takashi Sakamoto <o-takashi(a)sakamocchi.jp>
---
drivers/firewire/core-cdev.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
index 9c89f7d53e99..958aa4662ccb 100644
--- a/drivers/firewire/core-cdev.c
+++ b/drivers/firewire/core-cdev.c
@@ -819,8 +819,10 @@ static int ioctl_send_response(struct client *client, union ioctl_arg *arg)
r = container_of(resource, struct inbound_transaction_resource,
resource);
- if (is_fcp_request(r->request))
+ if (is_fcp_request(r->request)) {
+ kfree(r->data);
goto out;
+ }
if (a->length != fw_get_response_length(r->request)) {
ret = -EINVAL;
--
2.37.2
From: Chris Wilson <chris.p.wilson(a)linux.intel.com>
Make sure that upon error after we have acquired the wakeref we do
release it again.
Fixes: 027c38b4121e ("drm/i915/selftests: Grab the runtime pm in shrink_thp")
Reviewed-by: Matthew Auld <matthew.auld(a)intel.com>
Signed-off-by: Chris Wilson <chris.p.wilson(a)linux.intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.0+
---
drivers/gpu/drm/i915/gem/selftests/huge_pages.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
index c281b0ec9e05..295d6f2cc4ff 100644
--- a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
+++ b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
@@ -1855,7 +1855,7 @@ static int igt_shrink_thp(void *arg)
I915_SHRINK_ACTIVE);
i915_vma_unpin(vma);
if (err)
- goto out_put;
+ goto out_wf;
/*
* Now that the pages are *unpinned* shrinking should invoke
@@ -1871,7 +1871,7 @@ static int igt_shrink_thp(void *arg)
pr_err("unexpected pages mismatch, should_swap=%s\n",
str_yes_no(should_swap));
err = -EINVAL;
- goto out_put;
+ goto out_wf;
}
if (should_swap == (obj->mm.page_sizes.sg || obj->mm.page_sizes.phys)) {
@@ -1883,7 +1883,7 @@ static int igt_shrink_thp(void *arg)
err = i915_vma_pin(vma, 0, 0, flags);
if (err)
- goto out_put;
+ goto out_wf;
while (n--) {
err = cpu_check(obj, n, 0xdeadbeaf);
--
2.39.0
If we split a bio marked with REQ_NOWAIT, then we can trigger spurious
EAGAIN if constituent parts of that split bio end up failing request
allocations. Parts will complete just fine, but just a single failure
in one of the chained bios will yield an EAGAIN final result for the
parent bio.
Return EAGAIN early if we end up needing to split such a bio, which
allows for saner recovery handling.
Cc: stable(a)vger.kernel.org # 5.15+
Link: https://github.com/axboe/liburing/issues/766
Reported-by: Michael Kelley <mikelley(a)microsoft.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
---
block/blk-merge.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 071c5f8cf0cf..b7c193d67185 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -309,6 +309,16 @@ static struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
*segs = nsegs;
return NULL;
split:
+ /*
+ * We can't sanely support splitting for a REQ_NOWAIT bio. End it
+ * with EAGAIN if splitting is required and return an error pointer.
+ */
+ if (bio->bi_opf & REQ_NOWAIT) {
+ bio->bi_status = BLK_STS_AGAIN;
+ bio_endio(bio);
+ return ERR_PTR(-EAGAIN);
+ }
+
*segs = nsegs;
/*
--
2.39.0
The patch titled
Subject: mm: multi-gen LRU: fix crash during cgroup migration
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-multi-gen-lru-fix-crash-during-cgroup-migration.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm: multi-gen LRU: fix crash during cgroup migration
Date: Sun, 15 Jan 2023 20:44:05 -0700
lru_gen_migrate_mm() assumes lru_gen_add_mm() runs prior to itself. This
isn't true for the following scenario:
CPU 1 CPU 2
clone()
cgroup_can_fork()
cgroup_procs_write()
cgroup_post_fork()
task_lock()
lru_gen_migrate_mm()
task_unlock()
task_lock()
lru_gen_add_mm()
task_unlock()
And when the above happens, kernel crashes because of linked list
corruption (mm_struct->lru_gen.list).
Link: https://lore.kernel.org/r/20230115134651.30028-1-msizanoen@qtmlabs.xyz/
Link: https://lkml.kernel.org/r/20230116034405.2960276-1-yuzhao@google.com
Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Reported-by: msizanoen <msizanoen(a)qtmlabs.xyz>
Tested-by: msizanoen <msizanoen(a)qtmlabs.xyz>
Cc: <stable(a)vger.kernel.org> [6.1+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/vmscan.c~mm-multi-gen-lru-fix-crash-during-cgroup-migration
+++ a/mm/vmscan.c
@@ -3323,13 +3323,16 @@ void lru_gen_migrate_mm(struct mm_struct
if (mem_cgroup_disabled())
return;
+ /* migration can happen before addition */
+ if (!mm->lru_gen.memcg)
+ return;
+
rcu_read_lock();
memcg = mem_cgroup_from_task(task);
rcu_read_unlock();
if (memcg == mm->lru_gen.memcg)
return;
- VM_WARN_ON_ONCE(!mm->lru_gen.memcg);
VM_WARN_ON_ONCE(list_empty(&mm->lru_gen.list));
lru_gen_del_mm(mm);
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-multi-gen-lru-fix-crash-during-cgroup-migration.patch
mm-multi-gen-lru-rename-lru_gen_struct-to-lru_gen_folio.patch
mm-multi-gen-lru-rename-lrugen-lists-to-lrugen-folios.patch
mm-multi-gen-lru-remove-eviction-fairness-safeguard.patch
mm-multi-gen-lru-remove-aging-fairness-safeguard.patch
mm-multi-gen-lru-shuffle-should_run_aging.patch
mm-multi-gen-lru-per-node-lru_gen_folio-lists.patch
mm-multi-gen-lru-clarify-scan_control-flags.patch
mm-multi-gen-lru-simplify-arch_has_hw_pte_young-check.patch
mm-add-vma_has_recency.patch
mm-support-posix_fadv_noreuse.patch
There is a lock inversion and rwsem read-lock recursion in the devfreq
target callback which can lead to deadlocks.
Specifically, ufshcd_devfreq_scale() already holds a clk_scaling_lock
read lock when toggling the write booster, which involves taking the
dev_cmd mutex before taking another clk_scaling_lock read lock.
This can lead to a deadlock if another thread:
1) tries to acquire the dev_cmd and clk_scaling locks in the correct
order, or
2) takes a clk_scaling write lock before the attempt to take the
clk_scaling read lock a second time.
Fix this by dropping the clk_scaling_lock before toggling the write
booster as was done before commit 0e9d4ca43ba8 ("scsi: ufs: Protect some
contexts from unexpected clock scaling").
While the devfreq callbacks are already serialised, add a second
serialising mutex to handle the unlikely case where a callback triggered
through the devfreq sysfs interface is racing with a request to disable
clock scaling through the UFS controller 'clkscale_enable' sysfs
attribute. This could otherwise lead to the write booster being left
disabled after having disabled clock scaling.
Also take the new mutex in ufshcd_clk_scaling_allow() to make sure that
any pending write booster update has completed on return.
Note that this currently only affects Qualcomm platforms since commit
87bd05016a64 ("scsi: ufs: core: Allow host driver to disable wb toggling
during clock scaling").
The lock inversion (i.e. 1 above) was reported by lockdep as:
======================================================
WARNING: possible circular locking dependency detected
6.1.0-next-20221216 #211 Not tainted
------------------------------------------------------
kworker/u16:2/71 is trying to acquire lock:
ffff076280ba98a0 (&hba->dev_cmd.lock){+.+.}-{3:3}, at: ufshcd_query_flag+0x50/0x1c0
but task is already holding lock:
ffff076280ba9cf0 (&hba->clk_scaling_lock){++++}-{3:3}, at: ufshcd_devfreq_scale+0x2b8/0x380
which lock already depends on the new lock.
[ +0.011606]
the existing dependency chain (in reverse order) is:
-> #1 (&hba->clk_scaling_lock){++++}-{3:3}:
lock_acquire+0x68/0x90
down_read+0x58/0x80
ufshcd_exec_dev_cmd+0x70/0x2c0
ufshcd_verify_dev_init+0x68/0x170
ufshcd_probe_hba+0x398/0x1180
ufshcd_async_scan+0x30/0x320
async_run_entry_fn+0x34/0x150
process_one_work+0x288/0x6c0
worker_thread+0x74/0x450
kthread+0x118/0x120
ret_from_fork+0x10/0x20
-> #0 (&hba->dev_cmd.lock){+.+.}-{3:3}:
__lock_acquire+0x12a0/0x2240
lock_acquire.part.0+0xcc/0x220
lock_acquire+0x68/0x90
__mutex_lock+0x98/0x430
mutex_lock_nested+0x2c/0x40
ufshcd_query_flag+0x50/0x1c0
ufshcd_query_flag_retry+0x64/0x100
ufshcd_wb_toggle+0x5c/0x120
ufshcd_devfreq_scale+0x2c4/0x380
ufshcd_devfreq_target+0xf4/0x230
devfreq_set_target+0x84/0x2f0
devfreq_update_target+0xc4/0xf0
devfreq_monitor+0x38/0x1f0
process_one_work+0x288/0x6c0
worker_thread+0x74/0x450
kthread+0x118/0x120
ret_from_fork+0x10/0x20
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&hba->clk_scaling_lock);
lock(&hba->dev_cmd.lock);
lock(&hba->clk_scaling_lock);
lock(&hba->dev_cmd.lock);
*** DEADLOCK ***
Fixes: 0e9d4ca43ba8 ("scsi: ufs: Protect some contexts from unexpected clock scaling")
Cc: stable(a)vger.kernel.org # 5.12
Cc: Can Guo <quic_cang(a)quicinc.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
This issue has apparently been known for over a year [1] without anyone
bothering to fix it. Aside from the potential deadlocks, this also leads
to developers using Qualcomm platforms not being able to use lockdep to
prevent further issues like this from being introduced in other places.
Johan
[1] https://lore.kernel.org/lkml/1631843521-2863-1-git-send-email-cang@codeauro…
drivers/ufs/core/ufshcd.c | 29 +++++++++++++++--------------
include/ufs/ufshcd.h | 2 ++
2 files changed, 17 insertions(+), 14 deletions(-)
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index bda61be5f035..5c3821b2fcf8 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -1234,12 +1234,14 @@ static int ufshcd_clock_scaling_prepare(struct ufs_hba *hba)
* clock scaling is in progress
*/
ufshcd_scsi_block_requests(hba);
+ mutex_lock(&hba->wb_mutex);
down_write(&hba->clk_scaling_lock);
if (!hba->clk_scaling.is_allowed ||
ufshcd_wait_for_doorbell_clr(hba, DOORBELL_CLR_TOUT_US)) {
ret = -EBUSY;
up_write(&hba->clk_scaling_lock);
+ mutex_unlock(&hba->wb_mutex);
ufshcd_scsi_unblock_requests(hba);
goto out;
}
@@ -1251,12 +1253,16 @@ static int ufshcd_clock_scaling_prepare(struct ufs_hba *hba)
return ret;
}
-static void ufshcd_clock_scaling_unprepare(struct ufs_hba *hba, bool writelock)
+static void ufshcd_clock_scaling_unprepare(struct ufs_hba *hba, bool scale_up)
{
- if (writelock)
- up_write(&hba->clk_scaling_lock);
- else
- up_read(&hba->clk_scaling_lock);
+ up_write(&hba->clk_scaling_lock);
+
+ /* Enable Write Booster if we have scaled up else disable it */
+ if (ufshcd_enable_wb_if_scaling_up(hba))
+ ufshcd_wb_toggle(hba, scale_up);
+
+ mutex_unlock(&hba->wb_mutex);
+
ufshcd_scsi_unblock_requests(hba);
ufshcd_release(hba);
}
@@ -1273,7 +1279,6 @@ static void ufshcd_clock_scaling_unprepare(struct ufs_hba *hba, bool writelock)
static int ufshcd_devfreq_scale(struct ufs_hba *hba, bool scale_up)
{
int ret = 0;
- bool is_writelock = true;
ret = ufshcd_clock_scaling_prepare(hba);
if (ret)
@@ -1302,15 +1307,8 @@ static int ufshcd_devfreq_scale(struct ufs_hba *hba, bool scale_up)
}
}
- /* Enable Write Booster if we have scaled up else disable it */
- if (ufshcd_enable_wb_if_scaling_up(hba)) {
- downgrade_write(&hba->clk_scaling_lock);
- is_writelock = false;
- ufshcd_wb_toggle(hba, scale_up);
- }
-
out_unprepare:
- ufshcd_clock_scaling_unprepare(hba, is_writelock);
+ ufshcd_clock_scaling_unprepare(hba, scale_up);
return ret;
}
@@ -6066,9 +6064,11 @@ static void ufshcd_force_error_recovery(struct ufs_hba *hba)
static void ufshcd_clk_scaling_allow(struct ufs_hba *hba, bool allow)
{
+ mutex_lock(&hba->wb_mutex);
down_write(&hba->clk_scaling_lock);
hba->clk_scaling.is_allowed = allow;
up_write(&hba->clk_scaling_lock);
+ mutex_unlock(&hba->wb_mutex);
}
static void ufshcd_clk_scaling_suspend(struct ufs_hba *hba, bool suspend)
@@ -9793,6 +9793,7 @@ int ufshcd_init(struct ufs_hba *hba, void __iomem *mmio_base, unsigned int irq)
/* Initialize mutex for exception event control */
mutex_init(&hba->ee_ctrl_mutex);
+ mutex_init(&hba->wb_mutex);
init_rwsem(&hba->clk_scaling_lock);
ufshcd_init_clk_gating(hba);
diff --git a/include/ufs/ufshcd.h b/include/ufs/ufshcd.h
index 5cf81dff60aa..727084cd79be 100644
--- a/include/ufs/ufshcd.h
+++ b/include/ufs/ufshcd.h
@@ -808,6 +808,7 @@ struct ufs_hba_monitor {
* @urgent_bkops_lvl: keeps track of urgent bkops level for device
* @is_urgent_bkops_lvl_checked: keeps track if the urgent bkops level for
* device is known or not.
+ * @wb_mutex: used to serialize devfreq and sysfs write booster toggling
* @clk_scaling_lock: used to serialize device commands and clock scaling
* @desc_size: descriptor sizes reported by device
* @scsi_block_reqs_cnt: reference counting for scsi block requests
@@ -951,6 +952,7 @@ struct ufs_hba {
enum bkops_status urgent_bkops_lvl;
bool is_urgent_bkops_lvl_checked;
+ struct mutex wb_mutex;
struct rw_semaphore clk_scaling_lock;
unsigned char desc_size[QUERY_DESC_IDN_MAX];
atomic_t scsi_block_reqs_cnt;
--
2.37.4
[Public]
Hi,
A patch was introduced into 6.2 that will allow outputting what GPIO is active at a given time for pinctrl-amd with dynamic debugging. This patch is useful particularly for debugging the wakeup source when it says that it's the GPIO controller.
I've asked at least 4 people to apply it on 6.1.y kernels and turn on those statements in /sys/kernel/debug/dynamic_debug/control to debug issues of spurious wakeups which has me realizing it would make sense to just generally being in 6.1.y.
Would you please cherry-pick this commit?
1d66e379731f ("pinctrl: amd: Add dynamic debugging for active GPIOs")
Thanks,
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
544d163d659d ("io_uring: lock overflowing for IOPOLL")
a8cf95f93610 ("io_uring: fix overflow handling regression")
fa18fa2272c7 ("io_uring: inline __io_req_complete_put()")
f9d567c75ec2 ("io_uring: inline __io_req_complete_post()")
52120f0fadcb ("io_uring: add allow_overflow to io_post_aux_cqe")
e6130eba8a84 ("io_uring: add support for passing fixed file descriptors")
253993210bd8 ("io_uring: introduce locking helpers for CQE posting")
305bef988708 ("io_uring: hide eventfd assumptions in eventfd paths")
d9dee4302a7c ("io_uring: remove ->flush_cqes optimisation")
a830ffd28780 ("io_uring: move io_eventfd_signal()")
9046c6415be6 ("io_uring: reshuffle io_uring/io_uring.h")
d142c3ec8d16 ("io_uring: remove extra io_commit_cqring()")
68494a65d0e2 ("io_uring: introduce io_req_cqe_overflow()")
faf88dde060f ("io_uring: don't inline __io_get_cqe()")
d245bca6375b ("io_uring: don't expose io_fill_cqe_aux()")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 544d163d659d45a206d8929370d5a2984e546cb7 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Thu, 12 Jan 2023 13:08:56 +0000
Subject: [PATCH] io_uring: lock overflowing for IOPOLL
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
syzbot reports an issue with overflow filling for IOPOLL:
WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0
Workqueue: events_unbound io_ring_exit_work
Call trace:
io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773
io_fill_cqe_req io_uring/io_uring.h:168 [inline]
io_do_iopoll+0x474/0x62c io_uring/rw.c:1065
io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513
io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056
io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869
process_one_work+0x2d8/0x504 kernel/workqueue.c:2289
worker_thread+0x340/0x610 kernel/workqueue.c:2436
kthread+0x12c/0x158 kernel/kthread.c:376
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863
There is no real problem for normal IOPOLL as flush is also called with
uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL,
for which __io_cqring_overflow_flush() happens from the CQ waiting path.
Reported-and-tested-by: syzbot+6805087452d72929404e(a)syzkaller.appspotmail.com
Cc: stable(a)vger.kernel.org # 5.10+
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 8227af2e1c0f..9c3ddd46a1ad 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1062,7 +1062,11 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
continue;
req->cqe.flags = io_put_kbuf(req, 0);
- io_fill_cqe_req(req->ctx, req);
+ if (unlikely(!__io_fill_cqe_req(ctx, req))) {
+ spin_lock(&ctx->completion_lock);
+ io_req_cqe_overflow(req);
+ spin_unlock(&ctx->completion_lock);
+ }
}
if (unlikely(!nr_events))
On Mon, Jan 16, 2023 at 06:12:12AM +0530, mkv22(a)cantab.net wrote:
> Apologies that I am unable to attach more detailed information from dmesg
> or logs as the update was in a production laptop and had to be rolled back
> to 6.1.5 immediately and lost the dmesg output.
Any chance to run 'git bisect' to find the offending change?
thanks,
greg k-h
USB3 ports on xHC hosts may have retimers that cause too long
exit latency to work with native USB3 U1/U2 link power management states.
For now only use usb_acpi_port_lpm_incapable() to evaluate if port lpm
should be disabled while setting up the USB3 roothub.
Other ways to identify lpm incapable ports can be added here later if
ACPI _DSM does not exist.
Limit this to Intel hosts for now, this is to my knowledge only
an Intel issue.
Cc: stable(a)vger.kernel.org
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci-pci.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
index b5016709b26f..fb988e4ea924 100644
--- a/drivers/usb/host/xhci-pci.c
+++ b/drivers/usb/host/xhci-pci.c
@@ -355,8 +355,38 @@ static void xhci_pme_acpi_rtd3_enable(struct pci_dev *dev)
NULL);
ACPI_FREE(obj);
}
+
+static void xhci_find_lpm_incapable_ports(struct usb_hcd *hcd, struct usb_device *hdev)
+{
+ struct xhci_hcd *xhci = hcd_to_xhci(hcd);
+ struct xhci_hub *rhub = &xhci->usb3_rhub;
+ int ret;
+ int i;
+
+ /* This is not the usb3 roothub we are looking for */
+ if (hcd != rhub->hcd)
+ return;
+
+ if (hdev->maxchild > rhub->num_ports) {
+ dev_err(&hdev->dev, "USB3 roothub port number mismatch\n");
+ return;
+ }
+
+ for (i = 0; i < hdev->maxchild; i++) {
+ ret = usb_acpi_port_lpm_incapable(hdev, i);
+
+ dev_dbg(&hdev->dev, "port-%d disable U1/U2 _DSM: %d\n", i + 1, ret);
+
+ if (ret >= 0) {
+ rhub->ports[i]->lpm_incapable = ret;
+ continue;
+ }
+ }
+}
+
#else
static void xhci_pme_acpi_rtd3_enable(struct pci_dev *dev) { }
+static void xhci_find_lpm_incapable_ports(struct usb_hcd *hcd, struct usb_device *hdev) { }
#endif /* CONFIG_ACPI */
/* called during probe() after chip reset completes */
@@ -392,6 +422,10 @@ static int xhci_pci_setup(struct usb_hcd *hcd)
static int xhci_pci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
struct usb_tt *tt, gfp_t mem_flags)
{
+ /* Check if acpi claims some USB3 roothub ports are lpm incapable */
+ if (!hdev->parent)
+ xhci_find_lpm_incapable_ports(hcd, hdev);
+
return xhci_update_hub_device(hcd, hdev, tt, mem_flags);
}
--
2.25.1
Add a helper to evaluate ACPI usb device specific method (_DSM) provided
in case the USB3 port shouldn't enter U1 and U2 link states.
This _DSM was added as port specific retimer configuration may lead to
exit latencies growing beyond U1/U2 exit limits, and OS needs a way to
find which ports can't support U1/U2 link power management states.
This _DSM is also used by windows:
Link: https://docs.microsoft.com/en-us/windows-hardware/drivers/bringup/usb-devic…
Some patch issues found in testing resolved by Ron Lee
Cc: stable(a)vger.kernel.org
Tested-by: Ron Lee <ron.lee(a)intel.com>
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/core/usb-acpi.c | 65 +++++++++++++++++++++++++++++++++++++
include/linux/usb.h | 3 ++
2 files changed, 68 insertions(+)
diff --git a/drivers/usb/core/usb-acpi.c b/drivers/usb/core/usb-acpi.c
index 6d93428432f1..533baa85083c 100644
--- a/drivers/usb/core/usb-acpi.c
+++ b/drivers/usb/core/usb-acpi.c
@@ -37,6 +37,71 @@ bool usb_acpi_power_manageable(struct usb_device *hdev, int index)
}
EXPORT_SYMBOL_GPL(usb_acpi_power_manageable);
+#define UUID_USB_CONTROLLER_DSM "ce2ee385-00e6-48cb-9f05-2edb927c4899"
+#define USB_DSM_DISABLE_U1_U2_FOR_PORT 5
+
+/**
+ * usb_acpi_port_lpm_incapable - check if lpm should be disabled for a port.
+ * @hdev: USB device belonging to the usb hub
+ * @index: zero based port index
+ *
+ * Some USB3 ports may not support USB3 link power management U1/U2 states
+ * due to different retimer setup. ACPI provides _DSM method which returns 0x01
+ * if U1 and U2 states should be disabled. Evaluate _DSM with:
+ * Arg0: UUID = ce2ee385-00e6-48cb-9f05-2edb927c4899
+ * Arg1: Revision ID = 0
+ * Arg2: Function Index = 5
+ * Arg3: (empty)
+ *
+ * Return 1 if USB3 port is LPM incapable, negative on error, otherwise 0
+ */
+
+int usb_acpi_port_lpm_incapable(struct usb_device *hdev, int index)
+{
+ union acpi_object *obj;
+ acpi_handle port_handle;
+ int port1 = index + 1;
+ guid_t guid;
+ int ret;
+
+ ret = guid_parse(UUID_USB_CONTROLLER_DSM, &guid);
+ if (ret)
+ return ret;
+
+ port_handle = usb_get_hub_port_acpi_handle(hdev, port1);
+ if (!port_handle) {
+ dev_dbg(&hdev->dev, "port-%d no acpi handle\n", port1);
+ return -ENODEV;
+ }
+
+ if (!acpi_check_dsm(port_handle, &guid, 0,
+ BIT(USB_DSM_DISABLE_U1_U2_FOR_PORT))) {
+ dev_dbg(&hdev->dev, "port-%d no _DSM function %d\n",
+ port1, USB_DSM_DISABLE_U1_U2_FOR_PORT);
+ return -ENODEV;
+ }
+
+ obj = acpi_evaluate_dsm(port_handle, &guid, 0,
+ USB_DSM_DISABLE_U1_U2_FOR_PORT, NULL);
+
+ if (!obj)
+ return -ENODEV;
+
+ if (obj->type != ACPI_TYPE_INTEGER) {
+ dev_dbg(&hdev->dev, "evaluate port-%d _DSM failed\n", port1);
+ ACPI_FREE(obj);
+ return -EINVAL;
+ }
+
+ if (obj->integer.value == 0x01)
+ ret = 1;
+
+ ACPI_FREE(obj);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(usb_acpi_port_lpm_incapable);
+
/**
* usb_acpi_set_power_state - control usb port's power via acpi power
* resource
diff --git a/include/linux/usb.h b/include/linux/usb.h
index 7d5325d47c45..04a7e94fb772 100644
--- a/include/linux/usb.h
+++ b/include/linux/usb.h
@@ -774,11 +774,14 @@ extern struct device *usb_intf_get_dma_device(struct usb_interface *intf);
extern int usb_acpi_set_power_state(struct usb_device *hdev, int index,
bool enable);
extern bool usb_acpi_power_manageable(struct usb_device *hdev, int index);
+extern int usb_acpi_port_lpm_incapable(struct usb_device *hdev, int index);
#else
static inline int usb_acpi_set_power_state(struct usb_device *hdev, int index,
bool enable) { return 0; }
static inline bool usb_acpi_power_manageable(struct usb_device *hdev, int index)
{ return true; }
+static inline int usb_acpi_port_lpm_incapable(struct usb_device *hdev, int index)
+ { return 0; }
#endif
/* USB autosuspend and autoresume */
--
2.25.1
One USB3 roothub port may support link power management, while another
root port on the same xHC can't due to different retimers used for
the ports.
This is the case with Intel Alder Lake, and possible future platforms
where retimers used for USB4 ports cause too long exit latecy to
enable native USB3 lpm U1 and U2 states.
Add a flag in the xhci port structure to indicate if the port is
lpm_incapable, and check it while calculating exit latency.
Cc: stable(a)vger.kernel.org
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci.c | 8 ++++++++
drivers/usb/host/xhci.h | 1 +
2 files changed, 9 insertions(+)
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 89f92fc78bb1..2b280beb0011 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -5049,6 +5049,7 @@ static int xhci_enable_usb3_lpm_timeout(struct usb_hcd *hcd,
struct usb_device *udev, enum usb3_link_state state)
{
struct xhci_hcd *xhci;
+ struct xhci_port *port;
u16 hub_encoded_timeout;
int mel;
int ret;
@@ -5065,6 +5066,13 @@ static int xhci_enable_usb3_lpm_timeout(struct usb_hcd *hcd,
if (xhci_check_tier_policy(xhci, udev, state) < 0)
return USB3_LPM_DISABLED;
+ /* If connected to root port then check port can handle lpm */
+ if (udev->parent && !udev->parent->parent) {
+ port = xhci->usb3_rhub.ports[udev->portnum - 1];
+ if (port->lpm_incapable)
+ return USB3_LPM_DISABLED;
+ }
+
hub_encoded_timeout = xhci_calculate_lpm_timeout(hcd, udev, state);
mel = calculate_max_exit_latency(udev, state, hub_encoded_timeout);
if (mel < 0) {
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 3edfacb93817..dcee7f3207ad 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1735,6 +1735,7 @@ struct xhci_port {
int hcd_portnum;
struct xhci_hub *rhub;
struct xhci_port_cap *port_cap;
+ unsigned int lpm_incapable:1;
};
struct xhci_hub {
--
2.25.1
Allow PCI hosts to check and tune roothub and port settings
before the hub is up and running.
This override is needed to turn off U1 and U2 LPM for some ports
based on per port ACPI _DSM, _UPC, or possibly vendor specific mmio
values for Intel xHC hosts.
Usb core calls the host update_hub_device once it creates a hub.
Entering U1 or U2 link power save state on ports with this limitation
will cause link to fail, turning the usb device unusable in that setup.
Cc: stable(a)vger.kernel.org
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci-pci.c | 9 +++++++++
drivers/usb/host/xhci.c | 5 ++++-
drivers/usb/host/xhci.h | 4 ++++
3 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
index 2c0d7038f040..b5016709b26f 100644
--- a/drivers/usb/host/xhci-pci.c
+++ b/drivers/usb/host/xhci-pci.c
@@ -78,9 +78,12 @@ static const char hcd_name[] = "xhci_hcd";
static struct hc_driver __read_mostly xhci_pci_hc_driver;
static int xhci_pci_setup(struct usb_hcd *hcd);
+static int xhci_pci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
+ struct usb_tt *tt, gfp_t mem_flags);
static const struct xhci_driver_overrides xhci_pci_overrides __initconst = {
.reset = xhci_pci_setup,
+ .update_hub_device = xhci_pci_update_hub_device,
};
/* called after powerup, by probe or system-pm "wakeup" */
@@ -386,6 +389,12 @@ static int xhci_pci_setup(struct usb_hcd *hcd)
return xhci_pci_reinit(xhci, pdev);
}
+static int xhci_pci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
+ struct usb_tt *tt, gfp_t mem_flags)
+{
+ return xhci_update_hub_device(hcd, hdev, tt, mem_flags);
+}
+
/*
* We need to register our own PCI probe function (instead of the USB core's
* function) in order to create a second roothub under xHCI.
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 50b41213e827..89f92fc78bb1 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -5124,7 +5124,7 @@ static int xhci_disable_usb3_lpm_timeout(struct usb_hcd *hcd,
/* Once a hub descriptor is fetched for a device, we need to update the xHC's
* internal data structures for the device.
*/
-static int xhci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
+int xhci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
struct usb_tt *tt, gfp_t mem_flags)
{
struct xhci_hcd *xhci = hcd_to_xhci(hcd);
@@ -5224,6 +5224,7 @@ static int xhci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
xhci_free_command(xhci, config_cmd);
return ret;
}
+EXPORT_SYMBOL_GPL(xhci_update_hub_device);
static int xhci_get_frame(struct usb_hcd *hcd)
{
@@ -5507,6 +5508,8 @@ void xhci_init_driver(struct hc_driver *drv,
drv->check_bandwidth = over->check_bandwidth;
if (over->reset_bandwidth)
drv->reset_bandwidth = over->reset_bandwidth;
+ if (over->update_hub_device)
+ drv->update_hub_device = over->update_hub_device;
}
}
EXPORT_SYMBOL_GPL(xhci_init_driver);
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index c9f06c5e4e9d..3edfacb93817 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1943,6 +1943,8 @@ struct xhci_driver_overrides {
struct usb_host_endpoint *ep);
int (*check_bandwidth)(struct usb_hcd *, struct usb_device *);
void (*reset_bandwidth)(struct usb_hcd *, struct usb_device *);
+ int (*update_hub_device)(struct usb_hcd *hcd, struct usb_device *hdev,
+ struct usb_tt *tt, gfp_t mem_flags);
};
#define XHCI_CFC_DELAY 10
@@ -2122,6 +2124,8 @@ int xhci_drop_endpoint(struct usb_hcd *hcd, struct usb_device *udev,
struct usb_host_endpoint *ep);
int xhci_check_bandwidth(struct usb_hcd *hcd, struct usb_device *udev);
void xhci_reset_bandwidth(struct usb_hcd *hcd, struct usb_device *udev);
+int xhci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
+ struct usb_tt *tt, gfp_t mem_flags);
int xhci_disable_slot(struct xhci_hcd *xhci, u32 slot_id);
int xhci_ext_cap_init(struct xhci_hcd *xhci);
--
2.25.1
Make sure xhci_free_dev() and xhci_kill_endpoint_urbs() do not race
and cause null pointer dereference when host suddenly dies.
Usb core may call xhci_free_dev() which frees the xhci->devs[slot_id]
virt device at the same time that xhci_kill_endpoint_urbs() tries to
loop through all the device's endpoints, checking if there are any
cancelled urbs left to give back.
hold the xhci spinlock while freeing the virt device
Cc: stable(a)vger.kernel.org
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 79d7931c048a..50b41213e827 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -3974,6 +3974,7 @@ static void xhci_free_dev(struct usb_hcd *hcd, struct usb_device *udev)
struct xhci_hcd *xhci = hcd_to_xhci(hcd);
struct xhci_virt_device *virt_dev;
struct xhci_slot_ctx *slot_ctx;
+ unsigned long flags;
int i, ret;
/*
@@ -4000,7 +4001,11 @@ static void xhci_free_dev(struct usb_hcd *hcd, struct usb_device *udev)
virt_dev->eps[i].ep_state &= ~EP_STOP_CMD_PENDING;
virt_dev->udev = NULL;
xhci_disable_slot(xhci, udev->slot_id);
+
+ spin_lock_irqsave(&xhci->lock, flags);
xhci_free_virt_device(xhci, udev->slot_id);
+ spin_unlock_irqrestore(&xhci->lock, flags);
+
}
int xhci_disable_slot(struct xhci_hcd *xhci, u32 slot_id)
--
2.25.1
Set the framebuffer info for drivers that support VGA switcheroo. Only
affects the amdgpu and nouveau drivers, which use VGA switcheroo and
generic fbdev emulation. For other drivers, this does nothing.
This fixes a potential regression in the console code. Both, amdgpu and
nouveau, invoked vga_switcheroo_client_fb_set() from their internal fbdev
code. But the call got lost when the drivers switched to the generic
emulation.
Fixes: 087451f372bf ("drm/amdgpu: use generic fb helpers instead of setting up AMD own's.")
Fixes: 4a16dd9d18a0 ("drm/nouveau/kms: switch to drm fbdev helpers")
Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Reviewed-by: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Reviewed-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: Ben Skeggs <bskeggs(a)redhat.com>
Cc: Karol Herbst <kherbst(a)redhat.com>
Cc: Lyude Paul <lyude(a)redhat.com>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Javier Martinez Canillas <javierm(a)redhat.com>
Cc: Laurent Pinchart <laurent.pinchart(a)ideasonboard.com>
Cc: Jani Nikula <jani.nikula(a)intel.com>
Cc: Dave Airlie <airlied(a)redhat.com>
Cc: Evan Quan <evan.quan(a)amd.com>
Cc: Christian König <christian.koenig(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: Hawking Zhang <Hawking.Zhang(a)amd.com>
Cc: Likun Gao <Likun.Gao(a)amd.com>
Cc: "Christian König" <christian.koenig(a)amd.com>
Cc: Stanley Yang <Stanley.Yang(a)amd.com>
Cc: "Tianci.Yin" <tianci.yin(a)amd.com>
Cc: Xiaojian Du <Xiaojian.Du(a)amd.com>
Cc: Andrey Grodzovsky <andrey.grodzovsky(a)amd.com>
Cc: YiPeng Chai <YiPeng.Chai(a)amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram(a)amd.com>
Cc: Bokun Zhang <Bokun.Zhang(a)amd.com>
Cc: Guchun Chen <guchun.chen(a)amd.com>
Cc: Hamza Mahfooz <hamza.mahfooz(a)amd.com>
Cc: Aurabindo Pillai <aurabindo.pillai(a)amd.com>
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Solomon Chiu <solomon.chiu(a)amd.com>
Cc: Kai-Heng Feng <kai.heng.feng(a)canonical.com>
Cc: Felix Kuehling <Felix.Kuehling(a)amd.com>
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: "Marek Olšák" <marek.olsak(a)amd.com>
Cc: Sam Ravnborg <sam(a)ravnborg.org>
Cc: Hans de Goede <hdegoede(a)redhat.com>
Cc: "Ville Syrjälä" <ville.syrjala(a)linux.intel.com>
Cc: dri-devel(a)lists.freedesktop.org
Cc: nouveau(a)lists.freedesktop.org
Cc: <stable(a)vger.kernel.org> # v5.17+
---
drivers/gpu/drm/drm_fb_helper.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c
index 367fb8b2d5fa..c5c13e192b64 100644
--- a/drivers/gpu/drm/drm_fb_helper.c
+++ b/drivers/gpu/drm/drm_fb_helper.c
@@ -30,7 +30,9 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/console.h>
+#include <linux/pci.h>
#include <linux/sysrq.h>
+#include <linux/vga_switcheroo.h>
#include <drm/drm_atomic.h>
#include <drm/drm_drv.h>
@@ -1924,6 +1926,7 @@ static int drm_fb_helper_single_fb_probe(struct drm_fb_helper *fb_helper,
int preferred_bpp)
{
struct drm_client_dev *client = &fb_helper->client;
+ struct drm_device *dev = fb_helper->dev;
struct drm_fb_helper_surface_size sizes;
int ret;
@@ -1945,6 +1948,11 @@ static int drm_fb_helper_single_fb_probe(struct drm_fb_helper *fb_helper,
return ret;
strcpy(fb_helper->fb->comm, "[fbcon]");
+
+ /* Set the fb info for vgaswitcheroo clients. Does nothing otherwise. */
+ if (dev_is_pci(dev->dev))
+ vga_switcheroo_client_fb_set(to_pci_dev(dev->dev), fb_helper->info);
+
return 0;
}
--
2.39.0
Always allow switching away via vga-switcheroo if the display is
uninitalized. Instead prevent switching to i915 if the device has
not been initialized.
This issue was introduced by commit 5df7bd130818 ("drm/i915: skip
display initialization when there is no display") protected, which
protects code paths from being executed on uninitialized devices.
In the case of vga-switcheroo, we want to allow a switch away from
i915's device. So run vga_switcheroo_process_delayed_switch() and
test in the switcheroo callbacks if the i915 device is available.
Fixes: 5df7bd130818 ("drm/i915: skip display initialization when there is no display")
Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Radhakrishna Sripada <radhakrishna.sripada(a)intel.com>
Cc: Lucas De Marchi <lucas.demarchi(a)intel.com>
Cc: José Roberto de Souza <jose.souza(a)intel.com>
Cc: Jani Nikula <jani.nikula(a)intel.com>
Cc: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen(a)linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)linux.intel.com>
Cc: "Ville Syrjälä" <ville.syrjala(a)linux.intel.com>
Cc: Manasi Navare <manasi.d.navare(a)intel.com>
Cc: Stanislav Lisovskiy <stanislav.lisovskiy(a)intel.com>
Cc: Imre Deak <imre.deak(a)intel.com>
Cc: "Jouni Högander" <jouni.hogander(a)intel.com>
Cc: Uma Shankar <uma.shankar(a)intel.com>
Cc: Ankit Nautiyal <ankit.k.nautiyal(a)intel.com>
Cc: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Cc: Matt Roper <matthew.d.roper(a)intel.com>
Cc: Ramalingam C <ramalingam.c(a)intel.com>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Andi Shyti <andi.shyti(a)linux.intel.com>
Cc: Andrzej Hajda <andrzej.hajda(a)intel.com>
Cc: "José Roberto de Souza" <jose.souza(a)intel.com>
Cc: Julia Lawall <Julia.Lawall(a)inria.fr>
Cc: intel-gfx(a)lists.freedesktop.org
Cc: <stable(a)vger.kernel.org> # v5.14+
---
drivers/gpu/drm/i915/i915_driver.c | 3 +--
drivers/gpu/drm/i915/i915_switcheroo.c | 6 +++++-
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index c1e427ba57ae..33e231b120c1 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1075,8 +1075,7 @@ static void i915_driver_lastclose(struct drm_device *dev)
intel_fbdev_restore_mode(dev);
- if (HAS_DISPLAY(i915))
- vga_switcheroo_process_delayed_switch();
+ vga_switcheroo_process_delayed_switch();
}
static void i915_driver_postclose(struct drm_device *dev, struct drm_file *file)
diff --git a/drivers/gpu/drm/i915/i915_switcheroo.c b/drivers/gpu/drm/i915/i915_switcheroo.c
index 23777d500cdf..f45bd6b6cede 100644
--- a/drivers/gpu/drm/i915/i915_switcheroo.c
+++ b/drivers/gpu/drm/i915/i915_switcheroo.c
@@ -19,6 +19,10 @@ static void i915_switcheroo_set_state(struct pci_dev *pdev,
dev_err(&pdev->dev, "DRM not initialized, aborting switch.\n");
return;
}
+ if (!HAS_DISPLAY(i915)) {
+ dev_err(&pdev->dev, "Device state not initialized, aborting switch.\n");
+ return;
+ }
if (state == VGA_SWITCHEROO_ON) {
drm_info(&i915->drm, "switched on\n");
@@ -44,7 +48,7 @@ static bool i915_switcheroo_can_switch(struct pci_dev *pdev)
* locking inversion with the driver load path. And the access here is
* completely racy anyway. So don't bother with locking for now.
*/
- return i915 && atomic_read(&i915->drm.open_count) == 0;
+ return i915 && HAS_DISPLAY(i915) && atomic_read(&i915->drm.open_count) == 0;
}
static const struct vga_switcheroo_client_ops i915_switcheroo_ops = {
--
2.39.0
The introduction of support for Apple board types inadvertently changed
the precedence order, causing hybrid SMBIOS+DT platforms to look up the
firmware using the DMI information instead of the device tree compatible
to generate the board type. Revert back to the old behavior,
as affected platforms use firmwares named after the DT compatible.
Fixes: 7682de8b3351 ("wifi: brcmfmac: of: Fetch Apple properties")
[1] https://bugzilla.opensuse.org/show_bug.cgi?id=1206697#c13
Cc: stable(a)vger.kernel.org
Signed-off-by: Ivan T. Ivanov <iivanov(a)suse.de>
Reviewed-by: Hector Martin <marcan(a)marcan.st>
---
Changes since v1
Rewrite commit message according feedback.
https://lore.kernel.org/all/20230106072746.29516-1-iivanov@suse.de/
drivers/net/wireless/broadcom/brcm80211/brcmfmac/of.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/of.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/of.c
index a83699de01ec..fdd0c9abc1a1 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/of.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/of.c
@@ -79,7 +79,8 @@ void brcmf_of_probe(struct device *dev, enum brcmf_bus_type bus_type,
/* Apple ARM64 platforms have their own idea of board type, passed in
* via the device tree. They also have an antenna SKU parameter
*/
- if (!of_property_read_string(np, "brcm,board-type", &prop))
+ err = of_property_read_string(np, "brcm,board-type", &prop);
+ if (!err)
settings->board_type = prop;
if (!of_property_read_string(np, "apple,antenna-sku", &prop))
@@ -87,7 +88,7 @@ void brcmf_of_probe(struct device *dev, enum brcmf_bus_type bus_type,
/* Set board-type to the first string of the machine compatible prop */
root = of_find_node_by_path("/");
- if (root && !settings->board_type) {
+ if (root && err) {
char *board_type;
const char *tmp;
--
2.35.3
This bug is marked as fixed by commit:
net: core: netlink: add helper refcount dec and lock function
net: sched: add helper function to take reference to Qdisc
net: sched: extend Qdisc with rcu
net: sched: rename qdisc_destroy() to qdisc_put()
net: sched: use Qdisc rcu API instead of relying on rtnl lock
But I can't find it in the tested trees[1] for more than 90 days.
Is it a correct commit? Please update it by replying:
#syz fix: exact-commit-title
Until then the bug is still considered open and new crashes with
the same signature are ignored.
Kernel: Linux 4.19
Dashboard link: https://syzkaller.appspot.com/bug?extid=5f229e48cccc804062c0
---
[1] I expect the commit to be present in:
1. linux-4.19.y branch of
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
After doorbell DMA fetches the TRB. If during dequeuing request
driver changes NORMAL TRB to LINK TRB but doesn't delete it from
controller cache then controller will handle cached TRB and packet
can be lost.
The example scenario for this issue looks like:
1. queue request - set doorbell
2. dequeue request
3. send OUT data packet from host
4. Device will accept this packet which is unexpected
5. queue new request - set doorbell
6. Device lost the expected packet.
By setting DFLUSH controller clears DRDY bit and stop DMA transfer.
Fixes: 7733f6c32e36 ("usb: cdns3: Add Cadence USB3 DRD Driver")
cc: <stable(a)vger.kernel.org>
Signed-off-by: Pawel Laszczak <pawell(a)cadence.com>
---
drivers/usb/cdns3/cdns3-gadget.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/usb/cdns3/cdns3-gadget.c b/drivers/usb/cdns3/cdns3-gadget.c
index 5adcb349718c..ccfaebca6faa 100644
--- a/drivers/usb/cdns3/cdns3-gadget.c
+++ b/drivers/usb/cdns3/cdns3-gadget.c
@@ -2614,6 +2614,7 @@ int cdns3_gadget_ep_dequeue(struct usb_ep *ep,
u8 req_on_hw_ring = 0;
unsigned long flags;
int ret = 0;
+ int val;
if (!ep || !request || !ep->desc)
return -EINVAL;
@@ -2649,6 +2650,13 @@ int cdns3_gadget_ep_dequeue(struct usb_ep *ep,
/* Update ring only if removed request is on pending_req_list list */
if (req_on_hw_ring && link_trb) {
+ /* Stop DMA */
+ writel(EP_CMD_DFLUSH, &priv_dev->regs->ep_cmd);
+
+ /* wait for DFLUSH cleared */
+ readl_poll_timeout_atomic(&priv_dev->regs->ep_cmd, val,
+ !(val & EP_CMD_DFLUSH), 1, 1000);
+
link_trb->buffer = cpu_to_le32(TRB_BUFFER(priv_ep->trb_pool_dma +
((priv_req->end_trb + 1) * TRB_SIZE)));
link_trb->control = cpu_to_le32((le32_to_cpu(link_trb->control) & TRB_CYCLE) |
@@ -2660,6 +2668,10 @@ int cdns3_gadget_ep_dequeue(struct usb_ep *ep,
cdns3_gadget_giveback(priv_ep, priv_req, -ECONNRESET);
+ req = cdns3_next_request(&priv_ep->pending_req_list);
+ if (req)
+ cdns3_rearm_transfer(priv_ep, 1);
+
not_found:
spin_unlock_irqrestore(&priv_dev->lock, flags);
return ret;
--
2.25.1
The patch titled
Subject: Revert "mm/compaction: fix set skip in fast_find_migrateblock"
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
revert-mm-compaction-fix-set-skip-in-fast_find_migrateblock.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Vlastimil Babka <vbabka(a)suse.cz>
Subject: Revert "mm/compaction: fix set skip in fast_find_migrateblock"
Date: Fri, 13 Jan 2023 18:33:45 +0100
This reverts commit 7efc3b7261030da79001c00d92bc3392fd6c664c.
We have got openSUSE reports (Link 1) for 6.1 kernel with khugepaged
stalling CPU for long periods of time. Investigation of tracepoint data
shows that compaction is stuck in repeating fast_find_migrateblock() based
migrate page isolation, and then fails to migrate all isolated pages.
Commit 7efc3b726103 ("mm/compaction: fix set skip in
fast_find_migrateblock") was suspected as it was merged in 6.1 and in
theory can indeed remove a termination condition for
fast_find_migrateblock() under certain conditions, as it removes a place
that always marks a scanned pageblock from being re-scanned. There are
other such places, but those can be skipped under certain conditions,
which seems to match the tracepoint data.
Testing of revert also appears to have resolved the issue, thus revert the
commit until a more robust solution for the original problem is developed.
It's also likely this will fix qemu stalls with 6.1 kernel reported in
Link 2, but that is not yet confirmed.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1206848
Link: https://lore.kernel.org/kvm/b8017e09-f336-3035-8344-c549086c2340@kernel.org/
Link: https://lkml.kernel.org/r/20230113173345.9692-1-vbabka@suse.cz
Fixes: 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
Cc: Chuyi Zhou <zhouchuyi(a)bytedance.com>
Cc: Jiri Slaby <jirislaby(a)kernel.org>
Cc: Maxim Levitsky <mlevitsk(a)redhat.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Thorsten Leemhuis <regressions(a)leemhuis.info>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/compaction.c~revert-mm-compaction-fix-set-skip-in-fast_find_migrateblock
+++ a/mm/compaction.c
@@ -1839,6 +1839,7 @@ static unsigned long fast_find_migratebl
pfn = cc->zone->zone_start_pfn;
cc->fast_search_fail = 0;
found_block = true;
+ set_pageblock_skip(freepage);
break;
}
}
_
Patches currently in -mm which might be from vbabka(a)suse.cz are
revert-mm-compaction-fix-set-skip-in-fast_find_migrateblock.patch
This is the start of the stable review cycle for the 6.1.6 release.
There are 10 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat, 14 Jan 2023 13:53:18 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.1.6-rc1.…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.1.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.1.6-rc1
Frederick Lawler <fred(a)cloudflare.com>
net: sched: disallow noqueue for qdisc classes
Linus Torvalds <torvalds(a)linux-foundation.org>
gcc: disable -Warray-bounds for gcc-11 too
Chuck Lever <chuck.lever(a)oracle.com>
Revert "SUNRPC: Use RMW bitops in single-threaded hot paths"
Kyle Huey <me(a)kylehuey.com>
selftests/vm/pkeys: Add a regression test for setting PKRU through ptrace
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Emulate XRSTOR's behavior if the xfeatures PKRU bit is not set
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Allow PKRU to be (once again) written by ptrace.
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Add a pkru argument to copy_uabi_to_xstate()
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Add a pkru argument to copy_uabi_from_kernel_to_xstate().
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Take task_struct* in copy_sigframe_from_user_to_xstate()
Helge Deller <deller(a)gmx.de>
parisc: Align parisc MADV_XXX constants with all other architectures
-------------
Diffstat:
Makefile | 4 +-
arch/parisc/include/uapi/asm/mman.h | 29 +++---
arch/parisc/kernel/sys_parisc.c | 28 ++++++
arch/parisc/kernel/syscalls/syscall.tbl | 2 +-
arch/x86/kernel/fpu/core.c | 19 ++--
arch/x86/kernel/fpu/regset.c | 2 +-
arch/x86/kernel/fpu/signal.c | 2 +-
arch/x86/kernel/fpu/xstate.c | 52 ++++++++++-
arch/x86/kernel/fpu/xstate.h | 4 +-
fs/nfsd/nfs4proc.c | 7 +-
fs/nfsd/nfs4xdr.c | 2 +-
init/Kconfig | 6 +-
net/sched/sch_api.c | 5 +
net/sunrpc/auth_gss/svcauth_gss.c | 4 +-
net/sunrpc/svc.c | 6 +-
net/sunrpc/svc_xprt.c | 2 +-
net/sunrpc/svcsock.c | 8 +-
net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +-
tools/arch/parisc/include/uapi/asm/mman.h | 12 +--
tools/perf/bench/bench.h | 12 ---
tools/testing/selftests/vm/pkey-x86.h | 12 +++
tools/testing/selftests/vm/protection_keys.c | 131 ++++++++++++++++++++++++++-
22 files changed, 276 insertions(+), 75 deletions(-)
This reverts commit 9ccd11718d76b95c69aa773f2abedef560776037.
The original commit 16fb4dca95daa ("drm/amdgpu: getting fan speed pwm for vega10 properly")
was reverted in commit 4545ae2ed3f2 ("drm/amdgpu: Revert "drm/amdgpu: getting fan speed pwm for vega10 properly"").
but the test that resulted in the revert was wrong and was fixed so the
revert was reverted in commit 30b8e7b8ee3b ("Revert "drm/amdgpu: Revert "drm/amdgpu: getting fan speed pwm for vega10 properly""").
That should have been the end of it, but then Sasha picked up the
original revert again and it was committed as 9ccd11718d76. So drop
that commit so we get back to where we need to be.
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: Sasha Levin <sashal(a)kernel.org>
Cc: stable(a)vger.kernel.org # 6.1.x
Cc: Yury Zhuravlev <stalkerg(a)gmail.com>
Cc: Guchun Chen <guchun.chen(a)amd.com>
Cc: Asher Song <Asher.Song(a)amd.com>
---
.../amd/pm/powerplay/hwmgr/vega10_thermal.c | 25 +++++++++----------
1 file changed, 12 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_thermal.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_thermal.c
index dad3e3741a4e..190af79f3236 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_thermal.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_thermal.c
@@ -67,22 +67,21 @@ int vega10_fan_ctrl_get_fan_speed_info(struct pp_hwmgr *hwmgr,
int vega10_fan_ctrl_get_fan_speed_pwm(struct pp_hwmgr *hwmgr,
uint32_t *speed)
{
- uint32_t current_rpm;
- uint32_t percent = 0;
-
- if (hwmgr->thermal_controller.fanInfo.bNoFan)
- return 0;
+ struct amdgpu_device *adev = hwmgr->adev;
+ uint32_t duty100, duty;
+ uint64_t tmp64;
- if (vega10_get_current_rpm(hwmgr, ¤t_rpm))
- return -1;
+ duty100 = REG_GET_FIELD(RREG32_SOC15(THM, 0, mmCG_FDO_CTRL1),
+ CG_FDO_CTRL1, FMAX_DUTY100);
+ duty = REG_GET_FIELD(RREG32_SOC15(THM, 0, mmCG_THERMAL_STATUS),
+ CG_THERMAL_STATUS, FDO_PWM_DUTY);
- if (hwmgr->thermal_controller.
- advanceFanControlParameters.usMaxFanRPM != 0)
- percent = current_rpm * 255 /
- hwmgr->thermal_controller.
- advanceFanControlParameters.usMaxFanRPM;
+ if (!duty100)
+ return -EINVAL;
- *speed = MIN(percent, 255);
+ tmp64 = (uint64_t)duty * 255;
+ do_div(tmp64, duty100);
+ *speed = MIN((uint32_t)tmp64, 255);
return 0;
}
--
2.39.0
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().")
16802e55dea9 ("memblock tests: Add skeleton of the memblock simulator")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 115d9d77bb0f9152c60b6e8646369fa7f6167593 Mon Sep 17 00:00:00 2001
From: Aaron Thompson <dev(a)aaront.org>
Date: Fri, 6 Jan 2023 22:22:44 +0000
Subject: [PATCH] mm: Always release pages to the buddy allocator in
memblock_free_late().
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the deferred
init process. memblock_free_pages() is called by memblock_free_late(),
which is used to free reserved ranges after memblock_free_all() has
run. All pages in reserved ranges have been initialized at that point,
and accordingly, those pages are not touched by the deferred init
process. This means that currently, if the pages that
memblock_free_late() intends to release are in the deferred range, they
will never be released to the buddy allocator. They will forever be
reserved.
In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().
For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call __free_pages_core()
directly instead.
One case where this issue can occur in the wild is EFI boot on
x86_64. The x86 EFI code reserves all EFI boot services memory ranges
via memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively). If any of those ranges happens to fall within the
deferred init range, the pages will not be released and that memory will
be unavailable.
For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
v6.2-rc2:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816 # +43,949 pages
Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Aaron Thompson <dev(a)aaront.org>
Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1b…
Signed-off-by: Mike Rapoport (IBM) <rppt(a)kernel.org>
diff --git a/mm/memblock.c b/mm/memblock.c
index d036c7861310..685e30e6d27c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);
for (; cursor < end; cursor++) {
- memblock_free_pages(pfn_to_page(cursor), cursor, 0);
+ /*
+ * Reserved pages are always initialized by the end of
+ * memblock_free_all() (by memmap_init() and, if deferred
+ * initialization is enabled, memmap_init_reserved_pages()), so
+ * these pages can be released directly to the buddy allocator.
+ */
+ __free_pages_core(pfn_to_page(cursor), 0);
totalram_pages_inc();
}
}
diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h
index fdb7f5db7308..85973e55489e 100644
--- a/tools/testing/memblock/internal.h
+++ b/tools/testing/memblock/internal.h
@@ -15,6 +15,10 @@ bool mirrored_kernelcore = false;
struct page {};
+void __free_pages_core(struct page *page, unsigned int order)
+{
+}
+
void memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
fe1f0714385f ("x86/resctrl: Fix task CLOSID/RMID update race")
e0ad6dc8969f ("x86/resctrl: Use task_curr() instead of task_struct->on_cpu to prevent unnecessary IPI")
ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
758999246965 ("x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak")
fd8d9db3559a ("x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak")
91989c707884 ("task_work: cleanup notification modes")
216578e55ac9 ("io_uring: fix REQ_F_COMP_LOCKED by killing it")
4edf20f99902 ("io_uring: dig out COMP_LOCK from deep call chain")
b1b74cfc1967 ("io_uring: don't unnecessarily clear F_LINK_TIMEOUT")
6ad4bf6ea160 ("Merge tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fe1f0714385fbcf76b0cbceb02b7277d842014fc Mon Sep 17 00:00:00 2001
From: Peter Newman <peternewman(a)google.com>
Date: Tue, 20 Dec 2022 17:11:23 +0100
Subject: [PATCH] x86/resctrl: Fix task CLOSID/RMID update race
When the user moves a running task to a new rdtgroup using the task's
file interface or by deleting its rdtgroup, the resulting change in
CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
task(s) CPUs.
x86 allows reordering loads with prior stores, so if the task starts
running between a task_curr() check that the CPU hoisted before the
stores in the CLOSID/RMID update then it can start running with the old
CLOSID/RMID until it is switched again because __rdtgroup_move_task()
failed to determine that it needs to be interrupted to obtain the new
CLOSID/RMID.
Refer to the diagram below:
CPU 0 CPU 1
----- -----
__rdtgroup_move_task():
curr <- t1->cpu->rq->curr
__schedule():
rq->curr <- t1
resctrl_sched_in():
t1->{closid,rmid} -> {1,1}
t1->{closid,rmid} <- {2,2}
if (curr == t1) // false
IPI(t1->cpu)
A similar race impacts rdt_move_group_tasks(), which updates tasks in a
deleted rdtgroup.
In both cases, use smp_mb() to order the task_struct::{closid,rmid}
stores before the loads in task_curr(). In particular, in the
rdt_move_group_tasks() case, simply execute an smp_mb() on every
iteration with a matching task.
It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
this would require two passes and a means of remembering which
task_structs were updated in the first loop. However, benchmarking
results below showed too little performance impact in the simple
approach to justify implementing the two-pass approach.
Times below were collected using `perf stat` to measure the time to
remove a group containing a 1600-task, parallel workload.
CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
# mkdir /sys/fs/resctrl/test
# echo $$ > /sys/fs/resctrl/test/tasks
# perf bench sched messaging -g 40 -l 100000
task-clock time ranges collected using:
# perf stat rmdir /sys/fs/resctrl/test
Baseline: 1.54 - 1.60 ms
smp_mb() every matching task: 1.57 - 1.67 ms
[ bp: Massage commit message. ]
Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")
Signed-off-by: Peter Newman <peternewman(a)google.com>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre(a)intel.com>
Reviewed-by: Babu Moger <babu.moger(a)amd.com>
Cc: <stable(a)kernel.org>
Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..5993da21d822 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
/*
* Ensure the task's closid and rmid are written before determining if
* the task is current that will decide if it will be interrupted.
+ * This pairs with the full barrier between the rq->curr update and
+ * resctrl_sched_in() during context switch.
*/
- barrier();
+ smp_mb();
/*
* By now, the task's closid and rmid are set. If the task is current
@@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
WRITE_ONCE(t->closid, to->closid);
WRITE_ONCE(t->rmid, to->mon.rmid);
+ /*
+ * Order the closid/rmid stores above before the loads
+ * in task_curr(). This pairs with the full barrier
+ * between the rq->curr update and resctrl_sched_in()
+ * during context switch.
+ */
+ smp_mb();
+
/*
* If the task is on a CPU, set the CPU in the mask.
* The detection is inaccurate as tasks might move or
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
fe1f0714385f ("x86/resctrl: Fix task CLOSID/RMID update race")
e0ad6dc8969f ("x86/resctrl: Use task_curr() instead of task_struct->on_cpu to prevent unnecessary IPI")
ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
758999246965 ("x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak")
fd8d9db3559a ("x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak")
91989c707884 ("task_work: cleanup notification modes")
216578e55ac9 ("io_uring: fix REQ_F_COMP_LOCKED by killing it")
4edf20f99902 ("io_uring: dig out COMP_LOCK from deep call chain")
b1b74cfc1967 ("io_uring: don't unnecessarily clear F_LINK_TIMEOUT")
6ad4bf6ea160 ("Merge tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fe1f0714385fbcf76b0cbceb02b7277d842014fc Mon Sep 17 00:00:00 2001
From: Peter Newman <peternewman(a)google.com>
Date: Tue, 20 Dec 2022 17:11:23 +0100
Subject: [PATCH] x86/resctrl: Fix task CLOSID/RMID update race
When the user moves a running task to a new rdtgroup using the task's
file interface or by deleting its rdtgroup, the resulting change in
CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
task(s) CPUs.
x86 allows reordering loads with prior stores, so if the task starts
running between a task_curr() check that the CPU hoisted before the
stores in the CLOSID/RMID update then it can start running with the old
CLOSID/RMID until it is switched again because __rdtgroup_move_task()
failed to determine that it needs to be interrupted to obtain the new
CLOSID/RMID.
Refer to the diagram below:
CPU 0 CPU 1
----- -----
__rdtgroup_move_task():
curr <- t1->cpu->rq->curr
__schedule():
rq->curr <- t1
resctrl_sched_in():
t1->{closid,rmid} -> {1,1}
t1->{closid,rmid} <- {2,2}
if (curr == t1) // false
IPI(t1->cpu)
A similar race impacts rdt_move_group_tasks(), which updates tasks in a
deleted rdtgroup.
In both cases, use smp_mb() to order the task_struct::{closid,rmid}
stores before the loads in task_curr(). In particular, in the
rdt_move_group_tasks() case, simply execute an smp_mb() on every
iteration with a matching task.
It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
this would require two passes and a means of remembering which
task_structs were updated in the first loop. However, benchmarking
results below showed too little performance impact in the simple
approach to justify implementing the two-pass approach.
Times below were collected using `perf stat` to measure the time to
remove a group containing a 1600-task, parallel workload.
CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
# mkdir /sys/fs/resctrl/test
# echo $$ > /sys/fs/resctrl/test/tasks
# perf bench sched messaging -g 40 -l 100000
task-clock time ranges collected using:
# perf stat rmdir /sys/fs/resctrl/test
Baseline: 1.54 - 1.60 ms
smp_mb() every matching task: 1.57 - 1.67 ms
[ bp: Massage commit message. ]
Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")
Signed-off-by: Peter Newman <peternewman(a)google.com>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre(a)intel.com>
Reviewed-by: Babu Moger <babu.moger(a)amd.com>
Cc: <stable(a)kernel.org>
Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..5993da21d822 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
/*
* Ensure the task's closid and rmid are written before determining if
* the task is current that will decide if it will be interrupted.
+ * This pairs with the full barrier between the rq->curr update and
+ * resctrl_sched_in() during context switch.
*/
- barrier();
+ smp_mb();
/*
* By now, the task's closid and rmid are set. If the task is current
@@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
WRITE_ONCE(t->closid, to->closid);
WRITE_ONCE(t->rmid, to->mon.rmid);
+ /*
+ * Order the closid/rmid stores above before the loads
+ * in task_curr(). This pairs with the full barrier
+ * between the rq->curr update and resctrl_sched_in()
+ * during context switch.
+ */
+ smp_mb();
+
/*
* If the task is on a CPU, set the CPU in the mask.
* The detection is inaccurate as tasks might move or
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
fe1f0714385f ("x86/resctrl: Fix task CLOSID/RMID update race")
e0ad6dc8969f ("x86/resctrl: Use task_curr() instead of task_struct->on_cpu to prevent unnecessary IPI")
ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
758999246965 ("x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak")
fd8d9db3559a ("x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak")
91989c707884 ("task_work: cleanup notification modes")
216578e55ac9 ("io_uring: fix REQ_F_COMP_LOCKED by killing it")
4edf20f99902 ("io_uring: dig out COMP_LOCK from deep call chain")
b1b74cfc1967 ("io_uring: don't unnecessarily clear F_LINK_TIMEOUT")
6ad4bf6ea160 ("Merge tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fe1f0714385fbcf76b0cbceb02b7277d842014fc Mon Sep 17 00:00:00 2001
From: Peter Newman <peternewman(a)google.com>
Date: Tue, 20 Dec 2022 17:11:23 +0100
Subject: [PATCH] x86/resctrl: Fix task CLOSID/RMID update race
When the user moves a running task to a new rdtgroup using the task's
file interface or by deleting its rdtgroup, the resulting change in
CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
task(s) CPUs.
x86 allows reordering loads with prior stores, so if the task starts
running between a task_curr() check that the CPU hoisted before the
stores in the CLOSID/RMID update then it can start running with the old
CLOSID/RMID until it is switched again because __rdtgroup_move_task()
failed to determine that it needs to be interrupted to obtain the new
CLOSID/RMID.
Refer to the diagram below:
CPU 0 CPU 1
----- -----
__rdtgroup_move_task():
curr <- t1->cpu->rq->curr
__schedule():
rq->curr <- t1
resctrl_sched_in():
t1->{closid,rmid} -> {1,1}
t1->{closid,rmid} <- {2,2}
if (curr == t1) // false
IPI(t1->cpu)
A similar race impacts rdt_move_group_tasks(), which updates tasks in a
deleted rdtgroup.
In both cases, use smp_mb() to order the task_struct::{closid,rmid}
stores before the loads in task_curr(). In particular, in the
rdt_move_group_tasks() case, simply execute an smp_mb() on every
iteration with a matching task.
It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
this would require two passes and a means of remembering which
task_structs were updated in the first loop. However, benchmarking
results below showed too little performance impact in the simple
approach to justify implementing the two-pass approach.
Times below were collected using `perf stat` to measure the time to
remove a group containing a 1600-task, parallel workload.
CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
# mkdir /sys/fs/resctrl/test
# echo $$ > /sys/fs/resctrl/test/tasks
# perf bench sched messaging -g 40 -l 100000
task-clock time ranges collected using:
# perf stat rmdir /sys/fs/resctrl/test
Baseline: 1.54 - 1.60 ms
smp_mb() every matching task: 1.57 - 1.67 ms
[ bp: Massage commit message. ]
Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")
Signed-off-by: Peter Newman <peternewman(a)google.com>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre(a)intel.com>
Reviewed-by: Babu Moger <babu.moger(a)amd.com>
Cc: <stable(a)kernel.org>
Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..5993da21d822 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
/*
* Ensure the task's closid and rmid are written before determining if
* the task is current that will decide if it will be interrupted.
+ * This pairs with the full barrier between the rq->curr update and
+ * resctrl_sched_in() during context switch.
*/
- barrier();
+ smp_mb();
/*
* By now, the task's closid and rmid are set. If the task is current
@@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
WRITE_ONCE(t->closid, to->closid);
WRITE_ONCE(t->rmid, to->mon.rmid);
+ /*
+ * Order the closid/rmid stores above before the loads
+ * in task_curr(). This pairs with the full barrier
+ * between the rq->curr update and resctrl_sched_in()
+ * during context switch.
+ */
+ smp_mb();
+
/*
* If the task is on a CPU, set the CPU in the mask.
* The detection is inaccurate as tasks might move or
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
fe1f0714385f ("x86/resctrl: Fix task CLOSID/RMID update race")
e0ad6dc8969f ("x86/resctrl: Use task_curr() instead of task_struct->on_cpu to prevent unnecessary IPI")
ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fe1f0714385fbcf76b0cbceb02b7277d842014fc Mon Sep 17 00:00:00 2001
From: Peter Newman <peternewman(a)google.com>
Date: Tue, 20 Dec 2022 17:11:23 +0100
Subject: [PATCH] x86/resctrl: Fix task CLOSID/RMID update race
When the user moves a running task to a new rdtgroup using the task's
file interface or by deleting its rdtgroup, the resulting change in
CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
task(s) CPUs.
x86 allows reordering loads with prior stores, so if the task starts
running between a task_curr() check that the CPU hoisted before the
stores in the CLOSID/RMID update then it can start running with the old
CLOSID/RMID until it is switched again because __rdtgroup_move_task()
failed to determine that it needs to be interrupted to obtain the new
CLOSID/RMID.
Refer to the diagram below:
CPU 0 CPU 1
----- -----
__rdtgroup_move_task():
curr <- t1->cpu->rq->curr
__schedule():
rq->curr <- t1
resctrl_sched_in():
t1->{closid,rmid} -> {1,1}
t1->{closid,rmid} <- {2,2}
if (curr == t1) // false
IPI(t1->cpu)
A similar race impacts rdt_move_group_tasks(), which updates tasks in a
deleted rdtgroup.
In both cases, use smp_mb() to order the task_struct::{closid,rmid}
stores before the loads in task_curr(). In particular, in the
rdt_move_group_tasks() case, simply execute an smp_mb() on every
iteration with a matching task.
It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
this would require two passes and a means of remembering which
task_structs were updated in the first loop. However, benchmarking
results below showed too little performance impact in the simple
approach to justify implementing the two-pass approach.
Times below were collected using `perf stat` to measure the time to
remove a group containing a 1600-task, parallel workload.
CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
# mkdir /sys/fs/resctrl/test
# echo $$ > /sys/fs/resctrl/test/tasks
# perf bench sched messaging -g 40 -l 100000
task-clock time ranges collected using:
# perf stat rmdir /sys/fs/resctrl/test
Baseline: 1.54 - 1.60 ms
smp_mb() every matching task: 1.57 - 1.67 ms
[ bp: Massage commit message. ]
Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")
Signed-off-by: Peter Newman <peternewman(a)google.com>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre(a)intel.com>
Reviewed-by: Babu Moger <babu.moger(a)amd.com>
Cc: <stable(a)kernel.org>
Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..5993da21d822 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
/*
* Ensure the task's closid and rmid are written before determining if
* the task is current that will decide if it will be interrupted.
+ * This pairs with the full barrier between the rq->curr update and
+ * resctrl_sched_in() during context switch.
*/
- barrier();
+ smp_mb();
/*
* By now, the task's closid and rmid are set. If the task is current
@@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
WRITE_ONCE(t->closid, to->closid);
WRITE_ONCE(t->rmid, to->mon.rmid);
+ /*
+ * Order the closid/rmid stores above before the loads
+ * in task_curr(). This pairs with the full barrier
+ * between the rq->curr update and resctrl_sched_in()
+ * during context switch.
+ */
+ smp_mb();
+
/*
* If the task is on a CPU, set the CPU in the mask.
* The detection is inaccurate as tasks might move or
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
76d588dddc45 ("powerpc/imc-pmu: Fix use of mutex in IRQs disabled section")
a36e8ba60b99 ("powerpc/perf: Implement a global lock to avoid races between trace, core and thread imc events.")
012ae244845f ("powerpc/perf: Trace imc PMU functions")
72c69dcddce1 ("powerpc/perf: Trace imc events detection and cpuhotplug")
dd50cf7cbc7b ("powerpc/perf: Rearrange setting of ldbar for thread-imc")
4851f75098bc ("powerpc/perf: Declare static identifier a such")
5e2d059b52e3 ("Merge tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76d588dddc459fefa1da96e0a081a397c5c8e216 Mon Sep 17 00:00:00 2001
From: Kajol Jain <kjain(a)linux.ibm.com>
Date: Fri, 6 Jan 2023 12:21:57 +0530
Subject: [PATCH] powerpc/imc-pmu: Fix use of mutex in IRQs disabled section
Current imc-pmu code triggers a WARNING with CONFIG_DEBUG_ATOMIC_SLEEP
and CONFIG_PROVE_LOCKING enabled, while running a thread_imc event.
Command to trigger the warning:
# perf stat -e thread_imc/CPM_CS_FROM_L4_MEM_X_DPTEG/ sleep 5
Performance counter stats for 'sleep 5':
0 thread_imc/CPM_CS_FROM_L4_MEM_X_DPTEG/
5.002117947 seconds time elapsed
0.000131000 seconds user
0.001063000 seconds sys
Below is snippet of the warning in dmesg:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2869, name: perf-exec
preempt_count: 2, expected: 0
4 locks held by perf-exec/2869:
#0: c00000004325c540 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0x64/0xa90
#1: c00000004325c5d8 (&sig->exec_update_lock){++++}-{3:3}, at: begin_new_exec+0x460/0xef0
#2: c0000003fa99d4e0 (&cpuctx_lock){-...}-{2:2}, at: perf_event_exec+0x290/0x510
#3: c000000017ab8418 (&ctx->lock){....}-{2:2}, at: perf_event_exec+0x29c/0x510
irq event stamp: 4806
hardirqs last enabled at (4805): [<c000000000f65b94>] _raw_spin_unlock_irqrestore+0x94/0xd0
hardirqs last disabled at (4806): [<c0000000003fae44>] perf_event_exec+0x394/0x510
softirqs last enabled at (0): [<c00000000013c404>] copy_process+0xc34/0x1ff0
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 36 PID: 2869 Comm: perf-exec Not tainted 6.2.0-rc2-00011-g1247637727f2 #61
Hardware name: 8375-42A POWER9 0x4e1202 opal:v7.0-16-g9b85f7d961 PowerNV
Call Trace:
dump_stack_lvl+0x98/0xe0 (unreliable)
__might_resched+0x2f8/0x310
__mutex_lock+0x6c/0x13f0
thread_imc_event_add+0xf4/0x1b0
event_sched_in+0xe0/0x210
merge_sched_in+0x1f0/0x600
visit_groups_merge.isra.92.constprop.166+0x2bc/0x6c0
ctx_flexible_sched_in+0xcc/0x140
ctx_sched_in+0x20c/0x2a0
ctx_resched+0x104/0x1c0
perf_event_exec+0x340/0x510
begin_new_exec+0x730/0xef0
load_elf_binary+0x3f8/0x1e10
...
do not call blocking ops when !TASK_RUNNING; state=2001 set at [<00000000fd63e7cf>] do_nanosleep+0x60/0x1a0
WARNING: CPU: 36 PID: 2869 at kernel/sched/core.c:9912 __might_sleep+0x9c/0xb0
CPU: 36 PID: 2869 Comm: sleep Tainted: G W 6.2.0-rc2-00011-g1247637727f2 #61
Hardware name: 8375-42A POWER9 0x4e1202 opal:v7.0-16-g9b85f7d961 PowerNV
NIP: c000000000194a1c LR: c000000000194a18 CTR: c000000000a78670
REGS: c00000004d2134e0 TRAP: 0700 Tainted: G W (6.2.0-rc2-00011-g1247637727f2)
MSR: 9000000000021033 <SF,HV,ME,IR,DR,RI,LE> CR: 48002824 XER: 00000000
CFAR: c00000000013fb64 IRQMASK: 1
The above warning triggered because the current imc-pmu code uses mutex
lock in interrupt disabled sections. The function mutex_lock()
internally calls __might_resched(), which will check if IRQs are
disabled and in case IRQs are disabled, it will trigger the warning.
Fix the issue by changing the mutex lock to spinlock.
Fixes: 8f95faaac56c ("powerpc/powernv: Detect and create IMC device")
Reported-by: Michael Petlan <mpetlan(a)redhat.com>
Reported-by: Peter Zijlstra <peterz(a)infradead.org>
Signed-off-by: Kajol Jain <kjain(a)linux.ibm.com>
[mpe: Fix comments, trim oops in change log, add reported-by tags]
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20230106065157.182648-1-kjain@linux.ibm.com
diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h
index 4f897993b710..699a88584ae1 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -137,7 +137,7 @@ struct imc_pmu {
* are inited.
*/
struct imc_pmu_ref {
- struct mutex lock;
+ spinlock_t lock;
unsigned int id;
int refc;
};
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index d517aba94d1b..100e97daf76b 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -14,6 +14,7 @@
#include <asm/cputhreads.h>
#include <asm/smp.h>
#include <linux/string.h>
+#include <linux/spinlock.h>
/* Nest IMC data structures and variables */
@@ -21,7 +22,7 @@
* Used to avoid races in counting the nest-pmu units during hotplug
* register and unregister
*/
-static DEFINE_MUTEX(nest_init_lock);
+static DEFINE_SPINLOCK(nest_init_lock);
static DEFINE_PER_CPU(struct imc_pmu_ref *, local_nest_imc_refc);
static struct imc_pmu **per_nest_pmu_arr;
static cpumask_t nest_imc_cpumask;
@@ -50,7 +51,7 @@ static int trace_imc_mem_size;
* core and trace-imc
*/
static struct imc_pmu_ref imc_global_refc = {
- .lock = __MUTEX_INITIALIZER(imc_global_refc.lock),
+ .lock = __SPIN_LOCK_INITIALIZER(imc_global_refc.lock),
.id = 0,
.refc = 0,
};
@@ -400,7 +401,7 @@ static int ppc_nest_imc_cpu_offline(unsigned int cpu)
get_hard_smp_processor_id(cpu));
/*
* If this is the last cpu in this chip then, skip the reference
- * count mutex lock and make the reference count on this chip zero.
+ * count lock and make the reference count on this chip zero.
*/
ref = get_nest_pmu_ref(cpu);
if (!ref)
@@ -462,15 +463,15 @@ static void nest_imc_counters_release(struct perf_event *event)
/*
* See if we need to disable the nest PMU.
* If no events are currently in use, then we have to take a
- * mutex to ensure that we don't race with another task doing
+ * lock to ensure that we don't race with another task doing
* enable or disable the nest counters.
*/
ref = get_nest_pmu_ref(event->cpu);
if (!ref)
return;
- /* Take the mutex lock for this node and then decrement the reference count */
- mutex_lock(&ref->lock);
+ /* Take the lock for this node and then decrement the reference count */
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
/*
* The scenario where this is true is, when perf session is
@@ -482,7 +483,7 @@ static void nest_imc_counters_release(struct perf_event *event)
* an OPAL call to disable the engine in that node.
*
*/
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return;
}
ref->refc--;
@@ -490,7 +491,7 @@ static void nest_imc_counters_release(struct perf_event *event)
rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_NEST,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("nest-imc: Unable to stop the counters for core %d\n", node_id);
return;
}
@@ -498,7 +499,7 @@ static void nest_imc_counters_release(struct perf_event *event)
WARN(1, "nest-imc: Invalid event reference count\n");
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
}
static int nest_imc_event_init(struct perf_event *event)
@@ -557,26 +558,25 @@ static int nest_imc_event_init(struct perf_event *event)
/*
* Get the imc_pmu_ref struct for this node.
- * Take the mutex lock and then increment the count of nest pmu events
- * inited.
+ * Take the lock and then increment the count of nest pmu events inited.
*/
ref = get_nest_pmu_ref(event->cpu);
if (!ref)
return -EINVAL;
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
rc = opal_imc_counters_start(OPAL_IMC_COUNTERS_NEST,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("nest-imc: Unable to start the counters for node %d\n",
node_id);
return rc;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
event->destroy = nest_imc_counters_release;
return 0;
@@ -612,9 +612,8 @@ static int core_imc_mem_init(int cpu, int size)
return -ENOMEM;
mem_info->vbase = page_address(page);
- /* Init the mutex */
core_imc_refc[core_id].id = core_id;
- mutex_init(&core_imc_refc[core_id].lock);
+ spin_lock_init(&core_imc_refc[core_id].lock);
rc = opal_imc_counters_init(OPAL_IMC_COUNTERS_CORE,
__pa((void *)mem_info->vbase),
@@ -703,9 +702,8 @@ static int ppc_core_imc_cpu_offline(unsigned int cpu)
perf_pmu_migrate_context(&core_imc_pmu->pmu, cpu, ncpu);
} else {
/*
- * If this is the last cpu in this core then, skip taking refernce
- * count mutex lock for this core and directly zero "refc" for
- * this core.
+ * If this is the last cpu in this core then skip taking reference
+ * count lock for this core and directly zero "refc" for this core.
*/
opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(cpu));
@@ -720,11 +718,11 @@ static int ppc_core_imc_cpu_offline(unsigned int cpu)
* last cpu in this core and core-imc event running
* in this cpu.
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == IMC_DOMAIN_CORE)
imc_global_refc.refc--;
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
}
return 0;
}
@@ -739,7 +737,7 @@ static int core_imc_pmu_cpumask_init(void)
static void reset_global_refc(struct perf_event *event)
{
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
imc_global_refc.refc--;
/*
@@ -751,7 +749,7 @@ static void reset_global_refc(struct perf_event *event)
imc_global_refc.refc = 0;
imc_global_refc.id = 0;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
}
static void core_imc_counters_release(struct perf_event *event)
@@ -764,17 +762,17 @@ static void core_imc_counters_release(struct perf_event *event)
/*
* See if we need to disable the IMC PMU.
* If no events are currently in use, then we have to take a
- * mutex to ensure that we don't race with another task doing
+ * lock to ensure that we don't race with another task doing
* enable or disable the core counters.
*/
core_id = event->cpu / threads_per_core;
- /* Take the mutex lock and decrement the refernce count for this core */
+ /* Take the lock and decrement the refernce count for this core */
ref = &core_imc_refc[core_id];
if (!ref)
return;
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
/*
* The scenario where this is true is, when perf session is
@@ -786,7 +784,7 @@ static void core_imc_counters_release(struct perf_event *event)
* an OPAL call to disable the engine in that core.
*
*/
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return;
}
ref->refc--;
@@ -794,7 +792,7 @@ static void core_imc_counters_release(struct perf_event *event)
rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("IMC: Unable to stop the counters for core %d\n", core_id);
return;
}
@@ -802,7 +800,7 @@ static void core_imc_counters_release(struct perf_event *event)
WARN(1, "core-imc: Invalid event reference count\n");
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
reset_global_refc(event);
}
@@ -840,7 +838,6 @@ static int core_imc_event_init(struct perf_event *event)
if ((!pcmi->vbase))
return -ENODEV;
- /* Get the core_imc mutex for this core */
ref = &core_imc_refc[core_id];
if (!ref)
return -EINVAL;
@@ -848,22 +845,22 @@ static int core_imc_event_init(struct perf_event *event)
/*
* Core pmu units are enabled only when it is used.
* See if this is triggered for the first time.
- * If yes, take the mutex lock and enable the core counters.
+ * If yes, take the lock and enable the core counters.
* If not, just increment the count in core_imc_refc struct.
*/
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
rc = opal_imc_counters_start(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("core-imc: Unable to start the counters for core %d\n",
core_id);
return rc;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
/*
* Since the system can run either in accumulation or trace-mode
@@ -874,7 +871,7 @@ static int core_imc_event_init(struct perf_event *event)
* to know whether any other trace/thread imc
* events are running.
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == 0 || imc_global_refc.id == IMC_DOMAIN_CORE) {
/*
* No other trace/thread imc events are running in
@@ -883,10 +880,10 @@ static int core_imc_event_init(struct perf_event *event)
imc_global_refc.id = IMC_DOMAIN_CORE;
imc_global_refc.refc++;
} else {
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return -EBUSY;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
event->hw.event_base = (u64)pcmi->vbase + (config & IMC_EVENT_OFFSET_MASK);
event->destroy = core_imc_counters_release;
@@ -958,10 +955,10 @@ static int ppc_thread_imc_cpu_offline(unsigned int cpu)
mtspr(SPRN_LDBAR, (mfspr(SPRN_LDBAR) & (~(1UL << 63))));
/* Reduce the refc if thread-imc event running on this cpu */
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == IMC_DOMAIN_THREAD)
imc_global_refc.refc--;
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return 0;
}
@@ -1001,7 +998,7 @@ static int thread_imc_event_init(struct perf_event *event)
if (!target)
return -EINVAL;
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
/*
* Check if any other trace/core imc events are running in the
* system, if not set the global id to thread-imc.
@@ -1010,10 +1007,10 @@ static int thread_imc_event_init(struct perf_event *event)
imc_global_refc.id = IMC_DOMAIN_THREAD;
imc_global_refc.refc++;
} else {
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return -EBUSY;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
event->pmu->task_ctx_nr = perf_sw_context;
event->destroy = reset_global_refc;
@@ -1135,25 +1132,25 @@ static int thread_imc_event_add(struct perf_event *event, int flags)
/*
* imc pmus are enabled only when it is used.
* See if this is triggered for the first time.
- * If yes, take the mutex lock and enable the counters.
+ * If yes, take the lock and enable the counters.
* If not, just increment the count in ref count struct.
*/
ref = &core_imc_refc[core_id];
if (!ref)
return -EINVAL;
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
if (opal_imc_counters_start(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("thread-imc: Unable to start the counter\
for core %d\n", core_id);
return -EINVAL;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return 0;
}
@@ -1170,12 +1167,12 @@ static void thread_imc_event_del(struct perf_event *event, int flags)
return;
}
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
ref->refc--;
if (ref->refc == 0) {
if (opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("thread-imc: Unable to stop the counters\
for core %d\n", core_id);
return;
@@ -1183,7 +1180,7 @@ static void thread_imc_event_del(struct perf_event *event, int flags)
} else if (ref->refc < 0) {
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
/* Set bit 0 of LDBAR to zero, to stop posting updates to memory */
mtspr(SPRN_LDBAR, (mfspr(SPRN_LDBAR) & (~(1UL << 63))));
@@ -1224,9 +1221,8 @@ static int trace_imc_mem_alloc(int cpu_id, int size)
}
}
- /* Init the mutex, if not already */
trace_imc_refc[core_id].id = core_id;
- mutex_init(&trace_imc_refc[core_id].lock);
+ spin_lock_init(&trace_imc_refc[core_id].lock);
mtspr(SPRN_LDBAR, 0);
return 0;
@@ -1246,10 +1242,10 @@ static int ppc_trace_imc_cpu_offline(unsigned int cpu)
* Reduce the refc if any trace-imc event running
* on this cpu.
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == IMC_DOMAIN_TRACE)
imc_global_refc.refc--;
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return 0;
}
@@ -1371,17 +1367,17 @@ static int trace_imc_event_add(struct perf_event *event, int flags)
}
mtspr(SPRN_LDBAR, ldbar_value);
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
if (opal_imc_counters_start(OPAL_IMC_COUNTERS_TRACE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("trace-imc: Unable to start the counters for core %d\n", core_id);
return -EINVAL;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return 0;
}
@@ -1414,19 +1410,19 @@ static void trace_imc_event_del(struct perf_event *event, int flags)
return;
}
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
ref->refc--;
if (ref->refc == 0) {
if (opal_imc_counters_stop(OPAL_IMC_COUNTERS_TRACE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("trace-imc: Unable to stop the counters for core %d\n", core_id);
return;
}
} else if (ref->refc < 0) {
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
trace_imc_event_stop(event, flags);
}
@@ -1448,7 +1444,7 @@ static int trace_imc_event_init(struct perf_event *event)
* no other thread is running any core/thread imc
* events
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == 0 || imc_global_refc.id == IMC_DOMAIN_TRACE) {
/*
* No core/thread imc events are running in the
@@ -1457,10 +1453,10 @@ static int trace_imc_event_init(struct perf_event *event)
imc_global_refc.id = IMC_DOMAIN_TRACE;
imc_global_refc.refc++;
} else {
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return -EBUSY;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
event->hw.idx = -1;
@@ -1533,10 +1529,10 @@ static int init_nest_pmu_ref(void)
i = 0;
for_each_node(nid) {
/*
- * Mutex lock to avoid races while tracking the number of
+ * Take the lock to avoid races while tracking the number of
* sessions using the chip's nest pmu units.
*/
- mutex_init(&nest_imc_refc[i].lock);
+ spin_lock_init(&nest_imc_refc[i].lock);
/*
* Loop to init the "id" with the node_id. Variable "i" initialized to
@@ -1633,7 +1629,7 @@ static void imc_common_mem_free(struct imc_pmu *pmu_ptr)
static void imc_common_cpuhp_mem_free(struct imc_pmu *pmu_ptr)
{
if (pmu_ptr->domain == IMC_DOMAIN_NEST) {
- mutex_lock(&nest_init_lock);
+ spin_lock(&nest_init_lock);
if (nest_pmus == 1) {
cpuhp_remove_state(CPUHP_AP_PERF_POWERPC_NEST_IMC_ONLINE);
kfree(nest_imc_refc);
@@ -1643,7 +1639,7 @@ static void imc_common_cpuhp_mem_free(struct imc_pmu *pmu_ptr)
if (nest_pmus > 0)
nest_pmus--;
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
}
/* Free core_imc memory */
@@ -1800,11 +1796,11 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id
* rest. To handle the cpuhotplug callback unregister, we track
* the number of nest pmus in "nest_pmus".
*/
- mutex_lock(&nest_init_lock);
+ spin_lock(&nest_init_lock);
if (nest_pmus == 0) {
ret = init_nest_pmu_ref();
if (ret) {
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
kfree(per_nest_pmu_arr);
per_nest_pmu_arr = NULL;
goto err_free_mem;
@@ -1812,7 +1808,7 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id
/* Register for cpu hotplug notification. */
ret = nest_pmu_cpumask_init();
if (ret) {
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
kfree(nest_imc_refc);
kfree(per_nest_pmu_arr);
per_nest_pmu_arr = NULL;
@@ -1820,7 +1816,7 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id
}
}
nest_pmus++;
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
break;
case IMC_DOMAIN_CORE:
ret = core_imc_pmu_cpumask_init();
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
76d588dddc45 ("powerpc/imc-pmu: Fix use of mutex in IRQs disabled section")
a36e8ba60b99 ("powerpc/perf: Implement a global lock to avoid races between trace, core and thread imc events.")
012ae244845f ("powerpc/perf: Trace imc PMU functions")
72c69dcddce1 ("powerpc/perf: Trace imc events detection and cpuhotplug")
dd50cf7cbc7b ("powerpc/perf: Rearrange setting of ldbar for thread-imc")
4851f75098bc ("powerpc/perf: Declare static identifier a such")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76d588dddc459fefa1da96e0a081a397c5c8e216 Mon Sep 17 00:00:00 2001
From: Kajol Jain <kjain(a)linux.ibm.com>
Date: Fri, 6 Jan 2023 12:21:57 +0530
Subject: [PATCH] powerpc/imc-pmu: Fix use of mutex in IRQs disabled section
Current imc-pmu code triggers a WARNING with CONFIG_DEBUG_ATOMIC_SLEEP
and CONFIG_PROVE_LOCKING enabled, while running a thread_imc event.
Command to trigger the warning:
# perf stat -e thread_imc/CPM_CS_FROM_L4_MEM_X_DPTEG/ sleep 5
Performance counter stats for 'sleep 5':
0 thread_imc/CPM_CS_FROM_L4_MEM_X_DPTEG/
5.002117947 seconds time elapsed
0.000131000 seconds user
0.001063000 seconds sys
Below is snippet of the warning in dmesg:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2869, name: perf-exec
preempt_count: 2, expected: 0
4 locks held by perf-exec/2869:
#0: c00000004325c540 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0x64/0xa90
#1: c00000004325c5d8 (&sig->exec_update_lock){++++}-{3:3}, at: begin_new_exec+0x460/0xef0
#2: c0000003fa99d4e0 (&cpuctx_lock){-...}-{2:2}, at: perf_event_exec+0x290/0x510
#3: c000000017ab8418 (&ctx->lock){....}-{2:2}, at: perf_event_exec+0x29c/0x510
irq event stamp: 4806
hardirqs last enabled at (4805): [<c000000000f65b94>] _raw_spin_unlock_irqrestore+0x94/0xd0
hardirqs last disabled at (4806): [<c0000000003fae44>] perf_event_exec+0x394/0x510
softirqs last enabled at (0): [<c00000000013c404>] copy_process+0xc34/0x1ff0
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 36 PID: 2869 Comm: perf-exec Not tainted 6.2.0-rc2-00011-g1247637727f2 #61
Hardware name: 8375-42A POWER9 0x4e1202 opal:v7.0-16-g9b85f7d961 PowerNV
Call Trace:
dump_stack_lvl+0x98/0xe0 (unreliable)
__might_resched+0x2f8/0x310
__mutex_lock+0x6c/0x13f0
thread_imc_event_add+0xf4/0x1b0
event_sched_in+0xe0/0x210
merge_sched_in+0x1f0/0x600
visit_groups_merge.isra.92.constprop.166+0x2bc/0x6c0
ctx_flexible_sched_in+0xcc/0x140
ctx_sched_in+0x20c/0x2a0
ctx_resched+0x104/0x1c0
perf_event_exec+0x340/0x510
begin_new_exec+0x730/0xef0
load_elf_binary+0x3f8/0x1e10
...
do not call blocking ops when !TASK_RUNNING; state=2001 set at [<00000000fd63e7cf>] do_nanosleep+0x60/0x1a0
WARNING: CPU: 36 PID: 2869 at kernel/sched/core.c:9912 __might_sleep+0x9c/0xb0
CPU: 36 PID: 2869 Comm: sleep Tainted: G W 6.2.0-rc2-00011-g1247637727f2 #61
Hardware name: 8375-42A POWER9 0x4e1202 opal:v7.0-16-g9b85f7d961 PowerNV
NIP: c000000000194a1c LR: c000000000194a18 CTR: c000000000a78670
REGS: c00000004d2134e0 TRAP: 0700 Tainted: G W (6.2.0-rc2-00011-g1247637727f2)
MSR: 9000000000021033 <SF,HV,ME,IR,DR,RI,LE> CR: 48002824 XER: 00000000
CFAR: c00000000013fb64 IRQMASK: 1
The above warning triggered because the current imc-pmu code uses mutex
lock in interrupt disabled sections. The function mutex_lock()
internally calls __might_resched(), which will check if IRQs are
disabled and in case IRQs are disabled, it will trigger the warning.
Fix the issue by changing the mutex lock to spinlock.
Fixes: 8f95faaac56c ("powerpc/powernv: Detect and create IMC device")
Reported-by: Michael Petlan <mpetlan(a)redhat.com>
Reported-by: Peter Zijlstra <peterz(a)infradead.org>
Signed-off-by: Kajol Jain <kjain(a)linux.ibm.com>
[mpe: Fix comments, trim oops in change log, add reported-by tags]
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20230106065157.182648-1-kjain@linux.ibm.com
diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h
index 4f897993b710..699a88584ae1 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -137,7 +137,7 @@ struct imc_pmu {
* are inited.
*/
struct imc_pmu_ref {
- struct mutex lock;
+ spinlock_t lock;
unsigned int id;
int refc;
};
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index d517aba94d1b..100e97daf76b 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -14,6 +14,7 @@
#include <asm/cputhreads.h>
#include <asm/smp.h>
#include <linux/string.h>
+#include <linux/spinlock.h>
/* Nest IMC data structures and variables */
@@ -21,7 +22,7 @@
* Used to avoid races in counting the nest-pmu units during hotplug
* register and unregister
*/
-static DEFINE_MUTEX(nest_init_lock);
+static DEFINE_SPINLOCK(nest_init_lock);
static DEFINE_PER_CPU(struct imc_pmu_ref *, local_nest_imc_refc);
static struct imc_pmu **per_nest_pmu_arr;
static cpumask_t nest_imc_cpumask;
@@ -50,7 +51,7 @@ static int trace_imc_mem_size;
* core and trace-imc
*/
static struct imc_pmu_ref imc_global_refc = {
- .lock = __MUTEX_INITIALIZER(imc_global_refc.lock),
+ .lock = __SPIN_LOCK_INITIALIZER(imc_global_refc.lock),
.id = 0,
.refc = 0,
};
@@ -400,7 +401,7 @@ static int ppc_nest_imc_cpu_offline(unsigned int cpu)
get_hard_smp_processor_id(cpu));
/*
* If this is the last cpu in this chip then, skip the reference
- * count mutex lock and make the reference count on this chip zero.
+ * count lock and make the reference count on this chip zero.
*/
ref = get_nest_pmu_ref(cpu);
if (!ref)
@@ -462,15 +463,15 @@ static void nest_imc_counters_release(struct perf_event *event)
/*
* See if we need to disable the nest PMU.
* If no events are currently in use, then we have to take a
- * mutex to ensure that we don't race with another task doing
+ * lock to ensure that we don't race with another task doing
* enable or disable the nest counters.
*/
ref = get_nest_pmu_ref(event->cpu);
if (!ref)
return;
- /* Take the mutex lock for this node and then decrement the reference count */
- mutex_lock(&ref->lock);
+ /* Take the lock for this node and then decrement the reference count */
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
/*
* The scenario where this is true is, when perf session is
@@ -482,7 +483,7 @@ static void nest_imc_counters_release(struct perf_event *event)
* an OPAL call to disable the engine in that node.
*
*/
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return;
}
ref->refc--;
@@ -490,7 +491,7 @@ static void nest_imc_counters_release(struct perf_event *event)
rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_NEST,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("nest-imc: Unable to stop the counters for core %d\n", node_id);
return;
}
@@ -498,7 +499,7 @@ static void nest_imc_counters_release(struct perf_event *event)
WARN(1, "nest-imc: Invalid event reference count\n");
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
}
static int nest_imc_event_init(struct perf_event *event)
@@ -557,26 +558,25 @@ static int nest_imc_event_init(struct perf_event *event)
/*
* Get the imc_pmu_ref struct for this node.
- * Take the mutex lock and then increment the count of nest pmu events
- * inited.
+ * Take the lock and then increment the count of nest pmu events inited.
*/
ref = get_nest_pmu_ref(event->cpu);
if (!ref)
return -EINVAL;
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
rc = opal_imc_counters_start(OPAL_IMC_COUNTERS_NEST,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("nest-imc: Unable to start the counters for node %d\n",
node_id);
return rc;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
event->destroy = nest_imc_counters_release;
return 0;
@@ -612,9 +612,8 @@ static int core_imc_mem_init(int cpu, int size)
return -ENOMEM;
mem_info->vbase = page_address(page);
- /* Init the mutex */
core_imc_refc[core_id].id = core_id;
- mutex_init(&core_imc_refc[core_id].lock);
+ spin_lock_init(&core_imc_refc[core_id].lock);
rc = opal_imc_counters_init(OPAL_IMC_COUNTERS_CORE,
__pa((void *)mem_info->vbase),
@@ -703,9 +702,8 @@ static int ppc_core_imc_cpu_offline(unsigned int cpu)
perf_pmu_migrate_context(&core_imc_pmu->pmu, cpu, ncpu);
} else {
/*
- * If this is the last cpu in this core then, skip taking refernce
- * count mutex lock for this core and directly zero "refc" for
- * this core.
+ * If this is the last cpu in this core then skip taking reference
+ * count lock for this core and directly zero "refc" for this core.
*/
opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(cpu));
@@ -720,11 +718,11 @@ static int ppc_core_imc_cpu_offline(unsigned int cpu)
* last cpu in this core and core-imc event running
* in this cpu.
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == IMC_DOMAIN_CORE)
imc_global_refc.refc--;
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
}
return 0;
}
@@ -739,7 +737,7 @@ static int core_imc_pmu_cpumask_init(void)
static void reset_global_refc(struct perf_event *event)
{
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
imc_global_refc.refc--;
/*
@@ -751,7 +749,7 @@ static void reset_global_refc(struct perf_event *event)
imc_global_refc.refc = 0;
imc_global_refc.id = 0;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
}
static void core_imc_counters_release(struct perf_event *event)
@@ -764,17 +762,17 @@ static void core_imc_counters_release(struct perf_event *event)
/*
* See if we need to disable the IMC PMU.
* If no events are currently in use, then we have to take a
- * mutex to ensure that we don't race with another task doing
+ * lock to ensure that we don't race with another task doing
* enable or disable the core counters.
*/
core_id = event->cpu / threads_per_core;
- /* Take the mutex lock and decrement the refernce count for this core */
+ /* Take the lock and decrement the refernce count for this core */
ref = &core_imc_refc[core_id];
if (!ref)
return;
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
/*
* The scenario where this is true is, when perf session is
@@ -786,7 +784,7 @@ static void core_imc_counters_release(struct perf_event *event)
* an OPAL call to disable the engine in that core.
*
*/
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return;
}
ref->refc--;
@@ -794,7 +792,7 @@ static void core_imc_counters_release(struct perf_event *event)
rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("IMC: Unable to stop the counters for core %d\n", core_id);
return;
}
@@ -802,7 +800,7 @@ static void core_imc_counters_release(struct perf_event *event)
WARN(1, "core-imc: Invalid event reference count\n");
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
reset_global_refc(event);
}
@@ -840,7 +838,6 @@ static int core_imc_event_init(struct perf_event *event)
if ((!pcmi->vbase))
return -ENODEV;
- /* Get the core_imc mutex for this core */
ref = &core_imc_refc[core_id];
if (!ref)
return -EINVAL;
@@ -848,22 +845,22 @@ static int core_imc_event_init(struct perf_event *event)
/*
* Core pmu units are enabled only when it is used.
* See if this is triggered for the first time.
- * If yes, take the mutex lock and enable the core counters.
+ * If yes, take the lock and enable the core counters.
* If not, just increment the count in core_imc_refc struct.
*/
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
rc = opal_imc_counters_start(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("core-imc: Unable to start the counters for core %d\n",
core_id);
return rc;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
/*
* Since the system can run either in accumulation or trace-mode
@@ -874,7 +871,7 @@ static int core_imc_event_init(struct perf_event *event)
* to know whether any other trace/thread imc
* events are running.
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == 0 || imc_global_refc.id == IMC_DOMAIN_CORE) {
/*
* No other trace/thread imc events are running in
@@ -883,10 +880,10 @@ static int core_imc_event_init(struct perf_event *event)
imc_global_refc.id = IMC_DOMAIN_CORE;
imc_global_refc.refc++;
} else {
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return -EBUSY;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
event->hw.event_base = (u64)pcmi->vbase + (config & IMC_EVENT_OFFSET_MASK);
event->destroy = core_imc_counters_release;
@@ -958,10 +955,10 @@ static int ppc_thread_imc_cpu_offline(unsigned int cpu)
mtspr(SPRN_LDBAR, (mfspr(SPRN_LDBAR) & (~(1UL << 63))));
/* Reduce the refc if thread-imc event running on this cpu */
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == IMC_DOMAIN_THREAD)
imc_global_refc.refc--;
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return 0;
}
@@ -1001,7 +998,7 @@ static int thread_imc_event_init(struct perf_event *event)
if (!target)
return -EINVAL;
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
/*
* Check if any other trace/core imc events are running in the
* system, if not set the global id to thread-imc.
@@ -1010,10 +1007,10 @@ static int thread_imc_event_init(struct perf_event *event)
imc_global_refc.id = IMC_DOMAIN_THREAD;
imc_global_refc.refc++;
} else {
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return -EBUSY;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
event->pmu->task_ctx_nr = perf_sw_context;
event->destroy = reset_global_refc;
@@ -1135,25 +1132,25 @@ static int thread_imc_event_add(struct perf_event *event, int flags)
/*
* imc pmus are enabled only when it is used.
* See if this is triggered for the first time.
- * If yes, take the mutex lock and enable the counters.
+ * If yes, take the lock and enable the counters.
* If not, just increment the count in ref count struct.
*/
ref = &core_imc_refc[core_id];
if (!ref)
return -EINVAL;
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
if (opal_imc_counters_start(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("thread-imc: Unable to start the counter\
for core %d\n", core_id);
return -EINVAL;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return 0;
}
@@ -1170,12 +1167,12 @@ static void thread_imc_event_del(struct perf_event *event, int flags)
return;
}
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
ref->refc--;
if (ref->refc == 0) {
if (opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("thread-imc: Unable to stop the counters\
for core %d\n", core_id);
return;
@@ -1183,7 +1180,7 @@ static void thread_imc_event_del(struct perf_event *event, int flags)
} else if (ref->refc < 0) {
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
/* Set bit 0 of LDBAR to zero, to stop posting updates to memory */
mtspr(SPRN_LDBAR, (mfspr(SPRN_LDBAR) & (~(1UL << 63))));
@@ -1224,9 +1221,8 @@ static int trace_imc_mem_alloc(int cpu_id, int size)
}
}
- /* Init the mutex, if not already */
trace_imc_refc[core_id].id = core_id;
- mutex_init(&trace_imc_refc[core_id].lock);
+ spin_lock_init(&trace_imc_refc[core_id].lock);
mtspr(SPRN_LDBAR, 0);
return 0;
@@ -1246,10 +1242,10 @@ static int ppc_trace_imc_cpu_offline(unsigned int cpu)
* Reduce the refc if any trace-imc event running
* on this cpu.
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == IMC_DOMAIN_TRACE)
imc_global_refc.refc--;
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return 0;
}
@@ -1371,17 +1367,17 @@ static int trace_imc_event_add(struct perf_event *event, int flags)
}
mtspr(SPRN_LDBAR, ldbar_value);
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
if (opal_imc_counters_start(OPAL_IMC_COUNTERS_TRACE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("trace-imc: Unable to start the counters for core %d\n", core_id);
return -EINVAL;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return 0;
}
@@ -1414,19 +1410,19 @@ static void trace_imc_event_del(struct perf_event *event, int flags)
return;
}
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
ref->refc--;
if (ref->refc == 0) {
if (opal_imc_counters_stop(OPAL_IMC_COUNTERS_TRACE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("trace-imc: Unable to stop the counters for core %d\n", core_id);
return;
}
} else if (ref->refc < 0) {
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
trace_imc_event_stop(event, flags);
}
@@ -1448,7 +1444,7 @@ static int trace_imc_event_init(struct perf_event *event)
* no other thread is running any core/thread imc
* events
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == 0 || imc_global_refc.id == IMC_DOMAIN_TRACE) {
/*
* No core/thread imc events are running in the
@@ -1457,10 +1453,10 @@ static int trace_imc_event_init(struct perf_event *event)
imc_global_refc.id = IMC_DOMAIN_TRACE;
imc_global_refc.refc++;
} else {
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return -EBUSY;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
event->hw.idx = -1;
@@ -1533,10 +1529,10 @@ static int init_nest_pmu_ref(void)
i = 0;
for_each_node(nid) {
/*
- * Mutex lock to avoid races while tracking the number of
+ * Take the lock to avoid races while tracking the number of
* sessions using the chip's nest pmu units.
*/
- mutex_init(&nest_imc_refc[i].lock);
+ spin_lock_init(&nest_imc_refc[i].lock);
/*
* Loop to init the "id" with the node_id. Variable "i" initialized to
@@ -1633,7 +1629,7 @@ static void imc_common_mem_free(struct imc_pmu *pmu_ptr)
static void imc_common_cpuhp_mem_free(struct imc_pmu *pmu_ptr)
{
if (pmu_ptr->domain == IMC_DOMAIN_NEST) {
- mutex_lock(&nest_init_lock);
+ spin_lock(&nest_init_lock);
if (nest_pmus == 1) {
cpuhp_remove_state(CPUHP_AP_PERF_POWERPC_NEST_IMC_ONLINE);
kfree(nest_imc_refc);
@@ -1643,7 +1639,7 @@ static void imc_common_cpuhp_mem_free(struct imc_pmu *pmu_ptr)
if (nest_pmus > 0)
nest_pmus--;
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
}
/* Free core_imc memory */
@@ -1800,11 +1796,11 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id
* rest. To handle the cpuhotplug callback unregister, we track
* the number of nest pmus in "nest_pmus".
*/
- mutex_lock(&nest_init_lock);
+ spin_lock(&nest_init_lock);
if (nest_pmus == 0) {
ret = init_nest_pmu_ref();
if (ret) {
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
kfree(per_nest_pmu_arr);
per_nest_pmu_arr = NULL;
goto err_free_mem;
@@ -1812,7 +1808,7 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id
/* Register for cpu hotplug notification. */
ret = nest_pmu_cpumask_init();
if (ret) {
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
kfree(nest_imc_refc);
kfree(per_nest_pmu_arr);
per_nest_pmu_arr = NULL;
@@ -1820,7 +1816,7 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id
}
}
nest_pmus++;
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
break;
case IMC_DOMAIN_CORE:
ret = core_imc_pmu_cpumask_init();
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
87ca4f9efbd7 ("sched/core: Fix use-after-free bug in dup_user_cpus_ptr()")
8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
713a2e21a513 ("sched: Introduce affinity_context")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 87ca4f9efbd7cc649ff43b87970888f2812945b8 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman(a)redhat.com>
Date: Fri, 30 Dec 2022 23:11:19 -0500
Subject: [PATCH] sched/core: Fix use-after-free bug in dup_user_cpus_ptr()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Since commit 07ec77a1d4e8 ("sched: Allow task CPU affinity to be
restricted on asymmetric systems"), the setting and clearing of
user_cpus_ptr are done under pi_lock for arm64 architecture. However,
dup_user_cpus_ptr() accesses user_cpus_ptr without any lock
protection. Since sched_setaffinity() can be invoked from another
process, the process being modified may be undergoing fork() at
the same time. When racing with the clearing of user_cpus_ptr in
__set_cpus_allowed_ptr_locked(), it can lead to user-after-free and
possibly double-free in arm64 kernel.
Commit 8f9ea86fdf99 ("sched: Always preserve the user requested
cpumask") fixes this problem as user_cpus_ptr, once set, will never
be cleared in a task's lifetime. However, this bug was re-introduced
in commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in
do_set_cpus_allowed()") which allows the clearing of user_cpus_ptr in
do_set_cpus_allowed(). This time, it will affect all arches.
Fix this bug by always clearing the user_cpus_ptr of the newly
cloned/forked task before the copying process starts and check the
user_cpus_ptr state of the source task under pi_lock.
Note to stable, this patch won't be applicable to stable releases.
Just copy the new dup_user_cpus_ptr() function over.
Fixes: 07ec77a1d4e8 ("sched: Allow task CPU affinity to be restricted on asymmetric systems")
Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()")
Reported-by: David Wang 王标 <wangbiao3(a)xiaomi.com>
Signed-off-by: Waiman Long <longman(a)redhat.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Reviewed-by: Peter Zijlstra <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20221231041120.440785-2-longman@redhat.com
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 965d813c28ad..f9f6e5413dcf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2612,19 +2612,43 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
int node)
{
+ cpumask_t *user_mask;
unsigned long flags;
- if (!src->user_cpus_ptr)
+ /*
+ * Always clear dst->user_cpus_ptr first as their user_cpus_ptr's
+ * may differ by now due to racing.
+ */
+ dst->user_cpus_ptr = NULL;
+
+ /*
+ * This check is racy and losing the race is a valid situation.
+ * It is not worth the extra overhead of taking the pi_lock on
+ * every fork/clone.
+ */
+ if (data_race(!src->user_cpus_ptr))
return 0;
- dst->user_cpus_ptr = kmalloc_node(cpumask_size(), GFP_KERNEL, node);
- if (!dst->user_cpus_ptr)
+ user_mask = kmalloc_node(cpumask_size(), GFP_KERNEL, node);
+ if (!user_mask)
return -ENOMEM;
- /* Use pi_lock to protect content of user_cpus_ptr */
+ /*
+ * Use pi_lock to protect content of user_cpus_ptr
+ *
+ * Though unlikely, user_cpus_ptr can be reset to NULL by a concurrent
+ * do_set_cpus_allowed().
+ */
raw_spin_lock_irqsave(&src->pi_lock, flags);
- cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
+ if (src->user_cpus_ptr) {
+ swap(dst->user_cpus_ptr, user_mask);
+ cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
+ }
raw_spin_unlock_irqrestore(&src->pi_lock, flags);
+
+ if (unlikely(user_mask))
+ kfree(user_mask);
+
return 0;
}
Syzkaller reports warning in submit_bio_checks in 5.10 stable releases.
The issue has been fixed by the following patch which can be cleanly
applied to 5.10 branch.
bio->bi_bdev from original patch is not compatible with 5.10 so adapted
the print message to work with 5.10 cleanly.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
142e821f68cf ("iommu/mediatek-v1: Fix an error handling path in mtk_iommu_v1_probe()")
ac304c070c54 ("iommu/mediatek-v1: Add error handle for mtk_iommu_probe")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 142e821f68cf5da79ce722cb9c1323afae30e185 Mon Sep 17 00:00:00 2001
From: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
Date: Mon, 19 Dec 2022 19:06:22 +0100
Subject: [PATCH] iommu/mediatek-v1: Fix an error handling path in
mtk_iommu_v1_probe()
A clk, prepared and enabled in mtk_iommu_v1_hw_init(), is not released in
the error handling path of mtk_iommu_v1_probe().
Add the corresponding clk_disable_unprepare(), as already done in the
remove function.
Fixes: b17336c55d89 ("iommu/mediatek: add support for mtk iommu generation one HW")
Signed-off-by: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
Reviewed-by: Yong Wu <yong.wu(a)mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno(a)collabora.com>
Reviewed-by: Matthias Brugger <matthias.bgg(a)gmail.com>
Link: https://lore.kernel.org/r/593e7b7d97c6e064b29716b091a9d4fd122241fb.16714731…
Signed-off-by: Joerg Roedel <jroedel(a)suse.de>
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index 69682ee068d2..ca581ff1c769 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -683,7 +683,7 @@ static int mtk_iommu_v1_probe(struct platform_device *pdev)
ret = iommu_device_sysfs_add(&data->iommu, &pdev->dev, NULL,
dev_name(&pdev->dev));
if (ret)
- return ret;
+ goto out_clk_unprepare;
ret = iommu_device_register(&data->iommu, &mtk_iommu_v1_ops, dev);
if (ret)
@@ -698,6 +698,8 @@ out_dev_unreg:
iommu_device_unregister(&data->iommu);
out_sysfs_remove:
iommu_device_sysfs_remove(&data->iommu);
+out_clk_unprepare:
+ clk_disable_unprepare(data->bclk);
return ret;
}
Hello,
I'm running Ubuntu 20.10 and experienced a regression when trying
kernel 6.1.5 that was not present in previous kernels such as 6.0.19
or 6.0.9
I have two monitors attached using DisplayPort MST (daisy chaining the
displays).
Normally, this works without issue and both monitors are detected and
display properly.
With kernel 6.1.5, the second monitor (the monitor that's not directly
connected to the computer) was detected but would not display. I
didn't spend much time troubleshooting the issue and instead reverted
to 6.0.19 where it again worked without issue. The first monitor
worked without issue.
Let me know what steps I can take to help identify the root cause.
-Matt