The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b52be557e24c47286738276121177a41f54e3b83 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Mon, 5 Dec 2022 17:59:27 +0100
Subject: [PATCH] ipc/sem: Fix dangling sem_array access in semtimedop race
When __do_semtimedop() goes to sleep because it has to wait for a
semaphore value becoming zero or becoming bigger than some threshold, it
links the on-stack sem_queue to the sem_array, then goes to sleep
without holding a reference on the sem_array.
When __do_semtimedop() comes back out of sleep, one of two things must
happen:
a) We prove that the on-stack sem_queue has been disconnected from the
(possibly freed) sem_array, making it safe to return from the stack
frame that the sem_queue exists in.
b) We stabilize our reference to the sem_array, lock the sem_array, and
detach the sem_queue from the sem_array ourselves.
sem_array has RCU lifetime, so for case (b), the reference can be
stabilized inside an RCU read-side critical section by locklessly
checking whether the sem_queue is still connected to the sem_array.
However, the current code does the lockless check on sem_queue before
starting an RCU read-side critical section, so the result of the
lockless check immediately becomes useless.
Fix it by doing rcu_read_lock() before the lockless check. Now RCU
ensures that if we observe the object being on our queue, the object
can't be freed until rcu_read_unlock().
This bug is only hittable on kernel builds with full preemption support
(either CONFIG_PREEMPT or PREEMPT_DYNAMIC with preempt=full).
Fixes: 370b262c896e ("ipc/sem: avoid idr tree lookup for interrupted semop")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/ipc/sem.c b/ipc/sem.c
index c8496f98b139..00f88aa01ac5 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2179,14 +2179,15 @@ long __do_semtimedop(int semid, struct sembuf *sops,
* scenarios where we were awakened externally, during the
* window between wake_q_add() and wake_up_q().
*/
+ rcu_read_lock();
error = READ_ONCE(queue.status);
if (error != -EINTR) {
/* see SEM_BARRIER_2 for purpose/pairing */
smp_acquire__after_ctrl_dep();
+ rcu_read_unlock();
goto out;
}
- rcu_read_lock();
locknum = sem_lock(sma, sops, nsops);
if (!ipc_valid_object(&sma->sem_perm))
--
Hello,
I tried e-mailing you more than twice but my email bounced back
failure, Note this, soonest you receive this email revert to me before
I deliver the message it's importunate, pressing, crucial. Await your
response.
Best regards
Dr. Mimi Aminu
After calling napi_complete_done(), the NAPIF_STATE_SCHED bit may be
cleared, and another CPU can start napi thread and access per-CQ variable,
cq->work_done. If the other thread (for example, from busy_poll) sets
it to a value >= budget, this thread will continue to run when it should
stop, and cause memory corruption and panic.
To fix this issue, save the per-CQ work_done variable in a local variable
before napi_complete_done(), so it won't be corrupted by a possible
concurrent thread after napi_complete_done().
Also, add a flag bit to advertise to the NIC firmware: the NAPI work_done
variable race is fixed, so the driver is able to reliably support features
like busy_poll.
Cc: stable(a)vger.kernel.org
Fixes: e1b5683ff62e ("net: mana: Move NAPI from EQ to CQ")
Signed-off-by: Haiyang Zhang <haiyangz(a)microsoft.com>
---
drivers/net/ethernet/microsoft/mana/gdma.h | 9 ++++++++-
drivers/net/ethernet/microsoft/mana/mana_en.c | 16 +++++++++++-----
2 files changed, 19 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h b/drivers/net/ethernet/microsoft/mana/gdma.h
index 4a6efe6ada08..65c24ee49efd 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma.h
+++ b/drivers/net/ethernet/microsoft/mana/gdma.h
@@ -498,7 +498,14 @@ enum {
#define GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT BIT(0)
-#define GDMA_DRV_CAP_FLAGS1 GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT
+/* Advertise to the NIC firmware: the NAPI work_done variable race is fixed,
+ * so the driver is able to reliably support features like busy_poll.
+ */
+#define GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX BIT(2)
+
+#define GDMA_DRV_CAP_FLAGS1 \
+ (GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
+ GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX)
#define GDMA_DRV_CAP_FLAGS2 0
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 9259a74eca40..27a0f3af8aab 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1303,10 +1303,11 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
xdp_do_flush();
}
-static void mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
+static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
{
struct mana_cq *cq = context;
u8 arm_bit;
+ int w;
WARN_ON_ONCE(cq->gdma_cq != gdma_queue);
@@ -1315,26 +1316,31 @@ static void mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
else
mana_poll_tx_cq(cq);
- if (cq->work_done < cq->budget &&
- napi_complete_done(&cq->napi, cq->work_done)) {
+ w = cq->work_done;
+
+ if (w < cq->budget &&
+ napi_complete_done(&cq->napi, w)) {
arm_bit = SET_ARM_BIT;
} else {
arm_bit = 0;
}
mana_gd_ring_cq(gdma_queue, arm_bit);
+
+ return w;
}
static int mana_poll(struct napi_struct *napi, int budget)
{
struct mana_cq *cq = container_of(napi, struct mana_cq, napi);
+ int w;
cq->work_done = 0;
cq->budget = budget;
- mana_cq_handler(cq, cq->gdma_cq);
+ w = mana_cq_handler(cq, cq->gdma_cq);
- return min(cq->work_done, budget);
+ return min(w, budget);
}
static void mana_schedule_napi(void *context, struct gdma_queue *gdma_queue)
--
2.25.1
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 97eea946b93961fffd29448dcda7398d0d51c4b2 ]
The bounds checks in snd_soc_put_volsw_sx() are only being applied to the
first channel, meaning it is possible to write out of bounds values to the
second channel in stereo controls. Add appropriate checks.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20220511134137.169575-2-broonie@kernel.org
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
sound/soc/soc-ops.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/sound/soc/soc-ops.c b/sound/soc/soc-ops.c
index 4fda8c24be29..7129bb685b63 100644
--- a/sound/soc/soc-ops.c
+++ b/sound/soc/soc-ops.c
@@ -465,6 +465,12 @@ int snd_soc_put_volsw_sx(struct snd_kcontrol *kcontrol,
if (snd_soc_volsw_is_stereo(mc)) {
val_mask = mask << rshift;
val2 = (ucontrol->value.integer.value[1] + min) & mask;
+
+ if (mc->platform_max && val2 > mc->platform_max)
+ return -EINVAL;
+ if (val2 > max)
+ return -EINVAL;
+
val2 = val2 << rshift;
err = snd_soc_component_update_bits(component, reg2, val_mask,
--
2.35.1
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 97eea946b93961fffd29448dcda7398d0d51c4b2 ]
The bounds checks in snd_soc_put_volsw_sx() are only being applied to the
first channel, meaning it is possible to write out of bounds values to the
second channel in stereo controls. Add appropriate checks.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20220511134137.169575-2-broonie@kernel.org
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
sound/soc/soc-ops.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/sound/soc/soc-ops.c b/sound/soc/soc-ops.c
index 81c9ecfa7c7f..63c0e61b1754 100644
--- a/sound/soc/soc-ops.c
+++ b/sound/soc/soc-ops.c
@@ -465,6 +465,12 @@ int snd_soc_put_volsw_sx(struct snd_kcontrol *kcontrol,
if (snd_soc_volsw_is_stereo(mc)) {
val_mask = mask << rshift;
val2 = (ucontrol->value.integer.value[1] + min) & mask;
+
+ if (mc->platform_max && val2 > mc->platform_max)
+ return -EINVAL;
+ if (val2 > max)
+ return -EINVAL;
+
val2 = val2 << rshift;
err = snd_soc_component_update_bits(component, reg2, val_mask,
--
2.35.1
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 97eea946b93961fffd29448dcda7398d0d51c4b2 ]
The bounds checks in snd_soc_put_volsw_sx() are only being applied to the
first channel, meaning it is possible to write out of bounds values to the
second channel in stereo controls. Add appropriate checks.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20220511134137.169575-2-broonie@kernel.org
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
sound/soc/soc-ops.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/sound/soc/soc-ops.c b/sound/soc/soc-ops.c
index 453b61b42dd9..00e6a6e46fe5 100644
--- a/sound/soc/soc-ops.c
+++ b/sound/soc/soc-ops.c
@@ -460,6 +460,12 @@ int snd_soc_put_volsw_sx(struct snd_kcontrol *kcontrol,
if (snd_soc_volsw_is_stereo(mc)) {
val_mask = mask << rshift;
val2 = (ucontrol->value.integer.value[1] + min) & mask;
+
+ if (mc->platform_max && val2 > mc->platform_max)
+ return -EINVAL;
+ if (val2 > max)
+ return -EINVAL;
+
val2 = val2 << rshift;
err = snd_soc_component_update_bits(component, reg2, val_mask,
--
2.35.1
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 97eea946b93961fffd29448dcda7398d0d51c4b2 ]
The bounds checks in snd_soc_put_volsw_sx() are only being applied to the
first channel, meaning it is possible to write out of bounds values to the
second channel in stereo controls. Add appropriate checks.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20220511134137.169575-2-broonie@kernel.org
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
sound/soc/soc-ops.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/sound/soc/soc-ops.c b/sound/soc/soc-ops.c
index 453b61b42dd9..00e6a6e46fe5 100644
--- a/sound/soc/soc-ops.c
+++ b/sound/soc/soc-ops.c
@@ -460,6 +460,12 @@ int snd_soc_put_volsw_sx(struct snd_kcontrol *kcontrol,
if (snd_soc_volsw_is_stereo(mc)) {
val_mask = mask << rshift;
val2 = (ucontrol->value.integer.value[1] + min) & mask;
+
+ if (mc->platform_max && val2 > mc->platform_max)
+ return -EINVAL;
+ if (val2 > max)
+ return -EINVAL;
+
val2 = val2 << rshift;
err = snd_soc_component_update_bits(component, reg2, val_mask,
--
2.35.1
From: Shengjiu Wang <shengjiu.wang(a)nxp.com>
[ Upstream commit 292709b9cf3ba470af94b62c9bb60284cc581b79 ]
SRES is self-cleared bit, but REG_MICFIL_CTRL1 is defined as
non volatile register, it still remain in regmap cache after set,
then every update of REG_MICFIL_CTRL1, software reset happens.
to avoid this, clear it explicitly.
Signed-off-by: Shengjiu Wang <shengjiu.wang(a)nxp.com>
Link: https://lore.kernel.org/r/1651925654-32060-1-git-send-email-shengjiu.wang@n…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
sound/soc/fsl/fsl_micfil.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/sound/soc/fsl/fsl_micfil.c b/sound/soc/fsl/fsl_micfil.c
index efc5daf53bba..ead4bfa13561 100644
--- a/sound/soc/fsl/fsl_micfil.c
+++ b/sound/soc/fsl/fsl_micfil.c
@@ -190,6 +190,17 @@ static int fsl_micfil_reset(struct device *dev)
return ret;
}
+ /*
+ * SRES is self-cleared bit, but REG_MICFIL_CTRL1 is defined
+ * as non-volatile register, so SRES still remain in regmap
+ * cache after set, that every update of REG_MICFIL_CTRL1,
+ * software reset happens. so clear it explicitly.
+ */
+ ret = regmap_clear_bits(micfil->regmap, REG_MICFIL_CTRL1,
+ MICFIL_CTRL1_SRES);
+ if (ret)
+ return ret;
+
return 0;
}
--
2.35.1
From: Shengjiu Wang <shengjiu.wang(a)nxp.com>
[ Upstream commit 292709b9cf3ba470af94b62c9bb60284cc581b79 ]
SRES is self-cleared bit, but REG_MICFIL_CTRL1 is defined as
non volatile register, it still remain in regmap cache after set,
then every update of REG_MICFIL_CTRL1, software reset happens.
to avoid this, clear it explicitly.
Signed-off-by: Shengjiu Wang <shengjiu.wang(a)nxp.com>
Link: https://lore.kernel.org/r/1651925654-32060-1-git-send-email-shengjiu.wang@n…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
sound/soc/fsl/fsl_micfil.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/sound/soc/fsl/fsl_micfil.c b/sound/soc/fsl/fsl_micfil.c
index 9f90989ac59a..cb84d95c3aac 100644
--- a/sound/soc/fsl/fsl_micfil.c
+++ b/sound/soc/fsl/fsl_micfil.c
@@ -191,6 +191,17 @@ static int fsl_micfil_reset(struct device *dev)
return ret;
}
+ /*
+ * SRES is self-cleared bit, but REG_MICFIL_CTRL1 is defined
+ * as non-volatile register, so SRES still remain in regmap
+ * cache after set, that every update of REG_MICFIL_CTRL1,
+ * software reset happens. so clear it explicitly.
+ */
+ ret = regmap_clear_bits(micfil->regmap, REG_MICFIL_CTRL1,
+ MICFIL_CTRL1_SRES);
+ if (ret)
+ return ret;
+
return 0;
}
--
2.35.1
[commit 6647e76ab623b2b3fb2efe03a86e9c9046c52c33 upstream]
The V4L2_MEMORY_USERPTR interface is long deprecated and shouldn't be
used (and is discouraged for any modern v4l drivers). And Seth Jenkins
points out that the fallback to VM_PFNMAP/VM_IO is fundamentally racy
and dangerous.
Note that it's not even a case that should trigger, since any normal
user pointer logic ends up just using the pin_user_pages_fast() call
that does the proper page reference counting. That's not the problem
case, only if you try to use special device mappings do you have any
issues.
Normally I'd just remove this during the merge window, but since Seth
pointed out the problem cases, we really want to know as soon as
possible if there are actually any users of this odd special case of a
legacy interface. Neither Hans nor Mauro seem to think that such
mis-uses of the old legacy interface should exist. As Mauro says:
"See, V4L2 has actually 4 streaming APIs:
- Kernel-allocated mmap (usually referred simply as just mmap);
- USERPTR mmap;
- read();
- dmabuf;
The USERPTR is one of the oldest way to use it, coming from V4L
version 1 times, and by far the least used one"
And Hans chimed in on the USERPTR interface:
"To be honest, I wouldn't mind if it goes away completely, but that's a
bit of a pipe dream right now"
but while removing this legacy interface entirely may be a pipe dream we
can at least try to remove the unlikely (and actively broken) case of
using special device mappings for USERPTR accesses.
This replaces it with a WARN_ONCE() that we can remove once we've
hopefully confirmed that no actual users exist.
NOTE! Longer term, this means that a 'struct frame_vector' only ever
contains proper page pointers, and all the games we have with converting
them to pages can go away (grep for 'frame_vector_to_pages()' and the
uses of 'vec->is_pfns'). But this is just the first step, to verify
that this code really is all dead, and do so as quickly as possible.
Reported-by: Seth Jenkins <sethjenkins(a)google.com>
Acked-by: Hans Verkuil <hverkuil(a)xs4all.nl>
Acked-by: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jan Kara <jack(a)suse.cz>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
CC: stable(a)vger.kernel.org # 4.9
Signed-off-by: Sergey Senozhatsky <senozhatsky(a)chromium.org>
---
mm/frame_vector.c | 31 ++++++-------------------------
1 file changed, 6 insertions(+), 25 deletions(-)
diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index d73eed0443f6..aa5526e62c5e 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -36,7 +36,6 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
int ret = 0;
- int err;
int locked;
if (nr_frames == 0)
@@ -71,32 +70,14 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
vec->is_pfns = false;
ret = get_user_pages_locked(start, nr_frames,
gup_flags, (struct page **)(vec->ptrs), &locked);
- goto out;
+ if (likely(ret > 0))
+ goto out;
}
- vec->got_ref = false;
- vec->is_pfns = true;
- do {
- unsigned long *nums = frame_vector_pfns(vec);
-
- while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) {
- err = follow_pfn(vma, start, &nums[ret]);
- if (err) {
- if (ret == 0)
- ret = err;
- goto out;
- }
- start += PAGE_SIZE;
- ret++;
- }
- /*
- * We stop if we have enough pages or if VMA doesn't completely
- * cover the tail page.
- */
- if (ret >= nr_frames || start < vma->vm_end)
- break;
- vma = find_vma_intersection(mm, start, start + 1);
- } while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP));
+ /* This used to (racily) return non-refcounted pfns. Let people know */
+ WARN_ONCE(1, "get_vaddr_frames() cannot follow VM_IO mapping");
+ vec->nr_frames = 0;
+
out:
if (locked)
up_read(&mm->mmap_sem);
--
2.39.0.rc0.267.gcb52ba06e7-goog
Hello friend,
I am writing to you on behalf of my client Mr. Yusuf Habib who is currently being held by the Saudi authorities. My name is Lukas, I am an investment portfolio Manager, and my client (Mr. Yusuf Habib) has a large sum of money and he is looking for someone to help him manage the funds.
The Saudi government filed charges against my-client Mr. Yusuf Habib with the aim of keeping him in captivity indefinitely. A variety of local and foreign politicians, civil activists, and journalists consider the process leading to the detention of Mr. Yusuf to be politically motivated. My client's involvement and financial support for Jamal Ahmad Khashoggi posed the most challenge ever to Mohammed bin Salman Al Saud who happens to be the current Crown Prince of Saudi Arabia. The money is currently deposited in the name of an existing Investment entity.
My client Mr. Yusuf Habib has presented a subtle offer that will need the help of a foreign partner like you to complete successfully. Mr. Yusuf Habib is in a difficult situation, and he must immediately relocate certain sums of money and this must be done in such a way that it must not be tied to Mr. Yusuf Habib. The money is currently deposited in the name of an existing Investment entity.
Your role will be to:
[1]. Act as the original beneficiary of the funds.
[2]. Receive the funds into a business / private bank account.
[3]. Invest / Manage the funds outside of Turkey
[4]. Value of funds: 35 million US Dollars.
See the website below to understand better the problem Mr. Yusuf Habib faced all these past years:
https://www.foxnews.com/world/
saudi-government-to-
confiscate-800-billion-from-
alleged-corrupt-individuals.
Everything will be done legally to ensure the rights to the funds are transferred to you. If you agree to partner with Mr. Yusuf Habib in this partnership business proposal, he will compensate you with 35% percent of the total sum.
Terms will be discussed when you show interest and if you aren't interested and you know of someone looking for an investor, please give him / her my contact.
Should you prefer I re-contact you with more express facts. Then make your interest known.
Sincerely,
Lukas.
Hello there,
I am writing to you on behalf of Mr. Yusuf Habib. My name is Lukas, I am
an investment portfolio Manager at MetLife, and my client (Mr. Yusuf
Habib) has a large sum of money he is looking for someone to help him
manage the funds.
The Saudi government filed charges against my-client Mr. Yusuf Habib
with the aim of keeping him in prison indefinitely. A variety of local
and foreign politicians, civil activists, and journalists consider the
process leading to the imprisonment of Mr. Yusuf to be politically
motivated. My client&#039;s involvement and financial support for Jamal Ahmad
Khashoggi posed the most challenge ever to Mohammed bin Salman Al Saud
who happens to be the current Crown Prince of Saudi Arabia the money is
currently deposited in the name of an existing Investment entity.
My client Mr. Yusuf Habib has presented a subtle offer that will need
the help of a partner like you to complete successfully. Mr. Yusuf Habib
is in a difficult situation, and he must immediately relocate certain
sums of money and this must be done in such a way that it must not be
tied to Mr. Yusuf Habib. The money is currently deposited in the name of
an existing Investment entity. Your role will be to:
[1]. Act as the original beneficiary of the funds.
[2]. Receive the funds into a business / private bank account.
[3]. Invest / Manage the funds outside of Turkey
[4]. Value of funds: 35 million US Dollars.
Everything will be done legally to ensure the rights to the funds are
transferred to you. If you agree to partner with Mr. Yusuf Habib in this
partnership business proposal, he will compensate you with 35% percent
of the total sum. Terms will be discussed when you show interest and if
you aren&#039;t interested and you know of someone looking for an investor,
please give him / her my contact.
Should you prefer I re-contact you with more express facts. Then make
your interest known.
Sincerely,
Lukas.
Since commit 0f0101719138 ("usb: dwc3: Don't switch OTG -> peripheral
if extcon is present"), Dual Role support on Intel Merrifield platform
broke due to rearranging the call to dwc3_get_extcon().
It appears to be caused by ulpi_read_id() masking the timeout on the first
test write. In the past dwc3 probe continued by calling dwc3_core_soft_reset()
followed by dwc3_get_extcon() which happend to return -EPROBE_DEFER.
On deferred probe ulpi_read_id() finally succeeded. Due to above mentioned
rearranging -EPROBE_DEFER is not returned and probe completes without phy.
On Intel Merrifield the timeout on the first test write issue is reproducible
but it is difficult to find the root cause. Using a mainline kernel and
rootfs with buildroot ulpi_read_id() succeeds. As soon as adding
ftrace / bootconfig to find out why, ulpi_read_id() fails and we can't
analyze the flow. Using another rootfs ulpi_read_id() fails even without
adding ftrace. We suspect the issue is some kind of timing / race, but
merely retrying ulpi_read_id() does not resolve the issue.
As we now changed ulpi_read_id() to return -ETIMEDOUT in this case, we
need to handle the error by calling dwc3_core_soft_reset() and request
-EPROBE_DEFER. On deferred probe ulpi_read_id() is retried and succeeds.
Fixes: ef6a7bcfb01c ("usb: ulpi: Support device discovery via DT")
Cc: stable(a)vger.kernel.org
Signed-off-by: Ferry Toth <ftoth(a)exalondelft.nl>
---
drivers/usb/dwc3/core.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index 648f1c570021..2779f17bffaf 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -1106,8 +1106,13 @@ static int dwc3_core_init(struct dwc3 *dwc)
if (!dwc->ulpi_ready) {
ret = dwc3_core_ulpi_init(dwc);
- if (ret)
+ if (ret) {
+ if (ret == -ETIMEDOUT) {
+ dwc3_core_soft_reset(dwc);
+ ret = -EPROBE_DEFER;
+ }
goto err0;
+ }
dwc->ulpi_ready = true;
}
--
2.37.2
The quilt patch titled
Subject: mm/mempolicy: failed to disable numa balancing
has been removed from the -mm tree. Its filename was
mm-mempolicy-failed-to-disable-numa-balancing.patch
This patch was dropped because it was nacked
------------------------------------------------------
From: tzm <tcm1030(a)163.com>
Subject: mm/mempolicy: failed to disable numa balancing
Date: Fri, 2 Dec 2022 22:16:30 +0800
The kernel fails to disable numa balancing policy permanently when the
user passes <numa_balancing=disable> to the boot cmdline parameters. The
numabalancing_override variable is 1 for enable -1 for disable. So,
!numabalancing_override will always be true, which causes this bug.
Link: https://lkml.kernel.org/r/20221202141630.41220-1-tcm1030@163.com
Signed-off-by: tzm <tcm1030(a)163.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mempolicy.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/mempolicy.c~mm-mempolicy-failed-to-disable-numa-balancing
+++ a/mm/mempolicy.c
@@ -2865,7 +2865,7 @@ static void __init check_numabalancing_e
if (numabalancing_override)
set_numabalancing_state(numabalancing_override == 1);
- if (num_online_nodes() > 1 && !numabalancing_override) {
+ if (num_online_nodes() > 1 && (numabalancing_override == 1)) {
pr_info("%s automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl\n",
numabalancing_default ? "Enabling" : "Disabling");
set_numabalancing_state(numabalancing_default);
_
Patches currently in -mm which might be from tcm1030(a)163.com are
The following commit has been merged into the x86/microcode branch of tip:
Commit-ID: be1b670f61443aa5d0d01782e9b8ea0ee825d018
Gitweb: https://git.kernel.org/tip/be1b670f61443aa5d0d01782e9b8ea0ee825d018
Author: Ashok Raj <ashok.raj(a)intel.com>
AuthorDate: Tue, 29 Nov 2022 13:08:27 -08:00
Committer: Borislav Petkov (AMD) <bp(a)alien8.de>
CommitterDate: Mon, 05 Dec 2022 21:22:21 +01:00
x86/microcode/intel: Do not retry microcode reloading on the APs
The retries in load_ucode_intel_ap() were in place to support systems
with mixed steppings. Mixed steppings are no longer supported and there is
only one microcode image at a time. Any retries will simply reattempt to
apply the same image over and over without making progress.
[ bp: Zap the circumstantial reasoning from the commit message. ]
Fixes: 06b8534cb728 ("x86/microcode: Rework microcode loading")
Signed-off-by: Ashok Raj <ashok.raj(a)intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Reviewed-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20221129210832.107850-3-ashok.raj@intel.com
---
arch/x86/kernel/cpu/microcode/intel.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 4f93875..2dba4b5 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -495,7 +495,6 @@ void load_ucode_intel_ap(void)
else
iup = &intel_ucode_patch;
-reget:
if (!*iup) {
patch = __load_ucode_intel(&uci);
if (!patch)
@@ -506,12 +505,7 @@ reget:
uci.mc = *iup;
- if (apply_microcode_early(&uci, true)) {
- /* Mixed-silicon system? Try to refetch the proper patch: */
- *iup = NULL;
-
- goto reget;
- }
+ apply_microcode_early(&uci, true);
}
static struct microcode_intel *find_patch(struct ucode_cpu_info *uci)
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
21b85b09527c ("madvise: use zap_page_range_single for madvise dontneed")
ecfbd733878d ("hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer")
131a79b474e9 ("hugetlb: fix vma lock handling during split vma and range unmapping")
40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
378397ccb8e5 ("hugetlb: create hugetlb_unmap_file_folio to unmap single file folio")
8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
12710fd69634 ("hugetlb: rename vma_shareable() and refactor code")
c86272287bc6 ("hugetlb: create remove_inode_single_folio to remove single file folio")
7e1813d48dd3 ("hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache")
3a47c54f09c4 ("hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization")
188a39725ad7 ("hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race")
763ecb035029 ("mm: remove the vma linked list")
8220543df148 ("nommu: remove uses of VMA linked list")
11f9a21ab655 ("mm/mmap: reorganize munmap to use maple states")
e99668a56430 ("mm/mmap: move mmap_region() below do_munmap()")
7964cf8caa4d ("mm: remove vmacache")
4dd1b84140c1 ("mm/mmap: use advanced maple tree API for mmap_region()")
abdba2dda0c4 ("mm: use maple tree operations for find_vma_intersection()")
2e7ce7d354f2 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 04ada095dcfc4ae359418053c0be94453bdf1e84 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Date: Mon, 14 Nov 2022 15:55:06 -0800
Subject: [PATCH] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED
processing
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
tables associated with the address range. For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final. However,
__unmap_hugepage_range_final assumes the passed vma is about to be removed
and deletes the vma_lock to prevent pmd sharing as the vma is on the way
out. In the case of madvise(MADV_DONTNEED) the vma remains, but the
missing vma_lock prevents pmd sharing and could potentially lead to issues
with truncation/fault races.
This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap. Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing. Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
not set in new pages added to the page table. This resulted in pages that
appeared anonymous in a VM_SHARED vma and triggered the BUG.
Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
call from unmap_vmas(). This is used to indicate the 'final' unmapping of
a hugetlb vma. When called via MADV_DONTNEED, this flag is not set and
the vm_lock is not deleted.
[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6J…
Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reported-by: Wei Chen <harperchen1110(a)gmail.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mina Almasry <almasrymina(a)google.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Naoya Horiguchi <naoya.horiguchi(a)linux.dev>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cbfb489d381c..974ccca609d2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1868,6 +1868,8 @@ struct zap_details {
* default, the flag is not set.
*/
#define ZAP_FLAG_DROP_MARKER ((__force zap_flags_t) BIT(0))
+/* Set in unmap_vmas() to indicate a final unmap call. Only used by hugetlb */
+#define ZAP_FLAG_UNMAP ((__force zap_flags_t) BIT(1))
#ifdef CONFIG_MMU
extern bool can_do_mlock(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f1385c3b6c96..e36ca75311a5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5206,17 +5206,22 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
- /*
- * Unlock and free the vma lock before releasing i_mmap_rwsem. When
- * the vma_lock is freed, this makes the vma ineligible for pmd
- * sharing. And, i_mmap_rwsem is required to set up pmd sharing.
- * This is important as page tables for this unmapped range will
- * be asynchrously deleted. If the page tables are shared, there
- * will be issues when accessed by someone else.
- */
- __hugetlb_vma_unlock_write_free(vma);
-
- i_mmap_unlock_write(vma->vm_file->f_mapping);
+ if (zap_flags & ZAP_FLAG_UNMAP) { /* final unmap */
+ /*
+ * Unlock and free the vma lock before releasing i_mmap_rwsem.
+ * When the vma_lock is freed, this makes the vma ineligible
+ * for pmd sharing. And, i_mmap_rwsem is required to set up
+ * pmd sharing. This is important as page tables for this
+ * unmapped range will be asynchrously deleted. If the page
+ * tables are shared, there will be issues when accessed by
+ * someone else.
+ */
+ __hugetlb_vma_unlock_write_free(vma);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+ } else {
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+ hugetlb_vma_unlock_write(vma);
+ }
}
void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
diff --git a/mm/memory.c b/mm/memory.c
index 9bc5edc35725..8c8420934d60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1711,7 +1711,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
{
struct mmu_notifier_range range;
struct zap_details details = {
- .zap_flags = ZAP_FLAG_DROP_MARKER,
+ .zap_flags = ZAP_FLAG_DROP_MARKER | ZAP_FLAG_UNMAP,
/* Careful - we need to zap private pages too! */
.even_cows = true,
};
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
21b85b09527c ("madvise: use zap_page_range_single for madvise dontneed")
763ecb035029 ("mm: remove the vma linked list")
8220543df148 ("nommu: remove uses of VMA linked list")
11f9a21ab655 ("mm/mmap: reorganize munmap to use maple states")
e99668a56430 ("mm/mmap: move mmap_region() below do_munmap()")
7964cf8caa4d ("mm: remove vmacache")
4dd1b84140c1 ("mm/mmap: use advanced maple tree API for mmap_region()")
abdba2dda0c4 ("mm: use maple tree operations for find_vma_intersection()")
2e7ce7d354f2 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
3b0e81a1cdc9 ("mmap: change zeroing of maple tree in __vma_adjust()")
524e00b36e8c ("mm: remove rb tree.")
c9dbe82cb99d ("kernel/fork: use maple tree for dup_mmap() during forking")
3499a13168da ("mm/mmap: use maple tree for unmapped_area{_topdown}")
be8432e7166e ("mm/mmap: use the maple tree in find_vma() instead of the rbtree.")
2e3af1db1744 ("mmap: use the VMA iterator in count_vma_pages_range()")
f39af05949a4 ("mm: add VMA iterator")
d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
54a611b60590 ("Maple Tree: add new data structure")
bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 21b85b09527c28e242db55c1b751f7f7549b830c Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Date: Mon, 14 Nov 2022 15:55:05 -0800
Subject: [PATCH] madvise: use zap_page_range_single for madvise dontneed
This series addresses the issue first reported in [1], and fully described
in patch 2. Patches 1 and 2 address the user visible issue and are tagged
for stable backports.
While exploring solutions to this issue, related problems with mmu
notification calls were discovered. This is addressed in the patch
"hugetlb: remove duplicate mmu notifications:". Since there are no user
visible effects, this third is not tagged for stable backports.
Previous discussions suggested further cleanup by removing the
routine zap_page_range. This is possible because zap_page_range_single
is now exported, and all callers of zap_page_range pass ranges entirely
within a single vma. This work will be done in a later patch so as not
to distract from this bug fix.
[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6J…
This patch (of 2):
Expose the routine zap_page_range_single to zap a range within a single
vma. The madvise routine madvise_dontneed_single_vma can use this routine
as it explicitly operates on a single vma. Also, update the mmu
notification range in zap_page_range_single to take hugetlb pmd sharing
into account. This is required as MADV_DONTNEED supports hugetlb vmas.
Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com
Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reported-by: Wei Chen <harperchen1110(a)gmail.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mina Almasry <almasrymina(a)google.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Naoya Horiguchi <naoya.horiguchi(a)linux.dev>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..cbfb489d381c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1852,6 +1852,23 @@ static void __maybe_unused show_free_areas(unsigned int flags, nodemask_t *nodem
__show_free_areas(flags, nodemask, MAX_NR_ZONES - 1);
}
+/*
+ * Parameter block passed down to zap_pte_range in exceptional cases.
+ */
+struct zap_details {
+ struct folio *single_folio; /* Locked folio to be unmapped */
+ bool even_cows; /* Zap COWed private pages too? */
+ zap_flags_t zap_flags; /* Extra flags for zapping */
+};
+
+/*
+ * Whether to drop the pte markers, for example, the uffd-wp information for
+ * file-backed memory. This should only be specified when we will completely
+ * drop the page in the mm, either by truncation or unmapping of the vma. By
+ * default, the flag is not set.
+ */
+#define ZAP_FLAG_DROP_MARKER ((__force zap_flags_t) BIT(0))
+
#ifdef CONFIG_MMU
extern bool can_do_mlock(void);
#else
@@ -1869,6 +1886,8 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
void zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
+void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+ unsigned long size, struct zap_details *details);
void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *start_vma, unsigned long start,
unsigned long end);
@@ -3467,12 +3486,4 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
}
#endif
-/*
- * Whether to drop the pte markers, for example, the uffd-wp information for
- * file-backed memory. This should only be specified when we will completely
- * drop the page in the mm, either by truncation or unmapping of the vma. By
- * default, the flag is not set.
- */
-#define ZAP_FLAG_DROP_MARKER ((__force zap_flags_t) BIT(0))
-
#endif /* _LINUX_MM_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index c7105ec6d08c..b913ba6efc10 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -772,8 +772,8 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
* Application no longer needs these pages. If the pages are dirty,
* it's OK to just throw them away. The app will be more careful about
* data it wants to keep. Be sure to free swap resources too. The
- * zap_page_range call sets things up for shrink_active_list to actually free
- * these pages later if no one else has touched them in the meantime,
+ * zap_page_range_single call sets things up for shrink_active_list to actually
+ * free these pages later if no one else has touched them in the meantime,
* although we could add these pages to a global reuse list for
* shrink_active_list to pick up before reclaiming other pages.
*
@@ -790,7 +790,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
- zap_page_range(vma, start, end - start);
+ zap_page_range_single(vma, start, end - start, NULL);
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index 8a6d5c823f91..9bc5edc35725 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1341,15 +1341,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
return ret;
}
-/*
- * Parameter block passed down to zap_pte_range in exceptional cases.
- */
-struct zap_details {
- struct folio *single_folio; /* Locked folio to be unmapped */
- bool even_cows; /* Zap COWed private pages too? */
- zap_flags_t zap_flags; /* Extra flags for zapping */
-};
-
/* Whether we should zap all COWed (private) pages too */
static inline bool should_zap_cows(struct zap_details *details)
{
@@ -1774,19 +1765,27 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
*
* The range must fit into one VMA.
*/
-static void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *details)
{
+ const unsigned long end = address + size;
struct mmu_notifier_range range;
struct mmu_gather tlb;
lru_add_drain();
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
- address, address + size);
+ address, end);
+ if (is_vm_hugetlb_page(vma))
+ adjust_range_if_pmd_sharing_possible(vma, &range.start,
+ &range.end);
tlb_gather_mmu(&tlb, vma->vm_mm);
update_hiwater_rss(vma->vm_mm);
mmu_notifier_invalidate_range_start(&range);
- unmap_single_vma(&tlb, vma, address, range.end, details);
+ /*
+ * unmap 'address-end' not 'range.start-range.end' as range
+ * could have been expanded for hugetlb pmd sharing.
+ */
+ unmap_single_vma(&tlb, vma, address, end, details);
mmu_notifier_invalidate_range_end(&range);
tlb_finish_mmu(&tlb);
}
We are a procuring Branch of Reitan Groups. A Retail and Trading
company located in Norway and newly now in USA. This is one of
our procuring services, hence will like to order your Product.
Please provide us with your offers, catalog so we can choose
precisely the needed products required by our client.
Advise if you can as well Ship to USA and Europe.
Your prompt response will be well appreciated.
Thank You,
Kriston Jentes
Oversea Marketing Manager
REMA 1000 REMA 1000 Norge
+47 24 09 85 00Gladengveien 2, 0661, Oslo
Postboks 6428 Etterstad, 0605 Oslo
https://www.reitangruppen.no
v5.11 changes the blkdev lookup mechanism completely since commit
22ae8ce8b892 ("block: simplify bdev/disk lookup in blkdev_get"),
and small part of the change is to unhash part bdev inode when
deleting partition. Turns out this kind of change does fix one
nasty issue in case of BLOCK_EXT_MAJOR:
1) when one partition is deleted & closed, disk_put_part() is always
called before bdput(bdev), see blkdev_put(); so the part's devt can
be freed & re-used before the inode is dropped
2) then new partition with same devt can be created just before the
inode in 1) is dropped, then the old inode/bdev structurein 1) is
re-used for this new partition, this way causes use-after-free and
kernel panic.
It isn't possible to backport the whole big patchset of "merge struct
block_device and struct hd_struct v4" for addressing this issue.
https://lore.kernel.org/linux-block/20201128161510.347752-1-hch@lst.de/
So fixes it by unhashing part bdev in delete_partition(), and this way
is actually aligned with v5.11+'s behavior.
Reported-by: Shiwei Cui <cuishw(a)inspur.com>
Tested-by: Shiwei Cui <cuishw(a)inspur.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Jan Kara <jack(a)suse.cz>
Signed-off-by: Ming Lei <ming.lei(a)redhat.com>
---
V2:
- fix one typo and Shiwei's email format
block/partitions/core.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/block/partitions/core.c b/block/partitions/core.c
index a02e22411594..e3d61ec4a5a6 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -329,6 +329,7 @@ void delete_partition(struct hd_struct *part)
struct gendisk *disk = part_to_disk(part);
struct disk_part_tbl *ptbl =
rcu_dereference_protected(disk->part_tbl, 1);
+ struct block_device *bdev;
/*
* ->part_tbl is referenced in this part's release handler, so
@@ -346,6 +347,12 @@ void delete_partition(struct hd_struct *part)
* "in-use" until we really free the gendisk.
*/
blk_invalidate_devt(part_devt(part));
+
+ bdev = bdget_part(part);
+ if (bdev) {
+ remove_inode_hash(bdev->bd_inode);
+ bdput(bdev);
+ }
percpu_ref_kill(&part->ref);
}
--
2.37.3
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
c981cdfb9925 ("mmc: sdhci: Fix voltage switch delay")
fa0910107a9f ("mmc: sdhci: use FIELD_GET for preset value bit masks")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c981cdfb9925f64a364f13c2b4f98f877308a408 Mon Sep 17 00:00:00 2001
From: Adrian Hunter <adrian.hunter(a)intel.com>
Date: Mon, 28 Nov 2022 15:32:56 +0200
Subject: [PATCH] mmc: sdhci: Fix voltage switch delay
Commit 20b92a30b561 ("mmc: sdhci: update signal voltage switch code")
removed voltage switch delays from sdhci because mmc core had been
enhanced to support them. However that assumed that sdhci_set_ios()
did a single clock change, which it did not, and so the delays in mmc
core, which should have come after the first clock change, were not
effective.
Fix by avoiding re-configuring UHS and preset settings when the clock
is turning on and the settings have not changed. That then also avoids
the associated clock changes, so that then sdhci_set_ios() does a single
clock change when voltage switching, and the mmc core delays become
effective.
To do that has meant keeping track of driver strength (host->drv_type),
and cases of reinitialization (host->reinit_uhs).
Note also, the 'turning_on_clk' restriction should not be necessary
but is done to minimize the impact of the change on stable kernels.
Fixes: 20b92a30b561 ("mmc: sdhci: update signal voltage switch code")
Cc: stable(a)vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter(a)intel.com>
Link: https://lore.kernel.org/r/20221128133259.38305-2-adrian.hunter@intel.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index fef03de85b99..c7ad32a75b57 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -373,6 +373,7 @@ static void sdhci_init(struct sdhci_host *host, int soft)
if (soft) {
/* force clock reconfiguration */
host->clock = 0;
+ host->reinit_uhs = true;
mmc->ops->set_ios(mmc, &mmc->ios);
}
}
@@ -2293,11 +2294,46 @@ void sdhci_set_uhs_signaling(struct sdhci_host *host, unsigned timing)
}
EXPORT_SYMBOL_GPL(sdhci_set_uhs_signaling);
+static bool sdhci_timing_has_preset(unsigned char timing)
+{
+ switch (timing) {
+ case MMC_TIMING_UHS_SDR12:
+ case MMC_TIMING_UHS_SDR25:
+ case MMC_TIMING_UHS_SDR50:
+ case MMC_TIMING_UHS_SDR104:
+ case MMC_TIMING_UHS_DDR50:
+ case MMC_TIMING_MMC_DDR52:
+ return true;
+ };
+ return false;
+}
+
+static bool sdhci_preset_needed(struct sdhci_host *host, unsigned char timing)
+{
+ return !(host->quirks2 & SDHCI_QUIRK2_PRESET_VALUE_BROKEN) &&
+ sdhci_timing_has_preset(timing);
+}
+
+static bool sdhci_presetable_values_change(struct sdhci_host *host, struct mmc_ios *ios)
+{
+ /*
+ * Preset Values are: Driver Strength, Clock Generator and SDCLK/RCLK
+ * Frequency. Check if preset values need to be enabled, or the Driver
+ * Strength needs updating. Note, clock changes are handled separately.
+ */
+ return !host->preset_enabled &&
+ (sdhci_preset_needed(host, ios->timing) || host->drv_type != ios->drv_type);
+}
+
void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
{
struct sdhci_host *host = mmc_priv(mmc);
+ bool reinit_uhs = host->reinit_uhs;
+ bool turning_on_clk = false;
u8 ctrl;
+ host->reinit_uhs = false;
+
if (ios->power_mode == MMC_POWER_UNDEFINED)
return;
@@ -2323,6 +2359,8 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
sdhci_enable_preset_value(host, false);
if (!ios->clock || ios->clock != host->clock) {
+ turning_on_clk = ios->clock && !host->clock;
+
host->ops->set_clock(host, ios->clock);
host->clock = ios->clock;
@@ -2349,6 +2387,17 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
host->ops->set_bus_width(host, ios->bus_width);
+ /*
+ * Special case to avoid multiple clock changes during voltage
+ * switching.
+ */
+ if (!reinit_uhs &&
+ turning_on_clk &&
+ host->timing == ios->timing &&
+ host->version >= SDHCI_SPEC_300 &&
+ !sdhci_presetable_values_change(host, ios))
+ return;
+
ctrl = sdhci_readb(host, SDHCI_HOST_CONTROL);
if (!(host->quirks & SDHCI_QUIRK_NO_HISPD_BIT)) {
@@ -2392,6 +2441,7 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
}
sdhci_writew(host, ctrl_2, SDHCI_HOST_CONTROL2);
+ host->drv_type = ios->drv_type;
} else {
/*
* According to SDHC Spec v3.00, if the Preset Value
@@ -2419,19 +2469,14 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
host->ops->set_uhs_signaling(host, ios->timing);
host->timing = ios->timing;
- if (!(host->quirks2 & SDHCI_QUIRK2_PRESET_VALUE_BROKEN) &&
- ((ios->timing == MMC_TIMING_UHS_SDR12) ||
- (ios->timing == MMC_TIMING_UHS_SDR25) ||
- (ios->timing == MMC_TIMING_UHS_SDR50) ||
- (ios->timing == MMC_TIMING_UHS_SDR104) ||
- (ios->timing == MMC_TIMING_UHS_DDR50) ||
- (ios->timing == MMC_TIMING_MMC_DDR52))) {
+ if (sdhci_preset_needed(host, ios->timing)) {
u16 preset;
sdhci_enable_preset_value(host, true);
preset = sdhci_get_preset_value(host);
ios->drv_type = FIELD_GET(SDHCI_PRESET_DRV_MASK,
preset);
+ host->drv_type = ios->drv_type;
}
/* Re-enable SD Clock */
@@ -3768,6 +3813,7 @@ int sdhci_resume_host(struct sdhci_host *host)
sdhci_init(host, 0);
host->pwr = 0;
host->clock = 0;
+ host->reinit_uhs = true;
mmc->ops->set_ios(mmc, &mmc->ios);
} else {
sdhci_init(host, (mmc->pm_flags & MMC_PM_KEEP_POWER));
@@ -3830,6 +3876,7 @@ int sdhci_runtime_resume_host(struct sdhci_host *host, int soft_reset)
/* Force clock and power re-program */
host->pwr = 0;
host->clock = 0;
+ host->reinit_uhs = true;
mmc->ops->start_signal_voltage_switch(mmc, &mmc->ios);
mmc->ops->set_ios(mmc, &mmc->ios);
diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index d750c464bd1e..87a3aaa07438 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -524,6 +524,8 @@ struct sdhci_host {
unsigned int clock; /* Current clock (MHz) */
u8 pwr; /* Current voltage */
+ u8 drv_type; /* Current UHS-I driver type */
+ bool reinit_uhs; /* Force UHS-related re-initialization */
bool runtime_suspended; /* Host is runtime suspended */
bool bus_on; /* Bus power prevents runtime suspend */
v5.11 changes the blkdev lookup mechanism completely since commit
22ae8ce8b892 ("block: simplify bdev/disk lookup in blkdev_get"),
and small part of the change is to unhash part bdev inode when
deleting partition. Turns out this kind of change does fix one
nasty issue in case of BLOCK_EXT_MAJOR:
1) when one partition is deleted & closed, disk_put_part() is always
called before bdput(bdev), see blkdev_put(); so the part's devt can
be freed & re-used before the inode is dropped
2) then new partition with same devt can be created just before the
inode in 1) is dropped, then the old inode/bdev structurein 1) is
re-used for this new partition, this way causes use-after-free and
kernel panic.
It isn't possible to backport the whole fbig patchset of "merge struct
block_device and struct hd_struct v4" for addressing this issue.
https://lore.kernel.org/linux-block/20201128161510.347752-1-hch@lst.de/
So fixes it by unhashing part bdev in delete_partition(), and this way
is actually aligned with v5.11+'s behavior.
Reported-by: cuishw(a)inspur.com
Tested-by: cuishw(a)inspur.com
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Jan Kara <jack(a)suse.cz>
Signed-off-by: Ming Lei <ming.lei(a)redhat.com>
---
block/partitions/core.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/block/partitions/core.c b/block/partitions/core.c
index a02e22411594..e3d61ec4a5a6 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -329,6 +329,7 @@ void delete_partition(struct hd_struct *part)
struct gendisk *disk = part_to_disk(part);
struct disk_part_tbl *ptbl =
rcu_dereference_protected(disk->part_tbl, 1);
+ struct block_device *bdev;
/*
* ->part_tbl is referenced in this part's release handler, so
@@ -346,6 +347,12 @@ void delete_partition(struct hd_struct *part)
* "in-use" until we really free the gendisk.
*/
blk_invalidate_devt(part_devt(part));
+
+ bdev = bdget_part(part);
+ if (bdev) {
+ remove_inode_hash(bdev->bd_inode);
+ bdput(bdev);
+ }
percpu_ref_kill(&part->ref);
}
--
2.37.3
This bug is marked as fixed by commit:
net: core: netlink: add helper refcount dec and lock function
net: sched: add helper function to take reference to Qdisc
net: sched: extend Qdisc with rcu
net: sched: rename qdisc_destroy() to qdisc_put()
net: sched: use Qdisc rcu API instead of relying on rtnl lock
But I can't find it in any tested tree for more than 90 days.
Is it a correct commit? Please update it by replying:
#syz fix: exact-commit-title
Until then the bug is still considered open and
new crashes with the same signature are ignored.
FOND MONETAR INTERNAȚIONAL (HQ1)
700 19th Street, N.W., Washington, D.C. 20431.
Fondul nerevendicat al loteriei Mo 755.000,00 USD
detectat de proprietarul de e-mail al fondului.
REF:-XVGN 82022.
Numărul câștigător; 5-6-14-29-35
Stimate proprietar de e-mail al fondului,
Ți-am trimis această scrisoare, acum o lună, dar nu am auzit de la
tine, nu sunt sigur dacă ai primit-o și de aceea o trimit din nou.
Acesta este pentru a te informa despre informații foarte importante
care vor fi de mare ajutor pentru a vă răscumpăra de toate
dificultățile pe care le-ați avut în obținerea plății restante din
cauza cererii excesive de bani de la dumneavoastră atât de către
funcționari corupți funcționari ai băncii, cât și de către companiile
de curierat după care fondul dumneavoastră rămâne neachitat.
Sunt doamna Kristalina Georgieva, un înalt funcționar la Fondul
Monetar Internațional (FMI). Te-ar putea interesa să știi că prin
atâtea corespondențe au ajuns la biroul nostru rapoarte despre felul
tulburător în care oamenii ca tine sunt tratați de către Diverse
Bănci. După investigația noastră, am aflat că adresa dvs. de e-mail a
fost unul dintre norocoșii câștigători ai selecției de loterie Mo în
anul 2020, dar din cauza unor bancheri corupți, aceștia încearcă să vă
redirecționeze fondurile în contul lor privat.
Toate prostatale guvernamentale și neguvernamentale, ONG-urile,
companiile financiare, băncile, companiile de securitate și companiile
de curierat care au fost în contact cu dvs. în ultimul timp au primit
instrucțiuni să vă retragă tranzacția și vi s-a sfătuit să NU
răspundeți de atunci. Fondul Monetar Internațional (FMI) este acum
direct responsabil pentru plata dvs., pe care o puteți primi în contul
dvs. bancar cu ajutorul Băncii Europene de Investiții.
Numele dvs. a apărut pe lista noastră de program de plăți a
beneficiarilor care își vor primi fondurile Loteriei Mo în această
plată din primul trimestru al anului 2022, deoarece transferăm fonduri
doar de două ori pe an, conform reglementărilor noastre bancare. Ne
cerem scuze pentru întârzierea plății și vă rugăm să nu mai comunicați
cu orice birou acum și să acordați atenție biroului nostru în
consecință.
Acum noua dvs. plată, Nr. de aprobare a Națiunilor Unite; UN 5685 P,
Nr. aprobat la Casa Alba: WH44CV, Nr. referinta 35460021, Nr. alocare:
674632 Nr. parola: 339331, nr. Cod PIN: 55674 și certificatul
dumneavoastră de plată de merit: Nr.: 103, Nr. cod emis: 0763;
Confirmare imediată (IMF) Telex Nr: -1114433; Cod secret nr: XXTN013.
Fondul dvs. de moștenire pentru plată parțială este de 755.000 USD
după ce ați primit aceste plăți vitale. numere câștigătoare;
5-6-14-29-35, data câștigării; 05 octombrie 2020, sunteți acum
calificat să primiți și să confirmați plata cu Regiunea Africană a
Fondului Monetar Internațional (FMI) imediat în următoarele 168 de
ore. Vă asigurăm că plata vă va ajunge atâta timp cât urmați
directivele și instrucțiunile mele. Am decis să vă oferim un COD,
CODUL ESTE: 601. Vă rugăm, de fiecare dată când primiți un mail cu
numele doamnei Kristalina Georgieva, verificați dacă există COD (601)
dacă codul nu este scris, vă rugăm să ștergeți mesaj din inbox-ul tău!
Sunteți sfătuit prin prezenta să NU mai trimiteți nicio plată către
nicio instituție cu privire la tranzacția dvs., deoarece fondul
dumneavoastră vă va fi transferat direct din sursa noastră. Singura
plată necesară la bancă este doar taxa pentru certificatul de
lichidare FMI, astfel încât banca va putea elibera întreaga sumă de
bani în contul dvs. bancar.
Sper că acest lucru este clar. Orice acțiune contrară acestei
instrucțiuni este pe propriul risc. REȚINEȚI CĂ TREBUIE PLĂȚIȚI
COMISIONALĂ DE LIMITARE A CERTIFICATULUI FMI CĂTRE BANCĂ ȘI FĂRĂ TAXĂ
SUPLIMENTARĂ.
Am depus deja fondul dvs. total la banca plătitoare corespunzătoare
Banca Europeană de Investiții, vreau să contactați Approval pentru a
primi fondul dvs. total în stare bună, fără alte întârzieri sau taxe
suplimentare.
Răspundeți la acest e-mail bancar:
(europeaninvestmentbankeib6(a)gmail.com)
Și fondul tău va fi eliberat în contul tău bancar odată ce vei contacta banca.
Contactați numele managerului băncii; Dr. Wilson Taylor
E-mail bancar:
(europeaninvestmentbankeib6(a)gmail.com)
Cu respect
Salutari,
Doamna Kristalina Georgieva, (I.M.F)(601)
Additionally, remove it from .ndo_stop().
This ensures that the worker is not called after being freed, and that
the UART TX queue remains active to send final commands when the netdev
is stopped.
Thanks to Jiri Slaby for finding this in slcan:
https://lore.kernel.org/linux-can/20221201073426.17328-1-jirislaby@kernel.o…
A variant of this patch for slcan, with the flush in .ndo_stop() still
present, has been tested successfully on physical hardware:
https://bugzilla.suse.com/show_bug.cgi?id=1205597
Fixes: 43da2f07622f ("can: can327: CAN/ldisc driver for ELM327 based OBD-II adapters")
Cc: "Jiri Slaby (SUSE)" <jirislaby(a)kernel.org>
Cc: Max Staudt <max(a)enpas.org>
Cc: Wolfgang Grandegger <wg(a)grandegger.com>
Cc: Marc Kleine-Budde <mkl(a)pengutronix.de>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Jakub Kicinski <kuba(a)kernel.org>
Cc: Paolo Abeni <pabeni(a)redhat.com>
Cc: linux-can(a)vger.kernel.org
Cc: netdev(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Signed-off-by: Max Staudt <max(a)enpas.org>
---
drivers/net/can/can327.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/drivers/net/can/can327.c b/drivers/net/can/can327.c
index ed3d0b8989a0..dc7192ecb001 100644
--- a/drivers/net/can/can327.c
+++ b/drivers/net/can/can327.c
@@ -796,9 +796,9 @@ static int can327_netdev_close(struct net_device *dev)
netif_stop_queue(dev);
- /* Give UART one final chance to flush. */
- clear_bit(TTY_DO_WRITE_WAKEUP, &elm->tty->flags);
- flush_work(&elm->tx_work);
+ /* We don't flush the UART TX queue here, as we want final stop
+ * commands (like the above dummy char) to be flushed out.
+ */
can_rx_offload_disable(&elm->offload);
elm->can.state = CAN_STATE_STOPPED;
@@ -1069,12 +1069,15 @@ static void can327_ldisc_close(struct tty_struct *tty)
{
struct can327 *elm = (struct can327 *)tty->disc_data;
- /* unregister_netdev() calls .ndo_stop() so we don't have to.
- * Our .ndo_stop() also flushes the TTY write wakeup handler,
- * so we can safely set elm->tty = NULL after this.
- */
+ /* unregister_netdev() calls .ndo_stop() so we don't have to. */
unregister_candev(elm->dev);
+ /* Give UART one final chance to flush.
+ * No need to clear TTY_DO_WRITE_WAKEUP since .write_wakeup() is
+ * serialised against .close() and will not be called once we return.
+ */
+ flush_work(&elm->tx_work);
+
/* Mark channel as dead */
spin_lock_bh(&elm->lock);
tty->disc_data = NULL;
--
2.30.2
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
89d21e259a94 ("powerpc/bpf/32: Fix Oops on tail call tests")
49c3af43e65f ("powerpc/bpf: Simplify bpf_to_ppc() and adopt it for powerpc64")
3a3fc9bf1039 ("powerpc64/bpf: Store temp registers' bpf to ppc mapping")
036d559c0bde ("powerpc/bpf: Use _Rn macros for GPRs")
576a6c3a00c1 ("powerpc/bpf: Move bpf_jit64.h into bpf_jit_comp64.c")
7b187dcdb5d3 ("powerpc/bpf: Cleanup bpf_jit.h")
794abc08d75e ("powerpc64/bpf: Get rid of PPC_BPF_[LL|STL|STLU] macros")
391c271f4deb ("powerpc64/bpf: Convert some of the uses of PPC_BPF_[LL|STL] to PPC_BPF_[LD|STD]")
feb6307289d8 ("powerpc64/bpf: Optimize instruction sequence used for function calls")
43d636f8b4fd ("powerpc64/bpf elfv1: Do not load TOC before calling functions")
b10cb163c4b3 ("powerpc64/bpf elfv2: Setup kernel TOC in r2 on entry")
1d4866d5652f ("powerpc64/bpf: Use r12 for constant blinding")
c2067f7f8883 ("powerpc64/bpf: Do not save/restore LR on each call to bpf_stf_barrier()")
0ffdbce6f4a8 ("powerpc/bpf: Handle large branch ranges with BPF_EXIT")
bafb5898de5d ("powerpc/bpf: Emit a single branch instruction for known short branch ranges")
a8936569a07b ("powerpc/bpf: Always reallocate BPF_REG_5, BPF_REG_AX and TMP_REG when possible")
3f5f766d5f7f ("powerpc64/bpf: Limit 'ldbrx' to processors compliant with ISA v2.06")
f9320c49993c ("powerpc/bpf: Update ldimm64 instructions during extra pass")
29ec39fcf11e ("Merge tag 'powerpc-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 89d21e259a94f7d5582ec675aa445f5a79f347e4 Mon Sep 17 00:00:00 2001
From: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Date: Thu, 24 Nov 2022 09:37:27 +0100
Subject: [PATCH] powerpc/bpf/32: Fix Oops on tail call tests
test_bpf tail call tests end up as:
test_bpf: #0 Tail call leaf jited:1 85 PASS
test_bpf: #1 Tail call 2 jited:1 111 PASS
test_bpf: #2 Tail call 3 jited:1 145 PASS
test_bpf: #3 Tail call 4 jited:1 170 PASS
test_bpf: #4 Tail call load/store leaf jited:1 190 PASS
test_bpf: #5 Tail call load/store jited:1
BUG: Unable to handle kernel data access on write at 0xf1b4e000
Faulting instruction address: 0xbe86b710
Oops: Kernel access of bad area, sig: 11 [#1]
BE PAGE_SIZE=4K MMU=Hash PowerMac
Modules linked in: test_bpf(+)
CPU: 0 PID: 97 Comm: insmod Not tainted 6.1.0-rc4+ #195
Hardware name: PowerMac3,1 750CL 0x87210 PowerMac
NIP: be86b710 LR: be857e88 CTR: be86b704
REGS: f1b4df20 TRAP: 0300 Not tainted (6.1.0-rc4+)
MSR: 00009032 <EE,ME,IR,DR,RI> CR: 28008242 XER: 00000000
DAR: f1b4e000 DSISR: 42000000
GPR00: 00000001 f1b4dfe0 c11d2280 00000000 00000000 00000000 00000002 00000000
GPR08: f1b4e000 be86b704 f1b4e000 00000000 00000000 100d816a f2440000 fe73baa8
GPR16: f2458000 00000000 c1941ae4 f1fe2248 00000045 c0de0000 f2458030 00000000
GPR24: 000003e8 0000000f f2458000 f1b4dc90 3e584b46 00000000 f24466a0 c1941a00
NIP [be86b710] 0xbe86b710
LR [be857e88] __run_one+0xec/0x264 [test_bpf]
Call Trace:
[f1b4dfe0] [00000002] 0x2 (unreliable)
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
---[ end trace 0000000000000000 ]---
This is a tentative to write above the stack. The problem is encoutered
with tests added by commit 38608ee7b690 ("bpf, tests: Add load store
test case for tail call")
This happens because tail call is done to a BPF prog with a different
stack_depth. At the time being, the stack is kept as is when the caller
tail calls its callee. But at exit, the callee restores the stack based
on its own properties. Therefore here, at each run, r1 is erroneously
increased by 32 - 16 = 16 bytes.
This was done that way in order to pass the tail call count from caller
to callee through the stack. As powerpc32 doesn't have a red zone in
the stack, it was necessary the maintain the stack as is for the tail
call. But it was not anticipated that the BPF frame size could be
different.
Let's take a new approach. Use register r4 to carry the tail call count
during the tail call, and save it into the stack at function entry if
required. This means the input parameter must be in r3, which is more
correct as it is a 32 bits parameter, then tail call better match with
normal BPF function entry, the down side being that we move that input
parameter back and forth between r3 and r4. That can be optimised later.
Doing that also has the advantage of maximising the common parts between
tail calls and a normal function exit.
With the fix, tail call tests are now successfull:
test_bpf: #0 Tail call leaf jited:1 53 PASS
test_bpf: #1 Tail call 2 jited:1 115 PASS
test_bpf: #2 Tail call 3 jited:1 154 PASS
test_bpf: #3 Tail call 4 jited:1 165 PASS
test_bpf: #4 Tail call load/store leaf jited:1 101 PASS
test_bpf: #5 Tail call load/store jited:1 141 PASS
test_bpf: #6 Tail call error path, max count reached jited:1 994 PASS
test_bpf: #7 Tail call count preserved across function calls jited:1 140975 PASS
test_bpf: #8 Tail call error path, NULL target jited:1 110 PASS
test_bpf: #9 Tail call error path, index out of range jited:1 69 PASS
test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]
Suggested-by: Naveen N. Rao <naveen.n.rao(a)linux.vnet.ibm.com>
Fixes: 51c66ad849a7 ("powerpc/bpf: Implement extended BPF on PPC32")
Cc: stable(a)vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Tested-by: Naveen N. Rao <naveen.n.rao(a)linux.vnet.ibm.com
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/757acccb7fbfc78efa42dcf3c974b46678198905.16692788…
diff --git a/arch/powerpc/net/bpf_jit_comp32.c b/arch/powerpc/net/bpf_jit_comp32.c
index 43f1c76d48ce..a379b0ce19ff 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -113,23 +113,19 @@ void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx)
{
int i;
- /* First arg comes in as a 32 bits pointer. */
- EMIT(PPC_RAW_MR(bpf_to_ppc(BPF_REG_1), _R3));
- EMIT(PPC_RAW_LI(bpf_to_ppc(BPF_REG_1) - 1, 0));
+ /* Initialize tail_call_cnt, to be skipped if we do tail calls. */
+ EMIT(PPC_RAW_LI(_R4, 0));
+
+#define BPF_TAILCALL_PROLOGUE_SIZE 4
+
EMIT(PPC_RAW_STWU(_R1, _R1, -BPF_PPC_STACKFRAME(ctx)));
- /*
- * Initialize tail_call_cnt in stack frame if we do tail calls.
- * Otherwise, put in NOPs so that it can be skipped when we are
- * invoked through a tail call.
- */
if (ctx->seen & SEEN_TAILCALL)
- EMIT(PPC_RAW_STW(bpf_to_ppc(BPF_REG_1) - 1, _R1,
- bpf_jit_stack_offsetof(ctx, BPF_PPC_TC)));
- else
- EMIT(PPC_RAW_NOP());
+ EMIT(PPC_RAW_STW(_R4, _R1, bpf_jit_stack_offsetof(ctx, BPF_PPC_TC)));
-#define BPF_TAILCALL_PROLOGUE_SIZE 16
+ /* First arg comes in as a 32 bits pointer. */
+ EMIT(PPC_RAW_MR(bpf_to_ppc(BPF_REG_1), _R3));
+ EMIT(PPC_RAW_LI(bpf_to_ppc(BPF_REG_1) - 1, 0));
/*
* We need a stack frame, but we don't necessarily need to
@@ -170,24 +166,24 @@ static void bpf_jit_emit_common_epilogue(u32 *image, struct codegen_context *ctx
for (i = BPF_PPC_NVR_MIN; i <= 31; i++)
if (bpf_is_seen_register(ctx, i))
EMIT(PPC_RAW_LWZ(i, _R1, bpf_jit_stack_offsetof(ctx, i)));
-}
-
-void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx)
-{
- EMIT(PPC_RAW_MR(_R3, bpf_to_ppc(BPF_REG_0)));
-
- bpf_jit_emit_common_epilogue(image, ctx);
-
- /* Tear down our stack frame */
if (ctx->seen & SEEN_FUNC)
EMIT(PPC_RAW_LWZ(_R0, _R1, BPF_PPC_STACKFRAME(ctx) + PPC_LR_STKOFF));
+ /* Tear down our stack frame */
EMIT(PPC_RAW_ADDI(_R1, _R1, BPF_PPC_STACKFRAME(ctx)));
if (ctx->seen & SEEN_FUNC)
EMIT(PPC_RAW_MTLR(_R0));
+}
+
+void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx)
+{
+ EMIT(PPC_RAW_MR(_R3, bpf_to_ppc(BPF_REG_0)));
+
+ bpf_jit_emit_common_epilogue(image, ctx);
+
EMIT(PPC_RAW_BLR());
}
@@ -244,7 +240,6 @@ static int bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32 o
EMIT(PPC_RAW_RLWINM(_R3, b2p_index, 2, 0, 29));
EMIT(PPC_RAW_ADD(_R3, _R3, b2p_bpf_array));
EMIT(PPC_RAW_LWZ(_R3, _R3, offsetof(struct bpf_array, ptrs)));
- EMIT(PPC_RAW_STW(_R0, _R1, bpf_jit_stack_offsetof(ctx, BPF_PPC_TC)));
/*
* if (prog == NULL)
@@ -255,19 +250,14 @@ static int bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32 o
/* goto *(prog->bpf_func + prologue_size); */
EMIT(PPC_RAW_LWZ(_R3, _R3, offsetof(struct bpf_prog, bpf_func)));
-
- if (ctx->seen & SEEN_FUNC)
- EMIT(PPC_RAW_LWZ(_R0, _R1, BPF_PPC_STACKFRAME(ctx) + PPC_LR_STKOFF));
-
EMIT(PPC_RAW_ADDIC(_R3, _R3, BPF_TAILCALL_PROLOGUE_SIZE));
-
- if (ctx->seen & SEEN_FUNC)
- EMIT(PPC_RAW_MTLR(_R0));
-
EMIT(PPC_RAW_MTCTR(_R3));
EMIT(PPC_RAW_MR(_R3, bpf_to_ppc(BPF_REG_1)));
+ /* Put tail_call_cnt in r4 */
+ EMIT(PPC_RAW_MR(_R4, _R0));
+
/* tear restore NVRs, ... */
bpf_jit_emit_common_epilogue(image, ctx);
> On 3 Dec 2022, at 09:26, Sasha Levin <sashal(a)kernel.org> wrote:
>
> This is a note to let you know that I've just added the patch titled
>
> kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible
>
> to the 5.15-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> kbuild-fix-wimplicit-function-declaration-in-license.patch
> and it can be found in the queue-5.15 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
>
>
>
> commit 3f07ef5743bd434f3689f9ceed7a24128dc6b5ae
> Author: Sam James <sam(a)gentoo.org>
> Date: Wed Nov 16 18:26:34 2022 +0000
>
> kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible
>
> [ Upstream commit 50c697215a8cc22f0e58c88f06f2716c05a26e85 ]
Hi Sasha,
Please yank this commit as it's been reverted upstream, it doesn't work
for cross as-is.
Best,
sam
The patch titled
Subject: tmpfs: fix data loss from failed fallocate
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
tmpfs-fix-data-loss-from-failed-fallocate.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: tmpfs: fix data loss from failed fallocate
Date: Sun, 4 Dec 2022 16:51:50 -0800 (PST)
Fix tmpfs data loss when the fallocate system call is interrupted by a
signal, or fails for some other reason. The partial folio handling in
shmem_undo_range() forgot to consider this unfalloc case, and was liable
to erase or truncate out data which had already been committed earlier.
It turns out that none of the partial folio handling there is appropriate
for the unfalloc case, which just wants to proceed to removal of whole
folios: which find_get_entries() provides, even when partially covered.
Original patch by Rui Wang.
Link: https://lore.kernel.org/linux-mm/33b85d82.7764.1842e9ab207.Coremail.chenguo…
Link: https://lkml.kernel.org/r/a5dac112-cf4b-7af-a33-f386e347fd38@google.com
Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Reported-by: Guoqi Chen <chenguoqic(a)163.com>
Link: https://lore.kernel.org/all/20221101032248.819360-1-kernel@hev.cc/
Cc: Rui Wang <kernel(a)hev.cc>
Cc: Huacai Chen <chenhuacai(a)loongson.cn>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Vishal Moola (Oracle) <vishal.moola(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [5.17+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/shmem.c | 11 +++++++++++
1 file changed, 11 insertions(+)
--- a/mm/shmem.c~tmpfs-fix-data-loss-from-failed-fallocate
+++ a/mm/shmem.c
@@ -948,6 +948,15 @@ static void shmem_undo_range(struct inod
index++;
}
+ /*
+ * When undoing a failed fallocate, we want none of the partial folio
+ * zeroing and splitting below, but shall want to truncate the whole
+ * folio when !uptodate indicates that it was added by this fallocate,
+ * even when [lstart, lend] covers only a part of the folio.
+ */
+ if (unfalloc)
+ goto whole_folios;
+
same_folio = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);
folio = shmem_get_partial_folio(inode, lstart >> PAGE_SHIFT);
if (folio) {
@@ -973,6 +982,8 @@ static void shmem_undo_range(struct inod
folio_put(folio);
}
+whole_folios:
+
index = start;
while (index < end) {
cond_resched();
_
Patches currently in -mm which might be from hughd(a)google.com are
tmpfs-fix-data-loss-from-failed-fallocate.patch
Here's a backported version of upstream commit ID
4dbd6a3e90e03130973688fd79e19425f720d999 that will work with
5.4, 4.19, 4.14, and 4.9.
Michael
-------------------------------------------------------------------------------------------
6f2e8eba629d29ffae0fc71c0cfd6b694ac4a5ec Mon Sep 17 00:00:00 2001
From: Michael Kelley <mikelley(a)microsoft.com>
Date: Sun, 4 Dec 2022 13:52:01 -0800
Subject: [PATCH v4 1/1] x86/ioremap: Fix page aligned size calculation in
__ioremap_caller()
Current code re-calculates the size after aligning the starting and
ending physical addresses on a page boundary. But the re-calculation
also embeds the masking of high order bits that exceed the size of
the physical address space (via PHYSICAL_PAGE_MASK). If the masking
removes any high order bits, the size calculation results in a huge
value that is likely to immediately fail.
Fix this by re-calculating the page-aligned size first. Then mask any
high order bits using PHYSICAL_PAGE_MASK.
Fixes: ffa71f33a820 ("x86, ioremap: Fix incorrect physical address handling in PAE mode")
Acked-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Signed-off-by: Michael Kelley <mikelley(a)microsoft.com>
---
arch/x86/mm/ioremap.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index ecae9ac..696fd6f 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -126,9 +126,15 @@ static void __iomem *__ioremap_caller(resource_size_t phys_addr,
* Mappings have to be page-aligned
*/
offset = phys_addr & ~PAGE_MASK;
- phys_addr &= PHYSICAL_PAGE_MASK;
+ phys_addr &= PAGE_MASK;
size = PAGE_ALIGN(last_addr+1) - phys_addr;
+ /*
+ * Mask out any bits not part of the actual physical
+ * address, like memory encryption bits.
+ */
+ phys_addr &= PHYSICAL_PAGE_MASK;
+
retval = reserve_memtype(phys_addr, (u64)phys_addr + size,
pcm, &new_pcm);
if (retval) {
--
1.8.3.1
Hello,
commit 711f8c3fb3db ("Bluetooth: L2CAP: Fix accepting connection request for
invalid SPSM") did not apply to 5.4-stable tree previously.
One of the notable dependencies is commit 15f02b910562 ("Bluetooth: L2CAP:
Add initial code for Enhanced Credit Based Mode") and that doesn't apply to
5.4-stable either due to a mismatch on `l2cap_sock_setsockopt_old` in
l2cap_sock.c.
After doing a comparison between upstream and older revisions, I merged the
changes and backported 15f02b910562 to 5.4-stable.
During compilation, I discovered another dependency commit 145720963b6c
("Bluetooth: L2CAP: Add definitions for Enhanced Credit Based Mode") and
added that to patchset.
All those combined will hopefully allow us to have the fix for CVE-2022-42896
in 5.4-stable.
Thank you.
Luiz Augusto von Dentz (3):
Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode
Bluetooth: L2CAP: Add definitions for Enhanced Credit Based Mode
Bluetooth: L2CAP: Fix accepting connection request for invalid SPSM
include/net/bluetooth/l2cap.h | 43 +++
net/bluetooth/l2cap_core.c | 570 +++++++++++++++++++++++++++++++++-
net/bluetooth/l2cap_sock.c | 24 +-
3 files changed, 617 insertions(+), 20 deletions(-)
--
2.37.2
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
48d4180939e1 ("ACPI: HMAT: Fix initiator registration for single-initiator systems")
14f16d47561b ("ACPI: HMAT: remove unnecessary variable initialization")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 48d4180939e12c4bd2846f984436d895bb9699ed Mon Sep 17 00:00:00 2001
From: Vishal Verma <vishal.l.verma(a)intel.com>
Date: Wed, 16 Nov 2022 16:37:37 -0700
Subject: [PATCH] ACPI: HMAT: Fix initiator registration for single-initiator
systems
In a system with a single initiator node, and one or more memory-only
'target' nodes, the memory-only node(s) would fail to register their
initiator node correctly. i.e. in sysfs:
# ls /sys/devices/system/node/node0/access0/targets/
node0
Where as the correct behavior should be:
# ls /sys/devices/system/node/node0/access0/targets/
node0 node1
This happened because hmat_register_target_initiators() uses list_sort()
to sort the initiator list, but the sort comparision function
(initiator_cmp()) is overloaded to also set the node mask's bits.
In a system with a single initiator, the list is singular, and list_sort
elides the comparision helper call. Thus the node mask never gets set,
and the subsequent search for the best initiator comes up empty.
Add a new helper to consume the sorted initiator list, and generate the
nodemask, decoupling it from the overloaded initiator_cmp() comparision
callback. This prevents the singular list corner case naturally, and
makes the code easier to follow as well.
Cc: <stable(a)vger.kernel.org>
Cc: Rafael J. Wysocki <rafael(a)kernel.org>
Cc: Liu Shixin <liushixin2(a)huawei.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reported-by: Chris Piper <chris.d.piper(a)intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Link: https://lore.kernel.org/r/20221116-acpi_hmat_fix-v2-2-3712569be691@intel.com
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 144a84f429ed..6cceca64a6bc 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -562,17 +562,26 @@ static int initiator_cmp(void *priv, const struct list_head *a,
{
struct memory_initiator *ia;
struct memory_initiator *ib;
- unsigned long *p_nodes = priv;
ia = list_entry(a, struct memory_initiator, node);
ib = list_entry(b, struct memory_initiator, node);
- set_bit(ia->processor_pxm, p_nodes);
- set_bit(ib->processor_pxm, p_nodes);
-
return ia->processor_pxm - ib->processor_pxm;
}
+static int initiators_to_nodemask(unsigned long *p_nodes)
+{
+ struct memory_initiator *initiator;
+
+ if (list_empty(&initiators))
+ return -ENXIO;
+
+ list_for_each_entry(initiator, &initiators, node)
+ set_bit(initiator->processor_pxm, p_nodes);
+
+ return 0;
+}
+
static void hmat_register_target_initiators(struct memory_target *target)
{
static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
@@ -609,7 +618,10 @@ static void hmat_register_target_initiators(struct memory_target *target)
* initiators.
*/
bitmap_zero(p_nodes, MAX_NUMNODES);
- list_sort(p_nodes, &initiators, initiator_cmp);
+ list_sort(NULL, &initiators, initiator_cmp);
+ if (initiators_to_nodemask(p_nodes) < 0)
+ return;
+
if (!access0done) {
for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
loc = localities_types[i];
@@ -643,7 +655,9 @@ static void hmat_register_target_initiators(struct memory_target *target)
/* Access 1 ignores Generic Initiators */
bitmap_zero(p_nodes, MAX_NUMNODES);
- list_sort(p_nodes, &initiators, initiator_cmp);
+ if (initiators_to_nodemask(p_nodes) < 0)
+ return;
+
for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
loc = localities_types[i];
if (!loc)
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
48d4180939e1 ("ACPI: HMAT: Fix initiator registration for single-initiator systems")
14f16d47561b ("ACPI: HMAT: remove unnecessary variable initialization")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 48d4180939e12c4bd2846f984436d895bb9699ed Mon Sep 17 00:00:00 2001
From: Vishal Verma <vishal.l.verma(a)intel.com>
Date: Wed, 16 Nov 2022 16:37:37 -0700
Subject: [PATCH] ACPI: HMAT: Fix initiator registration for single-initiator
systems
In a system with a single initiator node, and one or more memory-only
'target' nodes, the memory-only node(s) would fail to register their
initiator node correctly. i.e. in sysfs:
# ls /sys/devices/system/node/node0/access0/targets/
node0
Where as the correct behavior should be:
# ls /sys/devices/system/node/node0/access0/targets/
node0 node1
This happened because hmat_register_target_initiators() uses list_sort()
to sort the initiator list, but the sort comparision function
(initiator_cmp()) is overloaded to also set the node mask's bits.
In a system with a single initiator, the list is singular, and list_sort
elides the comparision helper call. Thus the node mask never gets set,
and the subsequent search for the best initiator comes up empty.
Add a new helper to consume the sorted initiator list, and generate the
nodemask, decoupling it from the overloaded initiator_cmp() comparision
callback. This prevents the singular list corner case naturally, and
makes the code easier to follow as well.
Cc: <stable(a)vger.kernel.org>
Cc: Rafael J. Wysocki <rafael(a)kernel.org>
Cc: Liu Shixin <liushixin2(a)huawei.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reported-by: Chris Piper <chris.d.piper(a)intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Link: https://lore.kernel.org/r/20221116-acpi_hmat_fix-v2-2-3712569be691@intel.com
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 144a84f429ed..6cceca64a6bc 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -562,17 +562,26 @@ static int initiator_cmp(void *priv, const struct list_head *a,
{
struct memory_initiator *ia;
struct memory_initiator *ib;
- unsigned long *p_nodes = priv;
ia = list_entry(a, struct memory_initiator, node);
ib = list_entry(b, struct memory_initiator, node);
- set_bit(ia->processor_pxm, p_nodes);
- set_bit(ib->processor_pxm, p_nodes);
-
return ia->processor_pxm - ib->processor_pxm;
}
+static int initiators_to_nodemask(unsigned long *p_nodes)
+{
+ struct memory_initiator *initiator;
+
+ if (list_empty(&initiators))
+ return -ENXIO;
+
+ list_for_each_entry(initiator, &initiators, node)
+ set_bit(initiator->processor_pxm, p_nodes);
+
+ return 0;
+}
+
static void hmat_register_target_initiators(struct memory_target *target)
{
static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
@@ -609,7 +618,10 @@ static void hmat_register_target_initiators(struct memory_target *target)
* initiators.
*/
bitmap_zero(p_nodes, MAX_NUMNODES);
- list_sort(p_nodes, &initiators, initiator_cmp);
+ list_sort(NULL, &initiators, initiator_cmp);
+ if (initiators_to_nodemask(p_nodes) < 0)
+ return;
+
if (!access0done) {
for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
loc = localities_types[i];
@@ -643,7 +655,9 @@ static void hmat_register_target_initiators(struct memory_target *target)
/* Access 1 ignores Generic Initiators */
bitmap_zero(p_nodes, MAX_NUMNODES);
- list_sort(p_nodes, &initiators, initiator_cmp);
+ if (initiators_to_nodemask(p_nodes) < 0)
+ return;
+
for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
loc = localities_types[i];
if (!loc)
Portions of upstream commit a4055888629b ("mm/memcg: warning on !memcg
after readahead page charged") were backported as commit cfe575954ddd
("mm: add VM_WARN_ON_ONCE_PAGE() macro"). Unfortunately, the backport
did not account for the lack of commit 33def8498fdd ("treewide: Convert
macro and uses of __section(foo) to __section("foo")") in kernels prior
to 5.10, resulting in the following orphan section warnings on PowerPC
clang builds with CONFIG_DEBUG_VM=y:
powerpc64le-linux-gnu-ld: warning: orphan section `".data.once"' from `mm/huge_memory.o' being placed in section `".data.once"'
powerpc64le-linux-gnu-ld: warning: orphan section `".data.once"' from `mm/huge_memory.o' being placed in section `".data.once"'
powerpc64le-linux-gnu-ld: warning: orphan section `".data.once"' from `mm/huge_memory.o' being placed in section `".data.once"'
This is a difference between how clang and gcc handle macro
stringification, which was resolved for the kernel by not stringifying
the argument to the __section() macro. Since that change was deemed not
suitable for the stable kernels by commit 59f89518f510 ("once: fix
section mismatch on clang builds"), do that same thing as that change
and remove the quotes from the argument to __section().
Fixes: cfe575954ddd ("mm: add VM_WARN_ON_ONCE_PAGE() macro")
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
---
As far as I can tell, this should be applied to 5.4 and earlier. It
should apply cleanly but let me know if not.
include/linux/mmdebug.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h
index 5d0767cb424a..4ed52879ce55 100644
--- a/include/linux/mmdebug.h
+++ b/include/linux/mmdebug.h
@@ -38,7 +38,7 @@ void dump_mm(const struct mm_struct *mm);
} \
} while (0)
#define VM_WARN_ON_ONCE_PAGE(cond, page) ({ \
- static bool __section(".data.once") __warned; \
+ static bool __section(.data.once) __warned; \
int __ret_warn_once = !!(cond); \
\
if (unlikely(__ret_warn_once && !__warned)) { \
base-commit: 4d2a309b5c28a2edc0900542d22fec3a5a22243b
--
2.38.1
From: Keith Busch <kbusch(a)kernel.org>
commit 23e085b2dead13b51fe86d27069895b740f749c0 upstream.
The passthrough commands already have this restriction, but the other
operations do not. Require the same capabilities for all users as all of
these operations, which include resets and rescans, can be disruptive.
Signed-off-by: Keith Busch <kbusch(a)kernel.org>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Ovidiu Panait <ovidiu.panait(a)windriver.com>
---
These backports are for CVE-2022-3169.
drivers/nvme/host/core.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6627fb531f33..3b5e5fb158be 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3027,11 +3027,17 @@ static long nvme_dev_ioctl(struct file *file, unsigned int cmd,
case NVME_IOCTL_IO_CMD:
return nvme_dev_user_cmd(ctrl, argp);
case NVME_IOCTL_RESET:
+ if (!capable(CAP_SYS_ADMIN))
+ return -EACCES;
dev_warn(ctrl->device, "resetting controller\n");
return nvme_reset_ctrl_sync(ctrl);
case NVME_IOCTL_SUBSYS_RESET:
+ if (!capable(CAP_SYS_ADMIN))
+ return -EACCES;
return nvme_reset_subsystem(ctrl);
case NVME_IOCTL_RESCAN:
+ if (!capable(CAP_SYS_ADMIN))
+ return -EACCES;
nvme_queue_scan(ctrl);
return 0;
default:
--
2.38.1
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
42fb0a1e84ff ("tracing/ring-buffer: Have polling block on watermark")
7e9fbbb1b776 ("ring-buffer: Add ring_buffer_wake_waiters()")
13292494379f ("tracing: Make struct ring_buffer less ambiguous")
1c5eb4481e01 ("tracing: Rename trace_buffer to array_buffer")
2d6425af6116 ("tracing: Declare newly exported APIs in include/linux/trace.h")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 42fb0a1e84ff525ebe560e2baf9451ab69127e2b Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Thu, 20 Oct 2022 23:14:27 -0400
Subject: [PATCH] tracing/ring-buffer: Have polling block on watermark
Currently the way polling works on the ring buffer is broken. It will
return immediately if there's any data in the ring buffer whereas a read
will block until the watermark (defined by the tracefs buffer_percent file)
is hit.
That is, a select() or poll() will return as if there's data available,
but then the following read will block. This is broken for the way
select()s and poll()s are supposed to work.
Have the polling on the ring buffer also block the same way reads and
splice does on the ring buffer.
Link: https://lkml.kernel.org/r/20221020231427.41be3f26@gandalf.local.home
Cc: Linux Trace Kernel <linux-trace-kernel(a)vger.kernel.org>
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Primiano Tucci <primiano(a)google.com>
Cc: stable(a)vger.kernel.org
Fixes: 1e0d6714aceb7 ("ring-buffer: Do not wake up a splice waiter when page is not full")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 2504df9a0453..3c7d295746f6 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -100,7 +100,7 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k
int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
__poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
- struct file *filp, poll_table *poll_table);
+ struct file *filp, poll_table *poll_table, int full);
void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu);
#define RING_BUFFER_ALL_CPUS -1
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9712083832f4..089b1ec9cb3b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -907,6 +907,21 @@ size_t ring_buffer_nr_dirty_pages(struct trace_buffer *buffer, int cpu)
return cnt - read;
}
+static __always_inline bool full_hit(struct trace_buffer *buffer, int cpu, int full)
+{
+ struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
+ size_t nr_pages;
+ size_t dirty;
+
+ nr_pages = cpu_buffer->nr_pages;
+ if (!nr_pages || !full)
+ return true;
+
+ dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
+
+ return (dirty * 100) > (full * nr_pages);
+}
+
/*
* rb_wake_up_waiters - wake up tasks waiting for ring buffer input
*
@@ -1046,22 +1061,20 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
!ring_buffer_empty_cpu(buffer, cpu)) {
unsigned long flags;
bool pagebusy;
- size_t nr_pages;
- size_t dirty;
+ bool done;
if (!full)
break;
raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
pagebusy = cpu_buffer->reader_page == cpu_buffer->commit_page;
- nr_pages = cpu_buffer->nr_pages;
- dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
+ done = !pagebusy && full_hit(buffer, cpu, full);
+
if (!cpu_buffer->shortest_full ||
cpu_buffer->shortest_full > full)
cpu_buffer->shortest_full = full;
raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
- if (!pagebusy &&
- (!nr_pages || (dirty * 100) > full * nr_pages))
+ if (done)
break;
}
@@ -1087,6 +1100,7 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
* @cpu: the cpu buffer to wait on
* @filp: the file descriptor
* @poll_table: The poll descriptor
+ * @full: wait until the percentage of pages are available, if @cpu != RING_BUFFER_ALL_CPUS
*
* If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon
* as data is added to any of the @buffer's cpu buffers. Otherwise
@@ -1096,14 +1110,15 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
* zero otherwise.
*/
__poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
- struct file *filp, poll_table *poll_table)
+ struct file *filp, poll_table *poll_table, int full)
{
struct ring_buffer_per_cpu *cpu_buffer;
struct rb_irq_work *work;
- if (cpu == RING_BUFFER_ALL_CPUS)
+ if (cpu == RING_BUFFER_ALL_CPUS) {
work = &buffer->irq_work;
- else {
+ full = 0;
+ } else {
if (!cpumask_test_cpu(cpu, buffer->cpumask))
return -EINVAL;
@@ -1111,8 +1126,14 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
work = &cpu_buffer->irq_work;
}
- poll_wait(filp, &work->waiters, poll_table);
- work->waiters_pending = true;
+ if (full) {
+ poll_wait(filp, &work->full_waiters, poll_table);
+ work->full_waiters_pending = true;
+ } else {
+ poll_wait(filp, &work->waiters, poll_table);
+ work->waiters_pending = true;
+ }
+
/*
* There's a tight race between setting the waiters_pending and
* checking if the ring buffer is empty. Once the waiters_pending bit
@@ -1128,6 +1149,9 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
*/
smp_mb();
+ if (full)
+ return full_hit(buffer, cpu, full) ? EPOLLIN | EPOLLRDNORM : 0;
+
if ((cpu == RING_BUFFER_ALL_CPUS && !ring_buffer_empty(buffer)) ||
(cpu != RING_BUFFER_ALL_CPUS && !ring_buffer_empty_cpu(buffer, cpu)))
return EPOLLIN | EPOLLRDNORM;
@@ -3155,10 +3179,6 @@ static void rb_commit(struct ring_buffer_per_cpu *cpu_buffer,
static __always_inline void
rb_wakeups(struct trace_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer)
{
- size_t nr_pages;
- size_t dirty;
- size_t full;
-
if (buffer->irq_work.waiters_pending) {
buffer->irq_work.waiters_pending = false;
/* irq_work_queue() supplies it's own memory barriers */
@@ -3182,10 +3202,7 @@ rb_wakeups(struct trace_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer)
cpu_buffer->last_pages_touch = local_read(&cpu_buffer->pages_touched);
- full = cpu_buffer->shortest_full;
- nr_pages = cpu_buffer->nr_pages;
- dirty = ring_buffer_nr_dirty_pages(buffer, cpu_buffer->cpu);
- if (full && nr_pages && (dirty * 100) <= full * nr_pages)
+ if (!full_hit(buffer, cpu_buffer->cpu, cpu_buffer->shortest_full))
return;
cpu_buffer->irq_work.wakeup_full = true;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 47a44b055a1d..c6c7a0af3ed2 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6681,7 +6681,7 @@ trace_poll(struct trace_iterator *iter, struct file *filp, poll_table *poll_tabl
return EPOLLIN | EPOLLRDNORM;
else
return ring_buffer_poll_wait(iter->array_buffer->buffer, iter->cpu_file,
- filp, poll_table);
+ filp, poll_table, iter->tr->buffer_percent);
}
static __poll_t
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
d5082d386eee ("ipv4: Fix route deletion when nexthop info is not specified")
61b91eb33a69 ("ipv4: Handle attempt to delete multipath route when fib_info contains an nh reference")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d5082d386eee7e8ec46fa8581932c81a4961dcef Mon Sep 17 00:00:00 2001
From: Ido Schimmel <idosch(a)nvidia.com>
Date: Thu, 24 Nov 2022 23:09:32 +0200
Subject: [PATCH] ipv4: Fix route deletion when nexthop info is not specified
When the kernel receives a route deletion request from user space it
tries to delete a route that matches the route attributes specified in
the request.
If only prefix information is specified in the request, the kernel
should delete the first matching FIB alias regardless of its associated
FIB info. However, an error is currently returned when the FIB info is
backed by a nexthop object:
# ip nexthop add id 1 via 192.0.2.2 dev dummy10
# ip route add 198.51.100.0/24 nhid 1
# ip route del 198.51.100.0/24
RTNETLINK answers: No such process
Fix by matching on such a FIB info when legacy nexthop attributes are
not specified in the request. An earlier check already covers the case
where a nexthop ID is specified in the request.
Add tests that cover these flows. Before the fix:
# ./fib_nexthops.sh -t ipv4_fcnal
...
TEST: Delete route when not specifying nexthop attributes [FAIL]
Tests passed: 11
Tests failed: 1
After the fix:
# ./fib_nexthops.sh -t ipv4_fcnal
...
TEST: Delete route when not specifying nexthop attributes [ OK ]
Tests passed: 12
Tests failed: 0
No regressions in other tests:
# ./fib_nexthops.sh
...
Tests passed: 228
Tests failed: 0
# ./fib_tests.sh
...
Tests passed: 186
Tests failed: 0
Cc: stable(a)vger.kernel.org
Reported-by: Jonas Gorski <jonas.gorski(a)gmail.com>
Tested-by: Jonas Gorski <jonas.gorski(a)gmail.com>
Fixes: 493ced1ac47c ("ipv4: Allow routes to use nexthop objects")
Fixes: 6bf92d70e690 ("net: ipv4: fix route with nexthop object delete warning")
Fixes: 61b91eb33a69 ("ipv4: Handle attempt to delete multipath route when fib_info contains an nh reference")
Signed-off-by: Ido Schimmel <idosch(a)nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor(a)blackwall.org>
Reviewed-by: David Ahern <dsahern(a)kernel.org>
Link: https://lore.kernel.org/r/20221124210932.2470010-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f721c308248b..19a662003eef 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -888,9 +888,11 @@ int fib_nh_match(struct net *net, struct fib_config *cfg, struct fib_info *fi,
return 1;
}
- /* cannot match on nexthop object attributes */
- if (fi->nh)
- return 1;
+ if (fi->nh) {
+ if (cfg->fc_oif || cfg->fc_gw_family || cfg->fc_mp)
+ return 1;
+ return 0;
+ }
if (cfg->fc_oif || cfg->fc_gw_family) {
struct fib_nh *nh;
diff --git a/tools/testing/selftests/net/fib_nexthops.sh b/tools/testing/selftests/net/fib_nexthops.sh
index ee5e98204d3d..a47b26ab48f2 100755
--- a/tools/testing/selftests/net/fib_nexthops.sh
+++ b/tools/testing/selftests/net/fib_nexthops.sh
@@ -1228,6 +1228,17 @@ ipv4_fcnal()
run_cmd "$IP ro add 172.16.101.0/24 nhid 21"
run_cmd "$IP ro del 172.16.101.0/24 nexthop via 172.16.1.7 dev veth1 nexthop via 172.16.1.8 dev veth1"
log_test $? 2 "Delete multipath route with only nh id based entry"
+
+ run_cmd "$IP nexthop add id 22 via 172.16.1.6 dev veth1"
+ run_cmd "$IP ro add 172.16.102.0/24 nhid 22"
+ run_cmd "$IP ro del 172.16.102.0/24 dev veth1"
+ log_test $? 2 "Delete route when specifying only nexthop device"
+
+ run_cmd "$IP ro del 172.16.102.0/24 via 172.16.1.6"
+ log_test $? 2 "Delete route when specifying only gateway"
+
+ run_cmd "$IP ro del 172.16.102.0/24"
+ log_test $? 0 "Delete route when not specifying nexthop attributes"
}
ipv4_grp_fcnal()
Greetings,
I am Danish Asad a gold miner from Burkina Faso. I have a
Mutual/Beneficial Business that would be beneficial to both of us. I
only have one question to ask of you,
Can you be honest?
Please note that the deal requires high level of maturity, honesty
and secrecy. This will involve moving some bars of golds, lots of
jewelries ranging from neck less to rings and bracelets under my care,
on trust to your hand also note that i will do everything to make sure
that the bars of golds and jewelries is moved as a purely legitimate,
so you will not be exposed to any risk.
I request for your full co-operation. I will give you details and
procedure when I receive your reply, to commence this business, I
require you to immediately indicate your interest by a return reply. I
will be waiting for your response in a timely manner.
Please contact my private email address.
danishasad074(a)gmail.com
Yours sincerely,
Mr.Danish Asad
danishasad074(a)gmail.com
+22667440112
NOTE: Please treat it genuinely.
Commit be36f9e7517e ("efi: READ_ONCE rng seed size before munmap")
added a READ_ONCE() and also changed the call to
add_bootloader_randomness() to use the local size variable. Neither
of these changes was actually needed and this was not backported to
the 4.14 stable branch.
Commit 161a438d730d ("efi: random: reduce seed size to 32 bytes")
reverted the addition of READ_ONCE() and added a limit to the value of
size. This depends on the earlier commit, because size can now differ
from seed->size, but it was wrongly backported to the 4.14 stable
branch by itself.
Apply the missing change to the add_bootloader_randomness() parameter
(except that here we are still using add_device_randomness()).
Fixes: 700485f70e50 ("efi: random: reduce seed size to 32 bytes")
Signed-off-by: Ben Hutchings <ben(a)decadent.org.uk>
---
drivers/firmware/efi/efi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index ed981f5e29ae..cc64869d8420 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -541,7 +541,7 @@ int __init efi_config_parse_tables(void *config_tables, int count, int sz,
seed = early_memremap(efi.rng_seed,
sizeof(*seed) + size);
if (seed != NULL) {
- add_device_randomness(seed->bits, seed->size);
+ add_device_randomness(seed->bits, size);
early_memunmap(seed, sizeof(*seed) + size);
pr_notice("seeding entropy pool\n");
} else {
Hello!
Would the stable maintainers please consider backporting the following
series of patches?:
https://lore.kernel.org/lkml/20220801013834.156015-1-andres@anarazel.de/
Branches where this is needed are:
* 5.4
* 5.10
* 5.15
Branch 6.0.y is fine, as this series is present there.
Failure is seen in this form:
-----8<----------8<----------8<-----
util/annotate.c: In function 'symbol__disassemble_bpf':
util/annotate.c:1739:9: error: too few arguments to function
'init_disassemble_info'
1739 | init_disassemble_info(&info, s,
| ^~~~~~~~~~~~~~~~~~~~~
In file included from util/annotate.c:1692:
/usr/include/dis-asm.h:472:13: note: declared here
472 | extern void init_disassemble_info (struct disassemble_info
*dinfo, void *stream,
| ^~~~~~~~~~~~~~~~~~~~~
make[4]: *** [/builds/linux/tools/build/Makefile.build:96:
/home/tuxbuild/.cache/tuxmake/builds/1/build/util/annotate.o] Error 1
----->8---------->8---------->8-----
The 5.15 backport is almost straight-forward, with patches 7 and 8
requiring some small modifications. I could not get the 5.10 backport
to build, and for 5.4 it was even more difficult to adapt.
Thanks and greetings!
Daniel Díaz
daniel.diaz(a)linaro.org
Hello!
The first patch fixes an issue reported by Sami, where linux panic()s
when bringing secondary CPUs online. The problem was the Spectre
workarounds trying to allocate a new slot for mitigating KVM when
those pages are no longer writeable.
While debugging that issue, I spotted the Spectre-BHB KVM mitigation was
over-riding the Spectre-v2 KVM Mitigation. It's supposed to happen the
other way round.
The backports aren't the same as mainline because the spectre mitigation code
was totally rewritten for v5.10, and prior to that the KVM infrastructure
is very different.
Thanks,
James Morse (2):
arm64: Fix panic() when Spectre-v2 causes Spectre-BHB to re-allocate
KVM vectors
arm64: errata: Fix KVM Spectre-v2 mitigation selection for
Cortex-A57/A72
arch/arm64/kernel/cpu_errata.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
--
2.30.2
commit 47b0c2e4c220f2251fd8dcfbb44479819c715e15 upstream.
make_mmu_pages_available() must be called with mmu_lock held for write.
However, if the TDP MMU is used, it will be called with mmu_lock held for
read.
This function does nothing unless shadow pages are used, so there is no
race unless nested TDP is used.
Since nested TDP uses shadow pages, old shadow pages may be zapped by this
function even when the TDP MMU is enabled.
Since shadow pages are never allocated by kvm_tdp_mmu_map(), a race
condition can be avoided by not calling make_mmu_pages_available() if the
TDP MMU is currently in use.
I encountered this when repeatedly starting and stopping nested VM.
It can be artificially caused by allocating a large number of nested TDP
SPTEs.
For example, the following BUG and general protection fault are caused in
the host kernel.
pte_list_remove: 00000000cd54fc10 many->many
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/mmu/mmu.c:963!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:pte_list_remove.cold+0x16/0x48 [kvm]
Call Trace:
<TASK>
drop_spte+0xe0/0x180 [kvm]
mmu_page_zap_pte+0x4f/0x140 [kvm]
__kvm_mmu_prepare_zap_page+0x62/0x3e0 [kvm]
kvm_mmu_zap_oldest_mmu_pages+0x7d/0xf0 [kvm]
direct_page_fault+0x3cb/0x9b0 [kvm]
kvm_tdp_page_fault+0x2c/0xa0 [kvm]
kvm_mmu_page_fault+0x207/0x930 [kvm]
npf_interception+0x47/0xb0 [kvm_amd]
svm_invoke_exit_handler+0x13c/0x1a0 [kvm_amd]
svm_handle_exit+0xfc/0x2c0 [kvm_amd]
kvm_arch_vcpu_ioctl_run+0xa79/0x1780 [kvm]
kvm_vcpu_ioctl+0x29b/0x6f0 [kvm]
__x64_sys_ioctl+0x95/0xd0
do_syscall_64+0x5c/0x90
general protection fault, probably for non-canonical address
0xdead000000000122: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:kvm_mmu_commit_zap_page.part.0+0x4b/0xe0 [kvm]
Call Trace:
<TASK>
kvm_mmu_zap_oldest_mmu_pages+0xae/0xf0 [kvm]
direct_page_fault+0x3cb/0x9b0 [kvm]
kvm_tdp_page_fault+0x2c/0xa0 [kvm]
kvm_mmu_page_fault+0x207/0x930 [kvm]
npf_interception+0x47/0xb0 [kvm_amd]
CVE: CVE-2022-45869
Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Signed-off-by: Kazuki Takiguchi <takiguchi.kazuki171(a)gmail.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
---
arch/x86/kvm/mmu/mmu.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ba1749a770eb..4724289c8a7f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2357,6 +2357,7 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
{
bool list_unstable;
+ lockdep_assert_held_write(&kvm->mmu_lock);
trace_kvm_mmu_prepare_zap_page(sp);
++kvm->stat.mmu_shadow_zapped;
*nr_zapped = mmu_zap_unsync_children(kvm, sp, invalid_list);
@@ -4007,16 +4008,17 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva))
goto out_unlock;
- r = make_mmu_pages_available(vcpu);
- if (r)
- goto out_unlock;
- if (is_tdp_mmu_fault)
+ if (is_tdp_mmu_fault) {
r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
pfn, prefault);
- else
+ } else {
+ r = make_mmu_pages_available(vcpu);
+ if (r)
+ goto out_unlock;
r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn,
prefault, is_tdp);
+ }
out_unlock:
if (is_tdp_mmu_fault)
--
2.34.1
Commit fbd567e56942ecc4da906c4f3f3652c94773af5b upstream.
The member void *data in the structure devfreq can be overwrite
by governor_userspace. For example:
1. The device driver assigned the devfreq governor to simple_ondemand
by the function devfreq_add_device() and init the devfreq member
void *data to a pointer of a static structure devfreq_simple_ondemand_data
by the function devfreq_add_device().
2. The user changed the devfreq governor to userspace by the command
"echo userspace > /sys/class/devfreq/.../governor".
3. The governor userspace alloced a dynamic memory for the struct
userspace_data and assigend the member void *data of devfreq to
this memory by the function userspace_init().
4. The user changed the devfreq governor back to simple_ondemand
by the command "echo simple_ondemand > /sys/class/devfreq/.../governor".
5. The governor userspace exited and assigned the member void *data
in the structure devfreq to NULL by the function userspace_exit().
6. The governor simple_ondemand fetched the static information of
devfreq_simple_ondemand_data in the function
devfreq_simple_ondemand_func() but the member void *data of devfreq was
assigned to NULL by the function userspace_exit().
7. The information of upthreshold and downdifferential is lost
and the governor simple_ondemand can't work correctly.
The member void *data in the structure devfreq is designed for
a static pointer used in a governor and inited by the function
devfreq_add_device(). This patch add an element named governor_data
in the devfreq structure which can be used by a governor(E.g userspace)
who want to assign a private data to do some private things.
Fixes: ce26c5bb9569 ("PM / devfreq: Add basic governors")
Cc: stable(a)vger.kernel.org # 5.4-
Signed-off-by: Kant Fan <kant(a)allwinnertech.com>
---
drivers/devfreq/devfreq.c | 6 ++----
drivers/devfreq/governor_userspace.c | 12 ++++++------
include/linux/devfreq.h | 7 ++++---
3 files changed, 12 insertions(+), 13 deletions(-)
diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
index c79652ee94be..93efaf69d08e 100644
--- a/drivers/devfreq/devfreq.c
+++ b/drivers/devfreq/devfreq.c
@@ -603,8 +603,7 @@ static void devfreq_dev_release(struct device *dev)
* @dev: the device to add devfreq feature.
* @profile: device-specific profile to run devfreq.
* @governor_name: name of the policy to choose frequency.
- * @data: private data for the governor. The devfreq framework does not
- * touch this value.
+ * @data: devfreq driver pass to governors, governor should not change it.
*/
struct devfreq *devfreq_add_device(struct device *dev,
struct devfreq_dev_profile *profile,
@@ -788,8 +787,7 @@ static void devm_devfreq_dev_release(struct device *dev, void *res)
* @dev: the device to add devfreq feature.
* @profile: device-specific profile to run devfreq.
* @governor_name: name of the policy to choose frequency.
- * @data: private data for the governor. The devfreq framework does not
- * touch this value.
+ * @data: devfreq driver pass to governors, governor should not change it.
*
* This function manages automatically the memory of devfreq device using device
* resource management and simplify the free operation for memory of devfreq
diff --git a/drivers/devfreq/governor_userspace.c b/drivers/devfreq/governor_userspace.c
index af94942fcf95..a3ae4dc4668b 100644
--- a/drivers/devfreq/governor_userspace.c
+++ b/drivers/devfreq/governor_userspace.c
@@ -21,7 +21,7 @@ struct userspace_data {
static int devfreq_userspace_func(struct devfreq *df, unsigned long *freq)
{
- struct userspace_data *data = df->data;
+ struct userspace_data *data = df->governor_data;
if (data->valid)
*freq = data->user_frequency;
@@ -40,7 +40,7 @@ static ssize_t store_freq(struct device *dev, struct device_attribute *attr,
int err = 0;
mutex_lock(&devfreq->lock);
- data = devfreq->data;
+ data = devfreq->governor_data;
sscanf(buf, "%lu", &wanted);
data->user_frequency = wanted;
@@ -60,7 +60,7 @@ static ssize_t show_freq(struct device *dev, struct device_attribute *attr,
int err = 0;
mutex_lock(&devfreq->lock);
- data = devfreq->data;
+ data = devfreq->governor_data;
if (data->valid)
err = sprintf(buf, "%lu\n", data->user_frequency);
@@ -91,7 +91,7 @@ static int userspace_init(struct devfreq *devfreq)
goto out;
}
data->valid = false;
- devfreq->data = data;
+ devfreq->governor_data = data;
err = sysfs_create_group(&devfreq->dev.kobj, &dev_attr_group);
out:
@@ -107,8 +107,8 @@ static void userspace_exit(struct devfreq *devfreq)
if (devfreq->dev.kobj.sd)
sysfs_remove_group(&devfreq->dev.kobj, &dev_attr_group);
- kfree(devfreq->data);
- devfreq->data = NULL;
+ kfree(devfreq->governor_data);
+ devfreq->governor_data = NULL;
}
static int devfreq_userspace_handler(struct devfreq *devfreq,
diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
index 2bae9ed3c783..6cbc6d1ae32f 100644
--- a/include/linux/devfreq.h
+++ b/include/linux/devfreq.h
@@ -121,8 +121,8 @@ struct devfreq_dev_profile {
* devfreq.nb to the corresponding register notifier call chain.
* @work: delayed work for load monitoring.
* @previous_freq: previously configured frequency value.
- * @data: Private data of the governor. The devfreq framework does not
- * touch this.
+ * @data: devfreq driver pass to governors, governor should not change it.
+ * @governor_data: private data for governors, devfreq core doesn't touch it.
* @min_freq: Limit minimum frequency requested by user (0: none)
* @max_freq: Limit maximum frequency requested by user (0: none)
* @scaling_min_freq: Limit minimum frequency requested by OPP interface
@@ -159,7 +159,8 @@ struct devfreq {
unsigned long previous_freq;
struct devfreq_dev_status last_status;
- void *data; /* private data for governors */
+ void *data;
+ void *governor_data;
unsigned long min_freq;
unsigned long max_freq;
--
2.29.0
Recent patches for io_uring polling.
Lin Ma (1):
io_uring/poll: fix poll_refs race with cancelation
Pavel Begunkov (4):
io_uring: update res mask in io_poll_check_events
io_uring: fix tw losing poll events
io_uring: cmpxchg for poll arm refs release
io_uring: make poll refs more robust
fs/io_uring.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 50 insertions(+), 7 deletions(-)
--
2.38.1
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
3c1ea6a5f4f5 ("drm/i915: Remove non-existent pipes from bigjoiner pipe mask")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3c1ea6a5f4f55d4e376675dda16945eb5d9bb4de Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ville=20Syrj=C3=A4l=C3=A4?= <ville.syrjala(a)linux.intel.com>
Date: Fri, 18 Nov 2022 20:52:01 +0200
Subject: [PATCH] drm/i915: Remove non-existent pipes from bigjoiner pipe mask
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
bigjoiner_pipes() doesn't consider that:
- RKL only has three pipes
- some pipes may be fused off
This means that intel_atomic_check_bigjoiner() won't reject
all configurations that would need a non-existent pipe.
Instead we just keep on rolling witout actually having
reserved the slave pipe we need.
It's possible that we don't outright explode anywhere due to
this since eg. for_each_intel_crtc_in_pipe_mask() will only
walk the crtcs we've registered even though the passed in
pipe_mask asks for more of them. But clearly the thing won't
do what is expected of it when the required pipes are not
present.
Fix the problem by consulting the device info pipe_mask already
in bigjoiner_pipes().
Cc: stable(a)vger.kernel.org
Signed-off-by: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221118185201.10469-1-ville.…
Reviewed-by: Arun R Murthy <arun.r.murthy(a)intel.com>
(cherry picked from commit f1c87a94a1087a26f41007ee83264033007421b5)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
index 461c62c88413..de77054195c6 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -3723,12 +3723,16 @@ static bool ilk_get_pipe_config(struct intel_crtc *crtc,
static u8 bigjoiner_pipes(struct drm_i915_private *i915)
{
+ u8 pipes;
+
if (DISPLAY_VER(i915) >= 12)
- return BIT(PIPE_A) | BIT(PIPE_B) | BIT(PIPE_C) | BIT(PIPE_D);
+ pipes = BIT(PIPE_A) | BIT(PIPE_B) | BIT(PIPE_C) | BIT(PIPE_D);
else if (DISPLAY_VER(i915) >= 11)
- return BIT(PIPE_B) | BIT(PIPE_C);
+ pipes = BIT(PIPE_B) | BIT(PIPE_C);
else
- return 0;
+ pipes = 0;
+
+ return pipes & RUNTIME_INFO(i915)->pipe_mask;
}
static bool transcoder_ddi_func_is_enabled(struct drm_i915_private *dev_priv,
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
95bc35f9bee5 ("mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in damon_sysfs_set_schemes()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 95bc35f9bee5220dad4e8567654ab3288a181639 Mon Sep 17 00:00:00 2001
From: SeongJae Park <sj(a)kernel.org>
Date: Tue, 22 Nov 2022 19:48:31 +0000
Subject: [PATCH] mm/damon/sysfs: fix wrong empty schemes assumption under
online tuning in damon_sysfs_set_schemes()
Commit da87878010e5 ("mm/damon/sysfs: support online inputs update") made
'damon_sysfs_set_schemes()' to be called for running DAMON context, which
could have schemes. In the case, DAMON sysfs interface is supposed to
update, remove, or add schemes to reflect the sysfs files. However, the
code is assuming the DAMON context wouldn't have schemes at all, and
therefore creates and adds new schemes. As a result, the code doesn't
work as intended for online schemes tuning and could have more than
expected memory footprint. The schemes are all in the DAMON context, so
it doesn't leak the memory, though.
Remove the wrong asssumption (the DAMON context wouldn't have schemes) in
'damon_sysfs_set_schemes()' to fix the bug.
Link: https://lkml.kernel.org/r/20221122194831.3472-1-sj@kernel.org
Fixes: da87878010e5 ("mm/damon/sysfs: support online inputs update")
Signed-off-by: SeongJae Park <sj(a)kernel.org>
Cc: <stable(a)vger.kernel.org> [5.19+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c
index 5ce403378c20..07e5f1bdf025 100644
--- a/mm/damon/sysfs.c
+++ b/mm/damon/sysfs.c
@@ -2283,12 +2283,54 @@ static struct damos *damon_sysfs_mk_scheme(
&wmarks);
}
+static void damon_sysfs_update_scheme(struct damos *scheme,
+ struct damon_sysfs_scheme *sysfs_scheme)
+{
+ struct damon_sysfs_access_pattern *access_pattern =
+ sysfs_scheme->access_pattern;
+ struct damon_sysfs_quotas *sysfs_quotas = sysfs_scheme->quotas;
+ struct damon_sysfs_weights *sysfs_weights = sysfs_quotas->weights;
+ struct damon_sysfs_watermarks *sysfs_wmarks = sysfs_scheme->watermarks;
+
+ scheme->pattern.min_sz_region = access_pattern->sz->min;
+ scheme->pattern.max_sz_region = access_pattern->sz->max;
+ scheme->pattern.min_nr_accesses = access_pattern->nr_accesses->min;
+ scheme->pattern.max_nr_accesses = access_pattern->nr_accesses->max;
+ scheme->pattern.min_age_region = access_pattern->age->min;
+ scheme->pattern.max_age_region = access_pattern->age->max;
+
+ scheme->action = sysfs_scheme->action;
+
+ scheme->quota.ms = sysfs_quotas->ms;
+ scheme->quota.sz = sysfs_quotas->sz;
+ scheme->quota.reset_interval = sysfs_quotas->reset_interval_ms;
+ scheme->quota.weight_sz = sysfs_weights->sz;
+ scheme->quota.weight_nr_accesses = sysfs_weights->nr_accesses;
+ scheme->quota.weight_age = sysfs_weights->age;
+
+ scheme->wmarks.metric = sysfs_wmarks->metric;
+ scheme->wmarks.interval = sysfs_wmarks->interval_us;
+ scheme->wmarks.high = sysfs_wmarks->high;
+ scheme->wmarks.mid = sysfs_wmarks->mid;
+ scheme->wmarks.low = sysfs_wmarks->low;
+}
+
static int damon_sysfs_set_schemes(struct damon_ctx *ctx,
struct damon_sysfs_schemes *sysfs_schemes)
{
- int i;
+ struct damos *scheme, *next;
+ int i = 0;
+
+ damon_for_each_scheme_safe(scheme, next, ctx) {
+ if (i < sysfs_schemes->nr)
+ damon_sysfs_update_scheme(scheme,
+ sysfs_schemes->schemes_arr[i]);
+ else
+ damon_destroy_scheme(scheme);
+ i++;
+ }
- for (i = 0; i < sysfs_schemes->nr; i++) {
+ for (; i < sysfs_schemes->nr; i++) {
struct damos *scheme, *next;
scheme = damon_sysfs_mk_scheme(sysfs_schemes->schemes_arr[i]);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
f268f6cf875f ("mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths")
2ba99c5e0881 ("mm/khugepaged: fix GUP-fast interaction by sending IPI")
8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds")
780a4b6fb865 ("mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()")
b26e27015ec9 ("mm: thp: convert to use common struct mm_slot")
685405020b9f ("mm/khugepaged: stop using vma linked list")
7d2c4385c341 ("mm/khugepaged: rename prefix of shared collapse functions")
7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
a7f4e6e4c47c ("mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()")
50ad2f24b3b4 ("mm/khugepaged: propagate enum scan_result codes back to callers")
9710a78ab2ae ("mm/khugepaged: dedup and simplify hugepage alloc and charging")
34d6b470ab9c ("mm/khugepaged: add struct collapse_control")
c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")
1064026bab9f ("mm: khugepaged: reorg some khugepaged helpers")
7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
9fec51689ff6 ("mm: thp: kill transparent_hugepage_active()")
f707fa493784 ("mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidate")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f268f6cf875f3220afc77bdd0bf1bb136eb54db9 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Fri, 25 Nov 2022 22:37:14 +0100
Subject: [PATCH] mm/khugepaged: invoke MMU notifiers in shmem/file collapse
paths
Any codepath that zaps page table entries must invoke MMU notifiers to
ensure that secondary MMUs (like KVM) don't keep accessing pages which
aren't mapped anymore. Secondary MMUs don't hold their own references to
pages that are mirrored over, so failing to notify them can lead to page
use-after-free.
I'm marking this as addressing an issue introduced in commit f3f0e1d2150b
("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of
the security impact of this only came in commit 27e1f8273113 ("khugepaged:
enable collapse pmd for pte-mapped THP"), which actually omitted flushes
for the removal of present PTEs, not just for the removal of empty page
tables.
Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 294cb75d9c22..3703a56571c1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1399,6 +1399,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
unsigned long addr, pmd_t *pmdp)
{
pmd_t pmd;
+ struct mmu_notifier_range range;
mmap_assert_write_locked(mm);
if (vma->vm_file)
@@ -1410,8 +1411,12 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
if (vma->anon_vma)
lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm, addr,
+ addr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
pmd = pmdp_collapse_flush(vma, addr, pmdp);
tlb_remove_table_sync_one();
+ mmu_notifier_invalidate_range_end(&range);
mm_dec_nr_ptes(mm);
page_table_check_pte_clear_range(mm, addr, pmd);
pte_free(mm, pmd_pgtable(pmd));
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
f268f6cf875f ("mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths")
2ba99c5e0881 ("mm/khugepaged: fix GUP-fast interaction by sending IPI")
8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds")
780a4b6fb865 ("mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()")
b26e27015ec9 ("mm: thp: convert to use common struct mm_slot")
685405020b9f ("mm/khugepaged: stop using vma linked list")
7d2c4385c341 ("mm/khugepaged: rename prefix of shared collapse functions")
7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
a7f4e6e4c47c ("mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()")
50ad2f24b3b4 ("mm/khugepaged: propagate enum scan_result codes back to callers")
9710a78ab2ae ("mm/khugepaged: dedup and simplify hugepage alloc and charging")
34d6b470ab9c ("mm/khugepaged: add struct collapse_control")
c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")
1064026bab9f ("mm: khugepaged: reorg some khugepaged helpers")
7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
9fec51689ff6 ("mm: thp: kill transparent_hugepage_active()")
f707fa493784 ("mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidate")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f268f6cf875f3220afc77bdd0bf1bb136eb54db9 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Fri, 25 Nov 2022 22:37:14 +0100
Subject: [PATCH] mm/khugepaged: invoke MMU notifiers in shmem/file collapse
paths
Any codepath that zaps page table entries must invoke MMU notifiers to
ensure that secondary MMUs (like KVM) don't keep accessing pages which
aren't mapped anymore. Secondary MMUs don't hold their own references to
pages that are mirrored over, so failing to notify them can lead to page
use-after-free.
I'm marking this as addressing an issue introduced in commit f3f0e1d2150b
("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of
the security impact of this only came in commit 27e1f8273113 ("khugepaged:
enable collapse pmd for pte-mapped THP"), which actually omitted flushes
for the removal of present PTEs, not just for the removal of empty page
tables.
Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 294cb75d9c22..3703a56571c1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1399,6 +1399,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
unsigned long addr, pmd_t *pmdp)
{
pmd_t pmd;
+ struct mmu_notifier_range range;
mmap_assert_write_locked(mm);
if (vma->vm_file)
@@ -1410,8 +1411,12 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
if (vma->anon_vma)
lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm, addr,
+ addr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
pmd = pmdp_collapse_flush(vma, addr, pmdp);
tlb_remove_table_sync_one();
+ mmu_notifier_invalidate_range_end(&range);
mm_dec_nr_ptes(mm);
page_table_check_pte_clear_range(mm, addr, pmd);
pte_free(mm, pmd_pgtable(pmd));
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
f268f6cf875f ("mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths")
2ba99c5e0881 ("mm/khugepaged: fix GUP-fast interaction by sending IPI")
8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds")
780a4b6fb865 ("mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()")
b26e27015ec9 ("mm: thp: convert to use common struct mm_slot")
685405020b9f ("mm/khugepaged: stop using vma linked list")
7d2c4385c341 ("mm/khugepaged: rename prefix of shared collapse functions")
7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
a7f4e6e4c47c ("mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()")
50ad2f24b3b4 ("mm/khugepaged: propagate enum scan_result codes back to callers")
9710a78ab2ae ("mm/khugepaged: dedup and simplify hugepage alloc and charging")
34d6b470ab9c ("mm/khugepaged: add struct collapse_control")
c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")
1064026bab9f ("mm: khugepaged: reorg some khugepaged helpers")
7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
9fec51689ff6 ("mm: thp: kill transparent_hugepage_active()")
f707fa493784 ("mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidate")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f268f6cf875f3220afc77bdd0bf1bb136eb54db9 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Fri, 25 Nov 2022 22:37:14 +0100
Subject: [PATCH] mm/khugepaged: invoke MMU notifiers in shmem/file collapse
paths
Any codepath that zaps page table entries must invoke MMU notifiers to
ensure that secondary MMUs (like KVM) don't keep accessing pages which
aren't mapped anymore. Secondary MMUs don't hold their own references to
pages that are mirrored over, so failing to notify them can lead to page
use-after-free.
I'm marking this as addressing an issue introduced in commit f3f0e1d2150b
("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of
the security impact of this only came in commit 27e1f8273113 ("khugepaged:
enable collapse pmd for pte-mapped THP"), which actually omitted flushes
for the removal of present PTEs, not just for the removal of empty page
tables.
Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 294cb75d9c22..3703a56571c1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1399,6 +1399,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
unsigned long addr, pmd_t *pmdp)
{
pmd_t pmd;
+ struct mmu_notifier_range range;
mmap_assert_write_locked(mm);
if (vma->vm_file)
@@ -1410,8 +1411,12 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
if (vma->anon_vma)
lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm, addr,
+ addr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
pmd = pmdp_collapse_flush(vma, addr, pmdp);
tlb_remove_table_sync_one();
+ mmu_notifier_invalidate_range_end(&range);
mm_dec_nr_ptes(mm);
page_table_check_pte_clear_range(mm, addr, pmd);
pte_free(mm, pmd_pgtable(pmd));
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
f268f6cf875f ("mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths")
2ba99c5e0881 ("mm/khugepaged: fix GUP-fast interaction by sending IPI")
8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds")
780a4b6fb865 ("mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()")
b26e27015ec9 ("mm: thp: convert to use common struct mm_slot")
685405020b9f ("mm/khugepaged: stop using vma linked list")
7d2c4385c341 ("mm/khugepaged: rename prefix of shared collapse functions")
7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
a7f4e6e4c47c ("mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()")
50ad2f24b3b4 ("mm/khugepaged: propagate enum scan_result codes back to callers")
9710a78ab2ae ("mm/khugepaged: dedup and simplify hugepage alloc and charging")
34d6b470ab9c ("mm/khugepaged: add struct collapse_control")
c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")
1064026bab9f ("mm: khugepaged: reorg some khugepaged helpers")
7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
9fec51689ff6 ("mm: thp: kill transparent_hugepage_active()")
f707fa493784 ("mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidate")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f268f6cf875f3220afc77bdd0bf1bb136eb54db9 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Fri, 25 Nov 2022 22:37:14 +0100
Subject: [PATCH] mm/khugepaged: invoke MMU notifiers in shmem/file collapse
paths
Any codepath that zaps page table entries must invoke MMU notifiers to
ensure that secondary MMUs (like KVM) don't keep accessing pages which
aren't mapped anymore. Secondary MMUs don't hold their own references to
pages that are mirrored over, so failing to notify them can lead to page
use-after-free.
I'm marking this as addressing an issue introduced in commit f3f0e1d2150b
("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of
the security impact of this only came in commit 27e1f8273113 ("khugepaged:
enable collapse pmd for pte-mapped THP"), which actually omitted flushes
for the removal of present PTEs, not just for the removal of empty page
tables.
Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 294cb75d9c22..3703a56571c1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1399,6 +1399,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
unsigned long addr, pmd_t *pmdp)
{
pmd_t pmd;
+ struct mmu_notifier_range range;
mmap_assert_write_locked(mm);
if (vma->vm_file)
@@ -1410,8 +1411,12 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
if (vma->anon_vma)
lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm, addr,
+ addr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
pmd = pmdp_collapse_flush(vma, addr, pmdp);
tlb_remove_table_sync_one();
+ mmu_notifier_invalidate_range_end(&range);
mm_dec_nr_ptes(mm);
page_table_check_pte_clear_range(mm, addr, pmd);
pte_free(mm, pmd_pgtable(pmd));
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
f268f6cf875f ("mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths")
2ba99c5e0881 ("mm/khugepaged: fix GUP-fast interaction by sending IPI")
8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds")
780a4b6fb865 ("mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()")
b26e27015ec9 ("mm: thp: convert to use common struct mm_slot")
685405020b9f ("mm/khugepaged: stop using vma linked list")
7d2c4385c341 ("mm/khugepaged: rename prefix of shared collapse functions")
7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
a7f4e6e4c47c ("mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()")
50ad2f24b3b4 ("mm/khugepaged: propagate enum scan_result codes back to callers")
9710a78ab2ae ("mm/khugepaged: dedup and simplify hugepage alloc and charging")
34d6b470ab9c ("mm/khugepaged: add struct collapse_control")
c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")
1064026bab9f ("mm: khugepaged: reorg some khugepaged helpers")
7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
9fec51689ff6 ("mm: thp: kill transparent_hugepage_active()")
f707fa493784 ("mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidate")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f268f6cf875f3220afc77bdd0bf1bb136eb54db9 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Fri, 25 Nov 2022 22:37:14 +0100
Subject: [PATCH] mm/khugepaged: invoke MMU notifiers in shmem/file collapse
paths
Any codepath that zaps page table entries must invoke MMU notifiers to
ensure that secondary MMUs (like KVM) don't keep accessing pages which
aren't mapped anymore. Secondary MMUs don't hold their own references to
pages that are mirrored over, so failing to notify them can lead to page
use-after-free.
I'm marking this as addressing an issue introduced in commit f3f0e1d2150b
("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of
the security impact of this only came in commit 27e1f8273113 ("khugepaged:
enable collapse pmd for pte-mapped THP"), which actually omitted flushes
for the removal of present PTEs, not just for the removal of empty page
tables.
Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 294cb75d9c22..3703a56571c1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1399,6 +1399,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
unsigned long addr, pmd_t *pmdp)
{
pmd_t pmd;
+ struct mmu_notifier_range range;
mmap_assert_write_locked(mm);
if (vma->vm_file)
@@ -1410,8 +1411,12 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
if (vma->anon_vma)
lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm, addr,
+ addr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
pmd = pmdp_collapse_flush(vma, addr, pmdp);
tlb_remove_table_sync_one();
+ mmu_notifier_invalidate_range_end(&range);
mm_dec_nr_ptes(mm);
page_table_check_pte_clear_range(mm, addr, pmd);
pte_free(mm, pmd_pgtable(pmd));
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
f268f6cf875f ("mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths")
2ba99c5e0881 ("mm/khugepaged: fix GUP-fast interaction by sending IPI")
8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds")
780a4b6fb865 ("mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()")
b26e27015ec9 ("mm: thp: convert to use common struct mm_slot")
685405020b9f ("mm/khugepaged: stop using vma linked list")
7d2c4385c341 ("mm/khugepaged: rename prefix of shared collapse functions")
7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
a7f4e6e4c47c ("mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()")
50ad2f24b3b4 ("mm/khugepaged: propagate enum scan_result codes back to callers")
9710a78ab2ae ("mm/khugepaged: dedup and simplify hugepage alloc and charging")
34d6b470ab9c ("mm/khugepaged: add struct collapse_control")
c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")
1064026bab9f ("mm: khugepaged: reorg some khugepaged helpers")
7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
9fec51689ff6 ("mm: thp: kill transparent_hugepage_active()")
f707fa493784 ("mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidate")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f268f6cf875f3220afc77bdd0bf1bb136eb54db9 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Fri, 25 Nov 2022 22:37:14 +0100
Subject: [PATCH] mm/khugepaged: invoke MMU notifiers in shmem/file collapse
paths
Any codepath that zaps page table entries must invoke MMU notifiers to
ensure that secondary MMUs (like KVM) don't keep accessing pages which
aren't mapped anymore. Secondary MMUs don't hold their own references to
pages that are mirrored over, so failing to notify them can lead to page
use-after-free.
I'm marking this as addressing an issue introduced in commit f3f0e1d2150b
("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of
the security impact of this only came in commit 27e1f8273113 ("khugepaged:
enable collapse pmd for pte-mapped THP"), which actually omitted flushes
for the removal of present PTEs, not just for the removal of empty page
tables.
Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 294cb75d9c22..3703a56571c1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1399,6 +1399,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
unsigned long addr, pmd_t *pmdp)
{
pmd_t pmd;
+ struct mmu_notifier_range range;
mmap_assert_write_locked(mm);
if (vma->vm_file)
@@ -1410,8 +1411,12 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
if (vma->anon_vma)
lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm, addr,
+ addr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
pmd = pmdp_collapse_flush(vma, addr, pmdp);
tlb_remove_table_sync_one();
+ mmu_notifier_invalidate_range_end(&range);
mm_dec_nr_ptes(mm);
page_table_check_pte_clear_range(mm, addr, pmd);
pte_free(mm, pmd_pgtable(pmd));
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
f268f6cf875f ("mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths")
2ba99c5e0881 ("mm/khugepaged: fix GUP-fast interaction by sending IPI")
8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds")
780a4b6fb865 ("mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()")
b26e27015ec9 ("mm: thp: convert to use common struct mm_slot")
685405020b9f ("mm/khugepaged: stop using vma linked list")
7d2c4385c341 ("mm/khugepaged: rename prefix of shared collapse functions")
7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
a7f4e6e4c47c ("mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()")
50ad2f24b3b4 ("mm/khugepaged: propagate enum scan_result codes back to callers")
9710a78ab2ae ("mm/khugepaged: dedup and simplify hugepage alloc and charging")
34d6b470ab9c ("mm/khugepaged: add struct collapse_control")
c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f268f6cf875f3220afc77bdd0bf1bb136eb54db9 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Fri, 25 Nov 2022 22:37:14 +0100
Subject: [PATCH] mm/khugepaged: invoke MMU notifiers in shmem/file collapse
paths
Any codepath that zaps page table entries must invoke MMU notifiers to
ensure that secondary MMUs (like KVM) don't keep accessing pages which
aren't mapped anymore. Secondary MMUs don't hold their own references to
pages that are mirrored over, so failing to notify them can lead to page
use-after-free.
I'm marking this as addressing an issue introduced in commit f3f0e1d2150b
("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of
the security impact of this only came in commit 27e1f8273113 ("khugepaged:
enable collapse pmd for pte-mapped THP"), which actually omitted flushes
for the removal of present PTEs, not just for the removal of empty page
tables.
Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Jann Horn <jannh(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 294cb75d9c22..3703a56571c1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1399,6 +1399,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
unsigned long addr, pmd_t *pmdp)
{
pmd_t pmd;
+ struct mmu_notifier_range range;
mmap_assert_write_locked(mm);
if (vma->vm_file)
@@ -1410,8 +1411,12 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
if (vma->anon_vma)
lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm, addr,
+ addr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
pmd = pmdp_collapse_flush(vma, addr, pmdp);
tlb_remove_table_sync_one();
+ mmu_notifier_invalidate_range_end(&range);
mm_dec_nr_ptes(mm);
page_table_check_pte_clear_range(mm, addr, pmd);
pte_free(mm, pmd_pgtable(pmd));
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds")
780a4b6fb865 ("mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()")
b26e27015ec9 ("mm: thp: convert to use common struct mm_slot")
685405020b9f ("mm/khugepaged: stop using vma linked list")
7d2c4385c341 ("mm/khugepaged: rename prefix of shared collapse functions")
7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
a7f4e6e4c47c ("mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()")
50ad2f24b3b4 ("mm/khugepaged: propagate enum scan_result codes back to callers")
9710a78ab2ae ("mm/khugepaged: dedup and simplify hugepage alloc and charging")
34d6b470ab9c ("mm/khugepaged: add struct collapse_control")
c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")
1064026bab9f ("mm: khugepaged: reorg some khugepaged helpers")
7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
9fec51689ff6 ("mm: thp: kill transparent_hugepage_active()")
f707fa493784 ("mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidate")
4fa6893faeaa ("mm: thp: consolidate vma size check to transhuge_vma_suitable")
66137fb34a4b ("mm: khugepaged: check THP flag in hugepage_vma_check()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8d3c106e19e8d251da31ff4cc7462e4565d65084 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Fri, 25 Nov 2022 22:37:12 +0100
Subject: [PATCH] mm/khugepaged: take the right locks for page table retraction
pagetable walks on address ranges mapped by VMAs can be done under the
mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
VMA's address_space. Only one of these needs to be held, and it does not
need to be held in exclusive mode.
Under those circumstances, the rules for concurrent access to page table
entries are:
- Terminal page table entries (entries that don't point to another page
table) can be arbitrarily changed under the page table lock, with the
exception that they always need to be consistent for
hardware page table walks and lockless_pages_from_mm().
This includes that they can be changed into non-terminal entries.
- Non-terminal page table entries (which point to another page table)
can not be modified; readers are allowed to READ_ONCE() an entry, verify
that it is non-terminal, and then assume that its value will stay as-is.
Retracting a page table involves modifying a non-terminal entry, so
page-table-level locks are insufficient to protect against concurrent page
table traversal; it requires taking all the higher-level locks under which
it is possible to start a page walk in the relevant range in exclusive
mode.
The collapse_huge_page() path for anonymous THP already follows this rule,
but the shmem/file THP path was getting it wrong, making it possible for
concurrent rmap-based operations to cause corruption.
Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a8d5ef2a77d2..0a11e132ad6b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1379,16 +1379,37 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
return SCAN_SUCCEED;
}
+/*
+ * A note about locking:
+ * Trying to take the page table spinlocks would be useless here because those
+ * are only used to synchronize:
+ *
+ * - modifying terminal entries (ones that point to a data page, not to another
+ * page table)
+ * - installing *new* non-terminal entries
+ *
+ * Instead, we need roughly the same kind of protection as free_pgtables() or
+ * mm_take_all_locks() (but only for a single VMA):
+ * The mmap lock together with this VMA's rmap locks covers all paths towards
+ * the page table entries we're messing with here, except for hardware page
+ * table walks and lockless_pages_from_mm().
+ */
static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
{
- spinlock_t *ptl;
pmd_t pmd;
mmap_assert_write_locked(mm);
- ptl = pmd_lock(vma->vm_mm, pmdp);
+ if (vma->vm_file)
+ lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
+ /*
+ * All anon_vmas attached to the VMA have the same root and are
+ * therefore locked by the same lock.
+ */
+ if (vma->anon_vma)
+ lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
+
pmd = pmdp_collapse_flush(vma, addr, pmdp);
- spin_unlock(ptl);
mm_dec_nr_ptes(mm);
page_table_check_pte_clear_range(mm, addr, pmd);
pte_free(mm, pmd_pgtable(pmd));
@@ -1439,6 +1460,14 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
return SCAN_VMA_CHECK;
+ /*
+ * Symmetry with retract_page_tables(): Exclude MAP_PRIVATE mappings
+ * that got written to. Without this, we'd have to also lock the
+ * anon_vma if one exists.
+ */
+ if (vma->anon_vma)
+ return SCAN_VMA_CHECK;
+
/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
if (userfaultfd_wp(vma))
return SCAN_PTE_UFFD_WP;
@@ -1472,6 +1501,20 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
goto drop_hpage;
}
+ /*
+ * We need to lock the mapping so that from here on, only GUP-fast and
+ * hardware page walks can access the parts of the page tables that
+ * we're operating on.
+ * See collapse_and_free_pmd().
+ */
+ i_mmap_lock_write(vma->vm_file->f_mapping);
+
+ /*
+ * This spinlock should be unnecessary: Nobody else should be accessing
+ * the page tables under spinlock protection here, only
+ * lockless_pages_from_mm() and the hardware page walker can access page
+ * tables while all the high-level locks are held in write mode.
+ */
start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
result = SCAN_FAIL;
@@ -1526,6 +1569,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
/* step 4: remove pte entries */
collapse_and_free_pmd(mm, vma, haddr, pmd);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+
maybe_install_pmd:
/* step 5: install pmd entry */
result = install_pmd
@@ -1539,6 +1584,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
abort:
pte_unmap_unlock(start_pte, ptl);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
goto drop_hpage;
}
@@ -1595,7 +1641,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
* An alternative would be drop the check, but check that page
* table is clear before calling pmdp_collapse_flush() under
* ptl. It has higher chance to recover THP for the VMA, but
- * has higher cost too.
+ * has higher cost too. It would also probably require locking
+ * the anon_vma.
*/
if (vma->anon_vma) {
result = SCAN_PAGE_ANON;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds")
780a4b6fb865 ("mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()")
b26e27015ec9 ("mm: thp: convert to use common struct mm_slot")
685405020b9f ("mm/khugepaged: stop using vma linked list")
7d2c4385c341 ("mm/khugepaged: rename prefix of shared collapse functions")
7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
a7f4e6e4c47c ("mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()")
50ad2f24b3b4 ("mm/khugepaged: propagate enum scan_result codes back to callers")
9710a78ab2ae ("mm/khugepaged: dedup and simplify hugepage alloc and charging")
34d6b470ab9c ("mm/khugepaged: add struct collapse_control")
c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")
1064026bab9f ("mm: khugepaged: reorg some khugepaged helpers")
7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
9fec51689ff6 ("mm: thp: kill transparent_hugepage_active()")
f707fa493784 ("mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidate")
4fa6893faeaa ("mm: thp: consolidate vma size check to transhuge_vma_suitable")
66137fb34a4b ("mm: khugepaged: check THP flag in hugepage_vma_check()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8d3c106e19e8d251da31ff4cc7462e4565d65084 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Fri, 25 Nov 2022 22:37:12 +0100
Subject: [PATCH] mm/khugepaged: take the right locks for page table retraction
pagetable walks on address ranges mapped by VMAs can be done under the
mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
VMA's address_space. Only one of these needs to be held, and it does not
need to be held in exclusive mode.
Under those circumstances, the rules for concurrent access to page table
entries are:
- Terminal page table entries (entries that don't point to another page
table) can be arbitrarily changed under the page table lock, with the
exception that they always need to be consistent for
hardware page table walks and lockless_pages_from_mm().
This includes that they can be changed into non-terminal entries.
- Non-terminal page table entries (which point to another page table)
can not be modified; readers are allowed to READ_ONCE() an entry, verify
that it is non-terminal, and then assume that its value will stay as-is.
Retracting a page table involves modifying a non-terminal entry, so
page-table-level locks are insufficient to protect against concurrent page
table traversal; it requires taking all the higher-level locks under which
it is possible to start a page walk in the relevant range in exclusive
mode.
The collapse_huge_page() path for anonymous THP already follows this rule,
but the shmem/file THP path was getting it wrong, making it possible for
concurrent rmap-based operations to cause corruption.
Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a8d5ef2a77d2..0a11e132ad6b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1379,16 +1379,37 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
return SCAN_SUCCEED;
}
+/*
+ * A note about locking:
+ * Trying to take the page table spinlocks would be useless here because those
+ * are only used to synchronize:
+ *
+ * - modifying terminal entries (ones that point to a data page, not to another
+ * page table)
+ * - installing *new* non-terminal entries
+ *
+ * Instead, we need roughly the same kind of protection as free_pgtables() or
+ * mm_take_all_locks() (but only for a single VMA):
+ * The mmap lock together with this VMA's rmap locks covers all paths towards
+ * the page table entries we're messing with here, except for hardware page
+ * table walks and lockless_pages_from_mm().
+ */
static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
{
- spinlock_t *ptl;
pmd_t pmd;
mmap_assert_write_locked(mm);
- ptl = pmd_lock(vma->vm_mm, pmdp);
+ if (vma->vm_file)
+ lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
+ /*
+ * All anon_vmas attached to the VMA have the same root and are
+ * therefore locked by the same lock.
+ */
+ if (vma->anon_vma)
+ lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
+
pmd = pmdp_collapse_flush(vma, addr, pmdp);
- spin_unlock(ptl);
mm_dec_nr_ptes(mm);
page_table_check_pte_clear_range(mm, addr, pmd);
pte_free(mm, pmd_pgtable(pmd));
@@ -1439,6 +1460,14 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
return SCAN_VMA_CHECK;
+ /*
+ * Symmetry with retract_page_tables(): Exclude MAP_PRIVATE mappings
+ * that got written to. Without this, we'd have to also lock the
+ * anon_vma if one exists.
+ */
+ if (vma->anon_vma)
+ return SCAN_VMA_CHECK;
+
/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
if (userfaultfd_wp(vma))
return SCAN_PTE_UFFD_WP;
@@ -1472,6 +1501,20 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
goto drop_hpage;
}
+ /*
+ * We need to lock the mapping so that from here on, only GUP-fast and
+ * hardware page walks can access the parts of the page tables that
+ * we're operating on.
+ * See collapse_and_free_pmd().
+ */
+ i_mmap_lock_write(vma->vm_file->f_mapping);
+
+ /*
+ * This spinlock should be unnecessary: Nobody else should be accessing
+ * the page tables under spinlock protection here, only
+ * lockless_pages_from_mm() and the hardware page walker can access page
+ * tables while all the high-level locks are held in write mode.
+ */
start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
result = SCAN_FAIL;
@@ -1526,6 +1569,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
/* step 4: remove pte entries */
collapse_and_free_pmd(mm, vma, haddr, pmd);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+
maybe_install_pmd:
/* step 5: install pmd entry */
result = install_pmd
@@ -1539,6 +1584,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
abort:
pte_unmap_unlock(start_pte, ptl);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
goto drop_hpage;
}
@@ -1595,7 +1641,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
* An alternative would be drop the check, but check that page
* table is clear before calling pmdp_collapse_flush() under
* ptl. It has higher chance to recover THP for the VMA, but
- * has higher cost too.
+ * has higher cost too. It would also probably require locking
+ * the anon_vma.
*/
if (vma->anon_vma) {
result = SCAN_PAGE_ANON;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds")
780a4b6fb865 ("mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()")
b26e27015ec9 ("mm: thp: convert to use common struct mm_slot")
685405020b9f ("mm/khugepaged: stop using vma linked list")
7d2c4385c341 ("mm/khugepaged: rename prefix of shared collapse functions")
7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
a7f4e6e4c47c ("mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()")
50ad2f24b3b4 ("mm/khugepaged: propagate enum scan_result codes back to callers")
9710a78ab2ae ("mm/khugepaged: dedup and simplify hugepage alloc and charging")
34d6b470ab9c ("mm/khugepaged: add struct collapse_control")
c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")
1064026bab9f ("mm: khugepaged: reorg some khugepaged helpers")
7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
9fec51689ff6 ("mm: thp: kill transparent_hugepage_active()")
f707fa493784 ("mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidate")
4fa6893faeaa ("mm: thp: consolidate vma size check to transhuge_vma_suitable")
66137fb34a4b ("mm: khugepaged: check THP flag in hugepage_vma_check()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8d3c106e19e8d251da31ff4cc7462e4565d65084 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Fri, 25 Nov 2022 22:37:12 +0100
Subject: [PATCH] mm/khugepaged: take the right locks for page table retraction
pagetable walks on address ranges mapped by VMAs can be done under the
mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
VMA's address_space. Only one of these needs to be held, and it does not
need to be held in exclusive mode.
Under those circumstances, the rules for concurrent access to page table
entries are:
- Terminal page table entries (entries that don't point to another page
table) can be arbitrarily changed under the page table lock, with the
exception that they always need to be consistent for
hardware page table walks and lockless_pages_from_mm().
This includes that they can be changed into non-terminal entries.
- Non-terminal page table entries (which point to another page table)
can not be modified; readers are allowed to READ_ONCE() an entry, verify
that it is non-terminal, and then assume that its value will stay as-is.
Retracting a page table involves modifying a non-terminal entry, so
page-table-level locks are insufficient to protect against concurrent page
table traversal; it requires taking all the higher-level locks under which
it is possible to start a page walk in the relevant range in exclusive
mode.
The collapse_huge_page() path for anonymous THP already follows this rule,
but the shmem/file THP path was getting it wrong, making it possible for
concurrent rmap-based operations to cause corruption.
Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a8d5ef2a77d2..0a11e132ad6b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1379,16 +1379,37 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
return SCAN_SUCCEED;
}
+/*
+ * A note about locking:
+ * Trying to take the page table spinlocks would be useless here because those
+ * are only used to synchronize:
+ *
+ * - modifying terminal entries (ones that point to a data page, not to another
+ * page table)
+ * - installing *new* non-terminal entries
+ *
+ * Instead, we need roughly the same kind of protection as free_pgtables() or
+ * mm_take_all_locks() (but only for a single VMA):
+ * The mmap lock together with this VMA's rmap locks covers all paths towards
+ * the page table entries we're messing with here, except for hardware page
+ * table walks and lockless_pages_from_mm().
+ */
static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
{
- spinlock_t *ptl;
pmd_t pmd;
mmap_assert_write_locked(mm);
- ptl = pmd_lock(vma->vm_mm, pmdp);
+ if (vma->vm_file)
+ lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
+ /*
+ * All anon_vmas attached to the VMA have the same root and are
+ * therefore locked by the same lock.
+ */
+ if (vma->anon_vma)
+ lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
+
pmd = pmdp_collapse_flush(vma, addr, pmdp);
- spin_unlock(ptl);
mm_dec_nr_ptes(mm);
page_table_check_pte_clear_range(mm, addr, pmd);
pte_free(mm, pmd_pgtable(pmd));
@@ -1439,6 +1460,14 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
return SCAN_VMA_CHECK;
+ /*
+ * Symmetry with retract_page_tables(): Exclude MAP_PRIVATE mappings
+ * that got written to. Without this, we'd have to also lock the
+ * anon_vma if one exists.
+ */
+ if (vma->anon_vma)
+ return SCAN_VMA_CHECK;
+
/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
if (userfaultfd_wp(vma))
return SCAN_PTE_UFFD_WP;
@@ -1472,6 +1501,20 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
goto drop_hpage;
}
+ /*
+ * We need to lock the mapping so that from here on, only GUP-fast and
+ * hardware page walks can access the parts of the page tables that
+ * we're operating on.
+ * See collapse_and_free_pmd().
+ */
+ i_mmap_lock_write(vma->vm_file->f_mapping);
+
+ /*
+ * This spinlock should be unnecessary: Nobody else should be accessing
+ * the page tables under spinlock protection here, only
+ * lockless_pages_from_mm() and the hardware page walker can access page
+ * tables while all the high-level locks are held in write mode.
+ */
start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
result = SCAN_FAIL;
@@ -1526,6 +1569,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
/* step 4: remove pte entries */
collapse_and_free_pmd(mm, vma, haddr, pmd);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+
maybe_install_pmd:
/* step 5: install pmd entry */
result = install_pmd
@@ -1539,6 +1584,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
abort:
pte_unmap_unlock(start_pte, ptl);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
goto drop_hpage;
}
@@ -1595,7 +1641,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
* An alternative would be drop the check, but check that page
* table is clear before calling pmdp_collapse_flush() under
* ptl. It has higher chance to recover THP for the VMA, but
- * has higher cost too.
+ * has higher cost too. It would also probably require locking
+ * the anon_vma.
*/
if (vma->anon_vma) {
result = SCAN_PAGE_ANON;
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds")
780a4b6fb865 ("mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()")
b26e27015ec9 ("mm: thp: convert to use common struct mm_slot")
685405020b9f ("mm/khugepaged: stop using vma linked list")
7d2c4385c341 ("mm/khugepaged: rename prefix of shared collapse functions")
7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
507228044236 ("mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage")
a7f4e6e4c47c ("mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()")
50ad2f24b3b4 ("mm/khugepaged: propagate enum scan_result codes back to callers")
9710a78ab2ae ("mm/khugepaged: dedup and simplify hugepage alloc and charging")
34d6b470ab9c ("mm/khugepaged: add struct collapse_control")
c6a7f445a272 ("mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8d3c106e19e8d251da31ff4cc7462e4565d65084 Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Fri, 25 Nov 2022 22:37:12 +0100
Subject: [PATCH] mm/khugepaged: take the right locks for page table retraction
pagetable walks on address ranges mapped by VMAs can be done under the
mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
VMA's address_space. Only one of these needs to be held, and it does not
need to be held in exclusive mode.
Under those circumstances, the rules for concurrent access to page table
entries are:
- Terminal page table entries (entries that don't point to another page
table) can be arbitrarily changed under the page table lock, with the
exception that they always need to be consistent for
hardware page table walks and lockless_pages_from_mm().
This includes that they can be changed into non-terminal entries.
- Non-terminal page table entries (which point to another page table)
can not be modified; readers are allowed to READ_ONCE() an entry, verify
that it is non-terminal, and then assume that its value will stay as-is.
Retracting a page table involves modifying a non-terminal entry, so
page-table-level locks are insufficient to protect against concurrent page
table traversal; it requires taking all the higher-level locks under which
it is possible to start a page walk in the relevant range in exclusive
mode.
The collapse_huge_page() path for anonymous THP already follows this rule,
but the shmem/file THP path was getting it wrong, making it possible for
concurrent rmap-based operations to cause corruption.
Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a8d5ef2a77d2..0a11e132ad6b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1379,16 +1379,37 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
return SCAN_SUCCEED;
}
+/*
+ * A note about locking:
+ * Trying to take the page table spinlocks would be useless here because those
+ * are only used to synchronize:
+ *
+ * - modifying terminal entries (ones that point to a data page, not to another
+ * page table)
+ * - installing *new* non-terminal entries
+ *
+ * Instead, we need roughly the same kind of protection as free_pgtables() or
+ * mm_take_all_locks() (but only for a single VMA):
+ * The mmap lock together with this VMA's rmap locks covers all paths towards
+ * the page table entries we're messing with here, except for hardware page
+ * table walks and lockless_pages_from_mm().
+ */
static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
{
- spinlock_t *ptl;
pmd_t pmd;
mmap_assert_write_locked(mm);
- ptl = pmd_lock(vma->vm_mm, pmdp);
+ if (vma->vm_file)
+ lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
+ /*
+ * All anon_vmas attached to the VMA have the same root and are
+ * therefore locked by the same lock.
+ */
+ if (vma->anon_vma)
+ lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
+
pmd = pmdp_collapse_flush(vma, addr, pmdp);
- spin_unlock(ptl);
mm_dec_nr_ptes(mm);
page_table_check_pte_clear_range(mm, addr, pmd);
pte_free(mm, pmd_pgtable(pmd));
@@ -1439,6 +1460,14 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
return SCAN_VMA_CHECK;
+ /*
+ * Symmetry with retract_page_tables(): Exclude MAP_PRIVATE mappings
+ * that got written to. Without this, we'd have to also lock the
+ * anon_vma if one exists.
+ */
+ if (vma->anon_vma)
+ return SCAN_VMA_CHECK;
+
/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
if (userfaultfd_wp(vma))
return SCAN_PTE_UFFD_WP;
@@ -1472,6 +1501,20 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
goto drop_hpage;
}
+ /*
+ * We need to lock the mapping so that from here on, only GUP-fast and
+ * hardware page walks can access the parts of the page tables that
+ * we're operating on.
+ * See collapse_and_free_pmd().
+ */
+ i_mmap_lock_write(vma->vm_file->f_mapping);
+
+ /*
+ * This spinlock should be unnecessary: Nobody else should be accessing
+ * the page tables under spinlock protection here, only
+ * lockless_pages_from_mm() and the hardware page walker can access page
+ * tables while all the high-level locks are held in write mode.
+ */
start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
result = SCAN_FAIL;
@@ -1526,6 +1569,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
/* step 4: remove pte entries */
collapse_and_free_pmd(mm, vma, haddr, pmd);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+
maybe_install_pmd:
/* step 5: install pmd entry */
result = install_pmd
@@ -1539,6 +1584,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
abort:
pte_unmap_unlock(start_pte, ptl);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
goto drop_hpage;
}
@@ -1595,7 +1641,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
* An alternative would be drop the check, but check that page
* table is clear before calling pmdp_collapse_flush() under
* ptl. It has higher chance to recover THP for the VMA, but
- * has higher cost too.
+ * has higher cost too. It would also probably require locking
+ * the anon_vma.
*/
if (vma->anon_vma) {
result = SCAN_PAGE_ANON;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
4313e5a61304 ("tracing: Free buffers when a used dynamic event is removed")
5448d44c3855 ("tracing: Add unified dynamic event framework")
6b0b7551428e ("perf/core: Rename CONFIG_[UK]PROBE_EVENT to CONFIG_[UK]PROBE_EVENTS")
1d9995771fcb ("s390: update defconfigs")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4313e5a613049dfc1819a6dfb5f94cf2caff9452 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Wed, 23 Nov 2022 17:14:34 -0500
Subject: [PATCH] tracing: Free buffers when a used dynamic event is removed
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
After 65536 dynamic events have been added and removed, the "type" field
of the event then uses the first type number that is available (not
currently used by other events). A type number is the identifier of the
binary blobs in the tracing ring buffer (known as events) to map them to
logic that can parse the binary blob.
The issue is that if a dynamic event (like a kprobe event) is traced and
is in the ring buffer, and then that event is removed (because it is
dynamic, which means it can be created and destroyed), if another dynamic
event is created that has the same number that new event's logic on
parsing the binary blob will be used.
To show how this can be an issue, the following can crash the kernel:
# cd /sys/kernel/tracing
# for i in `seq 65536`; do
echo 'p:kprobes/foo do_sys_openat2 $arg1:u32' > kprobe_events
# done
For every iteration of the above, the writing to the kprobe_events will
remove the old event and create a new one (with the same format) and
increase the type number to the next available on until the type number
reaches over 65535 which is the max number for the 16 bit type. After it
reaches that number, the logic to allocate a new number simply looks for
the next available number. When an dynamic event is removed, that number
is then available to be reused by the next dynamic event created. That is,
once the above reaches the max number, the number assigned to the event in
that loop will remain the same.
Now that means deleting one dynamic event and created another will reuse
the previous events type number. This is where bad things can happen.
After the above loop finishes, the kprobes/foo event which reads the
do_sys_openat2 function call's first parameter as an integer.
# echo 1 > kprobes/foo/enable
# cat /etc/passwd > /dev/null
# cat trace
cat-2211 [005] .... 2007.849603: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849620: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849838: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849880: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
# echo 0 > kprobes/foo/enable
Now if we delete the kprobe and create a new one that reads a string:
# echo 'p:kprobes/foo do_sys_openat2 +0($arg2):string' > kprobe_events
And now we can the trace:
# cat trace
sendmail-1942 [002] ..... 530.136320: foo: (do_sys_openat2+0x0/0x240) arg1= cat-2046 [004] ..... 530.930817: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.930961: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.934278: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.934563: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
bash-1515 [007] ..... 534.299093: foo: (do_sys_openat2+0x0/0x240) arg1="kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk���������@��4Z����;Y�����U
And dmesg has:
==================================================================
BUG: KASAN: use-after-free in string+0xd4/0x1c0
Read of size 1 at addr ffff88805fdbbfa0 by task cat/2049
CPU: 0 PID: 2049 Comm: cat Not tainted 6.1.0-rc6-test+ #641
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x77
print_report+0x17f/0x47b
kasan_report+0xad/0x130
string+0xd4/0x1c0
vsnprintf+0x500/0x840
seq_buf_vprintf+0x62/0xc0
trace_seq_printf+0x10e/0x1e0
print_type_string+0x90/0xa0
print_kprobe_event+0x16b/0x290
print_trace_line+0x451/0x8e0
s_show+0x72/0x1f0
seq_read_iter+0x58e/0x750
seq_read+0x115/0x160
vfs_read+0x11d/0x460
ksys_read+0xa9/0x130
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fc2e972ade2
Code: c0 e9 b2 fe ff ff 50 48 8d 3d b2 3f 0a 00 e8 05 f0 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
RSP: 002b:00007ffc64e687c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fc2e972ade2
RDX: 0000000000020000 RSI: 00007fc2e980d000 RDI: 0000000000000003
RBP: 00007fc2e980d000 R08: 00007fc2e980c010 R09: 0000000000000000
R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000020f00
R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
</TASK>
The buggy address belongs to the physical page:
page:ffffea00017f6ec0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5fdbb
flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc0000000 0000000000000000 ffffea00017f6ec8 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88805fdbbe80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88805fdbbf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88805fdbbf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88805fdbc000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88805fdbc080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
This was found when Zheng Yejian sent a patch to convert the event type
number assignment to use IDA, which gives the next available number, and
this bug showed up in the fuzz testing by Yujie Liu and the kernel test
robot. But after further analysis, I found that this behavior is the same
as when the event type numbers go past the 16bit max (and the above shows
that).
As modules have a similar issue, but is dealt with by setting a
"WAS_ENABLED" flag when a module event is enabled, and when the module is
freed, if any of its events were enabled, the ring buffer that holds that
event is also cleared, to prevent reading stale events. The same can be
done for dynamic events.
If any dynamic event that is being removed was enabled, then make sure the
buffers they were enabled in are now cleared.
Link: https://lkml.kernel.org/r/20221123171434.545706e3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110020319.1259291-1-zhengyejian1@huawei.co…
Cc: stable(a)vger.kernel.org
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Depends-on: e18eb8783ec49 ("tracing: Add tracing_reset_all_online_cpus_unlocked() function")
Depends-on: 5448d44c38557 ("tracing: Add unified dynamic event framework")
Depends-on: 6212dd29683ee ("tracing/kprobes: Use dyn_event framework for kprobe events")
Depends-on: 065e63f951432 ("tracing: Only have rmmod clear buffers that its events were active in")
Depends-on: 575380da8b469 ("tracing: Only clear trace buffer on module unload if event was traced")
Fixes: 77b44d1b7c283 ("tracing/kprobes: Rename Kprobe-tracer to kprobe-event")
Reported-by: Zheng Yejian <zhengyejian1(a)huawei.com>
Reported-by: Yujie Liu <yujie.liu(a)intel.com>
Reported-by: kernel test robot <yujie.liu(a)intel.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace_dynevent.c b/kernel/trace/trace_dynevent.c
index 154996684fb5..4376887e0d8a 100644
--- a/kernel/trace/trace_dynevent.c
+++ b/kernel/trace/trace_dynevent.c
@@ -118,6 +118,7 @@ int dyn_event_release(const char *raw_command, struct dyn_event_operations *type
if (ret)
break;
}
+ tracing_reset_all_online_cpus();
mutex_unlock(&event_mutex);
out:
argv_free(argv);
@@ -214,6 +215,7 @@ int dyn_events_release_all(struct dyn_event_operations *type)
break;
}
out:
+ tracing_reset_all_online_cpus();
mutex_unlock(&event_mutex);
return ret;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 78cd19e31dba..f71ea6e79b3c 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -2880,7 +2880,10 @@ static int probe_remove_event_call(struct trace_event_call *call)
* TRACE_REG_UNREGISTER.
*/
if (file->flags & EVENT_FILE_FL_ENABLED)
- return -EBUSY;
+ goto busy;
+
+ if (file->flags & EVENT_FILE_FL_WAS_ENABLED)
+ tr->clear_trace = true;
/*
* The do_for_each_event_file_safe() is
* a double loop. After finding the call for this
@@ -2893,6 +2896,12 @@ static int probe_remove_event_call(struct trace_event_call *call)
__trace_remove_event_call(call);
return 0;
+ busy:
+ /* No need to clear the trace now */
+ list_for_each_entry(tr, &ftrace_trace_arrays, list) {
+ tr->clear_trace = false;
+ }
+ return -EBUSY;
}
/* Remove an event_call */
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
4313e5a61304 ("tracing: Free buffers when a used dynamic event is removed")
5448d44c3855 ("tracing: Add unified dynamic event framework")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4313e5a613049dfc1819a6dfb5f94cf2caff9452 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Wed, 23 Nov 2022 17:14:34 -0500
Subject: [PATCH] tracing: Free buffers when a used dynamic event is removed
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
After 65536 dynamic events have been added and removed, the "type" field
of the event then uses the first type number that is available (not
currently used by other events). A type number is the identifier of the
binary blobs in the tracing ring buffer (known as events) to map them to
logic that can parse the binary blob.
The issue is that if a dynamic event (like a kprobe event) is traced and
is in the ring buffer, and then that event is removed (because it is
dynamic, which means it can be created and destroyed), if another dynamic
event is created that has the same number that new event's logic on
parsing the binary blob will be used.
To show how this can be an issue, the following can crash the kernel:
# cd /sys/kernel/tracing
# for i in `seq 65536`; do
echo 'p:kprobes/foo do_sys_openat2 $arg1:u32' > kprobe_events
# done
For every iteration of the above, the writing to the kprobe_events will
remove the old event and create a new one (with the same format) and
increase the type number to the next available on until the type number
reaches over 65535 which is the max number for the 16 bit type. After it
reaches that number, the logic to allocate a new number simply looks for
the next available number. When an dynamic event is removed, that number
is then available to be reused by the next dynamic event created. That is,
once the above reaches the max number, the number assigned to the event in
that loop will remain the same.
Now that means deleting one dynamic event and created another will reuse
the previous events type number. This is where bad things can happen.
After the above loop finishes, the kprobes/foo event which reads the
do_sys_openat2 function call's first parameter as an integer.
# echo 1 > kprobes/foo/enable
# cat /etc/passwd > /dev/null
# cat trace
cat-2211 [005] .... 2007.849603: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849620: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849838: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849880: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
# echo 0 > kprobes/foo/enable
Now if we delete the kprobe and create a new one that reads a string:
# echo 'p:kprobes/foo do_sys_openat2 +0($arg2):string' > kprobe_events
And now we can the trace:
# cat trace
sendmail-1942 [002] ..... 530.136320: foo: (do_sys_openat2+0x0/0x240) arg1= cat-2046 [004] ..... 530.930817: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.930961: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.934278: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.934563: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
bash-1515 [007] ..... 534.299093: foo: (do_sys_openat2+0x0/0x240) arg1="kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk���������@��4Z����;Y�����U
And dmesg has:
==================================================================
BUG: KASAN: use-after-free in string+0xd4/0x1c0
Read of size 1 at addr ffff88805fdbbfa0 by task cat/2049
CPU: 0 PID: 2049 Comm: cat Not tainted 6.1.0-rc6-test+ #641
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x77
print_report+0x17f/0x47b
kasan_report+0xad/0x130
string+0xd4/0x1c0
vsnprintf+0x500/0x840
seq_buf_vprintf+0x62/0xc0
trace_seq_printf+0x10e/0x1e0
print_type_string+0x90/0xa0
print_kprobe_event+0x16b/0x290
print_trace_line+0x451/0x8e0
s_show+0x72/0x1f0
seq_read_iter+0x58e/0x750
seq_read+0x115/0x160
vfs_read+0x11d/0x460
ksys_read+0xa9/0x130
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fc2e972ade2
Code: c0 e9 b2 fe ff ff 50 48 8d 3d b2 3f 0a 00 e8 05 f0 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
RSP: 002b:00007ffc64e687c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fc2e972ade2
RDX: 0000000000020000 RSI: 00007fc2e980d000 RDI: 0000000000000003
RBP: 00007fc2e980d000 R08: 00007fc2e980c010 R09: 0000000000000000
R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000020f00
R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
</TASK>
The buggy address belongs to the physical page:
page:ffffea00017f6ec0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5fdbb
flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc0000000 0000000000000000 ffffea00017f6ec8 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88805fdbbe80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88805fdbbf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88805fdbbf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88805fdbc000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88805fdbc080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
This was found when Zheng Yejian sent a patch to convert the event type
number assignment to use IDA, which gives the next available number, and
this bug showed up in the fuzz testing by Yujie Liu and the kernel test
robot. But after further analysis, I found that this behavior is the same
as when the event type numbers go past the 16bit max (and the above shows
that).
As modules have a similar issue, but is dealt with by setting a
"WAS_ENABLED" flag when a module event is enabled, and when the module is
freed, if any of its events were enabled, the ring buffer that holds that
event is also cleared, to prevent reading stale events. The same can be
done for dynamic events.
If any dynamic event that is being removed was enabled, then make sure the
buffers they were enabled in are now cleared.
Link: https://lkml.kernel.org/r/20221123171434.545706e3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110020319.1259291-1-zhengyejian1@huawei.co…
Cc: stable(a)vger.kernel.org
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Depends-on: e18eb8783ec49 ("tracing: Add tracing_reset_all_online_cpus_unlocked() function")
Depends-on: 5448d44c38557 ("tracing: Add unified dynamic event framework")
Depends-on: 6212dd29683ee ("tracing/kprobes: Use dyn_event framework for kprobe events")
Depends-on: 065e63f951432 ("tracing: Only have rmmod clear buffers that its events were active in")
Depends-on: 575380da8b469 ("tracing: Only clear trace buffer on module unload if event was traced")
Fixes: 77b44d1b7c283 ("tracing/kprobes: Rename Kprobe-tracer to kprobe-event")
Reported-by: Zheng Yejian <zhengyejian1(a)huawei.com>
Reported-by: Yujie Liu <yujie.liu(a)intel.com>
Reported-by: kernel test robot <yujie.liu(a)intel.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace_dynevent.c b/kernel/trace/trace_dynevent.c
index 154996684fb5..4376887e0d8a 100644
--- a/kernel/trace/trace_dynevent.c
+++ b/kernel/trace/trace_dynevent.c
@@ -118,6 +118,7 @@ int dyn_event_release(const char *raw_command, struct dyn_event_operations *type
if (ret)
break;
}
+ tracing_reset_all_online_cpus();
mutex_unlock(&event_mutex);
out:
argv_free(argv);
@@ -214,6 +215,7 @@ int dyn_events_release_all(struct dyn_event_operations *type)
break;
}
out:
+ tracing_reset_all_online_cpus();
mutex_unlock(&event_mutex);
return ret;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 78cd19e31dba..f71ea6e79b3c 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -2880,7 +2880,10 @@ static int probe_remove_event_call(struct trace_event_call *call)
* TRACE_REG_UNREGISTER.
*/
if (file->flags & EVENT_FILE_FL_ENABLED)
- return -EBUSY;
+ goto busy;
+
+ if (file->flags & EVENT_FILE_FL_WAS_ENABLED)
+ tr->clear_trace = true;
/*
* The do_for_each_event_file_safe() is
* a double loop. After finding the call for this
@@ -2893,6 +2896,12 @@ static int probe_remove_event_call(struct trace_event_call *call)
__trace_remove_event_call(call);
return 0;
+ busy:
+ /* No need to clear the trace now */
+ list_for_each_entry(tr, &ftrace_trace_arrays, list) {
+ tr->clear_trace = false;
+ }
+ return -EBUSY;
}
/* Remove an event_call */
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
c61bfb1cb63d ("mmc: mtk-sd: Fix missing clk_disable_unprepare in msdc_of_clock_parse()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c61bfb1cb63ddab52b31cf5f1924688917e61fad Mon Sep 17 00:00:00 2001
From: Gaosheng Cui <cuigaosheng1(a)huawei.com>
Date: Fri, 25 Nov 2022 17:01:41 +0800
Subject: [PATCH] mmc: mtk-sd: Fix missing clk_disable_unprepare in
msdc_of_clock_parse()
The clk_disable_unprepare() should be called in the error handling
of devm_clk_bulk_get_optional, fix it by replacing devm_clk_get_optional
and clk_prepare_enable by devm_clk_get_optional_enabled.
Fixes: f5eccd94b63f ("mmc: mediatek: Add subsys clock control for MT8192 msdc")
Signed-off-by: Gaosheng Cui <cuigaosheng1(a)huawei.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20221125090141.3626747-1-cuigaosheng1@huawei.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/mtk-sd.c b/drivers/mmc/host/mtk-sd.c
index df941438aef5..26bc59b5a7cc 100644
--- a/drivers/mmc/host/mtk-sd.c
+++ b/drivers/mmc/host/mtk-sd.c
@@ -2588,13 +2588,11 @@ static int msdc_of_clock_parse(struct platform_device *pdev,
return PTR_ERR(host->src_clk_cg);
}
- host->sys_clk_cg = devm_clk_get_optional(&pdev->dev, "sys_cg");
+ /* If present, always enable for this clock gate */
+ host->sys_clk_cg = devm_clk_get_optional_enabled(&pdev->dev, "sys_cg");
if (IS_ERR(host->sys_clk_cg))
host->sys_clk_cg = NULL;
- /* If present, always enable for this clock gate */
- clk_prepare_enable(host->sys_clk_cg);
-
host->bulk_clks[0].id = "pclk_cg";
host->bulk_clks[1].id = "axi_cg";
host->bulk_clks[2].id = "ahb_cg";
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
152fe65f300e ("Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled")
8d192bec534b ("parisc: Increase FRAME_WARN to 2048 bytes on parisc")
867050247e29 ("xtensa: increase size of gcc stack frame check")
55b70eed81cb ("parisc: Increase size of gcc stack frame check")
432654df90f2 ("parisc: Fix too large frame size warnings")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 152fe65f300e1819d59b80477d3e0999b4d5d7d2 Mon Sep 17 00:00:00 2001
From: Lee Jones <lee(a)kernel.org>
Date: Fri, 25 Nov 2022 12:07:50 +0000
Subject: [PATCH] Kconfig.debug: provide a little extra FRAME_WARN leeway when
KASAN is enabled
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When enabled, KASAN enlarges function's stack-frames. Pushing quite a few
over the current threshold. This can mainly be seen on 32-bit
architectures where the present limit (when !GCC) is a lowly 1024-Bytes.
Link: https://lkml.kernel.org/r/20221125120750.3537134-3-lee@kernel.org
Signed-off-by: Lee Jones <lee(a)kernel.org>
Acked-by: Arnd Bergmann <arnd(a)arndb.de>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: "Christian König" <christian.koenig(a)amd.com>
Cc: Daniel Vetter <daniel(a)ffwll.ch>
Cc: David Airlie <airlied(a)gmail.com>
Cc: Harry Wentland <harry.wentland(a)amd.com>
Cc: Leo Li <sunpeng.li(a)amd.com>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Maxime Ripard <mripard(a)kernel.org>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: "Pan, Xinhui" <Xinhui.Pan(a)amd.com>
Cc: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Tom Rix <trix(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a1005415f0f4..580e453e284e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -399,6 +399,7 @@ config FRAME_WARN
default 2048 if GCC_PLUGIN_LATENT_ENTROPY
default 2048 if PARISC
default 1536 if (!64BIT && XTENSA)
+ default 1280 if KASAN && !64BIT
default 1024 if !64BIT
default 2048 if 64BIT
help
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
152fe65f300e ("Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled")
8d192bec534b ("parisc: Increase FRAME_WARN to 2048 bytes on parisc")
867050247e29 ("xtensa: increase size of gcc stack frame check")
55b70eed81cb ("parisc: Increase size of gcc stack frame check")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 152fe65f300e1819d59b80477d3e0999b4d5d7d2 Mon Sep 17 00:00:00 2001
From: Lee Jones <lee(a)kernel.org>
Date: Fri, 25 Nov 2022 12:07:50 +0000
Subject: [PATCH] Kconfig.debug: provide a little extra FRAME_WARN leeway when
KASAN is enabled
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When enabled, KASAN enlarges function's stack-frames. Pushing quite a few
over the current threshold. This can mainly be seen on 32-bit
architectures where the present limit (when !GCC) is a lowly 1024-Bytes.
Link: https://lkml.kernel.org/r/20221125120750.3537134-3-lee@kernel.org
Signed-off-by: Lee Jones <lee(a)kernel.org>
Acked-by: Arnd Bergmann <arnd(a)arndb.de>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: "Christian König" <christian.koenig(a)amd.com>
Cc: Daniel Vetter <daniel(a)ffwll.ch>
Cc: David Airlie <airlied(a)gmail.com>
Cc: Harry Wentland <harry.wentland(a)amd.com>
Cc: Leo Li <sunpeng.li(a)amd.com>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Maxime Ripard <mripard(a)kernel.org>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: "Pan, Xinhui" <Xinhui.Pan(a)amd.com>
Cc: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Tom Rix <trix(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a1005415f0f4..580e453e284e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -399,6 +399,7 @@ config FRAME_WARN
default 2048 if GCC_PLUGIN_LATENT_ENTROPY
default 2048 if PARISC
default 1536 if (!64BIT && XTENSA)
+ default 1280 if KASAN && !64BIT
default 1024 if !64BIT
default 2048 if 64BIT
help
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
152fe65f300e ("Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled")
8d192bec534b ("parisc: Increase FRAME_WARN to 2048 bytes on parisc")
867050247e29 ("xtensa: increase size of gcc stack frame check")
55b70eed81cb ("parisc: Increase size of gcc stack frame check")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 152fe65f300e1819d59b80477d3e0999b4d5d7d2 Mon Sep 17 00:00:00 2001
From: Lee Jones <lee(a)kernel.org>
Date: Fri, 25 Nov 2022 12:07:50 +0000
Subject: [PATCH] Kconfig.debug: provide a little extra FRAME_WARN leeway when
KASAN is enabled
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When enabled, KASAN enlarges function's stack-frames. Pushing quite a few
over the current threshold. This can mainly be seen on 32-bit
architectures where the present limit (when !GCC) is a lowly 1024-Bytes.
Link: https://lkml.kernel.org/r/20221125120750.3537134-3-lee@kernel.org
Signed-off-by: Lee Jones <lee(a)kernel.org>
Acked-by: Arnd Bergmann <arnd(a)arndb.de>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: "Christian König" <christian.koenig(a)amd.com>
Cc: Daniel Vetter <daniel(a)ffwll.ch>
Cc: David Airlie <airlied(a)gmail.com>
Cc: Harry Wentland <harry.wentland(a)amd.com>
Cc: Leo Li <sunpeng.li(a)amd.com>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Maxime Ripard <mripard(a)kernel.org>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: "Pan, Xinhui" <Xinhui.Pan(a)amd.com>
Cc: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Tom Rix <trix(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a1005415f0f4..580e453e284e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -399,6 +399,7 @@ config FRAME_WARN
default 2048 if GCC_PLUGIN_LATENT_ENTROPY
default 2048 if PARISC
default 1536 if (!64BIT && XTENSA)
+ default 1280 if KASAN && !64BIT
default 1024 if !64BIT
default 2048 if 64BIT
help
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
152fe65f300e ("Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled")
8d192bec534b ("parisc: Increase FRAME_WARN to 2048 bytes on parisc")
867050247e29 ("xtensa: increase size of gcc stack frame check")
55b70eed81cb ("parisc: Increase size of gcc stack frame check")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 152fe65f300e1819d59b80477d3e0999b4d5d7d2 Mon Sep 17 00:00:00 2001
From: Lee Jones <lee(a)kernel.org>
Date: Fri, 25 Nov 2022 12:07:50 +0000
Subject: [PATCH] Kconfig.debug: provide a little extra FRAME_WARN leeway when
KASAN is enabled
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When enabled, KASAN enlarges function's stack-frames. Pushing quite a few
over the current threshold. This can mainly be seen on 32-bit
architectures where the present limit (when !GCC) is a lowly 1024-Bytes.
Link: https://lkml.kernel.org/r/20221125120750.3537134-3-lee@kernel.org
Signed-off-by: Lee Jones <lee(a)kernel.org>
Acked-by: Arnd Bergmann <arnd(a)arndb.de>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: "Christian König" <christian.koenig(a)amd.com>
Cc: Daniel Vetter <daniel(a)ffwll.ch>
Cc: David Airlie <airlied(a)gmail.com>
Cc: Harry Wentland <harry.wentland(a)amd.com>
Cc: Leo Li <sunpeng.li(a)amd.com>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Maxime Ripard <mripard(a)kernel.org>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: "Pan, Xinhui" <Xinhui.Pan(a)amd.com>
Cc: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Tom Rix <trix(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a1005415f0f4..580e453e284e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -399,6 +399,7 @@ config FRAME_WARN
default 2048 if GCC_PLUGIN_LATENT_ENTROPY
default 2048 if PARISC
default 1536 if (!64BIT && XTENSA)
+ default 1280 if KASAN && !64BIT
default 1024 if !64BIT
default 2048 if 64BIT
help
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
152fe65f300e ("Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled")
8d192bec534b ("parisc: Increase FRAME_WARN to 2048 bytes on parisc")
867050247e29 ("xtensa: increase size of gcc stack frame check")
55b70eed81cb ("parisc: Increase size of gcc stack frame check")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 152fe65f300e1819d59b80477d3e0999b4d5d7d2 Mon Sep 17 00:00:00 2001
From: Lee Jones <lee(a)kernel.org>
Date: Fri, 25 Nov 2022 12:07:50 +0000
Subject: [PATCH] Kconfig.debug: provide a little extra FRAME_WARN leeway when
KASAN is enabled
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When enabled, KASAN enlarges function's stack-frames. Pushing quite a few
over the current threshold. This can mainly be seen on 32-bit
architectures where the present limit (when !GCC) is a lowly 1024-Bytes.
Link: https://lkml.kernel.org/r/20221125120750.3537134-3-lee@kernel.org
Signed-off-by: Lee Jones <lee(a)kernel.org>
Acked-by: Arnd Bergmann <arnd(a)arndb.de>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: "Christian König" <christian.koenig(a)amd.com>
Cc: Daniel Vetter <daniel(a)ffwll.ch>
Cc: David Airlie <airlied(a)gmail.com>
Cc: Harry Wentland <harry.wentland(a)amd.com>
Cc: Leo Li <sunpeng.li(a)amd.com>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Maxime Ripard <mripard(a)kernel.org>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: "Pan, Xinhui" <Xinhui.Pan(a)amd.com>
Cc: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Tom Rix <trix(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a1005415f0f4..580e453e284e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -399,6 +399,7 @@ config FRAME_WARN
default 2048 if GCC_PLUGIN_LATENT_ENTROPY
default 2048 if PARISC
default 1536 if (!64BIT && XTENSA)
+ default 1280 if KASAN && !64BIT
default 1024 if !64BIT
default 2048 if 64BIT
help
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
152fe65f300e ("Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled")
8d192bec534b ("parisc: Increase FRAME_WARN to 2048 bytes on parisc")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 152fe65f300e1819d59b80477d3e0999b4d5d7d2 Mon Sep 17 00:00:00 2001
From: Lee Jones <lee(a)kernel.org>
Date: Fri, 25 Nov 2022 12:07:50 +0000
Subject: [PATCH] Kconfig.debug: provide a little extra FRAME_WARN leeway when
KASAN is enabled
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When enabled, KASAN enlarges function's stack-frames. Pushing quite a few
over the current threshold. This can mainly be seen on 32-bit
architectures where the present limit (when !GCC) is a lowly 1024-Bytes.
Link: https://lkml.kernel.org/r/20221125120750.3537134-3-lee@kernel.org
Signed-off-by: Lee Jones <lee(a)kernel.org>
Acked-by: Arnd Bergmann <arnd(a)arndb.de>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: "Christian König" <christian.koenig(a)amd.com>
Cc: Daniel Vetter <daniel(a)ffwll.ch>
Cc: David Airlie <airlied(a)gmail.com>
Cc: Harry Wentland <harry.wentland(a)amd.com>
Cc: Leo Li <sunpeng.li(a)amd.com>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Maxime Ripard <mripard(a)kernel.org>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: "Pan, Xinhui" <Xinhui.Pan(a)amd.com>
Cc: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Tom Rix <trix(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a1005415f0f4..580e453e284e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -399,6 +399,7 @@ config FRAME_WARN
default 2048 if GCC_PLUGIN_LATENT_ENTROPY
default 2048 if PARISC
default 1536 if (!64BIT && XTENSA)
+ default 1280 if KASAN && !64BIT
default 1024 if !64BIT
default 2048 if 64BIT
help
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
c981cdfb9925 ("mmc: sdhci: Fix voltage switch delay")
fa0910107a9f ("mmc: sdhci: use FIELD_GET for preset value bit masks")
6a6d4ceb7be0 ("mmc: sdhci: Export sdhci_set_ios() from sdhci.c")
d1e4f74f911d ("mmc: sdhci: Do not use spin lock in set_ios paths")
5a436cc0af62 ("mmc: sdhci: Optimize delay loops")
e2ebfb2142ac ("mmc: sdhci: Do not disable interrupts while waiting for clock")
2ce0c7b65505 ("mmc: sdhci-of-at91: Support external regulators")
db9bd1638115 ("mmc: sdhci-msm: Factor out sdhci_msm_hs400")
0fb8a3d46b03 ("mmc: sdhci-msm: Factor out function to set/get msm clock rate")
b54aaa8a4fd8 ("mmc: sdhci-msm: Factor out sdhci_msm_hc_select_mode")
84ec048ba133 ("mmc: sdhci: Fix to handle MMC_POWER_UNDEFINED")
85a882c2e91d ("mmc: sdhci: export sdhci_execute_tuning()")
6b11e70bb72c ("mmc: sdhci: Tidy tuning loop")
da4bc4f2851e ("mmc: sdhci: Factor out tuning helper functions")
d0c3ab59105d ("mmc: sdhci: Use mmc_abort_tuning()")
0760c355525c ("mmc: sdhci: Always allow tuning to fall back to fixed sampling")
5ef5203b95c5 ("mmc: sdhci: Fix tuning reset after exhausting the maximum number of loops")
61e53bd0047d ("mmc: sdhci: Fix recovery from tuning timeout")
2ca71c27eeae ("Revert "mmc: sdhci: Reset cmd and data circuits after tuning failure"")
02e4293dc013 ("sdhci: sdhci-msm: update dll configuration")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c981cdfb9925f64a364f13c2b4f98f877308a408 Mon Sep 17 00:00:00 2001
From: Adrian Hunter <adrian.hunter(a)intel.com>
Date: Mon, 28 Nov 2022 15:32:56 +0200
Subject: [PATCH] mmc: sdhci: Fix voltage switch delay
Commit 20b92a30b561 ("mmc: sdhci: update signal voltage switch code")
removed voltage switch delays from sdhci because mmc core had been
enhanced to support them. However that assumed that sdhci_set_ios()
did a single clock change, which it did not, and so the delays in mmc
core, which should have come after the first clock change, were not
effective.
Fix by avoiding re-configuring UHS and preset settings when the clock
is turning on and the settings have not changed. That then also avoids
the associated clock changes, so that then sdhci_set_ios() does a single
clock change when voltage switching, and the mmc core delays become
effective.
To do that has meant keeping track of driver strength (host->drv_type),
and cases of reinitialization (host->reinit_uhs).
Note also, the 'turning_on_clk' restriction should not be necessary
but is done to minimize the impact of the change on stable kernels.
Fixes: 20b92a30b561 ("mmc: sdhci: update signal voltage switch code")
Cc: stable(a)vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter(a)intel.com>
Link: https://lore.kernel.org/r/20221128133259.38305-2-adrian.hunter@intel.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index fef03de85b99..c7ad32a75b57 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -373,6 +373,7 @@ static void sdhci_init(struct sdhci_host *host, int soft)
if (soft) {
/* force clock reconfiguration */
host->clock = 0;
+ host->reinit_uhs = true;
mmc->ops->set_ios(mmc, &mmc->ios);
}
}
@@ -2293,11 +2294,46 @@ void sdhci_set_uhs_signaling(struct sdhci_host *host, unsigned timing)
}
EXPORT_SYMBOL_GPL(sdhci_set_uhs_signaling);
+static bool sdhci_timing_has_preset(unsigned char timing)
+{
+ switch (timing) {
+ case MMC_TIMING_UHS_SDR12:
+ case MMC_TIMING_UHS_SDR25:
+ case MMC_TIMING_UHS_SDR50:
+ case MMC_TIMING_UHS_SDR104:
+ case MMC_TIMING_UHS_DDR50:
+ case MMC_TIMING_MMC_DDR52:
+ return true;
+ };
+ return false;
+}
+
+static bool sdhci_preset_needed(struct sdhci_host *host, unsigned char timing)
+{
+ return !(host->quirks2 & SDHCI_QUIRK2_PRESET_VALUE_BROKEN) &&
+ sdhci_timing_has_preset(timing);
+}
+
+static bool sdhci_presetable_values_change(struct sdhci_host *host, struct mmc_ios *ios)
+{
+ /*
+ * Preset Values are: Driver Strength, Clock Generator and SDCLK/RCLK
+ * Frequency. Check if preset values need to be enabled, or the Driver
+ * Strength needs updating. Note, clock changes are handled separately.
+ */
+ return !host->preset_enabled &&
+ (sdhci_preset_needed(host, ios->timing) || host->drv_type != ios->drv_type);
+}
+
void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
{
struct sdhci_host *host = mmc_priv(mmc);
+ bool reinit_uhs = host->reinit_uhs;
+ bool turning_on_clk = false;
u8 ctrl;
+ host->reinit_uhs = false;
+
if (ios->power_mode == MMC_POWER_UNDEFINED)
return;
@@ -2323,6 +2359,8 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
sdhci_enable_preset_value(host, false);
if (!ios->clock || ios->clock != host->clock) {
+ turning_on_clk = ios->clock && !host->clock;
+
host->ops->set_clock(host, ios->clock);
host->clock = ios->clock;
@@ -2349,6 +2387,17 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
host->ops->set_bus_width(host, ios->bus_width);
+ /*
+ * Special case to avoid multiple clock changes during voltage
+ * switching.
+ */
+ if (!reinit_uhs &&
+ turning_on_clk &&
+ host->timing == ios->timing &&
+ host->version >= SDHCI_SPEC_300 &&
+ !sdhci_presetable_values_change(host, ios))
+ return;
+
ctrl = sdhci_readb(host, SDHCI_HOST_CONTROL);
if (!(host->quirks & SDHCI_QUIRK_NO_HISPD_BIT)) {
@@ -2392,6 +2441,7 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
}
sdhci_writew(host, ctrl_2, SDHCI_HOST_CONTROL2);
+ host->drv_type = ios->drv_type;
} else {
/*
* According to SDHC Spec v3.00, if the Preset Value
@@ -2419,19 +2469,14 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
host->ops->set_uhs_signaling(host, ios->timing);
host->timing = ios->timing;
- if (!(host->quirks2 & SDHCI_QUIRK2_PRESET_VALUE_BROKEN) &&
- ((ios->timing == MMC_TIMING_UHS_SDR12) ||
- (ios->timing == MMC_TIMING_UHS_SDR25) ||
- (ios->timing == MMC_TIMING_UHS_SDR50) ||
- (ios->timing == MMC_TIMING_UHS_SDR104) ||
- (ios->timing == MMC_TIMING_UHS_DDR50) ||
- (ios->timing == MMC_TIMING_MMC_DDR52))) {
+ if (sdhci_preset_needed(host, ios->timing)) {
u16 preset;
sdhci_enable_preset_value(host, true);
preset = sdhci_get_preset_value(host);
ios->drv_type = FIELD_GET(SDHCI_PRESET_DRV_MASK,
preset);
+ host->drv_type = ios->drv_type;
}
/* Re-enable SD Clock */
@@ -3768,6 +3813,7 @@ int sdhci_resume_host(struct sdhci_host *host)
sdhci_init(host, 0);
host->pwr = 0;
host->clock = 0;
+ host->reinit_uhs = true;
mmc->ops->set_ios(mmc, &mmc->ios);
} else {
sdhci_init(host, (mmc->pm_flags & MMC_PM_KEEP_POWER));
@@ -3830,6 +3876,7 @@ int sdhci_runtime_resume_host(struct sdhci_host *host, int soft_reset)
/* Force clock and power re-program */
host->pwr = 0;
host->clock = 0;
+ host->reinit_uhs = true;
mmc->ops->start_signal_voltage_switch(mmc, &mmc->ios);
mmc->ops->set_ios(mmc, &mmc->ios);
diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index d750c464bd1e..87a3aaa07438 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -524,6 +524,8 @@ struct sdhci_host {
unsigned int clock; /* Current clock (MHz) */
u8 pwr; /* Current voltage */
+ u8 drv_type; /* Current UHS-I driver type */
+ bool reinit_uhs; /* Force UHS-related re-initialization */
bool runtime_suspended; /* Host is runtime suspended */
bool bus_on; /* Bus power prevents runtime suspend */
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
c981cdfb9925 ("mmc: sdhci: Fix voltage switch delay")
fa0910107a9f ("mmc: sdhci: use FIELD_GET for preset value bit masks")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c981cdfb9925f64a364f13c2b4f98f877308a408 Mon Sep 17 00:00:00 2001
From: Adrian Hunter <adrian.hunter(a)intel.com>
Date: Mon, 28 Nov 2022 15:32:56 +0200
Subject: [PATCH] mmc: sdhci: Fix voltage switch delay
Commit 20b92a30b561 ("mmc: sdhci: update signal voltage switch code")
removed voltage switch delays from sdhci because mmc core had been
enhanced to support them. However that assumed that sdhci_set_ios()
did a single clock change, which it did not, and so the delays in mmc
core, which should have come after the first clock change, were not
effective.
Fix by avoiding re-configuring UHS and preset settings when the clock
is turning on and the settings have not changed. That then also avoids
the associated clock changes, so that then sdhci_set_ios() does a single
clock change when voltage switching, and the mmc core delays become
effective.
To do that has meant keeping track of driver strength (host->drv_type),
and cases of reinitialization (host->reinit_uhs).
Note also, the 'turning_on_clk' restriction should not be necessary
but is done to minimize the impact of the change on stable kernels.
Fixes: 20b92a30b561 ("mmc: sdhci: update signal voltage switch code")
Cc: stable(a)vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter(a)intel.com>
Link: https://lore.kernel.org/r/20221128133259.38305-2-adrian.hunter@intel.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index fef03de85b99..c7ad32a75b57 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -373,6 +373,7 @@ static void sdhci_init(struct sdhci_host *host, int soft)
if (soft) {
/* force clock reconfiguration */
host->clock = 0;
+ host->reinit_uhs = true;
mmc->ops->set_ios(mmc, &mmc->ios);
}
}
@@ -2293,11 +2294,46 @@ void sdhci_set_uhs_signaling(struct sdhci_host *host, unsigned timing)
}
EXPORT_SYMBOL_GPL(sdhci_set_uhs_signaling);
+static bool sdhci_timing_has_preset(unsigned char timing)
+{
+ switch (timing) {
+ case MMC_TIMING_UHS_SDR12:
+ case MMC_TIMING_UHS_SDR25:
+ case MMC_TIMING_UHS_SDR50:
+ case MMC_TIMING_UHS_SDR104:
+ case MMC_TIMING_UHS_DDR50:
+ case MMC_TIMING_MMC_DDR52:
+ return true;
+ };
+ return false;
+}
+
+static bool sdhci_preset_needed(struct sdhci_host *host, unsigned char timing)
+{
+ return !(host->quirks2 & SDHCI_QUIRK2_PRESET_VALUE_BROKEN) &&
+ sdhci_timing_has_preset(timing);
+}
+
+static bool sdhci_presetable_values_change(struct sdhci_host *host, struct mmc_ios *ios)
+{
+ /*
+ * Preset Values are: Driver Strength, Clock Generator and SDCLK/RCLK
+ * Frequency. Check if preset values need to be enabled, or the Driver
+ * Strength needs updating. Note, clock changes are handled separately.
+ */
+ return !host->preset_enabled &&
+ (sdhci_preset_needed(host, ios->timing) || host->drv_type != ios->drv_type);
+}
+
void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
{
struct sdhci_host *host = mmc_priv(mmc);
+ bool reinit_uhs = host->reinit_uhs;
+ bool turning_on_clk = false;
u8 ctrl;
+ host->reinit_uhs = false;
+
if (ios->power_mode == MMC_POWER_UNDEFINED)
return;
@@ -2323,6 +2359,8 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
sdhci_enable_preset_value(host, false);
if (!ios->clock || ios->clock != host->clock) {
+ turning_on_clk = ios->clock && !host->clock;
+
host->ops->set_clock(host, ios->clock);
host->clock = ios->clock;
@@ -2349,6 +2387,17 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
host->ops->set_bus_width(host, ios->bus_width);
+ /*
+ * Special case to avoid multiple clock changes during voltage
+ * switching.
+ */
+ if (!reinit_uhs &&
+ turning_on_clk &&
+ host->timing == ios->timing &&
+ host->version >= SDHCI_SPEC_300 &&
+ !sdhci_presetable_values_change(host, ios))
+ return;
+
ctrl = sdhci_readb(host, SDHCI_HOST_CONTROL);
if (!(host->quirks & SDHCI_QUIRK_NO_HISPD_BIT)) {
@@ -2392,6 +2441,7 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
}
sdhci_writew(host, ctrl_2, SDHCI_HOST_CONTROL2);
+ host->drv_type = ios->drv_type;
} else {
/*
* According to SDHC Spec v3.00, if the Preset Value
@@ -2419,19 +2469,14 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
host->ops->set_uhs_signaling(host, ios->timing);
host->timing = ios->timing;
- if (!(host->quirks2 & SDHCI_QUIRK2_PRESET_VALUE_BROKEN) &&
- ((ios->timing == MMC_TIMING_UHS_SDR12) ||
- (ios->timing == MMC_TIMING_UHS_SDR25) ||
- (ios->timing == MMC_TIMING_UHS_SDR50) ||
- (ios->timing == MMC_TIMING_UHS_SDR104) ||
- (ios->timing == MMC_TIMING_UHS_DDR50) ||
- (ios->timing == MMC_TIMING_MMC_DDR52))) {
+ if (sdhci_preset_needed(host, ios->timing)) {
u16 preset;
sdhci_enable_preset_value(host, true);
preset = sdhci_get_preset_value(host);
ios->drv_type = FIELD_GET(SDHCI_PRESET_DRV_MASK,
preset);
+ host->drv_type = ios->drv_type;
}
/* Re-enable SD Clock */
@@ -3768,6 +3813,7 @@ int sdhci_resume_host(struct sdhci_host *host)
sdhci_init(host, 0);
host->pwr = 0;
host->clock = 0;
+ host->reinit_uhs = true;
mmc->ops->set_ios(mmc, &mmc->ios);
} else {
sdhci_init(host, (mmc->pm_flags & MMC_PM_KEEP_POWER));
@@ -3830,6 +3876,7 @@ int sdhci_runtime_resume_host(struct sdhci_host *host, int soft_reset)
/* Force clock and power re-program */
host->pwr = 0;
host->clock = 0;
+ host->reinit_uhs = true;
mmc->ops->start_signal_voltage_switch(mmc, &mmc->ios);
mmc->ops->set_ios(mmc, &mmc->ios);
diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index d750c464bd1e..87a3aaa07438 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -524,6 +524,8 @@ struct sdhci_host {
unsigned int clock; /* Current clock (MHz) */
u8 pwr; /* Current voltage */
+ u8 drv_type; /* Current UHS-I driver type */
+ bool reinit_uhs; /* Force UHS-related re-initialization */
bool runtime_suspended; /* Host is runtime suspended */
bool bus_on; /* Bus power prevents runtime suspend */
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
c981cdfb9925 ("mmc: sdhci: Fix voltage switch delay")
fa0910107a9f ("mmc: sdhci: use FIELD_GET for preset value bit masks")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c981cdfb9925f64a364f13c2b4f98f877308a408 Mon Sep 17 00:00:00 2001
From: Adrian Hunter <adrian.hunter(a)intel.com>
Date: Mon, 28 Nov 2022 15:32:56 +0200
Subject: [PATCH] mmc: sdhci: Fix voltage switch delay
Commit 20b92a30b561 ("mmc: sdhci: update signal voltage switch code")
removed voltage switch delays from sdhci because mmc core had been
enhanced to support them. However that assumed that sdhci_set_ios()
did a single clock change, which it did not, and so the delays in mmc
core, which should have come after the first clock change, were not
effective.
Fix by avoiding re-configuring UHS and preset settings when the clock
is turning on and the settings have not changed. That then also avoids
the associated clock changes, so that then sdhci_set_ios() does a single
clock change when voltage switching, and the mmc core delays become
effective.
To do that has meant keeping track of driver strength (host->drv_type),
and cases of reinitialization (host->reinit_uhs).
Note also, the 'turning_on_clk' restriction should not be necessary
but is done to minimize the impact of the change on stable kernels.
Fixes: 20b92a30b561 ("mmc: sdhci: update signal voltage switch code")
Cc: stable(a)vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter(a)intel.com>
Link: https://lore.kernel.org/r/20221128133259.38305-2-adrian.hunter@intel.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index fef03de85b99..c7ad32a75b57 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -373,6 +373,7 @@ static void sdhci_init(struct sdhci_host *host, int soft)
if (soft) {
/* force clock reconfiguration */
host->clock = 0;
+ host->reinit_uhs = true;
mmc->ops->set_ios(mmc, &mmc->ios);
}
}
@@ -2293,11 +2294,46 @@ void sdhci_set_uhs_signaling(struct sdhci_host *host, unsigned timing)
}
EXPORT_SYMBOL_GPL(sdhci_set_uhs_signaling);
+static bool sdhci_timing_has_preset(unsigned char timing)
+{
+ switch (timing) {
+ case MMC_TIMING_UHS_SDR12:
+ case MMC_TIMING_UHS_SDR25:
+ case MMC_TIMING_UHS_SDR50:
+ case MMC_TIMING_UHS_SDR104:
+ case MMC_TIMING_UHS_DDR50:
+ case MMC_TIMING_MMC_DDR52:
+ return true;
+ };
+ return false;
+}
+
+static bool sdhci_preset_needed(struct sdhci_host *host, unsigned char timing)
+{
+ return !(host->quirks2 & SDHCI_QUIRK2_PRESET_VALUE_BROKEN) &&
+ sdhci_timing_has_preset(timing);
+}
+
+static bool sdhci_presetable_values_change(struct sdhci_host *host, struct mmc_ios *ios)
+{
+ /*
+ * Preset Values are: Driver Strength, Clock Generator and SDCLK/RCLK
+ * Frequency. Check if preset values need to be enabled, or the Driver
+ * Strength needs updating. Note, clock changes are handled separately.
+ */
+ return !host->preset_enabled &&
+ (sdhci_preset_needed(host, ios->timing) || host->drv_type != ios->drv_type);
+}
+
void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
{
struct sdhci_host *host = mmc_priv(mmc);
+ bool reinit_uhs = host->reinit_uhs;
+ bool turning_on_clk = false;
u8 ctrl;
+ host->reinit_uhs = false;
+
if (ios->power_mode == MMC_POWER_UNDEFINED)
return;
@@ -2323,6 +2359,8 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
sdhci_enable_preset_value(host, false);
if (!ios->clock || ios->clock != host->clock) {
+ turning_on_clk = ios->clock && !host->clock;
+
host->ops->set_clock(host, ios->clock);
host->clock = ios->clock;
@@ -2349,6 +2387,17 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
host->ops->set_bus_width(host, ios->bus_width);
+ /*
+ * Special case to avoid multiple clock changes during voltage
+ * switching.
+ */
+ if (!reinit_uhs &&
+ turning_on_clk &&
+ host->timing == ios->timing &&
+ host->version >= SDHCI_SPEC_300 &&
+ !sdhci_presetable_values_change(host, ios))
+ return;
+
ctrl = sdhci_readb(host, SDHCI_HOST_CONTROL);
if (!(host->quirks & SDHCI_QUIRK_NO_HISPD_BIT)) {
@@ -2392,6 +2441,7 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
}
sdhci_writew(host, ctrl_2, SDHCI_HOST_CONTROL2);
+ host->drv_type = ios->drv_type;
} else {
/*
* According to SDHC Spec v3.00, if the Preset Value
@@ -2419,19 +2469,14 @@ void sdhci_set_ios(struct mmc_host *mmc, struct mmc_ios *ios)
host->ops->set_uhs_signaling(host, ios->timing);
host->timing = ios->timing;
- if (!(host->quirks2 & SDHCI_QUIRK2_PRESET_VALUE_BROKEN) &&
- ((ios->timing == MMC_TIMING_UHS_SDR12) ||
- (ios->timing == MMC_TIMING_UHS_SDR25) ||
- (ios->timing == MMC_TIMING_UHS_SDR50) ||
- (ios->timing == MMC_TIMING_UHS_SDR104) ||
- (ios->timing == MMC_TIMING_UHS_DDR50) ||
- (ios->timing == MMC_TIMING_MMC_DDR52))) {
+ if (sdhci_preset_needed(host, ios->timing)) {
u16 preset;
sdhci_enable_preset_value(host, true);
preset = sdhci_get_preset_value(host);
ios->drv_type = FIELD_GET(SDHCI_PRESET_DRV_MASK,
preset);
+ host->drv_type = ios->drv_type;
}
/* Re-enable SD Clock */
@@ -3768,6 +3813,7 @@ int sdhci_resume_host(struct sdhci_host *host)
sdhci_init(host, 0);
host->pwr = 0;
host->clock = 0;
+ host->reinit_uhs = true;
mmc->ops->set_ios(mmc, &mmc->ios);
} else {
sdhci_init(host, (mmc->pm_flags & MMC_PM_KEEP_POWER));
@@ -3830,6 +3876,7 @@ int sdhci_runtime_resume_host(struct sdhci_host *host, int soft_reset)
/* Force clock and power re-program */
host->pwr = 0;
host->clock = 0;
+ host->reinit_uhs = true;
mmc->ops->start_signal_voltage_switch(mmc, &mmc->ios);
mmc->ops->set_ios(mmc, &mmc->ios);
diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index d750c464bd1e..87a3aaa07438 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -524,6 +524,8 @@ struct sdhci_host {
unsigned int clock; /* Current clock (MHz) */
u8 pwr; /* Current voltage */
+ u8 drv_type; /* Current UHS-I driver type */
+ bool reinit_uhs; /* Force UHS-related re-initialization */
bool runtime_suspended; /* Host is runtime suspended */
bool bus_on; /* Bus power prevents runtime suspend */
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
489d144563f2 ("mmc: core: Fix ambiguous TRIM and DISCARD arg")
bc47e2f6f9e2 ("mmc: core: Add discard support to sd")
01904ff77676 ("mmc: core: Calculate the discard arg only once")
06c9ccb78e68 ("mmc: core: add proper be32 annotation")
c0a3e080f929 ("mmc: core: Move erase/trim/discard defines from public core.h to mmc.h")
55244c5659b5 ("mmc: core: Move public functions from core.h to private headers")
066185d69063 ("mmc: core: First step in cleaning up private mmc header files")
164b50b35390 ("mmc: block: Replace "goto retry" by a proper do / while loop")
e85baa8868b0 ("mmc: sd: Meet alignment requirements for raw_ssr DMA")
f397c8d80a5e ("mmc: block: Move files to core")
e711f0309109 ("mmc: mmc: Introduce mmc_abort_tuning()")
aa33ce3c411a ("mmc: core: Enable __mmc_switch() to change bus speed timing for the host")
625228fa3e01 ("mmc: core: Rename ignore_crc to retry_crc_err to reflect its purpose")
437590a123b6 ("mmc: core: Retry instead of ignore at CRC errors when polling for busy")
8e8b3f514c12 ("mmc: core: use enum mmc_blk_status properly")
70562644f4ee ("mmc: core: Don't use ->card_busy() and CMD13 in combination when polling")
716bdb8953c7 ("mmc: core: Factor out code related to polling in __mmc_switch()")
cb26ce069ffa ("mmc: core: Clarify code which deals with polling in __mmc_switch()")
20348d1981da ("mmc: core: Make mmc_switch_status() available for mmc core")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 489d144563f23911262a652234b80c70c89c978b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20L=C3=B6hle?= <CLoehle(a)hyperstone.com>
Date: Thu, 17 Nov 2022 14:42:09 +0000
Subject: [PATCH] mmc: core: Fix ambiguous TRIM and DISCARD arg
Clean up the MMC_TRIM_ARGS define that became ambiguous with DISCARD
introduction. While at it, let's fix one usage where MMC_TRIM_ARGS falsely
included DISCARD too.
Fixes: b3bf915308ca ("mmc: core: new discard feature support at eMMC v4.5")
Signed-off-by: Christian Loehle <cloehle(a)hyperstone.com>
Acked-by: Adrian Hunter <adrian.hunter(a)intel.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/11376b5714964345908f3990f17e0701@hyperstone.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index c5de202f530a..de1cc9e1ae57 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -1484,6 +1484,11 @@ void mmc_init_erase(struct mmc_card *card)
card->pref_erase = 0;
}
+static bool is_trim_arg(unsigned int arg)
+{
+ return (arg & MMC_TRIM_OR_DISCARD_ARGS) && arg != MMC_DISCARD_ARG;
+}
+
static unsigned int mmc_mmc_erase_timeout(struct mmc_card *card,
unsigned int arg, unsigned int qty)
{
@@ -1766,7 +1771,7 @@ int mmc_erase(struct mmc_card *card, unsigned int from, unsigned int nr,
!(card->ext_csd.sec_feature_support & EXT_CSD_SEC_ER_EN))
return -EOPNOTSUPP;
- if (mmc_card_mmc(card) && (arg & MMC_TRIM_ARGS) &&
+ if (mmc_card_mmc(card) && is_trim_arg(arg) &&
!(card->ext_csd.sec_feature_support & EXT_CSD_SEC_GB_CL_EN))
return -EOPNOTSUPP;
@@ -1796,7 +1801,7 @@ int mmc_erase(struct mmc_card *card, unsigned int from, unsigned int nr,
* identified by the card->eg_boundary flag.
*/
rem = card->erase_size - (from % card->erase_size);
- if ((arg & MMC_TRIM_ARGS) && (card->eg_boundary) && (nr > rem)) {
+ if ((arg & MMC_TRIM_OR_DISCARD_ARGS) && card->eg_boundary && nr > rem) {
err = mmc_do_erase(card, from, from + rem - 1, arg);
from += rem;
if ((err) || (to <= from))
diff --git a/include/linux/mmc/mmc.h b/include/linux/mmc/mmc.h
index 9c50bc40f8ff..6f7993803ee7 100644
--- a/include/linux/mmc/mmc.h
+++ b/include/linux/mmc/mmc.h
@@ -451,7 +451,7 @@ static inline bool mmc_ready_for_data(u32 status)
#define MMC_SECURE_TRIM1_ARG 0x80000001
#define MMC_SECURE_TRIM2_ARG 0x80008000
#define MMC_SECURE_ARGS 0x80000000
-#define MMC_TRIM_ARGS 0x00008001
+#define MMC_TRIM_OR_DISCARD_ARGS 0x00008003
#define mmc_driver_type_mask(n) (1 << (n))
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
489d144563f2 ("mmc: core: Fix ambiguous TRIM and DISCARD arg")
bc47e2f6f9e2 ("mmc: core: Add discard support to sd")
01904ff77676 ("mmc: core: Calculate the discard arg only once")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 489d144563f23911262a652234b80c70c89c978b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20L=C3=B6hle?= <CLoehle(a)hyperstone.com>
Date: Thu, 17 Nov 2022 14:42:09 +0000
Subject: [PATCH] mmc: core: Fix ambiguous TRIM and DISCARD arg
Clean up the MMC_TRIM_ARGS define that became ambiguous with DISCARD
introduction. While at it, let's fix one usage where MMC_TRIM_ARGS falsely
included DISCARD too.
Fixes: b3bf915308ca ("mmc: core: new discard feature support at eMMC v4.5")
Signed-off-by: Christian Loehle <cloehle(a)hyperstone.com>
Acked-by: Adrian Hunter <adrian.hunter(a)intel.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/11376b5714964345908f3990f17e0701@hyperstone.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index c5de202f530a..de1cc9e1ae57 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -1484,6 +1484,11 @@ void mmc_init_erase(struct mmc_card *card)
card->pref_erase = 0;
}
+static bool is_trim_arg(unsigned int arg)
+{
+ return (arg & MMC_TRIM_OR_DISCARD_ARGS) && arg != MMC_DISCARD_ARG;
+}
+
static unsigned int mmc_mmc_erase_timeout(struct mmc_card *card,
unsigned int arg, unsigned int qty)
{
@@ -1766,7 +1771,7 @@ int mmc_erase(struct mmc_card *card, unsigned int from, unsigned int nr,
!(card->ext_csd.sec_feature_support & EXT_CSD_SEC_ER_EN))
return -EOPNOTSUPP;
- if (mmc_card_mmc(card) && (arg & MMC_TRIM_ARGS) &&
+ if (mmc_card_mmc(card) && is_trim_arg(arg) &&
!(card->ext_csd.sec_feature_support & EXT_CSD_SEC_GB_CL_EN))
return -EOPNOTSUPP;
@@ -1796,7 +1801,7 @@ int mmc_erase(struct mmc_card *card, unsigned int from, unsigned int nr,
* identified by the card->eg_boundary flag.
*/
rem = card->erase_size - (from % card->erase_size);
- if ((arg & MMC_TRIM_ARGS) && (card->eg_boundary) && (nr > rem)) {
+ if ((arg & MMC_TRIM_OR_DISCARD_ARGS) && card->eg_boundary && nr > rem) {
err = mmc_do_erase(card, from, from + rem - 1, arg);
from += rem;
if ((err) || (to <= from))
diff --git a/include/linux/mmc/mmc.h b/include/linux/mmc/mmc.h
index 9c50bc40f8ff..6f7993803ee7 100644
--- a/include/linux/mmc/mmc.h
+++ b/include/linux/mmc/mmc.h
@@ -451,7 +451,7 @@ static inline bool mmc_ready_for_data(u32 status)
#define MMC_SECURE_TRIM1_ARG 0x80000001
#define MMC_SECURE_TRIM2_ARG 0x80008000
#define MMC_SECURE_ARGS 0x80000000
-#define MMC_TRIM_ARGS 0x00008001
+#define MMC_TRIM_OR_DISCARD_ARGS 0x00008003
#define mmc_driver_type_mask(n) (1 << (n))
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
489d144563f2 ("mmc: core: Fix ambiguous TRIM and DISCARD arg")
bc47e2f6f9e2 ("mmc: core: Add discard support to sd")
01904ff77676 ("mmc: core: Calculate the discard arg only once")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 489d144563f23911262a652234b80c70c89c978b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20L=C3=B6hle?= <CLoehle(a)hyperstone.com>
Date: Thu, 17 Nov 2022 14:42:09 +0000
Subject: [PATCH] mmc: core: Fix ambiguous TRIM and DISCARD arg
Clean up the MMC_TRIM_ARGS define that became ambiguous with DISCARD
introduction. While at it, let's fix one usage where MMC_TRIM_ARGS falsely
included DISCARD too.
Fixes: b3bf915308ca ("mmc: core: new discard feature support at eMMC v4.5")
Signed-off-by: Christian Loehle <cloehle(a)hyperstone.com>
Acked-by: Adrian Hunter <adrian.hunter(a)intel.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/11376b5714964345908f3990f17e0701@hyperstone.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index c5de202f530a..de1cc9e1ae57 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -1484,6 +1484,11 @@ void mmc_init_erase(struct mmc_card *card)
card->pref_erase = 0;
}
+static bool is_trim_arg(unsigned int arg)
+{
+ return (arg & MMC_TRIM_OR_DISCARD_ARGS) && arg != MMC_DISCARD_ARG;
+}
+
static unsigned int mmc_mmc_erase_timeout(struct mmc_card *card,
unsigned int arg, unsigned int qty)
{
@@ -1766,7 +1771,7 @@ int mmc_erase(struct mmc_card *card, unsigned int from, unsigned int nr,
!(card->ext_csd.sec_feature_support & EXT_CSD_SEC_ER_EN))
return -EOPNOTSUPP;
- if (mmc_card_mmc(card) && (arg & MMC_TRIM_ARGS) &&
+ if (mmc_card_mmc(card) && is_trim_arg(arg) &&
!(card->ext_csd.sec_feature_support & EXT_CSD_SEC_GB_CL_EN))
return -EOPNOTSUPP;
@@ -1796,7 +1801,7 @@ int mmc_erase(struct mmc_card *card, unsigned int from, unsigned int nr,
* identified by the card->eg_boundary flag.
*/
rem = card->erase_size - (from % card->erase_size);
- if ((arg & MMC_TRIM_ARGS) && (card->eg_boundary) && (nr > rem)) {
+ if ((arg & MMC_TRIM_OR_DISCARD_ARGS) && card->eg_boundary && nr > rem) {
err = mmc_do_erase(card, from, from + rem - 1, arg);
from += rem;
if ((err) || (to <= from))
diff --git a/include/linux/mmc/mmc.h b/include/linux/mmc/mmc.h
index 9c50bc40f8ff..6f7993803ee7 100644
--- a/include/linux/mmc/mmc.h
+++ b/include/linux/mmc/mmc.h
@@ -451,7 +451,7 @@ static inline bool mmc_ready_for_data(u32 status)
#define MMC_SECURE_TRIM1_ARG 0x80000001
#define MMC_SECURE_TRIM2_ARG 0x80008000
#define MMC_SECURE_ARGS 0x80000000
-#define MMC_TRIM_ARGS 0x00008001
+#define MMC_TRIM_OR_DISCARD_ARGS 0x00008003
#define mmc_driver_type_mask(n) (1 << (n))
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
829ae0f81ce0 ("mm: migrate: fix THP's mapcount on isolation")
89f6c88a6ab4 ("mm: __isolate_lru_page_prepare() in isolate_migratepages_block()")
c2135f7c570b ("mm/vmscan: __isolate_lru_page_prepare() cleanup")
9df41314390b ("mm/compaction: do page isolation first in compaction")
d25b5bd8a8f4 ("mm/lru: introduce TestClearPageLRU()")
13805a88a9bd ("mm/mlock: remove __munlock_isolate_lru_page()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 829ae0f81ce093d674ff2256f66a714753e9ce32 Mon Sep 17 00:00:00 2001
From: Gavin Shan <gshan(a)redhat.com>
Date: Thu, 24 Nov 2022 17:55:23 +0800
Subject: [PATCH] mm: migrate: fix THP's mapcount on isolation
The issue is reported when removing memory through virtio_mem device. The
transparent huge page, experienced copy-on-write fault, is wrongly
regarded as pinned. The transparent huge page is escaped from being
isolated in isolate_migratepages_block(). The transparent huge page can't
be migrated and the corresponding memory block can't be put into offline
state.
Fix it by replacing page_mapcount() with total_mapcount(). With this, the
transparent huge page can be isolated and migrated, and the memory block
can be put into offline state. Besides, The page's refcount is increased
a bit earlier to avoid the page is released when the check is executed.
Link: https://lkml.kernel.org/r/20221124095523.31061-1-gshan@redhat.com
Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
Signed-off-by: Gavin Shan <gshan(a)redhat.com>
Reported-by: Zhenyu Zhang <zhenyzha(a)redhat.com>
Tested-by: Zhenyu Zhang <zhenyzha(a)redhat.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: William Kucharski <william.kucharski(a)oracle.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org> [5.7+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/compaction.c b/mm/compaction.c
index c51f7f545afe..1f6da31dd9a5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -984,29 +984,29 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
goto isolate_fail;
}
+ /*
+ * Be careful not to clear PageLRU until after we're
+ * sure the page is not being freed elsewhere -- the
+ * page release code relies on it.
+ */
+ if (unlikely(!get_page_unless_zero(page)))
+ goto isolate_fail;
+
/*
* Migration will fail if an anonymous page is pinned in memory,
* so avoid taking lru_lock and isolating it unnecessarily in an
* admittedly racy check.
*/
mapping = page_mapping(page);
- if (!mapping && page_count(page) > page_mapcount(page))
- goto isolate_fail;
+ if (!mapping && (page_count(page) - 1) > total_mapcount(page))
+ goto isolate_fail_put;
/*
* Only allow to migrate anonymous pages in GFP_NOFS context
* because those do not depend on fs locks.
*/
if (!(cc->gfp_mask & __GFP_FS) && mapping)
- goto isolate_fail;
-
- /*
- * Be careful not to clear PageLRU until after we're
- * sure the page is not being freed elsewhere -- the
- * page release code relies on it.
- */
- if (unlikely(!get_page_unless_zero(page)))
- goto isolate_fail;
+ goto isolate_fail_put;
/* Only take pages on LRU: a check now makes later tests safe */
if (!PageLRU(page))
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
829ae0f81ce0 ("mm: migrate: fix THP's mapcount on isolation")
89f6c88a6ab4 ("mm: __isolate_lru_page_prepare() in isolate_migratepages_block()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 829ae0f81ce093d674ff2256f66a714753e9ce32 Mon Sep 17 00:00:00 2001
From: Gavin Shan <gshan(a)redhat.com>
Date: Thu, 24 Nov 2022 17:55:23 +0800
Subject: [PATCH] mm: migrate: fix THP's mapcount on isolation
The issue is reported when removing memory through virtio_mem device. The
transparent huge page, experienced copy-on-write fault, is wrongly
regarded as pinned. The transparent huge page is escaped from being
isolated in isolate_migratepages_block(). The transparent huge page can't
be migrated and the corresponding memory block can't be put into offline
state.
Fix it by replacing page_mapcount() with total_mapcount(). With this, the
transparent huge page can be isolated and migrated, and the memory block
can be put into offline state. Besides, The page's refcount is increased
a bit earlier to avoid the page is released when the check is executed.
Link: https://lkml.kernel.org/r/20221124095523.31061-1-gshan@redhat.com
Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
Signed-off-by: Gavin Shan <gshan(a)redhat.com>
Reported-by: Zhenyu Zhang <zhenyzha(a)redhat.com>
Tested-by: Zhenyu Zhang <zhenyzha(a)redhat.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: William Kucharski <william.kucharski(a)oracle.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org> [5.7+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/compaction.c b/mm/compaction.c
index c51f7f545afe..1f6da31dd9a5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -984,29 +984,29 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
goto isolate_fail;
}
+ /*
+ * Be careful not to clear PageLRU until after we're
+ * sure the page is not being freed elsewhere -- the
+ * page release code relies on it.
+ */
+ if (unlikely(!get_page_unless_zero(page)))
+ goto isolate_fail;
+
/*
* Migration will fail if an anonymous page is pinned in memory,
* so avoid taking lru_lock and isolating it unnecessarily in an
* admittedly racy check.
*/
mapping = page_mapping(page);
- if (!mapping && page_count(page) > page_mapcount(page))
- goto isolate_fail;
+ if (!mapping && (page_count(page) - 1) > total_mapcount(page))
+ goto isolate_fail_put;
/*
* Only allow to migrate anonymous pages in GFP_NOFS context
* because those do not depend on fs locks.
*/
if (!(cc->gfp_mask & __GFP_FS) && mapping)
- goto isolate_fail;
-
- /*
- * Be careful not to clear PageLRU until after we're
- * sure the page is not being freed elsewhere -- the
- * page release code relies on it.
- */
- if (unlikely(!get_page_unless_zero(page)))
- goto isolate_fail;
+ goto isolate_fail_put;
/* Only take pages on LRU: a check now makes later tests safe */
if (!PageLRU(page))
This is incremental to Dave's recent "[PATCH v7 00/20] Introduce
security commands for CXL pmem device" [1], starting after patch 17 [2].
I.e. I want to drop patch 18, 19, and 20 from that series and replace
them with these. It was prompted by Davidlohr's concerns about
cxl_invalidate_memregion().
The insight is that now that cpu_cache_invalidate_memregion() has a
default implementation for all architectures, the cache management can
move from the intel-pmem-specific security operations to the generic
NVDIMM core. This relieves the new CXL security ops from needing to
open-code their own cache flushing.
Also prompted by Davidlohr's concerns is what do about cache flushing
for scenarios outside of the PMEM security operations. For that "[PATCH
5/5] cxl/region: Manage CPU caches relative to DPA invalidation events"
proposes to handle that management at region activation time. This does
mean that dynamic CXL region provisioning is limited to environments
where cpu_cache_has_invalidate_memregion() is true. A new
CONFIG_CXL_REGION_INVALIDATION_TEST is added to bypass that data
integrity enforcement.
Lastly this includes some fixups, one for the fact that
cxl_region_probe() was ignoring some errors, another to enforce that
PMEM security operations originate through LIBNVDIMM, and lastly a
cleanup to move a string formatting failure condition from runtime to
compile-time in cxl_nvdimm_alloc().
[1]: http://lore.kernel.org/r/166983606451.2734609.4050644229630259452.stgit@dji…
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.2/…
---
Dan Williams (4):
cxl/region: Fix missing probe failure
cxl/pmem: Enforce keyctl ABI for PMEM security
nvdimm/region: Move cache management to the region driver
cxl/region: Manage CPU caches relative to DPA invalidation events
Dave Jiang (1):
cxl: add dimm_id support for __nvdimm_create()
drivers/acpi/nfit/intel.c | 25 ---------------------
drivers/cxl/Kconfig | 18 +++++++++++++++
drivers/cxl/core/mbox.c | 10 +++++++++
drivers/cxl/core/pmem.c | 7 ++++++
drivers/cxl/core/region.c | 34 +++++++++++++++++++++++++++++
drivers/cxl/cxl.h | 11 +++++++++
drivers/cxl/pmem.c | 3 ++-
drivers/cxl/security.c | 14 ------------
drivers/nvdimm/region.c | 11 +++++++++
drivers/nvdimm/region_devs.c | 49 +++++++++++++++++++++++++++++++++++++++++-
drivers/nvdimm/security.c | 6 +++++
include/linux/libnvdimm.h | 5 ++++
12 files changed, 152 insertions(+), 41 deletions(-)
base-commit: 15a8348707ffd2a37516db9bede88cc0bb467e0b