[ Upstream commit ccf16413e520164eb718cf8b22a30438da80ff23 ]
kernel ulong and compat_ulong_t may not be same width. Use type directly
to eliminate mismatches.
This would result in truncation rather than EFBIG for 32bit mode for
large disks.
Reviewed-by: Bart Van Assche <bvanassche(a)acm.org>
Signed-off-by: Khazhismel Kumykov <khazhy(a)google.com>
Reviewed-by: Chaitanya Kulkarni <kch(a)nvidia.com>
Link: https://lore.kernel.org/r/20220414224056.2875681-1-khazhy@google.com
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
[compat_ioctl is it's own file in 5.4-stable and earlier]
---
The original commit should apply to the newer stables, this should apply
to all the older stables.
block/compat_ioctl.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/compat_ioctl.c b/block/compat_ioctl.c
index 7f053468b50d..d490ac220ba8 100644
--- a/block/compat_ioctl.c
+++ b/block/compat_ioctl.c
@@ -393,7 +393,7 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
return 0;
case BLKGETSIZE:
size = i_size_read(bdev->bd_inode);
- if ((size >> 9) > ~0UL)
+ if ((size >> 9) > ~(compat_ulong_t)0)
return -EFBIG;
return compat_put_ulong(arg, size >> 9);
--
2.36.0.rc0.470.gd361397f0d-goog
Hi Daniel,
On Mon, Apr 25, 2022 at 9:00 PM Daniel Harding <dharding(a)living180.net> wrote:
> The commit "Restrict usage of GPIO chip irq members before initialization" breaks suspend on a
> Dell Inspiron 5515 laptop in a very severe way.
Does this commit, which is already upstream in Torvald's tree, solve the issue?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/d…
Yours,
Linus Walleij
In function dvb_register_device() -> dvb_register_media_device() ->
dvb_create_media_entity(), dvb->entity is allocated and initialized. If
the initialization fails, it frees the dvb->entity, and return an error
code. The caller takes the error code and handles the error by calling
dvb_media_device_free(), which unregisters the entity and frees the
field again if it is not NULL. As dvb->entity may not NULLed in
dvb_create_media_entity() when the allocation of dvbdev->pad fails, a
double free may occur. This may also cause an Use After free in
media_device_unregister_entity().
Fix this by storing NULL to dvb->entity when it is freed.
Fixes: fcd5ce4b3936 ("media: dvb-core: fix a memory leak bug")
Cc: stable(a)vger.kernel.org
Cc: Wenwen Wang <wenwen(a)cs.uga.edu>
Signed-off-by: Keita Suzuki <keitasuzuki.park(a)sslab.ics.keio.ac.jp>
---
drivers/media/dvb-core/dvbdev.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/media/dvb-core/dvbdev.c b/drivers/media/dvb-core/dvbdev.c
index 675d877a67b2..4597af108f4d 100644
--- a/drivers/media/dvb-core/dvbdev.c
+++ b/drivers/media/dvb-core/dvbdev.c
@@ -332,6 +332,7 @@ static int dvb_create_media_entity(struct dvb_device *dvbdev,
GFP_KERNEL);
if (!dvbdev->pads) {
kfree(dvbdev->entity);
+ dvbdev->entity = NULL;
return -ENOMEM;
}
}
--
2.25.1
Good day! I know this email might be a surprise to you, due to the
fact that we have never met before, I got your email contact from
WORLD TRADE UNION and I believed that you will be of help to this deal
I am proposing to you.
I am Mrs. Sharon Sanosy, I need your assistant to help me retrieve my
late husband’s Fund $500,000,000.00 which he deposited in a SECURITY
BANK. He was CEO of SANOSYL ENERGY, OIL & GAS. TEXAS. My husband Dr.
PAUL POLMAN SANOSY died last year of the COVID19 pandemic, I wish to
have a deal with you regarding the fund.
As a result of his sudden death his business associates are trying to
rip me off my late husband’s assets and heirlooms which he had left
for me before his painful demise. I want you to help me retrieve the
FUND from the Bank, as my late Husband’s Business partner.
l ready and willing to divulge more information to you upon your
positive response. Please let me know your thoughts . Kindly reply
through this email: info(a)sanosylenergy.com
Yours faithfully,
Mrs. Sharon Sanosy
--
This email has been checked for viruses by AVG.
https://www.avg.com
All creation paths except for O_TMPFILE handle umask in the vfs directly
if the filesystem doesn't support or enable POSIX ACLs. If the filesystem
does then umask handling is deferred until posix_acl_create().
Because, O_TMPFILE misses umask handling in the vfs it will not honor
umask settings. Fix this by adding the missing umask handling.
Fixes: 60545d0d4610 ("[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...")
Cc: <stable(a)vger.kernel.org> # 4.19+
Reported-by: Christian Brauner (Microsoft) <brauner(a)kernel.org>
Acked-by: Christian Brauner (Microsoft) <brauner(a)kernel.org>
Reviewed-by: Darrick J. Wong <djwong(a)kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy(a)fujitsu.com>
---
fs/namei.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/namei.c b/fs/namei.c
index 509657fdf4f5..73646e28fae0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3521,6 +3521,8 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
child = d_alloc(dentry, &slash_name);
if (unlikely(!child))
goto out_err;
+ if (!IS_POSIXACL(dir))
+ mode &= ~current_umask();
error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
if (error)
goto out_err;
--
2.27.0
Hello,
We wish to inform you that your compensation payment has finally been
approved and you need to contact the compensation payment director
below to get paid.
Contact Person: Anshula Kant
Email: info(a)anshulekant.com
Thanks
Mrs Rosemary Woods
[Apologies for the original HTML email]
The commit "Restrict usage of GPIO chip irq members before
initialization" breaks suspend on a Dell Inspiron 5515 laptop in a very
severe way. Suspending with this commit present causes the machine to
lock up hard. The only way to recover is to disconnect mains power,
open up the case, disconnect the battery, and hold down the power
button. Bisecting pointed to 2c1fa3614795e2b24da1ba95de0b27b8f6ea4537
in 5.16.20. Testing with the source commit,
5467801f1fcbdc46bc7298a84dbf3ca1ff2a7320 confirmed that it was the one
that introduced the problem. Unfortunately, this commit was backported
to multiple stable kernels: 5.17.3, 5.16.20, 5.15.34, and 5.10.111.
I have not yet done any debugging to determine exactly why this commit
causing things to break, but am happy to try out any fixes over the next
couple of days until I put my laptop back together properly.
Regards,
Daniel Harding
Linus,
This patch is being sent directly to you because there has been
a regression in 5.18 that I identified and sent a fix up that has been
reviewed/tested/acked for nearly a week but the current subsystem
maintainer (Bartosz) hasn't picked it up to send to you.
It's a severe problem; anyone who hits it:
1) Power button doesn't work anymore
2) Can't resume their laptop from S3 or s2idle
Because the original patch was cc stable@, it landed in stable releases
and has been breaking people left and right as distros track the stable
channels. The patch is well tested. Would you please consider to pick
this up directly to fix that regression?
Thanks,
Mario Limonciello (1):
gpio: Request interrupts after IRQ is initialized
drivers/gpio/gpiolib.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--
2.34.1
All creation paths except for O_TMPFILE handle umask in the vfs directly
if the filesystem doesn't support or enable POSIX ACLs. If the filesystem
does then umask handling is deferred until posix_acl_create().
Because, O_TMPFILE misses umask handling in the vfs it will not honor
umask settings. Fix this by adding the missing umask handling.
Fixes: 60545d0d4610 ("[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...")
Cc: <stable(a)vger.kernel.org> # 4.19+
Reported-by: Christian Brauner (Microsoft) <brauner(a)kernel.org>
Acked-by: Christian Brauner (Microsoft) <brauner(a)kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy(a)fujitsu.com>
---
fs/namei.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/namei.c b/fs/namei.c
index 509657fdf4f5..73646e28fae0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3521,6 +3521,8 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
child = d_alloc(dentry, &slash_name);
if (unlikely(!child))
goto out_err;
+ if (!IS_POSIXACL(dir))
+ mode &= ~current_umask();
error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
if (error)
goto out_err;
--
2.27.0
Supply additional check in order to prevent unexpected results.
Fixes: b892bf75b2034 ("ion: Switch ion to use dma-buf")
Suggested-by: Dan Carpenter <dan.carpenter(a)oracle.com>
Signed-off-by: Lee Jones <lee.jones(a)linaro.org>
---
This is a forward-port from linux-4.4.y and linux-4.9.y.
It has never been upstream.
Please apply to v4.14 through v5.10.
drivers/staging/android/ion/ion.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
index e1fe03ceb7f13..e6d4a3ee6cda5 100644
--- a/drivers/staging/android/ion/ion.c
+++ b/drivers/staging/android/ion/ion.c
@@ -114,6 +114,9 @@ static void *ion_buffer_kmap_get(struct ion_buffer *buffer)
void *vaddr;
if (buffer->kmap_cnt) {
+ if (buffer->kmap_cnt == INT_MAX)
+ return ERR_PTR(-EOVERFLOW);
+
buffer->kmap_cnt++;
return buffer->vaddr;
}
--
2.36.0.rc2.479.g8af0fa9b8e-goog
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c0713540f6d55c53dca65baaead55a5a8b20552d Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Sun, 17 Apr 2022 10:10:34 +0100
Subject: [PATCH] io_uring: fix leaks on IOPOLL and CQE_SKIP
If all completed requests in io_do_iopoll() were marked with
REQ_F_CQE_SKIP, we'll not only skip CQE posting but also
io_free_batch_list() leaking memory and resources.
Move @nr_events increment before REQ_F_CQE_SKIP check. We'll potentially
return the value greater than the real one, but iopolling will deal with
it and the userspace will re-iopoll if needed. In anyway, I don't think
there are many use cases for REQ_F_CQE_SKIP + IOPOLL.
Fixes: 83a13a4181b0e ("io_uring: tweak iopoll CQE_SKIP event counting")
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/5072fc8693fbfd595f89e5d4305bfcfd5d2f0a64.16501866…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 24409dd07239..7625b29153b9 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2797,11 +2797,10 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
/* order with io_complete_rw_iopoll(), e.g. ->result updates */
if (!smp_load_acquire(&req->iopoll_completed))
break;
+ nr_events++;
if (unlikely(req->flags & REQ_F_CQE_SKIP))
continue;
-
__io_fill_cqe_req(req, req->result, io_put_kbuf(req, 0));
- nr_events++;
}
if (unlikely(!nr_events))
In create_var_ref(), init_var_ref() is called to initialize the fields
of variable ref_field, which is allocated in the previous function call
to create_hist_field(). Function init_var_ref() allocates the
corresponding fields such as ref_field->system, but frees these fields
when the function encounters an error. The caller later calls
destroy_hist_field() to conduct error handling, which frees the fields
and the variable itself. This results in double free of the fields which
are already freed in the previous function.
Fix this by storing NULL to the corresponding fields when they are freed
in init_var_ref().
Fixes: 067fe038e70f ("tracing: Add variable reference handling to hist triggers")
CC: stable(a)vger.kernel.org
Signed-off-by: Keita Suzuki <keitasuzuki.park(a)sslab.ics.keio.ac.jp>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
---
kernel/trace/trace_events_hist.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 44db5ba9cabb..a0e41906d9ce 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -2093,8 +2093,11 @@ static int init_var_ref(struct hist_field *ref_field,
return err;
free:
kfree(ref_field->system);
+ ref_field->system = NULL;
kfree(ref_field->event_name);
+ ref_field->event_name = NULL;
kfree(ref_field->name);
+ ref_field->name = NULL;
goto out;
}
--
2.25.1
The imx412/imx577 sensor has a reset line that is active low not active
high. Currently the logic for this is inverted.
The right way to define the reset line is to declare it active low in the
DTS and invert the logic currently contained in the driver.
The DTS should represent the hardware does i.e. reset is active low.
So:
+ reset-gpios = <&tlmm 78 GPIO_ACTIVE_LOW>;
not:
- reset-gpios = <&tlmm 78 GPIO_ACTIVE_HIGH>;
I was a bit reticent about changing this logic since I thought it might
negatively impact @intel.com users. Googling a bit though I believe this
sensor is used on "Keem Bay" which is clearly a DTS based system and is not
upstream yet.
Fixes: 9214e86c0cc1 ("media: i2c: Add imx412 camera sensor driver")
Cc: stable(a)vger.kernel.org
Signed-off-by: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
---
drivers/media/i2c/imx412.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/media/i2c/imx412.c b/drivers/media/i2c/imx412.c
index be3f6ea55559..e6be6b4250f5 100644
--- a/drivers/media/i2c/imx412.c
+++ b/drivers/media/i2c/imx412.c
@@ -1011,7 +1011,7 @@ static int imx412_power_on(struct device *dev)
struct imx412 *imx412 = to_imx412(sd);
int ret;
- gpiod_set_value_cansleep(imx412->reset_gpio, 1);
+ gpiod_set_value_cansleep(imx412->reset_gpio, 0);
ret = clk_prepare_enable(imx412->inclk);
if (ret) {
@@ -1024,7 +1024,7 @@ static int imx412_power_on(struct device *dev)
return 0;
error_reset:
- gpiod_set_value_cansleep(imx412->reset_gpio, 0);
+ gpiod_set_value_cansleep(imx412->reset_gpio, 1);
return ret;
}
@@ -1040,7 +1040,7 @@ static int imx412_power_off(struct device *dev)
struct v4l2_subdev *sd = dev_get_drvdata(dev);
struct imx412 *imx412 = to_imx412(sd);
- gpiod_set_value_cansleep(imx412->reset_gpio, 0);
+ gpiod_set_value_cansleep(imx412->reset_gpio, 1);
clk_disable_unprepare(imx412->inclk);
--
2.35.1
[ Upstream commit d73497081710c876c3c61444445512989e102152 ]
The first attempt to fix a the 'impossible' WARN_ON_ONCE(1) in
isotp_tx_timer_handler() focussed on the identical CAN IDs created by
the syzbot reproducer and lead to upstream fix/commit 3ea566422cbd
("can: isotp: sanitize CAN ID checks in isotp_bind()"). But this did
not catch the root cause of the wrong tx.state in the tx_timer handler.
In the isotp 'first frame' case a timeout monitoring needs to be started
before the 'first frame' is send. But when this sending failed the timeout
monitoring for this specific frame has to be disabled too.
Otherwise the tx_timer is fired with the 'warn me' tx.state of ISOTP_IDLE.
Fixes: e057dd3fc20f ("can: add ISO 15765-2:2016 transport protocol")
Link: https://lore.kernel.org/all/20220405175112.2682-1-socketcan@hartkopp.net
Reported-by: syzbot+2339c27f5c66c652843e(a)syzkaller.appspotmail.com
Signed-off-by: Oliver Hartkopp <socketcan(a)hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl(a)pengutronix.de>
---
net/can/isotp.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/net/can/isotp.c b/net/can/isotp.c
index 9a4a9c5a9f24..c515bbd46c67 100644
--- a/net/can/isotp.c
+++ b/net/can/isotp.c
@@ -862,10 +862,11 @@ static int isotp_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
struct sk_buff *skb;
struct net_device *dev;
struct canfd_frame *cf;
int ae = (so->opt.flags & CAN_ISOTP_EXTEND_ADDR) ? 1 : 0;
int wait_tx_done = (so->opt.flags & CAN_ISOTP_WAIT_TX_DONE) ? 1 : 0;
+ s64 hrtimer_sec = 0;
int off;
int err;
if (!so->bound)
return -EADDRNOTAVAIL;
@@ -960,11 +961,13 @@ static int isotp_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
/* send first frame and wait for FC */
isotp_create_fframe(cf, so, ae);
/* start timeout for FC */
- hrtimer_start(&so->txtimer, ktime_set(1, 0), HRTIMER_MODE_REL_SOFT);
+ hrtimer_sec = 1;
+ hrtimer_start(&so->txtimer, ktime_set(hrtimer_sec, 0),
+ HRTIMER_MODE_REL_SOFT);
}
/* send the first or only CAN frame */
cf->flags = so->ll.tx_flags;
@@ -973,10 +976,15 @@ static int isotp_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
err = can_send(skb, 1);
dev_put(dev);
if (err) {
pr_notice_once("can-isotp: %s: can_send_ret %d\n",
__func__, err);
+
+ /* no transmission -> no timeout monitoring */
+ if (hrtimer_sec)
+ hrtimer_cancel(&so->txtimer);
+
goto err_out_drop;
}
if (wait_tx_done) {
/* wait for complete transmission of current pdu */
--
2.30.2
Greetings My Dear Friend,
Before I introduce myself, I wish to inform you that this letter is not a
hoax mail and I urge you to treat it serious. This letter must come to you
as a big surprise, but I believe it is only a day that people meet and
become great friends and business partners. Please I want you to read this
letter very carefully and I must apologize for barging this message into
your mail box without any formal introduction due to the urgency and
confidentiality of this business and I know that this message will come to
you as a surprise. Please
this is not a joke and I will not like you to joke with it ok, With due
respect to your person and much sincerity of purpose, I make this contact
with you as I believe that you can be of great assistance to me. My name is
DR.ADAMA ALI, from Burkina Faso, West Africa. I work in Bank Of Africa
(BOA) as telex manager, please see this as a confidential message and do
not reveal it to another person and let me know whether you can be of
assistance regarding my proposal below because it is top secret.
I am about to retire from active Banking service to start a new life but I
am skeptical to reveal this particular secret to a stranger. You must
assure me that everything will be handled confidentially because we are not
going to suffer again in life. It has been 10 years now that most of the
greedy African Politicians used our bank to launder money overseas through
the help of their Political advisers. Most of the funds which they
transferred out of the shores of Africa were gold and oil money that was
supposed to have been used to develop the continent. Their Political
advisers always inflated the amounts before transferring to foreign
accounts, so I also used the opportunity to divert part of the funds hence
I am aware that there is no official trace of how much was transferred as
all the accounts used for such transfers were being closed after transfer.
I acted as the Bank Officer to most of the politicians and when I
discovered that they were using me to succeed in their greedy act; I also
cleaned some of their banking records from the Bank files and no one cared
to ask me
because the money was too much for them to control. They laundered over
$5billion Dollars during the process.Before I send this message to you, I
have already diverted ($10.5million Dollars) to an escrow account belonging
to no one in the bank. The bank is anxious now to know who the beneficiary
to the funds is because they have made a lot of profits with the funds. It
is more than Eight years now and most of the politicians are no longer
using our bank to transfer funds overseas. The ($10.5million Dollars) has
been laying waste in our bank and I don’t want to retire from the bank
without transferring the funds to a foreign account to enable me share the
proceeds with the receiver (a foreigner). The money will be shared 60% for
me and 40% for you. There is no one coming to ask you about the funds
because I secured everything. I only want you to assist me by providing a
reliable bank account where the funds can be transferred.
You are not to face any difficulties or legal implications as I am going to
handle the transfer personally. If you are capable of receiving the funds,
do let me know immediately to enable me give you a detailed information on
what to do. For me, I have not stolen the money from anyone because the
other people that took the whole money did not face any problems. This is
my chance to grab my own life opportunity but you must keep the details of
the funds secret to avoid any leakages as no one in the bank knows about my
plans Please get back to me if you are interested and capable to handle
this project, I shall intimate you on what to do when I hear from your
confirmation and acceptance.If you are capable of being my trusted
associate do declare your consent to me. I am looking forward to hear from
you immediately for further information.
Thanks with my best regards.
DR.ADAMA ALI
Telex Manager
Bank Of Africa(BOA)
Burkina Faso
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 423ecfea77dda83823c71b0fad1c2ddb2af1e5fc Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Wed, 20 Apr 2022 01:37:31 +0000
Subject: [PATCH] KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to
fix a race
Make a KVM_REQ_APICV_UPDATE request when creating a vCPU with an
in-kernel local APIC and APICv enabled at the module level. Consuming
kvm_apicv_activated() and stuffing vcpu->arch.apicv_active directly can
race with __kvm_set_or_clear_apicv_inhibit(), as vCPU creation happens
before the vCPU is fully onlined, i.e. it won't get the request made to
"all" vCPUs. If APICv is globally inhibited between setting apicv_active
and onlining the vCPU, the vCPU will end up running with APICv enabled
and trigger KVM's sanity check.
Mark APICv as active during vCPU creation if APICv is enabled at the
module level, both to be optimistic about it's final state, e.g. to avoid
additional VMWRITEs on VMX, and because there are likely bugs lurking
since KVM checks apicv_active in multiple vCPU creation paths. While
keeping the current behavior of consuming kvm_apicv_activated() is
arguably safer from a regression perspective, force apicv_active so that
vCPU creation runs with deterministic state and so that if there are bugs,
they are found sooner than later, i.e. not when some crazy race condition
is hit.
WARNING: CPU: 0 PID: 484 at arch/x86/kvm/x86.c:9877 vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877
Modules linked in:
CPU: 0 PID: 484 Comm: syz-executor361 Not tainted 5.16.13 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014
RIP: 0010:vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877
Call Trace:
<TASK>
vcpu_run arch/x86/kvm/x86.c:10039 [inline]
kvm_arch_vcpu_ioctl_run+0x337/0x15e0 arch/x86/kvm/x86.c:10234
kvm_vcpu_ioctl+0x4d2/0xc80 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3727
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x16d/0x1d0 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The bug was hit by a syzkaller spamming VM creation with 2 vCPUs and a
call to KVM_SET_GUEST_DEBUG.
r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x0, 0x0)
r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
ioctl$KVM_CAP_SPLIT_IRQCHIP(r1, 0x4068aea3, &(0x7f0000000000)) (async)
r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) (async)
r3 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x400000000000002)
ioctl$KVM_SET_GUEST_DEBUG(r3, 0x4048ae9b, &(0x7f00000000c0)={0x5dda9c14aa95f5c5})
ioctl$KVM_RUN(r2, 0xae80, 0x0)
Reported-by: Gaoning Pan <pgn(a)zju.edu.cn>
Reported-by: Yongkang Jia <kangel(a)zju.edu.cn>
Fixes: 8df14af42f00 ("kvm: x86: Add support for dynamic APICv activation")
Cc: stable(a)vger.kernel.org
Cc: Maxim Levitsky <mlevitsk(a)redhat.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Reviewed-by: Maxim Levitsky <mlevitsk(a)redhat.com>
Message-Id: <20220420013732.3308816-4-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d54d4a67b226..9c02217c1e47 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11189,8 +11189,21 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
r = kvm_create_lapic(vcpu, lapic_timer_advance_ns);
if (r < 0)
goto fail_mmu_destroy;
- if (kvm_apicv_activated(vcpu->kvm))
+
+ /*
+ * Defer evaluating inhibits until the vCPU is first run, as
+ * this vCPU will not get notified of any changes until this
+ * vCPU is visible to other vCPUs (marked online and added to
+ * the set of vCPUs). Opportunistically mark APICv active as
+ * VMX in particularly is highly unlikely to have inhibits.
+ * Ignore the current per-VM APICv state so that vCPU creation
+ * is guaranteed to run with a deterministic value, the request
+ * will ensure the vCPU gets the correct state before VM-Entry.
+ */
+ if (enable_apicv) {
vcpu->arch.apicv_active = true;
+ kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
+ }
} else
static_branch_inc(&kvm_has_noapic_vcpu);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4bbef7e8eb8c2c7dabf57d97decfd2b4f48aaf02 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Thu, 21 Apr 2022 03:14:05 +0000
Subject: [PATCH] KVM: SVM: Simplify and harden helper to flush SEV guest
page(s)
Rework sev_flush_guest_memory() to explicitly handle only a single page,
and harden it to fall back to WBINVD if VM_PAGE_FLUSH fails. Per-page
flushing is currently used only to flush the VMSA, and in its current
form, the helper is completely broken with respect to flushing actual
guest memory, i.e. won't work correctly for an arbitrary memory range.
VM_PAGE_FLUSH takes a host virtual address, and is subject to normal page
walks, i.e. will fault if the address is not present in the host page
tables or does not have the correct permissions. Current AMD CPUs also
do not honor SMAP overrides (undocumented in kernel versions of the APM),
so passing in a userspace address is completely out of the question. In
other words, KVM would need to manually walk the host page tables to get
the pfn, ensure the pfn is stable, and then use the direct map to invoke
VM_PAGE_FLUSH. And the latter might not even work, e.g. if userspace is
particularly evil/clever and backs the guest with Secret Memory (which
unmaps memory from the direct map).
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Fixes: add5e2f04541 ("KVM: SVM: Add support for the SEV-ES VMSA")
Reported-by: Mingwei Zhang <mizhang(a)google.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Mingwei Zhang <mizhang(a)google.com>
Message-Id: <20220421031407.2516575-2-mizhang(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 537aaddc852f..b77b3913e2d9 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2226,9 +2226,18 @@ int sev_cpu_init(struct svm_cpu_data *sd)
* Pages used by hardware to hold guest encrypted state must be flushed before
* returning them to the system.
*/
-static void sev_flush_guest_memory(struct vcpu_svm *svm, void *va,
- unsigned long len)
+static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)
{
+ int asid = to_kvm_svm(vcpu->kvm)->sev_info.asid;
+
+ /*
+ * Note! The address must be a kernel address, as regular page walk
+ * checks are performed by VM_PAGE_FLUSH, i.e. operating on a user
+ * address is non-deterministic and unsafe. This function deliberately
+ * takes a pointer to deter passing in a user address.
+ */
+ unsigned long addr = (unsigned long)va;
+
/*
* If hardware enforced cache coherency for encrypted mappings of the
* same physical page is supported, nothing to do.
@@ -2237,40 +2246,16 @@ static void sev_flush_guest_memory(struct vcpu_svm *svm, void *va,
return;
/*
- * If the VM Page Flush MSR is supported, use it to flush the page
- * (using the page virtual address and the guest ASID).
+ * VM Page Flush takes a host virtual address and a guest ASID. Fall
+ * back to WBINVD if this faults so as not to make any problems worse
+ * by leaving stale encrypted data in the cache.
*/
- if (boot_cpu_has(X86_FEATURE_VM_PAGE_FLUSH)) {
- struct kvm_sev_info *sev;
- unsigned long va_start;
- u64 start, stop;
+ if (WARN_ON_ONCE(wrmsrl_safe(MSR_AMD64_VM_PAGE_FLUSH, addr | asid)))
+ goto do_wbinvd;
- /* Align start and stop to page boundaries. */
- va_start = (unsigned long)va;
- start = (u64)va_start & PAGE_MASK;
- stop = PAGE_ALIGN((u64)va_start + len);
-
- if (start < stop) {
- sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
-
- while (start < stop) {
- wrmsrl(MSR_AMD64_VM_PAGE_FLUSH,
- start | sev->asid);
-
- start += PAGE_SIZE;
- }
+ return;
- return;
- }
-
- WARN(1, "Address overflow, using WBINVD\n");
- }
-
- /*
- * Hardware should always have one of the above features,
- * but if not, use WBINVD and issue a warning.
- */
- WARN_ONCE(1, "Using WBINVD to flush guest memory\n");
+do_wbinvd:
wbinvd_on_all_cpus();
}
@@ -2284,7 +2269,8 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
svm = to_svm(vcpu);
if (vcpu->arch.guest_state_protected)
- sev_flush_guest_memory(svm, svm->sev_es.vmsa, PAGE_SIZE);
+ sev_flush_encrypted_page(vcpu, svm->sev_es.vmsa);
+
__free_page(virt_to_page(svm->sev_es.vmsa));
if (svm->sev_es.ghcb_sa_free)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7c69661e225cc484fbf44a0b99b56714a5241ae3 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Wed, 20 Apr 2022 01:37:30 +0000
Subject: [PATCH] KVM: nVMX: Defer APICv updates while L2 is active until L1 is
active
Defer APICv updates that occur while L2 is active until nested VM-Exit,
i.e. until L1 regains control. vmx_refresh_apicv_exec_ctrl() assumes L1
is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
vmcs01. E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
becomes unhibited while L2 is active, KVM will set various APICv controls
in vmcs02 and trigger a failed VM-Entry. The kicker is that, unless
running with nested_early_check=1, KVM blames L1 and chaos ensues.
In all cases, ignoring vmcs02 and always deferring the inhibition change
to vmcs01 is correct (or at least acceptable). The ABSENT and DISABLE
inhibitions cannot truly change while L2 is active (see below).
IRQ_BLOCKING can change, but it is firmly a best effort debug feature.
Furthermore, only L2's APIC is accelerated/virtualized to the full extent
possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR
interception will apply to the virtual APIC managed by KVM.
The exception is the SELF_IPI register when x2APIC is enabled, but that's
an acceptable hole.
Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the
MSRs to L2, but for that to work in any sane capacity, L1 would need to
pass through IRQs to L2 as well, and IRQs must be intercepted to enable
virtual interrupt delivery. I.e. exposing Auto EOI to L2 and enabling
VID for L2 are, for all intents and purposes, mutually exclusive.
Lack of dynamic toggling is also why this scenario is all but impossible
to encounter in KVM's current form. But a future patch will pend an
APICv update request _during_ vCPU creation to plug a race where a vCPU
that's being created doesn't get included in the "all vCPUs request"
because it's not yet visible to other vCPUs. If userspaces restores L2
after VM creation (hello, KVM selftests), the first KVM_RUN will occur
while L2 is active and thus service the APICv update request made during
VM creation.
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20220420013732.3308816-3-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index f18744f7ff82..856c87563883 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4618,6 +4618,11 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
}
+ if (vmx->nested.update_vmcs01_apicv_status) {
+ vmx->nested.update_vmcs01_apicv_status = false;
+ kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
+ }
+
if ((vm_exit_reason != -1) &&
(enable_shadow_vmcs || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)))
vmx->nested.need_vmcs12_to_shadow_sync = true;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 04d170c4b61e..d58b763df855 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4174,6 +4174,11 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
+ if (is_guest_mode(vcpu)) {
+ vmx->nested.update_vmcs01_apicv_status = true;
+ return;
+ }
+
pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
if (cpu_has_secondary_exec_ctrls()) {
if (kvm_vcpu_apicv_active(vcpu))
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 9c6bfcd84008..b98c7e96697a 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -183,6 +183,7 @@ struct nested_vmx {
bool change_vmcs01_virtual_apic_mode;
bool reload_vmcs01_apic_access_page;
bool update_vmcs01_cpu_dirty_logging;
+ bool update_vmcs01_apicv_status;
/*
* Enlightened VMCS has been enabled. It does not mean that L1 has to
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7c69661e225cc484fbf44a0b99b56714a5241ae3 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Wed, 20 Apr 2022 01:37:30 +0000
Subject: [PATCH] KVM: nVMX: Defer APICv updates while L2 is active until L1 is
active
Defer APICv updates that occur while L2 is active until nested VM-Exit,
i.e. until L1 regains control. vmx_refresh_apicv_exec_ctrl() assumes L1
is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
vmcs01. E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
becomes unhibited while L2 is active, KVM will set various APICv controls
in vmcs02 and trigger a failed VM-Entry. The kicker is that, unless
running with nested_early_check=1, KVM blames L1 and chaos ensues.
In all cases, ignoring vmcs02 and always deferring the inhibition change
to vmcs01 is correct (or at least acceptable). The ABSENT and DISABLE
inhibitions cannot truly change while L2 is active (see below).
IRQ_BLOCKING can change, but it is firmly a best effort debug feature.
Furthermore, only L2's APIC is accelerated/virtualized to the full extent
possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR
interception will apply to the virtual APIC managed by KVM.
The exception is the SELF_IPI register when x2APIC is enabled, but that's
an acceptable hole.
Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the
MSRs to L2, but for that to work in any sane capacity, L1 would need to
pass through IRQs to L2 as well, and IRQs must be intercepted to enable
virtual interrupt delivery. I.e. exposing Auto EOI to L2 and enabling
VID for L2 are, for all intents and purposes, mutually exclusive.
Lack of dynamic toggling is also why this scenario is all but impossible
to encounter in KVM's current form. But a future patch will pend an
APICv update request _during_ vCPU creation to plug a race where a vCPU
that's being created doesn't get included in the "all vCPUs request"
because it's not yet visible to other vCPUs. If userspaces restores L2
after VM creation (hello, KVM selftests), the first KVM_RUN will occur
while L2 is active and thus service the APICv update request made during
VM creation.
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20220420013732.3308816-3-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index f18744f7ff82..856c87563883 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4618,6 +4618,11 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
}
+ if (vmx->nested.update_vmcs01_apicv_status) {
+ vmx->nested.update_vmcs01_apicv_status = false;
+ kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
+ }
+
if ((vm_exit_reason != -1) &&
(enable_shadow_vmcs || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)))
vmx->nested.need_vmcs12_to_shadow_sync = true;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 04d170c4b61e..d58b763df855 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4174,6 +4174,11 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
+ if (is_guest_mode(vcpu)) {
+ vmx->nested.update_vmcs01_apicv_status = true;
+ return;
+ }
+
pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
if (cpu_has_secondary_exec_ctrls()) {
if (kvm_vcpu_apicv_active(vcpu))
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 9c6bfcd84008..b98c7e96697a 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -183,6 +183,7 @@ struct nested_vmx {
bool change_vmcs01_virtual_apic_mode;
bool reload_vmcs01_apic_access_page;
bool update_vmcs01_cpu_dirty_logging;
+ bool update_vmcs01_apicv_status;
/*
* Enlightened VMCS has been enabled. It does not mean that L1 has to
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 75189d1de1b377e580ebd2d2c55914631eac9c64 Mon Sep 17 00:00:00 2001
From: Like Xu <likexu(a)tencent.com>
Date: Sat, 9 Apr 2022 09:52:26 +0800
Subject: [PATCH] KVM: x86/pmu: Update AMD PMC sample period to fix guest
NMI-watchdog
NMI-watchdog is one of the favorite features of kernel developers,
but it does not work in AMD guest even with vPMU enabled and worse,
the system misrepresents this capability via /proc.
This is a PMC emulation error. KVM does not pass the latest valid
value to perf_event in time when guest NMI-watchdog is running, thus
the perf_event corresponding to the watchdog counter will enter the
old state at some point after the first guest NMI injection, forcing
the hardware register PMC0 to be constantly written to 0x800000000001.
Meanwhile, the running counter should accurately reflect its new value
based on the latest coordinated pmc->counter (from vPMC's point of view)
rather than the value written directly by the guest.
Fixes: 168d918f2643 ("KVM: x86: Adjust counter sample period after a wrmsr")
Reported-by: Dongli Cao <caodongli(a)kingsoft.com>
Signed-off-by: Like Xu <likexu(a)tencent.com>
Reviewed-by: Yanan Wang <wangyanan55(a)huawei.com>
Tested-by: Yanan Wang <wangyanan55(a)huawei.com>
Reviewed-by: Jim Mattson <jmattson(a)google.com>
Message-Id: <20220409015226.38619-1-likexu(a)tencent.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 9e66fba1d6a3..22992b049d38 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -138,6 +138,15 @@ static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
return sample_period;
}
+static inline void pmc_update_sample_period(struct kvm_pmc *pmc)
+{
+ if (!pmc->perf_event || pmc->is_paused)
+ return;
+
+ perf_event_period(pmc->perf_event,
+ get_sample_period(pmc, pmc->counter));
+}
+
void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 24eb935b6f85..b14860863c39 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -257,6 +257,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
if (pmc) {
pmc->counter += data - pmc_read_counter(pmc);
+ pmc_update_sample_period(pmc);
return 0;
}
/* MSR_EVNTSELn */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index bc3f8512bb64..b82b6709d7a8 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -431,15 +431,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
!(msr & MSR_PMC_FULL_WIDTH_BIT))
data = (s64)(s32)data;
pmc->counter += data - pmc_read_counter(pmc);
- if (pmc->perf_event && !pmc->is_paused)
- perf_event_period(pmc->perf_event,
- get_sample_period(pmc, data));
+ pmc_update_sample_period(pmc);
return 0;
} else if ((pmc = get_fixed_pmc(pmu, msr))) {
pmc->counter += data - pmc_read_counter(pmc);
- if (pmc->perf_event && !pmc->is_paused)
- perf_event_period(pmc->perf_event,
- get_sample_period(pmc, data));
+ pmc_update_sample_period(pmc);
return 0;
} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
if (data == pmc->eventsel)
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c8618d65007ba68d7891130642d73e89372101e8 Mon Sep 17 00:00:00 2001
From: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Date: Sun, 27 Mar 2022 16:10:02 +0800
Subject: [PATCH] ASoC: rt5682: fix an incorrect NULL check on list iterator
The bug is here:
if (!dai) {
The list iterator value 'dai' will *always* be set and non-NULL
by for_each_component_dais(), so it is incorrect to assume that
the iterator value will be NULL if the list is empty or no element
is found (In fact, it will be a bogus pointer to an invalid struct
object containing the HEAD). Otherwise it will bypass the check
'if (!dai) {' (never call dev_err() and never return -ENODEV;)
and lead to invalid memory access lately when calling
'rt5682_set_bclk1_ratio(dai, factor);'.
To fix the bug, just return rt5682_set_bclk1_ratio(dai, factor);
when found the 'dai', otherwise dev_err() and return -ENODEV;
Cc: stable(a)vger.kernel.org
Fixes: ebbfabc16d23d ("ASoC: rt5682: Add CCF usage for providing I2S clks")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Link: https://lore.kernel.org/r/20220327081002.12684-1-xiam0nd.tong@gmail.com
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/sound/soc/codecs/rt5682.c b/sound/soc/codecs/rt5682.c
index be68d573a490..c9ff9c89adf7 100644
--- a/sound/soc/codecs/rt5682.c
+++ b/sound/soc/codecs/rt5682.c
@@ -2822,14 +2822,11 @@ static int rt5682_bclk_set_rate(struct clk_hw *hw, unsigned long rate,
for_each_component_dais(component, dai)
if (dai->id == RT5682_AIF1)
- break;
- if (!dai) {
- dev_err(rt5682->i2c_dev, "dai %d not found in component\n",
- RT5682_AIF1);
- return -ENODEV;
- }
+ return rt5682_set_bclk1_ratio(dai, factor);
- return rt5682_set_bclk1_ratio(dai, factor);
+ dev_err(rt5682->i2c_dev, "dai %d not found in component\n",
+ RT5682_AIF1);
+ return -ENODEV;
}
static const struct clk_ops rt5682_dai_clk_ops[RT5682_DAI_NUM_CLKS] = {
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c8618d65007ba68d7891130642d73e89372101e8 Mon Sep 17 00:00:00 2001
From: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Date: Sun, 27 Mar 2022 16:10:02 +0800
Subject: [PATCH] ASoC: rt5682: fix an incorrect NULL check on list iterator
The bug is here:
if (!dai) {
The list iterator value 'dai' will *always* be set and non-NULL
by for_each_component_dais(), so it is incorrect to assume that
the iterator value will be NULL if the list is empty or no element
is found (In fact, it will be a bogus pointer to an invalid struct
object containing the HEAD). Otherwise it will bypass the check
'if (!dai) {' (never call dev_err() and never return -ENODEV;)
and lead to invalid memory access lately when calling
'rt5682_set_bclk1_ratio(dai, factor);'.
To fix the bug, just return rt5682_set_bclk1_ratio(dai, factor);
when found the 'dai', otherwise dev_err() and return -ENODEV;
Cc: stable(a)vger.kernel.org
Fixes: ebbfabc16d23d ("ASoC: rt5682: Add CCF usage for providing I2S clks")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
Link: https://lore.kernel.org/r/20220327081002.12684-1-xiam0nd.tong@gmail.com
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/sound/soc/codecs/rt5682.c b/sound/soc/codecs/rt5682.c
index be68d573a490..c9ff9c89adf7 100644
--- a/sound/soc/codecs/rt5682.c
+++ b/sound/soc/codecs/rt5682.c
@@ -2822,14 +2822,11 @@ static int rt5682_bclk_set_rate(struct clk_hw *hw, unsigned long rate,
for_each_component_dais(component, dai)
if (dai->id == RT5682_AIF1)
- break;
- if (!dai) {
- dev_err(rt5682->i2c_dev, "dai %d not found in component\n",
- RT5682_AIF1);
- return -ENODEV;
- }
+ return rt5682_set_bclk1_ratio(dai, factor);
- return rt5682_set_bclk1_ratio(dai, factor);
+ dev_err(rt5682->i2c_dev, "dai %d not found in component\n",
+ RT5682_AIF1);
+ return -ENODEV;
}
static const struct clk_ops rt5682_dai_clk_ops[RT5682_DAI_NUM_CLKS] = {
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 839769c35477d4acc2369e45000ca7b0b6af39a7 Mon Sep 17 00:00:00 2001
From: Max Filippov <jcmvbkbc(a)gmail.com>
Date: Wed, 13 Apr 2022 22:44:36 -0700
Subject: [PATCH] xtensa: fix a7 clobbering in coprocessor context load/store
Fast coprocessor exception handler saves a3..a6, but coprocessor context
load/store code uses a4..a7 as temporaries, potentially clobbering a7.
'Potentially' because coprocessor state load/store macros may not use
all four temporary registers (and neither FPU nor HiFi macros do).
Use a3..a6 as intended.
Cc: stable(a)vger.kernel.org
Fixes: c658eac628aa ("[XTENSA] Add support for configurable registers and coprocessors")
Signed-off-by: Max Filippov <jcmvbkbc(a)gmail.com>
diff --git a/arch/xtensa/kernel/coprocessor.S b/arch/xtensa/kernel/coprocessor.S
index 45cc0ae0af6f..c7b9f12896f2 100644
--- a/arch/xtensa/kernel/coprocessor.S
+++ b/arch/xtensa/kernel/coprocessor.S
@@ -29,7 +29,7 @@
.if XTENSA_HAVE_COPROCESSOR(x); \
.align 4; \
.Lsave_cp_regs_cp##x: \
- xchal_cp##x##_store a2 a4 a5 a6 a7; \
+ xchal_cp##x##_store a2 a3 a4 a5 a6; \
jx a0; \
.endif
@@ -46,7 +46,7 @@
.if XTENSA_HAVE_COPROCESSOR(x); \
.align 4; \
.Lload_cp_regs_cp##x: \
- xchal_cp##x##_load a2 a4 a5 a6 a7; \
+ xchal_cp##x##_load a2 a3 a4 a5 a6; \
jx a0; \
.endif
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 839769c35477d4acc2369e45000ca7b0b6af39a7 Mon Sep 17 00:00:00 2001
From: Max Filippov <jcmvbkbc(a)gmail.com>
Date: Wed, 13 Apr 2022 22:44:36 -0700
Subject: [PATCH] xtensa: fix a7 clobbering in coprocessor context load/store
Fast coprocessor exception handler saves a3..a6, but coprocessor context
load/store code uses a4..a7 as temporaries, potentially clobbering a7.
'Potentially' because coprocessor state load/store macros may not use
all four temporary registers (and neither FPU nor HiFi macros do).
Use a3..a6 as intended.
Cc: stable(a)vger.kernel.org
Fixes: c658eac628aa ("[XTENSA] Add support for configurable registers and coprocessors")
Signed-off-by: Max Filippov <jcmvbkbc(a)gmail.com>
diff --git a/arch/xtensa/kernel/coprocessor.S b/arch/xtensa/kernel/coprocessor.S
index 45cc0ae0af6f..c7b9f12896f2 100644
--- a/arch/xtensa/kernel/coprocessor.S
+++ b/arch/xtensa/kernel/coprocessor.S
@@ -29,7 +29,7 @@
.if XTENSA_HAVE_COPROCESSOR(x); \
.align 4; \
.Lsave_cp_regs_cp##x: \
- xchal_cp##x##_store a2 a4 a5 a6 a7; \
+ xchal_cp##x##_store a2 a3 a4 a5 a6; \
jx a0; \
.endif
@@ -46,7 +46,7 @@
.if XTENSA_HAVE_COPROCESSOR(x); \
.align 4; \
.Lload_cp_regs_cp##x: \
- xchal_cp##x##_load a2 a4 a5 a6 a7; \
+ xchal_cp##x##_load a2 a3 a4 a5 a6; \
jx a0; \
.endif
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 839769c35477d4acc2369e45000ca7b0b6af39a7 Mon Sep 17 00:00:00 2001
From: Max Filippov <jcmvbkbc(a)gmail.com>
Date: Wed, 13 Apr 2022 22:44:36 -0700
Subject: [PATCH] xtensa: fix a7 clobbering in coprocessor context load/store
Fast coprocessor exception handler saves a3..a6, but coprocessor context
load/store code uses a4..a7 as temporaries, potentially clobbering a7.
'Potentially' because coprocessor state load/store macros may not use
all four temporary registers (and neither FPU nor HiFi macros do).
Use a3..a6 as intended.
Cc: stable(a)vger.kernel.org
Fixes: c658eac628aa ("[XTENSA] Add support for configurable registers and coprocessors")
Signed-off-by: Max Filippov <jcmvbkbc(a)gmail.com>
diff --git a/arch/xtensa/kernel/coprocessor.S b/arch/xtensa/kernel/coprocessor.S
index 45cc0ae0af6f..c7b9f12896f2 100644
--- a/arch/xtensa/kernel/coprocessor.S
+++ b/arch/xtensa/kernel/coprocessor.S
@@ -29,7 +29,7 @@
.if XTENSA_HAVE_COPROCESSOR(x); \
.align 4; \
.Lsave_cp_regs_cp##x: \
- xchal_cp##x##_store a2 a4 a5 a6 a7; \
+ xchal_cp##x##_store a2 a3 a4 a5 a6; \
jx a0; \
.endif
@@ -46,7 +46,7 @@
.if XTENSA_HAVE_COPROCESSOR(x); \
.align 4; \
.Lload_cp_regs_cp##x: \
- xchal_cp##x##_load a2 a4 a5 a6 a7; \
+ xchal_cp##x##_load a2 a3 a4 a5 a6; \
jx a0; \
.endif
On 4/22/22 08:47, Damien Le Moal wrote:
> On 4/21/22 10:39, Zheyu Ma wrote:
>> Before detecting the cable type on the dma bar, the driver should check
>> whether the 'bmdma_addr' is zero, which means the adapter does not
>> support DMA, otherwise we will get the following error:
>>
>> [ 5.146634] Bad IO access at port 0x1 (return inb(port))
>> [ 5.147206] WARNING: CPU: 2 PID: 303 at lib/iomap.c:44 ioread8+0x4a/0x60
>> [ 5.150856] RIP: 0010:ioread8+0x4a/0x60
>> [ 5.160238] Call Trace:
>> [ 5.160470] <TASK>
>> [ 5.160674] marvell_cable_detect+0x6e/0xc0 [pata_marvell]
>> [ 5.161728] ata_eh_recover+0x3520/0x6cc0
>> [ 5.168075] ata_do_eh+0x49/0x3c0
>>
>> Signed-off-by: Zheyu Ma <zheyuma97(a)gmail.com>
>> ---Changes in v2:
>> - Delete the useless 'else'
>
> Note for future contributions: The change log should be placed *after*
> the "---" that comes before the "diff" line below. Otherwise, the change
> log pollutes the commit message.
>
> I fixed that and applied to for-5.18-fixes. Thanks.
I completely overlooked that this needs a CC stable...
Greg,
Could you please pickup this commit for stable ?
In Linus tree/rc4, it is:
aafa9f958342 ("ata: pata_marvell: Check the 'bmdma_addr' beforing reading")
Thanks !
>
>> ---
>> drivers/ata/pata_marvell.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/ata/pata_marvell.c b/drivers/ata/pata_marvell.c
>> index 0c5a51970fbf..014ccb0f45dc 100644
>> --- a/drivers/ata/pata_marvell.c
>> +++ b/drivers/ata/pata_marvell.c
>> @@ -77,6 +77,8 @@ static int marvell_cable_detect(struct ata_port *ap)
>> switch(ap->port_no)
>> {
>> case 0:
>> + if (!ap->ioaddr.bmdma_addr)
>> + return ATA_CBL_PATA_UNK;
>> if (ioread8(ap->ioaddr.bmdma_addr + 1) & 1)
>> return ATA_CBL_PATA40;
>> return ATA_CBL_PATA80;
>
>
--
Damien Le Moal
Western Digital Research
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e4a38402c36e42df28eb1a5394be87e6571fb48a Mon Sep 17 00:00:00 2001
From: Nico Pache <npache(a)redhat.com>
Date: Thu, 21 Apr 2022 16:36:01 -0700
Subject: [PATCH] oom_kill.c: futex: delay the OOM reaper to allow time for
proper futex cleanup
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which
can be targeted by the oom reaper. This mapping is used to store the
futex robust list head; the kernel does not keep a copy of the robust
list and instead references a userspace address to maintain the
robustness during a process death.
A race can occur between exit_mm and the oom reaper that allows the oom
reaper to free the memory of the futex robust list before the exit path
has handled the futex death:
CPU1 CPU2
--------------------------------------------------------------------
page_fault
do_exit "signal"
wake_oom_reaper
oom_reaper
oom_reap_task_mm (invalidates mm)
exit_mm
exit_mm_release
futex_exit_release
futex_cleanup
exit_robust_list
get_user (EFAULT- can't access memory)
If the get_user EFAULT's, the kernel will be unable to recover the
waiters on the robust_list, leaving userspace mutexes hung indefinitely.
Delay the OOM reaper, allowing more time for the exit path to perform
the futex cleanup.
Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
Based on a patch by Michal Hocko.
Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 [1]
Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com
Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: Joel Savitz <jsavitz(a)redhat.com>
Signed-off-by: Nico Pache <npache(a)redhat.com>
Co-developed-by: Joel Savitz <jsavitz(a)redhat.com>
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Acked-by: Thomas Gleixner <tglx(a)linutronix.de>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Rafael Aquini <aquini(a)redhat.com>
Cc: Waiman Long <longman(a)redhat.com>
Cc: Herton R. Krzesinski <herton(a)redhat.com>
Cc: Juri Lelli <juri.lelli(a)redhat.com>
Cc: Vincent Guittot <vincent.guittot(a)linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann(a)arm.com>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Daniel Bristot de Oliveira <bristot(a)redhat.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Joel Savitz <jsavitz(a)redhat.com>
Cc: Darren Hart <dvhart(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d5e3c00b74e1..a8911b1f35aa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1443,6 +1443,7 @@ struct task_struct {
int pagefault_disabled;
#ifdef CONFIG_MMU
struct task_struct *oom_reaper_list;
+ struct timer_list oom_reaper_timer;
#endif
#ifdef CONFIG_VMAP_STACK
struct vm_struct *stack_vm_area;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ec38194f8e1..49d7df39b02d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -632,7 +632,7 @@ done:
*/
set_bit(MMF_OOM_SKIP, &mm->flags);
- /* Drop a reference taken by wake_oom_reaper */
+ /* Drop a reference taken by queue_oom_reaper */
put_task_struct(tsk);
}
@@ -644,12 +644,12 @@ static int oom_reaper(void *unused)
struct task_struct *tsk = NULL;
wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
- spin_lock(&oom_reaper_lock);
+ spin_lock_irq(&oom_reaper_lock);
if (oom_reaper_list != NULL) {
tsk = oom_reaper_list;
oom_reaper_list = tsk->oom_reaper_list;
}
- spin_unlock(&oom_reaper_lock);
+ spin_unlock_irq(&oom_reaper_lock);
if (tsk)
oom_reap_task(tsk);
@@ -658,22 +658,48 @@ static int oom_reaper(void *unused)
return 0;
}
-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper(struct timer_list *timer)
{
- /* mm is already queued? */
- if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
- return;
+ struct task_struct *tsk = container_of(timer, struct task_struct,
+ oom_reaper_timer);
+ struct mm_struct *mm = tsk->signal->oom_mm;
+ unsigned long flags;
- get_task_struct(tsk);
+ /* The victim managed to terminate on its own - see exit_mmap */
+ if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+ put_task_struct(tsk);
+ return;
+ }
- spin_lock(&oom_reaper_lock);
+ spin_lock_irqsave(&oom_reaper_lock, flags);
tsk->oom_reaper_list = oom_reaper_list;
oom_reaper_list = tsk;
- spin_unlock(&oom_reaper_lock);
+ spin_unlock_irqrestore(&oom_reaper_lock, flags);
trace_wake_reaper(tsk->pid);
wake_up(&oom_reaper_wait);
}
+/*
+ * Give the OOM victim time to exit naturally before invoking the oom_reaping.
+ * The timers timeout is arbitrary... the longer it is, the longer the worst
+ * case scenario for the OOM can take. If it is too small, the oom_reaper can
+ * get in the way and release resources needed by the process exit path.
+ * e.g. The futex robust list can sit in Anon|Private memory that gets reaped
+ * before the exit path is able to wake the futex waiters.
+ */
+#define OOM_REAPER_DELAY (2*HZ)
+static void queue_oom_reaper(struct task_struct *tsk)
+{
+ /* mm is already queued? */
+ if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
+ return;
+
+ get_task_struct(tsk);
+ timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
+ tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
+ add_timer(&tsk->oom_reaper_timer);
+}
+
static int __init oom_init(void)
{
oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -681,7 +707,7 @@ static int __init oom_init(void)
}
subsys_initcall(oom_init)
#else
-static inline void wake_oom_reaper(struct task_struct *tsk)
+static inline void queue_oom_reaper(struct task_struct *tsk)
{
}
#endif /* CONFIG_MMU */
@@ -932,7 +958,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
rcu_read_unlock();
if (can_oom_reap)
- wake_oom_reaper(victim);
+ queue_oom_reaper(victim);
mmdrop(mm);
put_task_struct(victim);
@@ -968,7 +994,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
task_lock(victim);
if (task_will_free_mem(victim)) {
mark_oom_victim(victim);
- wake_oom_reaper(victim);
+ queue_oom_reaper(victim);
task_unlock(victim);
put_task_struct(victim);
return;
@@ -1067,7 +1093,7 @@ bool out_of_memory(struct oom_control *oc)
*/
if (task_will_free_mem(current)) {
mark_oom_victim(current);
- wake_oom_reaper(current);
+ queue_oom_reaper(current);
return true;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e4a38402c36e42df28eb1a5394be87e6571fb48a Mon Sep 17 00:00:00 2001
From: Nico Pache <npache(a)redhat.com>
Date: Thu, 21 Apr 2022 16:36:01 -0700
Subject: [PATCH] oom_kill.c: futex: delay the OOM reaper to allow time for
proper futex cleanup
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which
can be targeted by the oom reaper. This mapping is used to store the
futex robust list head; the kernel does not keep a copy of the robust
list and instead references a userspace address to maintain the
robustness during a process death.
A race can occur between exit_mm and the oom reaper that allows the oom
reaper to free the memory of the futex robust list before the exit path
has handled the futex death:
CPU1 CPU2
--------------------------------------------------------------------
page_fault
do_exit "signal"
wake_oom_reaper
oom_reaper
oom_reap_task_mm (invalidates mm)
exit_mm
exit_mm_release
futex_exit_release
futex_cleanup
exit_robust_list
get_user (EFAULT- can't access memory)
If the get_user EFAULT's, the kernel will be unable to recover the
waiters on the robust_list, leaving userspace mutexes hung indefinitely.
Delay the OOM reaper, allowing more time for the exit path to perform
the futex cleanup.
Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
Based on a patch by Michal Hocko.
Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 [1]
Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com
Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: Joel Savitz <jsavitz(a)redhat.com>
Signed-off-by: Nico Pache <npache(a)redhat.com>
Co-developed-by: Joel Savitz <jsavitz(a)redhat.com>
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Acked-by: Thomas Gleixner <tglx(a)linutronix.de>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Rafael Aquini <aquini(a)redhat.com>
Cc: Waiman Long <longman(a)redhat.com>
Cc: Herton R. Krzesinski <herton(a)redhat.com>
Cc: Juri Lelli <juri.lelli(a)redhat.com>
Cc: Vincent Guittot <vincent.guittot(a)linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann(a)arm.com>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Daniel Bristot de Oliveira <bristot(a)redhat.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Joel Savitz <jsavitz(a)redhat.com>
Cc: Darren Hart <dvhart(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d5e3c00b74e1..a8911b1f35aa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1443,6 +1443,7 @@ struct task_struct {
int pagefault_disabled;
#ifdef CONFIG_MMU
struct task_struct *oom_reaper_list;
+ struct timer_list oom_reaper_timer;
#endif
#ifdef CONFIG_VMAP_STACK
struct vm_struct *stack_vm_area;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ec38194f8e1..49d7df39b02d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -632,7 +632,7 @@ done:
*/
set_bit(MMF_OOM_SKIP, &mm->flags);
- /* Drop a reference taken by wake_oom_reaper */
+ /* Drop a reference taken by queue_oom_reaper */
put_task_struct(tsk);
}
@@ -644,12 +644,12 @@ static int oom_reaper(void *unused)
struct task_struct *tsk = NULL;
wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
- spin_lock(&oom_reaper_lock);
+ spin_lock_irq(&oom_reaper_lock);
if (oom_reaper_list != NULL) {
tsk = oom_reaper_list;
oom_reaper_list = tsk->oom_reaper_list;
}
- spin_unlock(&oom_reaper_lock);
+ spin_unlock_irq(&oom_reaper_lock);
if (tsk)
oom_reap_task(tsk);
@@ -658,22 +658,48 @@ static int oom_reaper(void *unused)
return 0;
}
-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper(struct timer_list *timer)
{
- /* mm is already queued? */
- if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
- return;
+ struct task_struct *tsk = container_of(timer, struct task_struct,
+ oom_reaper_timer);
+ struct mm_struct *mm = tsk->signal->oom_mm;
+ unsigned long flags;
- get_task_struct(tsk);
+ /* The victim managed to terminate on its own - see exit_mmap */
+ if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+ put_task_struct(tsk);
+ return;
+ }
- spin_lock(&oom_reaper_lock);
+ spin_lock_irqsave(&oom_reaper_lock, flags);
tsk->oom_reaper_list = oom_reaper_list;
oom_reaper_list = tsk;
- spin_unlock(&oom_reaper_lock);
+ spin_unlock_irqrestore(&oom_reaper_lock, flags);
trace_wake_reaper(tsk->pid);
wake_up(&oom_reaper_wait);
}
+/*
+ * Give the OOM victim time to exit naturally before invoking the oom_reaping.
+ * The timers timeout is arbitrary... the longer it is, the longer the worst
+ * case scenario for the OOM can take. If it is too small, the oom_reaper can
+ * get in the way and release resources needed by the process exit path.
+ * e.g. The futex robust list can sit in Anon|Private memory that gets reaped
+ * before the exit path is able to wake the futex waiters.
+ */
+#define OOM_REAPER_DELAY (2*HZ)
+static void queue_oom_reaper(struct task_struct *tsk)
+{
+ /* mm is already queued? */
+ if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
+ return;
+
+ get_task_struct(tsk);
+ timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
+ tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
+ add_timer(&tsk->oom_reaper_timer);
+}
+
static int __init oom_init(void)
{
oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -681,7 +707,7 @@ static int __init oom_init(void)
}
subsys_initcall(oom_init)
#else
-static inline void wake_oom_reaper(struct task_struct *tsk)
+static inline void queue_oom_reaper(struct task_struct *tsk)
{
}
#endif /* CONFIG_MMU */
@@ -932,7 +958,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
rcu_read_unlock();
if (can_oom_reap)
- wake_oom_reaper(victim);
+ queue_oom_reaper(victim);
mmdrop(mm);
put_task_struct(victim);
@@ -968,7 +994,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
task_lock(victim);
if (task_will_free_mem(victim)) {
mark_oom_victim(victim);
- wake_oom_reaper(victim);
+ queue_oom_reaper(victim);
task_unlock(victim);
put_task_struct(victim);
return;
@@ -1067,7 +1093,7 @@ bool out_of_memory(struct oom_control *oc)
*/
if (task_will_free_mem(current)) {
mark_oom_victim(current);
- wake_oom_reaper(current);
+ queue_oom_reaper(current);
return true;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5f24d5a579d1eace79d505b148808a850b417d4c Mon Sep 17 00:00:00 2001
From: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Date: Thu, 21 Apr 2022 16:35:46 -0700
Subject: [PATCH] mm, hugetlb: allow for "high" userspace addresses
This is a fix for commit f6795053dac8 ("mm: mmap: Allow for "high"
userspace addresses") for hugetlb.
This patch adds support for "high" userspace addresses that are
optionally supported on the system and have to be requested via a hint
mechanism ("high" addr parameter to mmap).
Architectures such as powerpc and x86 achieve this by making changes to
their architectural versions of hugetlb_get_unmapped_area() function.
However, arm64 uses the generic version of that function.
So take into account arch_get_mmap_base() and arch_get_mmap_end() in
hugetlb_get_unmapped_area(). To allow that, move those two macros out
of mm/mmap.c into include/linux/sched/mm.h
If these macros are not defined in architectural code then they default
to (TASK_SIZE) and (base) so should not introduce any behavioural
changes to architectures that do not define them.
For the time being, only ARM64 is affected by this change.
Catalin (ARM64) said
"We should have fixed hugetlb_get_unmapped_area() as well when we added
support for 52-bit VA. The reason for commit f6795053dac8 was to
prevent normal mmap() from returning addresses above 48-bit by default
as some user-space had hard assumptions about this.
It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
but I doubt anyone would notice. It's more likely that the current
behaviour would cause issues, so I'd rather have them consistent.
Basically when arm64 gained support for 52-bit addresses we did not
want user-space calling mmap() to suddenly get such high addresses,
otherwise we could have inadvertently broken some programs (similar
behaviour to x86 here). Hence we added commit f6795053dac8. But we
missed hugetlbfs which could still get such high mmap() addresses. So
in theory that's a potential regression that should have bee addressed
at the same time as commit f6795053dac8 (and before arm64 enabled
52-bit addresses)"
Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.16500337…
Fixes: f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses")
Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Reviewed-by: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Steve Capper <steve.capper(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: <stable(a)vger.kernel.org> [5.0.x]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 99c7477cee5c..dd3a088db11d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -206,7 +206,7 @@ hugetlb_get_unmapped_area_bottomup(struct file *file, unsigned long addr,
info.flags = 0;
info.length = len;
info.low_limit = current->mm->mmap_base;
- info.high_limit = TASK_SIZE;
+ info.high_limit = arch_get_mmap_end(addr);
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
return vm_unmapped_area(&info);
@@ -222,7 +222,7 @@ hugetlb_get_unmapped_area_topdown(struct file *file, unsigned long addr,
info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
info.low_limit = max(PAGE_SIZE, mmap_min_addr);
- info.high_limit = current->mm->mmap_base;
+ info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
addr = vm_unmapped_area(&info);
@@ -237,7 +237,7 @@ hugetlb_get_unmapped_area_topdown(struct file *file, unsigned long addr,
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
info.low_limit = current->mm->mmap_base;
- info.high_limit = TASK_SIZE;
+ info.high_limit = arch_get_mmap_end(addr);
addr = vm_unmapped_area(&info);
}
@@ -251,6 +251,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
struct hstate *h = hstate_file(file);
+ const unsigned long mmap_end = arch_get_mmap_end(addr);
if (len & ~huge_page_mask(h))
return -EINVAL;
@@ -266,7 +267,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
if (addr) {
addr = ALIGN(addr, huge_page_size(h));
vma = find_vma(mm, addr);
- if (TASK_SIZE - len >= addr &&
+ if (mmap_end - len >= addr &&
(!vma || addr + len <= vm_start_gap(vma)))
return addr;
}
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index a80356e9dc69..1ad1f4bfa025 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -136,6 +136,14 @@ static inline void mm_update_next_owner(struct mm_struct *mm)
#endif /* CONFIG_MEMCG */
#ifdef CONFIG_MMU
+#ifndef arch_get_mmap_end
+#define arch_get_mmap_end(addr) (TASK_SIZE)
+#endif
+
+#ifndef arch_get_mmap_base
+#define arch_get_mmap_base(addr, base) (base)
+#endif
+
extern void arch_pick_mmap_layout(struct mm_struct *mm,
struct rlimit *rlim_stack);
extern unsigned long
diff --git a/mm/mmap.c b/mm/mmap.c
index 3aa839f81e63..313b57d55a63 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2117,14 +2117,6 @@ unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info)
return addr;
}
-#ifndef arch_get_mmap_end
-#define arch_get_mmap_end(addr) (TASK_SIZE)
-#endif
-
-#ifndef arch_get_mmap_base
-#define arch_get_mmap_base(addr, base) (base)
-#endif
-
/* Get an address range which is currently unmapped.
* For shmat() with addr=0.
*
The patch below does not apply to the 5.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 405ce051236cc65b30bbfe490b28ce60ae6aed85 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Date: Thu, 21 Apr 2022 16:35:33 -0700
Subject: [PATCH] mm/hwpoison: fix race between hugetlb free/demotion and
memory_failure_hugetlb()
There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reported-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Dan Carpenter <dan.carpenter(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53c1b6082a4c..ac2a1d758a80 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -169,6 +169,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags);
void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
@@ -378,6 +379,11 @@ static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
return 0;
}
+static inline int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+
static inline void putback_active_hugepage(struct page *page)
{
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e34edb775334..9f44254af8ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3197,6 +3197,14 @@ extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p);
extern atomic_long_t num_poisoned_pages __read_mostly;
extern int soft_offline_page(unsigned long pfn, int flags);
+#ifdef CONFIG_MEMORY_FAILURE
+extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags);
+#else
+static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+#endif
#ifndef arch_memory_failure
static inline int arch_memory_failure(unsigned long pfn, int flags)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f8ca7cca3c1a..3fc721789743 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6785,6 +6785,16 @@ int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
return ret;
}
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ int ret;
+
+ spin_lock_irq(&hugetlb_lock);
+ ret = __get_huge_page_for_hwpoison(pfn, flags);
+ spin_unlock_irq(&hugetlb_lock);
+ return ret;
+}
+
void putback_active_hugepage(struct page *page)
{
spin_lock_irq(&hugetlb_lock);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index dcb6bb9cf731..2020944398c9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1498,50 +1498,113 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
return 0;
}
-static int memory_failure_hugetlb(unsigned long pfn, int flags)
+/*
+ * Called from hugetlb code with hugetlb_lock held.
+ *
+ * Return values:
+ * 0 - free hugepage
+ * 1 - in-use hugepage
+ * 2 - not a hugepage
+ * -EBUSY - the hugepage is busy (try to retry)
+ * -EHWPOISON - the hugepage is already hwpoisoned
+ */
+int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ struct page *page = pfn_to_page(pfn);
+ struct page *head = compound_head(page);
+ int ret = 2; /* fallback to normal page handling */
+ bool count_increased = false;
+
+ if (!PageHeadHuge(head))
+ goto out;
+
+ if (flags & MF_COUNT_INCREASED) {
+ ret = 1;
+ count_increased = true;
+ } else if (HPageFreed(head) || HPageMigratable(head)) {
+ ret = get_page_unless_zero(head);
+ if (ret)
+ count_increased = true;
+ } else {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (TestSetPageHWPoison(head)) {
+ ret = -EHWPOISON;
+ goto out;
+ }
+
+ return ret;
+out:
+ if (count_increased)
+ put_page(head);
+ return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Taking refcount of hugetlb pages needs extra care about race conditions
+ * with basic operations like hugepage allocation/free/demotion.
+ * So some of prechecks for hwpoison (pinning, and testing/setting
+ * PageHWPoison) should be done in single hugetlb_lock range.
+ */
+static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
{
- struct page *p = pfn_to_page(pfn);
- struct page *head = compound_head(p);
int res;
+ struct page *p = pfn_to_page(pfn);
+ struct page *head;
unsigned long page_flags;
+ bool retry = true;
- if (TestSetPageHWPoison(head)) {
- pr_err("Memory failure: %#lx: already hardware poisoned\n",
- pfn);
- res = -EHWPOISON;
- if (flags & MF_ACTION_REQUIRED)
+ *hugetlb = 1;
+retry:
+ res = get_huge_page_for_hwpoison(pfn, flags);
+ if (res == 2) { /* fallback to normal page handling */
+ *hugetlb = 0;
+ return 0;
+ } else if (res == -EHWPOISON) {
+ pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn);
+ if (flags & MF_ACTION_REQUIRED) {
+ head = compound_head(p);
res = kill_accessing_process(current, page_to_pfn(head), flags);
+ }
return res;
+ } else if (res == -EBUSY) {
+ if (retry) {
+ retry = false;
+ goto retry;
+ }
+ action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
+ return res;
+ }
+
+ head = compound_head(p);
+ lock_page(head);
+
+ if (hwpoison_filter(p)) {
+ ClearPageHWPoison(head);
+ res = -EOPNOTSUPP;
+ goto out;
}
num_poisoned_pages_inc();
- if (!(flags & MF_COUNT_INCREASED)) {
- res = get_hwpoison_page(p, flags);
- if (!res) {
- lock_page(head);
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- unlock_page(head);
- return -EOPNOTSUPP;
- }
- unlock_page(head);
- res = MF_FAILED;
- if (__page_handle_poison(p)) {
- page_ref_inc(p);
- res = MF_RECOVERED;
- }
- action_result(pfn, MF_MSG_FREE_HUGE, res);
- return res == MF_RECOVERED ? 0 : -EBUSY;
- } else if (res < 0) {
- action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
- return -EBUSY;
+ /*
+ * Handling free hugepage. The possible race with hugepage allocation
+ * or demotion can be prevented by PageHWPoison flag.
+ */
+ if (res == 0) {
+ unlock_page(head);
+ res = MF_FAILED;
+ if (__page_handle_poison(p)) {
+ page_ref_inc(p);
+ res = MF_RECOVERED;
}
+ action_result(pfn, MF_MSG_FREE_HUGE, res);
+ return res == MF_RECOVERED ? 0 : -EBUSY;
}
- lock_page(head);
-
/*
* The page could have changed compound pages due to race window.
* If this happens just bail out.
@@ -1554,14 +1617,6 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
page_flags = head->flags;
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- put_page(p);
- res = -EOPNOTSUPP;
- goto out;
- }
-
/*
* TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
* simply disable it. In order to make it work properly, we need
@@ -1588,6 +1643,12 @@ out:
unlock_page(head);
return res;
}
+#else
+static inline int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
+{
+ return 0;
+}
+#endif
static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
struct dev_pagemap *pgmap)
@@ -1712,6 +1773,7 @@ int memory_failure(unsigned long pfn, int flags)
int res = 0;
unsigned long page_flags;
bool retry = true;
+ int hugetlb = 0;
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1739,10 +1801,9 @@ int memory_failure(unsigned long pfn, int flags)
}
try_again:
- if (PageHuge(p)) {
- res = memory_failure_hugetlb(pfn, flags);
+ res = try_memory_failure_hugetlb(pfn, flags, &hugetlb);
+ if (hugetlb)
goto unlock_mutex;
- }
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 405ce051236cc65b30bbfe490b28ce60ae6aed85 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Date: Thu, 21 Apr 2022 16:35:33 -0700
Subject: [PATCH] mm/hwpoison: fix race between hugetlb free/demotion and
memory_failure_hugetlb()
There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reported-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Dan Carpenter <dan.carpenter(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53c1b6082a4c..ac2a1d758a80 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -169,6 +169,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags);
void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
@@ -378,6 +379,11 @@ static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
return 0;
}
+static inline int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+
static inline void putback_active_hugepage(struct page *page)
{
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e34edb775334..9f44254af8ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3197,6 +3197,14 @@ extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p);
extern atomic_long_t num_poisoned_pages __read_mostly;
extern int soft_offline_page(unsigned long pfn, int flags);
+#ifdef CONFIG_MEMORY_FAILURE
+extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags);
+#else
+static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+#endif
#ifndef arch_memory_failure
static inline int arch_memory_failure(unsigned long pfn, int flags)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f8ca7cca3c1a..3fc721789743 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6785,6 +6785,16 @@ int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
return ret;
}
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ int ret;
+
+ spin_lock_irq(&hugetlb_lock);
+ ret = __get_huge_page_for_hwpoison(pfn, flags);
+ spin_unlock_irq(&hugetlb_lock);
+ return ret;
+}
+
void putback_active_hugepage(struct page *page)
{
spin_lock_irq(&hugetlb_lock);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index dcb6bb9cf731..2020944398c9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1498,50 +1498,113 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
return 0;
}
-static int memory_failure_hugetlb(unsigned long pfn, int flags)
+/*
+ * Called from hugetlb code with hugetlb_lock held.
+ *
+ * Return values:
+ * 0 - free hugepage
+ * 1 - in-use hugepage
+ * 2 - not a hugepage
+ * -EBUSY - the hugepage is busy (try to retry)
+ * -EHWPOISON - the hugepage is already hwpoisoned
+ */
+int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ struct page *page = pfn_to_page(pfn);
+ struct page *head = compound_head(page);
+ int ret = 2; /* fallback to normal page handling */
+ bool count_increased = false;
+
+ if (!PageHeadHuge(head))
+ goto out;
+
+ if (flags & MF_COUNT_INCREASED) {
+ ret = 1;
+ count_increased = true;
+ } else if (HPageFreed(head) || HPageMigratable(head)) {
+ ret = get_page_unless_zero(head);
+ if (ret)
+ count_increased = true;
+ } else {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (TestSetPageHWPoison(head)) {
+ ret = -EHWPOISON;
+ goto out;
+ }
+
+ return ret;
+out:
+ if (count_increased)
+ put_page(head);
+ return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Taking refcount of hugetlb pages needs extra care about race conditions
+ * with basic operations like hugepage allocation/free/demotion.
+ * So some of prechecks for hwpoison (pinning, and testing/setting
+ * PageHWPoison) should be done in single hugetlb_lock range.
+ */
+static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
{
- struct page *p = pfn_to_page(pfn);
- struct page *head = compound_head(p);
int res;
+ struct page *p = pfn_to_page(pfn);
+ struct page *head;
unsigned long page_flags;
+ bool retry = true;
- if (TestSetPageHWPoison(head)) {
- pr_err("Memory failure: %#lx: already hardware poisoned\n",
- pfn);
- res = -EHWPOISON;
- if (flags & MF_ACTION_REQUIRED)
+ *hugetlb = 1;
+retry:
+ res = get_huge_page_for_hwpoison(pfn, flags);
+ if (res == 2) { /* fallback to normal page handling */
+ *hugetlb = 0;
+ return 0;
+ } else if (res == -EHWPOISON) {
+ pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn);
+ if (flags & MF_ACTION_REQUIRED) {
+ head = compound_head(p);
res = kill_accessing_process(current, page_to_pfn(head), flags);
+ }
return res;
+ } else if (res == -EBUSY) {
+ if (retry) {
+ retry = false;
+ goto retry;
+ }
+ action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
+ return res;
+ }
+
+ head = compound_head(p);
+ lock_page(head);
+
+ if (hwpoison_filter(p)) {
+ ClearPageHWPoison(head);
+ res = -EOPNOTSUPP;
+ goto out;
}
num_poisoned_pages_inc();
- if (!(flags & MF_COUNT_INCREASED)) {
- res = get_hwpoison_page(p, flags);
- if (!res) {
- lock_page(head);
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- unlock_page(head);
- return -EOPNOTSUPP;
- }
- unlock_page(head);
- res = MF_FAILED;
- if (__page_handle_poison(p)) {
- page_ref_inc(p);
- res = MF_RECOVERED;
- }
- action_result(pfn, MF_MSG_FREE_HUGE, res);
- return res == MF_RECOVERED ? 0 : -EBUSY;
- } else if (res < 0) {
- action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
- return -EBUSY;
+ /*
+ * Handling free hugepage. The possible race with hugepage allocation
+ * or demotion can be prevented by PageHWPoison flag.
+ */
+ if (res == 0) {
+ unlock_page(head);
+ res = MF_FAILED;
+ if (__page_handle_poison(p)) {
+ page_ref_inc(p);
+ res = MF_RECOVERED;
}
+ action_result(pfn, MF_MSG_FREE_HUGE, res);
+ return res == MF_RECOVERED ? 0 : -EBUSY;
}
- lock_page(head);
-
/*
* The page could have changed compound pages due to race window.
* If this happens just bail out.
@@ -1554,14 +1617,6 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
page_flags = head->flags;
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- put_page(p);
- res = -EOPNOTSUPP;
- goto out;
- }
-
/*
* TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
* simply disable it. In order to make it work properly, we need
@@ -1588,6 +1643,12 @@ out:
unlock_page(head);
return res;
}
+#else
+static inline int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
+{
+ return 0;
+}
+#endif
static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
struct dev_pagemap *pgmap)
@@ -1712,6 +1773,7 @@ int memory_failure(unsigned long pfn, int flags)
int res = 0;
unsigned long page_flags;
bool retry = true;
+ int hugetlb = 0;
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1739,10 +1801,9 @@ int memory_failure(unsigned long pfn, int flags)
}
try_again:
- if (PageHuge(p)) {
- res = memory_failure_hugetlb(pfn, flags);
+ res = try_memory_failure_hugetlb(pfn, flags, &hugetlb);
+ if (hugetlb)
goto unlock_mutex;
- }
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 405ce051236cc65b30bbfe490b28ce60ae6aed85 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Date: Thu, 21 Apr 2022 16:35:33 -0700
Subject: [PATCH] mm/hwpoison: fix race between hugetlb free/demotion and
memory_failure_hugetlb()
There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reported-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Dan Carpenter <dan.carpenter(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53c1b6082a4c..ac2a1d758a80 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -169,6 +169,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags);
void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
@@ -378,6 +379,11 @@ static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
return 0;
}
+static inline int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+
static inline void putback_active_hugepage(struct page *page)
{
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e34edb775334..9f44254af8ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3197,6 +3197,14 @@ extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p);
extern atomic_long_t num_poisoned_pages __read_mostly;
extern int soft_offline_page(unsigned long pfn, int flags);
+#ifdef CONFIG_MEMORY_FAILURE
+extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags);
+#else
+static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+#endif
#ifndef arch_memory_failure
static inline int arch_memory_failure(unsigned long pfn, int flags)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f8ca7cca3c1a..3fc721789743 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6785,6 +6785,16 @@ int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
return ret;
}
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ int ret;
+
+ spin_lock_irq(&hugetlb_lock);
+ ret = __get_huge_page_for_hwpoison(pfn, flags);
+ spin_unlock_irq(&hugetlb_lock);
+ return ret;
+}
+
void putback_active_hugepage(struct page *page)
{
spin_lock_irq(&hugetlb_lock);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index dcb6bb9cf731..2020944398c9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1498,50 +1498,113 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
return 0;
}
-static int memory_failure_hugetlb(unsigned long pfn, int flags)
+/*
+ * Called from hugetlb code with hugetlb_lock held.
+ *
+ * Return values:
+ * 0 - free hugepage
+ * 1 - in-use hugepage
+ * 2 - not a hugepage
+ * -EBUSY - the hugepage is busy (try to retry)
+ * -EHWPOISON - the hugepage is already hwpoisoned
+ */
+int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ struct page *page = pfn_to_page(pfn);
+ struct page *head = compound_head(page);
+ int ret = 2; /* fallback to normal page handling */
+ bool count_increased = false;
+
+ if (!PageHeadHuge(head))
+ goto out;
+
+ if (flags & MF_COUNT_INCREASED) {
+ ret = 1;
+ count_increased = true;
+ } else if (HPageFreed(head) || HPageMigratable(head)) {
+ ret = get_page_unless_zero(head);
+ if (ret)
+ count_increased = true;
+ } else {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (TestSetPageHWPoison(head)) {
+ ret = -EHWPOISON;
+ goto out;
+ }
+
+ return ret;
+out:
+ if (count_increased)
+ put_page(head);
+ return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Taking refcount of hugetlb pages needs extra care about race conditions
+ * with basic operations like hugepage allocation/free/demotion.
+ * So some of prechecks for hwpoison (pinning, and testing/setting
+ * PageHWPoison) should be done in single hugetlb_lock range.
+ */
+static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
{
- struct page *p = pfn_to_page(pfn);
- struct page *head = compound_head(p);
int res;
+ struct page *p = pfn_to_page(pfn);
+ struct page *head;
unsigned long page_flags;
+ bool retry = true;
- if (TestSetPageHWPoison(head)) {
- pr_err("Memory failure: %#lx: already hardware poisoned\n",
- pfn);
- res = -EHWPOISON;
- if (flags & MF_ACTION_REQUIRED)
+ *hugetlb = 1;
+retry:
+ res = get_huge_page_for_hwpoison(pfn, flags);
+ if (res == 2) { /* fallback to normal page handling */
+ *hugetlb = 0;
+ return 0;
+ } else if (res == -EHWPOISON) {
+ pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn);
+ if (flags & MF_ACTION_REQUIRED) {
+ head = compound_head(p);
res = kill_accessing_process(current, page_to_pfn(head), flags);
+ }
return res;
+ } else if (res == -EBUSY) {
+ if (retry) {
+ retry = false;
+ goto retry;
+ }
+ action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
+ return res;
+ }
+
+ head = compound_head(p);
+ lock_page(head);
+
+ if (hwpoison_filter(p)) {
+ ClearPageHWPoison(head);
+ res = -EOPNOTSUPP;
+ goto out;
}
num_poisoned_pages_inc();
- if (!(flags & MF_COUNT_INCREASED)) {
- res = get_hwpoison_page(p, flags);
- if (!res) {
- lock_page(head);
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- unlock_page(head);
- return -EOPNOTSUPP;
- }
- unlock_page(head);
- res = MF_FAILED;
- if (__page_handle_poison(p)) {
- page_ref_inc(p);
- res = MF_RECOVERED;
- }
- action_result(pfn, MF_MSG_FREE_HUGE, res);
- return res == MF_RECOVERED ? 0 : -EBUSY;
- } else if (res < 0) {
- action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
- return -EBUSY;
+ /*
+ * Handling free hugepage. The possible race with hugepage allocation
+ * or demotion can be prevented by PageHWPoison flag.
+ */
+ if (res == 0) {
+ unlock_page(head);
+ res = MF_FAILED;
+ if (__page_handle_poison(p)) {
+ page_ref_inc(p);
+ res = MF_RECOVERED;
}
+ action_result(pfn, MF_MSG_FREE_HUGE, res);
+ return res == MF_RECOVERED ? 0 : -EBUSY;
}
- lock_page(head);
-
/*
* The page could have changed compound pages due to race window.
* If this happens just bail out.
@@ -1554,14 +1617,6 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
page_flags = head->flags;
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- put_page(p);
- res = -EOPNOTSUPP;
- goto out;
- }
-
/*
* TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
* simply disable it. In order to make it work properly, we need
@@ -1588,6 +1643,12 @@ out:
unlock_page(head);
return res;
}
+#else
+static inline int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
+{
+ return 0;
+}
+#endif
static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
struct dev_pagemap *pgmap)
@@ -1712,6 +1773,7 @@ int memory_failure(unsigned long pfn, int flags)
int res = 0;
unsigned long page_flags;
bool retry = true;
+ int hugetlb = 0;
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1739,10 +1801,9 @@ int memory_failure(unsigned long pfn, int flags)
}
try_again:
- if (PageHuge(p)) {
- res = memory_failure_hugetlb(pfn, flags);
+ res = try_memory_failure_hugetlb(pfn, flags, &hugetlb);
+ if (hugetlb)
goto unlock_mutex;
- }
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 405ce051236cc65b30bbfe490b28ce60ae6aed85 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Date: Thu, 21 Apr 2022 16:35:33 -0700
Subject: [PATCH] mm/hwpoison: fix race between hugetlb free/demotion and
memory_failure_hugetlb()
There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reported-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Dan Carpenter <dan.carpenter(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53c1b6082a4c..ac2a1d758a80 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -169,6 +169,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags);
void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
@@ -378,6 +379,11 @@ static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
return 0;
}
+static inline int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+
static inline void putback_active_hugepage(struct page *page)
{
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e34edb775334..9f44254af8ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3197,6 +3197,14 @@ extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p);
extern atomic_long_t num_poisoned_pages __read_mostly;
extern int soft_offline_page(unsigned long pfn, int flags);
+#ifdef CONFIG_MEMORY_FAILURE
+extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags);
+#else
+static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+#endif
#ifndef arch_memory_failure
static inline int arch_memory_failure(unsigned long pfn, int flags)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f8ca7cca3c1a..3fc721789743 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6785,6 +6785,16 @@ int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
return ret;
}
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ int ret;
+
+ spin_lock_irq(&hugetlb_lock);
+ ret = __get_huge_page_for_hwpoison(pfn, flags);
+ spin_unlock_irq(&hugetlb_lock);
+ return ret;
+}
+
void putback_active_hugepage(struct page *page)
{
spin_lock_irq(&hugetlb_lock);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index dcb6bb9cf731..2020944398c9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1498,50 +1498,113 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
return 0;
}
-static int memory_failure_hugetlb(unsigned long pfn, int flags)
+/*
+ * Called from hugetlb code with hugetlb_lock held.
+ *
+ * Return values:
+ * 0 - free hugepage
+ * 1 - in-use hugepage
+ * 2 - not a hugepage
+ * -EBUSY - the hugepage is busy (try to retry)
+ * -EHWPOISON - the hugepage is already hwpoisoned
+ */
+int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ struct page *page = pfn_to_page(pfn);
+ struct page *head = compound_head(page);
+ int ret = 2; /* fallback to normal page handling */
+ bool count_increased = false;
+
+ if (!PageHeadHuge(head))
+ goto out;
+
+ if (flags & MF_COUNT_INCREASED) {
+ ret = 1;
+ count_increased = true;
+ } else if (HPageFreed(head) || HPageMigratable(head)) {
+ ret = get_page_unless_zero(head);
+ if (ret)
+ count_increased = true;
+ } else {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (TestSetPageHWPoison(head)) {
+ ret = -EHWPOISON;
+ goto out;
+ }
+
+ return ret;
+out:
+ if (count_increased)
+ put_page(head);
+ return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Taking refcount of hugetlb pages needs extra care about race conditions
+ * with basic operations like hugepage allocation/free/demotion.
+ * So some of prechecks for hwpoison (pinning, and testing/setting
+ * PageHWPoison) should be done in single hugetlb_lock range.
+ */
+static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
{
- struct page *p = pfn_to_page(pfn);
- struct page *head = compound_head(p);
int res;
+ struct page *p = pfn_to_page(pfn);
+ struct page *head;
unsigned long page_flags;
+ bool retry = true;
- if (TestSetPageHWPoison(head)) {
- pr_err("Memory failure: %#lx: already hardware poisoned\n",
- pfn);
- res = -EHWPOISON;
- if (flags & MF_ACTION_REQUIRED)
+ *hugetlb = 1;
+retry:
+ res = get_huge_page_for_hwpoison(pfn, flags);
+ if (res == 2) { /* fallback to normal page handling */
+ *hugetlb = 0;
+ return 0;
+ } else if (res == -EHWPOISON) {
+ pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn);
+ if (flags & MF_ACTION_REQUIRED) {
+ head = compound_head(p);
res = kill_accessing_process(current, page_to_pfn(head), flags);
+ }
return res;
+ } else if (res == -EBUSY) {
+ if (retry) {
+ retry = false;
+ goto retry;
+ }
+ action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
+ return res;
+ }
+
+ head = compound_head(p);
+ lock_page(head);
+
+ if (hwpoison_filter(p)) {
+ ClearPageHWPoison(head);
+ res = -EOPNOTSUPP;
+ goto out;
}
num_poisoned_pages_inc();
- if (!(flags & MF_COUNT_INCREASED)) {
- res = get_hwpoison_page(p, flags);
- if (!res) {
- lock_page(head);
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- unlock_page(head);
- return -EOPNOTSUPP;
- }
- unlock_page(head);
- res = MF_FAILED;
- if (__page_handle_poison(p)) {
- page_ref_inc(p);
- res = MF_RECOVERED;
- }
- action_result(pfn, MF_MSG_FREE_HUGE, res);
- return res == MF_RECOVERED ? 0 : -EBUSY;
- } else if (res < 0) {
- action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
- return -EBUSY;
+ /*
+ * Handling free hugepage. The possible race with hugepage allocation
+ * or demotion can be prevented by PageHWPoison flag.
+ */
+ if (res == 0) {
+ unlock_page(head);
+ res = MF_FAILED;
+ if (__page_handle_poison(p)) {
+ page_ref_inc(p);
+ res = MF_RECOVERED;
}
+ action_result(pfn, MF_MSG_FREE_HUGE, res);
+ return res == MF_RECOVERED ? 0 : -EBUSY;
}
- lock_page(head);
-
/*
* The page could have changed compound pages due to race window.
* If this happens just bail out.
@@ -1554,14 +1617,6 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
page_flags = head->flags;
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- put_page(p);
- res = -EOPNOTSUPP;
- goto out;
- }
-
/*
* TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
* simply disable it. In order to make it work properly, we need
@@ -1588,6 +1643,12 @@ out:
unlock_page(head);
return res;
}
+#else
+static inline int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
+{
+ return 0;
+}
+#endif
static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
struct dev_pagemap *pgmap)
@@ -1712,6 +1773,7 @@ int memory_failure(unsigned long pfn, int flags)
int res = 0;
unsigned long page_flags;
bool retry = true;
+ int hugetlb = 0;
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1739,10 +1801,9 @@ int memory_failure(unsigned long pfn, int flags)
}
try_again:
- if (PageHuge(p)) {
- res = memory_failure_hugetlb(pfn, flags);
+ res = try_memory_failure_hugetlb(pfn, flags, &hugetlb);
+ if (hugetlb)
goto unlock_mutex;
- }
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 405ce051236cc65b30bbfe490b28ce60ae6aed85 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Date: Thu, 21 Apr 2022 16:35:33 -0700
Subject: [PATCH] mm/hwpoison: fix race between hugetlb free/demotion and
memory_failure_hugetlb()
There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reported-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Dan Carpenter <dan.carpenter(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53c1b6082a4c..ac2a1d758a80 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -169,6 +169,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags);
void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
@@ -378,6 +379,11 @@ static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
return 0;
}
+static inline int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+
static inline void putback_active_hugepage(struct page *page)
{
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e34edb775334..9f44254af8ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3197,6 +3197,14 @@ extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p);
extern atomic_long_t num_poisoned_pages __read_mostly;
extern int soft_offline_page(unsigned long pfn, int flags);
+#ifdef CONFIG_MEMORY_FAILURE
+extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags);
+#else
+static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+#endif
#ifndef arch_memory_failure
static inline int arch_memory_failure(unsigned long pfn, int flags)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f8ca7cca3c1a..3fc721789743 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6785,6 +6785,16 @@ int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
return ret;
}
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ int ret;
+
+ spin_lock_irq(&hugetlb_lock);
+ ret = __get_huge_page_for_hwpoison(pfn, flags);
+ spin_unlock_irq(&hugetlb_lock);
+ return ret;
+}
+
void putback_active_hugepage(struct page *page)
{
spin_lock_irq(&hugetlb_lock);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index dcb6bb9cf731..2020944398c9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1498,50 +1498,113 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
return 0;
}
-static int memory_failure_hugetlb(unsigned long pfn, int flags)
+/*
+ * Called from hugetlb code with hugetlb_lock held.
+ *
+ * Return values:
+ * 0 - free hugepage
+ * 1 - in-use hugepage
+ * 2 - not a hugepage
+ * -EBUSY - the hugepage is busy (try to retry)
+ * -EHWPOISON - the hugepage is already hwpoisoned
+ */
+int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ struct page *page = pfn_to_page(pfn);
+ struct page *head = compound_head(page);
+ int ret = 2; /* fallback to normal page handling */
+ bool count_increased = false;
+
+ if (!PageHeadHuge(head))
+ goto out;
+
+ if (flags & MF_COUNT_INCREASED) {
+ ret = 1;
+ count_increased = true;
+ } else if (HPageFreed(head) || HPageMigratable(head)) {
+ ret = get_page_unless_zero(head);
+ if (ret)
+ count_increased = true;
+ } else {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (TestSetPageHWPoison(head)) {
+ ret = -EHWPOISON;
+ goto out;
+ }
+
+ return ret;
+out:
+ if (count_increased)
+ put_page(head);
+ return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Taking refcount of hugetlb pages needs extra care about race conditions
+ * with basic operations like hugepage allocation/free/demotion.
+ * So some of prechecks for hwpoison (pinning, and testing/setting
+ * PageHWPoison) should be done in single hugetlb_lock range.
+ */
+static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
{
- struct page *p = pfn_to_page(pfn);
- struct page *head = compound_head(p);
int res;
+ struct page *p = pfn_to_page(pfn);
+ struct page *head;
unsigned long page_flags;
+ bool retry = true;
- if (TestSetPageHWPoison(head)) {
- pr_err("Memory failure: %#lx: already hardware poisoned\n",
- pfn);
- res = -EHWPOISON;
- if (flags & MF_ACTION_REQUIRED)
+ *hugetlb = 1;
+retry:
+ res = get_huge_page_for_hwpoison(pfn, flags);
+ if (res == 2) { /* fallback to normal page handling */
+ *hugetlb = 0;
+ return 0;
+ } else if (res == -EHWPOISON) {
+ pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn);
+ if (flags & MF_ACTION_REQUIRED) {
+ head = compound_head(p);
res = kill_accessing_process(current, page_to_pfn(head), flags);
+ }
return res;
+ } else if (res == -EBUSY) {
+ if (retry) {
+ retry = false;
+ goto retry;
+ }
+ action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
+ return res;
+ }
+
+ head = compound_head(p);
+ lock_page(head);
+
+ if (hwpoison_filter(p)) {
+ ClearPageHWPoison(head);
+ res = -EOPNOTSUPP;
+ goto out;
}
num_poisoned_pages_inc();
- if (!(flags & MF_COUNT_INCREASED)) {
- res = get_hwpoison_page(p, flags);
- if (!res) {
- lock_page(head);
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- unlock_page(head);
- return -EOPNOTSUPP;
- }
- unlock_page(head);
- res = MF_FAILED;
- if (__page_handle_poison(p)) {
- page_ref_inc(p);
- res = MF_RECOVERED;
- }
- action_result(pfn, MF_MSG_FREE_HUGE, res);
- return res == MF_RECOVERED ? 0 : -EBUSY;
- } else if (res < 0) {
- action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
- return -EBUSY;
+ /*
+ * Handling free hugepage. The possible race with hugepage allocation
+ * or demotion can be prevented by PageHWPoison flag.
+ */
+ if (res == 0) {
+ unlock_page(head);
+ res = MF_FAILED;
+ if (__page_handle_poison(p)) {
+ page_ref_inc(p);
+ res = MF_RECOVERED;
}
+ action_result(pfn, MF_MSG_FREE_HUGE, res);
+ return res == MF_RECOVERED ? 0 : -EBUSY;
}
- lock_page(head);
-
/*
* The page could have changed compound pages due to race window.
* If this happens just bail out.
@@ -1554,14 +1617,6 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
page_flags = head->flags;
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- put_page(p);
- res = -EOPNOTSUPP;
- goto out;
- }
-
/*
* TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
* simply disable it. In order to make it work properly, we need
@@ -1588,6 +1643,12 @@ out:
unlock_page(head);
return res;
}
+#else
+static inline int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
+{
+ return 0;
+}
+#endif
static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
struct dev_pagemap *pgmap)
@@ -1712,6 +1773,7 @@ int memory_failure(unsigned long pfn, int flags)
int res = 0;
unsigned long page_flags;
bool retry = true;
+ int hugetlb = 0;
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1739,10 +1801,9 @@ int memory_failure(unsigned long pfn, int flags)
}
try_again:
- if (PageHuge(p)) {
- res = memory_failure_hugetlb(pfn, flags);
+ res = try_memory_failure_hugetlb(pfn, flags, &hugetlb);
+ if (hugetlb)
goto unlock_mutex;
- }
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 405ce051236cc65b30bbfe490b28ce60ae6aed85 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Date: Thu, 21 Apr 2022 16:35:33 -0700
Subject: [PATCH] mm/hwpoison: fix race between hugetlb free/demotion and
memory_failure_hugetlb()
There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reported-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Dan Carpenter <dan.carpenter(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53c1b6082a4c..ac2a1d758a80 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -169,6 +169,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags);
void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
@@ -378,6 +379,11 @@ static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
return 0;
}
+static inline int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+
static inline void putback_active_hugepage(struct page *page)
{
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e34edb775334..9f44254af8ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3197,6 +3197,14 @@ extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p);
extern atomic_long_t num_poisoned_pages __read_mostly;
extern int soft_offline_page(unsigned long pfn, int flags);
+#ifdef CONFIG_MEMORY_FAILURE
+extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags);
+#else
+static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+#endif
#ifndef arch_memory_failure
static inline int arch_memory_failure(unsigned long pfn, int flags)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f8ca7cca3c1a..3fc721789743 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6785,6 +6785,16 @@ int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
return ret;
}
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ int ret;
+
+ spin_lock_irq(&hugetlb_lock);
+ ret = __get_huge_page_for_hwpoison(pfn, flags);
+ spin_unlock_irq(&hugetlb_lock);
+ return ret;
+}
+
void putback_active_hugepage(struct page *page)
{
spin_lock_irq(&hugetlb_lock);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index dcb6bb9cf731..2020944398c9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1498,50 +1498,113 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
return 0;
}
-static int memory_failure_hugetlb(unsigned long pfn, int flags)
+/*
+ * Called from hugetlb code with hugetlb_lock held.
+ *
+ * Return values:
+ * 0 - free hugepage
+ * 1 - in-use hugepage
+ * 2 - not a hugepage
+ * -EBUSY - the hugepage is busy (try to retry)
+ * -EHWPOISON - the hugepage is already hwpoisoned
+ */
+int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ struct page *page = pfn_to_page(pfn);
+ struct page *head = compound_head(page);
+ int ret = 2; /* fallback to normal page handling */
+ bool count_increased = false;
+
+ if (!PageHeadHuge(head))
+ goto out;
+
+ if (flags & MF_COUNT_INCREASED) {
+ ret = 1;
+ count_increased = true;
+ } else if (HPageFreed(head) || HPageMigratable(head)) {
+ ret = get_page_unless_zero(head);
+ if (ret)
+ count_increased = true;
+ } else {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (TestSetPageHWPoison(head)) {
+ ret = -EHWPOISON;
+ goto out;
+ }
+
+ return ret;
+out:
+ if (count_increased)
+ put_page(head);
+ return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Taking refcount of hugetlb pages needs extra care about race conditions
+ * with basic operations like hugepage allocation/free/demotion.
+ * So some of prechecks for hwpoison (pinning, and testing/setting
+ * PageHWPoison) should be done in single hugetlb_lock range.
+ */
+static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
{
- struct page *p = pfn_to_page(pfn);
- struct page *head = compound_head(p);
int res;
+ struct page *p = pfn_to_page(pfn);
+ struct page *head;
unsigned long page_flags;
+ bool retry = true;
- if (TestSetPageHWPoison(head)) {
- pr_err("Memory failure: %#lx: already hardware poisoned\n",
- pfn);
- res = -EHWPOISON;
- if (flags & MF_ACTION_REQUIRED)
+ *hugetlb = 1;
+retry:
+ res = get_huge_page_for_hwpoison(pfn, flags);
+ if (res == 2) { /* fallback to normal page handling */
+ *hugetlb = 0;
+ return 0;
+ } else if (res == -EHWPOISON) {
+ pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn);
+ if (flags & MF_ACTION_REQUIRED) {
+ head = compound_head(p);
res = kill_accessing_process(current, page_to_pfn(head), flags);
+ }
return res;
+ } else if (res == -EBUSY) {
+ if (retry) {
+ retry = false;
+ goto retry;
+ }
+ action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
+ return res;
+ }
+
+ head = compound_head(p);
+ lock_page(head);
+
+ if (hwpoison_filter(p)) {
+ ClearPageHWPoison(head);
+ res = -EOPNOTSUPP;
+ goto out;
}
num_poisoned_pages_inc();
- if (!(flags & MF_COUNT_INCREASED)) {
- res = get_hwpoison_page(p, flags);
- if (!res) {
- lock_page(head);
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- unlock_page(head);
- return -EOPNOTSUPP;
- }
- unlock_page(head);
- res = MF_FAILED;
- if (__page_handle_poison(p)) {
- page_ref_inc(p);
- res = MF_RECOVERED;
- }
- action_result(pfn, MF_MSG_FREE_HUGE, res);
- return res == MF_RECOVERED ? 0 : -EBUSY;
- } else if (res < 0) {
- action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
- return -EBUSY;
+ /*
+ * Handling free hugepage. The possible race with hugepage allocation
+ * or demotion can be prevented by PageHWPoison flag.
+ */
+ if (res == 0) {
+ unlock_page(head);
+ res = MF_FAILED;
+ if (__page_handle_poison(p)) {
+ page_ref_inc(p);
+ res = MF_RECOVERED;
}
+ action_result(pfn, MF_MSG_FREE_HUGE, res);
+ return res == MF_RECOVERED ? 0 : -EBUSY;
}
- lock_page(head);
-
/*
* The page could have changed compound pages due to race window.
* If this happens just bail out.
@@ -1554,14 +1617,6 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
page_flags = head->flags;
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- put_page(p);
- res = -EOPNOTSUPP;
- goto out;
- }
-
/*
* TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
* simply disable it. In order to make it work properly, we need
@@ -1588,6 +1643,12 @@ out:
unlock_page(head);
return res;
}
+#else
+static inline int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
+{
+ return 0;
+}
+#endif
static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
struct dev_pagemap *pgmap)
@@ -1712,6 +1773,7 @@ int memory_failure(unsigned long pfn, int flags)
int res = 0;
unsigned long page_flags;
bool retry = true;
+ int hugetlb = 0;
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1739,10 +1801,9 @@ int memory_failure(unsigned long pfn, int flags)
}
try_again:
- if (PageHuge(p)) {
- res = memory_failure_hugetlb(pfn, flags);
+ res = try_memory_failure_hugetlb(pfn, flags, &hugetlb);
+ if (hugetlb)
goto unlock_mutex;
- }
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
--
Hi Dear
My name is Lisa Williams, I am from United States of America, Its my
pleasure to contact you for new and special friendship, I will be glad to
see your reply for us to know each other better
Yours
Lisa
This bug is marked as fixed by commit:
net: core: netlink: add helper refcount dec and lock function
net: sched: add helper function to take reference to Qdisc
net: sched: extend Qdisc with rcu
net: sched: rename qdisc_destroy() to qdisc_put()
net: sched: use Qdisc rcu API instead of relying on rtnl lock
But I can't find it in any tested tree for more than 90 days.
Is it a correct commit? Please update it by replying:
#syz fix: exact-commit-title
Until then the bug is still considered open and
new crashes with the same signature are ignored.
--
Drogi Zwycięzco,
Nazywam się Warren E. Buffett, jestem amerykańskim magnatem
biznesowym, inwestorem i filantropem. Jestem odnoszącym największe
sukcesy inwestorem na świecie. Głęboko wierzę w zasadę "dawania za
życia". Mam jedną ideę, która nigdy nie zmieniła się w moim umyśle, że
powinieneś używać swojego bogactwa, aby pomagać ludziom i zdecydowałem
się przekazać { 3,500,000.00 Euro } Trzy Miliony Pięćset Tysięcy Euro
losowo wybranym ludziom na całym świecie. Kiedy otrzymasz ten e-mail,
powinieneś liczyć się jako szczęściarz, ponieważ Twój adres e-mail
został wybrany online podczas losowego wyszukiwania.
Proszę odezwij się do mnie szybko, abym wiedział, że Twój adres e-mail
jest poprawny.
Odwiedź tę stronę: https://en.wikipedia.org/wiki/Warren_Buffett lub
wyszukaj moje nazwisko w google, aby uzyskać więcej informacji:
(Warren E. Buffett).
Z niecierpliwością czekam na odpowiedź.
Z poważaniem,
Pan Warren E. Buffett
Dyrektor Generalny: Berkshire Hathaway
http://www.berkshirehathaway.com/
bFLT binaries are usually created using elf2flt.
The linker script used by elf2flt has defined the .data section like the
following for the last 19 years:
.data : {
_sdata = . ;
__data_start = . ;
data_start = . ;
*(.got.plt)
*(.got)
FILL(0) ;
. = ALIGN(0x20) ;
LONG(-1)
. = ALIGN(0x20) ;
...
}
It places the .got.plt input section before the .got input section.
The same is true for the default linker script (ld --verbose) on most
architectures except x86/x86-64.
The binfmt_flat loader should relocate all GOT entries until it encounters
a -1 (the LONG(-1) in the linker script).
The problem is that the .got.plt input section starts with a GOTPLT header
(which has size 16 bytes on elf64-riscv and 8 bytes on elf32-riscv), where
the first word is set to -1. See the binutils implementation for riscv [1].
This causes the binfmt_flat loader to stop relocating GOT entries
prematurely and thus causes the application to crash when running.
Fix this by skipping the whole GOTPLT header, since the whole GOTPLT header
is reserved for the dynamic linker.
The GOTPLT header will only be skipped for bFLT binaries with flag
FLAT_FLAG_GOTPIC set. This flag is unconditionally set by elf2flt if the
supplied ELF binary has the symbol _GLOBAL_OFFSET_TABLE_ defined.
ELF binaries without a .got input section should thus remain unaffected.
Tested on RISC-V Canaan Kendryte K210 and RISC-V QEMU nommu_virt_defconfig.
[1] https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=bfd/elfnn-riscv.c;h…
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Niklas Cassel <niklas.cassel(a)wdc.com>
---
Changes since v1:
-Incorporated review comments from Eric Biederman.
RISC-V elf2flt patches are still not merged, they can be found here:
https://github.com/floatious/elf2flt/tree/riscv
buildroot branch for k210 nommu (including this patch and elf2flt patches):
https://github.com/floatious/buildroot/tree/k210-v14
fs/binfmt_flat.c | 27 ++++++++++++++++++++++++++-
1 file changed, 26 insertions(+), 1 deletion(-)
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index 626898150011..e5e2a03b39c1 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -440,6 +440,30 @@ static void old_reloc(unsigned long rl)
/****************************************************************************/
+static inline u32 __user *skip_got_header(u32 __user *rp)
+{
+ if (IS_ENABLED(CONFIG_RISCV)) {
+ /*
+ * RISC-V has a 16 byte GOT PLT header for elf64-riscv
+ * and 8 byte GOT PLT header for elf32-riscv.
+ * Skip the whole GOT PLT header, since it is reserved
+ * for the dynamic linker (ld.so).
+ */
+ u32 rp_val0, rp_val1;
+
+ if (get_user(rp_val0, rp))
+ return rp;
+ if (get_user(rp_val1, rp + 1))
+ return rp;
+
+ if (rp_val0 == 0xffffffff && rp_val1 == 0xffffffff)
+ rp += 4;
+ else if (rp_val0 == 0xffffffff)
+ rp += 2;
+ }
+ return rp;
+}
+
static int load_flat_file(struct linux_binprm *bprm,
struct lib_info *libinfo, int id, unsigned long *extra_stack)
{
@@ -789,7 +813,8 @@ static int load_flat_file(struct linux_binprm *bprm,
* image.
*/
if (flags & FLAT_FLAG_GOTPIC) {
- for (rp = (u32 __user *)datapos; ; rp++) {
+ rp = skip_got_header((u32 * __user) datapos);
+ for (; ; rp++) {
u32 addr, rp_val;
if (get_user(rp_val, rp))
return -EFAULT;
--
2.35.1
ALSA fireworks driver has a bug in its initial state to return count
shorter than expected by 4 bytes to userspace applications when handling
response frame for Echo Audio Fireworks transaction. It's due to missing
addition of the size for the type of event in ALSA firewire stack.
Fixes: 555e8a8f7f14 ("ALSA: fireworks: Add command/response functionality into hwdep interface")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Takashi Sakamoto <o-takashi(a)sakamocchi.jp>
---
sound/firewire/fireworks/fireworks_hwdep.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/sound/firewire/fireworks/fireworks_hwdep.c b/sound/firewire/fireworks/fireworks_hwdep.c
index 626c0c34b0b6..3a53914277d3 100644
--- a/sound/firewire/fireworks/fireworks_hwdep.c
+++ b/sound/firewire/fireworks/fireworks_hwdep.c
@@ -34,6 +34,7 @@ hwdep_read_resp_buf(struct snd_efw *efw, char __user *buf, long remained,
type = SNDRV_FIREWIRE_EVENT_EFW_RESPONSE;
if (copy_to_user(buf, &type, sizeof(type)))
return -EFAULT;
+ count += sizeof(type);
remained -= sizeof(type);
buf += sizeof(type);
--
2.34.1
From: Mike Rapoport <rppt(a)linux.ibm.com>
The semantics of pfn_valid() is to check presence of the memory map for a
PFN and not whether a PFN is covered by the linear map. The memory map may
be present for NOMAP memory regions, but they won't be mapped in the linear
mapping. Accessing such regions via __va() when they are memremap()'ed
will cause a crash.
On v5.4.y the crash happens on qemu-arm with UEFI [1]:
<1>[ 0.084476] 8<--- cut here ---
<1>[ 0.084595] Unable to handle kernel paging request at virtual address dfb76000
<1>[ 0.084938] pgd = (ptrval)
<1>[ 0.085038] [dfb76000] *pgd=5f7fe801, *pte=00000000, *ppte=00000000
...
<4>[ 0.093923] [<c0ed6ce8>] (memcpy) from [<c16a06f8>] (dmi_setup+0x60/0x418)
<4>[ 0.094204] [<c16a06f8>] (dmi_setup) from [<c16a38d4>] (arm_dmi_init+0x8/0x10)
<4>[ 0.094408] [<c16a38d4>] (arm_dmi_init) from [<c0302e9c>] (do_one_initcall+0x50/0x228)
<4>[ 0.094619] [<c0302e9c>] (do_one_initcall) from [<c16011e4>] (kernel_init_freeable+0x15c/0x1f8)
<4>[ 0.094841] [<c16011e4>] (kernel_init_freeable) from [<c0f028cc>] (kernel_init+0x8/0x10c)
<4>[ 0.095057] [<c0f028cc>] (kernel_init) from [<c03010e8>] (ret_from_fork+0x14/0x2c)
On kernels v5.10.y and newer the same crash won't reproduce on ARM because
commit b10d6bca8720 ("arch, drivers: replace for_each_membock() with
for_each_mem_range()") changed the way memory regions are registered in the
resource tree, but that merely covers up the problem.
On ARM64 memory resources registered in yet another way and there the
issue of wrong usage of pfn_valid() to ensure availability of the linear
map is also covered.
Implement arch_memremap_can_ram_remap() on ARM and ARM64 to prevent access
to NOMAP regions via the linear mapping in memremap().
Link: https://lore.kernel.org/all/Yl65zxGgFzF1Okac@sirena.org.uk
Reported-by: "kernelci.org bot" <bot(a)kernelci.org>
Tested-by: Mark Brown <broonie(a)kernel.org>
Cc: stable(a)vger.kernel.org # 5.4+
Signed-off-by: Mike Rapoport <rppt(a)linux.ibm.com>
---
arch/arm/include/asm/io.h | 4 ++++
arch/arm/mm/ioremap.c | 9 ++++++++-
arch/arm64/include/asm/io.h | 4 ++++
arch/arm64/mm/ioremap.c | 8 ++++++++
kernel/iomem.c | 2 +-
5 files changed, 25 insertions(+), 2 deletions(-)
diff --git a/arch/arm/include/asm/io.h b/arch/arm/include/asm/io.h
index 0c70eb688a00..fbb2eeea7285 100644
--- a/arch/arm/include/asm/io.h
+++ b/arch/arm/include/asm/io.h
@@ -145,6 +145,10 @@ extern void __iomem * (*arch_ioremap_caller)(phys_addr_t, size_t,
unsigned int, void *);
extern void (*arch_iounmap)(volatile void __iomem *);
+extern bool arch_memremap_can_ram_remap(resource_size_t offset, size_t size,
+ unsigned long flags);
+#define arch_memremap_can_ram_remap arch_memremap_can_ram_remap
+
/*
* Bad read/write accesses...
*/
diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c
index aa08bcb72db9..6eb1ad24544d 100644
--- a/arch/arm/mm/ioremap.c
+++ b/arch/arm/mm/ioremap.c
@@ -43,7 +43,6 @@
#include <asm/mach/pci.h>
#include "mm.h"
-
LIST_HEAD(static_vmlist);
static struct static_vm *find_static_vm_paddr(phys_addr_t paddr,
@@ -493,3 +492,11 @@ void __init early_ioremap_init(void)
{
early_ioremap_setup();
}
+
+bool arch_memremap_can_ram_remap(resource_size_t offset, size_t size,
+ unsigned long flags)
+{
+ unsigned long pfn = PHYS_PFN(offset);
+
+ return memblock_is_map_memory(pfn);
+}
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 7fd836bea7eb..3995652daf81 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -192,4 +192,8 @@ extern void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size);
extern int valid_phys_addr_range(phys_addr_t addr, size_t size);
extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t size);
+extern bool arch_memremap_can_ram_remap(resource_size_t offset, size_t size,
+ unsigned long flags);
+#define arch_memremap_can_ram_remap arch_memremap_can_ram_remap
+
#endif /* __ASM_IO_H */
diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
index b7c81dacabf0..b21f91cd830d 100644
--- a/arch/arm64/mm/ioremap.c
+++ b/arch/arm64/mm/ioremap.c
@@ -99,3 +99,11 @@ void __init early_ioremap_init(void)
{
early_ioremap_setup();
}
+
+bool arch_memremap_can_ram_remap(resource_size_t offset, size_t size,
+ unsigned long flags)
+{
+ unsigned long pfn = PHYS_PFN(offset);
+
+ return pfn_is_map_memory(pfn);
+}
diff --git a/kernel/iomem.c b/kernel/iomem.c
index 62c92e43aa0d..e85bed24c0a9 100644
--- a/kernel/iomem.c
+++ b/kernel/iomem.c
@@ -33,7 +33,7 @@ static void *try_ram_remap(resource_size_t offset, size_t size,
unsigned long pfn = PHYS_PFN(offset);
/* In the simple case just return the existing linear address */
- if (pfn_valid(pfn) && !PageHighMem(pfn_to_page(pfn)) &&
+ if (!PageHighMem(pfn_to_page(pfn)) &&
arch_memremap_can_ram_remap(offset, size, flags))
return __va(offset);
base-commit: b2d229d4ddb17db541098b83524d901257e93845
--
2.28.0
The fw_dnld_over() is a reentrant function in nfcmrvl driver, it
could be called by nfcmrvl_fw_dnld_start(), nfcmrvl_nci_recv_frame()
and nfcmrvl_nci_unregister_dev() without synchronization. As a result,
the firmware struct could be deallocated more than once, which leads to
double-free or invalid-free bugs.
The first situation that causes bug is shown below:
(Thread 1) | (Thread 2)
nfcmrvl_fw_dnld_start |
... | nfcmrvl_nci_unregister_dev
release_firmware() | nfcmrvl_fw_dnld_abort
kfree(fw) //(1) | fw_dnld_over
| release_firmware
... | kfree(fw) //(2)
| ...
The second situation that causes bug is shown below:
(Thread 1) | (Thread 2)
nfcmrvl_fw_dnld_start |
... |
mod_timer |
(wait a time) |
fw_dnld_timeout | nfcmrvl_nci_unregister_dev
fw_dnld_over | nfcmrvl_fw_dnld_abort
release_firmware | fw_dnld_over
kfree(fw) //(1) | release_firmware
... | kfree(fw) //(2)
The third situation that causes bug is shown below:
(Thread 1) | (Thread 2)
nfcmrvl_nci_recv_frame |
if(..->fw_download_in_progress)|
nfcmrvl_fw_dnld_recv_frame |
queue_work |
|
fw_dnld_rx_work | nfcmrvl_nci_unregister_dev
fw_dnld_over | nfcmrvl_fw_dnld_abort
release_firmware | fw_dnld_over
kfree(fw) //(1) | release_firmware
| kfree(fw) //(2)
The firmware struct is deallocated in position (1) and deallocated
in position (2) again.
The crash trace triggered by POC is like below:
[ 122.640457] BUG: KASAN: double-free or invalid-free in fw_dnld_over+0x28/0xf0
[ 122.640457] Call Trace:
[ 122.640457] <TASK>
[ 122.640457] kfree+0xb0/0x330
[ 122.640457] fw_dnld_over+0x28/0xf0
[ 122.640457] nfcmrvl_nci_unregister_dev+0x61/0x70
[ 122.640457] nci_uart_tty_close+0x87/0xd0
[ 122.640457] tty_ldisc_kill+0x3e/0x80
[ 122.640457] tty_ldisc_hangup+0x1b2/0x2c0
[ 122.640457] __tty_hangup.part.0+0x316/0x520
[ 122.640457] tty_release+0x200/0x670
[ 122.640457] __fput+0x110/0x410
[ 122.640457] task_work_run+0x86/0xd0
[ 122.640457] exit_to_user_mode_prepare+0x1aa/0x1b0
[ 122.640457] syscall_exit_to_user_mode+0x19/0x50
[ 122.640457] do_syscall_64+0x48/0x90
[ 122.640457] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 122.640457] RIP: 0033:0x7f68433f6beb
What's more, there are also use-after-free and null-ptr-deref bugs
in nfcmrvl_fw_dnld_start(). If we deallocate firmware struct or
set null to the members of priv->fw_dnld in fw_dnld_over(), then,
we dereference firmware or the members of priv->fw_dnld in
nfcmrvl_fw_dnld_start(), the UAF or NPD bugs will happen.
This patch reorders nfcmrvl_fw_dnld_abort() after nci_unregister_device()
to avoid the double-free, UAF and NPD bugs, as nci_unregister_device()
is well synchronized and won't return if there is a running routine.
Fixes: 3194c6870158 ("NFC: nfcmrvl: add firmware download support")
Signed-off-by: Duoming Zhou <duoming(a)zju.edu.cn>
Cc: stable(a)vger.kernel.org
---
drivers/nfc/nfcmrvl/main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nfc/nfcmrvl/main.c b/drivers/nfc/nfcmrvl/main.c
index 2fcf545012b..1a5284de434 100644
--- a/drivers/nfc/nfcmrvl/main.c
+++ b/drivers/nfc/nfcmrvl/main.c
@@ -183,6 +183,7 @@ void nfcmrvl_nci_unregister_dev(struct nfcmrvl_private *priv)
{
struct nci_dev *ndev = priv->ndev;
+ nci_unregister_device(ndev);
if (priv->ndev->nfc_dev->fw_download_in_progress)
nfcmrvl_fw_dnld_abort(priv);
@@ -191,7 +192,6 @@ void nfcmrvl_nci_unregister_dev(struct nfcmrvl_private *priv)
if (gpio_is_valid(priv->config.reset_n_io))
gpio_free(priv->config.reset_n_io);
- nci_unregister_device(ndev);
nci_free_device(ndev);
kfree(priv);
}
--
2.17.1
There are double-free bugs in nfcmrvl driver. The root cause of these
bugs is that the functions call release_firmware() in nfcmrvl driver
is not well synchronized.
The double-free bugs are between nfcmrvl_fw_dnld_start() and
nfcmrvl_nci_unregister_dev(). Both of these functions will call
release_firmware(). As a result, the firmware struct will be
deallocated more than once.
One double-free bug is shown below:
(Thread 1) | (Thread 2)
nfcmrvl_fw_dnld_start |
... | nfcmrvl_nci_unregister_dev
release_firmware() | nfcmrvl_fw_dnld_abort
kfree(fw) //(1) | fw_dnld_over
| release_firmware
| kfree(fw) //(2)
... | ...
Another double-free bug is shown below:
(Thread 1) | (Thread 2)
nfcmrvl_fw_dnld_start |
... |
mod_timer |
(wait a time) |
fw_dnld_timeout | nfcmrvl_nci_unregister_dev
fw_dnld_over | nfcmrvl_fw_dnld_abort
release_firmware | fw_dnld_over
kfree(fw) //(1) | release_firmware
... | kfree(fw) //(2)
The firmware struct is deallocated in position (1) and deallocated
in position (2) again.
The crash trace triggered by POC is like below:
[ 122.640457] BUG: KASAN: double-free or invalid-free in fw_dnld_over+0x28/0xf0
[ 122.640457] Call Trace:
[ 122.640457] <TASK>
[ 122.640457] kfree+0xb0/0x330
[ 122.640457] fw_dnld_over+0x28/0xf0
[ 122.640457] nfcmrvl_nci_unregister_dev+0x61/0x70
[ 122.640457] nci_uart_tty_close+0x87/0xd0
[ 122.640457] tty_ldisc_kill+0x3e/0x80
[ 122.640457] tty_ldisc_hangup+0x1b2/0x2c0
[ 122.640457] __tty_hangup.part.0+0x316/0x520
[ 122.640457] tty_release+0x200/0x670
[ 122.640457] __fput+0x110/0x410
[ 122.640457] task_work_run+0x86/0xd0
[ 122.640457] exit_to_user_mode_prepare+0x1aa/0x1b0
[ 122.640457] syscall_exit_to_user_mode+0x19/0x50
[ 122.640457] do_syscall_64+0x48/0x90
[ 122.640457] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 122.640457] RIP: 0033:0x7f68433f6beb
What's more, there are also use-after-free and null-ptr-deref bugs
in nfcmrvl_fw_dnld_start(). If we deallocate firmware struct or
set null to the members of priv->fw_dnld in nfcmrvl_nci_unregister_dev(),
then, we dereference firmware or the members of priv->fw_dnld in
nfcmrvl_fw_dnld_start(), the UAF or NPD bugs will happen.
This patch reorders nfcmrvl_fw_dnld_abort() after nci_unregister_device()
to avoid the double-free, UAF and NPD bugs, as nci_unregister_device()
is well synchronized and won't return if there is a running routine.
Fixes: 3194c6870158 ("NFC: nfcmrvl: add firmware download support")
Signed-off-by: Duoming Zhou <duoming(a)zju.edu.cn>
Cc: stable(a)vger.kernel.org
---
drivers/nfc/nfcmrvl/main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nfc/nfcmrvl/main.c b/drivers/nfc/nfcmrvl/main.c
index 2fcf545012b..1a5284de434 100644
--- a/drivers/nfc/nfcmrvl/main.c
+++ b/drivers/nfc/nfcmrvl/main.c
@@ -183,6 +183,7 @@ void nfcmrvl_nci_unregister_dev(struct nfcmrvl_private *priv)
{
struct nci_dev *ndev = priv->ndev;
+ nci_unregister_device(ndev);
if (priv->ndev->nfc_dev->fw_download_in_progress)
nfcmrvl_fw_dnld_abort(priv);
@@ -191,7 +192,6 @@ void nfcmrvl_nci_unregister_dev(struct nfcmrvl_private *priv)
if (gpio_is_valid(priv->config.reset_n_io))
gpio_free(priv->config.reset_n_io);
- nci_unregister_device(ndev);
nci_free_device(ndev);
kfree(priv);
}
--
2.17.1
[BUG]
When running generic/475 with 64K page size and 4K sector size, it has a
very high chance (almost 100%) to hang, with mostly data page locked but
no one is going to unlock it.
[CAUSE]
With commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on
reads"), if we failed to lookup checksum due to metadata IO error, we
will return error for btrfs_submit_data_bio().
This will cause the page to be unlocked twice in btrfs_do_readpage():
btrfs_do_readpage()
|- submit_extent_page()
| |- submit_one_bio()
| |- btrfs_submit_data_bio()
| |- if (ret) {
| |- bio->bi_status = ret;
| |- bio_endio(bio); }
| In the endio function, we will call end_page_read()
| and unlock_extent() to cleanup the subpage range.
|
|- if (ret) {
|- unlock_extent(); end_page_read() }
Here we unlock the extent and cleanup the subpage range
again.
For unlock_extent(), it's mostly double unlock safe.
But for end_page_read(), it's not, especially for subpage case,
as for subpage case we will call btrfs_subpage_end_reader() to reduce
the reader number, and use that to number to determine if we need to
unlock the full page.
If double accounted, it can underflow the number and leave the page
locked without anyone to unlock it.
[FIX]
The commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on
reads") itself is completely fine, it's our existing code not properly
handling the error from bio submission hook properly.
This patch will make submit_one_bio() to return void so that the callers
will never be able to do cleanup when bio submission hook fails.
CC: stable(a)vger.kernel.org # 5.18+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
fs/btrfs/extent_io.c | 114 ++++++++++++++-----------------------------
fs/btrfs/extent_io.h | 3 +-
fs/btrfs/inode.c | 13 +++--
3 files changed, 44 insertions(+), 86 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 15a429e28174..163aa6dee987 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -166,24 +166,27 @@ static int add_extent_changeset(struct extent_state *state, u32 bits,
return ret;
}
-int __must_check submit_one_bio(struct bio *bio, int mirror_num,
- unsigned long bio_flags)
+void submit_one_bio(struct bio *bio, int mirror_num, unsigned long bio_flags)
{
- blk_status_t ret = 0;
struct extent_io_tree *tree = bio->bi_private;
bio->bi_private = NULL;
/* Caller should ensure the bio has at least some range added */
ASSERT(bio->bi_iter.bi_size);
+
if (is_data_inode(tree->private_data))
- ret = btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
+ btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
bio_flags);
else
- ret = btrfs_submit_metadata_bio(tree->private_data, bio,
+ btrfs_submit_metadata_bio(tree->private_data, bio,
mirror_num, bio_flags);
-
- return blk_status_to_errno(ret);
+ /*
+ * Above submission hooks will handle the error by ending the bio,
+ * which will do the cleanup properly.
+ * So here we should not return any error, or the caller of
+ * submit_extent_page() will do cleanup again, causing problems.
+ */
}
/* Cleanup unsubmitted bios */
@@ -204,13 +207,12 @@ static void end_write_bio(struct extent_page_data *epd, int ret)
* Return 0 if everything is OK.
* Return <0 for error.
*/
-static int __must_check flush_write_bio(struct extent_page_data *epd)
+static void flush_write_bio(struct extent_page_data *epd)
{
- int ret = 0;
struct bio *bio = epd->bio_ctrl.bio;
if (bio) {
- ret = submit_one_bio(bio, 0, 0);
+ submit_one_bio(bio, 0, 0);
/*
* Clean up of epd->bio is handled by its endio function.
* And endio is either triggered by successful bio execution
@@ -220,7 +222,6 @@ static int __must_check flush_write_bio(struct extent_page_data *epd)
*/
epd->bio_ctrl.bio = NULL;
}
- return ret;
}
int __init extent_state_cache_init(void)
@@ -3441,10 +3442,8 @@ static int submit_extent_page(unsigned int opf,
ASSERT(pg_offset < PAGE_SIZE && size <= PAGE_SIZE &&
pg_offset + size <= PAGE_SIZE);
if (force_bio_submit && bio_ctrl->bio) {
- ret = submit_one_bio(bio_ctrl->bio, mirror_num, bio_ctrl->bio_flags);
+ submit_one_bio(bio_ctrl->bio, mirror_num, bio_ctrl->bio_flags);
bio_ctrl->bio = NULL;
- if (ret < 0)
- return ret;
}
while (cur < pg_offset + size) {
@@ -3485,11 +3484,9 @@ static int submit_extent_page(unsigned int opf,
if (added < size - offset) {
/* The bio should contain some page(s) */
ASSERT(bio_ctrl->bio->bi_iter.bi_size);
- ret = submit_one_bio(bio_ctrl->bio, mirror_num,
+ submit_one_bio(bio_ctrl->bio, mirror_num,
bio_ctrl->bio_flags);
bio_ctrl->bio = NULL;
- if (ret < 0)
- return ret;
}
cur += added;
}
@@ -4237,14 +4234,12 @@ static noinline_for_stack int lock_extent_buffer_for_io(struct extent_buffer *eb
struct extent_page_data *epd)
{
struct btrfs_fs_info *fs_info = eb->fs_info;
- int i, num_pages, failed_page_nr;
+ int i, num_pages;
int flush = 0;
int ret = 0;
if (!btrfs_try_tree_write_lock(eb)) {
- ret = flush_write_bio(epd);
- if (ret < 0)
- return ret;
+ flush_write_bio(epd);
flush = 1;
btrfs_tree_lock(eb);
}
@@ -4254,9 +4249,7 @@ static noinline_for_stack int lock_extent_buffer_for_io(struct extent_buffer *eb
if (!epd->sync_io)
return 0;
if (!flush) {
- ret = flush_write_bio(epd);
- if (ret < 0)
- return ret;
+ flush_write_bio(epd);
flush = 1;
}
while (1) {
@@ -4303,39 +4296,13 @@ static noinline_for_stack int lock_extent_buffer_for_io(struct extent_buffer *eb
if (!trylock_page(p)) {
if (!flush) {
- int err;
-
- err = flush_write_bio(epd);
- if (err < 0) {
- ret = err;
- failed_page_nr = i;
- goto err_unlock;
- }
+ flush_write_bio(epd);
flush = 1;
}
lock_page(p);
}
}
- return ret;
-err_unlock:
- /* Unlock already locked pages */
- for (i = 0; i < failed_page_nr; i++)
- unlock_page(eb->pages[i]);
- /*
- * Clear EXTENT_BUFFER_WRITEBACK and wake up anyone waiting on it.
- * Also set back EXTENT_BUFFER_DIRTY so future attempts to this eb can
- * be made and undo everything done before.
- */
- btrfs_tree_lock(eb);
- spin_lock(&eb->refs_lock);
- set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
- end_extent_buffer_writeback(eb);
- spin_unlock(&eb->refs_lock);
- percpu_counter_add_batch(&fs_info->dirty_metadata_bytes, eb->len,
- fs_info->dirty_metadata_batch);
- btrfs_clear_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
- btrfs_tree_unlock(eb);
return ret;
}
@@ -4957,13 +4924,19 @@ int btree_write_cache_pages(struct address_space *mapping,
* if the fs already has error.
*/
if (!BTRFS_FS_ERROR(fs_info)) {
- ret = flush_write_bio(&epd);
+ flush_write_bio(&epd);
} else {
ret = -EROFS;
end_write_bio(&epd, ret);
}
out:
btrfs_zoned_meta_io_unlock(fs_info);
+ /*
+ * We can get ret > 0 from submit_extent_page() indicating how many ebs
+ * were submitted. Reset it to 0 to avoid false alerts for the caller.
+ */
+ if (ret > 0)
+ ret = 0;
return ret;
}
@@ -5065,8 +5038,7 @@ static int extent_write_cache_pages(struct address_space *mapping,
* tmpfs file mapping
*/
if (!trylock_page(page)) {
- ret = flush_write_bio(epd);
- BUG_ON(ret < 0);
+ flush_write_bio(epd);
lock_page(page);
}
@@ -5076,10 +5048,8 @@ static int extent_write_cache_pages(struct address_space *mapping,
}
if (wbc->sync_mode != WB_SYNC_NONE) {
- if (PageWriteback(page)) {
- ret = flush_write_bio(epd);
- BUG_ON(ret < 0);
- }
+ if (PageWriteback(page))
+ flush_write_bio(epd);
wait_on_page_writeback(page);
}
@@ -5119,9 +5089,8 @@ static int extent_write_cache_pages(struct address_space *mapping,
* page in our current bio, and thus deadlock, so flush the
* write bio here.
*/
- ret = flush_write_bio(epd);
- if (!ret)
- goto retry;
+ flush_write_bio(epd);
+ goto retry;
}
if (wbc->range_cyclic || (wbc->nr_to_write > 0 && range_whole))
@@ -5147,8 +5116,7 @@ int extent_write_full_page(struct page *page, struct writeback_control *wbc)
return ret;
}
- ret = flush_write_bio(&epd);
- ASSERT(ret <= 0);
+ flush_write_bio(&epd);
return ret;
}
@@ -5210,7 +5178,7 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end)
}
if (!found_error)
- ret = flush_write_bio(&epd);
+ flush_write_bio(&epd);
else
end_write_bio(&epd, ret);
@@ -5243,7 +5211,7 @@ int extent_writepages(struct address_space *mapping,
end_write_bio(&epd, ret);
return ret;
}
- ret = flush_write_bio(&epd);
+ flush_write_bio(&epd);
return ret;
}
@@ -5266,10 +5234,8 @@ void extent_readahead(struct readahead_control *rac)
if (em_cached)
free_extent_map(em_cached);
- if (bio_ctrl.bio) {
- if (submit_one_bio(bio_ctrl.bio, 0, bio_ctrl.bio_flags))
- return;
- }
+ if (bio_ctrl.bio)
+ submit_one_bio(bio_ctrl.bio, 0, bio_ctrl.bio_flags);
}
/*
@@ -6649,12 +6615,8 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
atomic_dec(&eb->io_pages);
}
if (bio_ctrl.bio) {
- int tmp;
-
- tmp = submit_one_bio(bio_ctrl.bio, mirror_num, 0);
+ submit_one_bio(bio_ctrl.bio, mirror_num, 0);
bio_ctrl.bio = NULL;
- if (tmp < 0)
- return tmp;
}
if (ret || wait != WAIT_COMPLETE)
return ret;
@@ -6767,10 +6729,8 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
}
if (bio_ctrl.bio) {
- err = submit_one_bio(bio_ctrl.bio, mirror_num, bio_ctrl.bio_flags);
+ submit_one_bio(bio_ctrl.bio, mirror_num, bio_ctrl.bio_flags);
bio_ctrl.bio = NULL;
- if (err)
- return err;
}
if (ret || wait != WAIT_COMPLETE)
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 05253612ce7b..9a283b2358b8 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -178,8 +178,7 @@ typedef struct extent_map *(get_extent_t)(struct btrfs_inode *inode,
int try_release_extent_mapping(struct page *page, gfp_t mask);
int try_release_extent_buffer(struct page *page);
-int __must_check submit_one_bio(struct bio *bio, int mirror_num,
- unsigned long bio_flags);
+void submit_one_bio(struct bio *bio, int mirror_num, unsigned long bio_flags);
int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
struct btrfs_bio_ctrl *bio_ctrl,
unsigned int read_flags, u64 *prev_em_start);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 03fa33db1640..62fdcb7576ba 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8158,13 +8158,12 @@ int btrfs_readpage(struct file *file, struct page *page)
btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
ret = btrfs_do_readpage(page, NULL, &bio_ctrl, 0, NULL);
- if (bio_ctrl.bio) {
- int ret2;
-
- ret2 = submit_one_bio(bio_ctrl.bio, 0, bio_ctrl.bio_flags);
- if (ret == 0)
- ret = ret2;
- }
+ /*
+ * If btrfs_do_readpage() failed we will want to submit the assembled
+ * bio to do the cleanup.
+ */
+ if (bio_ctrl.bio)
+ submit_one_bio(bio_ctrl.bio, 0, bio_ctrl.bio_flags);
return ret;
}
--
2.35.1
From: Guo Ren <guoren(a)linux.alibaba.com>
These patch_text implementations are using stop_machine_cpuslocked
infrastructure with atomic cpu_count. The original idea: When the
master CPU patch_text, the others should wait for it. But current
implementation is using the first CPU as master, which couldn't
guarantee the remaining CPUs are waiting. This patch changes the
last CPU as the master to solve the potential risk.
Signed-off-by: Guo Ren <guoren(a)linux.alibaba.com>
Signed-off-by: Guo Ren <guoren(a)kernel.org>
Acked-by: Palmer Dabbelt <palmer(a)rivosinc.com>
Reviewed-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
---
arch/riscv/kernel/patch.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
index 0b552873a577..765004b60513 100644
--- a/arch/riscv/kernel/patch.c
+++ b/arch/riscv/kernel/patch.c
@@ -104,7 +104,7 @@ static int patch_text_cb(void *data)
struct patch_insn *patch = data;
int ret = 0;
- if (atomic_inc_return(&patch->cpu_count) == 1) {
+ if (atomic_inc_return(&patch->cpu_count) == num_online_cpus()) {
ret =
patch_text_nosync(patch->addr, &patch->insn,
GET_INSN_LENGTH(patch->insn));
--
2.25.1
On Sun, Apr 24, 2022 at 02:02:22PM +0800, 张广辉 wrote:
>
> Hi all
>
<snip>
Hi,
This is the friendly patch-bot of Greg Kroah-Hartman. You have sent him
a patch that has triggered this response. He used to manually respond
to these common problems, but in order to save his sanity (he kept
writing the same thing over and over, yet to different people), I was
created. Hopefully you will not take offence and will fix the problem
in your patch and resubmit it so that it can be accepted into the Linux
kernel tree.
You are receiving this message because of the following common error(s)
as indicated below:
- Your patch is malformed (tabs converted to spaces, linewrapped, etc.)
and can not be applied. Please read the file,
Documentation/email-clients.txt in order to fix this.
If you wish to discuss this problem further, or you have questions about
how to resolve this issue, please feel free to respond to this email and
Greg will reply once he has dug out from the pending patches received
from other developers.
thanks,
greg k-h's patch email bot
--
Greetings,
I'm Mr. Jibri loubda, how are you doing hope you are in good health,
the Board irector
try to reach you on phone several times Meanwhile, your number was not
connecting. before he ask me to send you an email to hear from you if
you are fine. hope to hear you are in good Health.
Thanks,
Mr. Jibri loubda.
In ufs_qcom_dev_ref_clk_ctrl(), it was noted that the ref_clk needs to be
stable for at least 1us. Even though there is wmb() to make sure the write
gets "completed", there is no guarantee that the write actually reached
the UFS device. There is a good chance that the write could be stored in
a Write Buffer (WB). In that case, even though the CPU waits for 1us, the
ref_clk might not be stable for that period.
So lets do a readl() to make sure that the previous write has reached the
UFS device before udelay().
Also, the wmb() after writel_relaxed is not really needed. Both writel and
readl are ordered on all architectures and the CPU won't speculate
instructions after readl() due to the in-built control dependency with
read value on weakly ordered architectures. So it can be safely removed.
Cc: stable(a)vger.kernel.org
Fixes: f06fcc7155dc ("scsi: ufs-qcom: add QUniPro hardware support and power optimizations")
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
---
drivers/scsi/ufs/ufs-qcom.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/scsi/ufs/ufs-qcom.c b/drivers/scsi/ufs/ufs-qcom.c
index 6ee33cc0ad09..f47a16b7cff5 100644
--- a/drivers/scsi/ufs/ufs-qcom.c
+++ b/drivers/scsi/ufs/ufs-qcom.c
@@ -687,8 +687,11 @@ static void ufs_qcom_dev_ref_clk_ctrl(struct ufs_qcom_host *host, bool enable)
writel_relaxed(temp, host->dev_ref_clk_ctrl_mmio);
- /* ensure that ref_clk is enabled/disabled before we return */
- wmb();
+ /*
+ * Make sure the write to ref_clk reaches the destination and
+ * not stored in a Write Buffer (WB).
+ */
+ readl(host->dev_ref_clk_ctrl_mmio);
/*
* If we call hibern8 exit after this, we need to make sure that
--
2.25.1
In ufs_qcom_dev_ref_clk_ctrl(), it was noted that the ref_clk needs to be
stable for atleast 1us. Eventhough there is wmb() to make sure the write
gets "completed", there is no guarantee that the write actually reached
the UFS device. There is a good chance that the write could be stored in
a Write Buffer (WB). In that case, eventhough the CPU waits for 1us, the
ref_clk might not be stable for that period.
So lets do a readl() to make sure that the previous write has reached the
UFS device before udelay().
Cc: stable(a)vger.kernel.org
Fixes: f06fcc7155dc ("scsi: ufs-qcom: add QUniPro hardware support and power optimizations")
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
---
drivers/scsi/ufs/ufs-qcom.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/scsi/ufs/ufs-qcom.c b/drivers/scsi/ufs/ufs-qcom.c
index 5f0a8f646eb5..5b9986c63eed 100644
--- a/drivers/scsi/ufs/ufs-qcom.c
+++ b/drivers/scsi/ufs/ufs-qcom.c
@@ -690,6 +690,12 @@ static void ufs_qcom_dev_ref_clk_ctrl(struct ufs_qcom_host *host, bool enable)
/* ensure that ref_clk is enabled/disabled before we return */
wmb();
+ /*
+ * Make sure the write to ref_clk reaches the destination and
+ * not stored in a Write Buffer (WB).
+ */
+ readl(host->dev_ref_clk_ctrl_mmio);
+
/*
* If we call hibern8 exit after this, we need to make sure that
* device ref_clk is stable for at least 1us before the hibern8
--
2.25.1
If the user sets the usb_request's no_interrupt, then there will be no
completion event for the request. Currently the driver incorrectly uses
the event status of a different request to report the status for a
request with no_interrupt. The dwc3 driver needs to check the TRB status
associated with the request when reporting its status.
Note: this is only applicable to missed_isoc TRB completion status, but
the other status are also listed for completeness/documentation.
Fixes: 6d8a019614f3 ("usb: dwc3: gadget: check for Missed Isoc from event status")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
---
drivers/usb/dwc3/gadget.c | 31 ++++++++++++++++++++++++++++++-
1 file changed, 30 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index ab725d2262d6..0b9c2493844a 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -3274,6 +3274,7 @@ static int dwc3_gadget_ep_cleanup_completed_request(struct dwc3_ep *dep,
const struct dwc3_event_depevt *event,
struct dwc3_request *req, int status)
{
+ int request_status;
int ret;
if (req->request.num_mapped_sgs)
@@ -3294,7 +3295,35 @@ static int dwc3_gadget_ep_cleanup_completed_request(struct dwc3_ep *dep,
req->needs_extra_trb = false;
}
- dwc3_gadget_giveback(dep, req, status);
+ /*
+ * The event status only reflects the status of the TRB with IOC set.
+ * For the requests that don't set interrupt on completion, the driver
+ * needs to check and return the status of the completed TRBs associated
+ * with the request. Use the status of the last TRB of the request.
+ */
+ if (req->request.no_interrupt) {
+ struct dwc3_trb *trb;
+
+ trb = dwc3_ep_prev_trb(dep, dep->trb_dequeue);
+ switch (DWC3_TRB_SIZE_TRBSTS(trb->size)) {
+ case DWC3_TRBSTS_MISSED_ISOC:
+ /* Isoc endpoint only */
+ request_status = -EXDEV;
+ break;
+ case DWC3_TRB_STS_XFER_IN_PROG:
+ /* Applicable when End Transfer with ForceRM=0 */
+ case DWC3_TRBSTS_SETUP_PENDING:
+ /* Control endpoint only */
+ case DWC3_TRBSTS_OK:
+ default:
+ request_status = 0;
+ break;
+ }
+ } else {
+ request_status = status;
+ }
+
+ dwc3_gadget_giveback(dep, req, request_status);
out:
return ret;
base-commit: f4fd84ae0765a80494b28c43b756a95100351a94
--
2.28.0
I made a mistake with the commit a6aaa0032424 ("net: ethernet: stmmac:
fix altr_tse_pcs function when using a fixed-link"). I should have
tested against both scenario of having a SGMII interface and one
without.
Without the SGMII PCS TSE adpater, the sgmii_adapter_base address is
NULL, thus a write to this address will fail.
Fixes: a6aaa0032424 ("net: ethernet: stmmac: fix altr_tse_pcs function when using a fixed-link")
Cc: linux-stable <stable(a)vger.kernel.org>
Signed-off-by: Dinh Nguyen <dinguyen(a)kernel.org>
---
drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
index ac9e6c7a33b5..6b447d8f0bd8 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
@@ -65,8 +65,9 @@ static void socfpga_dwmac_fix_mac_speed(void *priv, unsigned int speed)
struct phy_device *phy_dev = ndev->phydev;
u32 val;
- writew(SGMII_ADAPTER_DISABLE,
- sgmii_adapter_base + SGMII_ADAPTER_CTRL_REG);
+ if (sgmii_adapter_base)
+ writew(SGMII_ADAPTER_DISABLE,
+ sgmii_adapter_base + SGMII_ADAPTER_CTRL_REG);
if (splitter_base) {
val = readl(splitter_base + EMAC_SPLITTER_CTRL_REG);
@@ -88,10 +89,11 @@ static void socfpga_dwmac_fix_mac_speed(void *priv, unsigned int speed)
writel(val, splitter_base + EMAC_SPLITTER_CTRL_REG);
}
- writew(SGMII_ADAPTER_ENABLE,
- sgmii_adapter_base + SGMII_ADAPTER_CTRL_REG);
- if (phy_dev)
+ if (phy_dev && sgmii_adapter_base) {
+ writew(SGMII_ADAPTER_ENABLE,
+ sgmii_adapter_base + SGMII_ADAPTER_CTRL_REG);
tse_pcs_fix_mac_speed(&dwmac->pcs, phy_dev, speed);
+ }
}
static int socfpga_dwmac_parse_data(struct socfpga_dwmac *dwmac, struct device *dev)
--
2.25.1
--
Hello Dear
Am a dying woman here in the hospital, i was diagnose as a
Coronavirus patient over 2 months ago. I am A business woman who is
dealing with Gold Exportation, I Am 71 year old from USA California i
have a charitable and unfufilling project that am about to handover
to you, if you are interested to know more about this project please
reply me.
Hope to hear from you
Best Regard
Mrs. Margaret Christopher
The patch titled
Subject: mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
has been removed from the -mm tree. Its filename was
mm-mmu_notifierc-fix-race-in-mmu_interval_notifier_remove.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Alistair Popple <apopple(a)nvidia.com>
Subject: mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
In some cases it is possible for mmu_interval_notifier_remove() to race
with mn_tree_inv_end() allowing it to return while the notifier data
structure is still in use. Consider the following sequence:
CPU0 - mn_tree_inv_end() CPU1 - mmu_interval_notifier_remove()
----------------------------------- ------------------------------------
spin_lock(subscriptions->lock);
seq = subscriptions->invalidate_seq;
spin_lock(subscriptions->lock); spin_unlock(subscriptions->lock);
subscriptions->invalidate_seq++;
wait_event(invalidate_seq != seq);
return;
interval_tree_remove(interval_sub); kfree(interval_sub);
spin_unlock(subscriptions->lock);
wake_up_all();
As the wait_event() condition is true it will return immediately. This
can lead to use-after-free type errors if the caller frees the data
structure containing the interval notifier subscription while it is still
on a deferred list. Fix this by taking the appropriate lock when reading
invalidate_seq to ensure proper synchronisation.
I observed this whilst running stress testing during some development.
You do have to be pretty unlucky, but it leads to the usual problems of
use-after-free (memory corruption, kernel crash, difficult to diagnose
WARN_ON, etc).
Link: https://lkml.kernel.org/r/20220420043734.476348-1-apopple@nvidia.com
Fixes: 99cb252f5e68 ("mm/mmu_notifier: add an interval tree notifier")
Signed-off-by: Alistair Popple <apopple(a)nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Christian K��nig <christian.koenig(a)amd.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mmu_notifier.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
--- a/mm/mmu_notifier.c~mm-mmu_notifierc-fix-race-in-mmu_interval_notifier_remove
+++ a/mm/mmu_notifier.c
@@ -1036,6 +1036,18 @@ int mmu_interval_notifier_insert_locked(
}
EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked);
+static bool
+mmu_interval_seq_released(struct mmu_notifier_subscriptions *subscriptions,
+ unsigned long seq)
+{
+ bool ret;
+
+ spin_lock(&subscriptions->lock);
+ ret = subscriptions->invalidate_seq != seq;
+ spin_unlock(&subscriptions->lock);
+ return ret;
+}
+
/**
* mmu_interval_notifier_remove - Remove a interval notifier
* @interval_sub: Interval subscription to unregister
@@ -1083,7 +1095,7 @@ void mmu_interval_notifier_remove(struct
lock_map_release(&__mmu_notifier_invalidate_range_start_map);
if (seq)
wait_event(subscriptions->wq,
- READ_ONCE(subscriptions->invalidate_seq) != seq);
+ mmu_interval_seq_released(subscriptions, seq));
/* pairs with mmgrab in mmu_interval_notifier_insert() */
mmdrop(mm);
_
Patches currently in -mm which might be from apopple(a)nvidia.com are
mm-add-selftests-for-migration-entries.patch
The patch titled
Subject: oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
has been removed from the -mm tree. Its filename was
oom_killc-futex-delay-the-oom-reaper-to-allow-time-for-proper-futex-cleanup.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Nico Pache <npache(a)redhat.com>
Subject: oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
be targeted by the oom reaper. This mapping is used to store the futex
robust list head; the kernel does not keep a copy of the robust list and
instead references a userspace address to maintain the robustness during a
process death. A race can occur between exit_mm and the oom reaper that
allows the oom reaper to free the memory of the futex robust list before
the exit path has handled the futex death:
CPU1 CPU2
------------------------------------------------------------------------
page_fault
do_exit "signal"
wake_oom_reaper
oom_reaper
oom_reap_task_mm (invalidates mm)
exit_mm
exit_mm_release
futex_exit_release
futex_cleanup
exit_robust_list
get_user (EFAULT- can't access memory)
If the get_user EFAULT's, the kernel will be unable to recover the waiters
on the robust_list, leaving userspace mutexes hung indefinitely.
Delay the OOM reaper, allowing more time for the exit path to perform the
futex cleanup.
Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
Based on a patch by Michal Hocko.
[1] https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370
Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com
Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: Joel Savitz <jsavitz(a)redhat.com>
Signed-off-by: Nico Pache <npache(a)redhat.com>
Co-developed-by: Joel Savitz <jsavitz(a)redhat.com>
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Acked-by: Thomas Gleixner <tglx(a)linutronix.de>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Rafael Aquini <aquini(a)redhat.com>
Cc: Waiman Long <longman(a)redhat.com>
Cc: Herton R. Krzesinski <herton(a)redhat.com>
Cc: Juri Lelli <juri.lelli(a)redhat.com>
Cc: Vincent Guittot <vincent.guittot(a)linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann(a)arm.com>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Daniel Bristot de Oliveira <bristot(a)redhat.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Joel Savitz <jsavitz(a)redhat.com>
Cc: Darren Hart <dvhart(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/sched.h | 1
mm/oom_kill.c | 54 +++++++++++++++++++++++++++++-----------
2 files changed, 41 insertions(+), 14 deletions(-)
--- a/include/linux/sched.h~oom_killc-futex-delay-the-oom-reaper-to-allow-time-for-proper-futex-cleanup
+++ a/include/linux/sched.h
@@ -1443,6 +1443,7 @@ struct task_struct {
int pagefault_disabled;
#ifdef CONFIG_MMU
struct task_struct *oom_reaper_list;
+ struct timer_list oom_reaper_timer;
#endif
#ifdef CONFIG_VMAP_STACK
struct vm_struct *stack_vm_area;
--- a/mm/oom_kill.c~oom_killc-futex-delay-the-oom-reaper-to-allow-time-for-proper-futex-cleanup
+++ a/mm/oom_kill.c
@@ -632,7 +632,7 @@ done:
*/
set_bit(MMF_OOM_SKIP, &mm->flags);
- /* Drop a reference taken by wake_oom_reaper */
+ /* Drop a reference taken by queue_oom_reaper */
put_task_struct(tsk);
}
@@ -644,12 +644,12 @@ static int oom_reaper(void *unused)
struct task_struct *tsk = NULL;
wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
- spin_lock(&oom_reaper_lock);
+ spin_lock_irq(&oom_reaper_lock);
if (oom_reaper_list != NULL) {
tsk = oom_reaper_list;
oom_reaper_list = tsk->oom_reaper_list;
}
- spin_unlock(&oom_reaper_lock);
+ spin_unlock_irq(&oom_reaper_lock);
if (tsk)
oom_reap_task(tsk);
@@ -658,22 +658,48 @@ static int oom_reaper(void *unused)
return 0;
}
-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper(struct timer_list *timer)
{
- /* mm is already queued? */
- if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
+ struct task_struct *tsk = container_of(timer, struct task_struct,
+ oom_reaper_timer);
+ struct mm_struct *mm = tsk->signal->oom_mm;
+ unsigned long flags;
+
+ /* The victim managed to terminate on its own - see exit_mmap */
+ if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+ put_task_struct(tsk);
return;
+ }
- get_task_struct(tsk);
-
- spin_lock(&oom_reaper_lock);
+ spin_lock_irqsave(&oom_reaper_lock, flags);
tsk->oom_reaper_list = oom_reaper_list;
oom_reaper_list = tsk;
- spin_unlock(&oom_reaper_lock);
+ spin_unlock_irqrestore(&oom_reaper_lock, flags);
trace_wake_reaper(tsk->pid);
wake_up(&oom_reaper_wait);
}
+/*
+ * Give the OOM victim time to exit naturally before invoking the oom_reaping.
+ * The timers timeout is arbitrary... the longer it is, the longer the worst
+ * case scenario for the OOM can take. If it is too small, the oom_reaper can
+ * get in the way and release resources needed by the process exit path.
+ * e.g. The futex robust list can sit in Anon|Private memory that gets reaped
+ * before the exit path is able to wake the futex waiters.
+ */
+#define OOM_REAPER_DELAY (2*HZ)
+static void queue_oom_reaper(struct task_struct *tsk)
+{
+ /* mm is already queued? */
+ if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
+ return;
+
+ get_task_struct(tsk);
+ timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
+ tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
+ add_timer(&tsk->oom_reaper_timer);
+}
+
static int __init oom_init(void)
{
oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -681,7 +707,7 @@ static int __init oom_init(void)
}
subsys_initcall(oom_init)
#else
-static inline void wake_oom_reaper(struct task_struct *tsk)
+static inline void queue_oom_reaper(struct task_struct *tsk)
{
}
#endif /* CONFIG_MMU */
@@ -932,7 +958,7 @@ static void __oom_kill_process(struct ta
rcu_read_unlock();
if (can_oom_reap)
- wake_oom_reaper(victim);
+ queue_oom_reaper(victim);
mmdrop(mm);
put_task_struct(victim);
@@ -968,7 +994,7 @@ static void oom_kill_process(struct oom_
task_lock(victim);
if (task_will_free_mem(victim)) {
mark_oom_victim(victim);
- wake_oom_reaper(victim);
+ queue_oom_reaper(victim);
task_unlock(victim);
put_task_struct(victim);
return;
@@ -1067,7 +1093,7 @@ bool out_of_memory(struct oom_control *o
*/
if (task_will_free_mem(current)) {
mark_oom_victim(current);
- wake_oom_reaper(current);
+ queue_oom_reaper(current);
return true;
}
_
Patches currently in -mm which might be from npache(a)redhat.com are
The patch titled
Subject: mm, hugetlb: allow for "high" userspace addresses
has been removed from the -mm tree. Its filename was
mm-hugetlbfs-allow-for-high-userspace-addresses.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Subject: mm, hugetlb: allow for "high" userspace addresses
This is a fix for commit f6795053dac8 ("mm: mmap: Allow for "high"
userspace addresses") for hugetlb.
This patch adds support for "high" userspace addresses that are optionally
supported on the system and have to be requested via a hint mechanism
("high" addr parameter to mmap).
Architectures such as powerpc and x86 achieve this by making changes to
their architectural versions of hugetlb_get_unmapped_area() function.
However, arm64 uses the generic version of that function.
So take into account arch_get_mmap_base() and arch_get_mmap_end() in
hugetlb_get_unmapped_area(). To allow that, move those two macros out of
mm/mmap.c into include/linux/sched/mm.h
If these macros are not defined in architectural code then they default to
(TASK_SIZE) and (base) so should not introduce any behavioural changes to
architectures that do not define them.
For the time being, only ARM64 is affected by this change.
Catalin (ARM64) said
: We should have fixed hugetlb_get_unmapped_area() as well when we added
: support for 52-bit VA. The reason for commit f6795053dac8 was to prevent
: normal mmap() from returning addresses above 48-bit by default as some
: user-space had hard assumptions about this.
:
: It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
: but I doubt anyone would notice. It's more likely that the current
: behaviour would cause issues, so I'd rather have them consistent.
:
: Basically when arm64 gained support for 52-bit addresses we did not
: want user-space calling mmap() to suddenly get such high addresses,
: otherwise we could have inadvertently broken some programs (similar
: behaviour to x86 here). Hence we added commit f6795053dac8. But we
: missed hugetlbfs which could still get such high mmap() addresses. So
: in theory that's a potential regression that should have bee addressed
: at the same time as commit f6795053dac8 (and before arm64 enabled
: 52-bit addresses).
Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.16500337…
Fixes: f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses")
Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Reviewed-by: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Steve Capper <steve.capper(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: <stable(a)vger.kernel.org> [5.0.x]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/hugetlbfs/inode.c | 9 +++++----
include/linux/sched/mm.h | 8 ++++++++
mm/mmap.c | 8 --------
3 files changed, 13 insertions(+), 12 deletions(-)
--- a/fs/hugetlbfs/inode.c~mm-hugetlbfs-allow-for-high-userspace-addresses
+++ a/fs/hugetlbfs/inode.c
@@ -206,7 +206,7 @@ hugetlb_get_unmapped_area_bottomup(struc
info.flags = 0;
info.length = len;
info.low_limit = current->mm->mmap_base;
- info.high_limit = TASK_SIZE;
+ info.high_limit = arch_get_mmap_end(addr);
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
return vm_unmapped_area(&info);
@@ -222,7 +222,7 @@ hugetlb_get_unmapped_area_topdown(struct
info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
info.low_limit = max(PAGE_SIZE, mmap_min_addr);
- info.high_limit = current->mm->mmap_base;
+ info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
addr = vm_unmapped_area(&info);
@@ -237,7 +237,7 @@ hugetlb_get_unmapped_area_topdown(struct
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
info.low_limit = current->mm->mmap_base;
- info.high_limit = TASK_SIZE;
+ info.high_limit = arch_get_mmap_end(addr);
addr = vm_unmapped_area(&info);
}
@@ -251,6 +251,7 @@ hugetlb_get_unmapped_area(struct file *f
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
struct hstate *h = hstate_file(file);
+ const unsigned long mmap_end = arch_get_mmap_end(addr);
if (len & ~huge_page_mask(h))
return -EINVAL;
@@ -266,7 +267,7 @@ hugetlb_get_unmapped_area(struct file *f
if (addr) {
addr = ALIGN(addr, huge_page_size(h));
vma = find_vma(mm, addr);
- if (TASK_SIZE - len >= addr &&
+ if (mmap_end - len >= addr &&
(!vma || addr + len <= vm_start_gap(vma)))
return addr;
}
--- a/include/linux/sched/mm.h~mm-hugetlbfs-allow-for-high-userspace-addresses
+++ a/include/linux/sched/mm.h
@@ -136,6 +136,14 @@ static inline void mm_update_next_owner(
#endif /* CONFIG_MEMCG */
#ifdef CONFIG_MMU
+#ifndef arch_get_mmap_end
+#define arch_get_mmap_end(addr) (TASK_SIZE)
+#endif
+
+#ifndef arch_get_mmap_base
+#define arch_get_mmap_base(addr, base) (base)
+#endif
+
extern void arch_pick_mmap_layout(struct mm_struct *mm,
struct rlimit *rlim_stack);
extern unsigned long
--- a/mm/mmap.c~mm-hugetlbfs-allow-for-high-userspace-addresses
+++ a/mm/mmap.c
@@ -2117,14 +2117,6 @@ unsigned long vm_unmapped_area(struct vm
return addr;
}
-#ifndef arch_get_mmap_end
-#define arch_get_mmap_end(addr) (TASK_SIZE)
-#endif
-
-#ifndef arch_get_mmap_base
-#define arch_get_mmap_base(addr, base) (base)
-#endif
-
/* Get an address range which is currently unmapped.
* For shmat() with addr=0.
*
_
Patches currently in -mm which might be from christophe.leroy(a)csgroup.eu are
The patch titled
Subject: memcg: sync flush only if periodic flush is delayed
has been removed from the -mm tree. Its filename was
memcg-sync-flush-only-if-periodic-flush-is-delayed.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Shakeel Butt <shakeelb(a)google.com>
Subject: memcg: sync flush only if periodic flush is delayed
Daniel Dao has reported [1] a regression on workloads that may trigger a
lot of refaults (anon and file). The underlying issue is that flushing
rstat is expensive. Although rstat flush are batched with (nr_cpus *
MEMCG_BATCH) stat updates, it seems like there are workloads which
genuinely do stat updates larger than batch value within short amount of
time. Since the rstat flush can happen in the performance critical
codepaths like page faults, such workload can suffer greatly.
This patch fixes this regression by making the rstat flushing conditional
in the performance critical codepaths. More specifically, the kernel
relies on the async periodic rstat flusher to flush the stats and only if
the periodic flusher is delayed by more than twice the amount of its
normal time window then the kernel allows rstat flushing from the
performance critical codepaths.
Now the question: what are the side-effects of this change? The worst
that can happen is the refault codepath will see 4sec old lruvec stats and
may cause false (or missed) activations of the refaulted page which may
under-or-overestimate the workingset size. Though that is not very
concerning as the kernel can already miss or do false activations.
There are two more codepaths whose flushing behavior is not changed by
this patch and we may need to come to them in future. One is the
writeback stats used by dirty throttling and second is the deactivation
heuristic in the reclaim. For now keeping an eye on them and if there is
report of regression due to these codepaths, we will reevaluate then.
Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndg… [1]
Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault")
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Reported-by: Daniel Dao <dqminh(a)cloudflare.com>
Tested-by: Ivan Babrou <ivan(a)cloudflare.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Michal Koutn�� <mkoutny(a)suse.com>
Cc: Frank Hofmann <fhofmann(a)cloudflare.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/memcontrol.h | 5 +++++
mm/memcontrol.c | 12 +++++++++++-
mm/workingset.c | 2 +-
3 files changed, 17 insertions(+), 2 deletions(-)
--- a/include/linux/memcontrol.h~memcg-sync-flush-only-if-periodic-flush-is-delayed
+++ a/include/linux/memcontrol.h
@@ -1012,6 +1012,7 @@ static inline unsigned long lruvec_page_
}
void mem_cgroup_flush_stats(void);
+void mem_cgroup_flush_stats_delayed(void);
void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
int val);
@@ -1455,6 +1456,10 @@ static inline void mem_cgroup_flush_stat
{
}
+static inline void mem_cgroup_flush_stats_delayed(void)
+{
+}
+
static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
enum node_stat_item idx, int val)
{
--- a/mm/memcontrol.c~memcg-sync-flush-only-if-periodic-flush-is-delayed
+++ a/mm/memcontrol.c
@@ -587,6 +587,9 @@ static DECLARE_DEFERRABLE_WORK(stats_flu
static DEFINE_SPINLOCK(stats_flush_lock);
static DEFINE_PER_CPU(unsigned int, stats_updates);
static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
+static u64 flush_next_time;
+
+#define FLUSH_TIME (2UL*HZ)
/*
* Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
@@ -637,6 +640,7 @@ static void __mem_cgroup_flush_stats(voi
if (!spin_trylock_irqsave(&stats_flush_lock, flag))
return;
+ flush_next_time = jiffies_64 + 2*FLUSH_TIME;
cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
atomic_set(&stats_flush_threshold, 0);
spin_unlock_irqrestore(&stats_flush_lock, flag);
@@ -648,10 +652,16 @@ void mem_cgroup_flush_stats(void)
__mem_cgroup_flush_stats();
}
+void mem_cgroup_flush_stats_delayed(void)
+{
+ if (time_after64(jiffies_64, flush_next_time))
+ mem_cgroup_flush_stats();
+}
+
static void flush_memcg_stats_dwork(struct work_struct *w)
{
__mem_cgroup_flush_stats();
- queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
+ queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
}
/**
--- a/mm/workingset.c~memcg-sync-flush-only-if-periodic-flush-is-delayed
+++ a/mm/workingset.c
@@ -355,7 +355,7 @@ void workingset_refault(struct folio *fo
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats_delayed();
/*
* Compare the distance to the existing workingset size. We
* don't activate pages that couldn't stay resident even if
_
Patches currently in -mm which might be from shakeelb(a)google.com are
The patch titled
Subject: mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
has been removed from the -mm tree. Its filename was
mm-hwpoison-fix-race-between-hugetlb-free-demotion-and-memory_failure_hugetlb.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Subject: mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated, so
move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reported-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Dan Carpenter <dan.carpenter(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/hugetlb.h | 6 +
include/linux/mm.h | 8 ++
mm/hugetlb.c | 10 ++
mm/memory-failure.c | 145 ++++++++++++++++++++++++++------------
4 files changed, 127 insertions(+), 42 deletions(-)
--- a/include/linux/hugetlb.h~mm-hwpoison-fix-race-between-hugetlb-free-demotion-and-memory_failure_hugetlb
+++ a/include/linux/hugetlb.h
@@ -169,6 +169,7 @@ long hugetlb_unreserve_pages(struct inod
long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags);
void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
@@ -377,6 +378,11 @@ static inline int get_hwpoison_huge_page
{
return 0;
}
+
+static inline int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
static inline void putback_active_hugepage(struct page *page)
{
--- a/include/linux/mm.h~mm-hwpoison-fix-race-between-hugetlb-free-demotion-and-memory_failure_hugetlb
+++ a/include/linux/mm.h
@@ -3197,6 +3197,14 @@ extern int sysctl_memory_failure_recover
extern void shake_page(struct page *p);
extern atomic_long_t num_poisoned_pages __read_mostly;
extern int soft_offline_page(unsigned long pfn, int flags);
+#ifdef CONFIG_MEMORY_FAILURE
+extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags);
+#else
+static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+#endif
#ifndef arch_memory_failure
static inline int arch_memory_failure(unsigned long pfn, int flags)
--- a/mm/hugetlb.c~mm-hwpoison-fix-race-between-hugetlb-free-demotion-and-memory_failure_hugetlb
+++ a/mm/hugetlb.c
@@ -6785,6 +6785,16 @@ int get_hwpoison_huge_page(struct page *
return ret;
}
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ int ret;
+
+ spin_lock_irq(&hugetlb_lock);
+ ret = __get_huge_page_for_hwpoison(pfn, flags);
+ spin_unlock_irq(&hugetlb_lock);
+ return ret;
+}
+
void putback_active_hugepage(struct page *page)
{
spin_lock_irq(&hugetlb_lock);
--- a/mm/memory-failure.c~mm-hwpoison-fix-race-between-hugetlb-free-demotion-and-memory_failure_hugetlb
+++ a/mm/memory-failure.c
@@ -1498,50 +1498,113 @@ static int try_to_split_thp_page(struct
return 0;
}
-static int memory_failure_hugetlb(unsigned long pfn, int flags)
+/*
+ * Called from hugetlb code with hugetlb_lock held.
+ *
+ * Return values:
+ * 0 - free hugepage
+ * 1 - in-use hugepage
+ * 2 - not a hugepage
+ * -EBUSY - the hugepage is busy (try to retry)
+ * -EHWPOISON - the hugepage is already hwpoisoned
+ */
+int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ struct page *page = pfn_to_page(pfn);
+ struct page *head = compound_head(page);
+ int ret = 2; /* fallback to normal page handling */
+ bool count_increased = false;
+
+ if (!PageHeadHuge(head))
+ goto out;
+
+ if (flags & MF_COUNT_INCREASED) {
+ ret = 1;
+ count_increased = true;
+ } else if (HPageFreed(head) || HPageMigratable(head)) {
+ ret = get_page_unless_zero(head);
+ if (ret)
+ count_increased = true;
+ } else {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (TestSetPageHWPoison(head)) {
+ ret = -EHWPOISON;
+ goto out;
+ }
+
+ return ret;
+out:
+ if (count_increased)
+ put_page(head);
+ return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Taking refcount of hugetlb pages needs extra care about race conditions
+ * with basic operations like hugepage allocation/free/demotion.
+ * So some of prechecks for hwpoison (pinning, and testing/setting
+ * PageHWPoison) should be done in single hugetlb_lock range.
+ */
+static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
{
- struct page *p = pfn_to_page(pfn);
- struct page *head = compound_head(p);
int res;
+ struct page *p = pfn_to_page(pfn);
+ struct page *head;
unsigned long page_flags;
+ bool retry = true;
- if (TestSetPageHWPoison(head)) {
- pr_err("Memory failure: %#lx: already hardware poisoned\n",
- pfn);
- res = -EHWPOISON;
- if (flags & MF_ACTION_REQUIRED)
+ *hugetlb = 1;
+retry:
+ res = get_huge_page_for_hwpoison(pfn, flags);
+ if (res == 2) { /* fallback to normal page handling */
+ *hugetlb = 0;
+ return 0;
+ } else if (res == -EHWPOISON) {
+ pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn);
+ if (flags & MF_ACTION_REQUIRED) {
+ head = compound_head(p);
res = kill_accessing_process(current, page_to_pfn(head), flags);
+ }
return res;
+ } else if (res == -EBUSY) {
+ if (retry) {
+ retry = false;
+ goto retry;
+ }
+ action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
+ return res;
+ }
+
+ head = compound_head(p);
+ lock_page(head);
+
+ if (hwpoison_filter(p)) {
+ ClearPageHWPoison(head);
+ res = -EOPNOTSUPP;
+ goto out;
}
num_poisoned_pages_inc();
- if (!(flags & MF_COUNT_INCREASED)) {
- res = get_hwpoison_page(p, flags);
- if (!res) {
- lock_page(head);
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- unlock_page(head);
- return -EOPNOTSUPP;
- }
- unlock_page(head);
- res = MF_FAILED;
- if (__page_handle_poison(p)) {
- page_ref_inc(p);
- res = MF_RECOVERED;
- }
- action_result(pfn, MF_MSG_FREE_HUGE, res);
- return res == MF_RECOVERED ? 0 : -EBUSY;
- } else if (res < 0) {
- action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
- return -EBUSY;
+ /*
+ * Handling free hugepage. The possible race with hugepage allocation
+ * or demotion can be prevented by PageHWPoison flag.
+ */
+ if (res == 0) {
+ unlock_page(head);
+ res = MF_FAILED;
+ if (__page_handle_poison(p)) {
+ page_ref_inc(p);
+ res = MF_RECOVERED;
}
+ action_result(pfn, MF_MSG_FREE_HUGE, res);
+ return res == MF_RECOVERED ? 0 : -EBUSY;
}
- lock_page(head);
-
/*
* The page could have changed compound pages due to race window.
* If this happens just bail out.
@@ -1554,14 +1617,6 @@ static int memory_failure_hugetlb(unsign
page_flags = head->flags;
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- put_page(p);
- res = -EOPNOTSUPP;
- goto out;
- }
-
/*
* TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
* simply disable it. In order to make it work properly, we need
@@ -1588,6 +1643,12 @@ out:
unlock_page(head);
return res;
}
+#else
+static inline int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
+{
+ return 0;
+}
+#endif
static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
struct dev_pagemap *pgmap)
@@ -1712,6 +1773,7 @@ int memory_failure(unsigned long pfn, in
int res = 0;
unsigned long page_flags;
bool retry = true;
+ int hugetlb = 0;
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1739,10 +1801,9 @@ int memory_failure(unsigned long pfn, in
}
try_again:
- if (PageHuge(p)) {
- res = memory_failure_hugetlb(pfn, flags);
+ res = try_memory_failure_hugetlb(pfn, flags, &hugetlb);
+ if (hugetlb)
goto unlock_mutex;
- }
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
_
Patches currently in -mm which might be from naoya.horiguchi(a)nec.com are
mm-hwpoison-put-page-in-already-hwpoisoned-case-with-mf_count_increased.patch
revert-mm-memory-failurec-fix-race-with-changing-page-compound-again.patch
mm-hugetlb-hwpoison-separate-branch-for-free-and-in-use-hugepage.patch
Since linux-stable 5.10.112 commit:
1ff5359afa5e ("net: micrel: fix KS8851_MLL Kconfig")
it is not possible to select KS8851_MLL symbol.
This is because the commit adds dependency on Kconfig symbol
PTP_1588_CLOCK_OPTIONAL
which was added in Linux upstream commit:
e5f31552674e ("ethernet: fix PTP_1588_CLOCK dependencies")
And the aforementioned commit is not part of stable 5.10.112.
This is a note to let you know that I've just added the patch titled
usb: dwc3: core: Only handle soft-reset in DCTL
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From f4fd84ae0765a80494b28c43b756a95100351a94 Mon Sep 17 00:00:00 2001
From: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
Date: Thu, 21 Apr 2022 19:33:56 -0700
Subject: usb: dwc3: core: Only handle soft-reset in DCTL
Make sure not to set run_stop bit or link state change request while
initiating soft-reset. Register read-modify-write operation may
unintentionally start the controller before the initialization completes
with its previous DCTL value, which can cause initialization failure.
Fixes: f59dcab17629 ("usb: dwc3: core: improve reset sequence")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
Link: https://lore.kernel.org/r/6aecbd78328f102003d40ccf18ceeebd411d3703.16505947…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/dwc3/core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index 1ca9dae57855..d28cd1a6709b 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -274,7 +274,8 @@ int dwc3_core_soft_reset(struct dwc3 *dwc)
reg = dwc3_readl(dwc->regs, DWC3_DCTL);
reg |= DWC3_DCTL_CSFTRST;
- dwc3_writel(dwc->regs, DWC3_DCTL, reg);
+ reg &= ~DWC3_DCTL_RUN_STOP;
+ dwc3_gadget_dctl_write_safe(dwc, reg);
/*
* For DWC_usb31 controller 1.90a and later, the DCTL.CSFRST bit
--
2.36.0
commit 5467801f1fcb ("gpio: Restrict usage of GPIO chip irq members before
initialization") attempted to fix a race condition that lead to a NULL
pointer, but in the process caused a regression for _AEI/_EVT declared
GPIOs. This manifests in messages showing deferred probing while trying
to allocate IRQs like so:
[ 0.688318] amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x0000 to IRQ, err -517
[ 0.688337] amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x002C to IRQ, err -517
[ 0.688348] amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x003D to IRQ, err -517
[ 0.688359] amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x003E to IRQ, err -517
[ 0.688369] amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x003A to IRQ, err -517
[ 0.688379] amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x003B to IRQ, err -517
[ 0.688389] amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x0002 to IRQ, err -517
[ 0.688399] amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x0011 to IRQ, err -517
[ 0.688410] amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x0012 to IRQ, err -517
[ 0.688420] amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x0007 to IRQ, err -517
The code for walking _AEI doesn't handle deferred probing and so this leads
to non-functional GPIO interrupts.
Fix this issue by moving the call to `acpi_gpiochip_request_interrupts` to
occur after gc->irc.initialized is set.
Cc: Shreeya Patel <shreeya.patel(a)collabora.com>
Cc: stable(a)vger.kernel.org
Fixes: 5467801f1fcb ("gpio: Restrict usage of GPIO chip irq members before initialization")
Reported-by: Mario Limonciello <mario.limonciello(a)amd.com>
Link: https://lore.kernel.org/linux-gpio/BL1PR12MB51577A77F000A008AA694675E2EF9@B…
Signed-off-by: Mario Limonciello <mario.limonciello(a)amd.com>
---
drivers/gpio/gpiolib.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
index 085348e08986..b7694171655c 100644
--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -1601,8 +1601,6 @@ static int gpiochip_add_irqchip(struct gpio_chip *gc,
gpiochip_set_irq_hooks(gc);
- acpi_gpiochip_request_interrupts(gc);
-
/*
* Using barrier() here to prevent compiler from reordering
* gc->irq.initialized before initialization of above
@@ -1612,6 +1610,8 @@ static int gpiochip_add_irqchip(struct gpio_chip *gc,
gc->irq.initialized = true;
+ acpi_gpiochip_request_interrupts(gc);
+
return 0;
}
--
2.34.1
--
Dear Sir,
How are you?
I am Regis Kabbah. I have a business opportunity to discuss with you.
If you will be interested in listening further on this project, do
reply to enable me elaborate you on the subject.
Best Regards,
Regis Kabbah.
The bug is here:
if (&req->req == u_req) {
The list iterator 'req' will point to a bogus position containing
HEAD if the list is empty or no element is found. This case must
be checked before any use of the iterator, otherwise it may bypass
the 'if (&req->req == u_req) {' check in theory, if '*u_req' obj is
just allocated in the same addr with '&req->req'.
To fix this bug, just mova all thing inside the loop and return 0,
otherwise return error.
Cc: stable(a)vger.kernel.org
Fixes: 7ecca2a4080cb ("usb/gadget: Add driver for Aspeed SoC virtual hub")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong(a)gmail.com>
---
drivers/usb/gadget/udc/aspeed-vhub/epn.c | 23 ++++++++++-------------
1 file changed, 10 insertions(+), 13 deletions(-)
diff --git a/drivers/usb/gadget/udc/aspeed-vhub/epn.c b/drivers/usb/gadget/udc/aspeed-vhub/epn.c
index 917892ca8753..aae4ce3e1029 100644
--- a/drivers/usb/gadget/udc/aspeed-vhub/epn.c
+++ b/drivers/usb/gadget/udc/aspeed-vhub/epn.c
@@ -468,27 +468,24 @@ static int ast_vhub_epn_dequeue(struct usb_ep* u_ep, struct usb_request *u_req)
struct ast_vhub *vhub = ep->vhub;
struct ast_vhub_req *req;
unsigned long flags;
- int rc = -EINVAL;
spin_lock_irqsave(&vhub->lock, flags);
/* Make sure it's actually queued on this endpoint */
list_for_each_entry (req, &ep->queue, queue) {
- if (&req->req == u_req)
- break;
- }
-
- if (&req->req == u_req) {
- EPVDBG(ep, "dequeue req @%p active=%d\n",
- req, req->active);
- if (req->active)
- ast_vhub_stop_active_req(ep, true);
- ast_vhub_done(ep, req, -ECONNRESET);
- rc = 0;
+ if (&req->req == u_req) {
+ EPVDBG(ep, "dequeue req @%p active=%d\n",
+ req, req->active);
+ if (req->active)
+ ast_vhub_stop_active_req(ep, true);
+ ast_vhub_done(ep, req, -ECONNRESET);
+ spin_unlock_irqrestore(&vhub->lock, flags);
+ return 0;
+ }
}
spin_unlock_irqrestore(&vhub->lock, flags);
- return rc;
+ return -EINVAL;
}
void ast_vhub_update_epn_stall(struct ast_vhub_ep *ep)
--
2.17.1
On Fri, Apr 22, 2022 at 07:09:34AM -0400, Joshua Freedman wrote:
> The kernel I was using as good for audio is now missing; It was
> 5.16.11-76051611-generic But the only release avail now is
> 5.16.11-051611-generic and audio does not work.
Those look like distro-specific kernels, please contact your distro for
support of this, there is nothing we can do here about them.
good luck!
greg k-h
On Sun, Apr 17, 2022 at 02:32:03PM -0700, KernelCI bot wrote:
The KernelCI bisection bot found that commit 6026d4032dbbe3 ("arm:
extend pfn_valid to take into account freed memory map alignment")
triggered a regression in v5.4.x on 32 bit ARM with a qemu platform
booting UEFI firmware. We try to dereference an invalid pointer parsing
the DMI tables:
<1>[ 0.084476] 8<--- cut here ---
<1>[ 0.084595] Unable to handle kernel paging request at virtual address dfb76000
<1>[ 0.084938] pgd = (ptrval)
<1>[ 0.085038] [dfb76000] *pgd=5f7fe801, *pte=00000000, *ppte=00000000
...
<4>[ 0.093923] [<c0ed6ce8>] (memcpy) from [<c16a06f8>] (dmi_setup+0x60/0x418)
<4>[ 0.094204] [<c16a06f8>] (dmi_setup) from [<c16a38d4>] (arm_dmi_init+0x8/0x10)
<4>[ 0.094408] [<c16a38d4>] (arm_dmi_init) from [<c0302e9c>] (do_one_initcall+0x50/0x228)
<4>[ 0.094619] [<c0302e9c>] (do_one_initcall) from [<c16011e4>] (kernel_init_freeable+0x15c/0x1f8)
<4>[ 0.094841] [<c16011e4>] (kernel_init_freeable) from [<c0f028cc>] (kernel_init+0x8/0x10c)
<4>[ 0.095057] [<c0f028cc>] (kernel_init) from [<c03010e8>] (ret_from_fork+0x14/0x2c)
This particular bisect is from GICv2 but GICv3 shows the same issue, and
it persists in the latest stable -rc:
https://linux.kernelci.org/test/job/stable-rc/branch/linux-5.4.y/kernel/v5.…
A quick check seems to show that other stable branches are unaffected.
I've left all the context from the report (including full boot logs and
a Reported-by tag) below:
> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
> * This automated bisection report was sent to you on the basis *
> * that you may be involved with the breaking commit it has *
> * found. No manual investigation has been done to verify it, *
> * and the root cause of the problem may be somewhere else. *
> * *
> * If you do send a fix, please include this trailer: *
> * Reported-by: "kernelci.org bot" <bot(a)kernelci.org> *
> * *
> * Hope this helps! *
> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
>
> stable-rc/linux-5.4.y bisection: baseline.login on qemu_arm-virt-gicv2-uefi
>
> Summary:
> Start: e7f5213d755bc Linux 5.4.189
> Plain log: https://storage.kernelci.org/stable-rc/linux-5.4.y/v5.4.189/arm/multi_v7_de…
> HTML log: https://storage.kernelci.org/stable-rc/linux-5.4.y/v5.4.189/arm/multi_v7_de…
> Result: 6026d4032dbbe arm: extend pfn_valid to take into account freed memory map alignment
>
> Checks:
> revert: PASS
> verify: PASS
>
> Parameters:
> Tree: stable-rc
> URL: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
> Branch: linux-5.4.y
> Target: qemu_arm-virt-gicv2-uefi
> CPU arch: arm
> Lab: lab-baylibre
> Compiler: gcc-10
> Config: multi_v7_defconfig
> Test case: baseline.login
>
> Breaking commit found:
>
> -------------------------------------------------------------------------------
> commit 6026d4032dbbe3d7f4ac2c8daa923fe74dcf41c4
> Author: Mike Rapoport <rppt(a)linux.ibm.com>
> Date: Mon Dec 13 16:57:09 2021 +0800
>
> arm: extend pfn_valid to take into account freed memory map alignment
>
> commit a4d5613c4dc6d413e0733e37db9d116a2a36b9f3 upstream.
>
> When unused memory map is freed the preserved part of the memory map is
> extended to match pageblock boundaries because lots of core mm
> functionality relies on homogeneity of the memory map within pageblock
> boundaries.
>
> Since pfn_valid() is used to check whether there is a valid memory map
> entry for a PFN, make it return true also for PFNs that have memory map
> entries even if there is no actual memory populated there.
>
> Signed-off-by: Mike Rapoport <rppt(a)linux.ibm.com>
> Tested-by: Kefeng Wang <wangkefeng.wang(a)huawei.com>
> Tested-by: Tony Lindgren <tony(a)atomide.com>
> Link: https://lore.kernel.org/lkml/20210630071211.21011-1-rppt@kernel.org/
> Signed-off-by: Mark-PK Tsai <mark-pk.tsai(a)mediatek.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
>
> diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
> index 5635bcc419af8..ff2cd985d20e0 100644
> --- a/arch/arm/mm/init.c
> +++ b/arch/arm/mm/init.c
> @@ -176,11 +176,22 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max_low,
> int pfn_valid(unsigned long pfn)
> {
> phys_addr_t addr = __pfn_to_phys(pfn);
> + unsigned long pageblock_size = PAGE_SIZE * pageblock_nr_pages;
>
> if (__phys_to_pfn(addr) != pfn)
> return 0;
>
> - return memblock_is_map_memory(__pfn_to_phys(pfn));
> + /*
> + * If address less than pageblock_size bytes away from a present
> + * memory chunk there still will be a memory map entry for it
> + * because we round freed memory map to the pageblock boundaries.
> + */
> + if (memblock_overlaps_region(&memblock.memory,
> + ALIGN_DOWN(addr, pageblock_size),
> + pageblock_size))
> + return 1;
> +
> + return 0;
> }
> EXPORT_SYMBOL(pfn_valid);
> #endif
> -------------------------------------------------------------------------------
>
>
> Git bisection log:
>
> -------------------------------------------------------------------------------
> git bisect start
> # good: [7f70428f0109470aa9177d1a9e5ce02de736f480] Linux 5.4.165
> git bisect good 7f70428f0109470aa9177d1a9e5ce02de736f480
> # bad: [e7f5213d755bc34f366d36f08825c0b446117d96] Linux 5.4.189
> git bisect bad e7f5213d755bc34f366d36f08825c0b446117d96
> # bad: [902528183f4d94945a0c1ed6048d4a5d4e1e712e] mmc: block: fix read single on recovery logic
> git bisect bad 902528183f4d94945a0c1ed6048d4a5d4e1e712e
> # bad: [c7e4004b38aa7ad482fc46ab76e28879f84ec77e] batman-adv: allow netlink usage in unprivileged containers
> git bisect bad c7e4004b38aa7ad482fc46ab76e28879f84ec77e
> # bad: [db0c834abbc186bda56b1e13b4eb61f7126c12c5] rndis_host: support Hytera digital radios
> git bisect bad db0c834abbc186bda56b1e13b4eb61f7126c12c5
> # bad: [0b01c51c4f47f59ad7eb1ea5bac47fab14b188a5] qlcnic: potential dereference null pointer of rx_queue->page_ring
> git bisect bad 0b01c51c4f47f59ad7eb1ea5bac47fab14b188a5
> # bad: [e7660f9535ade84ea57aed1c55d102bfb23dd2ff] mac80211: fix lookup when adding AddBA extension element
> git bisect bad e7660f9535ade84ea57aed1c55d102bfb23dd2ff
> # bad: [802a1a8501563714a5fe8824f4ed27fec04a0719] firmware: arm_scpi: Fix string overflow in SCPI genpd driver
> git bisect bad 802a1a8501563714a5fe8824f4ed27fec04a0719
> # good: [2fb8e4267c47d69d6bada6310607ea3762f6c962] KVM: x86: Ignore sparse banks size for an "all CPUs", non-sparse IPI req
> git bisect good 2fb8e4267c47d69d6bada6310607ea3762f6c962
> # good: [492f4d3cde95aadcd1d070db5dd4796ae8019165] memblock: ensure there is no overflow in memblock_overlaps_region()
> git bisect good 492f4d3cde95aadcd1d070db5dd4796ae8019165
> # bad: [e8ef940326efd17ca7fdd3cb8791c29a24b04f28] Linux 5.4.167
> git bisect bad e8ef940326efd17ca7fdd3cb8791c29a24b04f28
> # bad: [c97579584fa88df65ff6e4653b175acba154862d] arm: ioremap: don't abuse pfn_valid() to check if pfn is in RAM
> git bisect bad c97579584fa88df65ff6e4653b175acba154862d
> # bad: [6026d4032dbbe3d7f4ac2c8daa923fe74dcf41c4] arm: extend pfn_valid to take into account freed memory map alignment
> git bisect bad 6026d4032dbbe3d7f4ac2c8daa923fe74dcf41c4
> # first bad commit: [6026d4032dbbe3d7f4ac2c8daa923fe74dcf41c4] arm: extend pfn_valid to take into account freed memory map alignment
> -------------------------------------------------------------------------------
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Groups.io Links: You receive all messages sent to this group.
> View/Reply Online (#25917): https://groups.io/g/kernelci-results/message/25917
> Mute This Topic: https://groups.io/mt/90529234/1131744
> Group Owner: kernelci-results+owner(a)groups.io
> Unsubscribe: https://groups.io/g/kernelci-results/unsub [broonie(a)kernel.org]
> -=-=-=-=-=-=-=-=-=-=-=-
>
>
From: Hongyu Xie <xiehongyu1(a)kylinos.cn>
pl2303.c doesn't have reset_resume for hibernation.
So needs_binding will be set to 1 duiring hibernation.
usb_forced_unbind_intf will be called, and the port minor
will be released (x in ttyUSBx).
It works fine if you have only one USB-to-serial device.
Assume you have 2 USB-to-serial device, nameing A and B.
A gets a smaller minor(ttyUSB0), B gets a bigger one.
And start to hibernate. When your PC is in hibernation,
unplug device A. Then wake up your PC by pressing the
power button. After waking up the whole system, device
B gets ttyUSB0. This will casuse a problem if you were
using those to ports(like opened two minicom process)
before hibernation.
So member reset_resume is needed in usb_serial_driver
pl2303_device.
Codes in pl2303_reset_resume are borrowed from pl2303_open.
As a matter of fact, all driver under drivers/usb/serial
has the same problem except ch341.c.
Cc: stable(a)vger.kernel.org
Signed-off-by: Hongyu Xie <xiehongyu1(a)kylinos.cn>
Reported-by: sheng.huang <sheng.huang(a)ecastech.com>
---
drivers/usb/serial/pl2303.c | 48 +++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/drivers/usb/serial/pl2303.c b/drivers/usb/serial/pl2303.c
index 88b284d61681..7cc05123b88c 100644
--- a/drivers/usb/serial/pl2303.c
+++ b/drivers/usb/serial/pl2303.c
@@ -1218,6 +1218,53 @@ static void pl2303_process_read_urb(struct urb *urb)
tty_flip_buffer_push(&port->port);
}
+static int pl2303_configure(struct usb_serial *serial, struct pl2303_serial_private *priv)
+{
+ struct usb_serial_port *port = serial->port[0];
+
+ if (priv->quirks & PL2303_QUIRK_LEGACY) {
+ usb_clear_halt(serial->dev, port->write_urb->pipe);
+ usb_clear_halt(serial->dev, port->read_urb->pipe);
+ } else {
+ /* reset upstream data pipes */
+ if (priv->type == &pl2303_type_data[TYPE_HXN])
+ pl2303_vendor_write(serial, PL2303_HXN_RESET_REG,
+ PL2303_HXN_RESET_UPSTREAM_PIPE |
+ PL2303_HXN_RESET_DOWNSTREAM_PIPE);
+ else {
+ pl2303_vendor_write(serial, 8, 0);
+ pl2303_vendor_write(serial, 9, 0);
+ }
+ }
+ return 0;
+}
+
+static int pl2303_reset_resume(struct usb_serial *serial)
+{
+ struct usb_serial_port *port = serial->port[0];
+ struct pl2303_serial_private *priv = usb_get_serial_port_data(port);
+ struct tty_struct *tty = tty_port_tty_get(&port->port);
+ int ret;
+
+ /* reconfigure pl2303 serial port after bus-reset */
+ pl2303_configure(serial, priv);
+
+ /* Setup termios */
+ if (tty)
+ pl2303_set_termios(tty, port, NULL);
+
+ if (tty_port_initialized(&port->port)) {
+ ret = usb_submit_urb(port->interrupt_in_urb, GFP_NOIO);
+ if (ret) {
+ dev_err(&port->dev, "failed to submit interrupt urb: %d\n",
+ ret);
+ return ret;
+ }
+ }
+
+ return usb_serial_generic_resume(serial);
+}
+
static struct usb_serial_driver pl2303_device = {
.driver = {
.owner = THIS_MODULE,
@@ -1246,6 +1293,7 @@ static struct usb_serial_driver pl2303_device = {
.release = pl2303_release,
.port_probe = pl2303_port_probe,
.port_remove = pl2303_port_remove,
+ .reset_resume = pl2303_reset_resume,
};
static struct usb_serial_driver * const serial_drivers[] = {
--
2.25.1
Make sure not to set run_stop bit or link state change request while
initiating soft-reset. Register read-modify-write operation may
unintentionally start the controller before the initialization completes
with its previous DCTL value, which can cause initialization failure.
Fixes: f59dcab17629 ("usb: dwc3: core: improve reset sequence")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
---
drivers/usb/dwc3/core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index 1ca9dae57855..d28cd1a6709b 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -274,7 +274,8 @@ int dwc3_core_soft_reset(struct dwc3 *dwc)
reg = dwc3_readl(dwc->regs, DWC3_DCTL);
reg |= DWC3_DCTL_CSFTRST;
- dwc3_writel(dwc->regs, DWC3_DCTL, reg);
+ reg &= ~DWC3_DCTL_RUN_STOP;
+ dwc3_gadget_dctl_write_safe(dwc, reg);
/*
* For DWC_usb31 controller 1.90a and later, the DCTL.CSFRST bit
base-commit: bf95c4d4630c7a2c16e7b424fdea5177d9ce0864
--
2.28.0
If the file preallocated blocks and fsync'ed, we should not truncate them during
roll-forward recovery which will recover i_size correctly back.
Fixes: d4dd19ec1ea0 ("f2fs: do not expose unwritten blocks to user by DIO")
Cc: <stable(a)vger.kernel.org> # 5.17+
Signed-off-by: Jaegeuk Kim <jaegeuk(a)kernel.org>
---
fs/f2fs/inode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index 71f232dcf3c2..83639238a1fe 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -550,7 +550,8 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
}
f2fs_set_inode_flags(inode);
- if (file_should_truncate(inode)) {
+ if (file_should_truncate(inode) &&
+ !is_sbi_flag_set(sbi, SBI_POR_DOING)) {
ret = f2fs_truncate(inode);
if (ret)
goto bad_inode;
--
2.36.0.rc2.479.g8af0fa9b8e-goog
From: Alistair Popple <apopple(a)nvidia.com>
Subject: mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
In some cases it is possible for mmu_interval_notifier_remove() to race
with mn_tree_inv_end() allowing it to return while the notifier data
structure is still in use. Consider the following sequence:
CPU0 - mn_tree_inv_end() CPU1 - mmu_interval_notifier_remove()
----------------------------------- ------------------------------------
spin_lock(subscriptions->lock);
seq = subscriptions->invalidate_seq;
spin_lock(subscriptions->lock); spin_unlock(subscriptions->lock);
subscriptions->invalidate_seq++;
wait_event(invalidate_seq != seq);
return;
interval_tree_remove(interval_sub); kfree(interval_sub);
spin_unlock(subscriptions->lock);
wake_up_all();
As the wait_event() condition is true it will return immediately. This
can lead to use-after-free type errors if the caller frees the data
structure containing the interval notifier subscription while it is still
on a deferred list. Fix this by taking the appropriate lock when reading
invalidate_seq to ensure proper synchronisation.
I observed this whilst running stress testing during some development.
You do have to be pretty unlucky, but it leads to the usual problems of
use-after-free (memory corruption, kernel crash, difficult to diagnose
WARN_ON, etc).
Link: https://lkml.kernel.org/r/20220420043734.476348-1-apopple@nvidia.com
Fixes: 99cb252f5e68 ("mm/mmu_notifier: add an interval tree notifier")
Signed-off-by: Alistair Popple <apopple(a)nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Christian K��nig <christian.koenig(a)amd.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mmu_notifier.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
--- a/mm/mmu_notifier.c~mm-mmu_notifierc-fix-race-in-mmu_interval_notifier_remove
+++ a/mm/mmu_notifier.c
@@ -1036,6 +1036,18 @@ int mmu_interval_notifier_insert_locked(
}
EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked);
+static bool
+mmu_interval_seq_released(struct mmu_notifier_subscriptions *subscriptions,
+ unsigned long seq)
+{
+ bool ret;
+
+ spin_lock(&subscriptions->lock);
+ ret = subscriptions->invalidate_seq != seq;
+ spin_unlock(&subscriptions->lock);
+ return ret;
+}
+
/**
* mmu_interval_notifier_remove - Remove a interval notifier
* @interval_sub: Interval subscription to unregister
@@ -1083,7 +1095,7 @@ void mmu_interval_notifier_remove(struct
lock_map_release(&__mmu_notifier_invalidate_range_start_map);
if (seq)
wait_event(subscriptions->wq,
- READ_ONCE(subscriptions->invalidate_seq) != seq);
+ mmu_interval_seq_released(subscriptions, seq));
/* pairs with mmgrab in mmu_interval_notifier_insert() */
mmdrop(mm);
_
From: Nico Pache <npache(a)redhat.com>
Subject: oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
be targeted by the oom reaper. This mapping is used to store the futex
robust list head; the kernel does not keep a copy of the robust list and
instead references a userspace address to maintain the robustness during a
process death. A race can occur between exit_mm and the oom reaper that
allows the oom reaper to free the memory of the futex robust list before
the exit path has handled the futex death:
CPU1 CPU2
------------------------------------------------------------------------
page_fault
do_exit "signal"
wake_oom_reaper
oom_reaper
oom_reap_task_mm (invalidates mm)
exit_mm
exit_mm_release
futex_exit_release
futex_cleanup
exit_robust_list
get_user (EFAULT- can't access memory)
If the get_user EFAULT's, the kernel will be unable to recover the waiters
on the robust_list, leaving userspace mutexes hung indefinitely.
Delay the OOM reaper, allowing more time for the exit path to perform the
futex cleanup.
Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
Based on a patch by Michal Hocko.
[1] https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370
Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com
Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: Joel Savitz <jsavitz(a)redhat.com>
Signed-off-by: Nico Pache <npache(a)redhat.com>
Co-developed-by: Joel Savitz <jsavitz(a)redhat.com>
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Acked-by: Thomas Gleixner <tglx(a)linutronix.de>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Rafael Aquini <aquini(a)redhat.com>
Cc: Waiman Long <longman(a)redhat.com>
Cc: Herton R. Krzesinski <herton(a)redhat.com>
Cc: Juri Lelli <juri.lelli(a)redhat.com>
Cc: Vincent Guittot <vincent.guittot(a)linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann(a)arm.com>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Daniel Bristot de Oliveira <bristot(a)redhat.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Joel Savitz <jsavitz(a)redhat.com>
Cc: Darren Hart <dvhart(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/sched.h | 1
mm/oom_kill.c | 54 +++++++++++++++++++++++++++++-----------
2 files changed, 41 insertions(+), 14 deletions(-)
--- a/include/linux/sched.h~oom_killc-futex-delay-the-oom-reaper-to-allow-time-for-proper-futex-cleanup
+++ a/include/linux/sched.h
@@ -1443,6 +1443,7 @@ struct task_struct {
int pagefault_disabled;
#ifdef CONFIG_MMU
struct task_struct *oom_reaper_list;
+ struct timer_list oom_reaper_timer;
#endif
#ifdef CONFIG_VMAP_STACK
struct vm_struct *stack_vm_area;
--- a/mm/oom_kill.c~oom_killc-futex-delay-the-oom-reaper-to-allow-time-for-proper-futex-cleanup
+++ a/mm/oom_kill.c
@@ -632,7 +632,7 @@ done:
*/
set_bit(MMF_OOM_SKIP, &mm->flags);
- /* Drop a reference taken by wake_oom_reaper */
+ /* Drop a reference taken by queue_oom_reaper */
put_task_struct(tsk);
}
@@ -644,12 +644,12 @@ static int oom_reaper(void *unused)
struct task_struct *tsk = NULL;
wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
- spin_lock(&oom_reaper_lock);
+ spin_lock_irq(&oom_reaper_lock);
if (oom_reaper_list != NULL) {
tsk = oom_reaper_list;
oom_reaper_list = tsk->oom_reaper_list;
}
- spin_unlock(&oom_reaper_lock);
+ spin_unlock_irq(&oom_reaper_lock);
if (tsk)
oom_reap_task(tsk);
@@ -658,22 +658,48 @@ static int oom_reaper(void *unused)
return 0;
}
-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper(struct timer_list *timer)
{
- /* mm is already queued? */
- if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
+ struct task_struct *tsk = container_of(timer, struct task_struct,
+ oom_reaper_timer);
+ struct mm_struct *mm = tsk->signal->oom_mm;
+ unsigned long flags;
+
+ /* The victim managed to terminate on its own - see exit_mmap */
+ if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+ put_task_struct(tsk);
return;
+ }
- get_task_struct(tsk);
-
- spin_lock(&oom_reaper_lock);
+ spin_lock_irqsave(&oom_reaper_lock, flags);
tsk->oom_reaper_list = oom_reaper_list;
oom_reaper_list = tsk;
- spin_unlock(&oom_reaper_lock);
+ spin_unlock_irqrestore(&oom_reaper_lock, flags);
trace_wake_reaper(tsk->pid);
wake_up(&oom_reaper_wait);
}
+/*
+ * Give the OOM victim time to exit naturally before invoking the oom_reaping.
+ * The timers timeout is arbitrary... the longer it is, the longer the worst
+ * case scenario for the OOM can take. If it is too small, the oom_reaper can
+ * get in the way and release resources needed by the process exit path.
+ * e.g. The futex robust list can sit in Anon|Private memory that gets reaped
+ * before the exit path is able to wake the futex waiters.
+ */
+#define OOM_REAPER_DELAY (2*HZ)
+static void queue_oom_reaper(struct task_struct *tsk)
+{
+ /* mm is already queued? */
+ if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
+ return;
+
+ get_task_struct(tsk);
+ timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
+ tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
+ add_timer(&tsk->oom_reaper_timer);
+}
+
static int __init oom_init(void)
{
oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -681,7 +707,7 @@ static int __init oom_init(void)
}
subsys_initcall(oom_init)
#else
-static inline void wake_oom_reaper(struct task_struct *tsk)
+static inline void queue_oom_reaper(struct task_struct *tsk)
{
}
#endif /* CONFIG_MMU */
@@ -932,7 +958,7 @@ static void __oom_kill_process(struct ta
rcu_read_unlock();
if (can_oom_reap)
- wake_oom_reaper(victim);
+ queue_oom_reaper(victim);
mmdrop(mm);
put_task_struct(victim);
@@ -968,7 +994,7 @@ static void oom_kill_process(struct oom_
task_lock(victim);
if (task_will_free_mem(victim)) {
mark_oom_victim(victim);
- wake_oom_reaper(victim);
+ queue_oom_reaper(victim);
task_unlock(victim);
put_task_struct(victim);
return;
@@ -1067,7 +1093,7 @@ bool out_of_memory(struct oom_control *o
*/
if (task_will_free_mem(current)) {
mark_oom_victim(current);
- wake_oom_reaper(current);
+ queue_oom_reaper(current);
return true;
}
_
From: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Subject: mm, hugetlb: allow for "high" userspace addresses
This is a fix for commit f6795053dac8 ("mm: mmap: Allow for "high"
userspace addresses") for hugetlb.
This patch adds support for "high" userspace addresses that are optionally
supported on the system and have to be requested via a hint mechanism
("high" addr parameter to mmap).
Architectures such as powerpc and x86 achieve this by making changes to
their architectural versions of hugetlb_get_unmapped_area() function.
However, arm64 uses the generic version of that function.
So take into account arch_get_mmap_base() and arch_get_mmap_end() in
hugetlb_get_unmapped_area(). To allow that, move those two macros out of
mm/mmap.c into include/linux/sched/mm.h
If these macros are not defined in architectural code then they default to
(TASK_SIZE) and (base) so should not introduce any behavioural changes to
architectures that do not define them.
For the time being, only ARM64 is affected by this change.
Catalin (ARM64) said
: We should have fixed hugetlb_get_unmapped_area() as well when we added
: support for 52-bit VA. The reason for commit f6795053dac8 was to prevent
: normal mmap() from returning addresses above 48-bit by default as some
: user-space had hard assumptions about this.
:
: It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
: but I doubt anyone would notice. It's more likely that the current
: behaviour would cause issues, so I'd rather have them consistent.
:
: Basically when arm64 gained support for 52-bit addresses we did not
: want user-space calling mmap() to suddenly get such high addresses,
: otherwise we could have inadvertently broken some programs (similar
: behaviour to x86 here). Hence we added commit f6795053dac8. But we
: missed hugetlbfs which could still get such high mmap() addresses. So
: in theory that's a potential regression that should have bee addressed
: at the same time as commit f6795053dac8 (and before arm64 enabled
: 52-bit addresses).
Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.16500337…
Fixes: f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses")
Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Reviewed-by: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Steve Capper <steve.capper(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: <stable(a)vger.kernel.org> [5.0.x]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/hugetlbfs/inode.c | 9 +++++----
include/linux/sched/mm.h | 8 ++++++++
mm/mmap.c | 8 --------
3 files changed, 13 insertions(+), 12 deletions(-)
--- a/fs/hugetlbfs/inode.c~mm-hugetlbfs-allow-for-high-userspace-addresses
+++ a/fs/hugetlbfs/inode.c
@@ -206,7 +206,7 @@ hugetlb_get_unmapped_area_bottomup(struc
info.flags = 0;
info.length = len;
info.low_limit = current->mm->mmap_base;
- info.high_limit = TASK_SIZE;
+ info.high_limit = arch_get_mmap_end(addr);
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
return vm_unmapped_area(&info);
@@ -222,7 +222,7 @@ hugetlb_get_unmapped_area_topdown(struct
info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
info.low_limit = max(PAGE_SIZE, mmap_min_addr);
- info.high_limit = current->mm->mmap_base;
+ info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
addr = vm_unmapped_area(&info);
@@ -237,7 +237,7 @@ hugetlb_get_unmapped_area_topdown(struct
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
info.low_limit = current->mm->mmap_base;
- info.high_limit = TASK_SIZE;
+ info.high_limit = arch_get_mmap_end(addr);
addr = vm_unmapped_area(&info);
}
@@ -251,6 +251,7 @@ hugetlb_get_unmapped_area(struct file *f
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
struct hstate *h = hstate_file(file);
+ const unsigned long mmap_end = arch_get_mmap_end(addr);
if (len & ~huge_page_mask(h))
return -EINVAL;
@@ -266,7 +267,7 @@ hugetlb_get_unmapped_area(struct file *f
if (addr) {
addr = ALIGN(addr, huge_page_size(h));
vma = find_vma(mm, addr);
- if (TASK_SIZE - len >= addr &&
+ if (mmap_end - len >= addr &&
(!vma || addr + len <= vm_start_gap(vma)))
return addr;
}
--- a/include/linux/sched/mm.h~mm-hugetlbfs-allow-for-high-userspace-addresses
+++ a/include/linux/sched/mm.h
@@ -136,6 +136,14 @@ static inline void mm_update_next_owner(
#endif /* CONFIG_MEMCG */
#ifdef CONFIG_MMU
+#ifndef arch_get_mmap_end
+#define arch_get_mmap_end(addr) (TASK_SIZE)
+#endif
+
+#ifndef arch_get_mmap_base
+#define arch_get_mmap_base(addr, base) (base)
+#endif
+
extern void arch_pick_mmap_layout(struct mm_struct *mm,
struct rlimit *rlim_stack);
extern unsigned long
--- a/mm/mmap.c~mm-hugetlbfs-allow-for-high-userspace-addresses
+++ a/mm/mmap.c
@@ -2117,14 +2117,6 @@ unsigned long vm_unmapped_area(struct vm
return addr;
}
-#ifndef arch_get_mmap_end
-#define arch_get_mmap_end(addr) (TASK_SIZE)
-#endif
-
-#ifndef arch_get_mmap_base
-#define arch_get_mmap_base(addr, base) (base)
-#endif
-
/* Get an address range which is currently unmapped.
* For shmat() with addr=0.
*
_
From: Shakeel Butt <shakeelb(a)google.com>
Subject: memcg: sync flush only if periodic flush is delayed
Daniel Dao has reported [1] a regression on workloads that may trigger a
lot of refaults (anon and file). The underlying issue is that flushing
rstat is expensive. Although rstat flush are batched with (nr_cpus *
MEMCG_BATCH) stat updates, it seems like there are workloads which
genuinely do stat updates larger than batch value within short amount of
time. Since the rstat flush can happen in the performance critical
codepaths like page faults, such workload can suffer greatly.
This patch fixes this regression by making the rstat flushing conditional
in the performance critical codepaths. More specifically, the kernel
relies on the async periodic rstat flusher to flush the stats and only if
the periodic flusher is delayed by more than twice the amount of its
normal time window then the kernel allows rstat flushing from the
performance critical codepaths.
Now the question: what are the side-effects of this change? The worst
that can happen is the refault codepath will see 4sec old lruvec stats and
may cause false (or missed) activations of the refaulted page which may
under-or-overestimate the workingset size. Though that is not very
concerning as the kernel can already miss or do false activations.
There are two more codepaths whose flushing behavior is not changed by
this patch and we may need to come to them in future. One is the
writeback stats used by dirty throttling and second is the deactivation
heuristic in the reclaim. For now keeping an eye on them and if there is
report of regression due to these codepaths, we will reevaluate then.
Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndg… [1]
Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault")
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Reported-by: Daniel Dao <dqminh(a)cloudflare.com>
Tested-by: Ivan Babrou <ivan(a)cloudflare.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Michal Koutn�� <mkoutny(a)suse.com>
Cc: Frank Hofmann <fhofmann(a)cloudflare.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/memcontrol.h | 5 +++++
mm/memcontrol.c | 12 +++++++++++-
mm/workingset.c | 2 +-
3 files changed, 17 insertions(+), 2 deletions(-)
--- a/include/linux/memcontrol.h~memcg-sync-flush-only-if-periodic-flush-is-delayed
+++ a/include/linux/memcontrol.h
@@ -1012,6 +1012,7 @@ static inline unsigned long lruvec_page_
}
void mem_cgroup_flush_stats(void);
+void mem_cgroup_flush_stats_delayed(void);
void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
int val);
@@ -1455,6 +1456,10 @@ static inline void mem_cgroup_flush_stat
{
}
+static inline void mem_cgroup_flush_stats_delayed(void)
+{
+}
+
static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
enum node_stat_item idx, int val)
{
--- a/mm/memcontrol.c~memcg-sync-flush-only-if-periodic-flush-is-delayed
+++ a/mm/memcontrol.c
@@ -587,6 +587,9 @@ static DECLARE_DEFERRABLE_WORK(stats_flu
static DEFINE_SPINLOCK(stats_flush_lock);
static DEFINE_PER_CPU(unsigned int, stats_updates);
static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
+static u64 flush_next_time;
+
+#define FLUSH_TIME (2UL*HZ)
/*
* Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
@@ -637,6 +640,7 @@ static void __mem_cgroup_flush_stats(voi
if (!spin_trylock_irqsave(&stats_flush_lock, flag))
return;
+ flush_next_time = jiffies_64 + 2*FLUSH_TIME;
cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
atomic_set(&stats_flush_threshold, 0);
spin_unlock_irqrestore(&stats_flush_lock, flag);
@@ -648,10 +652,16 @@ void mem_cgroup_flush_stats(void)
__mem_cgroup_flush_stats();
}
+void mem_cgroup_flush_stats_delayed(void)
+{
+ if (time_after64(jiffies_64, flush_next_time))
+ mem_cgroup_flush_stats();
+}
+
static void flush_memcg_stats_dwork(struct work_struct *w)
{
__mem_cgroup_flush_stats();
- queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
+ queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
}
/**
--- a/mm/workingset.c~memcg-sync-flush-only-if-periodic-flush-is-delayed
+++ a/mm/workingset.c
@@ -355,7 +355,7 @@ void workingset_refault(struct folio *fo
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats_delayed();
/*
* Compare the distance to the existing workingset size. We
* don't activate pages that couldn't stay resident even if
_
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Subject: mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated, so
move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reported-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Dan Carpenter <dan.carpenter(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/hugetlb.h | 6 +
include/linux/mm.h | 8 ++
mm/hugetlb.c | 10 ++
mm/memory-failure.c | 145 ++++++++++++++++++++++++++------------
4 files changed, 127 insertions(+), 42 deletions(-)
--- a/include/linux/hugetlb.h~mm-hwpoison-fix-race-between-hugetlb-free-demotion-and-memory_failure_hugetlb
+++ a/include/linux/hugetlb.h
@@ -169,6 +169,7 @@ long hugetlb_unreserve_pages(struct inod
long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags);
void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
@@ -377,6 +378,11 @@ static inline int get_hwpoison_huge_page
{
return 0;
}
+
+static inline int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
static inline void putback_active_hugepage(struct page *page)
{
--- a/include/linux/mm.h~mm-hwpoison-fix-race-between-hugetlb-free-demotion-and-memory_failure_hugetlb
+++ a/include/linux/mm.h
@@ -3197,6 +3197,14 @@ extern int sysctl_memory_failure_recover
extern void shake_page(struct page *p);
extern atomic_long_t num_poisoned_pages __read_mostly;
extern int soft_offline_page(unsigned long pfn, int flags);
+#ifdef CONFIG_MEMORY_FAILURE
+extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags);
+#else
+static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ return 0;
+}
+#endif
#ifndef arch_memory_failure
static inline int arch_memory_failure(unsigned long pfn, int flags)
--- a/mm/hugetlb.c~mm-hwpoison-fix-race-between-hugetlb-free-demotion-and-memory_failure_hugetlb
+++ a/mm/hugetlb.c
@@ -6785,6 +6785,16 @@ int get_hwpoison_huge_page(struct page *
return ret;
}
+int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ int ret;
+
+ spin_lock_irq(&hugetlb_lock);
+ ret = __get_huge_page_for_hwpoison(pfn, flags);
+ spin_unlock_irq(&hugetlb_lock);
+ return ret;
+}
+
void putback_active_hugepage(struct page *page)
{
spin_lock_irq(&hugetlb_lock);
--- a/mm/memory-failure.c~mm-hwpoison-fix-race-between-hugetlb-free-demotion-and-memory_failure_hugetlb
+++ a/mm/memory-failure.c
@@ -1498,50 +1498,113 @@ static int try_to_split_thp_page(struct
return 0;
}
-static int memory_failure_hugetlb(unsigned long pfn, int flags)
+/*
+ * Called from hugetlb code with hugetlb_lock held.
+ *
+ * Return values:
+ * 0 - free hugepage
+ * 1 - in-use hugepage
+ * 2 - not a hugepage
+ * -EBUSY - the hugepage is busy (try to retry)
+ * -EHWPOISON - the hugepage is already hwpoisoned
+ */
+int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
+{
+ struct page *page = pfn_to_page(pfn);
+ struct page *head = compound_head(page);
+ int ret = 2; /* fallback to normal page handling */
+ bool count_increased = false;
+
+ if (!PageHeadHuge(head))
+ goto out;
+
+ if (flags & MF_COUNT_INCREASED) {
+ ret = 1;
+ count_increased = true;
+ } else if (HPageFreed(head) || HPageMigratable(head)) {
+ ret = get_page_unless_zero(head);
+ if (ret)
+ count_increased = true;
+ } else {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (TestSetPageHWPoison(head)) {
+ ret = -EHWPOISON;
+ goto out;
+ }
+
+ return ret;
+out:
+ if (count_increased)
+ put_page(head);
+ return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Taking refcount of hugetlb pages needs extra care about race conditions
+ * with basic operations like hugepage allocation/free/demotion.
+ * So some of prechecks for hwpoison (pinning, and testing/setting
+ * PageHWPoison) should be done in single hugetlb_lock range.
+ */
+static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
{
- struct page *p = pfn_to_page(pfn);
- struct page *head = compound_head(p);
int res;
+ struct page *p = pfn_to_page(pfn);
+ struct page *head;
unsigned long page_flags;
+ bool retry = true;
- if (TestSetPageHWPoison(head)) {
- pr_err("Memory failure: %#lx: already hardware poisoned\n",
- pfn);
- res = -EHWPOISON;
- if (flags & MF_ACTION_REQUIRED)
+ *hugetlb = 1;
+retry:
+ res = get_huge_page_for_hwpoison(pfn, flags);
+ if (res == 2) { /* fallback to normal page handling */
+ *hugetlb = 0;
+ return 0;
+ } else if (res == -EHWPOISON) {
+ pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn);
+ if (flags & MF_ACTION_REQUIRED) {
+ head = compound_head(p);
res = kill_accessing_process(current, page_to_pfn(head), flags);
+ }
return res;
+ } else if (res == -EBUSY) {
+ if (retry) {
+ retry = false;
+ goto retry;
+ }
+ action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
+ return res;
+ }
+
+ head = compound_head(p);
+ lock_page(head);
+
+ if (hwpoison_filter(p)) {
+ ClearPageHWPoison(head);
+ res = -EOPNOTSUPP;
+ goto out;
}
num_poisoned_pages_inc();
- if (!(flags & MF_COUNT_INCREASED)) {
- res = get_hwpoison_page(p, flags);
- if (!res) {
- lock_page(head);
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- unlock_page(head);
- return -EOPNOTSUPP;
- }
- unlock_page(head);
- res = MF_FAILED;
- if (__page_handle_poison(p)) {
- page_ref_inc(p);
- res = MF_RECOVERED;
- }
- action_result(pfn, MF_MSG_FREE_HUGE, res);
- return res == MF_RECOVERED ? 0 : -EBUSY;
- } else if (res < 0) {
- action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
- return -EBUSY;
+ /*
+ * Handling free hugepage. The possible race with hugepage allocation
+ * or demotion can be prevented by PageHWPoison flag.
+ */
+ if (res == 0) {
+ unlock_page(head);
+ res = MF_FAILED;
+ if (__page_handle_poison(p)) {
+ page_ref_inc(p);
+ res = MF_RECOVERED;
}
+ action_result(pfn, MF_MSG_FREE_HUGE, res);
+ return res == MF_RECOVERED ? 0 : -EBUSY;
}
- lock_page(head);
-
/*
* The page could have changed compound pages due to race window.
* If this happens just bail out.
@@ -1554,14 +1617,6 @@ static int memory_failure_hugetlb(unsign
page_flags = head->flags;
- if (hwpoison_filter(p)) {
- if (TestClearPageHWPoison(head))
- num_poisoned_pages_dec();
- put_page(p);
- res = -EOPNOTSUPP;
- goto out;
- }
-
/*
* TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
* simply disable it. In order to make it work properly, we need
@@ -1588,6 +1643,12 @@ out:
unlock_page(head);
return res;
}
+#else
+static inline int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb)
+{
+ return 0;
+}
+#endif
static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
struct dev_pagemap *pgmap)
@@ -1712,6 +1773,7 @@ int memory_failure(unsigned long pfn, in
int res = 0;
unsigned long page_flags;
bool retry = true;
+ int hugetlb = 0;
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1739,10 +1801,9 @@ int memory_failure(unsigned long pfn, in
}
try_again:
- if (PageHuge(p)) {
- res = memory_failure_hugetlb(pfn, flags);
+ res = try_memory_failure_hugetlb(pfn, flags, &hugetlb);
+ if (hugetlb)
goto unlock_mutex;
- }
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
_
Attention.
You have successfully been granted the cumulative sum of $4.8 Million
USD as a family donation from Warren Edward Buffett. We decided to
give this amount to randomly selected individuals worldwide to help
fight against poverty in your region.
Kindly get back to me at your earliest convenience so I know your
email is still valid: Thank you for accepting our family donation, we
are indeed grateful.
Regards,
Warren Edward Buffett
Hi
This is backport of patches d208b89401e0 ("dm: fix mempool NULL pointer
race when completing IO") and 9f6dc6337610 ("dm: interlock pending dm_io
and dm_wait_for_bios_completion") for the kernel 4.19.
The bugs fixed by these patches can cause random crashing when reloading
dm table, so it is eligible for stable backport.
This patch is different from the upstream patches because the code
diverged significantly.
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
---
drivers/md/dm.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
Index: linux-stable/drivers/md/dm.c
===================================================================
--- linux-stable.orig/drivers/md/dm.c 2022-04-19 16:28:24.000000000 +0200
+++ linux-stable/drivers/md/dm.c 2022-04-19 16:32:16.000000000 +0200
@@ -647,6 +647,8 @@ static void end_io_acct(struct dm_io *io
bio->bi_iter.bi_sector, bio_sectors(bio),
true, duration, &io->stats_aux);
+ free_io(md, io);
+
/*
* After this is decremented the bio must not be touched if it is
* a flush.
@@ -899,7 +901,6 @@ static void dec_pending(struct dm_io *io
io_error = io->status;
bio = io->orig_bio;
end_io_acct(io);
- free_io(md, io);
if (io_error == BLK_STS_DM_REQUEUE)
return;
@@ -2472,6 +2473,8 @@ static int dm_wait_for_completion(struct
}
finish_wait(&md->wait, &wait);
+ smp_rmb();
+
return r;
}
Hi
This is backport of patches d208b89401e0 ("dm: fix mempool NULL pointer
race when completing IO") and 9f6dc6337610 ("dm: interlock pending dm_io
and dm_wait_for_bios_completion") for the kernel 4.14.
The bugs fixed by these patches can cause random crashing when reloading
dm table, so it is eligible for stable backport.
This patch is different from the upstream patches because the code
diverged significantly.
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
drivers/md/dm.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
Index: linux-stable/drivers/md/dm.c
===================================================================
--- linux-stable.orig/drivers/md/dm.c 2022-04-19 16:33:39.000000000 +0200
+++ linux-stable/drivers/md/dm.c 2022-04-19 16:33:39.000000000 +0200
@@ -543,6 +543,8 @@ static void end_io_acct(struct dm_io *io
bio->bi_iter.bi_sector, bio_sectors(bio),
true, duration, &io->stats_aux);
+ free_io(md, io);
+
/*
* After this is decremented the bio must not be touched if it is
* a flush.
@@ -802,7 +804,6 @@ static void dec_pending(struct dm_io *io
io_error = io->status;
bio = io->bio;
end_io_acct(io);
- free_io(md, io);
if (io_error == BLK_STS_DM_REQUEUE)
return;
@@ -2227,6 +2228,8 @@ static int dm_wait_for_completion(struct
}
finish_wait(&md->wait, &wait);
+ smp_rmb();
+
return r;
}
Hi
This is backport of patches d208b89401e0 ("dm: fix mempool NULL pointer
race when completing IO") and 9f6dc6337610 ("dm: interlock pending dm_io
and dm_wait_for_bios_completion") for the kernel 4.9.
The bugs fixed by these patches can cause random crashing when reloading
dm table, so it is eligible for stable backport.
This patch is different from the upstream patches because the code
diverged significantly.
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
---
drivers/md/dm.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
Index: linux-stable/drivers/md/dm.c
===================================================================
--- linux-stable.orig/drivers/md/dm.c 2022-04-19 16:35:22.000000000 +0200
+++ linux-stable/drivers/md/dm.c 2022-04-19 16:35:22.000000000 +0200
@@ -539,6 +539,8 @@ static void end_io_acct(struct dm_io *io
bio->bi_iter.bi_sector, bio_sectors(bio),
true, duration, &io->stats_aux);
+ free_io(md, io);
+
/*
* After this is decremented the bio must not be touched if it is
* a flush.
@@ -794,7 +796,6 @@ static void dec_pending(struct dm_io *io
io_error = io->error;
bio = io->bio;
end_io_acct(io);
- free_io(md, io);
if (io_error == DM_ENDIO_REQUEUE)
return;
@@ -2024,6 +2025,8 @@ static int dm_wait_for_completion(struct
}
finish_wait(&md->wait, &wait);
+ smp_rmb();
+
return r;
}
This is a note to let you know that I've just added the patch titled
usb: dwc3: core: Fix tx/rx threshold settings
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From f28ad9069363dec7deb88032b70612755eed9ee6 Mon Sep 17 00:00:00 2001
From: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
Date: Mon, 11 Apr 2022 18:33:47 -0700
Subject: usb: dwc3: core: Fix tx/rx threshold settings
The current driver logic checks against 0 to determine whether the
periodic tx/rx threshold settings are set, but we may get bogus values
from uninitialized variables if no device property is set. Properly
default these variables to 0.
Fixes: 938a5ad1d305 ("usb: dwc3: Check for ESS TX/RX threshold config")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
Link: https://lore.kernel.org/r/cccfce990b11b730b0dae42f9d217dc6fb988c90.16497271…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/dwc3/core.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index 5bfd3e88af35..1ca9dae57855 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -1377,10 +1377,10 @@ static void dwc3_get_properties(struct dwc3 *dwc)
u8 lpm_nyet_threshold;
u8 tx_de_emphasis;
u8 hird_threshold;
- u8 rx_thr_num_pkt_prd;
- u8 rx_max_burst_prd;
- u8 tx_thr_num_pkt_prd;
- u8 tx_max_burst_prd;
+ u8 rx_thr_num_pkt_prd = 0;
+ u8 rx_max_burst_prd = 0;
+ u8 tx_thr_num_pkt_prd = 0;
+ u8 tx_max_burst_prd = 0;
u8 tx_fifo_resize_max_num;
const char *usb_psy_name;
int ret;
--
2.36.0
This is a note to let you know that I've just added the patch titled
usb: xhci: tegra:Fix PM usage reference leak of
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From 8771039482d965bdc8cefd972bcabac2b76944a8 Mon Sep 17 00:00:00 2001
From: zhangqilong <zhangqilong3(a)huawei.com>
Date: Sat, 19 Mar 2022 10:38:22 +0800
Subject: usb: xhci: tegra:Fix PM usage reference leak of
tegra_xusb_unpowergate_partitions
pm_runtime_get_sync will increment pm usage counter
even it failed. Forgetting to putting operation will
result in reference leak here. We fix it by replacing
it with pm_runtime_resume_and_get to keep usage counter
balanced.
Fixes: 41a7426d25fa ("usb: xhci: tegra: Unlink power domain devices")
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Zhang Qilong <zhangqilong3(a)huawei.com>
Link: https://lore.kernel.org/r/20220319023822.145641-1-zhangqilong3@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/host/xhci-tegra.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/host/xhci-tegra.c b/drivers/usb/host/xhci-tegra.c
index c8af2cd2216d..996958a6565c 100644
--- a/drivers/usb/host/xhci-tegra.c
+++ b/drivers/usb/host/xhci-tegra.c
@@ -1034,13 +1034,13 @@ static int tegra_xusb_unpowergate_partitions(struct tegra_xusb *tegra)
int rc;
if (tegra->use_genpd) {
- rc = pm_runtime_get_sync(tegra->genpd_dev_ss);
+ rc = pm_runtime_resume_and_get(tegra->genpd_dev_ss);
if (rc < 0) {
dev_err(dev, "failed to enable XUSB SS partition\n");
return rc;
}
- rc = pm_runtime_get_sync(tegra->genpd_dev_host);
+ rc = pm_runtime_resume_and_get(tegra->genpd_dev_host);
if (rc < 0) {
dev_err(dev, "failed to enable XUSB Host partition\n");
pm_runtime_put_sync(tegra->genpd_dev_ss);
--
2.36.0
From: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
commit d6b88ce2eb9d ("ACPI: processor idle: Allow playing dead in C3 state")
was supposedly just trying to enable C3 when the CPU is offlined,
but it also mistakenly enabled C3 usage without setting ARB_DIS=1
in normal idle scenarios.
This results in a machine that won't boot past the point when it first
enters C3. Restore the correct behaviour (either demote to C1/C2, or
use C3 but also set ARB_DIS=1).
I hit this on a Fujitsu Siemens Lifebook S6010 (P3) machine.
Cc: stable(a)vger.kernel.org
Cc: Woody Suwalski <wsuwalski(a)gmail.com>
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Richard Gong <richard.gong(a)amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Fixes: d6b88ce2eb9d ("ACPI: processor idle: Allow playing dead in C3 state")
Signed-off-by: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
---
drivers/acpi/processor_idle.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 4556c86c3465..54f0a1915025 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -793,10 +793,10 @@ static int acpi_processor_setup_cstates(struct acpi_processor *pr)
state->flags = 0;
if (cx->type == ACPI_STATE_C1 || cx->type == ACPI_STATE_C2 ||
- cx->type == ACPI_STATE_C3) {
+ cx->type == ACPI_STATE_C3)
state->enter_dead = acpi_idle_play_dead;
+ if (cx->type == ACPI_STATE_C1 || cx->type == ACPI_STATE_C2)
drv->safe_state_index = count;
- }
/*
* Halt-induced C1 is not good for ->enter_s2idle, because it
* re-enables interrupts on exit. Moreover, C1 is generally not
--
2.35.1
We have now seen panel (XMG Core 15 e21 laptop) advertizing support
for Intel proprietary eDP backlight control via DPCD registers, but
actually working only with legacy pwm control.
This patch adds panel EDID check for possible HDR static metadata and
Intel proprietary eDP backlight control is used only if that exists.
Missing HDR static metadata is ignored if user specifically asks for
Intel proprietary eDP backlight control via enable_dpcd_backlight
parameter.
v2 :
- Ignore missing HDR static metadata if Intel proprietary eDP
backlight control is forced via i915.enable_dpcd_backlight
- Printout info message if panel is missing HDR static metadata and
support for Intel proprietary eDP backlight control is detected
Fixes: 4a8d79901d5b ("drm/i915/dp: Enable Intel's HDR backlight interface (only SDR for now)")
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5284
Cc: Lyude Paul <lyude(a)redhat.com>
Cc: Mika Kahola <mika.kahola(a)intel.com>
Cc: Jani Nikula <jani.nikula(a)intel.com>
Cc: Filippo Falezza <filippo.falezza(a)outlook.it>
Cc: stable(a)vger.kernel.org
Signed-off-by: Jouni Högander <jouni.hogander(a)intel.com>
---
.../drm/i915/display/intel_dp_aux_backlight.c | 34 ++++++++++++++-----
1 file changed, 26 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/i915/display/intel_dp_aux_backlight.c b/drivers/gpu/drm/i915/display/intel_dp_aux_backlight.c
index 97cf3cac0105..fb6cf30ee628 100644
--- a/drivers/gpu/drm/i915/display/intel_dp_aux_backlight.c
+++ b/drivers/gpu/drm/i915/display/intel_dp_aux_backlight.c
@@ -97,6 +97,14 @@
#define INTEL_EDP_BRIGHTNESS_OPTIMIZATION_1 0x359
+enum intel_dp_aux_backlight_modparam {
+ INTEL_DP_AUX_BACKLIGHT_AUTO = -1,
+ INTEL_DP_AUX_BACKLIGHT_OFF = 0,
+ INTEL_DP_AUX_BACKLIGHT_ON = 1,
+ INTEL_DP_AUX_BACKLIGHT_FORCE_VESA = 2,
+ INTEL_DP_AUX_BACKLIGHT_FORCE_INTEL = 3,
+};
+
/* Intel EDP backlight callbacks */
static bool
intel_dp_aux_supports_hdr_backlight(struct intel_connector *connector)
@@ -126,6 +134,24 @@ intel_dp_aux_supports_hdr_backlight(struct intel_connector *connector)
return false;
}
+ /*
+ * If we don't have HDR static metadata there is no way to
+ * runtime detect used range for nits based control. For now
+ * do not use Intel proprietary eDP backlight control if we
+ * don't have this data in panel EDID. In case we find panel
+ * which supports only nits based control, but doesn't provide
+ * HDR static metadata we need to start maintaining table of
+ * ranges for such panels.
+ */
+ if (i915->params.enable_dpcd_backlight != INTEL_DP_AUX_BACKLIGHT_FORCE_INTEL &&
+ !(connector->base.hdr_sink_metadata.hdmi_type1.metadata_type &
+ BIT(HDMI_STATIC_METADATA_TYPE1))) {
+ drm_info(&i915->drm,
+ "Panel is missing HDR static metadata. Possible support for Intel HDR backlight interface is not used. If your backlight controls don't work try booting with i915.enable_dpcd_backlight=%d. needs this, please file a _new_ bug report on drm/i915, see " FDO_BUG_URL " for details.\n",
+ INTEL_DP_AUX_BACKLIGHT_FORCE_INTEL);
+ return false;
+ }
+
panel->backlight.edp.intel.sdr_uses_aux =
tcon_cap[2] & INTEL_EDP_SDR_TCON_BRIGHTNESS_AUX_CAP;
@@ -413,14 +439,6 @@ static const struct intel_panel_bl_funcs intel_dp_vesa_bl_funcs = {
.get = intel_dp_aux_vesa_get_backlight,
};
-enum intel_dp_aux_backlight_modparam {
- INTEL_DP_AUX_BACKLIGHT_AUTO = -1,
- INTEL_DP_AUX_BACKLIGHT_OFF = 0,
- INTEL_DP_AUX_BACKLIGHT_ON = 1,
- INTEL_DP_AUX_BACKLIGHT_FORCE_VESA = 2,
- INTEL_DP_AUX_BACKLIGHT_FORCE_INTEL = 3,
-};
-
int intel_dp_aux_init_backlight_funcs(struct intel_connector *connector)
{
struct drm_device *dev = connector->base.dev;
--
2.25.1
From: Joerg Roedel <jroedel(a)suse.de>
Allow a runtime opt-out of kexec support for architecture code in case
the kernel is running in an environment where kexec is not properly
supported yet.
This will be used on x86 when the kernel is running as an SEV-ES
guest. SEV-ES guests need special handling for kexec to hand over all
CPUs to the new kernel. This requires special hypervisor support and
handling code in the guest which is not yet implemented.
Cc: stable(a)vger.kernel.org # v5.10+
Signed-off-by: Joerg Roedel <jroedel(a)suse.de>
---
include/linux/kexec.h | 1 +
kernel/kexec.c | 14 ++++++++++++++
kernel/kexec_file.c | 9 +++++++++
3 files changed, 24 insertions(+)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 0c994ae37729..85c30dcd0bdc 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -201,6 +201,7 @@ int arch_kexec_kernel_verify_sig(struct kimage *image, void *buf,
unsigned long buf_len);
#endif
int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf);
+bool arch_kexec_supported(void);
extern int kexec_add_buffer(struct kexec_buf *kbuf);
int kexec_locate_mem_hole(struct kexec_buf *kbuf);
diff --git a/kernel/kexec.c b/kernel/kexec.c
index c82c6c06f051..d03134160458 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -195,11 +195,25 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
* that to happen you need to do that yourself.
*/
+bool __weak arch_kexec_supported(void)
+{
+ return true;
+}
+
static inline int kexec_load_check(unsigned long nr_segments,
unsigned long flags)
{
int result;
+ /*
+ * The architecture may support kexec in general, but the kernel could
+ * run in an environment where it is not (yet) possible to execute a new
+ * kernel. Allow the architecture code to opt-out of kexec support when
+ * it is running in such an environment.
+ */
+ if (!arch_kexec_supported())
+ return -ENOSYS;
+
/* We only trust the superuser with rebooting the system. */
if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
return -EPERM;
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 33400ff051a8..96d08a512e9c 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -358,6 +358,15 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
int ret = 0, i;
struct kimage **dest_image, *image;
+ /*
+ * The architecture may support kexec in general, but the kernel could
+ * run in an environment where it is not (yet) possible to execute a new
+ * kernel. Allow the architecture code to opt-out of kexec support when
+ * it is running in such an environment.
+ */
+ if (!arch_kexec_supported())
+ return -ENOSYS;
+
/* We only trust the superuser with rebooting the system. */
if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
return -EPERM;
--
2.31.1
Reviewed-by: Vabhav Sharma <vabhav.sharma(a)nxp.com>
> -----Original Message-----
> From: Fabio Estevam <festevam(a)gmail.com>
> Sent: Wednesday, April 20, 2022 5:36 PM
> To: herbert(a)gondor.apana.org.au
> Cc: Horia Geanta <horia.geanta(a)nxp.com>; Gaurav Jain
> <gaurav.jain(a)nxp.com>; Varun Sethi <V.Sethi(a)nxp.com>; linux-
> crypto(a)vger.kernel.org; Fabio Estevam <festevam(a)denx.de>;
> stable(a)vger.kernel.org
> Subject: [PATCH v5] crypto: caam - fix i.MX6SX entropy delay value
>
> From: Fabio Estevam <festevam(a)denx.de>
>
> Since commit 358ba762d9f1 ("crypto: caam - enable prediction resistance in
> HRWNG") the following CAAM errors can be seen on i.MX6SX:
>
> caam_jr 2101000.jr: 20003c5b: CCB: desc idx 60: RNG: Hardware error
> hwrng: no data available
>
> This error is due to an incorrect entropy delay for i.MX6SX.
>
> Fix it by increasing the minimum entropy delay for i.MX6SX as done in U-Boot:
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatch
> work.ozlabs.org%2Fproject%2Fuboot%2Fpatch%2F20220415111049.2565744-
> 1-
> gaurav.jain%40nxp.com%2F&data=05%7C01%7Cmeenakshi.aggarwal%4
> 0nxp.com%7Caf57d0186dde479aa9cf08da22c687d0%7C686ea1d3bc2b4c6fa92
> cd99c5c301635%7C0%7C0%7C637860533324307730%7CUnknown%7CTWFpb
> GZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI
> 6Mn0%3D%7C3000%7C%7C%7C&sdata=UhqjgESpgMhOhJS%2BT4ghI6y
> NIvyybOI8yEv5%2FjKNcDE%3D&reserved=0
>
> As explained in the U-Boot patch:
>
> "RNG self tests are run to determine the correct entropy delay.
> Such tests are executed with different voltages and temperatures to identify
> the worst case value for the entropy delay. For i.MX6SX, it was determined
> that after adding a margin value of 1000 the minimum entropy delay should
> be at least 12000."
>
> Cc: <stable(a)vger.kernel.org>
> Fixes: 358ba762d9f1 ("crypto: caam - enable prediction resistance in HRWNG")
> Signed-off-by: Fabio Estevam <festevam(a)denx.de>
> Reviewed-by: Horia Geantă <horia.geanta(a)nxp.com>
> ---
> Changes since v4:
> - Change the function name to needs_entropy_delay_adjustment() -
> Vabhav
> - Improve the commit log by adding the explanation from the U-Boot patch -
> Vabhav
>
> drivers/crypto/caam/ctrl.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/drivers/crypto/caam/ctrl.c b/drivers/crypto/caam/ctrl.c index
> ca0361b2dbb0..f87aa2169e5f 100644
> --- a/drivers/crypto/caam/ctrl.c
> +++ b/drivers/crypto/caam/ctrl.c
> @@ -609,6 +609,13 @@ static bool check_version(struct fsl_mc_version
> *mc_version, u32 major, } #endif
>
> +static bool needs_entropy_delay_adjustment(void)
> +{
> + if (of_machine_is_compatible("fsl,imx6sx"))
> + return true;
> + return false;
> +}
> +
> /* Probe routine for CAAM top (controller) level */ static int
> caam_probe(struct platform_device *pdev) { @@ -855,6 +862,8 @@ static
> int caam_probe(struct platform_device *pdev)
> * Also, if a handle was instantiated, do not change
> * the TRNG parameters.
> */
> + if (needs_entropy_delay_adjustment())
> + ent_delay = 12000;
> if (!(ctrlpriv->rng4_sh_init || inst_handles)) {
> dev_info(dev,
> "Entropy delay = %u\n",
> @@ -871,6 +880,15 @@ static int caam_probe(struct platform_device *pdev)
> */
> ret = instantiate_rng(dev, inst_handles,
> gen_sk);
> + /*
> + * Entropy delay is determined via TRNG
> characterization.
> + * TRNG characterization is run across different
> voltages
> + * and temperatures.
> + * If worst case value for ent_dly is identified,
> + * the loop can be skipped for that platform.
> + */
> + if (needs_entropy_delay_adjustment())
> + break;
> if (ret == -EAGAIN)
> /*
> * if here, the loop will rerun,
> --
> 2.25.1
[Public]
Hi,
Can you please bring
commit 1210b17dd4ece454d68a9283f391e3b036aeb010 ("drm/amd/display: Only set PSR version when valid")
to 5.17.y+
This fixes a hang in certain GPU firmware on select panels.
You can also add to the commit log:
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1969407
Thanks,
commit a668cc07f990d2ed19424d5c1a529521a9d1cee1 upstream
perf_evsel::sample_id is an xyarray which can cause a segfault when
accessed beyond its size. e.g.
# perf record -e intel_pt// -C 1 sleep 1
Segmentation fault (core dumped)
#
That is happening because a dummy event is opened to capture text poke
events across all CPUs, however the mmap logic is allocating according
to the number of user_requested_cpus.
In general, perf sometimes uses the evsel cpus to open events, and
sometimes the evlist user_requested_cpus. However, it is not necessary
to determine which case is which because the opened event file
descriptors are also in an xyarray, the size of whch can be used
to correctly allocate the size of the sample_id xyarray, because there
is one ID per file descriptor.
Note, in the affected code path, perf_evsel fd array is subsequently
used to get the file descriptor for the mmap, so it makes sense for the
xyarrays to be the same size there.
Fixes: d1a177595b3a824c ("libperf: Adopt perf_evlist__mmap()/munmap() from tools/perf")
Fixes: 246eba8e9041c477 ("perf tools: Add support for PERF_RECORD_TEXT_POKE")
Signed-off-by: Adrian Hunter <adrian.hunter(a)intel.com>
Acked-by: Ian Rogers <irogers(a)google.com>
Cc: Adrian Hunter <adrian.hunter(a)intel.com>
Cc: Jiri Olsa <jolsa(a)kernel.org>
Cc: stable(a)vger.kernel.org # 5.5+
Link: https://lore.kernel.org/r/20220413114232.26914-1-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme(a)redhat.com>
[backport by Adrian]
Signed-off-by: Adrian Hunter <adrian.hunter(a)intel.com>
---
tools/lib/perf/evlist.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c
index 17465d454a0e..f76b1a9d5a6e 100644
--- a/tools/lib/perf/evlist.c
+++ b/tools/lib/perf/evlist.c
@@ -571,7 +571,6 @@ int perf_evlist__mmap_ops(struct perf_evlist *evlist,
{
struct perf_evsel *evsel;
const struct perf_cpu_map *cpus = evlist->cpus;
- const struct perf_thread_map *threads = evlist->threads;
if (!ops || !ops->get || !ops->mmap)
return -EINVAL;
@@ -583,7 +582,7 @@ int perf_evlist__mmap_ops(struct perf_evlist *evlist,
perf_evlist__for_each_entry(evlist, evsel) {
if ((evsel->attr.read_format & PERF_FORMAT_ID) &&
evsel->sample_id == NULL &&
- perf_evsel__alloc_id(evsel, perf_cpu_map__nr(cpus), threads->nr) < 0)
+ perf_evsel__alloc_id(evsel, evsel->fd->max_x, evsel->fd->max_y) < 0)
return -ENOMEM;
}
--
2.25.1
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 08c1af8f1c13bbf210f1760132f4df24d0ed46d6 Mon Sep 17 00:00:00 2001
From: Mikulas Patocka <mpatocka(a)redhat.com>
Date: Sun, 3 Apr 2022 14:38:22 -0400
Subject: [PATCH] dm integrity: fix memory corruption when tag_size is less
than digest size
It is possible to set up dm-integrity in such a way that the
"tag_size" parameter is less than the actual digest size. In this
situation, a part of the digest beyond tag_size is ignored.
In this case, dm-integrity would write beyond the end of the
ic->recalc_tags array and corrupt memory. The corruption happened in
integrity_recalc->integrity_sector_checksum->crypto_shash_final.
Fix this corruption by increasing the tags array so that it has enough
padding at the end to accomodate the loop in integrity_recalc() being
able to write a full digest size for the last member of the tags
array.
Cc: stable(a)vger.kernel.org # v4.19+
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
Signed-off-by: Mike Snitzer <snitzer(a)kernel.org>
diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
index ad2d5faa2ebb..36ae30b73a6e 100644
--- a/drivers/md/dm-integrity.c
+++ b/drivers/md/dm-integrity.c
@@ -4399,6 +4399,7 @@ static int dm_integrity_ctr(struct dm_target *ti, unsigned argc, char **argv)
}
if (ic->internal_hash) {
+ size_t recalc_tags_size;
ic->recalc_wq = alloc_workqueue("dm-integrity-recalc", WQ_MEM_RECLAIM, 1);
if (!ic->recalc_wq ) {
ti->error = "Cannot allocate workqueue";
@@ -4412,8 +4413,10 @@ static int dm_integrity_ctr(struct dm_target *ti, unsigned argc, char **argv)
r = -ENOMEM;
goto bad;
}
- ic->recalc_tags = kvmalloc_array(RECALC_SECTORS >> ic->sb->log2_sectors_per_block,
- ic->tag_size, GFP_KERNEL);
+ recalc_tags_size = (RECALC_SECTORS >> ic->sb->log2_sectors_per_block) * ic->tag_size;
+ if (crypto_shash_digestsize(ic->internal_hash) > ic->tag_size)
+ recalc_tags_size += crypto_shash_digestsize(ic->internal_hash) - ic->tag_size;
+ ic->recalc_tags = kvmalloc(recalc_tags_size, GFP_KERNEL);
if (!ic->recalc_tags) {
ti->error = "Cannot allocate tags for recalculating";
r = -ENOMEM;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ce33c845b030c9cf768370c951bc699470b09fa7 Mon Sep 17 00:00:00 2001
From: Daniel Bristot de Oliveira <bristot(a)kernel.org>
Date: Sun, 20 Feb 2022 23:49:57 +0100
Subject: [PATCH] tracing: Dump stacktrace trigger to the corresponding
instance
The stacktrace event trigger is not dumping the stacktrace to the instance
where it was enabled, but to the global "instance."
Use the private_data, pointing to the trigger file, to figure out the
corresponding trace instance, and use it in the trigger action, like
snapshot_trigger does.
Link: https://lkml.kernel.org/r/afbb0b4f18ba92c276865bc97204d438473f4ebc.16453962…
Cc: stable(a)vger.kernel.org
Fixes: ae63b31e4d0e2 ("tracing: Separate out trace events from global variables")
Reviewed-by: Tom Zanussi <zanussi(a)kernel.org>
Tested-by: Tom Zanussi <zanussi(a)kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot(a)kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index d00fee705f9c..e0d50c9577f3 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -1540,7 +1540,12 @@ stacktrace_trigger(struct event_trigger_data *data,
struct trace_buffer *buffer, void *rec,
struct ring_buffer_event *event)
{
- trace_dump_stack(STACK_SKIP);
+ struct trace_event_file *file = data->private_data;
+
+ if (file)
+ __trace_stack(file->tr, tracing_gen_ctx(), STACK_SKIP);
+ else
+ trace_dump_stack(STACK_SKIP);
}
static void
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 302e9edd54985f584cfc180098f3554774126969 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Wed, 23 Feb 2022 22:38:37 -0500
Subject: [PATCH] tracing: Have traceon and traceoff trigger honor the instance
If a trigger is set on an event to disable or enable tracing within an
instance, then tracing should be disabled or enabled in the instance and
not at the top level, which is confusing to users.
Link: https://lkml.kernel.org/r/20220223223837.14f94ec3@rorschach.local.home
Cc: stable(a)vger.kernel.org
Fixes: ae63b31e4d0e2 ("tracing: Separate out trace events from global variables")
Tested-by: Daniel Bristot de Oliveira <bristot(a)kernel.org>
Reviewed-by: Tom Zanussi <zanussi(a)kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index e0d50c9577f3..efe563140f27 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -1295,6 +1295,16 @@ traceon_trigger(struct event_trigger_data *data,
struct trace_buffer *buffer, void *rec,
struct ring_buffer_event *event)
{
+ struct trace_event_file *file = data->private_data;
+
+ if (file) {
+ if (tracer_tracing_is_on(file->tr))
+ return;
+
+ tracer_tracing_on(file->tr);
+ return;
+ }
+
if (tracing_is_on())
return;
@@ -1306,8 +1316,15 @@ traceon_count_trigger(struct event_trigger_data *data,
struct trace_buffer *buffer, void *rec,
struct ring_buffer_event *event)
{
- if (tracing_is_on())
- return;
+ struct trace_event_file *file = data->private_data;
+
+ if (file) {
+ if (tracer_tracing_is_on(file->tr))
+ return;
+ } else {
+ if (tracing_is_on())
+ return;
+ }
if (!data->count)
return;
@@ -1315,7 +1332,10 @@ traceon_count_trigger(struct event_trigger_data *data,
if (data->count != -1)
(data->count)--;
- tracing_on();
+ if (file)
+ tracer_tracing_on(file->tr);
+ else
+ tracing_on();
}
static void
@@ -1323,6 +1343,16 @@ traceoff_trigger(struct event_trigger_data *data,
struct trace_buffer *buffer, void *rec,
struct ring_buffer_event *event)
{
+ struct trace_event_file *file = data->private_data;
+
+ if (file) {
+ if (!tracer_tracing_is_on(file->tr))
+ return;
+
+ tracer_tracing_off(file->tr);
+ return;
+ }
+
if (!tracing_is_on())
return;
@@ -1334,8 +1364,15 @@ traceoff_count_trigger(struct event_trigger_data *data,
struct trace_buffer *buffer, void *rec,
struct ring_buffer_event *event)
{
- if (!tracing_is_on())
- return;
+ struct trace_event_file *file = data->private_data;
+
+ if (file) {
+ if (!tracer_tracing_is_on(file->tr))
+ return;
+ } else {
+ if (!tracing_is_on())
+ return;
+ }
if (!data->count)
return;
@@ -1343,7 +1380,10 @@ traceoff_count_trigger(struct event_trigger_data *data,
if (data->count != -1)
(data->count)--;
- tracing_off();
+ if (file)
+ tracer_tracing_off(file->tr);
+ else
+ tracing_off();
}
static int
--
A mail was sent to you sometime last week with the expectation to have a
return mail from you but to my surprise you never bothered to reply. Kindly
reply for further explanations.
your sister in the Lord,
Respectfully yours,
mrs. Hesther Thembile.
Hello Stable kernel maintainers,
I would like to request backport of below patches to linux-5.15-y branch
in stable tree:
2618a0dae09e etherdevice: Adjust ether_addr* prototypes to silence
-Wstringop-overead
ca831f29f8f2 mm: page_alloc: fix building error on -Werror=array-compare
These two patches are required to fix build with GCC 12 for arm
architectures. I have validated it on top of 5.15.34
Thank you
-Khem
From: Mike Rapoport <rppt(a)linux.ibm.com>
[ Upstream commit a9c38c5d267cb94871dfa2de5539c92025c855d7 ]
dma_map_resource() uses pfn_valid() to ensure the range is not RAM.
However, pfn_valid() only checks for availability of the memory map for a
PFN but it does not ensure that the PFN is actually backed by RAM.
As dma_map_resource() is the only method in DMA mapping APIs that has this
check, simply drop the pfn_valid() test from dma_map_resource().
Link: https://lore.kernel.org/all/20210824173741.GC623@arm.com/
Signed-off-by: Mike Rapoport <rppt(a)linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Acked-by: David Hildenbrand <david(a)redhat.com>
Link: https://lore.kernel.org/r/20210930013039.11260-2-rppt@kernel.org
Signed-off-by: Will Deacon <will(a)kernel.org>
Fixes: 859a85ddf90e ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE")
Link: https://lore.kernel.org/r/Yl0IZWT2nsiYtqBT@linux.ibm.com
Signed-off-by: Georgi Djakov <quic_c_gdjako(a)quicinc.com>
---
kernel/dma/mapping.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 8349a9f2c345..9478eccd1c8e 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -296,10 +296,6 @@ dma_addr_t dma_map_resource(struct device *dev, phys_addr_t phys_addr,
if (WARN_ON_ONCE(!dev->dma_mask))
return DMA_MAPPING_ERROR;
- /* Don't allow RAM to be mapped */
- if (WARN_ON_ONCE(pfn_valid(PHYS_PFN(phys_addr))))
- return DMA_MAPPING_ERROR;
-
if (dma_map_direct(dev, ops))
addr = dma_direct_map_resource(dev, phys_addr, size, dir, attrs);
else if (ops->map_resource)
This looks like it's harmless, as both the source and the destinations are
currently the same allocation size (4 bytes) and don't use their padding,
but if anything were to ever be added after the "mcr" member in "struct
whiteheat_private", it would be overwritten. The structs both have a
single u8 "mcr" member, but are 4 bytes in padded size. The memcpy()
destination was explicitly targeting the u8 member (size 1) with the
length of the whole structure (size 4), triggering the memcpy buffer
overflow warning:
In file included from include/linux/string.h:253,
from include/linux/bitmap.h:11,
from include/linux/cpumask.h:12,
from include/linux/smp.h:13,
from include/linux/lockdep.h:14,
from include/linux/spinlock.h:62,
from include/linux/mmzone.h:8,
from include/linux/gfp.h:6,
from include/linux/slab.h:15,
from drivers/usb/serial/whiteheat.c:17:
In function 'fortify_memcpy_chk',
inlined from 'firm_send_command' at drivers/usb/serial/whiteheat.c:587:4:
include/linux/fortify-string.h:328:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
328 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand the memcpy() to the entire structure, though perhaps the correct
solution is to mark all the USB command structures as "__packed".
Reported-by: kernel test robot <lkp(a)intel.com>
Link: https://lore.kernel.org/lkml/202204142318.vDqjjSFn-lkp@intel.com
Cc: Johan Hovold <johan(a)kernel.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: linux-usb(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Signed-off-by: Kees Cook <keescook(a)chromium.org>
---
drivers/usb/serial/whiteheat.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/usb/serial/whiteheat.c b/drivers/usb/serial/whiteheat.c
index da65d14c9ed5..6e00498843fb 100644
--- a/drivers/usb/serial/whiteheat.c
+++ b/drivers/usb/serial/whiteheat.c
@@ -584,7 +584,7 @@ static int firm_send_command(struct usb_serial_port *port, __u8 command,
switch (command) {
case WHITEHEAT_GET_DTR_RTS:
info = usb_get_serial_port_data(port);
- memcpy(&info->mcr, command_info->result_buffer,
+ memcpy(info, command_info->result_buffer,
sizeof(struct whiteheat_dr_info));
break;
}
--
2.32.0
Dzień dobry,
jakiś czas temu zgłosiła się do nas firma, której strona internetowa nie pozycjonowała się wysoko w wyszukiwarce Google.
Na podstawie wykonanego przez nas audytu SEO zoptymalizowaliśmy treści na stronie pod kątem wcześniej opracowanych słów kluczowych. Nasz wewnętrzny system codziennie analizuje prawidłowe działanie witryny. Dzięki indywidualnej strategii, firma zdobywa coraz więcej Klientów.
Czy chcieliby Państwo zwiększyć liczbę osób odwiedzających stronę internetową firmy? Mógłbym przedstawić ofertę?
Pozdrawiam serdecznie,
Wiktor Zielonko
This patch fixes a memory corruption that occurred in the
nand_scan() path for Hynix nand device.
On boot, for Hynix nand device will panic at a weird place:
| Unable to handle kernel NULL pointer dereference at virtual
address 00000070
| [00000070] *pgd=00000000
| Internal error: Oops: 5 [#1] PREEMPT SMP ARM
| Modules linked in:
| CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-01473-g13ae1769cfb0
#38
| Hardware name: Generic DT based system
| PC is at nandc_set_reg+0x8/0x1c
| LR is at qcom_nandc_command+0x20c/0x5d0
| pc : [<c088b74c>] lr : [<c088d9c8>] psr: 00000113
| sp : c14adc50 ip : c14ee208 fp : c0cc970c
| r10: 000000a3 r9 : 00000000 r8 : 00000040
| r7 : c16f6a00 r6 : 00000090 r5 : 00000004 r4 :c14ee040
| r3 : 00000000 r2 : 0000000b r1 : 00000000 r0 :c14ee040
| Flags: nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
| Control: 10c5387d Table: 8020406a DAC: 00000051
| Register r0 information: slab kmalloc-2k start c14ee000 pointer offset
64 size 2048
| Process swapper/0 (pid: 1, stack limit = 0x(ptrval))
| nandc_set_reg from qcom_nandc_command+0x20c/0x5d0
| qcom_nandc_command from nand_readid_op+0x198/0x1e8
| nand_readid_op from hynix_nand_has_valid_jedecid+0x30/0x78
| hynix_nand_has_valid_jedecid from hynix_nand_init+0xb8/0x454
| hynix_nand_init from nand_scan_with_ids+0xa30/0x14a8
| nand_scan_with_ids from qcom_nandc_probe+0x648/0x7b0
| qcom_nandc_probe from platform_probe+0x58/0xac
The problem is that the nand_scan()'s qcom_nand_attach_chip callback
is updating the nandc->max_cwperpage from 1 to 4 or 8 based on page size.
This causes the sg_init_table of clear_bam_transaction() in the driver's
qcom_nandc_command() to memset much more than what was initially
allocated by alloc_bam_transaction().
This patch will update nandc->max_cwperpage 1 to 4 or 8 based on page
size in qcom_nand_attach_chip call back after freeing the previously
allocated memory for bam txn as per nandc->max_cwperpage = 1 and then
again allocating bam txn as per nandc->max_cwperpage = 4 or 8 based on
page size in qcom_nand_attach_chip call back itself.
Cc: stable(a)vger.kernel.org
Fixes: 6a3cec64f18c ("mtd: rawnand: qcom: convert driver to nand_scan()")
Reported-by: Konrad Dybcio <konrad.dybcio(a)somainline.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Co-developed-by: Sricharan R <quic_srichara(a)quicinc.com>
Signed-off-by: Sricharan R <quic_srichara(a)quicinc.com>
Signed-off-by: Md Sadre Alam <quic_mdalam(a)quicinc.com>
---
Changes in V5:
* Incorporated "missing Co-developed-by tag" comment from Mani
* Added Co-developed-by tag Co-developed-by: Sricharan R <quic_srichara(a)quicinc.com>
* Incorporated " Add Reviewed-by tag" comment from Mani
* Added Reviewed-by tag Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Changes in V4:
* Incorporated "commit log wrong" comment from Mani
* Updated commit log
Changes in V3:
* Incorporated "Fixes tags are missing" comment from Miquèl
* Added Fixes tag Fixes:6a3cec64f18c ("mtd: rawnand: qcom: convert driver to nand_scan()")
* Incorporated "stable tag missing" comment from Miquèl
* Added stable tag Cc: stable(a)vger.kernel.org
* Incorporated "Reported-by tag missing" comment from Mani
* Added Reported-by tag Reported-by: Konrad Dybcio <konrad.dybcio(a)somainline.org>
Changes in V2:
* Incorporated "alloc_bam_transaction inside qcom_nand_attach_chip" suggestion from Mani
* Freed previously alloacted memory for bam txn before updating max_cwperpage inside
qcom_nand_attach_chip().
* Moved alloc_bam_transaction() inside qcom_nand_attach_chip(). after upding max_cwperpage
4 or 8 based on page size.
drivers/mtd/nand/raw/qcom_nandc.c | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/drivers/mtd/nand/raw/qcom_nandc.c b/drivers/mtd/nand/raw/qcom_nandc.c
index 1a77542..048b255 100644
--- a/drivers/mtd/nand/raw/qcom_nandc.c
+++ b/drivers/mtd/nand/raw/qcom_nandc.c
@@ -2651,10 +2651,23 @@ static int qcom_nand_attach_chip(struct nand_chip *chip)
ecc->engine_type = NAND_ECC_ENGINE_TYPE_ON_HOST;
mtd_set_ooblayout(mtd, &qcom_nand_ooblayout_ops);
+ /* Free the initially allocated BAM transaction for reading the ONFI params */
+ if (nandc->props->is_bam)
+ free_bam_transaction(nandc);
nandc->max_cwperpage = max_t(unsigned int, nandc->max_cwperpage,
cwperpage);
+ /* Now allocate the BAM transaction based on updated max_cwperpage */
+ if (nandc->props->is_bam) {
+ nandc->bam_txn = alloc_bam_transaction(nandc);
+ if (!nandc->bam_txn) {
+ dev_err(nandc->dev,
+ "failed to allocate bam transaction\n");
+ return -ENOMEM;
+ }
+ }
+
/*
* DATA_UD_BYTES varies based on whether the read/write command protects
* spare data with ECC too. We protect spare data by default, so we set
@@ -2955,17 +2968,6 @@ static int qcom_nand_host_init_and_register(struct qcom_nand_controller *nandc,
if (ret)
return ret;
- if (nandc->props->is_bam) {
- free_bam_transaction(nandc);
- nandc->bam_txn = alloc_bam_transaction(nandc);
- if (!nandc->bam_txn) {
- dev_err(nandc->dev,
- "failed to allocate bam transaction\n");
- nand_cleanup(chip);
- return -ENOMEM;
- }
- }
-
ret = mtd_device_parse_register(mtd, probes, NULL, NULL, 0);
if (ret)
nand_cleanup(chip);
--
2.7.4
Fixes device tree schema validation error messages like 'clocks
does not match any of the regexes: 'pinctrl-[0-9]+''.
The bindings for the memory element don't define the 'clock' and
'status' fields, and the presence of these elements was causing the
dt-schema checker to trip-up. Our operating assumption is that the
platform doesn't rely on the presence of these elements, and that
they were introduced by a typographical oversight.
Fixes: a2770b57d083 ("dt-bindings: timer: Add CLINT bindings")
Cc: stable(a)vger.kernel.org
Signed-off-by: Atul Khare <atulkhare(a)rivosinc.com>
---
arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dts | 4 ----
1 file changed, 4 deletions(-)
diff --git a/arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dts
b/arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dts
index cd2fe80fa81a..0a498a0f7eeb 100644
--- a/arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dts
+++ b/arch/riscv/boot/dts/microchip/microchip-mpfs-icicle-kit.dts
@@ -32,15 +32,11 @@ cpus {
ddrc_cache_lo: memory@80000000 {
device_type = "memory";
reg = <0x0 0x80000000 0x0 0x2e000000>;
- clocks = <&clkcfg CLK_DDRC>;
- status = "okay";
};
ddrc_cache_hi: memory@1000000000 {
device_type = "memory";
reg = <0x10 0x0 0x0 0x40000000>;
- clocks = <&clkcfg CLK_DDRC>;
- status = "okay";
};
};
--
2.35.1
Fixes Running device tree schema validation error messages like
'... cache-sets:0:0: 1024 was expected'.
The existing bindings had a single enumerated value of 1024, which
trips up the dt-schema checks. The ISA permits any arbitrary power
of two for the cache-sets value, but we decided to add the single
additional value of 2048 because we couldn't spot an obvious way
to express the constraint in the schema.
Fixes: a2770b57d083 ("dt-bindings: timer: Add CLINT bindings")
Cc: stable(a)vger.kernel.org
Signed-off-by: Atul Khare <atulkhare(a)rivosinc.com>
---
Documentation/devicetree/bindings/riscv/sifive-l2-cache.yaml | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/Documentation/devicetree/bindings/riscv/sifive-l2-cache.yaml
b/Documentation/devicetree/bindings/riscv/sifive-l2-cache.yaml
index e2d330bd4608..309517b78e84 100644
--- a/Documentation/devicetree/bindings/riscv/sifive-l2-cache.yaml
+++ b/Documentation/devicetree/bindings/riscv/sifive-l2-cache.yaml
@@ -46,7 +46,9 @@ properties:
const: 2
cache-sets:
- const: 1024
+ # Note: Technically this can be any power of 2, but we didn't see
an obvious way
+ # to express the constraint in Yaml
+ enum: [1024, 2048]
cache-size:
const: 2097152
--
2.35.1
Fixes device tree schema validation error messages like 'clint@2000000:
interrupts-extended: [[3, 3], [3, 7] ... is too long'.
The CLINT bindings don't define an "interrupts-extended: maxItems",
which trips up the dt-schema checks. Since there's no ISA-mandated
limit, we arbitrarily chose 1024 to reflect the soon-to-be maximum of
NR_CPUS=512 (systems typically have two hart contexts per CPU).
Fixes: a2770b57d083 ("dt-bindings: timer: Add CLINT bindings")
Cc: stable(a)vger.kernel.org
Signed-off-by: Atul Khare <atulkhare(a)rivosinc.com>
---
Documentation/devicetree/bindings/timer/sifive,clint.yaml | 2 ++
1 file changed, 2 insertions(+)
diff --git a/Documentation/devicetree/bindings/timer/sifive,clint.yaml
b/Documentation/devicetree/bindings/timer/sifive,clint.yaml
index 8d5f4687add9..4a1f6d422138 100644
--- a/Documentation/devicetree/bindings/timer/sifive,clint.yaml
+++ b/Documentation/devicetree/bindings/timer/sifive,clint.yaml
@@ -44,6 +44,8 @@ properties:
interrupts-extended:
minItems: 1
+# Based on updated max(NR_CPUS) (512) * (2 contexts per CPU)
+ maxItems: 1024
additionalProperties: false
--
2.35.1
The sizeof(struct whitehat_dr_info) can be 4 bytes under CONFIG_AEABI=n
due to "-mabi=apcs-gnu", even though it has a single u8:
whiteheat_private {
__u8 mcr; /* 0 1 */
/* size: 4, cachelines: 1, members: 1 */
/* padding: 3 */
/* last cacheline: 4 bytes */
};
The result is technically harmless, as both the source and the
destinations are currently the same allocation size (4 bytes) and don't
use their padding, but if anything were to ever be added after the
"mcr" member in "struct whiteheat_private", it would be overwritten. The
structs both have a single u8 "mcr" member, but are 4 bytes in padded
size. The memcpy() destination was explicitly targeting the u8 member
(size 1) with the length of the whole structure (size 4), triggering
the memcpy buffer overflow warning:
In file included from include/linux/string.h:253,
from include/linux/bitmap.h:11,
from include/linux/cpumask.h:12,
from include/linux/smp.h:13,
from include/linux/lockdep.h:14,
from include/linux/spinlock.h:62,
from include/linux/mmzone.h:8,
from include/linux/gfp.h:6,
from include/linux/slab.h:15,
from drivers/usb/serial/whiteheat.c:17:
In function 'fortify_memcpy_chk',
inlined from 'firm_send_command' at drivers/usb/serial/whiteheat.c:587:4:
include/linux/fortify-string.h:328:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
328 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Instead, just assign the one byte directly.
Reported-by: kernel test robot <lkp(a)intel.com>
Link: https://lore.kernel.org/lkml/202204142318.vDqjjSFn-lkp@intel.com
Cc: Johan Hovold <johan(a)kernel.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: linux-usb(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Signed-off-by: Kees Cook <keescook(a)chromium.org>
---
v1: https://lore.kernel.org/lkml/20220419041742.4117026-1-keescook@chromium.org/
v2: - just assign the single byte
---
drivers/usb/serial/whiteheat.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/usb/serial/whiteheat.c b/drivers/usb/serial/whiteheat.c
index da65d14c9ed5..06aad0d727dd 100644
--- a/drivers/usb/serial/whiteheat.c
+++ b/drivers/usb/serial/whiteheat.c
@@ -584,9 +584,8 @@ static int firm_send_command(struct usb_serial_port *port, __u8 command,
switch (command) {
case WHITEHEAT_GET_DTR_RTS:
info = usb_get_serial_port_data(port);
- memcpy(&info->mcr, command_info->result_buffer,
- sizeof(struct whiteheat_dr_info));
- break;
+ info->mcr = command_info->result_buffer[0];
+ break;
}
}
exit:
--
2.32.0
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ce33c845b030c9cf768370c951bc699470b09fa7 Mon Sep 17 00:00:00 2001
From: Daniel Bristot de Oliveira <bristot(a)kernel.org>
Date: Sun, 20 Feb 2022 23:49:57 +0100
Subject: [PATCH] tracing: Dump stacktrace trigger to the corresponding
instance
The stacktrace event trigger is not dumping the stacktrace to the instance
where it was enabled, but to the global "instance."
Use the private_data, pointing to the trigger file, to figure out the
corresponding trace instance, and use it in the trigger action, like
snapshot_trigger does.
Link: https://lkml.kernel.org/r/afbb0b4f18ba92c276865bc97204d438473f4ebc.16453962…
Cc: stable(a)vger.kernel.org
Fixes: ae63b31e4d0e2 ("tracing: Separate out trace events from global variables")
Reviewed-by: Tom Zanussi <zanussi(a)kernel.org>
Tested-by: Tom Zanussi <zanussi(a)kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot(a)kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index d00fee705f9c..e0d50c9577f3 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -1540,7 +1540,12 @@ stacktrace_trigger(struct event_trigger_data *data,
struct trace_buffer *buffer, void *rec,
struct ring_buffer_event *event)
{
- trace_dump_stack(STACK_SKIP);
+ struct trace_event_file *file = data->private_data;
+
+ if (file)
+ __trace_stack(file->tr, tracing_gen_ctx(), STACK_SKIP);
+ else
+ trace_dump_stack(STACK_SKIP);
}
static void
--
Dear friend,
My name is Madi Zongo, a banker in one of the banks in my country here
called Burkina Faso. I have the sum of $27,2 Million for transfer
which i need your help. If you are interested, please reply me
urgently for more details. Contact me via Email: zmadizongo(a)gmail.com
Thanks and best Regards.