The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 469c0682e03d67d8dc970ecaa70c2d753057c7c0
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025010322-villain-exes-fff1@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 469c0682e03d67d8dc970ecaa70c2d753057c7c0 Mon Sep 17 00:00:00 2001
From: Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
Date: Sun, 15 Dec 2024 12:01:59 +0900
Subject: [PATCH] pmdomain: imx: gpcv2: fix an OF node reference leak in
imx_gpcv2_probe()
imx_gpcv2_probe() leaks an OF node reference obtained by
of_get_child_by_name(). Fix it by declaring the device node with the
__free(device_node) cleanup construct.
This bug was found by an experimental static analysis tool that I am
developing.
Fixes: 03aa12629fc4 ("soc: imx: Add GPCv2 power gating driver")
Signed-off-by: Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
Cc: stable(a)vger.kernel.org
Message-ID: <20241215030159.1526624-1-joe(a)pf.is.s.u-tokyo.ac.jp>
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/pmdomain/imx/gpcv2.c b/drivers/pmdomain/imx/gpcv2.c
index e67ecf99ef84..9bdb80fd7210 100644
--- a/drivers/pmdomain/imx/gpcv2.c
+++ b/drivers/pmdomain/imx/gpcv2.c
@@ -1458,12 +1458,12 @@ static int imx_gpcv2_probe(struct platform_device *pdev)
.max_register = SZ_4K,
};
struct device *dev = &pdev->dev;
- struct device_node *pgc_np;
+ struct device_node *pgc_np __free(device_node) =
+ of_get_child_by_name(dev->of_node, "pgc");
struct regmap *regmap;
void __iomem *base;
int ret;
- pgc_np = of_get_child_by_name(dev->of_node, "pgc");
if (!pgc_np) {
dev_err(dev, "No power domains specified in DT\n");
return -EINVAL;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 469c0682e03d67d8dc970ecaa70c2d753057c7c0
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025010321-pampers-glazing-2689@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 469c0682e03d67d8dc970ecaa70c2d753057c7c0 Mon Sep 17 00:00:00 2001
From: Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
Date: Sun, 15 Dec 2024 12:01:59 +0900
Subject: [PATCH] pmdomain: imx: gpcv2: fix an OF node reference leak in
imx_gpcv2_probe()
imx_gpcv2_probe() leaks an OF node reference obtained by
of_get_child_by_name(). Fix it by declaring the device node with the
__free(device_node) cleanup construct.
This bug was found by an experimental static analysis tool that I am
developing.
Fixes: 03aa12629fc4 ("soc: imx: Add GPCv2 power gating driver")
Signed-off-by: Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
Cc: stable(a)vger.kernel.org
Message-ID: <20241215030159.1526624-1-joe(a)pf.is.s.u-tokyo.ac.jp>
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/pmdomain/imx/gpcv2.c b/drivers/pmdomain/imx/gpcv2.c
index e67ecf99ef84..9bdb80fd7210 100644
--- a/drivers/pmdomain/imx/gpcv2.c
+++ b/drivers/pmdomain/imx/gpcv2.c
@@ -1458,12 +1458,12 @@ static int imx_gpcv2_probe(struct platform_device *pdev)
.max_register = SZ_4K,
};
struct device *dev = &pdev->dev;
- struct device_node *pgc_np;
+ struct device_node *pgc_np __free(device_node) =
+ of_get_child_by_name(dev->of_node, "pgc");
struct regmap *regmap;
void __iomem *base;
int ret;
- pgc_np = of_get_child_by_name(dev->of_node, "pgc");
if (!pgc_np) {
dev_err(dev, "No power domains specified in DT\n");
return -EINVAL;
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 469c0682e03d67d8dc970ecaa70c2d753057c7c0
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025010321-bacteria-pebbly-7249@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 469c0682e03d67d8dc970ecaa70c2d753057c7c0 Mon Sep 17 00:00:00 2001
From: Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
Date: Sun, 15 Dec 2024 12:01:59 +0900
Subject: [PATCH] pmdomain: imx: gpcv2: fix an OF node reference leak in
imx_gpcv2_probe()
imx_gpcv2_probe() leaks an OF node reference obtained by
of_get_child_by_name(). Fix it by declaring the device node with the
__free(device_node) cleanup construct.
This bug was found by an experimental static analysis tool that I am
developing.
Fixes: 03aa12629fc4 ("soc: imx: Add GPCv2 power gating driver")
Signed-off-by: Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
Cc: stable(a)vger.kernel.org
Message-ID: <20241215030159.1526624-1-joe(a)pf.is.s.u-tokyo.ac.jp>
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/pmdomain/imx/gpcv2.c b/drivers/pmdomain/imx/gpcv2.c
index e67ecf99ef84..9bdb80fd7210 100644
--- a/drivers/pmdomain/imx/gpcv2.c
+++ b/drivers/pmdomain/imx/gpcv2.c
@@ -1458,12 +1458,12 @@ static int imx_gpcv2_probe(struct platform_device *pdev)
.max_register = SZ_4K,
};
struct device *dev = &pdev->dev;
- struct device_node *pgc_np;
+ struct device_node *pgc_np __free(device_node) =
+ of_get_child_by_name(dev->of_node, "pgc");
struct regmap *regmap;
void __iomem *base;
int ret;
- pgc_np = of_get_child_by_name(dev->of_node, "pgc");
if (!pgc_np) {
dev_err(dev, "No power domains specified in DT\n");
return -EINVAL;
The put_device(&epc->dev) call will trigger pci_epc_release() which
frees "epc" so the kfree(epc) on the next line is a double free.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 5e8cb4033807 ("PCI: endpoint: Add EP core layer to enable EP controller and EP functions")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
drivers/pci/endpoint/pci-epc-core.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/pci/endpoint/pci-epc-core.c b/drivers/pci/endpoint/pci-epc-core.c
index 46c9a5c3ca14..652350f054cf 100644
--- a/drivers/pci/endpoint/pci-epc-core.c
+++ b/drivers/pci/endpoint/pci-epc-core.c
@@ -818,7 +818,6 @@ __pci_epc_create(struct device *dev, const struct pci_epc_ops *ops,
put_dev:
put_device(&epc->dev);
- kfree(epc);
err_ret:
return ERR_PTR(ret);
--
2.25.1
This fixes data corruption when accessing the internal SD card in mass
storage mode.
I am actually not too sure why. I didn't figure a straightforward way to
reproduce the issue, but i seem to get garbage when issuing a lot (over 50)
of large reads (over 120 sectors) are done in a quick succession. That is,
time seems to matter here -- larger reads are fine if they are done with
some delay between them.
But I'm not great at understanding this sort of things, so I'll assume
the issue other, smarter, folks were seeing with similar phones is the
same problem and I'll just put my quirk next to theirs.
The "Software details" screen on the phone is as follows:
V 04.06
07-08-13
RM-849
(c) Nokia
TL;DR version of the device descriptor:
idVendor 0x0421 Nokia Mobile Phones
idProduct 0x06c2
bcdDevice 4.06
iManufacturer 1 Nokia
iProduct 2 Nokia 208
The patch assumes older firmwares are broken too (I'm unable to test, but
no biggie if they aren't I guess), and I have no idea if newer firmware
exists.
Signed-off-by: Lubomir Rintel <lkundrak(a)v3.sk>
Cc: <stable(a)vger.kernel.org>
---
drivers/usb/storage/unusual_devs.h | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/usb/storage/unusual_devs.h b/drivers/usb/storage/unusual_devs.h
index e5ad23d86833..54f0b1c83317 100644
--- a/drivers/usb/storage/unusual_devs.h
+++ b/drivers/usb/storage/unusual_devs.h
@@ -255,6 +255,13 @@ UNUSUAL_DEV( 0x0421, 0x06aa, 0x1110, 0x1110,
USB_SC_DEVICE, USB_PR_DEVICE, NULL,
US_FL_MAX_SECTORS_64 ),
+/* Added by Lubomir Rintel <lkundrak(a)v3.sk>, a very fine chap */
+UNUSUAL_DEV( 0x0421, 0x06c2, 0x0000, 0x0406,
+ "Nokia",
+ "Nokia 208",
+ USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+ US_FL_MAX_SECTORS_64 ),
+
#ifdef NO_SDDR09
UNUSUAL_DEV( 0x0436, 0x0005, 0x0100, 0x0100,
"Microtech",
--
2.47.1
There's USB error when tegra board is shutting down:
[ 180.919315] usb 2-3: Failed to set U1 timeout to 0x0,error code -113
[ 180.919995] usb 2-3: Failed to set U1 timeout to 0xa,error code -113
[ 180.920512] usb 2-3: Failed to set U2 timeout to 0x4,error code -113
[ 186.157172] tegra-xusb 3610000.usb: xHCI host controller not responding, assume dead
[ 186.157858] tegra-xusb 3610000.usb: HC died; cleaning up
[ 186.317280] tegra-xusb 3610000.usb: Timeout while waiting for evaluate context command
The issue is caused by disabling LPM on already suspended ports.
For USB2 LPM, the LPM is already disabled during port suspend. For USB3
LPM, port won't transit to U1/U2 when it's already suspended in U3,
hence disabling LPM is only needed for ports that are not suspended.
Cc: Wayne Chang <waynec(a)nvidia.com>
Cc: stable(a)vger.kernel.org
Fixes: d920a2ed8620 ("usb: Disable USB3 LPM at shutdown")
Signed-off-by: Kai-Heng Feng <kaihengf(a)nvidia.com>
---
v3:
Use udev->port_is_suspended which reflects upstream port status
v2:
Add "Cc: stable(a)vger.kernel.org"
drivers/usb/core/port.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c
index e7da2fca11a4..c92fb648a1c4 100644
--- a/drivers/usb/core/port.c
+++ b/drivers/usb/core/port.c
@@ -452,10 +452,11 @@ static int usb_port_runtime_suspend(struct device *dev)
static void usb_port_shutdown(struct device *dev)
{
struct usb_port *port_dev = to_usb_port(dev);
+ struct usb_device *udev = port_dev->child;
- if (port_dev->child) {
- usb_disable_usb2_hardware_lpm(port_dev->child);
- usb_unlocked_disable_lpm(port_dev->child);
+ if (udev && !udev->port_is_suspended) {
+ usb_disable_usb2_hardware_lpm(udev);
+ usb_unlocked_disable_lpm(udev);
}
}
--
2.47.0
Here are 3 different fixes, all related to the MPTCP receive buffer:
- Patch 1: fix receive buffer space when recvmsg() blocks after
receiving some data. For a fix introduced in v6.12, backported to
v6.1.
- Patch 2: mptcp_cleanup_rbuf() can be called when no data has been
copied. For 5.11.
- Patch 3: prevent excessive coalescing on receive, which can affect the
throughput badly. It looks better to wait a bit before backporting
this one to stable versions, to get more results. For 5.10.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Please note that there is no urgency here as well: this can of course be
sent to Linus next year!
Enjoy this holiday period!
---
Paolo Abeni (3):
mptcp: fix recvbuffer adjust on sleeping rcvmsg
mptcp: don't always assume copied data in mptcp_cleanup_rbuf()
mptcp: prevent excessive coalescing on receive
net/mptcp/protocol.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)
---
base-commit: a024e377efed31ecfb39210bed562932321345b3
change-id: 20241230-net-mptcp-rbuf-fixes-74526e59d951
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.12.y
git checkout FETCH_HEAD
git cherry-pick -x 900f83cf376bdaf798b6f5dcb2eae0c822e908b6
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024122319-risk-starlit-ce4a@gregkh' --subject-prefix 'PATCH 6.12.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 900f83cf376bdaf798b6f5dcb2eae0c822e908b6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Thi=C3=A9baud=20Weksteen?= <tweek(a)google.com>
Date: Thu, 5 Dec 2024 12:09:19 +1100
Subject: [PATCH] selinux: ignore unknown extended permissions
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When evaluating extended permissions, ignore unknown permissions instead
of calling BUG(). This commit ensures that future permissions can be
added without interfering with older kernels.
Cc: stable(a)vger.kernel.org
Fixes: fa1aa143ac4a ("selinux: extended permissions for ioctls")
Signed-off-by: Thiébaud Weksteen <tweek(a)google.com>
Signed-off-by: Paul Moore <paul(a)paul-moore.com>
diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 971c45d576ba..3d5c563cfc4c 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -979,7 +979,10 @@ void services_compute_xperms_decision(struct extended_perms_decision *xpermd,
return;
break;
default:
- BUG();
+ pr_warn_once(
+ "SELinux: unknown extended permission (%u) will be ignored\n",
+ node->datum.u.xperms->specified);
+ return;
}
if (node->key.specified == AVTAB_XPERMS_ALLOWED) {
@@ -998,7 +1001,8 @@ void services_compute_xperms_decision(struct extended_perms_decision *xpermd,
&node->datum.u.xperms->perms,
xpermd->dontaudit);
} else {
- BUG();
+ pr_warn_once("SELinux: unknown specified key (%u)\n",
+ node->key.specified);
}
}
From: Zilin Guan <zilin(a)seu.edu.cn>
In __ftrace_return_to_handler(), a loop iterates over the fgraph_array[]
elements, which are fgraph_ops. The loop checks if an element is a
fgraph_stub to prevent using a fgraph_stub afterward.
However, if the compiler reloads fgraph_array[] after this check, it might
race with an update to fgraph_array[] that introduces a fgraph_stub. This
could result in the stub being processed, but the stub contains a null
"func_hash" field, leading to a NULL pointer dereference.
To ensure that the gops compared against the fgraph_stub matches the gops
processed later, add a READ_ONCE(). A similar patch appears in commit
63a8dfb ("function_graph: Add READ_ONCE() when accessing fgraph_array[]").
Cc:stable@vger.kernel.org
Fixes: 37238abe3cb47 ("ftrace/function_graph: Pass fgraph_ops to function graph callbacks")
Link: https://lore.kernel.org/20241231113731.277668-1-zilin@seu.edu.cn
Signed-off-by: Zilin Guan <zilin(a)seu.edu.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/fgraph.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index ddedcb50917f..30e3ddc8a8a8 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -833,7 +833,7 @@ static unsigned long __ftrace_return_to_handler(struct fgraph_ret_regs *ret_regs
#endif
{
for_each_set_bit(i, &bitmap, sizeof(bitmap) * BITS_PER_BYTE) {
- struct fgraph_ops *gops = fgraph_array[i];
+ struct fgraph_ops *gops = READ_ONCE(fgraph_array[i]);
if (gops == &fgraph_stub)
continue;
--
2.45.2
The patch titled
Subject: fs/proc: do_task_stat: fix ESP not readable during coredump
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
fs-proc-do_task_stat-fix-esp-not-readable-during-coredump.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Nam Cao <namcao(a)linutronix.de>
Subject: fs/proc: do_task_stat: fix ESP not readable during coredump
Date: Thu, 2 Jan 2025 09:22:56 +0100
The field "eip" (instruction pointer) and "esp" (stack pointer) of a task
can be read from /proc/PID/stat. These fields can be interesting for
coredump.
However, these fields were disabled by commit 0a1eb2d474ed ("fs/proc: Stop
reporting eip and esp in /proc/PID/stat"), because it is generally unsafe
to do so. But it is safe for a coredumping process, and therefore
exceptions were made:
- for a coredumping thread by commit fd7d56270b52 ("fs/proc: Report
eip/esp in /prod/PID/stat for coredumping").
- for all other threads in a coredumping process by commit cb8f381f1613
("fs/proc/array.c: allow reporting eip/esp for all coredumping
threads").
The above two commits check the PF_DUMPCORE flag to determine a coredump
thread and the PF_EXITING flag for the other threads.
Unfortunately, commit 92307383082d ("coredump: Don't perform any cleanups
before dumping core") moved coredump to happen earlier and before
PF_EXITING is set. Thus, checking PF_EXITING is no longer the correct way
to determine threads in a coredumping process.
Instead of PF_EXITING, use PF_POSTCOREDUMP to determine the other threads.
Checking of PF_EXITING was added for coredumping, so it probably can now
be removed. But it doesn't hurt to keep.
Link: https://lkml.kernel.org/r/d89af63d478d6c64cc46a01420b46fd6eb147d6f.17358057…
Fixes: 92307383082d ("coredump: Don't perform any cleanups before dumping core")
Signed-off-by: Nam Cao <namcao(a)linutronix.de>
Acked-by: Oleg Nesterov <oleg(a)redhat.com>
Acked-by: Kees Cook <kees(a)kernel.org>
Cc: Eric W. Biederman <ebiederm(a)xmission.com>
Cc: Dylan Hatch <dylanbhatch(a)google.com>
Cc: John Ogness <john.ogness(a)linutronix.de>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/proc/array.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/fs/proc/array.c~fs-proc-do_task_stat-fix-esp-not-readable-during-coredump
+++ a/fs/proc/array.c
@@ -500,7 +500,7 @@ static int do_task_stat(struct seq_file
* a program is not able to use ptrace(2) in that case. It is
* safe because the task has stopped executing permanently.
*/
- if (permitted && (task->flags & (PF_EXITING|PF_DUMPCORE))) {
+ if (permitted && (task->flags & (PF_EXITING|PF_DUMPCORE|PF_POSTCOREDUMP))) {
if (try_get_task_stack(task)) {
eip = KSTK_EIP(task);
esp = KSTK_ESP(task);
_
Patches currently in -mm which might be from namcao(a)linutronix.de are
fs-proc-do_task_stat-fix-esp-not-readable-during-coredump.patch
selftests-coredump-add-stackdump-test.patch
Hi,
This series fixes the several suspend issues on Qcom platforms. Patch 1 fixes
the resume failure with spm_lvl=5 suspend on most of the Qcom platforms. For
this patch, I couldn't figure out the exact commit that caused the issue. So I
used the commit that introduced reinit support as a placeholder.
Patch 4 fixes the suspend issue on SM8550 and SM8650 platforms where UFS
PHY retention is not supported. Hence the default spm_lvl=3 suspend fails. So
this patch configures spm_lvl=5 as the default suspend level to force UFSHC/
device powerdown during suspend. This supersedes the previous series [1] that
tried to fix the issue in clock drivers.
This series is tested on Qcom SM8550 QRD, SM8650 QRD and Qcom RB5 boards.
[1] https://lore.kernel.org/linux-arm-msm/20241107-ufs-clk-fix-v1-0-6032ff22a05…
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
---
Changes in v3:
- Added a patch that honors the runtime/system PM levels set by host drivers.
Otherwise patch 4 doesn't have any effect. This was discovered with SM8650
QRD.
- Collected tags
- Link to v2: https://lore.kernel.org/r/20241213-ufs-qcom-suspend-fix-v2-0-1de6cd2d6146@l…
Changes in v2:
- Changed 'ufs_qcom_drvdata::quirks' type to 'enum ufshcd_quirks'
- Collected tags
- Link to v1: https://lore.kernel.org/r/20241211-ufs-qcom-suspend-fix-v1-0-83ebbde76b1c@l…
---
Manivannan Sadhasivam (4):
scsi: ufs: qcom: Power off the PHY if it was already powered on in ufs_qcom_power_up_sequence()
scsi: ufs: core: Honor runtime/system PM levels if set by host controller drivers
scsi: ufs: qcom: Allow passing platform specific OF data
scsi: ufs: qcom: Power down the controller/device during system suspend for SM8550/SM8650 SoCs
drivers/ufs/core/ufshcd-priv.h | 6 ------
drivers/ufs/core/ufshcd.c | 10 ++++++----
drivers/ufs/host/ufs-qcom.c | 31 +++++++++++++++++++------------
drivers/ufs/host/ufs-qcom.h | 5 +++++
include/ufs/ufshcd.h | 2 --
5 files changed, 30 insertions(+), 24 deletions(-)
---
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
change-id: 20241211-ufs-qcom-suspend-fix-5618e9c56d93
Best regards,
--
Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
The patch titled
Subject: filemap: avoid truncating 64-bit offset to 32 bits
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
filemap-avoid-truncating-64-bit-offset-to-32-bits.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Marco Nelissen <marco.nelissen(a)gmail.com>
Subject: filemap: avoid truncating 64-bit offset to 32 bits
Date: Thu, 2 Jan 2025 11:04:11 -0800
On 32-bit kernels, folio_seek_hole_data() was inadvertently truncating a
64-bit value to 32 bits, leading to a possible infinite loop when writing
to an xfs filesystem.
Link: https://lkml.kernel.org/r/20250102190540.1356838-1-marco.nelissen@gmail.com
Fixes: 54fa39ac2e00b ("iomap: use mapping_seek_hole_data")
Signed-off-by: Marco Nelissen <marco.nelissen(a)gmail.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/filemap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/filemap.c~filemap-avoid-truncating-64-bit-offset-to-32-bits
+++ a/mm/filemap.c
@@ -2996,7 +2996,7 @@ static inline loff_t folio_seek_hole_dat
if (ops->is_partially_uptodate(folio, offset, bsz) ==
seek_data)
break;
- start = (start + bsz) & ~(bsz - 1);
+ start = (start + bsz) & ~((u64)bsz - 1);
offset += bsz;
} while (offset < folio_size(folio));
unlock:
_
Patches currently in -mm which might be from marco.nelissen(a)gmail.com are
filemap-avoid-truncating-64-bit-offset-to-32-bits.patch
From: Kohei Enju <enjuk(a)amazon.com>
Commit c132be2c4fcc ("function_graph: Have the instances use their own
ftrace_ops for filtering"), function profiler (enabled via
function_profile_enabled) has been showing statistics for all functions,
ignoring set_ftrace_filter settings.
While tracers are instantiated, the function profiler is not. Therefore, it
should use the global set_ftrace_filter for consistency. This patch
modifies the function profiler to use the global filter, fixing the
filtering functionality.
Before (filtering not working):
```
root@localhost:~# echo 'vfs*' > /sys/kernel/tracing/set_ftrace_filter
root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# sleep 1
root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# head /sys/kernel/tracing/trace_stat/*
Function Hit Time Avg
s^2
-------- --- ---- ---
---
schedule 314 22290594 us 70989.15 us
40372231 us
x64_sys_call 1527 8762510 us 5738.382 us
3414354 us
schedule_hrtimeout_range 176 8665356 us 49234.98 us
405618876 us
__x64_sys_ppoll 324 5656635 us 17458.75 us
19203976 us
do_sys_poll 324 5653747 us 17449.83 us
19214945 us
schedule_timeout 67 5531396 us 82558.15 us
2136740827 us
__x64_sys_pselect6 12 3029540 us 252461.7 us
63296940171 us
do_pselect.constprop.0 12 3029532 us 252461.0 us
63296952931 us
```
After (filtering working):
```
root@localhost:~# echo 'vfs*' > /sys/kernel/tracing/set_ftrace_filter
root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# sleep 1
root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# head /sys/kernel/tracing/trace_stat/*
Function Hit Time Avg
s^2
-------- --- ---- ---
---
vfs_write 462 68476.43 us 148.217 us
25874.48 us
vfs_read 641 9611.356 us 14.994 us
28868.07 us
vfs_fstat 890 878.094 us 0.986 us
1.667 us
vfs_fstatat 227 757.176 us 3.335 us
18.928 us
vfs_statx 226 610.610 us 2.701 us
17.749 us
vfs_getattr_nosec 1187 460.919 us 0.388 us
0.326 us
vfs_statx_path 297 343.287 us 1.155 us
11.116 us
vfs_rename 6 291.575 us 48.595 us
9889.236 us
```
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/20250101190820.72534-1-enjuk@amazon.com
Fixes: c132be2c4fcc ("function_graph: Have the instances use their own ftrace_ops for filtering")
Signed-off-by: Kohei Enju <enjuk(a)amazon.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/ftrace.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 9b17efb1a87d..2e113f8b13a2 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -902,16 +902,13 @@ static void profile_graph_return(struct ftrace_graph_ret *trace,
}
static struct fgraph_ops fprofiler_ops = {
- .ops = {
- .flags = FTRACE_OPS_FL_INITIALIZED,
- INIT_OPS_HASH(fprofiler_ops.ops)
- },
.entryfunc = &profile_graph_entry,
.retfunc = &profile_graph_return,
};
static int register_ftrace_profiler(void)
{
+ ftrace_ops_set_global_filter(&fprofiler_ops.ops);
return register_ftrace_graph(&fprofiler_ops);
}
@@ -922,12 +919,11 @@ static void unregister_ftrace_profiler(void)
#else
static struct ftrace_ops ftrace_profile_ops __read_mostly = {
.func = function_profile_call,
- .flags = FTRACE_OPS_FL_INITIALIZED,
- INIT_OPS_HASH(ftrace_profile_ops)
};
static int register_ftrace_profiler(void)
{
+ ftrace_ops_set_global_filter(&ftrace_profile_ops);
return register_ftrace_function(&ftrace_profile_ops);
}
--
2.45.2
From: Joshua Washington <joshwash(a)google.com>
In GVE, dedicated XDP queues only exist when an XDP program is installed
and the interface is up. As such, the NDO XDP XMIT callback should
return early if either of these conditions are false.
In the case of no loaded XDP program, priv->num_xdp_queues=0 which can
cause a divide-by-zero error, and in the case of interface down,
num_xdp_queues remains untouched to persist XDP queue count for the next
interface up, but the TX pointer itself would be NULL.
The XDP xmit callback also needs to synchronize with a device
transitioning from open to close. This synchronization will happen via
the GVE_PRIV_FLAGS_NAPI_ENABLED bit along with a synchronize_net() call,
which waits for any RCU critical sections at call-time to complete.
Fixes: 39a7f4aa3e4a ("gve: Add XDP REDIRECT support for GQI-QPL format")
Cc: stable(a)vger.kernel.org
Signed-off-by: Joshua Washington <joshwash(a)google.com>
Signed-off-by: Praveen Kaligineedi <pkaligineedi(a)google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi(a)google.com>
Reviewed-by: Shailend Chand <shailend(a)google.com>
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
---
drivers/net/ethernet/google/gve/gve_main.c | 3 +++
drivers/net/ethernet/google/gve/gve_tx.c | 5 ++++-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index e171ca248f9a..5d7b0cc59959 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1899,6 +1899,9 @@ static void gve_turndown(struct gve_priv *priv)
gve_clear_napi_enabled(priv);
gve_clear_report_stats(priv);
+
+ /* Make sure that all traffic is finished processing. */
+ synchronize_net();
}
static void gve_turnup(struct gve_priv *priv)
diff --git a/drivers/net/ethernet/google/gve/gve_tx.c b/drivers/net/ethernet/google/gve/gve_tx.c
index 83ad278ec91f..852f8c7e39d2 100644
--- a/drivers/net/ethernet/google/gve/gve_tx.c
+++ b/drivers/net/ethernet/google/gve/gve_tx.c
@@ -837,9 +837,12 @@ int gve_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
struct gve_tx_ring *tx;
int i, err = 0, qid;
- if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
+ if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK) || !priv->xdp_prog)
return -EINVAL;
+ if (!gve_get_napi_enabled(priv))
+ return -ENETDOWN;
+
qid = gve_xdp_tx_queue_id(priv,
smp_processor_id() % priv->num_xdp_queues);
--
2.47.1.613.gc27f4b7a9f-goog
From: Claudiu Beznea <claudiu.beznea.uj(a)bp.renesas.com>
In the sh-sci driver, serial ports are mapped to the sci_ports[] array,
with earlycon mapped at index zero.
The uart_add_one_port() function eventually calls __device_attach(),
which, in turn, calls pm_request_idle(). The identified code path is as
follows:
uart_add_one_port() ->
serial_ctrl_register_port() ->
serial_core_register_port() ->
serial_core_port_device_add() ->
serial_base_port_add() ->
device_add() ->
bus_probe_device() ->
device_initial_probe() ->
__device_attach() ->
// ...
if (dev->p->dead) {
// ...
} else if (dev->driver) {
// ...
} else {
// ...
pm_request_idle(dev);
// ...
}
The earlycon device clocks are enabled by the bootloader. However, the
pm_request_idle() call in __device_attach() disables the SCI port clocks
while earlycon is still active.
The earlycon write function, serial_console_write(), calls
sci_poll_put_char() via serial_console_putchar(). If the SCI port clocks
are disabled, writing to earlycon may sometimes cause the SR.TDFE bit to
remain unset indefinitely, causing the while loop in sci_poll_put_char()
to never exit. On single-core SoCs, this can result in the system being
blocked during boot when this issue occurs.
To resolve this, increment the runtime PM usage counter for the earlycon
SCI device before registering the UART port.
Fixes: 0b0cced19ab1 ("serial: sh-sci: Add CONFIG_SERIAL_EARLYCON support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj(a)bp.renesas.com>
---
drivers/tty/serial/sh-sci.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/drivers/tty/serial/sh-sci.c b/drivers/tty/serial/sh-sci.c
index f74eb68774ca..6acdc8588d2d 100644
--- a/drivers/tty/serial/sh-sci.c
+++ b/drivers/tty/serial/sh-sci.c
@@ -3435,7 +3435,24 @@ static int sci_probe_single(struct platform_device *dev,
sciport->port.flags |= UPF_HARD_FLOW;
}
+ /*
+ * In case:
+ * - this is the earlycon port (mapped on index 0 in sci_ports[]) and
+ * - it now maps to an alias other than zero and
+ * - the earlycon is still alive (e.g., "earlycon keep_bootcon" is
+ * available in bootargs)
+ *
+ * we need to avoid disabling clocks and PM domains through the runtime
+ * PM APIs called in __device_attach(). For this, increment the runtime
+ * PM reference counter (the clocks and PM domains were already enabled
+ * by the bootloader). Otherwise the earlycon may access the HW when it
+ * has no clocks enabled leading to failures (infinite loop in
+ * sci_poll_put_char()).
+ */
+
if (sci_ports[0].earlycon && sci_ports[0].port.mapbase == sci_res->start) {
+ pm_runtime_get_noresume(&dev->dev);
+
/*
* Skip cleanup up the sci_port[0] in early_console_exit(), this
* port is the same as the earlycon one.
--
2.39.2
The hardware limitation "max=19" actually comes from SMMU Command Queue.
So, it'd be more natural for tegra241-cmdqv driver to read it out rather
than hardcoding it itself.
This is not an issue yet for a kernel on a baremetal system, but a guest
kernel setting the queue base/size in form of IPA/gPA might result in a
noncontiguous queue in the physical address space, if underlying physical
pages backing up the guest RAM aren't contiguous entirely: e.g. 2MB-page
backed guest RAM cannot guarantee a contiguous queue if it is 8MB (capped
to VCMDQ_LOG2SIZE_MAX=19). This might lead to command errors when HW does
linear-read from a noncontiguous queue memory.
Adding this extra IDR1.CMDQS cap (in the guest kernel) allows VMM to set
SMMU's IDR1.CMDQS=17 for the case mentioned above, so a guest-level queue
will be capped to maximum 2MB, ensuring a contiguous queue memory.
Fixes: a3799717b881 ("iommu/tegra241-cmdqv: Fix alignment failure at max_n_shift")
Reported-by: Ian Kalinowski <ikalinowski(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Nicolin Chen <nicolinc(a)nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index 6e41ddaa24d6..d525ab43a4ae 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -79,7 +79,6 @@
#define TEGRA241_VCMDQ_PAGE1(q) (TEGRA241_VCMDQ_PAGE1_BASE + 0x80*(q))
#define VCMDQ_ADDR GENMASK(47, 5)
#define VCMDQ_LOG2SIZE GENMASK(4, 0)
-#define VCMDQ_LOG2SIZE_MAX 19
#define TEGRA241_VCMDQ_BASE 0x00000
#define TEGRA241_VCMDQ_CONS_INDX_BASE 0x00008
@@ -505,12 +504,15 @@ static int tegra241_vcmdq_alloc_smmu_cmdq(struct tegra241_vcmdq *vcmdq)
struct arm_smmu_cmdq *cmdq = &vcmdq->cmdq;
struct arm_smmu_queue *q = &cmdq->q;
char name[16];
+ u32 regval;
int ret;
snprintf(name, 16, "vcmdq%u", vcmdq->idx);
- /* Queue size, capped to ensure natural alignment */
- q->llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT, VCMDQ_LOG2SIZE_MAX);
+ /* Cap queue size to SMMU's IDR1.CMDQS and ensure natural alignment */
+ regval = readl_relaxed(smmu->base + ARM_SMMU_IDR1);
+ q->llq.max_n_shift =
+ min_t(u32, CMDQ_MAX_SZ_SHIFT, FIELD_GET(IDR1_CMDQS, regval));
/* Use the common helper to init the VCMDQ, and then... */
ret = arm_smmu_init_one_queue(smmu, q, vcmdq->page0,
--
2.43.0
From: Chao Yu <chao(a)kernel.org>
[ Upstream commit 96cfeb0389530ae32ade8a48ae3ae1ac3b6c009d ]
It should wait all existing dio write IOs before block removal,
otherwise, previous direct write IO may overwrite data in the
block which may be reused by other inode.
Cc: stable(a)vger.kernel.org
Signed-off-by: Chao Yu <chao(a)kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk(a)kernel.org>
[ Resolve line conflicts to make it work on 6.6.y ]
Signed-off-by: Alva Lan <alvalan9(a)foxmail.com>
---
fs/f2fs/file.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 196755a34833..ae129044c52f 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -1037,6 +1037,13 @@ int f2fs_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
return err;
}
+ /*
+ * wait for inflight dio, blocks should be removed after
+ * IO completion.
+ */
+ if (attr->ia_size < old_size)
+ inode_dio_wait(inode);
+
f2fs_down_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
filemap_invalidate_lock(inode->i_mapping);
@@ -1873,6 +1880,12 @@ static long f2fs_fallocate(struct file *file, int mode,
if (ret)
goto out;
+ /*
+ * wait for inflight dio, blocks should be removed after IO
+ * completion.
+ */
+ inode_dio_wait(inode);
+
if (mode & FALLOC_FL_PUNCH_HOLE) {
if (offset >= inode->i_size)
goto out;
--
2.43.0
I'm announcing the release of the 6.6.69 kernel.
All users of the 6.6 kernel series must upgrade.
The updated 6.6.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.6.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/loongarch/include/asm/inst.h | 12 +
arch/loongarch/kernel/efi.c | 2
arch/loongarch/kernel/inst.c | 2
arch/loongarch/net/bpf_jit.c | 6
arch/mips/Makefile | 2
arch/mips/include/asm/mipsregs.h | 13 +
arch/powerpc/platforms/book3s/vas-api.c | 36 +++++
arch/x86/include/asm/intel-family.h | 87 ++++++++++++
arch/x86/include/asm/processor.h | 20 ++
arch/x86/kernel/cpu/intel.c | 118 ++++++++---------
arch/x86/kernel/cpu/match.c | 3
block/blk-mq.c | 119 +++++++++++++++--
drivers/base/power/domain.c | 1
drivers/base/regmap/regmap.c | 4
drivers/block/virtio_blk.c | 7 -
drivers/dma/apple-admac.c | 7 -
drivers/dma/at_xdmac.c | 2
drivers/dma/dw/acpi.c | 6
drivers/dma/dw/internal.h | 8 +
drivers/dma/dw/pci.c | 4
drivers/dma/fsl-edma-common.h | 1
drivers/dma/fsl-edma-main.c | 41 +++++-
drivers/dma/mv_xor.c | 2
drivers/dma/tegra186-gpc-dma.c | 10 +
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 22 ---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 2
drivers/gpu/drm/amd/amdgpu/hdp_v4_0.c | 14 +-
drivers/gpu/drm/amd/amdgpu/hdp_v5_0.c | 9 -
drivers/gpu/drm/amd/amdgpu/hdp_v6_0.c | 8 -
drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 28 ++--
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 15 ++
drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c | 3
drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c | 21 +--
drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.c | 32 ++--
drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c | 63 ++++++---
drivers/gpu/drm/amd/amdkfd/kfd_process.c | 43 ++----
drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 44 ++----
drivers/gpu/drm/amd/include/kgd_kfd_interface.h | 14 --
drivers/gpu/drm/display/drm_dp_mst_topology.c | 24 ++-
drivers/i2c/busses/i2c-imx.c | 1
drivers/i2c/busses/i2c-microchip-corei2c.c | 122 +++++++++++++-----
drivers/media/dvb-frontends/dib3000mb.c | 2
drivers/mtd/nand/raw/arasan-nand-controller.c | 11 +
drivers/mtd/nand/raw/atmel/pmecc.c | 4
drivers/mtd/nand/raw/diskonchip.c | 2
drivers/pci/msi/irqdomain.c | 7 -
drivers/pci/msi/msi.c | 4
drivers/phy/broadcom/phy-brcm-usb-init-synopsys.c | 6
drivers/phy/phy-core.c | 21 +--
drivers/phy/qualcomm/phy-qcom-qmp-usb.c | 2
drivers/phy/rockchip/phy-rockchip-naneng-combphy.c | 2
drivers/platform/x86/asus-nb-wmi.c | 1
drivers/power/supply/gpio-charger.c | 8 +
drivers/scsi/megaraid/megaraid_sas_base.c | 5
drivers/scsi/mpt3sas/mpt3sas_base.c | 7 -
drivers/scsi/qla1280.h | 12 -
drivers/scsi/storvsc_drv.c | 7 -
drivers/spi/spi-intel-pci.c | 2
drivers/spi/spi-omap2-mcspi.c | 6
drivers/watchdog/it87_wdt.c | 39 +++++
drivers/watchdog/mtk_wdt.c | 6
fs/btrfs/inode.c | 2
fs/btrfs/sysfs.c | 6
fs/ceph/addr.c | 4
fs/ceph/file.c | 43 +++---
fs/ceph/super.h | 14 ++
fs/nfsd/export.c | 31 ----
fs/nfsd/export.h | 4
fs/nfsd/nfs4callback.c | 4
fs/smb/server/smb_common.c | 4
fs/udf/namei.c | 6
include/linux/ceph/osd_client.h | 7 -
include/linux/sched.h | 3
include/linux/sched/task_stack.h | 2
include/linux/skmsg.h | 11 +
include/linux/trace_events.h | 2
include/linux/vmstat.h | 2
include/net/sock.h | 10 +
include/uapi/linux/stddef.h | 13 +
io_uring/sqpoll.c | 6
kernel/trace/trace.c | 3
kernel/trace/trace_kprobe.c | 2
net/ceph/osd_client.c | 2
net/core/filter.c | 21 ++-
net/core/skmsg.c | 6
net/ipv4/tcp_bpf.c | 6
sound/pci/hda/patch_conexant.c | 28 ++++
sound/pci/hda/patch_realtek.c | 7 +
sound/sh/sh_dac_audio.c | 5
tools/include/uapi/linux/stddef.h | 15 +-
91 files changed, 977 insertions(+), 426 deletions(-)
Aapo Vienamo (1):
spi: intel: Add Panther Lake SPI controller support
Akhil R (1):
dmaengine: tegra: Return correct DMA status when paused
Alex Deucher (5):
drm/amdgpu/hdp4.0: do a posting read when flushing HDP
drm/amdgpu/hdp5.0: do a posting read when flushing HDP
drm/amdgpu/hdp6.0: do a posting read when flushing HDP
drm/amdkfd: reduce stack size in kfd_topology_add_device()
drm/amdkfd: drop struct kfd_cu_info
Alexander Lobakin (1):
stddef: make __struct_group() UAPI C++-friendly
Andrew Cooper (1):
x86/cpu/intel: Drop stray FAM6 check with new Intel CPU model defines
Andy Shevchenko (1):
dmaengine: dw: Select only supported masters for ACPI devices
Armin Wolf (1):
platform/x86: asus-nb-wmi: Ignore unknown event 0xCF
Bart Van Assche (1):
mm/vmstat: fix a W=1 clang compiler warning
Brahmajit Das (1):
smb: server: Fix building with GCC 15
Carlos Song (1):
i2c: imx: add imx7d compatible string for applying erratum ERR007805
Cathy Avery (1):
scsi: storvsc: Do not flag MAINTENANCE_IN return of SRB_STATUS_DATA_OVERRUN as an error
Chen Ridong (2):
dmaengine: at_xdmac: avoid null_prt_deref in at_xdmac_prep_dma_memset
freezer, sched: Report frozen tasks as 'D' instead of 'R'
Chris Chiu (1):
ALSA: hda/realtek: fix micmute LEDs don't work on HP Laptops
Christian Göttsche (1):
tracing: Constify string literal data member in struct trace_event_call
Chukun Pan (1):
phy: rockchip: naneng-combphy: fix phy reset
Colin Ian King (1):
ALSA: hda/realtek: Fix spelling mistake "Firelfy" -> "Firefly"
Cong Wang (2):
tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress()
bpf: Check negative offsets in __bpf_skb_min_len()
Conor Dooley (2):
i2c: microchip-core: actually use repeated sends
i2c: microchip-core: fix "ghost" detections
Dan Carpenter (1):
mtd: rawnand: fix double free in atmel_pmecc_create_user()
Dimitri Fedrau (1):
power: supply: gpio-charger: Fix set charge current limits
Dirk Su (1):
ALSA: hda/realtek: fix mute/micmute LEDs don't work for EliteBook X G1i
Filipe Manana (1):
btrfs: avoid monopolizing a core when activating a swap file
Greg Kroah-Hartman (1):
Linux 6.6.69
Haren Myneni (1):
powerpc/pseries/vas: Add close() callback in vas_vm_ops struct
Huacai Chen (1):
LoongArch: Fix reserving screen info memory for above-4G firmware
Ilya Dryomov (2):
ceph: fix memory leak in ceph_direct_read_write()
ceph: allocate sparse_ext map only for sparse reads
Imre Deak (1):
drm/dp_mst: Ensure mst_primary pointer is valid in drm_dp_mst_handle_up_req()
James Hilliard (1):
watchdog: it87_wdt: add PWRGD enable quirk for Qotom QCML04
Jan Kara (1):
udf: Skip parent dir link count update if corrupted
Javier Carrasco (1):
dmaengine: mv_xor: fix child node refcount handling in early exit
Jesse.zhang(a)amd.com (1):
drm/amdkfd: pause autosuspend when creating pdd
Jiaxun Yang (2):
MIPS: Probe toolchain support of -msym32
MIPS: mipsregs: Set proper ISA level for virt extensions
Joe Hattori (1):
dmaengine: fsl-edma: implement the cleanup path of fsl_edma3_attach_pd()
Justin Chen (1):
phy: usb: Toggle the PHY power during init
Krishna Kurapati (1):
phy: qcom-qmp: Fix register name in RX Lane config of SC8280XP
Len Brown (1):
x86/cpu: Add Lunar Lake to list of CPUs with a broken MONITOR implementation
Lijo Lazar (1):
drm/amdkfd: Use device based logging for errors
Lizhi Xu (1):
tracing: Prevent bad count for tracing_cpumask_write
Maciej Andrzejewski (2):
mtd: rawnand: arasan: Fix double assertion of chip-select
mtd: rawnand: arasan: Fix missing de-registration of NAND
Magnus Lindholm (1):
scsi: qla1280: Fix hw revision numbering for ISP1020/1040
Mark Brown (1):
regmap: Use correct format specifier for logging range errors
Masami Hiramatsu (Google) (1):
tracing/kprobe: Make trace_kprobe's module callback called after jump_label update
Ming Lei (4):
virtio-blk: don't keep queue frozen during system suspend
blk-mq: register cpuhp callback after hctx is added to xarray table
blk-mq: move cpuhp callback registering out of q->sysfs_lock
block: avoid to reuse `hctx` not removed from cpuhp callback list
NeilBrown (1):
nfsd: restore callback functionality for NFSv4.0
Nikita Zhandarovich (1):
media: dvb-frontends: dib3000mb: fix uninit-value in dib3000_write_reg
Pavel Begunkov (1):
io_uring/sqpoll: fix sqpoll error handling races
Purushothama Siddaiah (1):
spi: omap2-mcspi: Fix the IS_ERR() bug for devm_clk_get_optional_enabled()
Qu Wenruo (1):
btrfs: sysfs: fix direct super block member reads
Qun-Wei Lin (1):
sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers
Ranjan Kumar (1):
scsi: mpt3sas: Diag-Reset when Doorbell-In-Use bit is set during driver load time
Sasha Finkelstein (1):
dmaengine: apple-admac: Avoid accessing registers in probe
Takashi Iwai (2):
ALSA: sh: Use standard helper for buffer accesses
ALSA: sh: Fix wrong argument order for copy_from_iter()
Thomas Gleixner (1):
PCI/MSI: Handle lack of irqdomain gracefully
Tiezhu Yang (1):
LoongArch: BPF: Adjust the parameter of emit_jirl()
Tomas Henzl (1):
scsi: megaraid_sas: Fix for a potential deadlock
Tony Luck (5):
x86/cpu: Add model number for Intel Clearwater Forest processor
x86/cpu: Add model number for another Intel Arrow Lake mobile processor
x86/cpu/vfm: Add/initialize x86_vfm field to struct cpuinfo_x86
x86/cpu/vfm: Update arch/x86/include/asm/intel-family.h
x86/cpu/intel: Switch to new Intel CPU model defines
Ulf Hansson (1):
pmdomain: core: Add missing put_device()
Victor Zhao (1):
drm/amd/amdgpu: allow use kiq to do hdp flush under sriov
Xiubo Li (1):
ceph: try to allocate a smaller extent map for sparse read
Yang Erkun (1):
nfsd: Revert "nfsd: release svc_expkey/svc_export with rcu_work"
Yassine Oudjana (1):
watchdog: mediatek: Add support for MT6735 TOPRGU/WDT
Zichen Xie (1):
mtd: diskonchip: Cast an operand to prevent potential overflow
Zijian Zhang (1):
tcp_bpf: Add sk_rmem_alloc related logic for tcp_bpf ingress redirection
Zijun Hu (5):
phy: core: Fix an OF node refcount leakage in _of_phy_get()
phy: core: Fix an OF node refcount leakage in of_phy_provider_lookup()
phy: core: Fix that API devm_phy_put() fails to release the phy
phy: core: Fix that API devm_of_phy_provider_unregister() fails to unregister the phy provider
phy: core: Fix that API devm_phy_destroy() fails to destroy the phy
bo liu (1):
ALSA: hda/conexant: fix Z60MR100 startup pop issue
This is the start of the stable review cycle for the 6.1.123 release.
There are 60 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed, 01 Jan 2025 15:41:48 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.1.123-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.1.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.1.123-rc1
Colin Ian King <colin.i.king(a)gmail.com>
ALSA: hda/realtek: Fix spelling mistake "Firelfy" -> "Firefly"
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "rcu-tasks: Fix access non-existent percpu rtpcp variable in rcu_tasks_need_gpcb()"
Yang Erkun <yangerkun(a)huaweicloud.com>
nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_net
Qu Wenruo <wqu(a)suse.com>
btrfs: sysfs: fix direct super block member reads
Filipe Manana <fdmanana(a)suse.com>
btrfs: avoid monopolizing a core when activating a swap file
Dimitri Fedrau <dimitri.fedrau(a)liebherr.com>
power: supply: gpio-charger: Fix set charge current limits
Conor Dooley <conor.dooley(a)microchip.com>
i2c: microchip-core: fix "ghost" detections
Carlos Song <carlos.song(a)nxp.com>
i2c: imx: add imx7d compatible string for applying erratum ERR007805
Conor Dooley <conor.dooley(a)microchip.com>
i2c: microchip-core: actually use repeated sends
Pavel Begunkov <asml.silence(a)gmail.com>
io_uring/sqpoll: fix sqpoll error handling races
Lizhi Xu <lizhi.xu(a)windriver.com>
tracing: Prevent bad count for tracing_cpumask_write
Christian Göttsche <cgzones(a)googlemail.com>
tracing: Constify string literal data member in struct trace_event_call
Chen Ridong <chenridong(a)huawei.com>
freezer, sched: Report frozen tasks as 'D' instead of 'R'
NeilBrown <neilb(a)suse.de>
sched/core: Report correct state for TASK_IDLE | TASK_FREEZABLE
Ulf Hansson <ulf.hansson(a)linaro.org>
pmdomain: core: Add missing put_device()
Chris Chiu <chris.chiu(a)canonical.com>
ALSA: hda/realtek: fix micmute LEDs don't work on HP Laptops
Dirk Su <dirk.su(a)canonical.com>
ALSA: hda/realtek: fix mute/micmute LEDs don't work for EliteBook X G1i
Hou Tao <houtao1(a)huawei.com>
bpf: Check validity of link->type in bpf_link_show_fdinfo()
Cosmin Ratiu <cratiu(a)nvidia.com>
net/mlx5e: Don't call cleanup on profile rollback failure
Jiaxun Yang <jiaxun.yang(a)flygoat.com>
MIPS: mipsregs: Set proper ISA level for virt extensions
Jiaxun Yang <jiaxun.yang(a)flygoat.com>
MIPS: Probe toolchain support of -msym32
Matthew Wilcox (Oracle) <willy(a)infradead.org>
vmalloc: fix accounting with i915
Ming Lei <ming.lei(a)redhat.com>
blk-mq: register cpuhp callback after hctx is added to xarray table
Ming Lei <ming.lei(a)redhat.com>
virtio-blk: don't keep queue frozen during system suspend
Imre Deak <imre.deak(a)intel.com>
drm/dp_mst: Ensure mst_primary pointer is valid in drm_dp_mst_handle_up_req()
Cathy Avery <cavery(a)redhat.com>
scsi: storvsc: Do not flag MAINTENANCE_IN return of SRB_STATUS_DATA_OVERRUN as an error
Ranjan Kumar <ranjan.kumar(a)broadcom.com>
scsi: mpt3sas: Diag-Reset when Doorbell-In-Use bit is set during driver load time
Armin Wolf <W_Armin(a)gmx.de>
platform/x86: asus-nb-wmi: Ignore unknown event 0xCF
Mark Brown <broonie(a)kernel.org>
regmap: Use correct format specifier for logging range errors
Brahmajit Das <brahmajit.xyz(a)gmail.com>
smb: server: Fix building with GCC 15
bo liu <bo.liu(a)senarytech.com>
ALSA: hda/conexant: fix Z60MR100 startup pop issue
Tomas Henzl <thenzl(a)redhat.com>
scsi: megaraid_sas: Fix for a potential deadlock
Magnus Lindholm <linmag7(a)gmail.com>
scsi: qla1280: Fix hw revision numbering for ISP1020/1040
James Hilliard <james.hilliard1(a)gmail.com>
watchdog: it87_wdt: add PWRGD enable quirk for Qotom QCML04
Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
tracing/kprobe: Make trace_kprobe's module callback called after jump_label update
Alexander Lobakin <aleksander.lobakin(a)intel.com>
stddef: make __struct_group() UAPI C++-friendly
Haren Myneni <haren(a)linux.ibm.com>
powerpc/pseries/vas: Add close() callback in vas_vm_ops struct
Dan Carpenter <dan.carpenter(a)linaro.org>
mtd: rawnand: fix double free in atmel_pmecc_create_user()
Chen Ridong <chenridong(a)huawei.com>
dmaengine: at_xdmac: avoid null_prt_deref in at_xdmac_prep_dma_memset
Sasha Finkelstein <fnkl.kernel(a)gmail.com>
dmaengine: apple-admac: Avoid accessing registers in probe
Akhil R <akhilrajeev(a)nvidia.com>
dmaengine: tegra: Return correct DMA status when paused
Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
dmaengine: dw: Select only supported masters for ACPI devices
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
dmaengine: mv_xor: fix child node refcount handling in early exit
Chukun Pan <amadeus(a)jmu.edu.cn>
phy: rockchip: naneng-combphy: fix phy reset
Justin Chen <justin.chen(a)broadcom.com>
phy: usb: Toggle the PHY power during init
Zijun Hu <quic_zijuhu(a)quicinc.com>
phy: core: Fix that API devm_phy_destroy() fails to destroy the phy
Zijun Hu <quic_zijuhu(a)quicinc.com>
phy: core: Fix that API devm_of_phy_provider_unregister() fails to unregister the phy provider
Zijun Hu <quic_zijuhu(a)quicinc.com>
phy: core: Fix that API devm_phy_put() fails to release the phy
Zijun Hu <quic_zijuhu(a)quicinc.com>
phy: core: Fix an OF node refcount leakage in of_phy_provider_lookup()
Zijun Hu <quic_zijuhu(a)quicinc.com>
phy: core: Fix an OF node refcount leakage in _of_phy_get()
Krishna Kurapati <quic_kriskura(a)quicinc.com>
phy: qcom-qmp: Fix register name in RX Lane config of SC8280XP
Maciej Andrzejewski <maciej.andrzejewski(a)m-works.net>
mtd: rawnand: arasan: Fix missing de-registration of NAND
Maciej Andrzejewski <maciej.andrzejewski(a)m-works.net>
mtd: rawnand: arasan: Fix double assertion of chip-select
Zichen Xie <zichenxie0106(a)gmail.com>
mtd: diskonchip: Cast an operand to prevent potential overflow
NeilBrown <neilb(a)suse.de>
nfsd: restore callback functionality for NFSv4.0
Cong Wang <cong.wang(a)bytedance.com>
bpf: Check negative offsets in __bpf_skb_min_len()
Zijian Zhang <zijianzhang(a)bytedance.com>
tcp_bpf: Add sk_rmem_alloc related logic for tcp_bpf ingress redirection
Cong Wang <cong.wang(a)bytedance.com>
tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress()
Bart Van Assche <bvanassche(a)acm.org>
mm/vmstat: fix a W=1 clang compiler warning
Nikita Zhandarovich <n.zhandarovich(a)fintech.ru>
media: dvb-frontends: dib3000mb: fix uninit-value in dib3000_write_reg
-------------
Diffstat:
Makefile | 4 +-
arch/mips/Makefile | 2 +-
arch/mips/include/asm/mipsregs.h | 13 ++-
arch/powerpc/platforms/book3s/vas-api.c | 36 ++++++
block/blk-mq.c | 15 ++-
drivers/base/power/domain.c | 1 +
drivers/base/regmap/regmap.c | 4 +-
drivers/block/virtio_blk.c | 7 +-
drivers/dma/apple-admac.c | 7 +-
drivers/dma/at_xdmac.c | 2 +
drivers/dma/dw/acpi.c | 6 +-
drivers/dma/dw/internal.h | 8 ++
drivers/dma/dw/pci.c | 4 +-
drivers/dma/mv_xor.c | 2 +
drivers/dma/tegra186-gpc-dma.c | 10 ++
drivers/gpu/drm/display/drm_dp_mst_topology.c | 24 +++-
drivers/i2c/busses/i2c-imx.c | 1 +
drivers/i2c/busses/i2c-microchip-corei2c.c | 126 ++++++++++++++++-----
drivers/media/dvb-frontends/dib3000mb.c | 2 +-
drivers/mtd/nand/raw/arasan-nand-controller.c | 11 +-
drivers/mtd/nand/raw/atmel/pmecc.c | 4 +-
drivers/mtd/nand/raw/diskonchip.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 4 +-
drivers/phy/broadcom/phy-brcm-usb-init-synopsys.c | 6 +
drivers/phy/phy-core.c | 21 ++--
drivers/phy/qualcomm/phy-qcom-qmp-usb.c | 2 +-
drivers/phy/rockchip/phy-rockchip-naneng-combphy.c | 2 +-
drivers/platform/x86/asus-nb-wmi.c | 1 +
drivers/power/supply/gpio-charger.c | 8 ++
drivers/scsi/megaraid/megaraid_sas_base.c | 5 +-
drivers/scsi/mpt3sas/mpt3sas_base.c | 7 +-
drivers/scsi/qla1280.h | 12 +-
drivers/scsi/storvsc_drv.c | 7 +-
drivers/watchdog/it87_wdt.c | 39 +++++++
fs/btrfs/inode.c | 2 +
fs/btrfs/sysfs.c | 6 +-
fs/nfsd/nfs4callback.c | 4 +-
fs/nfsd/nfs4state.c | 2 +-
fs/smb/server/smb_common.c | 4 +-
include/linux/sched.h | 5 +-
include/linux/skmsg.h | 11 +-
include/linux/trace_events.h | 2 +-
include/linux/vmstat.h | 2 +-
include/net/sock.h | 10 +-
include/uapi/linux/stddef.h | 13 ++-
io_uring/sqpoll.c | 6 +
kernel/bpf/syscall.c | 13 ++-
kernel/rcu/tasks.h | 82 +++++---------
kernel/trace/trace.c | 3 +
kernel/trace/trace_kprobe.c | 2 +-
mm/vmalloc.c | 6 +-
net/core/filter.c | 21 +++-
net/core/skmsg.c | 6 +-
net/ipv4/tcp_bpf.c | 6 +-
sound/pci/hda/patch_conexant.c | 28 +++++
sound/pci/hda/patch_realtek.c | 7 ++
tools/include/uapi/linux/stddef.h | 15 ++-
57 files changed, 474 insertions(+), 187 deletions(-)
The field "eip" (instruction pointer) and "esp" (stack pointer) of a task
can be read from /proc/PID/stat. These fields can be interesting for
coredump.
However, these fields were disabled by commit 0a1eb2d474ed ("fs/proc: Stop
reporting eip and esp in /proc/PID/stat"), because it is generally unsafe
to do so. But it is safe for a coredumping process, and therefore
exceptions were made:
- for a coredumping thread by commit fd7d56270b52 ("fs/proc: Report
eip/esp in /prod/PID/stat for coredumping").
- for all other threads in a coredumping process by commit cb8f381f1613
("fs/proc/array.c: allow reporting eip/esp for all coredumping
threads").
The above two commits check the PF_DUMPCORE flag to determine a coredump thread
and the PF_EXITING flag for the other threads.
Unfortunately, commit 92307383082d ("coredump: Don't perform any cleanups
before dumping core") moved coredump to happen earlier and before PF_EXITING is
set. Thus, checking PF_EXITING is no longer the correct way to determine
threads in a coredumping process.
Instead of PF_EXITING, use PF_POSTCOREDUMP to determine the other threads.
Checking of PF_EXITING was added for coredumping, so it probably can now be
removed. But it doesn't hurt to keep.
Fixes: 92307383082d ("coredump: Don't perform any cleanups before dumping core")
Cc: stable(a)vger.kernel.org
Cc: Eric W. Biederman <ebiederm(a)xmission.com>
Acked-by: Oleg Nesterov <oleg(a)redhat.com>
Acked-by: Kees Cook <kees(a)kernel.org>
Signed-off-by: Nam Cao <namcao(a)linutronix.de>
---
fs/proc/array.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 55ed3510d2bb..d6a0369caa93 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -500,7 +500,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
* a program is not able to use ptrace(2) in that case. It is
* safe because the task has stopped executing permanently.
*/
- if (permitted && (task->flags & (PF_EXITING|PF_DUMPCORE))) {
+ if (permitted && (task->flags & (PF_EXITING|PF_DUMPCORE|PF_POSTCOREDUMP))) {
if (try_get_task_stack(task)) {
eip = KSTK_EIP(task);
esp = KSTK_ESP(task);
--
2.39.5
From commit c132be2c4fcc ("function_graph: Have the instances use their own
ftrace_ops for filtering"), function profiler (enabled via function_profil
e_enabled) has been showing statistics for all functions, ignoring set_ftra
ce_filter settings.
While tracers are instantiated, the function profiler is not. Therefore, it
should use the global set_ftrace_filter for consistency.
This patch modifies the function profiler to use the global filter, fixing
the filtering functionality.
Before (filtering not working):
```
root@localhost:~# echo 'vfs*' > /sys/kernel/tracing/set_ftrace_filter
root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# sleep 1
root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# head /sys/kernel/tracing/trace_stat/*
Function Hit Time Avg
s^2
-------- --- ---- ---
---
schedule 314 22290594 us 70989.15 us
40372231 us
x64_sys_call 1527 8762510 us 5738.382 us
3414354 us
schedule_hrtimeout_range 176 8665356 us 49234.98 us
405618876 us
__x64_sys_ppoll 324 5656635 us 17458.75 us
19203976 us
do_sys_poll 324 5653747 us 17449.83 us
19214945 us
schedule_timeout 67 5531396 us 82558.15 us
2136740827 us
__x64_sys_pselect6 12 3029540 us 252461.7 us
63296940171 us
do_pselect.constprop.0 12 3029532 us 252461.0 us
63296952931 us
```
After (filtering working):
```
root@localhost:~# echo 'vfs*' > /sys/kernel/tracing/set_ftrace_filter
root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# sleep 1
root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# head /sys/kernel/tracing/trace_stat/*
Function Hit Time Avg
s^2
-------- --- ---- ---
---
vfs_write 462 68476.43 us 148.217 us
25874.48 us
vfs_read 641 9611.356 us 14.994 us
28868.07 us
vfs_fstat 890 878.094 us 0.986 us
1.667 us
vfs_fstatat 227 757.176 us 3.335 us
18.928 us
vfs_statx 226 610.610 us 2.701 us
17.749 us
vfs_getattr_nosec 1187 460.919 us 0.388 us
0.326 us
vfs_statx_path 297 343.287 us 1.155 us
11.116 us
vfs_rename 6 291.575 us 48.595 us
9889.236 us
```
Cc: stable(a)vger.kernel.org
Fixes: c132be2c4fcc ("function_graph: Have the instances use their own ftrace_ops for filtering")
Signed-off-by: Kohei Enju <enjuk(a)amazon.com>
---
kernel/trace/ftrace.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 9b17efb1a87d..2e113f8b13a2 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -902,16 +902,13 @@ static void profile_graph_return(struct ftrace_graph_ret *trace,
}
static struct fgraph_ops fprofiler_ops = {
- .ops = {
- .flags = FTRACE_OPS_FL_INITIALIZED,
- INIT_OPS_HASH(fprofiler_ops.ops)
- },
.entryfunc = &profile_graph_entry,
.retfunc = &profile_graph_return,
};
static int register_ftrace_profiler(void)
{
+ ftrace_ops_set_global_filter(&fprofiler_ops.ops);
return register_ftrace_graph(&fprofiler_ops);
}
@@ -922,12 +919,11 @@ static void unregister_ftrace_profiler(void)
#else
static struct ftrace_ops ftrace_profile_ops __read_mostly = {
.func = function_profile_call,
- .flags = FTRACE_OPS_FL_INITIALIZED,
- INIT_OPS_HASH(ftrace_profile_ops)
};
static int register_ftrace_profiler(void)
{
+ ftrace_ops_set_global_filter(&ftrace_profile_ops);
return register_ftrace_function(&ftrace_profile_ops);
}
--
2.39.5 (Apple Git-154)
From commit c132be2c4fcc ("function_graph: Have the instances use their own
ftrace_ops for filtering"), function profiler (enabled via function_profil
e_enabled) has been showing statistics for all functions, ignoring set_ftra
ce_filter settings.
While tracers are instantiated, the function profiler is not. Therefore, it
should use the global set_ftrace_filter for consistency.
This patch modifies the function profiler to use the global filter, fixing
the filtering functionality.
Before (filtering not working):
```
Function Hit Time Avg
s^2
-------- --- ---- ---
---
schedule 314 22290594 us 70989.15 us
40372231 us
x64_sys_call 1527 8762510 us 5738.382 us
3414354 us
schedule_hrtimeout_range 176 8665356 us 49234.98 us
405618876 us
__x64_sys_ppoll 324 5656635 us 17458.75 us
19203976 us
do_sys_poll 324 5653747 us 17449.83 us
19214945 us
schedule_timeout 67 5531396 us 82558.15 us
2136740827 us
__x64_sys_pselect6 12 3029540 us 252461.7 us
63296940171 us
do_pselect.constprop.0 12 3029532 us 252461.0 us
63296952931 us
```
After (filtering working):
```
Function Hit Time Avg
s^2
-------- --- ---- ---
---
vfs_write 462 68476.43 us 148.217 us
25874.48 us
vfs_read 641 9611.356 us 14.994 us
28868.07 us
vfs_fstat 890 878.094 us 0.986 us
1.667 us
vfs_fstatat 227 757.176 us 3.335 us
18.928 us
vfs_statx 226 610.610 us 2.701 us
17.749 us
vfs_getattr_nosec 1187 460.919 us 0.388 us
0.326 us
vfs_statx_path 297 343.287 us 1.155 us
11.116 us
vfs_rename 6 291.575 us 48.595 us
9889.236 us
```
Cc: stable(a)vger.kernel.org
Fixes: c132be2c4fcc ("function_graph: Have the instances use their own ftrace_ops for filtering")
Signed-off-by: Kohei Enju <enjuk(a)amazon.com>
---
kernel/trace/ftrace.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 9b17efb1a87d..2e113f8b13a2 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -902,16 +902,13 @@ static void profile_graph_return(struct ftrace_graph_ret *trace,
}
static struct fgraph_ops fprofiler_ops = {
- .ops = {
- .flags = FTRACE_OPS_FL_INITIALIZED,
- INIT_OPS_HASH(fprofiler_ops.ops)
- },
.entryfunc = &profile_graph_entry,
.retfunc = &profile_graph_return,
};
static int register_ftrace_profiler(void)
{
+ ftrace_ops_set_global_filter(&fprofiler_ops.ops);
return register_ftrace_graph(&fprofiler_ops);
}
@@ -922,12 +919,11 @@ static void unregister_ftrace_profiler(void)
#else
static struct ftrace_ops ftrace_profile_ops __read_mostly = {
.func = function_profile_call,
- .flags = FTRACE_OPS_FL_INITIALIZED,
- INIT_OPS_HASH(ftrace_profile_ops)
};
static int register_ftrace_profiler(void)
{
+ ftrace_ops_set_global_filter(&ftrace_profile_ops);
return register_ftrace_function(&ftrace_profile_ops);
}
--
2.39.5 (Apple Git-154)
Estimated people:
I can confirm what the next patch works:
diff --git a/drivers/acpi/resource.c b/drivers/acpi/resource.c
index 821867de43be..ab4c0e0b6b8e 100644
--- a/drivers/acpi/resource.c
+++ b/drivers/acpi/resource.c
@@ -440,6 +440,13 @@ static const struct dmi_system_id irq1_level_low_skip_override[] = {
DMI_MATCH(DMI_BOARD_NAME, "S5602ZA"),
},
},
+ {
+ /* Asus Vivobook X1504VAP */
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "ASUSTeK COMPUTER INC."),
+ DMI_MATCH(DMI_BOARD_NAME, "X1504VAP"),
+ },
+ },
{
/* Asus Vivobook X1704VAP */
.matches = {
--
2.47.1
I did all the steps in the tutorial https://docs.fedoraproject.org/en-US/quick-docs/kernel-testing-patches/ and the keyboard Asus Vivobook X1504VAP is working fine!!
Please include the patch in future kernels.
Thank you very much.
Hi,
> This is a note to let you know that I've just added the patch titled
>
> phy: rockchip: naneng-combphy: fix phy reset
>
> to the 6.12-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> phy-rockchip-naneng-combphy-fix-phy-reset.patch
> and it can be found in the queue-6.12 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
Please backport this commit together:
arm64: dts: rockchip: add reset-names for combphy on rk3568
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/…
If apply fails, please change arch/arm64/boot/dts/rockchip/rk356x-base.dtsi
to arch/arm64/boot/dts/rockchip/rk356x.dtsi.
Thanks,
Chukun
--
2.25.1
The quilt patch titled
Subject: scripts/sorttable: fix orc_sort_cmp() to maintain symmetry and transitivity
has been removed from the -mm tree. Its filename was
scripts-sorttable-fix-orc_sort_cmp-to-maintain-symmetry-and-transitivity.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Kuan-Wei Chiu <visitorckw(a)gmail.com>
Subject: scripts/sorttable: fix orc_sort_cmp() to maintain symmetry and transitivity
Date: Thu, 26 Dec 2024 22:03:32 +0800
The orc_sort_cmp() function, used with qsort(), previously violated the
symmetry and transitivity rules required by the C standard. Specifically,
when both entries are ORC_TYPE_UNDEFINED, it could result in both a < b
and b < a, which breaks the required symmetry and transitivity. This can
lead to undefined behavior and incorrect sorting results, potentially
causing memory corruption in glibc implementations [1].
Symmetry: If x < y, then y > x.
Transitivity: If x < y and y < z, then x < z.
Fix the comparison logic to return 0 when both entries are
ORC_TYPE_UNDEFINED, ensuring compliance with qsort() requirements.
Link: https://www.qualys.com/2024/01/30/qsort.txt [1]
Link: https://lkml.kernel.org/r/20241226140332.2670689-1-visitorckw@gmail.com
Fixes: 57fa18994285 ("scripts/sorttable: Implement build-time ORC unwind table sorting")
Fixes: fb799447ae29 ("x86,objtool: Split UNWIND_HINT_EMPTY in two")
Signed-off-by: Kuan-Wei Chiu <visitorckw(a)gmail.com>
Cc: Ching-Chun (Jim) Huang <jserv(a)ccns.ncku.edu.tw>
Cc: <chuang(a)cs.nycu.edu.tw>
Cc: Ingo Molnar <mingo(a)kernel.org>
Cc: Josh Poimboeuf <jpoimboe(a)kernel.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Shile Zhang <shile.zhang(a)linux.alibaba.com>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
scripts/sorttable.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--- a/scripts/sorttable.h~scripts-sorttable-fix-orc_sort_cmp-to-maintain-symmetry-and-transitivity
+++ a/scripts/sorttable.h
@@ -110,7 +110,7 @@ static inline unsigned long orc_ip(const
static int orc_sort_cmp(const void *_a, const void *_b)
{
- struct orc_entry *orc_a;
+ struct orc_entry *orc_a, *orc_b;
const int *a = g_orc_ip_table + *(int *)_a;
const int *b = g_orc_ip_table + *(int *)_b;
unsigned long a_val = orc_ip(a);
@@ -128,6 +128,9 @@ static int orc_sort_cmp(const void *_a,
* whitelisted .o files which didn't get objtool generation.
*/
orc_a = g_orc_table + (a - g_orc_ip_table);
+ orc_b = g_orc_table + (b - g_orc_ip_table);
+ if (orc_a->type == ORC_TYPE_UNDEFINED && orc_b->type == ORC_TYPE_UNDEFINED)
+ return 0;
return orc_a->type == ORC_TYPE_UNDEFINED ? -1 : 1;
}
_
Patches currently in -mm which might be from visitorckw(a)gmail.com are
lib-min_heap-improve-type-safety-in-min_heap-macros-by-using-container_of.patch
lib-test_min_heap-use-inline-min-heap-variants-to-reduce-attack-vector.patch
lib-min_heap-add-brief-introduction-to-min-heap-api.patch
documentation-core-api-min_heap-add-author-information.patch
lib-sort-clarify-comparison-function-requirements-in-sort_r.patch
lib-list_sort-clarify-comparison-function-requirements-in-sort_r.patch
The quilt patch titled
Subject: mm/damon/core: fix ignored quota goals and filters of newly committed schemes
has been removed from the -mm tree. Its filename was
mm-damon-core-fix-ignored-quota-goals-and-filters-of-newly-committed-schemes.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: SeongJae Park <sj(a)kernel.org>
Subject: mm/damon/core: fix ignored quota goals and filters of newly committed schemes
Date: Sun, 22 Dec 2024 15:12:22 -0800
damon_commit_schemes() ignores quota goals and filters of the newly
committed schemes. This makes users confused about the behaviors.
Correctly handle those inputs.
Link: https://lkml.kernel.org/r/20241222231222.85060-3-sj@kernel.org
Fixes: 9cb3d0b9dfce ("mm/damon/core: implement DAMON context commit function")
Signed-off-by: SeongJae Park <sj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/damon/core.c | 5 +++++
1 file changed, 5 insertions(+)
--- a/mm/damon/core.c~mm-damon-core-fix-ignored-quota-goals-and-filters-of-newly-committed-schemes
+++ a/mm/damon/core.c
@@ -868,6 +868,11 @@ static int damon_commit_schemes(struct d
NUMA_NO_NODE);
if (!new_scheme)
return -ENOMEM;
+ err = damos_commit(new_scheme, src_scheme);
+ if (err) {
+ damon_destroy_scheme(new_scheme);
+ return err;
+ }
damon_add_scheme(dst, new_scheme);
}
return 0;
_
Patches currently in -mm which might be from sj(a)kernel.org are
samples-add-a-skeleton-of-a-sample-damon-module-for-working-set-size-estimation.patch
samples-damon-wsse-start-and-stop-damon-as-the-user-requests.patch
samples-damon-wsse-implement-working-set-size-estimation-and-logging.patch
samples-damon-introduce-a-skeleton-of-a-smaple-damon-module-for-proactive-reclamation.patch
samples-damon-prcl-implement-schemes-setup.patch
replace-free-hugepage-folios-after-migration-fix-2.patch
The quilt patch titled
Subject: mm/damon/core: fix new damon_target objects leaks on damon_commit_targets()
has been removed from the -mm tree. Its filename was
mm-damon-core-fix-new-damon_target-objects-leaks-on-damon_commit_targets.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: SeongJae Park <sj(a)kernel.org>
Subject: mm/damon/core: fix new damon_target objects leaks on damon_commit_targets()
Date: Sun, 22 Dec 2024 15:12:21 -0800
Patch series "mm/damon/core: fix memory leaks and ignored inputs from
damon_commit_ctx()".
Due to two bugs in damon_commit_targets() and damon_commit_schemes(),
which are called from damon_commit_ctx(), some user inputs can be ignored,
and some mmeory objects can be leaked. Fix those.
Note that only DAMON sysfs interface users are affected. Other DAMON core
API user modules that more focused more on simple and dedicated production
usages, including DAMON_RECLAIM and DAMON_LRU_SORT are not using the buggy
function in the way, so not affected.
This patch (of 2):
When new DAMON targets are added via damon_commit_targets(), the newly
created targets are not deallocated when updating the internal data
(damon_commit_target()) is failed. Worse yet, even if the setup is
successfully done, the new target is not linked to the context. Hence,
the new targets are always leaked regardless of the internal data setup
failure. Fix the leaks.
Link: https://lkml.kernel.org/r/20241222231222.85060-2-sj@kernel.org
Fixes: 9cb3d0b9dfce ("mm/damon/core: implement DAMON context commit function")
Signed-off-by: SeongJae Park <sj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/damon/core.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--- a/mm/damon/core.c~mm-damon-core-fix-new-damon_target-objects-leaks-on-damon_commit_targets
+++ a/mm/damon/core.c
@@ -961,8 +961,11 @@ static int damon_commit_targets(
return -ENOMEM;
err = damon_commit_target(new_target, false,
src_target, damon_target_has_pid(src));
- if (err)
+ if (err) {
+ damon_destroy_target(new_target);
return err;
+ }
+ damon_add_target(dst, new_target);
}
return 0;
}
_
Patches currently in -mm which might be from sj(a)kernel.org are
samples-add-a-skeleton-of-a-sample-damon-module-for-working-set-size-estimation.patch
samples-damon-wsse-start-and-stop-damon-as-the-user-requests.patch
samples-damon-wsse-implement-working-set-size-estimation-and-logging.patch
samples-damon-introduce-a-skeleton-of-a-smaple-damon-module-for-proactive-reclamation.patch
samples-damon-prcl-implement-schemes-setup.patch
replace-free-hugepage-folios-after-migration-fix-2.patch
The quilt patch titled
Subject: vmstat: disable vmstat_work on vmstat_cpu_down_prep()
has been removed from the -mm tree. Its filename was
vmstat-disable-vmstat_work-on-vmstat_cpu_down_prep.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Koichiro Den <koichiro.den(a)canonical.com>
Subject: vmstat: disable vmstat_work on vmstat_cpu_down_prep()
Date: Sat, 21 Dec 2024 12:33:20 +0900
Even after mm/vmstat:online teardown, shepherd may still queue work for
the dying cpu until the cpu is removed from online mask. While it's quite
rare, this means that after unbind_workers() unbinds a per-cpu kworker, it
potentially runs vmstat_update for the dying CPU on an irrelevant cpu
before entering atomic AP states. When CONFIG_DEBUG_PREEMPT=y, it results
in the following error with the backtrace.
BUG: using smp_processor_id() in preemptible [00000000] code: \
kworker/7:3/1702
caller is refresh_cpu_vm_stats+0x235/0x5f0
CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
Tainted: [N]=TEST
Workqueue: mm_percpu_wq vmstat_update
Call Trace:
<TASK>
dump_stack_lvl+0x8d/0xb0
check_preemption_disabled+0xce/0xe0
refresh_cpu_vm_stats+0x235/0x5f0
vmstat_update+0x17/0xa0
process_one_work+0x869/0x1aa0
worker_thread+0x5e5/0x1100
kthread+0x29e/0x380
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x1a/0x30
</TASK>
So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
symmetrically enable it on startup.
Link: https://lkml.kernel.org/r/20241221033321.4154409-1-koichiro.den@canonical.c…
Signed-off-by: Koichiro Den <koichiro.den(a)canonical.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmstat.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/vmstat.c~vmstat-disable-vmstat_work-on-vmstat_cpu_down_prep
+++ a/mm/vmstat.c
@@ -2148,13 +2148,14 @@ static int vmstat_cpu_online(unsigned in
if (!node_state(cpu_to_node(cpu), N_CPU)) {
node_set_state(cpu_to_node(cpu), N_CPU);
}
+ enable_delayed_work(&per_cpu(vmstat_work, cpu));
return 0;
}
static int vmstat_cpu_down_prep(unsigned int cpu)
{
- cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+ disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
return 0;
}
_
Patches currently in -mm which might be from koichiro.den(a)canonical.com are
hugetlb-prioritize-surplus-allocation-from-current-node.patch
The quilt patch titled
Subject: mm: shmem: fix the update of 'shmem_falloc->nr_unswapped'
has been removed from the -mm tree. Its filename was
mm-shmem-fix-the-update-of-shmem_falloc-nr_unswapped.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Subject: mm: shmem: fix the update of 'shmem_falloc->nr_unswapped'
Date: Thu, 19 Dec 2024 15:30:09 +0800
The 'shmem_falloc->nr_unswapped' is used to record how many writepage
refused to swap out because fallocate() is allocating, but after shmem
supports large folio swap out, the update of 'shmem_falloc->nr_unswapped'
does not use the correct number of pages in the large folio, which may
lead to fallocate() not exiting as soon as possible.
Anyway, this is found through code inspection, and I am not sure whether
it would actually cause serious issues.
Link: https://lkml.kernel.org/r/f66a0119d0564c2c37c84f045835b870d1b2196f.17345931…
Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/shmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/shmem.c~mm-shmem-fix-the-update-of-shmem_falloc-nr_unswapped
+++ a/mm/shmem.c
@@ -1535,7 +1535,7 @@ try_split:
!shmem_falloc->waitq &&
index >= shmem_falloc->start &&
index < shmem_falloc->next)
- shmem_falloc->nr_unswapped++;
+ shmem_falloc->nr_unswapped += nr_pages;
else
shmem_falloc = NULL;
spin_unlock(&inode->i_lock);
_
Patches currently in -mm which might be from baolin.wang(a)linux.alibaba.com are
mm-factor-out-the-order-calculation-into-a-new-helper.patch
mm-shmem-change-shmem_huge_global_enabled-to-return-huge-order-bitmap.patch
mm-shmem-add-large-folio-support-for-tmpfs.patch
mm-shmem-add-a-kernel-command-line-to-change-the-default-huge-policy-for-tmpfs.patch
docs-tmpfs-drop-fadvise-from-the-documentation.patch
The quilt patch titled
Subject: mm: shmem: fix incorrect index alignment for within_size policy
has been removed from the -mm tree. Its filename was
mm-shmem-fix-incorrect-index-alignment-for-within_size-policy.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Subject: mm: shmem: fix incorrect index alignment for within_size policy
Date: Thu, 19 Dec 2024 15:30:08 +0800
With enabling the shmem per-size within_size policy, using an incorrect
'order' size to round_up() the index can lead to incorrect i_size checks,
resulting in an inappropriate large orders being returned.
Changing to use '1 << order' to round_up() the index to fix this issue.
Additionally, adding an 'aligned_index' variable to avoid affecting the
index checks.
Link: https://lkml.kernel.org/r/77d8ef76a7d3d646e9225e9af88a76549a68aab1.17345931…
Fixes: e7a2ab7b3bb5 ("mm: shmem: add mTHP support for anonymous shmem")
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/shmem.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- a/mm/shmem.c~mm-shmem-fix-incorrect-index-alignment-for-within_size-policy
+++ a/mm/shmem.c
@@ -1689,6 +1689,7 @@ unsigned long shmem_allowable_huge_order
unsigned long mask = READ_ONCE(huge_shmem_orders_always);
unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size);
unsigned long vm_flags = vma ? vma->vm_flags : 0;
+ pgoff_t aligned_index;
bool global_huge;
loff_t i_size;
int order;
@@ -1723,9 +1724,9 @@ unsigned long shmem_allowable_huge_order
/* Allow mTHP that will be fully within i_size. */
order = highest_order(within_size_orders);
while (within_size_orders) {
- index = round_up(index + 1, order);
+ aligned_index = round_up(index + 1, 1 << order);
i_size = round_up(i_size_read(inode), PAGE_SIZE);
- if (i_size >> PAGE_SHIFT >= index) {
+ if (i_size >> PAGE_SHIFT >= aligned_index) {
mask |= within_size_orders;
break;
}
_
Patches currently in -mm which might be from baolin.wang(a)linux.alibaba.com are
mm-factor-out-the-order-calculation-into-a-new-helper.patch
mm-shmem-change-shmem_huge_global_enabled-to-return-huge-order-bitmap.patch
mm-shmem-add-large-folio-support-for-tmpfs.patch
mm-shmem-add-a-kernel-command-line-to-change-the-default-huge-policy-for-tmpfs.patch
docs-tmpfs-drop-fadvise-from-the-documentation.patch
The quilt patch titled
Subject: mm: zswap: fix race between [de]compression and CPU hotunplug
has been removed from the -mm tree. Its filename was
mm-zswap-fix-race-between-compression-and-cpu-hotunplug.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Yosry Ahmed <yosryahmed(a)google.com>
Subject: mm: zswap: fix race between [de]compression and CPU hotunplug
Date: Thu, 19 Dec 2024 21:24:37 +0000
In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the
current CPU at the beginning of the operation is retrieved and used
throughout. However, since neither preemption nor migration are disabled,
it is possible that the operation continues on a different CPU.
If the original CPU is hotunplugged while the acomp_ctx is still in use,
we run into a UAF bug as the resources attached to the acomp_ctx are freed
during hotunplug in zswap_cpu_comp_dead().
The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use
crypto_acomp API for hardware acceleration") when the switch to the
crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was
retrieved using get_cpu_ptr() which disables preemption and makes sure the
CPU cannot go away from under us. Preemption cannot be disabled with the
crypto_acomp API as a sleepable context is needed.
Commit 8ba2f844f050 ("mm/zswap: change per-cpu mutex and buffer to
per-acomp_ctx") increased the UAF surface area by making the per-CPU
buffers dynamic, adding yet another resource that can be freed from under
zswap compression/decompression by CPU hotunplug.
There are a few ways to fix this:
(a) Add a refcount for acomp_ctx.
(b) Disable migration while using the per-CPU acomp_ctx.
(c) Disable CPU hotunplug while using the per-CPU acomp_ctx by holding
the CPUs read lock.
Implement (c) since it's simpler than (a), and (b) involves using
migrate_disable() which is apparently undesired (see huge comment in
include/linux/preempt.h).
Link: https://lkml.kernel.org/r/20241219212437.2714151-1-yosryahmed@google.com
Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration")
Signed-off-by: Yosry Ahmed <yosryahmed(a)google.com>
Reported-by: Johannes Weiner <hannes(a)cmpxchg.org>
Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/
Reported-by: Sam Sun <samsun1006219(a)gmail.com>
Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tP…
Reviewed-by: Chengming Zhou <chengming.zhou(a)linux.dev>
Acked-by: Barry Song <baohua(a)kernel.org>
Reviewed-by: Nhat Pham <nphamcs(a)gmail.com>
Cc: Vitaly Wool <vitalywool(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/zswap.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
--- a/mm/zswap.c~mm-zswap-fix-race-between-compression-and-cpu-hotunplug
+++ a/mm/zswap.c
@@ -880,6 +880,18 @@ static int zswap_cpu_comp_dead(unsigned
return 0;
}
+/* Prevent CPU hotplug from freeing up the per-CPU acomp_ctx resources */
+static struct crypto_acomp_ctx *acomp_ctx_get_cpu(struct crypto_acomp_ctx __percpu *acomp_ctx)
+{
+ cpus_read_lock();
+ return raw_cpu_ptr(acomp_ctx);
+}
+
+static void acomp_ctx_put_cpu(void)
+{
+ cpus_read_unlock();
+}
+
static bool zswap_compress(struct page *page, struct zswap_entry *entry,
struct zswap_pool *pool)
{
@@ -893,8 +905,7 @@ static bool zswap_compress(struct page *
gfp_t gfp;
u8 *dst;
- acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
-
+ acomp_ctx = acomp_ctx_get_cpu(pool->acomp_ctx);
mutex_lock(&acomp_ctx->mutex);
dst = acomp_ctx->buffer;
@@ -950,6 +961,7 @@ unlock:
zswap_reject_alloc_fail++;
mutex_unlock(&acomp_ctx->mutex);
+ acomp_ctx_put_cpu();
return comp_ret == 0 && alloc_ret == 0;
}
@@ -960,7 +972,7 @@ static void zswap_decompress(struct zswa
struct crypto_acomp_ctx *acomp_ctx;
u8 *src;
- acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+ acomp_ctx = acomp_ctx_get_cpu(entry->pool->acomp_ctx);
mutex_lock(&acomp_ctx->mutex);
src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
@@ -990,6 +1002,7 @@ static void zswap_decompress(struct zswa
if (src != acomp_ctx->buffer)
zpool_unmap_handle(zpool, entry->handle);
+ acomp_ctx_put_cpu();
}
/*********************************
_
Patches currently in -mm which might be from yosryahmed(a)google.com are
The quilt patch titled
Subject: ocfs2: fix slab-use-after-free due to dangling pointer dqi_priv
has been removed from the -mm tree. Its filename was
ocfs2-fix-slab-use-after-free-due-to-dangling-pointer-dqi_priv.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Dennis Lam <dennis.lamerice(a)gmail.com>
Subject: ocfs2: fix slab-use-after-free due to dangling pointer dqi_priv
Date: Tue, 17 Dec 2024 21:39:25 -0500
When mounting ocfs2 and then remounting it as read-only, a
slab-use-after-free occurs after the user uses a syscall to
quota_getnextquota. Specifically, sb_dqinfo(sb, type)->dqi_priv is the
dangling pointer.
During the remounting process, the pointer dqi_priv is freed but is never
set as null leaving it to be accessed. Additionally, the read-only option
for remounting sets the DQUOT_SUSPENDED flag instead of setting the
DQUOT_USAGE_ENABLED flags. Moreover, later in the process of getting the
next quota, the function ocfs2_get_next_id is called and only checks the
quota usage flags and not the quota suspended flags.
To fix this, I set dqi_priv to null when it is freed after remounting with
read-only and put a check for DQUOT_SUSPENDED in ocfs2_get_next_id.
[akpm(a)linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20241218023924.22821-2-dennis.lamerice@gmail.com
Fixes: 8f9e8f5fcc05 ("ocfs2: Fix Q_GETNEXTQUOTA for filesystem without quotas")
Signed-off-by: Dennis Lam <dennis.lamerice(a)gmail.com>
Reported-by: syzbot+d173bf8a5a7faeede34c(a)syzkaller.appspotmail.com
Tested-by: syzbot+d173bf8a5a7faeede34c(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6731d26f.050a0220.1fb99c.014b.GAE@google.com/T/
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/quota_global.c | 2 +-
fs/ocfs2/quota_local.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
--- a/fs/ocfs2/quota_global.c~ocfs2-fix-slab-use-after-free-due-to-dangling-pointer-dqi_priv
+++ a/fs/ocfs2/quota_global.c
@@ -893,7 +893,7 @@ static int ocfs2_get_next_id(struct supe
int status = 0;
trace_ocfs2_get_next_id(from_kqid(&init_user_ns, *qid), type);
- if (!sb_has_quota_loaded(sb, type)) {
+ if (!sb_has_quota_active(sb, type)) {
status = -ESRCH;
goto out;
}
--- a/fs/ocfs2/quota_local.c~ocfs2-fix-slab-use-after-free-due-to-dangling-pointer-dqi_priv
+++ a/fs/ocfs2/quota_local.c
@@ -867,6 +867,7 @@ out:
brelse(oinfo->dqi_libh);
brelse(oinfo->dqi_lqi_bh);
kfree(oinfo);
+ info->dqi_priv = NULL;
return status;
}
_
Patches currently in -mm which might be from dennis.lamerice(a)gmail.com are
The quilt patch titled
Subject: fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bit
has been removed from the -mm tree. Its filename was
fs-proc-task_mmu-fix-pagemap-flags-with-pmd-thp-entries-on-32bit.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bit
Date: Tue, 17 Dec 2024 20:50:00 +0100
Entries (including flags) are u64, even on 32bit. So right now we are
cutting of the flags on 32bit. This way, for example the cow selftest
complains about:
# ./cow
...
Bail Out! read and ioctl return unmatched results for populated: 0 1
Link: https://lkml.kernel.org/r/20241217195000.1734039-1-david@redhat.com
Fixes: 2c1f057e5be6 ("fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of PMD-mapped THPs")
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/proc/task_mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/fs/proc/task_mmu.c~fs-proc-task_mmu-fix-pagemap-flags-with-pmd-thp-entries-on-32bit
+++ a/fs/proc/task_mmu.c
@@ -1810,7 +1810,7 @@ static int pagemap_pmd_range(pmd_t *pmdp
}
for (; addr != end; addr += PAGE_SIZE, idx++) {
- unsigned long cur_flags = flags;
+ u64 cur_flags = flags;
pagemap_entry_t pme;
if (folio && (flags & PM_PRESENT) &&
_
Patches currently in -mm which might be from david(a)redhat.com are
docs-tmpfs-update-the-large-folios-policy-for-tmpfs-and-shmem.patch
mm-memory_hotplug-move-debug_pagealloc_map_pages-into-online_pages_range.patch
mm-page_isolation-dont-pass-gfp-flags-to-isolate_single_pageblock.patch
mm-page_isolation-dont-pass-gfp-flags-to-start_isolate_page_range.patch
mm-page_alloc-make-__alloc_contig_migrate_range-static.patch
mm-page_alloc-sort-out-the-alloc_contig_range-gfp-flags-mess.patch
mm-page_alloc-forward-the-gfp-flags-from-alloc_contig_range-to-post_alloc_hook.patch
powernv-memtrace-use-__gfp_zero-with-alloc_contig_pages.patch
mm-hugetlb-dont-map-folios-writable-without-vm_write-when-copying-during-fork.patch
fs-proc-vmcore-convert-vmcore_cb_lock-into-vmcore_mutex.patch
fs-proc-vmcore-replace-vmcoredd_mutex-by-vmcore_mutex.patch
fs-proc-vmcore-disallow-vmcore-modifications-while-the-vmcore-is-open.patch
fs-proc-vmcore-prefix-all-pr_-with-vmcore.patch
fs-proc-vmcore-move-vmcore-definitions-out-of-kcoreh.patch
fs-proc-vmcore-factor-out-allocating-a-vmcore-range-and-adding-it-to-a-list.patch
fs-proc-vmcore-factor-out-freeing-a-list-of-vmcore-ranges.patch
fs-proc-vmcore-introduce-proc_vmcore_device_ram-to-detect-device-ram-ranges-in-2nd-kernel.patch
virtio-mem-mark-device-ready-before-registering-callbacks-in-kdump-mode.patch
virtio-mem-remember-usable-region-size.patch
virtio-mem-support-config_proc_vmcore_device_ram.patch
s390-kdump-virtio-mem-kdump-support-config_proc_vmcore_device_ram.patch
mm-page_alloc-dont-use-__gfp_hardwall-when-migrating-pages-via-alloc_contig.patch
mm-memory_hotplug-dont-use-__gfp_hardwall-when-migrating-pages-via-memory-offlining.patch
The quilt patch titled
Subject: kcov: mark in_softirq_really() as __always_inline
has been removed from the -mm tree. Its filename was
kcov-mark-in_softirq_really-as-__always_inline.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Arnd Bergmann <arnd(a)arndb.de>
Subject: kcov: mark in_softirq_really() as __always_inline
Date: Tue, 17 Dec 2024 08:18:10 +0100
If gcc decides not to inline in_softirq_really(), objtool warns about a
function call with UACCESS enabled:
kernel/kcov.o: warning: objtool: __sanitizer_cov_trace_pc+0x1e: call to in_softirq_really() with UACCESS enabled
kernel/kcov.o: warning: objtool: check_kcov_mode+0x11: call to in_softirq_really() with UACCESS enabled
Mark this as __always_inline to avoid the problem.
Link: https://lkml.kernel.org/r/20241217071814.2261620-1-arnd@kernel.org
Fixes: 7d4df2dad312 ("kcov: properly check for softirq context")
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
Reviewed-by: Marco Elver <elver(a)google.com>
Cc: Aleksandr Nogikh <nogikh(a)google.com>
Cc: Andrey Konovalov <andreyknvl(a)gmail.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Josh Poimboeuf <jpoimboe(a)kernel.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/kcov.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/kernel/kcov.c~kcov-mark-in_softirq_really-as-__always_inline
+++ a/kernel/kcov.c
@@ -166,7 +166,7 @@ static void kcov_remote_area_put(struct
* Unlike in_serving_softirq(), this function returns false when called during
* a hardirq or an NMI that happened in the softirq context.
*/
-static inline bool in_softirq_really(void)
+static __always_inline bool in_softirq_really(void)
{
return in_serving_softirq() && !in_hardirq() && !in_nmi();
}
_
Patches currently in -mm which might be from arnd(a)arndb.de are
The quilt patch titled
Subject: mm/kmemleak: fix sleeping function called from invalid context at print message
has been removed from the -mm tree. Its filename was
mm-kmemleak-fix-sleeping-function-called-from-invalid-context-at-print-message.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Alessandro Carminati <acarmina(a)redhat.com>
Subject: mm/kmemleak: fix sleeping function called from invalid context at print message
Date: Tue, 17 Dec 2024 14:20:33 +0000
Address a bug in the kernel that triggers a "sleeping function called from
invalid context" warning when /sys/kernel/debug/kmemleak is printed under
specific conditions:
- CONFIG_PREEMPT_RT=y
- Set SELinux as the LSM for the system
- Set kptr_restrict to 1
- kmemleak buffer contains at least one item
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 136, name: cat
preempt_count: 1, expected: 0
RCU nest depth: 2, expected: 2
6 locks held by cat/136:
#0: ffff32e64bcbf950 (&p->lock){+.+.}-{3:3}, at: seq_read_iter+0xb8/0xe30
#1: ffffafe6aaa9dea0 (scan_mutex){+.+.}-{3:3}, at: kmemleak_seq_start+0x34/0x128
#3: ffff32e6546b1cd0 (&object->lock){....}-{2:2}, at: kmemleak_seq_show+0x3c/0x1e0
#4: ffffafe6aa8d8560 (rcu_read_lock){....}-{1:2}, at: has_ns_capability_noaudit+0x8/0x1b0
#5: ffffafe6aabbc0f8 (notif_lock){+.+.}-{2:2}, at: avc_compute_av+0xc4/0x3d0
irq event stamp: 136660
hardirqs last enabled at (136659): [<ffffafe6a80fd7a0>] _raw_spin_unlock_irqrestore+0xa8/0xd8
hardirqs last disabled at (136660): [<ffffafe6a80fd85c>] _raw_spin_lock_irqsave+0x8c/0xb0
softirqs last enabled at (0): [<ffffafe6a5d50b28>] copy_process+0x11d8/0x3df8
softirqs last disabled at (0): [<0000000000000000>] 0x0
Preemption disabled at:
[<ffffafe6a6598a4c>] kmemleak_seq_show+0x3c/0x1e0
CPU: 1 UID: 0 PID: 136 Comm: cat Tainted: G E 6.11.0-rt7+ #34
Tainted: [E]=UNSIGNED_MODULE
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0xa0/0x128
show_stack+0x1c/0x30
dump_stack_lvl+0xe8/0x198
dump_stack+0x18/0x20
rt_spin_lock+0x8c/0x1a8
avc_perm_nonode+0xa0/0x150
cred_has_capability.isra.0+0x118/0x218
selinux_capable+0x50/0x80
security_capable+0x7c/0xd0
has_ns_capability_noaudit+0x94/0x1b0
has_capability_noaudit+0x20/0x30
restricted_pointer+0x21c/0x4b0
pointer+0x298/0x760
vsnprintf+0x330/0xf70
seq_printf+0x178/0x218
print_unreferenced+0x1a4/0x2d0
kmemleak_seq_show+0xd0/0x1e0
seq_read_iter+0x354/0xe30
seq_read+0x250/0x378
full_proxy_read+0xd8/0x148
vfs_read+0x190/0x918
ksys_read+0xf0/0x1e0
__arm64_sys_read+0x70/0xa8
invoke_syscall.constprop.0+0xd4/0x1d8
el0_svc+0x50/0x158
el0t_64_sync+0x17c/0x180
%pS and %pK, in the same back trace line, are redundant, and %pS can void
%pK service in certain contexts.
%pS alone already provides the necessary information, and if it cannot
resolve the symbol, it falls back to printing the raw address voiding
the original intent behind the %pK.
Additionally, %pK requires a privilege check CAP_SYSLOG enforced through
the LSM, which can trigger a "sleeping function called from invalid
context" warning under RT_PREEMPT kernels when the check occurs in an
atomic context. This issue may also affect other LSMs.
This change avoids the unnecessary privilege check and resolves the
sleeping function warning without any loss of information.
Link: https://lkml.kernel.org/r/20241217142032.55793-1-acarmina@redhat.com
Fixes: 3a6f33d86baa ("mm/kmemleak: use %pK to display kernel pointers in backtrace")
Signed-off-by: Alessandro Carminati <acarmina(a)redhat.com>
Acked-by: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Acked-by: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Cl��ment L��ger <clement.leger(a)bootlin.com>
Cc: Alessandro Carminati <acarmina(a)redhat.com>
Cc: Eric Chanudet <echanude(a)redhat.com>
Cc: Gabriele Paoloni <gpaoloni(a)redhat.com>
Cc: Juri Lelli <juri.lelli(a)redhat.com>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Thomas Wei��schuh <thomas.weissschuh(a)linutronix.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kmemleak.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/kmemleak.c~mm-kmemleak-fix-sleeping-function-called-from-invalid-context-at-print-message
+++ a/mm/kmemleak.c
@@ -373,7 +373,7 @@ static void print_unreferenced(struct se
for (i = 0; i < nr_entries; i++) {
void *ptr = (void *)entries[i];
- warn_or_seq_printf(seq, " [<%pK>] %pS\n", ptr, ptr);
+ warn_or_seq_printf(seq, " %pS\n", ptr);
}
}
_
Patches currently in -mm which might be from acarmina(a)redhat.com are
The quilt patch titled
Subject: mm: hugetlb: independent PMD page table shared count
has been removed from the -mm tree. Its filename was
mm-hugetlb-independent-pmd-page-table-shared-count.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Liu Shixin <liushixin2(a)huawei.com>
Subject: mm: hugetlb: independent PMD page table shared count
Date: Mon, 16 Dec 2024 15:11:47 +0800
The folio refcount may be increased unexpectly through try_get_folio() by
caller such as split_huge_pages. In huge_pmd_unshare(), we use refcount
to check whether a pmd page table is shared. The check is incorrect if
the refcount is increased by the above caller, and this can cause the page
table leaked:
BUG: Bad page state in process sh pfn:109324
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x66 pfn:0x109324
flags: 0x17ffff800000000(node=0|zone=2|lastcpupid=0xfffff)
page_type: f2(table)
raw: 017ffff800000000 0000000000000000 0000000000000000 0000000000000000
raw: 0000000000000066 0000000000000000 00000000f2000000 0000000000000000
page dumped because: nonzero mapcount
...
CPU: 31 UID: 0 PID: 7515 Comm: sh Kdump: loaded Tainted: G B 6.13.0-rc2master+ #7
Tainted: [B]=BAD_PAGE
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
show_stack+0x20/0x38 (C)
dump_stack_lvl+0x80/0xf8
dump_stack+0x18/0x28
bad_page+0x8c/0x130
free_page_is_bad_report+0xa4/0xb0
free_unref_page+0x3cc/0x620
__folio_put+0xf4/0x158
split_huge_pages_all+0x1e0/0x3e8
split_huge_pages_write+0x25c/0x2d8
full_proxy_write+0x64/0xd8
vfs_write+0xcc/0x280
ksys_write+0x70/0x110
__arm64_sys_write+0x24/0x38
invoke_syscall+0x50/0x120
el0_svc_common.constprop.0+0xc8/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x34/0x128
el0t_64_sync_handler+0xc8/0xd0
el0t_64_sync+0x190/0x198
The issue may be triggered by damon, offline_page, page_idle, etc, which
will increase the refcount of page table.
1. The page table itself will be discarded after reporting the
"nonzero mapcount".
2. The HugeTLB page mapped by the page table miss freeing since we
treat the page table as shared and a shared page table will not be
unmapped.
Fix it by introducing independent PMD page table shared count. As
described by comment, pt_index/pt_mm/pt_frag_refcount are used for s390
gmap, x86 pgds and powerpc, pt_share_count is used for x86/arm64/riscv
pmds, so we can reuse the field as pt_share_count.
Link: https://lkml.kernel.org/r/20241216071147.3984217-1-liushixin2@huawei.com
Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Ken Chen <kenneth.w.chen(a)intel.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Nanyong Sun <sunnanyong(a)huawei.com>
Cc: Jane Chu <jane.chu(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 1 +
include/linux/mm_types.h | 30 ++++++++++++++++++++++++++++++
mm/hugetlb.c | 16 +++++++---------
3 files changed, 38 insertions(+), 9 deletions(-)
--- a/include/linux/mm.h~mm-hugetlb-independent-pmd-page-table-shared-count
+++ a/include/linux/mm.h
@@ -3125,6 +3125,7 @@ static inline bool pagetable_pmd_ctor(st
if (!pmd_ptlock_init(ptdesc))
return false;
__folio_set_pgtable(folio);
+ ptdesc_pmd_pts_init(ptdesc);
lruvec_stat_add_folio(folio, NR_PAGETABLE);
return true;
}
--- a/include/linux/mm_types.h~mm-hugetlb-independent-pmd-page-table-shared-count
+++ a/include/linux/mm_types.h
@@ -445,6 +445,7 @@ FOLIO_MATCH(compound_head, _head_2a);
* @pt_index: Used for s390 gmap.
* @pt_mm: Used for x86 pgds.
* @pt_frag_refcount: For fragmented page table tracking. Powerpc only.
+ * @pt_share_count: Used for HugeTLB PMD page table share count.
* @_pt_pad_2: Padding to ensure proper alignment.
* @ptl: Lock for the page table.
* @__page_type: Same as page->page_type. Unused for page tables.
@@ -471,6 +472,9 @@ struct ptdesc {
pgoff_t pt_index;
struct mm_struct *pt_mm;
atomic_t pt_frag_refcount;
+#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
+ atomic_t pt_share_count;
+#endif
};
union {
@@ -516,6 +520,32 @@ static_assert(sizeof(struct ptdesc) <= s
const struct page *: (const struct ptdesc *)(p), \
struct page *: (struct ptdesc *)(p)))
+#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
+static inline void ptdesc_pmd_pts_init(struct ptdesc *ptdesc)
+{
+ atomic_set(&ptdesc->pt_share_count, 0);
+}
+
+static inline void ptdesc_pmd_pts_inc(struct ptdesc *ptdesc)
+{
+ atomic_inc(&ptdesc->pt_share_count);
+}
+
+static inline void ptdesc_pmd_pts_dec(struct ptdesc *ptdesc)
+{
+ atomic_dec(&ptdesc->pt_share_count);
+}
+
+static inline int ptdesc_pmd_pts_count(struct ptdesc *ptdesc)
+{
+ return atomic_read(&ptdesc->pt_share_count);
+}
+#else
+static inline void ptdesc_pmd_pts_init(struct ptdesc *ptdesc)
+{
+}
+#endif
+
/*
* Used for sizing the vmemmap region on some architectures
*/
--- a/mm/hugetlb.c~mm-hugetlb-independent-pmd-page-table-shared-count
+++ a/mm/hugetlb.c
@@ -7211,7 +7211,7 @@ pte_t *huge_pmd_share(struct mm_struct *
spte = hugetlb_walk(svma, saddr,
vma_mmu_pagesize(svma));
if (spte) {
- get_page(virt_to_page(spte));
+ ptdesc_pmd_pts_inc(virt_to_ptdesc(spte));
break;
}
}
@@ -7226,7 +7226,7 @@ pte_t *huge_pmd_share(struct mm_struct *
(pmd_t *)((unsigned long)spte & PAGE_MASK));
mm_inc_nr_pmds(mm);
} else {
- put_page(virt_to_page(spte));
+ ptdesc_pmd_pts_dec(virt_to_ptdesc(spte));
}
spin_unlock(&mm->page_table_lock);
out:
@@ -7238,10 +7238,6 @@ out:
/*
* unmap huge page backed by shared pte.
*
- * Hugetlb pte page is ref counted at the time of mapping. If pte is shared
- * indicated by page_count > 1, unmap is achieved by clearing pud and
- * decrementing the ref count. If count == 1, the pte page is not shared.
- *
* Called with page table lock held.
*
* returns: 1 successfully unmapped a shared pte page
@@ -7250,18 +7246,20 @@ out:
int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
{
+ unsigned long sz = huge_page_size(hstate_vma(vma));
pgd_t *pgd = pgd_offset(mm, addr);
p4d_t *p4d = p4d_offset(pgd, addr);
pud_t *pud = pud_offset(p4d, addr);
i_mmap_assert_write_locked(vma->vm_file->f_mapping);
hugetlb_vma_assert_locked(vma);
- BUG_ON(page_count(virt_to_page(ptep)) == 0);
- if (page_count(virt_to_page(ptep)) == 1)
+ if (sz != PMD_SIZE)
+ return 0;
+ if (!ptdesc_pmd_pts_count(virt_to_ptdesc(ptep)))
return 0;
pud_clear(pud);
- put_page(virt_to_page(ptep));
+ ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep));
mm_dec_nr_pmds(mm);
return 1;
}
_
Patches currently in -mm which might be from liushixin2(a)huawei.com are
The quilt patch titled
Subject: maple_tree: reload mas before the second call for mas_empty_area
has been removed from the -mm tree. Its filename was
maple_tree-reload-mas-before-the-second-call-for-mas_empty_area.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Yang Erkun <yangerkun(a)huawei.com>
Subject: maple_tree: reload mas before the second call for mas_empty_area
Date: Sat, 14 Dec 2024 17:30:05 +0800
Change the LONG_MAX in simple_offset_add to 1024, and do latter:
[root@fedora ~]# mkdir /tmp/dir
[root@fedora ~]# for i in {1..1024}; do touch /tmp/dir/$i; done
touch: cannot touch '/tmp/dir/1024': Device or resource busy
[root@fedora ~]# rm /tmp/dir/123
[root@fedora ~]# touch /tmp/dir/1024
[root@fedora ~]# rm /tmp/dir/100
[root@fedora ~]# touch /tmp/dir/1025
touch: cannot touch '/tmp/dir/1025': Device or resource busy
After we delete file 100, actually this is a empty entry, but the latter
create failed unexpected.
mas_alloc_cyclic has two chance to find empty entry. First find the entry
with range range_lo and range_hi, if no empty entry exist, and range_lo >
min, retry find with range min and range_hi. However, the first call
mas_empty_area may mark mas as EBUSY, and the second call for
mas_empty_area will return false directly. Fix this by reload mas before
second call for mas_empty_area.
[Liam.Howlett(a)Oracle.com: fix mas_alloc_cyclic() second search]
Link: https://lore.kernel.org/all/20241216060600.287B4C4CED0@smtp.kernel.org/
Link: https://lkml.kernel.org/r/20241216190113.1226145-2-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20241214093005.72284-1-yangerkun@huaweicloud.com
Fixes: 9b6713cc7522 ("maple_tree: Add mtree_alloc_cyclic()")
Signed-off-by: Yang Erkun <yangerkun(a)huawei.com>
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Chuck Lever <chuck.lever(a)oracle.com> says:
Cc: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/maple_tree.c | 1 +
1 file changed, 1 insertion(+)
--- a/lib/maple_tree.c~maple_tree-reload-mas-before-the-second-call-for-mas_empty_area
+++ a/lib/maple_tree.c
@@ -4354,6 +4354,7 @@ int mas_alloc_cyclic(struct ma_state *ma
ret = 1;
}
if (ret < 0 && range_lo > min) {
+ mas_reset(mas);
ret = mas_empty_area(mas, min, range_hi, 1);
if (ret == 0)
ret = 1;
_
Patches currently in -mm which might be from yangerkun(a)huawei.com are
The quilt patch titled
Subject: mm/readahead: fix large folio support in async readahead
has been removed from the -mm tree. Its filename was
mm-readahead-fix-large-folio-support-in-async-readahead.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Yafang Shao <laoar.shao(a)gmail.com>
Subject: mm/readahead: fix large folio support in async readahead
Date: Fri, 6 Dec 2024 16:30:25 +0800
When testing large folio support with XFS on our servers, we observed that
only a few large folios are mapped when reading large files via mmap.
After a thorough analysis, I identified it was caused by the
`/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
parameter is set to 128KB. After I tune it to 2MB, the large folio can
work as expected. However, I believe the large folio behavior should not
be dependent on the value of read_ahead_kb. It would be more robust if
the kernel can automatically adopt to it.
With /sys/block/*/queue/read_ahead_kb set to 128KB and performing a
sequential read on a 1GB file using MADV_HUGEPAGE, the differences in
/proc/meminfo are as follows:
- before this patch
FileHugePages: 18432 kB
FilePmdMapped: 4096 kB
- after this patch
FileHugePages: 1067008 kB
FilePmdMapped: 1048576 kB
This shows that after applying the patch, the entire 1GB file is mapped to
huge pages. The stable list is CCed, as without this patch, large folios
don't function optimally in the readahead path.
It's worth noting that if read_ahead_kb is set to a larger value that
isn't aligned with huge page sizes (e.g., 4MB + 128KB), it may still fail
to map to hugepages.
Link: https://lkml.kernel.org/r/20241108141710.9721-1-laoar.shao@gmail.com
Link: https://lkml.kernel.org/r/20241206083025.3478-1-laoar.shao@gmail.com
Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
Signed-off-by: Yafang Shao <laoar.shao(a)gmail.com>
Tested-by: kernel test robot <oliver.sang(a)intel.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/readahead.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/mm/readahead.c~mm-readahead-fix-large-folio-support-in-async-readahead
+++ a/mm/readahead.c
@@ -646,7 +646,11 @@ void page_cache_async_ra(struct readahea
1UL << order);
if (index == expected) {
ra->start += ra->size;
- ra->size = get_next_ra_size(ra, max_pages);
+ /*
+ * In the case of MADV_HUGEPAGE, the actual size might exceed
+ * the readahead window.
+ */
+ ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
ra->async_size = ra->size;
goto readit;
}
_
Patches currently in -mm which might be from laoar.shao(a)gmail.com are
kernel-remove-get_task_comm-and-print-task-comm-directly.patch
arch-remove-get_task_comm-and-print-task-comm-directly.patch
net-remove-get_task_comm-and-print-task-comm-directly.patch
security-remove-get_task_comm-and-print-task-comm-directly.patch
drivers-remove-get_task_comm-and-print-task-comm-directly.patch
The quilt patch titled
Subject: mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
has been removed from the -mm tree. Its filename was
mm-vmscan-account-for-free-pages-to-prevent-infinite-loop-in-throttle_direct_reclaim.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Seiji Nishikawa <snishika(a)redhat.com>
Subject: mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
Date: Sun, 1 Dec 2024 01:12:34 +0900
The task sometimes continues looping in throttle_direct_reclaim() because
allow_direct_reclaim(pgdat) keeps returning false.
#0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac
#1 [ffff80002cb6f900] __schedule at ffff800008abbd1c
#2 [ffff80002cb6f990] schedule at ffff800008abc50c
#3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550
#4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68
#5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660
#6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98
#7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8
#8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974
#9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4
At this point, the pgdat contains the following two zones:
NODE: 4 ZONE: 0 ADDR: ffff00817fffe540 NAME: "DMA32"
SIZE: 20480 MIN/LOW/HIGH: 11/28/45
VM_STAT:
NR_FREE_PAGES: 359
NR_ZONE_INACTIVE_ANON: 18813
NR_ZONE_ACTIVE_ANON: 0
NR_ZONE_INACTIVE_FILE: 50
NR_ZONE_ACTIVE_FILE: 0
NR_ZONE_UNEVICTABLE: 0
NR_ZONE_WRITE_PENDING: 0
NR_MLOCK: 0
NR_BOUNCE: 0
NR_ZSPAGES: 0
NR_FREE_CMA_PAGES: 0
NODE: 4 ZONE: 1 ADDR: ffff00817fffec00 NAME: "Normal"
SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264
VM_STAT:
NR_FREE_PAGES: 146
NR_ZONE_INACTIVE_ANON: 94668
NR_ZONE_ACTIVE_ANON: 3
NR_ZONE_INACTIVE_FILE: 735
NR_ZONE_ACTIVE_FILE: 78
NR_ZONE_UNEVICTABLE: 0
NR_ZONE_WRITE_PENDING: 0
NR_MLOCK: 0
NR_BOUNCE: 0
NR_ZSPAGES: 0
NR_FREE_CMA_PAGES: 0
In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of
inactive/active file-backed pages calculated in zone_reclaimable_pages()
based on the result of zone_page_state_snapshot() is zero.
Additionally, since this system lacks swap, the calculation of inactive/
active anonymous pages is skipped.
crash> p nr_swap_pages
nr_swap_pages = $1937 = {
counter = 0
}
As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to
the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having
free pages significantly exceeding the high watermark.
The problem is that the pgdat->kswapd_failures hasn't been incremented.
crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures
$1935 = 0x0
This is because the node deemed balanced. The node balancing logic in
balance_pgdat() evaluates all zones collectively. If one or more zones
(e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the
entire node is deemed balanced. This causes balance_pgdat() to exit early
before incrementing the kswapd_failures, as it considers the overall
memory state acceptable, even though some zones (like ZONE_NORMAL) remain
under significant pressure.
The patch ensures that zone_reclaimable_pages() includes free pages
(NR_FREE_PAGES) in its calculation when no other reclaimable pages are
available (e.g., file-backed or anonymous pages). This change prevents
zones like ZONE_DMA32, which have sufficient free pages, from being
mistakenly deemed unreclaimable. By doing so, the patch ensures proper
node balancing, avoids masking pressure on other zones like ZONE_NORMAL,
and prevents infinite loops in throttle_direct_reclaim() caused by
allow_direct_reclaim(pgdat) repeatedly returning false.
The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused
by a node being incorrectly deemed balanced despite pressure in certain
zones, such as ZONE_NORMAL. This issue arises from
zone_reclaimable_pages() returning 0 for zones without reclaimable file-
backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient
free pages to be skipped.
The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored
during reclaim, masking pressure in other zones. Consequently,
pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback
mechanisms in allow_direct_reclaim() from being triggered, leading to an
infinite loop in throttle_direct_reclaim().
This patch modifies zone_reclaimable_pages() to account for free pages
(NR_FREE_PAGES) when no other reclaimable pages exist. This ensures zones
with sufficient free pages are not skipped, enabling proper balancing and
reclaim behavior.
[akpm(a)linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20241130164346.436469-1-snishika@redhat.com
Link: https://lkml.kernel.org/r/20241130161236.433747-2-snishika@redhat.com
Fixes: 5a1c84b404a7 ("mm: remove reclaim and compaction retry approximations")
Signed-off-by: Seiji Nishikawa <snishika(a)redhat.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
--- a/mm/vmscan.c~mm-vmscan-account-for-free-pages-to-prevent-infinite-loop-in-throttle_direct_reclaim
+++ a/mm/vmscan.c
@@ -374,7 +374,14 @@ unsigned long zone_reclaimable_pages(str
if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
-
+ /*
+ * If there are no reclaimable file-backed or anonymous pages,
+ * ensure zones with sufficient free pages are not skipped.
+ * This prevents zones like DMA32 from being ignored in reclaim
+ * scenarios where they can still help alleviate memory pressure.
+ */
+ if (nr == 0)
+ nr = zone_page_state_snapshot(zone, NR_FREE_PAGES);
return nr;
}
_
Patches currently in -mm which might be from snishika(a)redhat.com are
The quilt patch titled
Subject: mm: reinstate ability to map write-sealed memfd mappings read-only
has been removed from the -mm tree. Its filename was
mm-reinstate-ability-to-map-write-sealed-memfd-mappings-read-only.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Subject: mm: reinstate ability to map write-sealed memfd mappings read-only
Date: Thu, 28 Nov 2024 15:06:17 +0000
Patch series "mm: reinstate ability to map write-sealed memfd mappings
read-only".
In commit 158978945f31 ("mm: perform the mapping_map_writable() check
after call_mmap()") (and preceding changes in the same series) it became
possible to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
behaviour") unintentionally undid this logic by moving the
mapping_map_writable() check before the shmem_mmap() hook is invoked,
thereby regressing this change.
This series reworks how we both permit write-sealed mappings being mapped
read-only and disallow mprotect() from undoing the write-seal, fixing this
regression.
We also add a regression test to ensure that we do not accidentally
regress this in future.
Thanks to Julian Orth for reporting this regression.
This patch (of 2):
In commit 158978945f31 ("mm: perform the mapping_map_writable() check
after call_mmap()") (and preceding changes in the same series) it became
possible to mmap() F_SEAL_WRITE sealed memfd mappings read-only.
This was previously unnecessarily disallowed, despite the man page
documentation indicating that it would be, thereby limiting the usefulness
of F_SEAL_WRITE logic.
We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE
seal (one which disallows future writes to the memfd) to also be used for
F_SEAL_WRITE.
For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a
read-only mapping to disallow mprotect() from overriding the seal - an
operation performed by seal_check_write(), invoked from shmem_mmap(), the
f_op->mmap() hook used by shmem mappings.
By extending this to F_SEAL_WRITE and critically - checking
mapping_map_writable() to determine if we may map the memfd AFTER we
invoke shmem_mmap() - the desired logic becomes possible. This is because
mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will
have cleared.
Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path
behaviour") unintentionally undid this logic by moving the
mapping_map_writable() check before the shmem_mmap() hook is invoked,
thereby regressing this change.
We reinstate this functionality by moving the check out of shmem_mmap()
and instead performing it in do_mmap() at the point at which VMA flags are
being determined, which seems in any case to be a more appropriate place
in which to make this determination.
In order to achieve this we rework memfd seal logic to allow us access to
this information using existing logic and eliminate the clearing of
VM_MAYWRITE from seal_check_write() which we are performing in do_mmap()
instead.
Link: https://lkml.kernel.org/r/99fc35d2c62bd2e05571cf60d9f8b843c56069e0.17328047…
Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reported-by: Julian Orth <ju.orth(a)gmail.com>
Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZx…
Cc: Jann Horn <jannh(a)google.com>
Cc: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/memfd.h | 14 +++++++++
include/linux/mm.h | 58 +++++++++++++++++++++++++++-------------
mm/memfd.c | 2 -
mm/mmap.c | 4 ++
4 files changed, 59 insertions(+), 19 deletions(-)
--- a/include/linux/memfd.h~mm-reinstate-ability-to-map-write-sealed-memfd-mappings-read-only
+++ a/include/linux/memfd.h
@@ -7,6 +7,7 @@
#ifdef CONFIG_MEMFD_CREATE
extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg);
struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx);
+unsigned int *memfd_file_seals_ptr(struct file *file);
#else
static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a)
{
@@ -16,6 +17,19 @@ static inline struct folio *memfd_alloc_
{
return ERR_PTR(-EINVAL);
}
+
+static inline unsigned int *memfd_file_seals_ptr(struct file *file)
+{
+ return NULL;
+}
#endif
+/* Retrieve memfd seals associated with the file, if any. */
+static inline unsigned int memfd_file_seals(struct file *file)
+{
+ unsigned int *sealsp = memfd_file_seals_ptr(file);
+
+ return sealsp ? *sealsp : 0;
+}
+
#endif /* __LINUX_MEMFD_H */
--- a/include/linux/mm.h~mm-reinstate-ability-to-map-write-sealed-memfd-mappings-read-only
+++ a/include/linux/mm.h
@@ -4101,6 +4101,37 @@ void mem_dump_obj(void *object);
static inline void mem_dump_obj(void *object) {}
#endif
+static inline bool is_write_sealed(int seals)
+{
+ return seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE);
+}
+
+/**
+ * is_readonly_sealed - Checks whether write-sealed but mapped read-only,
+ * in which case writes should be disallowing moving
+ * forwards.
+ * @seals: the seals to check
+ * @vm_flags: the VMA flags to check
+ *
+ * Returns whether readonly sealed, in which case writess should be disallowed
+ * going forward.
+ */
+static inline bool is_readonly_sealed(int seals, vm_flags_t vm_flags)
+{
+ /*
+ * Since an F_SEAL_[FUTURE_]WRITE sealed memfd can be mapped as
+ * MAP_SHARED and read-only, take care to not allow mprotect to
+ * revert protections on such mappings. Do this only for shared
+ * mappings. For private mappings, don't need to mask
+ * VM_MAYWRITE as we still want them to be COW-writable.
+ */
+ if (is_write_sealed(seals) &&
+ ((vm_flags & (VM_SHARED | VM_WRITE)) == VM_SHARED))
+ return true;
+
+ return false;
+}
+
/**
* seal_check_write - Check for F_SEAL_WRITE or F_SEAL_FUTURE_WRITE flags and
* handle them.
@@ -4112,24 +4143,15 @@ static inline void mem_dump_obj(void *ob
*/
static inline int seal_check_write(int seals, struct vm_area_struct *vma)
{
- if (seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) {
- /*
- * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
- * write seals are active.
- */
- if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
- return -EPERM;
-
- /*
- * Since an F_SEAL_[FUTURE_]WRITE sealed memfd can be mapped as
- * MAP_SHARED and read-only, take care to not allow mprotect to
- * revert protections on such mappings. Do this only for shared
- * mappings. For private mappings, don't need to mask
- * VM_MAYWRITE as we still want them to be COW-writable.
- */
- if (vma->vm_flags & VM_SHARED)
- vm_flags_clear(vma, VM_MAYWRITE);
- }
+ if (!is_write_sealed(seals))
+ return 0;
+
+ /*
+ * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
+ * write seals are active.
+ */
+ if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
+ return -EPERM;
return 0;
}
--- a/mm/memfd.c~mm-reinstate-ability-to-map-write-sealed-memfd-mappings-read-only
+++ a/mm/memfd.c
@@ -170,7 +170,7 @@ static int memfd_wait_for_pins(struct ad
return error;
}
-static unsigned int *memfd_file_seals_ptr(struct file *file)
+unsigned int *memfd_file_seals_ptr(struct file *file)
{
if (shmem_file(file))
return &SHMEM_I(file_inode(file))->seals;
--- a/mm/mmap.c~mm-reinstate-ability-to-map-write-sealed-memfd-mappings-read-only
+++ a/mm/mmap.c
@@ -47,6 +47,7 @@
#include <linux/oom.h>
#include <linux/sched/mm.h>
#include <linux/ksm.h>
+#include <linux/memfd.h>
#include <linux/uaccess.h>
#include <asm/cacheflush.h>
@@ -368,6 +369,7 @@ unsigned long do_mmap(struct file *file,
if (file) {
struct inode *inode = file_inode(file);
+ unsigned int seals = memfd_file_seals(file);
unsigned long flags_mask;
if (!file_mmap_ok(file, inode, pgoff, len))
@@ -408,6 +410,8 @@ unsigned long do_mmap(struct file *file,
vm_flags |= VM_SHARED | VM_MAYSHARE;
if (!(file->f_mode & FMODE_WRITE))
vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
+ else if (is_readonly_sealed(seals, vm_flags))
+ vm_flags &= ~VM_MAYWRITE;
fallthrough;
case MAP_PRIVATE:
if (!(file->f_mode & FMODE_READ))
_
Patches currently in -mm which might be from lorenzo.stoakes(a)oracle.com are
mm-vma-move-brk-internals-to-mm-vmac.patch
mm-vma-move-brk-internals-to-mm-vmac-fix.patch
mm-vma-move-unmapped_area-internals-to-mm-vmac.patch
mm-abstract-get_arg_page-stack-expansion-and-mmap-read-lock.patch
mm-vma-move-stack-expansion-logic-to-mm-vmac.patch
mm-vma-move-__vm_munmap-to-mm-vmac.patch
selftests-mm-add-fork-cow-guard-page-test.patch
mm-enforce-__must_check-on-vma-merge-and-split.patch
mm-perform-all-memfd-seal-checks-in-a-single-place.patch
mm-perform-all-memfd-seal-checks-in-a-single-place-fix.patch
maintainers-update-memory-mapping-section.patch
mm-assert-mmap-write-lock-held-on-do_mmap-mmap_region.patch
mm-add-comments-to-do_mmap-mmap_region-and-vm_mmap.patch
tools-testing-add-simple-__mmap_region-userland-test.patch
The quilt patch titled
Subject: maple_tree: fix mas_alloc_cyclic() second search
has been removed from the -mm tree. Its filename was
maple_tree-reload-mas-before-the-second-call-for-mas_empty_area-fix.patch
This patch was dropped because it was folded into maple_tree-reload-mas-before-the-second-call-for-mas_empty_area.patch
------------------------------------------------------
From: "Liam R. Howlett" <Liam.Howlett(a)Oracle.com>
Subject: maple_tree: fix mas_alloc_cyclic() second search
Date: Mon, 16 Dec 2024 14:01:12 -0500
The first search may leave the maple state in an error state. Reset the
maple state before the second search so that the search has a chance of
executing correctly after an exhausted first search.
Link: https://lore.kernel.org/all/20241216060600.287B4C4CED0@smtp.kernel.org/
Link: https://lkml.kernel.org/r/20241216190113.1226145-2-Liam.Howlett@oracle.com
Fixes: 9b6713cc7522 ("maple_tree: Add mtree_alloc_cyclic()")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Reviewed-by: Yang Erkun <yangerkun(a)huawei.com>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Chuck Lever <chuck.lever(a)oracle.com> says:
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/maple_tree.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/lib/maple_tree.c~maple_tree-reload-mas-before-the-second-call-for-mas_empty_area-fix
+++ a/lib/maple_tree.c
@@ -4346,7 +4346,6 @@ int mas_alloc_cyclic(struct ma_state *ma
{
unsigned long min = range_lo;
int ret = 0;
- struct ma_state m = *mas;
range_lo = max(min, *next);
ret = mas_empty_area(mas, range_lo, range_hi, 1);
@@ -4355,7 +4354,7 @@ int mas_alloc_cyclic(struct ma_state *ma
ret = 1;
}
if (ret < 0 && range_lo > min) {
- *mas = m;
+ mas_reset(mas);
ret = mas_empty_area(mas, min, range_hi, 1);
if (ret == 0)
ret = 1;
_
Patches currently in -mm which might be from Liam.Howlett(a)Oracle.com are
maple_tree-reload-mas-before-the-second-call-for-mas_empty_area.patch
test_maple_tree-test-exhausted-upper-limit-of-mtree_alloc_cyclic.patch
From: Joshua Washington <joshwash(a)google.com>
Commit ba0925c34e0f ("gve: process XSK TX descriptors as part of RX NAPI")
moved XSK TX processing to be part of the RX NAPI. However, that commit
did not include triggering the RX NAPI in gve_xsk_wakeup. This is
necessary because the TX NAPI only processes TX completions, meaning
that a TX wakeup would not actually trigger XSK descriptor processing.
Also, the branch on XDP_WAKEUP_TX was supposed to have been removed, as
the NAPI should be scheduled whether the wakeup is for RX or TX.
Fixes: ba0925c34e0f ("gve: process XSK TX descriptors as part of RX NAPI")
Cc: stable(a)vger.kernel.org
Signed-off-by: Joshua Washington <joshwash(a)google.com>
Signed-off-by: Praveen Kaligineedi <pkaligineedi(a)google.com>
---
drivers/net/ethernet/google/gve/gve_main.c | 21 +++++++--------------
1 file changed, 7 insertions(+), 14 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 09fb7f16f73e..8a8f6ab12a98 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1714,7 +1714,7 @@ static int gve_xsk_pool_disable(struct net_device *dev,
static int gve_xsk_wakeup(struct net_device *dev, u32 queue_id, u32 flags)
{
struct gve_priv *priv = netdev_priv(dev);
- int tx_queue_id = gve_xdp_tx_queue_id(priv, queue_id);
+ struct napi_struct *napi;
if (!gve_get_napi_enabled(priv))
return -ENETDOWN;
@@ -1722,19 +1722,12 @@ static int gve_xsk_wakeup(struct net_device *dev, u32 queue_id, u32 flags)
if (queue_id >= priv->rx_cfg.num_queues || !priv->xdp_prog)
return -EINVAL;
- if (flags & XDP_WAKEUP_TX) {
- struct gve_tx_ring *tx = &priv->tx[tx_queue_id];
- struct napi_struct *napi =
- &priv->ntfy_blocks[tx->ntfy_id].napi;
-
- if (!napi_if_scheduled_mark_missed(napi)) {
- /* Call local_bh_enable to trigger SoftIRQ processing */
- local_bh_disable();
- napi_schedule(napi);
- local_bh_enable();
- }
-
- tx->xdp_xsk_wakeup++;
+ napi = &priv->ntfy_blocks[gve_rx_idx_to_ntfy(priv, queue_id)].napi;
+ if (!napi_if_scheduled_mark_missed(napi)) {
+ /* Call local_bh_enable to trigger SoftIRQ processing */
+ local_bh_disable();
+ napi_schedule(napi);
+ local_bh_enable();
}
return 0;
--
2.47.1.613.gc27f4b7a9f-goog
Greg,
A Gentoo user found an issue with btrfs where:
"... upstream commit a5d74fa24752
("btrfs: avoid unnecessary device path update for the same device") breaks
6.6 LTS kernel behavior where previously lsblk can properly show multiple
mount points of the same btrfs"
There's a revert on the bug report [1].
The fix by the author can be found on linux-btrfs [2]
[1] https://bugs.gentoo.org/947126
[2] https://lore.kernel.org/linux-btrfs/30aefd8b4e8c1f0c5051630b106a1ff3570d28e…
Mike
--
Mike Pagano
Gentoo Developer - Kernel Project
E-Mail : mpagano(a)gentoo.org
GnuPG FP : 52CC A0B0 F631 0B17 0142 F83F 92A6 DBEC 81F2 B137
Public Key : http://pgp.mit.edu/pks/lookup?search=0x92A6DBEC81F2B137&op=index
The field "eip" (instruction pointer) and "esp" (stack pointer) of a task
can be read from /proc/PID/stat. These fields can be interesting for
coredump.
However, these fields were disabled by commit 0a1eb2d474ed ("fs/proc: Stop
reporting eip and esp in /proc/PID/stat"), because it is generally unsafe
to do so. But it is safe for a coredumping process, and therefore
exceptions were made:
- for a coredumping thread by commit fd7d56270b52 ("fs/proc: Report
eip/esp in /prod/PID/stat for coredumping").
- for all other threads in a coredumping process by commit cb8f381f1613
("fs/proc/array.c: allow reporting eip/esp for all coredumping
threads").
The above two commits check the PF_DUMPCORE flag to determine a coredump thread
and the PF_EXITING flag for the other threads.
Unfortunately, commit 92307383082d ("coredump: Don't perform any cleanups
before dumping core") moved coredump to happen earlier and before PF_EXITING is
set. Thus, checking PF_EXITING is no longer the correct way to determine
threads in a coredumping process.
Instead of PF_EXITING, use PF_POSTCOREDUMP to determine the other threads.
Checking of PF_EXITING was added for coredumping, so it probably can now be
removed. But it doesn't hurt to keep.
Fixes: 92307383082d ("coredump: Don't perform any cleanups before dumping core")
Cc: stable(a)vger.kernel.org
Cc: Eric W. Biederman <ebiederm(a)xmission.com>
Acked-by: Oleg Nesterov <oleg(a)redhat.com>
Signed-off-by: Nam Cao <namcao(a)linutronix.de>
---
fs/proc/array.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 55ed3510d2bb..d6a0369caa93 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -500,7 +500,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
* a program is not able to use ptrace(2) in that case. It is
* safe because the task has stopped executing permanently.
*/
- if (permitted && (task->flags & (PF_EXITING|PF_DUMPCORE))) {
+ if (permitted && (task->flags & (PF_EXITING|PF_DUMPCORE|PF_POSTCOREDUMP))) {
if (try_get_task_stack(task)) {
eip = KSTK_EIP(task);
esp = KSTK_ESP(task);
--
2.39.5
On 28/12/2024 15:20, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> watchdog: s3c2410_wdt: add support for exynosautov920 SoC
>
> to the 6.12-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> watchdog-s3c2410_wdt-add-support-for-exynosautov920-.patch
> and it can be found in the queue-6.12 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
This is a new device support, depending on several other pieces like
syscon support, Exynos PMU and of course arm64/boot/dts patches. It
really relies on the rest of SoC support and alone is useless.
This is way beyond simple quirks thus I believe it does not meet stable
criteria at all.
Based on that I recommend to drop this patch from stable backports.
Best regards,
Krzysztof
From: Peilin He<he.peilin(a)zte.com.cn>
Upstream commit 6c24a03a61a2 ("net: dsa: improve shutdown sequence")
Issue
=====
Repeatedly accessing the DSA Ethernet controller via the ethtool command,
followed by a system reboot, may trigger a DSA null pointer dereference,
causing a kernel panic and preventing the system from rebooting properly.
This can lead to data loss or denial-of-service, resulting in serious
consequences.
The following is the panic log:
[ 172.523467] Unable to handle kernel NULL pointer dereference at virtual
address 0000000000000020
[ 172.706923] Call trace:
[ 172.709371] dsa_master_get_sset_count+0x24/0xa4
[ 172.714000] ethtool_get_drvinfo+0x8c/0x210
[ 172.718193] dev_ethtool+0x780/0x2120
[ 172.721863] dev_ioctl+0x1b0/0x580
[ 172.725273] sock_do_ioctl+0xc0/0x100
[ 172.728944] sock_ioctl+0x130/0x3c0
[ 172.732440] __arm64_sys_ioctl+0xb4/0x100
[ 172.736460] invoke_syscall+0x50/0x120
[ 172.740219] el0_svc_common.constprop.0+0x4c/0xf4
[ 172.744936] do_el0_svc+0x2c/0xa0
[ 172.748257] el0_svc+0x20/0x60
[ 172.751318] el0t_64_sync_handler+0xe8/0x114
[ 172.755599] el0t_64_sync+0x180/0x184
[ 172.759271] Code: a90153f3 2a0103f4 a9025bf5 f9418015 (f94012b6)
[ 172.765383] ---[ end trace 0000000000000002 ]---
Root Cause
==========
Based on analysis of the Linux 5.15 stable version, the function
dsa_master_get_sset_count() accesses members of the structure pointed
to by cpu_dp without checking for a null pointer. If cpu_dp is a
null pointer, this will cause a kernel panic.
static int dsa_master_get_sset_count(struct net_device *dev, int sset)
{
struct dsa_port *cpu_dp = dev->dsa_ptr;
const struct ethtool_ops *ops = cpu_dp->orig_ethtool_ops;
struct dsa_switch *ds = cpu_dp->ds;
...
}
dev->dsa_ptr is set to NULL in the dsa_switch_shutdown() or
dsa_master_teardown() functions. When the DSA module unloads,
dsa_master_ethtool_teardown(dev) restores the original copy of
the DSA device's ethtool_ops using "dev->ethtool_ops =
cpu_dp->orig_ethtool_ops;" before setting dev->dsa_ptr to NULL.
This ensures that ethtool_ops remains accessible after DSA unloads.
However, dsa_switch_shutdown does not restore the original copy of
the DSA device's ethtool_ops, potentially leading to a null pointer
dereference of dsa_ptr and causing a system panic. Essentially,
when we set master->dsa_ptr to NULL, we need to ensure that
no user ports are making requests to the DSA driver.
Solution
========
The addition of the netif_device_detach() function is to ensure that
ioctls, rtnetlinks and ethtool requests on the user ports no longer
propagate down to the driver - we're no longer prepared to handle them.
Fixes: ee534378f005 ("net: dsa: fix panic when DSA master device unbinds on shutdown")
Suggested-by: Vladimir Oltean <vladimir.oltean(a)nxp.com>
Signed-off-by: Peilin He <he.peilin(a)zte.com.cn>
Reviewed-by: xu xin <xu.xin16(a)zte.com.cn>
Signed-off-by: Kun Jiang <jiang.kun2(a)zte.com.cn>
Cc: Fan Yu <fan.yu9(a)zte.com.cn>
Cc: Yutan Qiu <qiu.yutan(a)zte.com.cn>
Cc: Yaxin Wang <wang.yaxin(a)zte.com.cn>
Cc: tuqiang <tu.qiang35(a)zte.com.cn>
Cc: Yang Yang <yang.yang29(a)zte.com.cn>
Cc: ye xingchen <ye.xingchen(a)zte.com.cn>
Cc: Yunkai Zhang <zhang.yunkai(a)zte.com.cn>
---
net/dsa/dsa2.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 543834e31298..bf384b30ec0a 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -1656,6 +1656,7 @@ EXPORT_SYMBOL_GPL(dsa_unregister_switch);
void dsa_switch_shutdown(struct dsa_switch *ds)
{
struct net_device *master, *slave_dev;
+ LIST_HEAD(close_list);
struct dsa_port *dp;
mutex_lock(&dsa2_mutex);
@@ -1665,6 +1666,11 @@ void dsa_switch_shutdown(struct dsa_switch *ds)
rtnl_lock();
+ dsa_switch_for_each_cpu_port(dp, ds)
+ list_add(&dp->master->close_list, &close_list);
+
+ dev_close_many(&close_list, true);
+
list_for_each_entry(dp, &ds->dst->ports, list) {
if (dp->ds != ds)
continue;
@@ -1675,6 +1681,7 @@ void dsa_switch_shutdown(struct dsa_switch *ds)
master = dp->cpu_dp->master;
slave_dev = dp->slave;
+ netif_device_detach(slave_dev);
netdev_upper_dev_unlink(master, slave_dev);
}
--
2.25.1
From: Jennifer Berringer <jberring(a)redhat.com>
When __nvmem_cell_entry_write() is called for an nvmem cell that does
not need bit shifting, it requires that the len parameter exactly
matches the nvmem cell size. However, when the nvmem cell has a nonzero
bit_offset, it was skipping this check.
Accepting values of len larger than the cell size results in
nvmem_cell_prepare_write_buffer() trying to write past the end of a heap
buffer that it allocates. Add a check to avoid that problem and instead
return -EINVAL when len doesn't match the number of bits expected by the
nvmem cell when bit_offset is nonzero.
This check uses cell->nbits in order to allow providing the smaller size
to cells that are shifted into another byte by bit_offset. For example,
a cell with nbits=8 and nonzero bit_offset would have bytes=2 but should
accept a 1-byte write here, although no current callers depend on this.
Fixes: 69aba7948cbe ("nvmem: Add a simple NVMEM framework for consumers")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jennifer Berringer <jberring(a)redhat.com>
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla(a)linaro.org>
---
drivers/nvmem/core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/nvmem/core.c b/drivers/nvmem/core.c
index d6494dfc20a7..845540b57e68 100644
--- a/drivers/nvmem/core.c
+++ b/drivers/nvmem/core.c
@@ -1790,6 +1790,8 @@ static int __nvmem_cell_entry_write(struct nvmem_cell_entry *cell, void *buf, si
return -EINVAL;
if (cell->bit_offset || cell->nbits) {
+ if (len != BITS_TO_BYTES(cell->nbits) && len != cell->bytes)
+ return -EINVAL;
buf = nvmem_cell_prepare_write_buffer(cell, buf, len);
if (IS_ERR(buf))
return PTR_ERR(buf);
--
2.25.1
On 6.12 there is a kernel crash during the release of btusb Mediatek
device.
list_del corruption, ffff8aae1f024000->next is LIST_POISON1 (dead000000000100)
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:56!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 3 UID: 0 PID: 3770 Comm: qemu-system-x86 Tainted: G W 6.12.5-200.fc41.x86_64 #1
Tainted: [W]=WARN
Hardware name: ASUS System Product Name/PRIME X670E-PRO WIFI, BIOS 3035 09/05/2024
RIP: 0010:__list_del_entry_valid_or_report.cold+0x5c/0x6f
Call Trace:
<TASK>
hci_unregister_dev+0x46/0x1f0 [bluetooth]
btusb_disconnect+0x67/0x170 [btusb]
usb_unbind_interface+0x95/0x2d0
device_release_driver_internal+0x19c/0x200
proc_ioctl+0x1be/0x230
usbdev_ioctl+0x6bd/0x1430
__x64_sys_ioctl+0x91/0xd0
do_syscall_64+0x82/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Note: Taint is due to the amdgpu warnings, totally unrelated to the
issue.
The bug has been fixed "silently" in upstream with the following series
of 4 commits [1]:
ad0c6f603bb0 ("Bluetooth: btusb: mediatek: move Bluetooth power off command position")
cea1805f165c ("Bluetooth: btusb: mediatek: add callback function in btusb_disconnect")
489304e67087 ("Bluetooth: btusb: mediatek: add intf release flow when usb disconnect")
defc33b5541e ("Bluetooth: btusb: mediatek: change the conditions for ISO interface")
These commits can be cleanly cherry-picked to 6.12.y and I may confirm
they fix the problem.
FWIW, the offending commit is ceac1cb0259d ("Bluetooth: btusb: mediatek:
add ISO data transmission functions") and it is present in 6.11.y and
6.12.y.
6.11.y is EOL, so please apply the patches to 6.12.y.
[1]: https://lore.kernel.org/linux-bluetooth/20240923084705.14123-1-chris.lu@med…
--
Thanks,
Fedor
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 9a45022a0efadd99bcc58f7f1cc2b6fb3b808c40
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123040-crablike-vision-ad10@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9a45022a0efadd99bcc58f7f1cc2b6fb3b808c40 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 9 Dec 2024 16:31:41 +0000
Subject: [PATCH] btrfs: allow swap activation to be interruptible
During swap activation we iterate over the extents of a file, then do
several checks for each extent, some of which may take some significant
time such as checking if an extent is shared. Since a file can have
many thousands of extents, this can be a very slow operation and it's
currently not interruptible. I had a bug during development of a previous
patch that resulted in an infinite loop when iterating the extents, so
a core was busy looping and I couldn't cancel the operation, which is very
annoying and requires a reboot. So make the loop interruptible by checking
for fatal signals at the end of each iteration and stopping immediately if
there is one.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b87f19630b00..c4675f4345fd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10073,6 +10073,11 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
bsi.block_start = physical_block_start;
bsi.block_len = len;
}
+
+ if (fatal_signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
}
if (bsi.block_len)
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 9a45022a0efadd99bcc58f7f1cc2b6fb3b808c40
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123039-broadside-tuesday-404a@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9a45022a0efadd99bcc58f7f1cc2b6fb3b808c40 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 9 Dec 2024 16:31:41 +0000
Subject: [PATCH] btrfs: allow swap activation to be interruptible
During swap activation we iterate over the extents of a file, then do
several checks for each extent, some of which may take some significant
time such as checking if an extent is shared. Since a file can have
many thousands of extents, this can be a very slow operation and it's
currently not interruptible. I had a bug during development of a previous
patch that resulted in an infinite loop when iterating the extents, so
a core was busy looping and I couldn't cancel the operation, which is very
annoying and requires a reboot. So make the loop interruptible by checking
for fatal signals at the end of each iteration and stopping immediately if
there is one.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b87f19630b00..c4675f4345fd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10073,6 +10073,11 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
bsi.block_start = physical_block_start;
bsi.block_len = len;
}
+
+ if (fatal_signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
}
if (bsi.block_len)
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 9a45022a0efadd99bcc58f7f1cc2b6fb3b808c40
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123039-vivacious-clutch-03a5@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9a45022a0efadd99bcc58f7f1cc2b6fb3b808c40 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 9 Dec 2024 16:31:41 +0000
Subject: [PATCH] btrfs: allow swap activation to be interruptible
During swap activation we iterate over the extents of a file, then do
several checks for each extent, some of which may take some significant
time such as checking if an extent is shared. Since a file can have
many thousands of extents, this can be a very slow operation and it's
currently not interruptible. I had a bug during development of a previous
patch that resulted in an infinite loop when iterating the extents, so
a core was busy looping and I couldn't cancel the operation, which is very
annoying and requires a reboot. So make the loop interruptible by checking
for fatal signals at the end of each iteration and stopping immediately if
there is one.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b87f19630b00..c4675f4345fd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10073,6 +10073,11 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
bsi.block_start = physical_block_start;
bsi.block_len = len;
}
+
+ if (fatal_signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
}
if (bsi.block_len)
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 9a45022a0efadd99bcc58f7f1cc2b6fb3b808c40
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123038-headscarf-village-fcdb@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9a45022a0efadd99bcc58f7f1cc2b6fb3b808c40 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 9 Dec 2024 16:31:41 +0000
Subject: [PATCH] btrfs: allow swap activation to be interruptible
During swap activation we iterate over the extents of a file, then do
several checks for each extent, some of which may take some significant
time such as checking if an extent is shared. Since a file can have
many thousands of extents, this can be a very slow operation and it's
currently not interruptible. I had a bug during development of a previous
patch that resulted in an infinite loop when iterating the extents, so
a core was busy looping and I couldn't cancel the operation, which is very
annoying and requires a reboot. So make the loop interruptible by checking
for fatal signals at the end of each iteration and stopping immediately if
there is one.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b87f19630b00..c4675f4345fd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10073,6 +10073,11 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
bsi.block_start = physical_block_start;
bsi.block_len = len;
}
+
+ if (fatal_signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
}
if (bsi.block_len)
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 9a45022a0efadd99bcc58f7f1cc2b6fb3b808c40
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123037-theology-zookeeper-1b67@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9a45022a0efadd99bcc58f7f1cc2b6fb3b808c40 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 9 Dec 2024 16:31:41 +0000
Subject: [PATCH] btrfs: allow swap activation to be interruptible
During swap activation we iterate over the extents of a file, then do
several checks for each extent, some of which may take some significant
time such as checking if an extent is shared. Since a file can have
many thousands of extents, this can be a very slow operation and it's
currently not interruptible. I had a bug during development of a previous
patch that resulted in an infinite loop when iterating the extents, so
a core was busy looping and I couldn't cancel the operation, which is very
annoying and requires a reboot. So make the loop interruptible by checking
for fatal signals at the end of each iteration and stopping immediately if
there is one.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b87f19630b00..c4675f4345fd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10073,6 +10073,11 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
bsi.block_start = physical_block_start;
bsi.block_len = len;
}
+
+ if (fatal_signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
}
if (bsi.block_len)
The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.12.y
git checkout FETCH_HEAD
git cherry-pick -x 9a45022a0efadd99bcc58f7f1cc2b6fb3b808c40
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123036-grout-exposure-3af8@gregkh' --subject-prefix 'PATCH 6.12.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9a45022a0efadd99bcc58f7f1cc2b6fb3b808c40 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 9 Dec 2024 16:31:41 +0000
Subject: [PATCH] btrfs: allow swap activation to be interruptible
During swap activation we iterate over the extents of a file, then do
several checks for each extent, some of which may take some significant
time such as checking if an extent is shared. Since a file can have
many thousands of extents, this can be a very slow operation and it's
currently not interruptible. I had a bug during development of a previous
patch that resulted in an infinite loop when iterating the extents, so
a core was busy looping and I couldn't cancel the operation, which is very
annoying and requires a reboot. So make the loop interruptible by checking
for fatal signals at the end of each iteration and stopping immediately if
there is one.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b87f19630b00..c4675f4345fd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10073,6 +10073,11 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
bsi.block_start = physical_block_start;
bsi.block_len = len;
}
+
+ if (fatal_signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
}
if (bsi.block_len)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 44f52bbe96dfdbe4aca3818a2534520082a07040
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123027-yeah-suitcase-41e7@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 44f52bbe96dfdbe4aca3818a2534520082a07040 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 11 Dec 2024 16:08:07 +0000
Subject: [PATCH] btrfs: fix use-after-free when COWing tree bock and tracing
is enabled
When a COWing a tree block, at btrfs_cow_block(), and we have the
tracepoint trace_btrfs_cow_block() enabled and preemption is also enabled
(CONFIG_PREEMPT=y), we can trigger a use-after-free in the COWed extent
buffer while inside the tracepoint code. This is because in some paths
that call btrfs_cow_block(), such as btrfs_search_slot(), we are holding
the last reference on the extent buffer @buf so btrfs_force_cow_block()
drops the last reference on the @buf extent buffer when it calls
free_extent_buffer_stale(buf), which schedules the release of the extent
buffer with RCU. This means that if we are on a kernel with preemption,
the current task may be preempted before calling trace_btrfs_cow_block()
and the extent buffer already released by the time trace_btrfs_cow_block()
is called, resulting in a use-after-free.
Fix this by moving the trace_btrfs_cow_block() from btrfs_cow_block() to
btrfs_force_cow_block() before the COWed extent buffer is freed.
This also has a side effect of invoking the tracepoint in the tree defrag
code, at defrag.c:btrfs_realloc_node(), since btrfs_force_cow_block() is
called there, but this is fine and it was actually missing there.
Reported-by: syzbot+8517da8635307182c8a5(a)syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6759a9b9.050a0220.1ac542.000d.GAE@googl…
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 693dc27ffb89..185985a337b3 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -654,6 +654,8 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
goto error_unlock_cow;
}
}
+
+ trace_btrfs_cow_block(root, buf, cow);
if (unlock_orig)
btrfs_tree_unlock(buf);
free_extent_buffer_stale(buf);
@@ -710,7 +712,6 @@ int btrfs_cow_block(struct btrfs_trans_handle *trans,
{
struct btrfs_fs_info *fs_info = root->fs_info;
u64 search_start;
- int ret;
if (unlikely(test_bit(BTRFS_ROOT_DELETING, &root->state))) {
btrfs_abort_transaction(trans, -EUCLEAN);
@@ -751,12 +752,8 @@ int btrfs_cow_block(struct btrfs_trans_handle *trans,
* Also We don't care about the error, as it's handled internally.
*/
btrfs_qgroup_trace_subtree_after_cow(trans, root, buf);
- ret = btrfs_force_cow_block(trans, root, buf, parent, parent_slot,
- cow_ret, search_start, 0, nest);
-
- trace_btrfs_cow_block(root, buf, *cow_ret);
-
- return ret;
+ return btrfs_force_cow_block(trans, root, buf, parent, parent_slot,
+ cow_ret, search_start, 0, nest);
}
ALLOW_ERROR_INJECTION(btrfs_cow_block, ERRNO);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 44f52bbe96dfdbe4aca3818a2534520082a07040
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123025-animal-itunes-c557@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 44f52bbe96dfdbe4aca3818a2534520082a07040 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 11 Dec 2024 16:08:07 +0000
Subject: [PATCH] btrfs: fix use-after-free when COWing tree bock and tracing
is enabled
When a COWing a tree block, at btrfs_cow_block(), and we have the
tracepoint trace_btrfs_cow_block() enabled and preemption is also enabled
(CONFIG_PREEMPT=y), we can trigger a use-after-free in the COWed extent
buffer while inside the tracepoint code. This is because in some paths
that call btrfs_cow_block(), such as btrfs_search_slot(), we are holding
the last reference on the extent buffer @buf so btrfs_force_cow_block()
drops the last reference on the @buf extent buffer when it calls
free_extent_buffer_stale(buf), which schedules the release of the extent
buffer with RCU. This means that if we are on a kernel with preemption,
the current task may be preempted before calling trace_btrfs_cow_block()
and the extent buffer already released by the time trace_btrfs_cow_block()
is called, resulting in a use-after-free.
Fix this by moving the trace_btrfs_cow_block() from btrfs_cow_block() to
btrfs_force_cow_block() before the COWed extent buffer is freed.
This also has a side effect of invoking the tracepoint in the tree defrag
code, at defrag.c:btrfs_realloc_node(), since btrfs_force_cow_block() is
called there, but this is fine and it was actually missing there.
Reported-by: syzbot+8517da8635307182c8a5(a)syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6759a9b9.050a0220.1ac542.000d.GAE@googl…
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 693dc27ffb89..185985a337b3 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -654,6 +654,8 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
goto error_unlock_cow;
}
}
+
+ trace_btrfs_cow_block(root, buf, cow);
if (unlock_orig)
btrfs_tree_unlock(buf);
free_extent_buffer_stale(buf);
@@ -710,7 +712,6 @@ int btrfs_cow_block(struct btrfs_trans_handle *trans,
{
struct btrfs_fs_info *fs_info = root->fs_info;
u64 search_start;
- int ret;
if (unlikely(test_bit(BTRFS_ROOT_DELETING, &root->state))) {
btrfs_abort_transaction(trans, -EUCLEAN);
@@ -751,12 +752,8 @@ int btrfs_cow_block(struct btrfs_trans_handle *trans,
* Also We don't care about the error, as it's handled internally.
*/
btrfs_qgroup_trace_subtree_after_cow(trans, root, buf);
- ret = btrfs_force_cow_block(trans, root, buf, parent, parent_slot,
- cow_ret, search_start, 0, nest);
-
- trace_btrfs_cow_block(root, buf, *cow_ret);
-
- return ret;
+ return btrfs_force_cow_block(trans, root, buf, parent, parent_slot,
+ cow_ret, search_start, 0, nest);
}
ALLOW_ERROR_INJECTION(btrfs_cow_block, ERRNO);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 44f52bbe96dfdbe4aca3818a2534520082a07040
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123023-hardy-ravioli-a818@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 44f52bbe96dfdbe4aca3818a2534520082a07040 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 11 Dec 2024 16:08:07 +0000
Subject: [PATCH] btrfs: fix use-after-free when COWing tree bock and tracing
is enabled
When a COWing a tree block, at btrfs_cow_block(), and we have the
tracepoint trace_btrfs_cow_block() enabled and preemption is also enabled
(CONFIG_PREEMPT=y), we can trigger a use-after-free in the COWed extent
buffer while inside the tracepoint code. This is because in some paths
that call btrfs_cow_block(), such as btrfs_search_slot(), we are holding
the last reference on the extent buffer @buf so btrfs_force_cow_block()
drops the last reference on the @buf extent buffer when it calls
free_extent_buffer_stale(buf), which schedules the release of the extent
buffer with RCU. This means that if we are on a kernel with preemption,
the current task may be preempted before calling trace_btrfs_cow_block()
and the extent buffer already released by the time trace_btrfs_cow_block()
is called, resulting in a use-after-free.
Fix this by moving the trace_btrfs_cow_block() from btrfs_cow_block() to
btrfs_force_cow_block() before the COWed extent buffer is freed.
This also has a side effect of invoking the tracepoint in the tree defrag
code, at defrag.c:btrfs_realloc_node(), since btrfs_force_cow_block() is
called there, but this is fine and it was actually missing there.
Reported-by: syzbot+8517da8635307182c8a5(a)syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6759a9b9.050a0220.1ac542.000d.GAE@googl…
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 693dc27ffb89..185985a337b3 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -654,6 +654,8 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
goto error_unlock_cow;
}
}
+
+ trace_btrfs_cow_block(root, buf, cow);
if (unlock_orig)
btrfs_tree_unlock(buf);
free_extent_buffer_stale(buf);
@@ -710,7 +712,6 @@ int btrfs_cow_block(struct btrfs_trans_handle *trans,
{
struct btrfs_fs_info *fs_info = root->fs_info;
u64 search_start;
- int ret;
if (unlikely(test_bit(BTRFS_ROOT_DELETING, &root->state))) {
btrfs_abort_transaction(trans, -EUCLEAN);
@@ -751,12 +752,8 @@ int btrfs_cow_block(struct btrfs_trans_handle *trans,
* Also We don't care about the error, as it's handled internally.
*/
btrfs_qgroup_trace_subtree_after_cow(trans, root, buf);
- ret = btrfs_force_cow_block(trans, root, buf, parent, parent_slot,
- cow_ret, search_start, 0, nest);
-
- trace_btrfs_cow_block(root, buf, *cow_ret);
-
- return ret;
+ return btrfs_force_cow_block(trans, root, buf, parent, parent_slot,
+ cow_ret, search_start, 0, nest);
}
ALLOW_ERROR_INJECTION(btrfs_cow_block, ERRNO);
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 44f52bbe96dfdbe4aca3818a2534520082a07040
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123021-campus-undoing-7345@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 44f52bbe96dfdbe4aca3818a2534520082a07040 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 11 Dec 2024 16:08:07 +0000
Subject: [PATCH] btrfs: fix use-after-free when COWing tree bock and tracing
is enabled
When a COWing a tree block, at btrfs_cow_block(), and we have the
tracepoint trace_btrfs_cow_block() enabled and preemption is also enabled
(CONFIG_PREEMPT=y), we can trigger a use-after-free in the COWed extent
buffer while inside the tracepoint code. This is because in some paths
that call btrfs_cow_block(), such as btrfs_search_slot(), we are holding
the last reference on the extent buffer @buf so btrfs_force_cow_block()
drops the last reference on the @buf extent buffer when it calls
free_extent_buffer_stale(buf), which schedules the release of the extent
buffer with RCU. This means that if we are on a kernel with preemption,
the current task may be preempted before calling trace_btrfs_cow_block()
and the extent buffer already released by the time trace_btrfs_cow_block()
is called, resulting in a use-after-free.
Fix this by moving the trace_btrfs_cow_block() from btrfs_cow_block() to
btrfs_force_cow_block() before the COWed extent buffer is freed.
This also has a side effect of invoking the tracepoint in the tree defrag
code, at defrag.c:btrfs_realloc_node(), since btrfs_force_cow_block() is
called there, but this is fine and it was actually missing there.
Reported-by: syzbot+8517da8635307182c8a5(a)syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6759a9b9.050a0220.1ac542.000d.GAE@googl…
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 693dc27ffb89..185985a337b3 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -654,6 +654,8 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
goto error_unlock_cow;
}
}
+
+ trace_btrfs_cow_block(root, buf, cow);
if (unlock_orig)
btrfs_tree_unlock(buf);
free_extent_buffer_stale(buf);
@@ -710,7 +712,6 @@ int btrfs_cow_block(struct btrfs_trans_handle *trans,
{
struct btrfs_fs_info *fs_info = root->fs_info;
u64 search_start;
- int ret;
if (unlikely(test_bit(BTRFS_ROOT_DELETING, &root->state))) {
btrfs_abort_transaction(trans, -EUCLEAN);
@@ -751,12 +752,8 @@ int btrfs_cow_block(struct btrfs_trans_handle *trans,
* Also We don't care about the error, as it's handled internally.
*/
btrfs_qgroup_trace_subtree_after_cow(trans, root, buf);
- ret = btrfs_force_cow_block(trans, root, buf, parent, parent_slot,
- cow_ret, search_start, 0, nest);
-
- trace_btrfs_cow_block(root, buf, *cow_ret);
-
- return ret;
+ return btrfs_force_cow_block(trans, root, buf, parent, parent_slot,
+ cow_ret, search_start, 0, nest);
}
ALLOW_ERROR_INJECTION(btrfs_cow_block, ERRNO);
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 44f52bbe96dfdbe4aca3818a2534520082a07040
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123018-overplant-backwash-53db@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 44f52bbe96dfdbe4aca3818a2534520082a07040 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 11 Dec 2024 16:08:07 +0000
Subject: [PATCH] btrfs: fix use-after-free when COWing tree bock and tracing
is enabled
When a COWing a tree block, at btrfs_cow_block(), and we have the
tracepoint trace_btrfs_cow_block() enabled and preemption is also enabled
(CONFIG_PREEMPT=y), we can trigger a use-after-free in the COWed extent
buffer while inside the tracepoint code. This is because in some paths
that call btrfs_cow_block(), such as btrfs_search_slot(), we are holding
the last reference on the extent buffer @buf so btrfs_force_cow_block()
drops the last reference on the @buf extent buffer when it calls
free_extent_buffer_stale(buf), which schedules the release of the extent
buffer with RCU. This means that if we are on a kernel with preemption,
the current task may be preempted before calling trace_btrfs_cow_block()
and the extent buffer already released by the time trace_btrfs_cow_block()
is called, resulting in a use-after-free.
Fix this by moving the trace_btrfs_cow_block() from btrfs_cow_block() to
btrfs_force_cow_block() before the COWed extent buffer is freed.
This also has a side effect of invoking the tracepoint in the tree defrag
code, at defrag.c:btrfs_realloc_node(), since btrfs_force_cow_block() is
called there, but this is fine and it was actually missing there.
Reported-by: syzbot+8517da8635307182c8a5(a)syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6759a9b9.050a0220.1ac542.000d.GAE@googl…
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 693dc27ffb89..185985a337b3 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -654,6 +654,8 @@ int btrfs_force_cow_block(struct btrfs_trans_handle *trans,
goto error_unlock_cow;
}
}
+
+ trace_btrfs_cow_block(root, buf, cow);
if (unlock_orig)
btrfs_tree_unlock(buf);
free_extent_buffer_stale(buf);
@@ -710,7 +712,6 @@ int btrfs_cow_block(struct btrfs_trans_handle *trans,
{
struct btrfs_fs_info *fs_info = root->fs_info;
u64 search_start;
- int ret;
if (unlikely(test_bit(BTRFS_ROOT_DELETING, &root->state))) {
btrfs_abort_transaction(trans, -EUCLEAN);
@@ -751,12 +752,8 @@ int btrfs_cow_block(struct btrfs_trans_handle *trans,
* Also We don't care about the error, as it's handled internally.
*/
btrfs_qgroup_trace_subtree_after_cow(trans, root, buf);
- ret = btrfs_force_cow_block(trans, root, buf, parent, parent_slot,
- cow_ret, search_start, 0, nest);
-
- trace_btrfs_cow_block(root, buf, *cow_ret);
-
- return ret;
+ return btrfs_force_cow_block(trans, root, buf, parent, parent_slot,
+ cow_ret, search_start, 0, nest);
}
ALLOW_ERROR_INJECTION(btrfs_cow_block, ERRNO);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x fca432e73db2bec0fdbfbf6d98d3ebcd5388a977
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123006-unhappy-grudge-df18@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fca432e73db2bec0fdbfbf6d98d3ebcd5388a977 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Wed, 18 Dec 2024 17:00:56 +1030
Subject: [PATCH] btrfs: sysfs: fix direct super block member reads
The following sysfs entries are reading super block member directly,
which can have a different endian and cause wrong values:
- sys/fs/btrfs/<uuid>/nodesize
- sys/fs/btrfs/<uuid>/sectorsize
- sys/fs/btrfs/<uuid>/clone_alignment
Thankfully those values (nodesize and sectorsize) are always aligned
inside the btrfs_super_block, so it won't trigger unaligned read errors,
just endian problems.
Fix them by using the native cached members instead.
Fixes: df93589a1737 ("btrfs: export more from FS_INFO to sysfs")
CC: stable(a)vger.kernel.org
Reviewed-by: Naohiro Aota <naohiro.aota(a)wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index fdcbf650ac31..7f09b6c9cc2d 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1118,7 +1118,7 @@ static ssize_t btrfs_nodesize_show(struct kobject *kobj,
{
struct btrfs_fs_info *fs_info = to_fs_info(kobj);
- return sysfs_emit(buf, "%u\n", fs_info->super_copy->nodesize);
+ return sysfs_emit(buf, "%u\n", fs_info->nodesize);
}
BTRFS_ATTR(, nodesize, btrfs_nodesize_show);
@@ -1128,7 +1128,7 @@ static ssize_t btrfs_sectorsize_show(struct kobject *kobj,
{
struct btrfs_fs_info *fs_info = to_fs_info(kobj);
- return sysfs_emit(buf, "%u\n", fs_info->super_copy->sectorsize);
+ return sysfs_emit(buf, "%u\n", fs_info->sectorsize);
}
BTRFS_ATTR(, sectorsize, btrfs_sectorsize_show);
@@ -1180,7 +1180,7 @@ static ssize_t btrfs_clone_alignment_show(struct kobject *kobj,
{
struct btrfs_fs_info *fs_info = to_fs_info(kobj);
- return sysfs_emit(buf, "%u\n", fs_info->super_copy->sectorsize);
+ return sysfs_emit(buf, "%u\n", fs_info->sectorsize);
}
BTRFS_ATTR(, clone_alignment, btrfs_clone_alignment_show);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x fca432e73db2bec0fdbfbf6d98d3ebcd5388a977
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123004-opt-riot-9e91@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fca432e73db2bec0fdbfbf6d98d3ebcd5388a977 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Wed, 18 Dec 2024 17:00:56 +1030
Subject: [PATCH] btrfs: sysfs: fix direct super block member reads
The following sysfs entries are reading super block member directly,
which can have a different endian and cause wrong values:
- sys/fs/btrfs/<uuid>/nodesize
- sys/fs/btrfs/<uuid>/sectorsize
- sys/fs/btrfs/<uuid>/clone_alignment
Thankfully those values (nodesize and sectorsize) are always aligned
inside the btrfs_super_block, so it won't trigger unaligned read errors,
just endian problems.
Fix them by using the native cached members instead.
Fixes: df93589a1737 ("btrfs: export more from FS_INFO to sysfs")
CC: stable(a)vger.kernel.org
Reviewed-by: Naohiro Aota <naohiro.aota(a)wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index fdcbf650ac31..7f09b6c9cc2d 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1118,7 +1118,7 @@ static ssize_t btrfs_nodesize_show(struct kobject *kobj,
{
struct btrfs_fs_info *fs_info = to_fs_info(kobj);
- return sysfs_emit(buf, "%u\n", fs_info->super_copy->nodesize);
+ return sysfs_emit(buf, "%u\n", fs_info->nodesize);
}
BTRFS_ATTR(, nodesize, btrfs_nodesize_show);
@@ -1128,7 +1128,7 @@ static ssize_t btrfs_sectorsize_show(struct kobject *kobj,
{
struct btrfs_fs_info *fs_info = to_fs_info(kobj);
- return sysfs_emit(buf, "%u\n", fs_info->super_copy->sectorsize);
+ return sysfs_emit(buf, "%u\n", fs_info->sectorsize);
}
BTRFS_ATTR(, sectorsize, btrfs_sectorsize_show);
@@ -1180,7 +1180,7 @@ static ssize_t btrfs_clone_alignment_show(struct kobject *kobj,
{
struct btrfs_fs_info *fs_info = to_fs_info(kobj);
- return sysfs_emit(buf, "%u\n", fs_info->super_copy->sectorsize);
+ return sysfs_emit(buf, "%u\n", fs_info->sectorsize);
}
BTRFS_ATTR(, clone_alignment, btrfs_clone_alignment_show);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x fca432e73db2bec0fdbfbf6d98d3ebcd5388a977
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123002-nylon-baffle-a6ad@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fca432e73db2bec0fdbfbf6d98d3ebcd5388a977 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Wed, 18 Dec 2024 17:00:56 +1030
Subject: [PATCH] btrfs: sysfs: fix direct super block member reads
The following sysfs entries are reading super block member directly,
which can have a different endian and cause wrong values:
- sys/fs/btrfs/<uuid>/nodesize
- sys/fs/btrfs/<uuid>/sectorsize
- sys/fs/btrfs/<uuid>/clone_alignment
Thankfully those values (nodesize and sectorsize) are always aligned
inside the btrfs_super_block, so it won't trigger unaligned read errors,
just endian problems.
Fix them by using the native cached members instead.
Fixes: df93589a1737 ("btrfs: export more from FS_INFO to sysfs")
CC: stable(a)vger.kernel.org
Reviewed-by: Naohiro Aota <naohiro.aota(a)wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index fdcbf650ac31..7f09b6c9cc2d 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1118,7 +1118,7 @@ static ssize_t btrfs_nodesize_show(struct kobject *kobj,
{
struct btrfs_fs_info *fs_info = to_fs_info(kobj);
- return sysfs_emit(buf, "%u\n", fs_info->super_copy->nodesize);
+ return sysfs_emit(buf, "%u\n", fs_info->nodesize);
}
BTRFS_ATTR(, nodesize, btrfs_nodesize_show);
@@ -1128,7 +1128,7 @@ static ssize_t btrfs_sectorsize_show(struct kobject *kobj,
{
struct btrfs_fs_info *fs_info = to_fs_info(kobj);
- return sysfs_emit(buf, "%u\n", fs_info->super_copy->sectorsize);
+ return sysfs_emit(buf, "%u\n", fs_info->sectorsize);
}
BTRFS_ATTR(, sectorsize, btrfs_sectorsize_show);
@@ -1180,7 +1180,7 @@ static ssize_t btrfs_clone_alignment_show(struct kobject *kobj,
{
struct btrfs_fs_info *fs_info = to_fs_info(kobj);
- return sysfs_emit(buf, "%u\n", fs_info->super_copy->sectorsize);
+ return sysfs_emit(buf, "%u\n", fs_info->sectorsize);
}
BTRFS_ATTR(, clone_alignment, btrfs_clone_alignment_show);
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x f2363e6fcc7938c5f0f6ac066fad0dd247598b51
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123034-luxury-clumsily-dce2@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f2363e6fcc7938c5f0f6ac066fad0dd247598b51 Mon Sep 17 00:00:00 2001
From: Julian Sun <sunjunchao2870(a)gmail.com>
Date: Wed, 11 Dec 2024 19:13:15 +0800
Subject: [PATCH] btrfs: fix transaction atomicity bug when enabling simple
quotas
Set squota incompat bit before committing the transaction that enables
the feature.
With the config CONFIG_BTRFS_ASSERT enabled, an assertion
failure occurs regarding the simple quota feature.
[5.596534] assertion failed: btrfs_fs_incompat(fs_info, SIMPLE_QUOTA), in fs/btrfs/qgroup.c:365
[5.597098] ------------[ cut here ]------------
[5.597371] kernel BUG at fs/btrfs/qgroup.c:365!
[5.597946] CPU: 1 UID: 0 PID: 268 Comm: mount Not tainted 6.13.0-rc2-00031-gf92f4749861b #146
[5.598450] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[5.599008] RIP: 0010:btrfs_read_qgroup_config+0x74d/0x7a0
[5.604303] <TASK>
[5.605230] ? btrfs_read_qgroup_config+0x74d/0x7a0
[5.605538] ? exc_invalid_op+0x56/0x70
[5.605775] ? btrfs_read_qgroup_config+0x74d/0x7a0
[5.606066] ? asm_exc_invalid_op+0x1f/0x30
[5.606441] ? btrfs_read_qgroup_config+0x74d/0x7a0
[5.606741] ? btrfs_read_qgroup_config+0x74d/0x7a0
[5.607038] ? try_to_wake_up+0x317/0x760
[5.607286] open_ctree+0xd9c/0x1710
[5.607509] btrfs_get_tree+0x58a/0x7e0
[5.608002] vfs_get_tree+0x2e/0x100
[5.608224] fc_mount+0x16/0x60
[5.608420] btrfs_get_tree+0x2f8/0x7e0
[5.608897] vfs_get_tree+0x2e/0x100
[5.609121] path_mount+0x4c8/0xbc0
[5.609538] __x64_sys_mount+0x10d/0x150
The issue can be easily reproduced using the following reproducer:
root@q:linux# cat repro.sh
set -e
mkfs.btrfs -q -f /dev/sdb
mount /dev/sdb /mnt/btrfs
btrfs quota enable -s /mnt/btrfs
umount /mnt/btrfs
mount /dev/sdb /mnt/btrfs
The issue is that when enabling quotas, at btrfs_quota_enable(), we set
BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE at fs_info->qgroup_flags and persist
it in the quota root in the item with the key BTRFS_QGROUP_STATUS_KEY, but
we only set the incompat bit BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA after we
commit the transaction used to enable simple quotas.
This means that if after that transaction commit we unmount the filesystem
without starting and committing any other transaction, or we have a power
failure, the next time we mount the filesystem we will find the flag
BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE set in the item with the key
BTRFS_QGROUP_STATUS_KEY but we will not find the incompat bit
BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA set in the superblock, triggering an
assertion failure at:
btrfs_read_qgroup_config() -> qgroup_read_enable_gen()
To fix this issue, set the BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA flag
immediately after setting the BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE.
This ensures that both flags are flushed to disk within the same
transaction.
Fixes: 182940f4f4db ("btrfs: qgroup: add new quota mode for simple quotas")
CC: stable(a)vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Julian Sun <sunjunchao2870(a)gmail.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a6f92836c9b1..f9b214992212 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1121,6 +1121,7 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON;
if (simple) {
fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE;
+ btrfs_set_fs_incompat(fs_info, SIMPLE_QUOTA);
btrfs_set_qgroup_status_enable_gen(leaf, ptr, trans->transid);
} else {
fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
@@ -1254,8 +1255,6 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
spin_lock(&fs_info->qgroup_lock);
fs_info->quota_root = quota_root;
set_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
- if (simple)
- btrfs_set_fs_incompat(fs_info, SIMPLE_QUOTA);
spin_unlock(&fs_info->qgroup_lock);
/* Skip rescan for simple qgroups. */
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 03018e5d8508254534511d40fb57bc150e6a87f2
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123028-alibi-limpness-1931@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 03018e5d8508254534511d40fb57bc150e6a87f2 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 9 Dec 2024 12:54:14 +0000
Subject: [PATCH] btrfs: fix swap file activation failure due to extents that
used to be shared
When activating a swap file, to determine if an extent is shared we use
can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
is meant to be quick because it's used in the NOCOW write path, when
flushing delalloc and when doing a direct IO write, however it does return
some false positives, meaning it may indicate that an extent is shared
even if it's no longer the case. For the write path this is fine, we just
do a unnecessary COW operation instead of doing a more rigorous check
which would be too heavy (calling btrfs_is_data_extent_shared()).
However when activating a swap file, the false positives simply result
in a failure, which is confusing for users/applications. One particular
case where this happens is when a data extent only has 1 reference but
that reference is not inlined in the extent item located in the extent
tree - this happens when we create more than 33 references for an extent
and then delete those 33 references plus every other non-inline reference
except one. The function check_committed_ref() assumes that if the size
of an extent item doesn't match the size of struct btrfs_extent_item
plus the size of an inline reference (plus an owner reference in case
simple quotas are enabled), then the extent is shared - that is not the
case however, we can have a single reference but it's not inlined - the
reason we do this is to be fast and avoid inspecting non-inline references
which may be located in another leaf of the extent tree, slowing down
write paths.
The following test script reproduces the bug:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
NUM_CLONES=50
umount $DEV &> /dev/null
run_test()
{
local sync_after_add_reflinks=$1
local sync_after_remove_reflinks=$2
mkfs.btrfs -f $DEV > /dev/null
#mkfs.xfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foo
chmod 0600 $MNT/foo
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo &> /dev/null
xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
mkswap $MNT/foo
for ((i = 1; i <= $NUM_CLONES; i++)); do
touch $MNT/foo_clone_$i
chmod 0600 $MNT/foo_clone_$i
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo_clone_$i &> /dev/null
cp --reflink=always $MNT/foo $MNT/foo_clone_$i
done
if [ $sync_after_add_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Remove the original file and all clones except the last.
rm -f $MNT/foo
for ((i = 1; i < $NUM_CLONES; i++)); do
rm -f $MNT/foo_clone_$i
done
if [ $sync_after_remove_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Now use the last clone as a swap file. It should work since
# its extent are not shared anymore.
swapon $MNT/foo_clone_${NUM_CLONES}
swapoff $MNT/foo_clone_${NUM_CLONES}
umount $MNT
}
echo -e "\nTest without sync after creating and removing clones"
run_test 0 0
echo -e "\nTest with sync after creating clones"
run_test 1 0
echo -e "\nTest with sync after removing clones"
run_test 0 1
echo -e "\nTest with sync after creating and removing clones"
run_test 1 1
Running the test:
$ ./test.sh
Test without sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after creating clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764
Test with sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Fix this by reworking btrfs_swap_activate() to instead of using extent
maps and checking for shared extents with can_nocow_extent(), iterate
over the inode's file extent items and use the accurate
btrfs_is_data_extent_shared().
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b2abc0aa5300..b87f19630b00 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9799,15 +9799,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
struct btrfs_fs_info *fs_info = root->fs_info;
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
struct extent_state *cached_state = NULL;
- struct extent_map *em = NULL;
struct btrfs_chunk_map *map = NULL;
struct btrfs_device *device = NULL;
struct btrfs_swap_info bsi = {
.lowest_ppage = (sector_t)-1ULL,
};
+ struct btrfs_backref_share_check_ctx *backref_ctx = NULL;
+ struct btrfs_path *path = NULL;
int ret = 0;
u64 isize;
- u64 start;
+ u64 prev_extent_end = 0;
/*
* Acquire the inode's mmap lock to prevent races with memory mapped
@@ -9846,6 +9847,13 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
goto out_unlock_mmap;
}
+ path = btrfs_alloc_path();
+ backref_ctx = btrfs_alloc_backref_share_check_ctx();
+ if (!path || !backref_ctx) {
+ ret = -ENOMEM;
+ goto out_unlock_mmap;
+ }
+
/*
* Balance or device remove/replace/resize can move stuff around from
* under us. The exclop protection makes sure they aren't running/won't
@@ -9904,24 +9912,39 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
lock_extent(io_tree, 0, isize - 1, &cached_state);
- start = 0;
- while (start < isize) {
- u64 logical_block_start, physical_block_start;
+ while (prev_extent_end < isize) {
+ struct btrfs_key key;
+ struct extent_buffer *leaf;
+ struct btrfs_file_extent_item *ei;
struct btrfs_block_group *bg;
- u64 len = isize - start;
+ u64 logical_block_start;
+ u64 physical_block_start;
+ u64 extent_gen;
+ u64 disk_bytenr;
+ u64 len;
- em = btrfs_get_extent(BTRFS_I(inode), NULL, start, len);
- if (IS_ERR(em)) {
- ret = PTR_ERR(em);
+ key.objectid = btrfs_ino(BTRFS_I(inode));
+ key.type = BTRFS_EXTENT_DATA_KEY;
+ key.offset = prev_extent_end;
+
+ ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+ if (ret < 0)
goto out;
- }
- if (em->disk_bytenr == EXTENT_MAP_HOLE) {
+ /*
+ * If key not found it means we have an implicit hole (NO_HOLES
+ * is enabled).
+ */
+ if (ret > 0) {
btrfs_warn(fs_info, "swapfile must not have holes");
ret = -EINVAL;
goto out;
}
- if (em->disk_bytenr == EXTENT_MAP_INLINE) {
+
+ leaf = path->nodes[0];
+ ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
+
+ if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
/*
* It's unlikely we'll ever actually find ourselves
* here, as a file small enough to fit inline won't be
@@ -9933,23 +9956,45 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
ret = -EINVAL;
goto out;
}
- if (extent_map_is_compressed(em)) {
+
+ if (btrfs_file_extent_compression(leaf, ei) != BTRFS_COMPRESS_NONE) {
btrfs_warn(fs_info, "swapfile must not be compressed");
ret = -EINVAL;
goto out;
}
- logical_block_start = extent_map_block_start(em) + (start - em->start);
- len = min(len, em->len - (start - em->start));
- free_extent_map(em);
- em = NULL;
+ disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+ if (disk_bytenr == 0) {
+ btrfs_warn(fs_info, "swapfile must not have holes");
+ ret = -EINVAL;
+ goto out;
+ }
- ret = can_nocow_extent(inode, start, &len, NULL, false, true);
+ logical_block_start = disk_bytenr + btrfs_file_extent_offset(leaf, ei);
+ extent_gen = btrfs_file_extent_generation(leaf, ei);
+ prev_extent_end = btrfs_file_extent_end(path);
+
+ if (prev_extent_end > isize)
+ len = isize - key.offset;
+ else
+ len = btrfs_file_extent_num_bytes(leaf, ei);
+
+ backref_ctx->curr_leaf_bytenr = leaf->start;
+
+ /*
+ * Don't need the path anymore, release to avoid deadlocks when
+ * calling btrfs_is_data_extent_shared() because when joining a
+ * transaction it can block waiting for the current one's commit
+ * which in turn may be trying to lock the same leaf to flush
+ * delayed items for example.
+ */
+ btrfs_release_path(path);
+
+ ret = btrfs_is_data_extent_shared(BTRFS_I(inode), disk_bytenr,
+ extent_gen, backref_ctx);
if (ret < 0) {
goto out;
- } else if (ret) {
- ret = 0;
- } else {
+ } else if (ret > 0) {
btrfs_warn(fs_info,
"swapfile must not be copy-on-write");
ret = -EINVAL;
@@ -9984,7 +10029,6 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
physical_block_start = (map->stripes[0].physical +
(logical_block_start - map->start));
- len = min(len, map->chunk_len - (logical_block_start - map->start));
btrfs_free_chunk_map(map);
map = NULL;
@@ -10025,20 +10069,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
if (ret)
goto out;
}
- bsi.start = start;
+ bsi.start = key.offset;
bsi.block_start = physical_block_start;
bsi.block_len = len;
}
-
- start += len;
}
if (bsi.block_len)
ret = btrfs_add_swap_extent(sis, &bsi);
out:
- if (!IS_ERR_OR_NULL(em))
- free_extent_map(em);
if (!IS_ERR_OR_NULL(map))
btrfs_free_chunk_map(map);
@@ -10053,6 +10093,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
out_unlock_mmap:
up_write(&BTRFS_I(inode)->i_mmap_lock);
+ btrfs_free_backref_share_ctx(backref_ctx);
+ btrfs_free_path(path);
if (ret)
return ret;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 03018e5d8508254534511d40fb57bc150e6a87f2
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123021-swimmable-curled-c1b6@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 03018e5d8508254534511d40fb57bc150e6a87f2 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 9 Dec 2024 12:54:14 +0000
Subject: [PATCH] btrfs: fix swap file activation failure due to extents that
used to be shared
When activating a swap file, to determine if an extent is shared we use
can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
is meant to be quick because it's used in the NOCOW write path, when
flushing delalloc and when doing a direct IO write, however it does return
some false positives, meaning it may indicate that an extent is shared
even if it's no longer the case. For the write path this is fine, we just
do a unnecessary COW operation instead of doing a more rigorous check
which would be too heavy (calling btrfs_is_data_extent_shared()).
However when activating a swap file, the false positives simply result
in a failure, which is confusing for users/applications. One particular
case where this happens is when a data extent only has 1 reference but
that reference is not inlined in the extent item located in the extent
tree - this happens when we create more than 33 references for an extent
and then delete those 33 references plus every other non-inline reference
except one. The function check_committed_ref() assumes that if the size
of an extent item doesn't match the size of struct btrfs_extent_item
plus the size of an inline reference (plus an owner reference in case
simple quotas are enabled), then the extent is shared - that is not the
case however, we can have a single reference but it's not inlined - the
reason we do this is to be fast and avoid inspecting non-inline references
which may be located in another leaf of the extent tree, slowing down
write paths.
The following test script reproduces the bug:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
NUM_CLONES=50
umount $DEV &> /dev/null
run_test()
{
local sync_after_add_reflinks=$1
local sync_after_remove_reflinks=$2
mkfs.btrfs -f $DEV > /dev/null
#mkfs.xfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foo
chmod 0600 $MNT/foo
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo &> /dev/null
xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
mkswap $MNT/foo
for ((i = 1; i <= $NUM_CLONES; i++)); do
touch $MNT/foo_clone_$i
chmod 0600 $MNT/foo_clone_$i
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo_clone_$i &> /dev/null
cp --reflink=always $MNT/foo $MNT/foo_clone_$i
done
if [ $sync_after_add_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Remove the original file and all clones except the last.
rm -f $MNT/foo
for ((i = 1; i < $NUM_CLONES; i++)); do
rm -f $MNT/foo_clone_$i
done
if [ $sync_after_remove_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Now use the last clone as a swap file. It should work since
# its extent are not shared anymore.
swapon $MNT/foo_clone_${NUM_CLONES}
swapoff $MNT/foo_clone_${NUM_CLONES}
umount $MNT
}
echo -e "\nTest without sync after creating and removing clones"
run_test 0 0
echo -e "\nTest with sync after creating clones"
run_test 1 0
echo -e "\nTest with sync after removing clones"
run_test 0 1
echo -e "\nTest with sync after creating and removing clones"
run_test 1 1
Running the test:
$ ./test.sh
Test without sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after creating clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764
Test with sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Fix this by reworking btrfs_swap_activate() to instead of using extent
maps and checking for shared extents with can_nocow_extent(), iterate
over the inode's file extent items and use the accurate
btrfs_is_data_extent_shared().
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b2abc0aa5300..b87f19630b00 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9799,15 +9799,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
struct btrfs_fs_info *fs_info = root->fs_info;
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
struct extent_state *cached_state = NULL;
- struct extent_map *em = NULL;
struct btrfs_chunk_map *map = NULL;
struct btrfs_device *device = NULL;
struct btrfs_swap_info bsi = {
.lowest_ppage = (sector_t)-1ULL,
};
+ struct btrfs_backref_share_check_ctx *backref_ctx = NULL;
+ struct btrfs_path *path = NULL;
int ret = 0;
u64 isize;
- u64 start;
+ u64 prev_extent_end = 0;
/*
* Acquire the inode's mmap lock to prevent races with memory mapped
@@ -9846,6 +9847,13 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
goto out_unlock_mmap;
}
+ path = btrfs_alloc_path();
+ backref_ctx = btrfs_alloc_backref_share_check_ctx();
+ if (!path || !backref_ctx) {
+ ret = -ENOMEM;
+ goto out_unlock_mmap;
+ }
+
/*
* Balance or device remove/replace/resize can move stuff around from
* under us. The exclop protection makes sure they aren't running/won't
@@ -9904,24 +9912,39 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
lock_extent(io_tree, 0, isize - 1, &cached_state);
- start = 0;
- while (start < isize) {
- u64 logical_block_start, physical_block_start;
+ while (prev_extent_end < isize) {
+ struct btrfs_key key;
+ struct extent_buffer *leaf;
+ struct btrfs_file_extent_item *ei;
struct btrfs_block_group *bg;
- u64 len = isize - start;
+ u64 logical_block_start;
+ u64 physical_block_start;
+ u64 extent_gen;
+ u64 disk_bytenr;
+ u64 len;
- em = btrfs_get_extent(BTRFS_I(inode), NULL, start, len);
- if (IS_ERR(em)) {
- ret = PTR_ERR(em);
+ key.objectid = btrfs_ino(BTRFS_I(inode));
+ key.type = BTRFS_EXTENT_DATA_KEY;
+ key.offset = prev_extent_end;
+
+ ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+ if (ret < 0)
goto out;
- }
- if (em->disk_bytenr == EXTENT_MAP_HOLE) {
+ /*
+ * If key not found it means we have an implicit hole (NO_HOLES
+ * is enabled).
+ */
+ if (ret > 0) {
btrfs_warn(fs_info, "swapfile must not have holes");
ret = -EINVAL;
goto out;
}
- if (em->disk_bytenr == EXTENT_MAP_INLINE) {
+
+ leaf = path->nodes[0];
+ ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
+
+ if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
/*
* It's unlikely we'll ever actually find ourselves
* here, as a file small enough to fit inline won't be
@@ -9933,23 +9956,45 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
ret = -EINVAL;
goto out;
}
- if (extent_map_is_compressed(em)) {
+
+ if (btrfs_file_extent_compression(leaf, ei) != BTRFS_COMPRESS_NONE) {
btrfs_warn(fs_info, "swapfile must not be compressed");
ret = -EINVAL;
goto out;
}
- logical_block_start = extent_map_block_start(em) + (start - em->start);
- len = min(len, em->len - (start - em->start));
- free_extent_map(em);
- em = NULL;
+ disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+ if (disk_bytenr == 0) {
+ btrfs_warn(fs_info, "swapfile must not have holes");
+ ret = -EINVAL;
+ goto out;
+ }
- ret = can_nocow_extent(inode, start, &len, NULL, false, true);
+ logical_block_start = disk_bytenr + btrfs_file_extent_offset(leaf, ei);
+ extent_gen = btrfs_file_extent_generation(leaf, ei);
+ prev_extent_end = btrfs_file_extent_end(path);
+
+ if (prev_extent_end > isize)
+ len = isize - key.offset;
+ else
+ len = btrfs_file_extent_num_bytes(leaf, ei);
+
+ backref_ctx->curr_leaf_bytenr = leaf->start;
+
+ /*
+ * Don't need the path anymore, release to avoid deadlocks when
+ * calling btrfs_is_data_extent_shared() because when joining a
+ * transaction it can block waiting for the current one's commit
+ * which in turn may be trying to lock the same leaf to flush
+ * delayed items for example.
+ */
+ btrfs_release_path(path);
+
+ ret = btrfs_is_data_extent_shared(BTRFS_I(inode), disk_bytenr,
+ extent_gen, backref_ctx);
if (ret < 0) {
goto out;
- } else if (ret) {
- ret = 0;
- } else {
+ } else if (ret > 0) {
btrfs_warn(fs_info,
"swapfile must not be copy-on-write");
ret = -EINVAL;
@@ -9984,7 +10029,6 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
physical_block_start = (map->stripes[0].physical +
(logical_block_start - map->start));
- len = min(len, map->chunk_len - (logical_block_start - map->start));
btrfs_free_chunk_map(map);
map = NULL;
@@ -10025,20 +10069,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
if (ret)
goto out;
}
- bsi.start = start;
+ bsi.start = key.offset;
bsi.block_start = physical_block_start;
bsi.block_len = len;
}
-
- start += len;
}
if (bsi.block_len)
ret = btrfs_add_swap_extent(sis, &bsi);
out:
- if (!IS_ERR_OR_NULL(em))
- free_extent_map(em);
if (!IS_ERR_OR_NULL(map))
btrfs_free_chunk_map(map);
@@ -10053,6 +10093,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
out_unlock_mmap:
up_write(&BTRFS_I(inode)->i_mmap_lock);
+ btrfs_free_backref_share_ctx(backref_ctx);
+ btrfs_free_path(path);
if (ret)
return ret;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 03018e5d8508254534511d40fb57bc150e6a87f2
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123015-briar-craftily-a6d3@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 03018e5d8508254534511d40fb57bc150e6a87f2 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 9 Dec 2024 12:54:14 +0000
Subject: [PATCH] btrfs: fix swap file activation failure due to extents that
used to be shared
When activating a swap file, to determine if an extent is shared we use
can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
is meant to be quick because it's used in the NOCOW write path, when
flushing delalloc and when doing a direct IO write, however it does return
some false positives, meaning it may indicate that an extent is shared
even if it's no longer the case. For the write path this is fine, we just
do a unnecessary COW operation instead of doing a more rigorous check
which would be too heavy (calling btrfs_is_data_extent_shared()).
However when activating a swap file, the false positives simply result
in a failure, which is confusing for users/applications. One particular
case where this happens is when a data extent only has 1 reference but
that reference is not inlined in the extent item located in the extent
tree - this happens when we create more than 33 references for an extent
and then delete those 33 references plus every other non-inline reference
except one. The function check_committed_ref() assumes that if the size
of an extent item doesn't match the size of struct btrfs_extent_item
plus the size of an inline reference (plus an owner reference in case
simple quotas are enabled), then the extent is shared - that is not the
case however, we can have a single reference but it's not inlined - the
reason we do this is to be fast and avoid inspecting non-inline references
which may be located in another leaf of the extent tree, slowing down
write paths.
The following test script reproduces the bug:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
NUM_CLONES=50
umount $DEV &> /dev/null
run_test()
{
local sync_after_add_reflinks=$1
local sync_after_remove_reflinks=$2
mkfs.btrfs -f $DEV > /dev/null
#mkfs.xfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foo
chmod 0600 $MNT/foo
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo &> /dev/null
xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
mkswap $MNT/foo
for ((i = 1; i <= $NUM_CLONES; i++)); do
touch $MNT/foo_clone_$i
chmod 0600 $MNT/foo_clone_$i
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo_clone_$i &> /dev/null
cp --reflink=always $MNT/foo $MNT/foo_clone_$i
done
if [ $sync_after_add_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Remove the original file and all clones except the last.
rm -f $MNT/foo
for ((i = 1; i < $NUM_CLONES; i++)); do
rm -f $MNT/foo_clone_$i
done
if [ $sync_after_remove_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Now use the last clone as a swap file. It should work since
# its extent are not shared anymore.
swapon $MNT/foo_clone_${NUM_CLONES}
swapoff $MNT/foo_clone_${NUM_CLONES}
umount $MNT
}
echo -e "\nTest without sync after creating and removing clones"
run_test 0 0
echo -e "\nTest with sync after creating clones"
run_test 1 0
echo -e "\nTest with sync after removing clones"
run_test 0 1
echo -e "\nTest with sync after creating and removing clones"
run_test 1 1
Running the test:
$ ./test.sh
Test without sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after creating clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764
Test with sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Fix this by reworking btrfs_swap_activate() to instead of using extent
maps and checking for shared extents with can_nocow_extent(), iterate
over the inode's file extent items and use the accurate
btrfs_is_data_extent_shared().
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b2abc0aa5300..b87f19630b00 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9799,15 +9799,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
struct btrfs_fs_info *fs_info = root->fs_info;
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
struct extent_state *cached_state = NULL;
- struct extent_map *em = NULL;
struct btrfs_chunk_map *map = NULL;
struct btrfs_device *device = NULL;
struct btrfs_swap_info bsi = {
.lowest_ppage = (sector_t)-1ULL,
};
+ struct btrfs_backref_share_check_ctx *backref_ctx = NULL;
+ struct btrfs_path *path = NULL;
int ret = 0;
u64 isize;
- u64 start;
+ u64 prev_extent_end = 0;
/*
* Acquire the inode's mmap lock to prevent races with memory mapped
@@ -9846,6 +9847,13 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
goto out_unlock_mmap;
}
+ path = btrfs_alloc_path();
+ backref_ctx = btrfs_alloc_backref_share_check_ctx();
+ if (!path || !backref_ctx) {
+ ret = -ENOMEM;
+ goto out_unlock_mmap;
+ }
+
/*
* Balance or device remove/replace/resize can move stuff around from
* under us. The exclop protection makes sure they aren't running/won't
@@ -9904,24 +9912,39 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
lock_extent(io_tree, 0, isize - 1, &cached_state);
- start = 0;
- while (start < isize) {
- u64 logical_block_start, physical_block_start;
+ while (prev_extent_end < isize) {
+ struct btrfs_key key;
+ struct extent_buffer *leaf;
+ struct btrfs_file_extent_item *ei;
struct btrfs_block_group *bg;
- u64 len = isize - start;
+ u64 logical_block_start;
+ u64 physical_block_start;
+ u64 extent_gen;
+ u64 disk_bytenr;
+ u64 len;
- em = btrfs_get_extent(BTRFS_I(inode), NULL, start, len);
- if (IS_ERR(em)) {
- ret = PTR_ERR(em);
+ key.objectid = btrfs_ino(BTRFS_I(inode));
+ key.type = BTRFS_EXTENT_DATA_KEY;
+ key.offset = prev_extent_end;
+
+ ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+ if (ret < 0)
goto out;
- }
- if (em->disk_bytenr == EXTENT_MAP_HOLE) {
+ /*
+ * If key not found it means we have an implicit hole (NO_HOLES
+ * is enabled).
+ */
+ if (ret > 0) {
btrfs_warn(fs_info, "swapfile must not have holes");
ret = -EINVAL;
goto out;
}
- if (em->disk_bytenr == EXTENT_MAP_INLINE) {
+
+ leaf = path->nodes[0];
+ ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
+
+ if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
/*
* It's unlikely we'll ever actually find ourselves
* here, as a file small enough to fit inline won't be
@@ -9933,23 +9956,45 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
ret = -EINVAL;
goto out;
}
- if (extent_map_is_compressed(em)) {
+
+ if (btrfs_file_extent_compression(leaf, ei) != BTRFS_COMPRESS_NONE) {
btrfs_warn(fs_info, "swapfile must not be compressed");
ret = -EINVAL;
goto out;
}
- logical_block_start = extent_map_block_start(em) + (start - em->start);
- len = min(len, em->len - (start - em->start));
- free_extent_map(em);
- em = NULL;
+ disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+ if (disk_bytenr == 0) {
+ btrfs_warn(fs_info, "swapfile must not have holes");
+ ret = -EINVAL;
+ goto out;
+ }
- ret = can_nocow_extent(inode, start, &len, NULL, false, true);
+ logical_block_start = disk_bytenr + btrfs_file_extent_offset(leaf, ei);
+ extent_gen = btrfs_file_extent_generation(leaf, ei);
+ prev_extent_end = btrfs_file_extent_end(path);
+
+ if (prev_extent_end > isize)
+ len = isize - key.offset;
+ else
+ len = btrfs_file_extent_num_bytes(leaf, ei);
+
+ backref_ctx->curr_leaf_bytenr = leaf->start;
+
+ /*
+ * Don't need the path anymore, release to avoid deadlocks when
+ * calling btrfs_is_data_extent_shared() because when joining a
+ * transaction it can block waiting for the current one's commit
+ * which in turn may be trying to lock the same leaf to flush
+ * delayed items for example.
+ */
+ btrfs_release_path(path);
+
+ ret = btrfs_is_data_extent_shared(BTRFS_I(inode), disk_bytenr,
+ extent_gen, backref_ctx);
if (ret < 0) {
goto out;
- } else if (ret) {
- ret = 0;
- } else {
+ } else if (ret > 0) {
btrfs_warn(fs_info,
"swapfile must not be copy-on-write");
ret = -EINVAL;
@@ -9984,7 +10029,6 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
physical_block_start = (map->stripes[0].physical +
(logical_block_start - map->start));
- len = min(len, map->chunk_len - (logical_block_start - map->start));
btrfs_free_chunk_map(map);
map = NULL;
@@ -10025,20 +10069,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
if (ret)
goto out;
}
- bsi.start = start;
+ bsi.start = key.offset;
bsi.block_start = physical_block_start;
bsi.block_len = len;
}
-
- start += len;
}
if (bsi.block_len)
ret = btrfs_add_swap_extent(sis, &bsi);
out:
- if (!IS_ERR_OR_NULL(em))
- free_extent_map(em);
if (!IS_ERR_OR_NULL(map))
btrfs_free_chunk_map(map);
@@ -10053,6 +10093,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
out_unlock_mmap:
up_write(&BTRFS_I(inode)->i_mmap_lock);
+ btrfs_free_backref_share_ctx(backref_ctx);
+ btrfs_free_path(path);
if (ret)
return ret;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 03018e5d8508254534511d40fb57bc150e6a87f2
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123009-curve-huntsman-5a05@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 03018e5d8508254534511d40fb57bc150e6a87f2 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 9 Dec 2024 12:54:14 +0000
Subject: [PATCH] btrfs: fix swap file activation failure due to extents that
used to be shared
When activating a swap file, to determine if an extent is shared we use
can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
is meant to be quick because it's used in the NOCOW write path, when
flushing delalloc and when doing a direct IO write, however it does return
some false positives, meaning it may indicate that an extent is shared
even if it's no longer the case. For the write path this is fine, we just
do a unnecessary COW operation instead of doing a more rigorous check
which would be too heavy (calling btrfs_is_data_extent_shared()).
However when activating a swap file, the false positives simply result
in a failure, which is confusing for users/applications. One particular
case where this happens is when a data extent only has 1 reference but
that reference is not inlined in the extent item located in the extent
tree - this happens when we create more than 33 references for an extent
and then delete those 33 references plus every other non-inline reference
except one. The function check_committed_ref() assumes that if the size
of an extent item doesn't match the size of struct btrfs_extent_item
plus the size of an inline reference (plus an owner reference in case
simple quotas are enabled), then the extent is shared - that is not the
case however, we can have a single reference but it's not inlined - the
reason we do this is to be fast and avoid inspecting non-inline references
which may be located in another leaf of the extent tree, slowing down
write paths.
The following test script reproduces the bug:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
NUM_CLONES=50
umount $DEV &> /dev/null
run_test()
{
local sync_after_add_reflinks=$1
local sync_after_remove_reflinks=$2
mkfs.btrfs -f $DEV > /dev/null
#mkfs.xfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foo
chmod 0600 $MNT/foo
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo &> /dev/null
xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
mkswap $MNT/foo
for ((i = 1; i <= $NUM_CLONES; i++)); do
touch $MNT/foo_clone_$i
chmod 0600 $MNT/foo_clone_$i
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo_clone_$i &> /dev/null
cp --reflink=always $MNT/foo $MNT/foo_clone_$i
done
if [ $sync_after_add_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Remove the original file and all clones except the last.
rm -f $MNT/foo
for ((i = 1; i < $NUM_CLONES; i++)); do
rm -f $MNT/foo_clone_$i
done
if [ $sync_after_remove_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Now use the last clone as a swap file. It should work since
# its extent are not shared anymore.
swapon $MNT/foo_clone_${NUM_CLONES}
swapoff $MNT/foo_clone_${NUM_CLONES}
umount $MNT
}
echo -e "\nTest without sync after creating and removing clones"
run_test 0 0
echo -e "\nTest with sync after creating clones"
run_test 1 0
echo -e "\nTest with sync after removing clones"
run_test 0 1
echo -e "\nTest with sync after creating and removing clones"
run_test 1 1
Running the test:
$ ./test.sh
Test without sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after creating clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764
Test with sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Fix this by reworking btrfs_swap_activate() to instead of using extent
maps and checking for shared extents with can_nocow_extent(), iterate
over the inode's file extent items and use the accurate
btrfs_is_data_extent_shared().
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b2abc0aa5300..b87f19630b00 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9799,15 +9799,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
struct btrfs_fs_info *fs_info = root->fs_info;
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
struct extent_state *cached_state = NULL;
- struct extent_map *em = NULL;
struct btrfs_chunk_map *map = NULL;
struct btrfs_device *device = NULL;
struct btrfs_swap_info bsi = {
.lowest_ppage = (sector_t)-1ULL,
};
+ struct btrfs_backref_share_check_ctx *backref_ctx = NULL;
+ struct btrfs_path *path = NULL;
int ret = 0;
u64 isize;
- u64 start;
+ u64 prev_extent_end = 0;
/*
* Acquire the inode's mmap lock to prevent races with memory mapped
@@ -9846,6 +9847,13 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
goto out_unlock_mmap;
}
+ path = btrfs_alloc_path();
+ backref_ctx = btrfs_alloc_backref_share_check_ctx();
+ if (!path || !backref_ctx) {
+ ret = -ENOMEM;
+ goto out_unlock_mmap;
+ }
+
/*
* Balance or device remove/replace/resize can move stuff around from
* under us. The exclop protection makes sure they aren't running/won't
@@ -9904,24 +9912,39 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
lock_extent(io_tree, 0, isize - 1, &cached_state);
- start = 0;
- while (start < isize) {
- u64 logical_block_start, physical_block_start;
+ while (prev_extent_end < isize) {
+ struct btrfs_key key;
+ struct extent_buffer *leaf;
+ struct btrfs_file_extent_item *ei;
struct btrfs_block_group *bg;
- u64 len = isize - start;
+ u64 logical_block_start;
+ u64 physical_block_start;
+ u64 extent_gen;
+ u64 disk_bytenr;
+ u64 len;
- em = btrfs_get_extent(BTRFS_I(inode), NULL, start, len);
- if (IS_ERR(em)) {
- ret = PTR_ERR(em);
+ key.objectid = btrfs_ino(BTRFS_I(inode));
+ key.type = BTRFS_EXTENT_DATA_KEY;
+ key.offset = prev_extent_end;
+
+ ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+ if (ret < 0)
goto out;
- }
- if (em->disk_bytenr == EXTENT_MAP_HOLE) {
+ /*
+ * If key not found it means we have an implicit hole (NO_HOLES
+ * is enabled).
+ */
+ if (ret > 0) {
btrfs_warn(fs_info, "swapfile must not have holes");
ret = -EINVAL;
goto out;
}
- if (em->disk_bytenr == EXTENT_MAP_INLINE) {
+
+ leaf = path->nodes[0];
+ ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
+
+ if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
/*
* It's unlikely we'll ever actually find ourselves
* here, as a file small enough to fit inline won't be
@@ -9933,23 +9956,45 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
ret = -EINVAL;
goto out;
}
- if (extent_map_is_compressed(em)) {
+
+ if (btrfs_file_extent_compression(leaf, ei) != BTRFS_COMPRESS_NONE) {
btrfs_warn(fs_info, "swapfile must not be compressed");
ret = -EINVAL;
goto out;
}
- logical_block_start = extent_map_block_start(em) + (start - em->start);
- len = min(len, em->len - (start - em->start));
- free_extent_map(em);
- em = NULL;
+ disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+ if (disk_bytenr == 0) {
+ btrfs_warn(fs_info, "swapfile must not have holes");
+ ret = -EINVAL;
+ goto out;
+ }
- ret = can_nocow_extent(inode, start, &len, NULL, false, true);
+ logical_block_start = disk_bytenr + btrfs_file_extent_offset(leaf, ei);
+ extent_gen = btrfs_file_extent_generation(leaf, ei);
+ prev_extent_end = btrfs_file_extent_end(path);
+
+ if (prev_extent_end > isize)
+ len = isize - key.offset;
+ else
+ len = btrfs_file_extent_num_bytes(leaf, ei);
+
+ backref_ctx->curr_leaf_bytenr = leaf->start;
+
+ /*
+ * Don't need the path anymore, release to avoid deadlocks when
+ * calling btrfs_is_data_extent_shared() because when joining a
+ * transaction it can block waiting for the current one's commit
+ * which in turn may be trying to lock the same leaf to flush
+ * delayed items for example.
+ */
+ btrfs_release_path(path);
+
+ ret = btrfs_is_data_extent_shared(BTRFS_I(inode), disk_bytenr,
+ extent_gen, backref_ctx);
if (ret < 0) {
goto out;
- } else if (ret) {
- ret = 0;
- } else {
+ } else if (ret > 0) {
btrfs_warn(fs_info,
"swapfile must not be copy-on-write");
ret = -EINVAL;
@@ -9984,7 +10029,6 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
physical_block_start = (map->stripes[0].physical +
(logical_block_start - map->start));
- len = min(len, map->chunk_len - (logical_block_start - map->start));
btrfs_free_chunk_map(map);
map = NULL;
@@ -10025,20 +10069,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
if (ret)
goto out;
}
- bsi.start = start;
+ bsi.start = key.offset;
bsi.block_start = physical_block_start;
bsi.block_len = len;
}
-
- start += len;
}
if (bsi.block_len)
ret = btrfs_add_swap_extent(sis, &bsi);
out:
- if (!IS_ERR_OR_NULL(em))
- free_extent_map(em);
if (!IS_ERR_OR_NULL(map))
btrfs_free_chunk_map(map);
@@ -10053,6 +10093,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
out_unlock_mmap:
up_write(&BTRFS_I(inode)->i_mmap_lock);
+ btrfs_free_backref_share_ctx(backref_ctx);
+ btrfs_free_path(path);
if (ret)
return ret;
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 03018e5d8508254534511d40fb57bc150e6a87f2
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123002-remedial-drainpipe-efb2@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 03018e5d8508254534511d40fb57bc150e6a87f2 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 9 Dec 2024 12:54:14 +0000
Subject: [PATCH] btrfs: fix swap file activation failure due to extents that
used to be shared
When activating a swap file, to determine if an extent is shared we use
can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
is meant to be quick because it's used in the NOCOW write path, when
flushing delalloc and when doing a direct IO write, however it does return
some false positives, meaning it may indicate that an extent is shared
even if it's no longer the case. For the write path this is fine, we just
do a unnecessary COW operation instead of doing a more rigorous check
which would be too heavy (calling btrfs_is_data_extent_shared()).
However when activating a swap file, the false positives simply result
in a failure, which is confusing for users/applications. One particular
case where this happens is when a data extent only has 1 reference but
that reference is not inlined in the extent item located in the extent
tree - this happens when we create more than 33 references for an extent
and then delete those 33 references plus every other non-inline reference
except one. The function check_committed_ref() assumes that if the size
of an extent item doesn't match the size of struct btrfs_extent_item
plus the size of an inline reference (plus an owner reference in case
simple quotas are enabled), then the extent is shared - that is not the
case however, we can have a single reference but it's not inlined - the
reason we do this is to be fast and avoid inspecting non-inline references
which may be located in another leaf of the extent tree, slowing down
write paths.
The following test script reproduces the bug:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
NUM_CLONES=50
umount $DEV &> /dev/null
run_test()
{
local sync_after_add_reflinks=$1
local sync_after_remove_reflinks=$2
mkfs.btrfs -f $DEV > /dev/null
#mkfs.xfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foo
chmod 0600 $MNT/foo
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo &> /dev/null
xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
mkswap $MNT/foo
for ((i = 1; i <= $NUM_CLONES; i++)); do
touch $MNT/foo_clone_$i
chmod 0600 $MNT/foo_clone_$i
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo_clone_$i &> /dev/null
cp --reflink=always $MNT/foo $MNT/foo_clone_$i
done
if [ $sync_after_add_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Remove the original file and all clones except the last.
rm -f $MNT/foo
for ((i = 1; i < $NUM_CLONES; i++)); do
rm -f $MNT/foo_clone_$i
done
if [ $sync_after_remove_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Now use the last clone as a swap file. It should work since
# its extent are not shared anymore.
swapon $MNT/foo_clone_${NUM_CLONES}
swapoff $MNT/foo_clone_${NUM_CLONES}
umount $MNT
}
echo -e "\nTest without sync after creating and removing clones"
run_test 0 0
echo -e "\nTest with sync after creating clones"
run_test 1 0
echo -e "\nTest with sync after removing clones"
run_test 0 1
echo -e "\nTest with sync after creating and removing clones"
run_test 1 1
Running the test:
$ ./test.sh
Test without sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after creating clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764
Test with sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Fix this by reworking btrfs_swap_activate() to instead of using extent
maps and checking for shared extents with can_nocow_extent(), iterate
over the inode's file extent items and use the accurate
btrfs_is_data_extent_shared().
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b2abc0aa5300..b87f19630b00 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9799,15 +9799,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
struct btrfs_fs_info *fs_info = root->fs_info;
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
struct extent_state *cached_state = NULL;
- struct extent_map *em = NULL;
struct btrfs_chunk_map *map = NULL;
struct btrfs_device *device = NULL;
struct btrfs_swap_info bsi = {
.lowest_ppage = (sector_t)-1ULL,
};
+ struct btrfs_backref_share_check_ctx *backref_ctx = NULL;
+ struct btrfs_path *path = NULL;
int ret = 0;
u64 isize;
- u64 start;
+ u64 prev_extent_end = 0;
/*
* Acquire the inode's mmap lock to prevent races with memory mapped
@@ -9846,6 +9847,13 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
goto out_unlock_mmap;
}
+ path = btrfs_alloc_path();
+ backref_ctx = btrfs_alloc_backref_share_check_ctx();
+ if (!path || !backref_ctx) {
+ ret = -ENOMEM;
+ goto out_unlock_mmap;
+ }
+
/*
* Balance or device remove/replace/resize can move stuff around from
* under us. The exclop protection makes sure they aren't running/won't
@@ -9904,24 +9912,39 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
lock_extent(io_tree, 0, isize - 1, &cached_state);
- start = 0;
- while (start < isize) {
- u64 logical_block_start, physical_block_start;
+ while (prev_extent_end < isize) {
+ struct btrfs_key key;
+ struct extent_buffer *leaf;
+ struct btrfs_file_extent_item *ei;
struct btrfs_block_group *bg;
- u64 len = isize - start;
+ u64 logical_block_start;
+ u64 physical_block_start;
+ u64 extent_gen;
+ u64 disk_bytenr;
+ u64 len;
- em = btrfs_get_extent(BTRFS_I(inode), NULL, start, len);
- if (IS_ERR(em)) {
- ret = PTR_ERR(em);
+ key.objectid = btrfs_ino(BTRFS_I(inode));
+ key.type = BTRFS_EXTENT_DATA_KEY;
+ key.offset = prev_extent_end;
+
+ ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+ if (ret < 0)
goto out;
- }
- if (em->disk_bytenr == EXTENT_MAP_HOLE) {
+ /*
+ * If key not found it means we have an implicit hole (NO_HOLES
+ * is enabled).
+ */
+ if (ret > 0) {
btrfs_warn(fs_info, "swapfile must not have holes");
ret = -EINVAL;
goto out;
}
- if (em->disk_bytenr == EXTENT_MAP_INLINE) {
+
+ leaf = path->nodes[0];
+ ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
+
+ if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
/*
* It's unlikely we'll ever actually find ourselves
* here, as a file small enough to fit inline won't be
@@ -9933,23 +9956,45 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
ret = -EINVAL;
goto out;
}
- if (extent_map_is_compressed(em)) {
+
+ if (btrfs_file_extent_compression(leaf, ei) != BTRFS_COMPRESS_NONE) {
btrfs_warn(fs_info, "swapfile must not be compressed");
ret = -EINVAL;
goto out;
}
- logical_block_start = extent_map_block_start(em) + (start - em->start);
- len = min(len, em->len - (start - em->start));
- free_extent_map(em);
- em = NULL;
+ disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+ if (disk_bytenr == 0) {
+ btrfs_warn(fs_info, "swapfile must not have holes");
+ ret = -EINVAL;
+ goto out;
+ }
- ret = can_nocow_extent(inode, start, &len, NULL, false, true);
+ logical_block_start = disk_bytenr + btrfs_file_extent_offset(leaf, ei);
+ extent_gen = btrfs_file_extent_generation(leaf, ei);
+ prev_extent_end = btrfs_file_extent_end(path);
+
+ if (prev_extent_end > isize)
+ len = isize - key.offset;
+ else
+ len = btrfs_file_extent_num_bytes(leaf, ei);
+
+ backref_ctx->curr_leaf_bytenr = leaf->start;
+
+ /*
+ * Don't need the path anymore, release to avoid deadlocks when
+ * calling btrfs_is_data_extent_shared() because when joining a
+ * transaction it can block waiting for the current one's commit
+ * which in turn may be trying to lock the same leaf to flush
+ * delayed items for example.
+ */
+ btrfs_release_path(path);
+
+ ret = btrfs_is_data_extent_shared(BTRFS_I(inode), disk_bytenr,
+ extent_gen, backref_ctx);
if (ret < 0) {
goto out;
- } else if (ret) {
- ret = 0;
- } else {
+ } else if (ret > 0) {
btrfs_warn(fs_info,
"swapfile must not be copy-on-write");
ret = -EINVAL;
@@ -9984,7 +10029,6 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
physical_block_start = (map->stripes[0].physical +
(logical_block_start - map->start));
- len = min(len, map->chunk_len - (logical_block_start - map->start));
btrfs_free_chunk_map(map);
map = NULL;
@@ -10025,20 +10069,16 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
if (ret)
goto out;
}
- bsi.start = start;
+ bsi.start = key.offset;
bsi.block_start = physical_block_start;
bsi.block_len = len;
}
-
- start += len;
}
if (bsi.block_len)
ret = btrfs_add_swap_extent(sis, &bsi);
out:
- if (!IS_ERR_OR_NULL(em))
- free_extent_map(em);
if (!IS_ERR_OR_NULL(map))
btrfs_free_chunk_map(map);
@@ -10053,6 +10093,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
out_unlock_mmap:
up_write(&BTRFS_I(inode)->i_mmap_lock);
+ btrfs_free_backref_share_ctx(backref_ctx);
+ btrfs_free_path(path);
if (ret)
return ret;
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x dc81e556f2a017d681251ace21bf06c126d5a192
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123023-arise-undercook-60f9@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From dc81e556f2a017d681251ace21bf06c126d5a192 Mon Sep 17 00:00:00 2001
From: "Xin Li (Intel)" <xin(a)zytor.com>
Date: Wed, 13 Nov 2024 09:59:34 -0800
Subject: [PATCH] x86/fred: Clear WFE in missing-ENDBRANCH #CPs
An indirect branch instruction sets the CPU indirect branch tracker
(IBT) into WAIT_FOR_ENDBRANCH (WFE) state and WFE stays asserted
across the instruction boundary. When the decoder finds an
inappropriate instruction while WFE is set ENDBR, the CPU raises a #CP
fault.
For the "kernel IBT no ENDBR" selftest where #CPs are deliberately
triggered, the WFE state of the interrupted context needs to be
cleared to let execution continue. Otherwise when the CPU resumes
from the instruction that just caused the previous #CP, another
missing-ENDBRANCH #CP is raised and the CPU enters a dead loop.
This is not a problem with IDT because it doesn't preserve WFE and
IRET doesn't set WFE. But FRED provides space on the entry stack
(in an expanded CS area) to save and restore the WFE state, thus the
WFE state is no longer clobbered, so software must clear it.
Clear WFE to avoid dead looping in ibt_clear_fred_wfe() and the
!ibt_fatal code path when execution is allowed to continue.
Clobbering WFE in any other circumstance is a security-relevant bug.
[ dhansen: changelog rewording ]
Fixes: a5f6c2ace997 ("x86/shstk: Add user control-protection fault handler")
Signed-off-by: Xin Li (Intel) <xin(a)zytor.com>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Acked-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20241113175934.3897541-1-xin%40zytor.com
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index d2c732a34e5d..303bf74d175b 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -81,6 +81,34 @@ static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
static __ro_after_init bool ibt_fatal = true;
+/*
+ * By definition, all missing-ENDBRANCH #CPs are a result of WFE && !ENDBR.
+ *
+ * For the kernel IBT no ENDBR selftest where #CPs are deliberately triggered,
+ * the WFE state of the interrupted context needs to be cleared to let execution
+ * continue. Otherwise when the CPU resumes from the instruction that just
+ * caused the previous #CP, another missing-ENDBRANCH #CP is raised and the CPU
+ * enters a dead loop.
+ *
+ * This is not a problem with IDT because it doesn't preserve WFE and IRET doesn't
+ * set WFE. But FRED provides space on the entry stack (in an expanded CS area)
+ * to save and restore the WFE state, thus the WFE state is no longer clobbered,
+ * so software must clear it.
+ */
+static void ibt_clear_fred_wfe(struct pt_regs *regs)
+{
+ /*
+ * No need to do any FRED checks.
+ *
+ * For IDT event delivery, the high-order 48 bits of CS are pushed
+ * as 0s into the stack, and later IRET ignores these bits.
+ *
+ * For FRED, a test to check if fred_cs.wfe is set would be dropped
+ * by compilers.
+ */
+ regs->fred_cs.wfe = 0;
+}
+
static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
{
if ((error_code & CP_EC) != CP_ENDBR) {
@@ -90,6 +118,7 @@ static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
if (unlikely(regs->ip == (unsigned long)&ibt_selftest_noendbr)) {
regs->ax = 0;
+ ibt_clear_fred_wfe(regs);
return;
}
@@ -97,6 +126,7 @@ static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
if (!ibt_fatal) {
printk(KERN_DEFAULT CUT_HERE);
__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
+ ibt_clear_fred_wfe(regs);
return;
}
BUG();
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 0525064bb82e50d59543b62b9d41a606198a4a44
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123011-designed-giving-6a82@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0525064bb82e50d59543b62b9d41a606198a4a44 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 29 Nov 2024 12:25:30 +0000
Subject: [PATCH] btrfs: fix race with memory mapped writes when activating
swap file
When activating the swap file we flush all delalloc and wait for ordered
extent completion, so that we don't miss any delalloc and extents before
we check that the file's extent layout is usable for a swap file and
activate the swap file. We are called with the inode's VFS lock acquired,
so we won't race with buffered and direct IO writes, however we can still
race with memory mapped writes since they don't acquire the inode's VFS
lock. The race window is between flushing all delalloc and locking the
whole file's extent range, since memory mapped writes lock an extent range
with the length of a page.
Fix this by acquiring the inode's mmap lock before we flush delalloc.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6baa0269a85b..b2abc0aa5300 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9809,6 +9809,15 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
u64 isize;
u64 start;
+ /*
+ * Acquire the inode's mmap lock to prevent races with memory mapped
+ * writes, as they could happen after we flush delalloc below and before
+ * we lock the extent range further below. The inode was already locked
+ * up in the call chain.
+ */
+ btrfs_assert_inode_locked(BTRFS_I(inode));
+ down_write(&BTRFS_I(inode)->i_mmap_lock);
+
/*
* If the swap file was just created, make sure delalloc is done. If the
* file changes again after this, the user is doing something stupid and
@@ -9816,22 +9825,25 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
*/
ret = btrfs_wait_ordered_range(BTRFS_I(inode), 0, (u64)-1);
if (ret)
- return ret;
+ goto out_unlock_mmap;
/*
* The inode is locked, so these flags won't change after we check them.
*/
if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
btrfs_warn(fs_info, "swapfile must not be compressed");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
btrfs_warn(fs_info, "swapfile must not be copy-on-write");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
btrfs_warn(fs_info, "swapfile must not be checksummed");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
/*
@@ -9846,7 +9858,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_SWAP_ACTIVATE)) {
btrfs_warn(fs_info,
"cannot activate swapfile while exclusive operation is running");
- return -EBUSY;
+ ret = -EBUSY;
+ goto out_unlock_mmap;
}
/*
@@ -9860,7 +9873,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_exclop_finish(fs_info);
btrfs_warn(fs_info,
"cannot activate swapfile because snapshot creation is in progress");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
/*
* Snapshots can create extents which require COW even if NODATACOW is
@@ -9881,7 +9895,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_warn(fs_info,
"cannot activate swapfile because subvolume %llu is being deleted",
btrfs_root_id(root));
- return -EPERM;
+ ret = -EPERM;
+ goto out_unlock_mmap;
}
atomic_inc(&root->nr_swapfiles);
spin_unlock(&root->root_item_lock);
@@ -10036,6 +10051,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_exclop_finish(fs_info);
+out_unlock_mmap:
+ up_write(&BTRFS_I(inode)->i_mmap_lock);
if (ret)
return ret;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 0525064bb82e50d59543b62b9d41a606198a4a44
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123012-thyself-commodity-0069@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0525064bb82e50d59543b62b9d41a606198a4a44 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 29 Nov 2024 12:25:30 +0000
Subject: [PATCH] btrfs: fix race with memory mapped writes when activating
swap file
When activating the swap file we flush all delalloc and wait for ordered
extent completion, so that we don't miss any delalloc and extents before
we check that the file's extent layout is usable for a swap file and
activate the swap file. We are called with the inode's VFS lock acquired,
so we won't race with buffered and direct IO writes, however we can still
race with memory mapped writes since they don't acquire the inode's VFS
lock. The race window is between flushing all delalloc and locking the
whole file's extent range, since memory mapped writes lock an extent range
with the length of a page.
Fix this by acquiring the inode's mmap lock before we flush delalloc.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6baa0269a85b..b2abc0aa5300 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9809,6 +9809,15 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
u64 isize;
u64 start;
+ /*
+ * Acquire the inode's mmap lock to prevent races with memory mapped
+ * writes, as they could happen after we flush delalloc below and before
+ * we lock the extent range further below. The inode was already locked
+ * up in the call chain.
+ */
+ btrfs_assert_inode_locked(BTRFS_I(inode));
+ down_write(&BTRFS_I(inode)->i_mmap_lock);
+
/*
* If the swap file was just created, make sure delalloc is done. If the
* file changes again after this, the user is doing something stupid and
@@ -9816,22 +9825,25 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
*/
ret = btrfs_wait_ordered_range(BTRFS_I(inode), 0, (u64)-1);
if (ret)
- return ret;
+ goto out_unlock_mmap;
/*
* The inode is locked, so these flags won't change after we check them.
*/
if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
btrfs_warn(fs_info, "swapfile must not be compressed");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
btrfs_warn(fs_info, "swapfile must not be copy-on-write");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
btrfs_warn(fs_info, "swapfile must not be checksummed");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
/*
@@ -9846,7 +9858,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_SWAP_ACTIVATE)) {
btrfs_warn(fs_info,
"cannot activate swapfile while exclusive operation is running");
- return -EBUSY;
+ ret = -EBUSY;
+ goto out_unlock_mmap;
}
/*
@@ -9860,7 +9873,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_exclop_finish(fs_info);
btrfs_warn(fs_info,
"cannot activate swapfile because snapshot creation is in progress");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
/*
* Snapshots can create extents which require COW even if NODATACOW is
@@ -9881,7 +9895,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_warn(fs_info,
"cannot activate swapfile because subvolume %llu is being deleted",
btrfs_root_id(root));
- return -EPERM;
+ ret = -EPERM;
+ goto out_unlock_mmap;
}
atomic_inc(&root->nr_swapfiles);
spin_unlock(&root->root_item_lock);
@@ -10036,6 +10051,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_exclop_finish(fs_info);
+out_unlock_mmap:
+ up_write(&BTRFS_I(inode)->i_mmap_lock);
if (ret)
return ret;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 0525064bb82e50d59543b62b9d41a606198a4a44
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123011-creature-slideshow-d673@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0525064bb82e50d59543b62b9d41a606198a4a44 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 29 Nov 2024 12:25:30 +0000
Subject: [PATCH] btrfs: fix race with memory mapped writes when activating
swap file
When activating the swap file we flush all delalloc and wait for ordered
extent completion, so that we don't miss any delalloc and extents before
we check that the file's extent layout is usable for a swap file and
activate the swap file. We are called with the inode's VFS lock acquired,
so we won't race with buffered and direct IO writes, however we can still
race with memory mapped writes since they don't acquire the inode's VFS
lock. The race window is between flushing all delalloc and locking the
whole file's extent range, since memory mapped writes lock an extent range
with the length of a page.
Fix this by acquiring the inode's mmap lock before we flush delalloc.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6baa0269a85b..b2abc0aa5300 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9809,6 +9809,15 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
u64 isize;
u64 start;
+ /*
+ * Acquire the inode's mmap lock to prevent races with memory mapped
+ * writes, as they could happen after we flush delalloc below and before
+ * we lock the extent range further below. The inode was already locked
+ * up in the call chain.
+ */
+ btrfs_assert_inode_locked(BTRFS_I(inode));
+ down_write(&BTRFS_I(inode)->i_mmap_lock);
+
/*
* If the swap file was just created, make sure delalloc is done. If the
* file changes again after this, the user is doing something stupid and
@@ -9816,22 +9825,25 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
*/
ret = btrfs_wait_ordered_range(BTRFS_I(inode), 0, (u64)-1);
if (ret)
- return ret;
+ goto out_unlock_mmap;
/*
* The inode is locked, so these flags won't change after we check them.
*/
if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
btrfs_warn(fs_info, "swapfile must not be compressed");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
btrfs_warn(fs_info, "swapfile must not be copy-on-write");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
btrfs_warn(fs_info, "swapfile must not be checksummed");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
/*
@@ -9846,7 +9858,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_SWAP_ACTIVATE)) {
btrfs_warn(fs_info,
"cannot activate swapfile while exclusive operation is running");
- return -EBUSY;
+ ret = -EBUSY;
+ goto out_unlock_mmap;
}
/*
@@ -9860,7 +9873,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_exclop_finish(fs_info);
btrfs_warn(fs_info,
"cannot activate swapfile because snapshot creation is in progress");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
/*
* Snapshots can create extents which require COW even if NODATACOW is
@@ -9881,7 +9895,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_warn(fs_info,
"cannot activate swapfile because subvolume %llu is being deleted",
btrfs_root_id(root));
- return -EPERM;
+ ret = -EPERM;
+ goto out_unlock_mmap;
}
atomic_inc(&root->nr_swapfiles);
spin_unlock(&root->root_item_lock);
@@ -10036,6 +10051,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_exclop_finish(fs_info);
+out_unlock_mmap:
+ up_write(&BTRFS_I(inode)->i_mmap_lock);
if (ret)
return ret;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 0525064bb82e50d59543b62b9d41a606198a4a44
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123010-voltage-disobey-16ec@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0525064bb82e50d59543b62b9d41a606198a4a44 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 29 Nov 2024 12:25:30 +0000
Subject: [PATCH] btrfs: fix race with memory mapped writes when activating
swap file
When activating the swap file we flush all delalloc and wait for ordered
extent completion, so that we don't miss any delalloc and extents before
we check that the file's extent layout is usable for a swap file and
activate the swap file. We are called with the inode's VFS lock acquired,
so we won't race with buffered and direct IO writes, however we can still
race with memory mapped writes since they don't acquire the inode's VFS
lock. The race window is between flushing all delalloc and locking the
whole file's extent range, since memory mapped writes lock an extent range
with the length of a page.
Fix this by acquiring the inode's mmap lock before we flush delalloc.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6baa0269a85b..b2abc0aa5300 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9809,6 +9809,15 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
u64 isize;
u64 start;
+ /*
+ * Acquire the inode's mmap lock to prevent races with memory mapped
+ * writes, as they could happen after we flush delalloc below and before
+ * we lock the extent range further below. The inode was already locked
+ * up in the call chain.
+ */
+ btrfs_assert_inode_locked(BTRFS_I(inode));
+ down_write(&BTRFS_I(inode)->i_mmap_lock);
+
/*
* If the swap file was just created, make sure delalloc is done. If the
* file changes again after this, the user is doing something stupid and
@@ -9816,22 +9825,25 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
*/
ret = btrfs_wait_ordered_range(BTRFS_I(inode), 0, (u64)-1);
if (ret)
- return ret;
+ goto out_unlock_mmap;
/*
* The inode is locked, so these flags won't change after we check them.
*/
if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
btrfs_warn(fs_info, "swapfile must not be compressed");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
btrfs_warn(fs_info, "swapfile must not be copy-on-write");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
btrfs_warn(fs_info, "swapfile must not be checksummed");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
/*
@@ -9846,7 +9858,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_SWAP_ACTIVATE)) {
btrfs_warn(fs_info,
"cannot activate swapfile while exclusive operation is running");
- return -EBUSY;
+ ret = -EBUSY;
+ goto out_unlock_mmap;
}
/*
@@ -9860,7 +9873,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_exclop_finish(fs_info);
btrfs_warn(fs_info,
"cannot activate swapfile because snapshot creation is in progress");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
/*
* Snapshots can create extents which require COW even if NODATACOW is
@@ -9881,7 +9895,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_warn(fs_info,
"cannot activate swapfile because subvolume %llu is being deleted",
btrfs_root_id(root));
- return -EPERM;
+ ret = -EPERM;
+ goto out_unlock_mmap;
}
atomic_inc(&root->nr_swapfiles);
spin_unlock(&root->root_item_lock);
@@ -10036,6 +10051,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_exclop_finish(fs_info);
+out_unlock_mmap:
+ up_write(&BTRFS_I(inode)->i_mmap_lock);
if (ret)
return ret;
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 0525064bb82e50d59543b62b9d41a606198a4a44
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123009-carbon-thrower-2611@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0525064bb82e50d59543b62b9d41a606198a4a44 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 29 Nov 2024 12:25:30 +0000
Subject: [PATCH] btrfs: fix race with memory mapped writes when activating
swap file
When activating the swap file we flush all delalloc and wait for ordered
extent completion, so that we don't miss any delalloc and extents before
we check that the file's extent layout is usable for a swap file and
activate the swap file. We are called with the inode's VFS lock acquired,
so we won't race with buffered and direct IO writes, however we can still
race with memory mapped writes since they don't acquire the inode's VFS
lock. The race window is between flushing all delalloc and locking the
whole file's extent range, since memory mapped writes lock an extent range
with the length of a page.
Fix this by acquiring the inode's mmap lock before we flush delalloc.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6baa0269a85b..b2abc0aa5300 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9809,6 +9809,15 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
u64 isize;
u64 start;
+ /*
+ * Acquire the inode's mmap lock to prevent races with memory mapped
+ * writes, as they could happen after we flush delalloc below and before
+ * we lock the extent range further below. The inode was already locked
+ * up in the call chain.
+ */
+ btrfs_assert_inode_locked(BTRFS_I(inode));
+ down_write(&BTRFS_I(inode)->i_mmap_lock);
+
/*
* If the swap file was just created, make sure delalloc is done. If the
* file changes again after this, the user is doing something stupid and
@@ -9816,22 +9825,25 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
*/
ret = btrfs_wait_ordered_range(BTRFS_I(inode), 0, (u64)-1);
if (ret)
- return ret;
+ goto out_unlock_mmap;
/*
* The inode is locked, so these flags won't change after we check them.
*/
if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
btrfs_warn(fs_info, "swapfile must not be compressed");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
btrfs_warn(fs_info, "swapfile must not be copy-on-write");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
btrfs_warn(fs_info, "swapfile must not be checksummed");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
/*
@@ -9846,7 +9858,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_SWAP_ACTIVATE)) {
btrfs_warn(fs_info,
"cannot activate swapfile while exclusive operation is running");
- return -EBUSY;
+ ret = -EBUSY;
+ goto out_unlock_mmap;
}
/*
@@ -9860,7 +9873,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_exclop_finish(fs_info);
btrfs_warn(fs_info,
"cannot activate swapfile because snapshot creation is in progress");
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_unlock_mmap;
}
/*
* Snapshots can create extents which require COW even if NODATACOW is
@@ -9881,7 +9895,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_warn(fs_info,
"cannot activate swapfile because subvolume %llu is being deleted",
btrfs_root_id(root));
- return -EPERM;
+ ret = -EPERM;
+ goto out_unlock_mmap;
}
atomic_inc(&root->nr_swapfiles);
spin_unlock(&root->root_item_lock);
@@ -10036,6 +10051,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
btrfs_exclop_finish(fs_info);
+out_unlock_mmap:
+ up_write(&BTRFS_I(inode)->i_mmap_lock);
if (ret)
return ret;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 98feccbf32cfdde8c722bc4587aaa60ee5ac33f0
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123053-acting-chase-d961@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 98feccbf32cfdde8c722bc4587aaa60ee5ac33f0 Mon Sep 17 00:00:00 2001
From: Lizhi Xu <lizhi.xu(a)windriver.com>
Date: Mon, 16 Dec 2024 15:32:38 +0800
Subject: [PATCH] tracing: Prevent bad count for tracing_cpumask_write
If a large count is provided, it will trigger a warning in bitmap_parse_user.
Also check zero for it.
Cc: stable(a)vger.kernel.org
Fixes: 9e01c1b74c953 ("cpumask: convert kernel trace functions")
Link: https://lore.kernel.org/20241216073238.2573704-1-lizhi.xu@windriver.com
Reported-by: syzbot+0aecfd34fb878546f3fd(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0aecfd34fb878546f3fd
Tested-by: syzbot+0aecfd34fb878546f3fd(a)syzkaller.appspotmail.com
Signed-off-by: Lizhi Xu <lizhi.xu(a)windriver.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 957f941a08e7..f8aebcb01e62 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5087,6 +5087,9 @@ tracing_cpumask_write(struct file *filp, const char __user *ubuf,
cpumask_var_t tracing_cpumask_new;
int err;
+ if (count == 0 || count > KMALLOC_MAX_SIZE)
+ return -EINVAL;
+
if (!zalloc_cpumask_var(&tracing_cpumask_new, GFP_KERNEL))
return -ENOMEM;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 98feccbf32cfdde8c722bc4587aaa60ee5ac33f0
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024123047-unfounded-coil-47c5@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 98feccbf32cfdde8c722bc4587aaa60ee5ac33f0 Mon Sep 17 00:00:00 2001
From: Lizhi Xu <lizhi.xu(a)windriver.com>
Date: Mon, 16 Dec 2024 15:32:38 +0800
Subject: [PATCH] tracing: Prevent bad count for tracing_cpumask_write
If a large count is provided, it will trigger a warning in bitmap_parse_user.
Also check zero for it.
Cc: stable(a)vger.kernel.org
Fixes: 9e01c1b74c953 ("cpumask: convert kernel trace functions")
Link: https://lore.kernel.org/20241216073238.2573704-1-lizhi.xu@windriver.com
Reported-by: syzbot+0aecfd34fb878546f3fd(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0aecfd34fb878546f3fd
Tested-by: syzbot+0aecfd34fb878546f3fd(a)syzkaller.appspotmail.com
Signed-off-by: Lizhi Xu <lizhi.xu(a)windriver.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 957f941a08e7..f8aebcb01e62 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5087,6 +5087,9 @@ tracing_cpumask_write(struct file *filp, const char __user *ubuf,
cpumask_var_t tracing_cpumask_new;
int err;
+ if (count == 0 || count > KMALLOC_MAX_SIZE)
+ return -EINVAL;
+
if (!zalloc_cpumask_var(&tracing_cpumask_new, GFP_KERNEL))
return -ENOMEM;
The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.12.y
git checkout FETCH_HEAD
git cherry-pick -x 4b65d5322e1d8994acfdb9b867aa00bdb30d177b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024122721-badge-research-e542@gregkh' --subject-prefix 'PATCH 6.12.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4b65d5322e1d8994acfdb9b867aa00bdb30d177b Mon Sep 17 00:00:00 2001
From: Binbin Zhou <zhoubinbin(a)loongson.cn>
Date: Mon, 28 Oct 2024 17:34:13 +0800
Subject: [PATCH] dmaengine: loongson2-apb: Change GENMASK to GENMASK_ULL
Fix the following smatch static checker warning:
drivers/dma/loongson2-apb-dma.c:189 ls2x_dma_write_cmd()
warn: was expecting a 64 bit value instead of '~(((0)) + (((~((0))) - (((1)) << (0)) + 1) & (~((0)) >> ((8 * 4) - 1 - (4)))))'
The GENMASK macro used "unsigned long", which caused build issues when
using a 32-bit toolchain because it would try to access bits > 31. This
patch switches GENMASK to GENMASK_ULL, which uses "unsigned long long".
Fixes: 71e7d3cb6e55 ("dmaengine: ls2x-apb: New driver for the Loongson LS2X APB DMA controller")
Reported-by: Dan Carpenter <dan.carpenter(a)linaro.org>
Closes: https://lore.kernel.org/all/87cdc025-7246-4548-85ca-3d36fdc2be2d@stanley.mo…
Signed-off-by: Binbin Zhou <zhoubinbin(a)loongson.cn>
Link: https://lore.kernel.org/r/20241028093413.1145820-1-zhoubinbin@loongson.cn
Signed-off-by: Vinod Koul <vkoul(a)kernel.org>
diff --git a/drivers/dma/loongson2-apb-dma.c b/drivers/dma/loongson2-apb-dma.c
index 367ed34ce4da..c528f02b9f84 100644
--- a/drivers/dma/loongson2-apb-dma.c
+++ b/drivers/dma/loongson2-apb-dma.c
@@ -31,7 +31,7 @@
#define LDMA_ASK_VALID BIT(2)
#define LDMA_START BIT(3) /* DMA start operation */
#define LDMA_STOP BIT(4) /* DMA stop operation */
-#define LDMA_CONFIG_MASK GENMASK(4, 0) /* DMA controller config bits mask */
+#define LDMA_CONFIG_MASK GENMASK_ULL(4, 0) /* DMA controller config bits mask */
/* Bitfields in ndesc_addr field of HW descriptor */
#define LDMA_DESC_EN BIT(0) /*1: The next descriptor is valid */
From: Jiayuan Chen <mrpre(a)163.com>
commit 8ca2a1eeadf09862190b2810697702d803ceef2d upstream.
When the stream_verdict program returns SK_PASS, it places the received skb
into its own receive queue, but a recursive lock eventually occurs, leading
to an operating system deadlock. This issue has been present since v6.9.
'''
sk_psock_strp_data_ready
write_lock_bh(&sk->sk_callback_lock)
strp_data_ready
strp_read_sock
read_sock -> tcp_read_sock
strp_recv
cb.rcv_msg -> sk_psock_strp_read
# now stream_verdict return SK_PASS without peer sock assign
__SK_PASS = sk_psock_map_verd(SK_PASS, NULL)
sk_psock_verdict_apply
sk_psock_skb_ingress_self
sk_psock_skb_ingress_enqueue
sk_psock_data_ready
read_lock_bh(&sk->sk_callback_lock) <= dead lock
'''
This topic has been discussed before, but it has not been fixed.
Previous discussion:
https://lore.kernel.org/all/6684a5864ec86_403d20898@john.notmuch
Fixes: 6648e613226e ("bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue")
Reported-by: Vincent Whitchurch <vincent.whitchurch(a)datadoghq.com>
Signed-off-by: Jiayuan Chen <mrpre(a)163.com>
Signed-off-by: John Fastabend <john.fastabend(a)gmail.com>
Acked-by: Martin KaFai Lau <martin.lau(a)kernel.org>
Link: https://patch.msgid.link/20241118030910.36230-2-mrpre@163.com
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
[srish: Apply to stable branch linux-5.10.y]
Signed-off-by: Srish Srinivasan <srishwap4(a)gmail.com>
---
net/core/skmsg.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 51792dda1..890e16bbc 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -940,9 +940,9 @@ static void sk_psock_strp_data_ready(struct sock *sk)
if (tls_sw_has_ctx_rx(sk)) {
psock->parser.saved_data_ready(sk);
} else {
- write_lock_bh(&sk->sk_callback_lock);
+ read_lock_bh(&sk->sk_callback_lock);
strp_data_ready(&psock->parser.strp);
- write_unlock_bh(&sk->sk_callback_lock);
+ read_unlock_bh(&sk->sk_callback_lock);
}
}
rcu_read_unlock();
--
2.25.1
From: Yang Erkun <yangerkun(a)huaweicloud.com>
[ Upstream commit d5ff2fb2e7167e9483846e34148e60c0c016a1f6 ]
In the normal case, when we excute `echo 0 > /proc/fs/nfsd/threads`, the
function `nfs4_state_destroy_net` in `nfs4_state_shutdown_net` will
release all resources related to the hashed `nfs4_client`. If the
`nfsd_client_shrinker` is running concurrently, the `expire_client`
function will first unhash this client and then destroy it. This can
lead to the following warning. Additionally, numerous use-after-free
errors may occur as well.
nfsd_client_shrinker echo 0 > /proc/fs/nfsd/threads
expire_client nfsd_shutdown_net
unhash_client ...
nfs4_state_shutdown_net
/* won't wait shrinker exit */
/* cancel_work(&nn->nfsd_shrinker_work)
* nfsd_file for this /* won't destroy unhashed client1 */
* client1 still alive nfs4_state_destroy_net
*/
nfsd_file_cache_shutdown
/* trigger warning */
kmem_cache_destroy(nfsd_file_slab)
kmem_cache_destroy(nfsd_file_mark_slab)
/* release nfsd_file and mark */
__destroy_client
====================================================================
BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on
__kmem_cache_shutdown()
--------------------------------------------------------------------
CPU: 4 UID: 0 PID: 764 Comm: sh Not tainted 6.12.0-rc3+ #1
dump_stack_lvl+0x53/0x70
slab_err+0xb0/0xf0
__kmem_cache_shutdown+0x15c/0x310
kmem_cache_destroy+0x66/0x160
nfsd_file_cache_shutdown+0xac/0x210 [nfsd]
nfsd_destroy_serv+0x251/0x2a0 [nfsd]
nfsd_svc+0x125/0x1e0 [nfsd]
write_threads+0x16a/0x2a0 [nfsd]
nfsctl_transaction_write+0x74/0xa0 [nfsd]
vfs_write+0x1a5/0x6d0
ksys_write+0xc1/0x160
do_syscall_64+0x5f/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
====================================================================
BUG nfsd_file_mark (Tainted: G B W ): Objects remaining
nfsd_file_mark on __kmem_cache_shutdown()
--------------------------------------------------------------------
dump_stack_lvl+0x53/0x70
slab_err+0xb0/0xf0
__kmem_cache_shutdown+0x15c/0x310
kmem_cache_destroy+0x66/0x160
nfsd_file_cache_shutdown+0xc8/0x210 [nfsd]
nfsd_destroy_serv+0x251/0x2a0 [nfsd]
nfsd_svc+0x125/0x1e0 [nfsd]
write_threads+0x16a/0x2a0 [nfsd]
nfsctl_transaction_write+0x74/0xa0 [nfsd]
vfs_write+0x1a5/0x6d0
ksys_write+0xc1/0x160
do_syscall_64+0x5f/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
To resolve this issue, cancel `nfsd_shrinker_work` using synchronous
mode in nfs4_state_shutdown_net.
Fixes: 7c24fa225081 ("NFSD: replace delayed_work with work_struct for nfsd_client_shrinker")
Signed-off-by: Yang Erkun <yangerkun(a)huaweicloud.com>
Reviewed-by: Jeff Layton <jlayton(a)kernel.org>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
(cherry picked from commit f965dc0f099a54fca100acf6909abe52d0c85328)
Signed-off-by: Vasiliy Kovalev <kovalev(a)altlinux.org>
---
Backport to fix CVE-2024-50121
Link: https://www.cve.org/CVERecord/?id=CVE-2024-50121
---
fs/nfsd/nfs4state.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 8bceae771c1c75..f6fa719ee32668 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -8208,7 +8208,7 @@ nfs4_state_shutdown_net(struct net *net)
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
unregister_shrinker(&nn->nfsd_client_shrinker);
- cancel_work(&nn->nfsd_shrinker_work);
+ cancel_work_sync(&nn->nfsd_shrinker_work);
cancel_delayed_work_sync(&nn->laundromat_work);
locks_end_grace(&nn->nfsd4_manager);
--
2.33.8
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: dc81e556f2a017d681251ace21bf06c126d5a192
Gitweb: https://git.kernel.org/tip/dc81e556f2a017d681251ace21bf06c126d5a192
Author: Xin Li (Intel) <xin(a)zytor.com>
AuthorDate: Wed, 13 Nov 2024 09:59:34 -08:00
Committer: Ingo Molnar <mingo(a)kernel.org>
CommitterDate: Sun, 29 Dec 2024 10:18:10 +01:00
x86/fred: Clear WFE in missing-ENDBRANCH #CPs
An indirect branch instruction sets the CPU indirect branch tracker
(IBT) into WAIT_FOR_ENDBRANCH (WFE) state and WFE stays asserted
across the instruction boundary. When the decoder finds an
inappropriate instruction while WFE is set ENDBR, the CPU raises a #CP
fault.
For the "kernel IBT no ENDBR" selftest where #CPs are deliberately
triggered, the WFE state of the interrupted context needs to be
cleared to let execution continue. Otherwise when the CPU resumes
from the instruction that just caused the previous #CP, another
missing-ENDBRANCH #CP is raised and the CPU enters a dead loop.
This is not a problem with IDT because it doesn't preserve WFE and
IRET doesn't set WFE. But FRED provides space on the entry stack
(in an expanded CS area) to save and restore the WFE state, thus the
WFE state is no longer clobbered, so software must clear it.
Clear WFE to avoid dead looping in ibt_clear_fred_wfe() and the
!ibt_fatal code path when execution is allowed to continue.
Clobbering WFE in any other circumstance is a security-relevant bug.
[ dhansen: changelog rewording ]
Fixes: a5f6c2ace997 ("x86/shstk: Add user control-protection fault handler")
Signed-off-by: Xin Li (Intel) <xin(a)zytor.com>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Acked-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20241113175934.3897541-1-xin%40zytor.com
---
arch/x86/kernel/cet.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index d2c732a..303bf74 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -81,6 +81,34 @@ static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
static __ro_after_init bool ibt_fatal = true;
+/*
+ * By definition, all missing-ENDBRANCH #CPs are a result of WFE && !ENDBR.
+ *
+ * For the kernel IBT no ENDBR selftest where #CPs are deliberately triggered,
+ * the WFE state of the interrupted context needs to be cleared to let execution
+ * continue. Otherwise when the CPU resumes from the instruction that just
+ * caused the previous #CP, another missing-ENDBRANCH #CP is raised and the CPU
+ * enters a dead loop.
+ *
+ * This is not a problem with IDT because it doesn't preserve WFE and IRET doesn't
+ * set WFE. But FRED provides space on the entry stack (in an expanded CS area)
+ * to save and restore the WFE state, thus the WFE state is no longer clobbered,
+ * so software must clear it.
+ */
+static void ibt_clear_fred_wfe(struct pt_regs *regs)
+{
+ /*
+ * No need to do any FRED checks.
+ *
+ * For IDT event delivery, the high-order 48 bits of CS are pushed
+ * as 0s into the stack, and later IRET ignores these bits.
+ *
+ * For FRED, a test to check if fred_cs.wfe is set would be dropped
+ * by compilers.
+ */
+ regs->fred_cs.wfe = 0;
+}
+
static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
{
if ((error_code & CP_EC) != CP_ENDBR) {
@@ -90,6 +118,7 @@ static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
if (unlikely(regs->ip == (unsigned long)&ibt_selftest_noendbr)) {
regs->ax = 0;
+ ibt_clear_fred_wfe(regs);
return;
}
@@ -97,6 +126,7 @@ static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
if (!ibt_fatal) {
printk(KERN_DEFAULT CUT_HERE);
__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
+ ibt_clear_fred_wfe(regs);
return;
}
BUG();
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 27834971f616c5e154423c578fa95e0444444ce1
Gitweb: https://git.kernel.org/tip/27834971f616c5e154423c578fa95e0444444ce1
Author: Li RongQing <lirongqing(a)baidu.com>
AuthorDate: Wed, 19 Jun 2024 19:18:01 +08:00
Committer: Ingo Molnar <mingo(a)kernel.org>
CommitterDate: Sun, 29 Dec 2024 10:18:44 +01:00
virt: tdx-guest: Just leak decrypted memory on unrecoverable errors
In CoCo VMs it is possible for the untrusted host to cause
set_memory_decrypted() to fail such that an error is returned
and the resulting memory is shared. Callers need to take care
to handle these errors to avoid returning decrypted (shared)
memory to the page allocator, which could lead to functional
or security issues.
Leak the decrypted memory when set_memory_decrypted() fails,
and don't need to print an error since set_memory_decrypted()
will call WARN_ONCE().
Fixes: f4738f56d1dc ("virt: tdx-guest: Add Quote generation support using TSM_REPORTS")
Signed-off-by: Li RongQing <lirongqing(a)baidu.com>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe(a)intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20240619111801.25630-1-lirongqing%40baidu.com
---
drivers/virt/coco/tdx-guest/tdx-guest.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/virt/coco/tdx-guest/tdx-guest.c b/drivers/virt/coco/tdx-guest/tdx-guest.c
index d7db6c8..224e7dd 100644
--- a/drivers/virt/coco/tdx-guest/tdx-guest.c
+++ b/drivers/virt/coco/tdx-guest/tdx-guest.c
@@ -124,10 +124,8 @@ static void *alloc_quote_buf(void)
if (!addr)
return NULL;
- if (set_memory_decrypted((unsigned long)addr, count)) {
- free_pages_exact(addr, len);
+ if (set_memory_decrypted((unsigned long)addr, count))
return NULL;
- }
return addr;
}
From: Chuck Lever <chuck.lever(a)oracle.com>
Testing shows that the EBUSY error return from mtree_alloc_cyclic()
leaks into user space. The ERRORS section of "man creat(2)" says:
> EBUSY O_EXCL was specified in flags and pathname refers
> to a block device that is in use by the system
> (e.g., it is mounted).
ENOSPC is closer to what applications expect in this situation.
Note that the normal range of simple directory offset values is
2..2^63, so hitting this error is going to be rare to impossible.
Fixes: 6faddda69f62 ("libfs: Add directory operations for stable offsets")
Cc: <stable(a)vger.kernel.org> # v6.9+
Reviewed-by: Jeff Layton <jlayton(a)kernel.org>
Reviewed-by: Yang Erkun <yangerkun(a)huawei.com>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
---
fs/libfs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/libfs.c b/fs/libfs.c
index 748ac5923154..3da58a92f48f 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -292,8 +292,8 @@ int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
ret = mtree_alloc_cyclic(&octx->mt, &offset, dentry, DIR_OFFSET_MIN,
LONG_MAX, &octx->next_offset, GFP_KERNEL);
- if (ret < 0)
- return ret;
+ if (unlikely(ret < 0))
+ return ret == -EBUSY ? -ENOSPC : ret;
offset_set(dentry, offset);
return 0;
--
2.47.0
>
>
>> 2024年12月25日 13:37,Kun Hu <huk23(a)m.fudan.edu.cn> 写道:
>>
>> Hello,
>>
>>> BUG: KASAN: slab-out-of-bounds in snd_seq_oss_synth_sysex+0x5d1/0x6c0 sound/core/seq/oss/seq_oss_synth.c:516
>>
>> We further analyzed the issue at line 516 in ./sound/core/seq/oss/seq_oss_synth.c.
>> The slab-out-of-bounds crash occurs in line 509, when sysex->len = 128. Specifically, the write operation to dest[0] accesses memory beyond the bounds of sysex->buf (128 byte).
>> To resolve this issue, we suggest adding 6 lines of code to validate the legality of the address write to sysex->buf before entering the loop:
>>
>> if (sysex->len >= MAX_SYSEX_BUFLEN) {
>> sysex->len = 0;
>> sysex->skip = 1;
>> return -EINVAL; /* Exit early if sysex->len is out of bounds */
>> }
>>
>> If you fix this issue, please add the following tag to the commit:
>> Reported-by: Kun Hu <huk23(a)m.fudan.edu.cn>
>>
>> —————
>> Thanks,
>> Kun Hu
>>
>>> 2024年12月24日 19:16,Kun Hu <huk23(a)m.fudan.edu.cn> 写道:
>>>
>>> Hello,
>>>
>>> When using fuzzer tool to fuzz the latest Linux kernel, the following crash
>>> was triggered.
>>>
>>> HEAD commit: 78d4f34e2115b517bcbfe7ec0d018bbbb6f9b0b8
>>> git tree: upstream
>>> Console output:https://drive.google.com/file/d/17oCyKDW_kNhSW5Bbvm23vnpD1eo0MHFi/vi…
>>> Kernel config: https://drive.google.com/file/d/1RhT5dFTs6Vx1U71PbpenN7TPtnPoa3NI/view?usp=…
>>> C reproducer: https://drive.google.com/file/d/177HJht6a7-6F3YLudKb_d4kiPGd1VA_i/view?usp=…
>>> Syzlang reproducer: https://drive.google.com/file/d/1AuP5UGGc47rEXXPuvjmCKgJ3d0U1P84j/view?usp=…
>>>
>>>
>>> If you fix this issue, please add the following tag to the commit:
>>> Reported-by: Kun Hu <huk23(a)m.fudan.edu.cn>
>>>
>>> ==================================================================
>>> BUG: KASAN: slab-out-of-bounds in snd_seq_oss_synth_sysex+0x5d1/0x6c0 sound/core/seq/oss/seq_oss_synth.c:516
>>> Write of size 1 at addr ff1100000588e288 by task syz-executor411/824
>>>
>>> CPU: 2 UID: 0 PID: 824 Comm: syz-executor411 Not tainted 6.13.0-rc3 #5
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
>>> Call Trace:
>>> <TASK>
>>> __dump_stack lib/dump_stack.c:94 [inline]
>>> dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:120
>>> print_address_description mm/kasan/report.c:378 [inline]
>>> print_report+0xcf/0x5f0 mm/kasan/report.c:489
>>> kasan_report+0x93/0xc0 mm/kasan/report.c:602
>>> snd_seq_oss_synth_sysex+0x5d1/0x6c0 sound/core/seq/oss/seq_oss_synth.c:516
>>> snd_seq_oss_process_event+0x46a/0x2620 sound/core/seq/oss/seq_oss_event.c:61
>>> insert_queue sound/core/seq/oss/seq_oss_rw.c:167 [inline]
>>> snd_seq_oss_write+0x261/0x7f0 sound/core/seq/oss/seq_oss_rw.c:135
>>> odev_write+0x53/0xa0 sound/core/seq/oss/seq_oss.c:168
>>> vfs_write fs/read_write.c:677 [inline]
>>> vfs_write+0x2e3/0x10f0 fs/read_write.c:659
>>> ksys_write+0x122/0x240 fs/read_write.c:731
>>> do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>>> do_syscall_64+0xc3/0x1d0 arch/x86/entry/common.c:83
>>> entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>
>
Hello,
Is this issue being considered and is it possible that the value of sysex->len in line 509 of the snd_seq_oss_synth_sysex function could exceed 127 and thus be out of bounds?
———
Best Regards,
Kun Hu
The patch titled
Subject: mm/kmemleak: fix percpu memory leak detection failure
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-kmemleak-fix-percpu-memory-leak-detection-failure.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Guo Weikang <guoweikang.kernel(a)gmail.com>
Subject: mm/kmemleak: fix percpu memory leak detection failure
Date: Fri, 27 Dec 2024 17:23:10 +0800
kmemleak_alloc_percpu gives an incorrect min_count parameter, causing
percpu memory to be considered a gray object.
Link: https://lkml.kernel.org/r/20241227092311.3572500-1-guoweikang.kernel@gmail.…
Fixes: 8c8685928910 ("mm/kmemleak: use IS_ERR_PCPU() for pointer in the percpu address space")
Signed-off-by: Guo Weikang <guoweikang.kernel(a)gmail.com>
Acked-by: Uros Bizjak <ubizjak(a)gmail.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Guo Weikang <guoweikang.kernel(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kmemleak.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/kmemleak.c~mm-kmemleak-fix-percpu-memory-leak-detection-failure
+++ a/mm/kmemleak.c
@@ -1093,7 +1093,7 @@ void __ref kmemleak_alloc_percpu(const v
pr_debug("%s(0x%px, %zu)\n", __func__, ptr, size);
if (kmemleak_enabled && ptr && !IS_ERR_PCPU(ptr))
- create_object_percpu((__force unsigned long)ptr, size, 0, gfp);
+ create_object_percpu((__force unsigned long)ptr, size, 1, gfp);
}
EXPORT_SYMBOL_GPL(kmemleak_alloc_percpu);
_
Patches currently in -mm which might be from guoweikang.kernel(a)gmail.com are
mm-kmemleak-fix-percpu-memory-leak-detection-failure.patch
mm-shmem-refactor-to-reuse-vfs_parse_monolithic_sep-for-option-parsing.patch
mm-early_ioremap-add-null-pointer-checks-to-prevent-null-pointer-dereference.patch