From: Fengnan Chang <changfengnan(a)bytedance.com>
[ Upstream commit 9d83e1f05c98bab5de350bef89177e2be8b34db0 ]
After commit 0b2b066f8a85 ("io_uring/io-wq: only create a new worker
if it can make progress"), in our produce environment, we still
observe that part of io_worker threads keeps creating and destroying.
After analysis, it was confirmed that this was due to a more complex
scenario involving a large number of fsync operations, which can be
abstracted as frequent write + fsync operations on multiple files in
a single uring instance. Since write is a hash operation while fsync
is not, and fsync is likely to be suspended during execution, the
action of checking the hash value in
io_wqe_dec_running cannot handle such scenarios.
Similarly, if hash-based work and non-hash-based work are sent at the
same time, similar issues are likely to occur.
Returning to the starting point of the issue, when a new work
arrives, io_wq_enqueue may wake up free worker A, while
io_wq_dec_running may create worker B. Ultimately, only one of A and
B can obtain and process the task, leaving the other in an idle
state. In the end, the issue is caused by inconsistent logic in the
checks performed by io_wq_enqueue and io_wq_dec_running.
Therefore, the problem can be resolved by checking for available
workers in io_wq_dec_running.
Signed-off-by: Fengnan Chang <changfengnan(a)bytedance.com>
Reviewed-by: Diangang Li <lidiangang(a)bytedance.com>
Link: https://lore.kernel.org/r/20250813120214.18729-1-changfengnan@bytedance.com
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
LLM Generated explanations, may be completely bogus:
Now let me analyze the nature of the fix to determine if it's
appropriate for stable:
**Backport Status: YES**
This commit should be backported to stable kernel trees for the
following reasons:
## Bug Fix Nature
The commit fixes a **real performance bug** that occurs in production
environments. The issue causes unnecessary creation and destruction of
io_worker threads, leading to:
1. CPU overhead from constant thread creation/destruction
2. Resource wastage from redundant workers
3. Performance degradation in production workloads with mixed hash/non-
hash operations
## Root Cause Analysis
The patch addresses a **race condition** between `io_wq_enqueue()` and
`io_wq_dec_running()`:
- When new work arrives, `io_wq_enqueue()` may wake up a free worker A
- Simultaneously, `io_wq_dec_running()` may create a new worker B
- Only one worker can actually process the task, leaving the other idle
- This creates a worker "churn" pattern that wastes resources
## Small and Contained Fix
The fix is minimal and surgical:
1. Adds a check in `create_worker_cb()` to verify if a free worker is
available before creating a new one
2. Uses the existing `io_acct_activate_free_worker()` function (lines
361-365)
3. Only adds 8 lines of code with proper RCU locking
4. Introduces a single goto label for clean error handling
## Production Impact
The commit message explicitly states this was observed in **production
environments** with real workloads involving:
- Frequent write + fsync operations
- Mixed hash and non-hash operations
- Multiple files in a single io_uring instance
## Regression Potential
The fix has **low regression risk** because:
1. It reuses existing, well-tested infrastructure
(`io_acct_activate_free_worker()`)
2. The logic is consistent with what `io_wq_enqueue()` already does
3. It adds a defensive check rather than changing core logic
4. The RCU locking is properly scoped and follows existing patterns
## Stable Kernel Rules Compliance
The patch follows stable kernel criteria:
- ✓ Fixes a real bug affecting users (worker thread churn in production)
- ✓ Small change (8 lines added)
- ✓ Obviously correct (aligns logic between enqueue and dec_running
paths)
- ✓ Already tested in production (reported by ByteDance engineers)
- ✓ Not a new feature or optimization
## Affected Versions
This should be backported to kernels containing commit 0b2b066f8a85
("io_uring/io-wq: only create a new worker if it can make progress")
which introduced incomplete handling of the worker creation logic. The
issue affects mixed workload scenarios that are common in production
environments.
The fact that this was discovered and reported by engineers from a major
cloud provider (ByteDance) running production workloads further
validates its importance for stable backporting.
io_uring/io-wq.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index be91edf34f01..17dfaa0395c4 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -357,6 +357,13 @@ static void create_worker_cb(struct callback_head *cb)
worker = container_of(cb, struct io_worker, create_work);
wq = worker->wq;
acct = worker->acct;
+
+ rcu_read_lock();
+ do_create = !io_acct_activate_free_worker(acct);
+ rcu_read_unlock();
+ if (!do_create)
+ goto no_need_create;
+
raw_spin_lock(&acct->workers_lock);
if (acct->nr_workers < acct->max_workers) {
@@ -367,6 +374,7 @@ static void create_worker_cb(struct callback_head *cb)
if (do_create) {
create_io_worker(wq, acct);
} else {
+no_need_create:
atomic_dec(&acct->nr_running);
io_worker_ref_put(wq);
}
--
2.50.1
amdgpu_dm_connector_ddc_get_modes() reinitializes a connector's probed
modes list without cleaning it up. First time it is called during the
driver's initialization phase, then via drm_mode_getconnector() ioctl.
The leaks observed with Kmemleak are as following:
unreferenced object 0xffff88812f91b200 (size 128):
comm "(udev-worker)", pid 388, jiffies 4294695475
hex dump (first 32 bytes):
ac dd 07 00 80 02 70 0b 90 0b e0 0b 00 00 e0 01 ......p.........
0b 07 10 07 5c 07 00 00 0a 00 00 00 00 00 00 00 ....\...........
backtrace (crc 89db554f):
__kmalloc_cache_noprof+0x3a3/0x490
drm_mode_duplicate+0x8e/0x2b0
amdgpu_dm_create_common_mode+0x40/0x150 [amdgpu]
amdgpu_dm_connector_add_common_modes+0x336/0x488 [amdgpu]
amdgpu_dm_connector_get_modes+0x428/0x8a0 [amdgpu]
amdgpu_dm_initialize_drm_device+0x1389/0x17b4 [amdgpu]
amdgpu_dm_init.cold+0x157b/0x1a1e [amdgpu]
dm_hw_init+0x3f/0x110 [amdgpu]
amdgpu_device_ip_init+0xcf4/0x1180 [amdgpu]
amdgpu_device_init.cold+0xb84/0x1863 [amdgpu]
amdgpu_driver_load_kms+0x15/0x90 [amdgpu]
amdgpu_pci_probe+0x391/0xce0 [amdgpu]
local_pci_probe+0xd9/0x190
pci_call_probe+0x183/0x540
pci_device_probe+0x171/0x2c0
really_probe+0x1e1/0x890
Found by Linux Verification Center (linuxtesting.org).
Fixes: acc96ae0d127 ("drm/amd/display: set panel orientation before drm_dev_register")
Cc: stable(a)vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin(a)ispras.ru>
---
I can't reproduce the issue before the commit in Fixes which placed that
extra amdgpu_dm_connector_get_modes() call. Though the reinitializing part
/* empty probed_modes */
INIT_LIST_HEAD(&connector->probed_modes);
was added years before and it looks OK since drm_connector_list_update()
should shake the list in between drm_mode_getconnector() ioctl calls.
For what the patch does there exists a drm_mode_remove() helper but it's
static at drivers/gpu/drm/drm_connector.c and requires to be exported
first. This probably looks like a subject for an independent for-next
patch, if needed.
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index cd0e2976e268..4b84f944f066 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -8227,9 +8227,14 @@ static void amdgpu_dm_connector_ddc_get_modes(struct drm_connector *connector,
{
struct amdgpu_dm_connector *amdgpu_dm_connector =
to_amdgpu_dm_connector(connector);
+ struct drm_display_mode *mode, *t;
if (drm_edid) {
/* empty probed_modes */
+ list_for_each_entry_safe(mode, t, &connector->probed_modes, head) {
+ list_del(&mode->head);
+ drm_mode_destroy(connector->dev, mode);
+ }
INIT_LIST_HEAD(&connector->probed_modes);
amdgpu_dm_connector->num_modes =
drm_edid_connector_add_modes(connector);
--
2.50.1
Fix smb3_init_transform_rq() to initialise buffer to NULL before calling
netfs_alloc_folioq_buffer() as netfs assumes it can append to the buffer it
is given. Setting it to NULL means it should start a fresh buffer, but the
value is currently undefined.
Fixes: a2906d3316fc ("cifs: Switch crypto buffer to use a folio_queue rather than an xarray")
Signed-off-by: David Howells <dhowells(a)redhat.com>
cc: Steve French <sfrench(a)samba.org>
cc: Paulo Alcantara <pc(a)manguebit.org>
cc: linux-cifs(a)vger.kernel.org
cc: linux-fsdevel(a)vger.kernel.org
---
fs/smb/client/smb2ops.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/smb/client/smb2ops.c b/fs/smb/client/smb2ops.c
index ad8947434b71..cd0c9b5a35c3 100644
--- a/fs/smb/client/smb2ops.c
+++ b/fs/smb/client/smb2ops.c
@@ -4487,7 +4487,7 @@ smb3_init_transform_rq(struct TCP_Server_Info *server, int num_rqst,
for (int i = 1; i < num_rqst; i++) {
struct smb_rqst *old = &old_rq[i - 1];
struct smb_rqst *new = &new_rq[i];
- struct folio_queue *buffer;
+ struct folio_queue *buffer = NULL;
size_t size = iov_iter_count(&old->rq_iter);
orig_len += smb_rqst_len(server, old);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 47b0f6d8f0d2be4d311a49e13d2fd5f152f492b2
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025081848-proximity-feline-dfea@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 47b0f6d8f0d2be4d311a49e13d2fd5f152f492b2 Mon Sep 17 00:00:00 2001
From: Breno Leitao <leitao(a)debian.org>
Date: Thu, 31 Jul 2025 02:57:18 -0700
Subject: [PATCH] mm/kmemleak: avoid deadlock by moving pr_warn() outside
kmemleak_lock
When netpoll is enabled, calling pr_warn_once() while holding
kmemleak_lock in mem_pool_alloc() can cause a deadlock due to lock
inversion with the netconsole subsystem. This occurs because
pr_warn_once() may trigger netpoll, which eventually leads to
__alloc_skb() and back into kmemleak code, attempting to reacquire
kmemleak_lock.
This is the path for the deadlock.
mem_pool_alloc()
-> raw_spin_lock_irqsave(&kmemleak_lock, flags);
-> pr_warn_once()
-> netconsole subsystem
-> netpoll
-> __alloc_skb
-> __create_object
-> raw_spin_lock_irqsave(&kmemleak_lock, flags);
Fix this by setting a flag and issuing the pr_warn_once() after
kmemleak_lock is released.
Link: https://lkml.kernel.org/r/20250731-kmemleak_lock-v1-1-728fd470198f@debian.o…
Fixes: c5665868183f ("mm: kmemleak: use the memory pool for early allocations")
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Reported-by: Jakub Kicinski <kuba(a)kernel.org>
Acked-by: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 8d588e685311..e0333455c738 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -470,6 +470,7 @@ static struct kmemleak_object *mem_pool_alloc(gfp_t gfp)
{
unsigned long flags;
struct kmemleak_object *object;
+ bool warn = false;
/* try the slab allocator first */
if (object_cache) {
@@ -488,8 +489,10 @@ static struct kmemleak_object *mem_pool_alloc(gfp_t gfp)
else if (mem_pool_free_count)
object = &mem_pool[--mem_pool_free_count];
else
- pr_warn_once("Memory pool empty, consider increasing CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE\n");
+ warn = true;
raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
+ if (warn)
+ pr_warn_once("Memory pool empty, consider increasing CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE\n");
return object;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 59305202c67fea50378dcad0cc199dbc13a0e99a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025081801-undertake-nuzzle-c4b1@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 59305202c67fea50378dcad0cc199dbc13a0e99a Mon Sep 17 00:00:00 2001
From: Anshuman Khandual <anshuman.khandual(a)arm.com>
Date: Fri, 20 Jun 2025 10:54:27 +0530
Subject: [PATCH] mm/ptdump: take the memory hotplug lock inside
ptdump_walk_pgd()
Memory hot remove unmaps and tears down various kernel page table regions
as required. The ptdump code can race with concurrent modifications of
the kernel page tables. When leaf entries are modified concurrently, the
dump code may log stale or inconsistent information for a VA range, but
this is otherwise not harmful.
But when intermediate levels of kernel page table are freed, the dump code
will continue to use memory that has been freed and potentially
reallocated for another purpose. In such cases, the ptdump code may
dereference bogus addresses, leading to a number of potential problems.
To avoid the above mentioned race condition, platforms such as arm64,
riscv and s390 take memory hotplug lock, while dumping kernel page table
via the sysfs interface /sys/kernel/debug/kernel_page_tables.
Similar race condition exists while checking for pages that might have
been marked W+X via /sys/kernel/debug/kernel_page_tables/check_wx_pages
which in turn calls ptdump_check_wx(). Instead of solving this race
condition again, let's just move the memory hotplug lock inside generic
ptdump_check_wx() which will benefit both the scenarios.
Drop get_online_mems() and put_online_mems() combination from all existing
platform ptdump code paths.
Link: https://lkml.kernel.org/r/20250620052427.2092093-1-anshuman.khandual@arm.com
Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove")
Signed-off-by: Anshuman Khandual <anshuman.khandual(a)arm.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Dev Jain <dev.jain(a)arm.com>
Acked-by: Alexander Gordeev <agordeev(a)linux.ibm.com> [s390]
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Paul Walmsley <paul.walmsley(a)sifive.com>
Cc: Palmer Dabbelt <palmer(a)dabbelt.com>
Cc: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com>
Cc: Heiko Carstens <hca(a)linux.ibm.com>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: Christian Borntraeger <borntraeger(a)linux.ibm.com>
Cc: Sven Schnelle <svens(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 68bf1a125502..1e308328c079 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -1,6 +1,5 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/debugfs.h>
-#include <linux/memory_hotplug.h>
#include <linux/seq_file.h>
#include <asm/ptdump.h>
@@ -9,9 +8,7 @@ static int ptdump_show(struct seq_file *m, void *v)
{
struct ptdump_info *info = m->private;
- get_online_mems();
ptdump_walk(m, info);
- put_online_mems();
return 0;
}
DEFINE_SHOW_ATTRIBUTE(ptdump);
diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 32922550a50a..3b51690cc876 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -6,7 +6,6 @@
#include <linux/efi.h>
#include <linux/init.h>
#include <linux/debugfs.h>
-#include <linux/memory_hotplug.h>
#include <linux/seq_file.h>
#include <linux/ptdump.h>
@@ -413,9 +412,7 @@ bool ptdump_check_wx(void)
static int ptdump_show(struct seq_file *m, void *v)
{
- get_online_mems();
ptdump_walk(m, m->private);
- put_online_mems();
return 0;
}
diff --git a/arch/s390/mm/dump_pagetables.c b/arch/s390/mm/dump_pagetables.c
index ac604b176660..9af2aae0a515 100644
--- a/arch/s390/mm/dump_pagetables.c
+++ b/arch/s390/mm/dump_pagetables.c
@@ -247,11 +247,9 @@ static int ptdump_show(struct seq_file *m, void *v)
.marker = markers,
};
- get_online_mems();
mutex_lock(&cpa_mutex);
ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
mutex_unlock(&cpa_mutex);
- put_online_mems();
return 0;
}
DEFINE_SHOW_ATTRIBUTE(ptdump);
diff --git a/mm/ptdump.c b/mm/ptdump.c
index 61a352aa12ed..b600c7f864b8 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -176,6 +176,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
{
const struct ptdump_range *range = st->range;
+ get_online_mems();
mmap_write_lock(mm);
while (range->start != range->end) {
walk_page_range_debug(mm, range->start, range->end,
@@ -183,6 +184,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
range++;
}
mmap_write_unlock(mm);
+ put_online_mems();
/* Flush out the last page */
st->note_page_flush(st);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 59305202c67fea50378dcad0cc199dbc13a0e99a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025081800-autopilot-booted-fb7f@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 59305202c67fea50378dcad0cc199dbc13a0e99a Mon Sep 17 00:00:00 2001
From: Anshuman Khandual <anshuman.khandual(a)arm.com>
Date: Fri, 20 Jun 2025 10:54:27 +0530
Subject: [PATCH] mm/ptdump: take the memory hotplug lock inside
ptdump_walk_pgd()
Memory hot remove unmaps and tears down various kernel page table regions
as required. The ptdump code can race with concurrent modifications of
the kernel page tables. When leaf entries are modified concurrently, the
dump code may log stale or inconsistent information for a VA range, but
this is otherwise not harmful.
But when intermediate levels of kernel page table are freed, the dump code
will continue to use memory that has been freed and potentially
reallocated for another purpose. In such cases, the ptdump code may
dereference bogus addresses, leading to a number of potential problems.
To avoid the above mentioned race condition, platforms such as arm64,
riscv and s390 take memory hotplug lock, while dumping kernel page table
via the sysfs interface /sys/kernel/debug/kernel_page_tables.
Similar race condition exists while checking for pages that might have
been marked W+X via /sys/kernel/debug/kernel_page_tables/check_wx_pages
which in turn calls ptdump_check_wx(). Instead of solving this race
condition again, let's just move the memory hotplug lock inside generic
ptdump_check_wx() which will benefit both the scenarios.
Drop get_online_mems() and put_online_mems() combination from all existing
platform ptdump code paths.
Link: https://lkml.kernel.org/r/20250620052427.2092093-1-anshuman.khandual@arm.com
Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove")
Signed-off-by: Anshuman Khandual <anshuman.khandual(a)arm.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Dev Jain <dev.jain(a)arm.com>
Acked-by: Alexander Gordeev <agordeev(a)linux.ibm.com> [s390]
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Paul Walmsley <paul.walmsley(a)sifive.com>
Cc: Palmer Dabbelt <palmer(a)dabbelt.com>
Cc: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com>
Cc: Heiko Carstens <hca(a)linux.ibm.com>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: Christian Borntraeger <borntraeger(a)linux.ibm.com>
Cc: Sven Schnelle <svens(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 68bf1a125502..1e308328c079 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -1,6 +1,5 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/debugfs.h>
-#include <linux/memory_hotplug.h>
#include <linux/seq_file.h>
#include <asm/ptdump.h>
@@ -9,9 +8,7 @@ static int ptdump_show(struct seq_file *m, void *v)
{
struct ptdump_info *info = m->private;
- get_online_mems();
ptdump_walk(m, info);
- put_online_mems();
return 0;
}
DEFINE_SHOW_ATTRIBUTE(ptdump);
diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 32922550a50a..3b51690cc876 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -6,7 +6,6 @@
#include <linux/efi.h>
#include <linux/init.h>
#include <linux/debugfs.h>
-#include <linux/memory_hotplug.h>
#include <linux/seq_file.h>
#include <linux/ptdump.h>
@@ -413,9 +412,7 @@ bool ptdump_check_wx(void)
static int ptdump_show(struct seq_file *m, void *v)
{
- get_online_mems();
ptdump_walk(m, m->private);
- put_online_mems();
return 0;
}
diff --git a/arch/s390/mm/dump_pagetables.c b/arch/s390/mm/dump_pagetables.c
index ac604b176660..9af2aae0a515 100644
--- a/arch/s390/mm/dump_pagetables.c
+++ b/arch/s390/mm/dump_pagetables.c
@@ -247,11 +247,9 @@ static int ptdump_show(struct seq_file *m, void *v)
.marker = markers,
};
- get_online_mems();
mutex_lock(&cpa_mutex);
ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
mutex_unlock(&cpa_mutex);
- put_online_mems();
return 0;
}
DEFINE_SHOW_ATTRIBUTE(ptdump);
diff --git a/mm/ptdump.c b/mm/ptdump.c
index 61a352aa12ed..b600c7f864b8 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -176,6 +176,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
{
const struct ptdump_range *range = st->range;
+ get_online_mems();
mmap_write_lock(mm);
while (range->start != range->end) {
walk_page_range_debug(mm, range->start, range->end,
@@ -183,6 +184,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
range++;
}
mmap_write_unlock(mm);
+ put_online_mems();
/* Flush out the last page */
st->note_page_flush(st);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 1ef94169db0958d6de39f9ea6e063ce887342e2d
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025081801-shortlist-acutely-2100@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1ef94169db0958d6de39f9ea6e063ce887342e2d Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Wed, 2 Jul 2025 15:08:13 +0930
Subject: [PATCH] btrfs: populate otime when logging an inode item
[TEST FAILURE WITH EXPERIMENTAL FEATURES]
When running test case generic/508, the test case will fail with the new
btrfs shutdown support:
generic/508 - output mismatch (see /home/adam/xfstests/results//generic/508.out.bad)
--- tests/generic/508.out 2022-05-11 11:25:30.806666664 +0930
+++ /home/adam/xfstests/results//generic/508.out.bad 2025-07-02 14:53:22.401824212 +0930
@@ -1,2 +1,6 @@
QA output created by 508
Silence is golden
+Before:
+After : stat.btime = Thu Jan 1 09:30:00 1970
+Before:
+After : stat.btime = Wed Jul 2 14:53:22 2025
...
(Run 'diff -u /home/adam/xfstests/tests/generic/508.out /home/adam/xfstests/results//generic/508.out.bad' to see the entire diff)
Ran: generic/508
Failures: generic/508
Failed 1 of 1 tests
Please note that the test case requires shutdown support, thus the test
case will be skipped using the current upstream kernel, as it doesn't
have shutdown ioctl support.
[CAUSE]
The direct cause the 0 time stamp in the log tree:
leaf 30507008 items 2 free space 16057 generation 9 owner TREE_LOG
leaf 30507008 flags 0x1(WRITTEN) backref revision 1
checksum stored e522548d
checksum calced e522548d
fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0
chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398
item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 9 transid 9 size 0 nbytes 0
block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
atime 1751432947.492000000 (2025-07-02 14:39:07)
ctime 1751432947.492000000 (2025-07-02 14:39:07)
mtime 1751432947.492000000 (2025-07-02 14:39:07)
otime 0.0 (1970-01-01 09:30:00) <<<
But the old fs tree has all the correct time stamp:
btrfs-progs v6.12
fs tree key (FS_TREE ROOT_ITEM 0)
leaf 30425088 items 2 free space 16061 generation 5 owner FS_TREE
leaf 30425088 flags 0x1(WRITTEN) backref revision 1
checksum stored 48f6c57e
checksum calced 48f6c57e
fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0
chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398
item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 3 transid 0 size 0 nbytes 16384
block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0
sequence 0 flags 0x0(none)
atime 1751432947.0 (2025-07-02 14:39:07)
ctime 1751432947.0 (2025-07-02 14:39:07)
mtime 1751432947.0 (2025-07-02 14:39:07)
otime 1751432947.0 (2025-07-02 14:39:07) <<<
The root cause is that fill_inode_item() in tree-log.c is only
populating a/c/m time, not the otime (or btime in statx output).
Part of the reason is that, the vfs inode only has a/c/m time, no native
btime support yet.
[FIX]
Thankfully btrfs has its otime stored in btrfs_inode::i_otime_sec and
btrfs_inode::i_otime_nsec.
So what we really need is just fill the otime time stamp in
fill_inode_item() of tree-log.c
There is another fill_inode_item() in inode.c, which is doing the proper
otime population.
Fixes: 94edf4ae43a5 ("Btrfs: don't bother committing delayed inode updates when fsyncing")
CC: stable(a)vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 1e805dabfc4b..ab0815d9e7e5 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4233,6 +4233,9 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
btrfs_set_timespec_sec(leaf, &item->ctime, inode_get_ctime_sec(inode));
btrfs_set_timespec_nsec(leaf, &item->ctime, inode_get_ctime_nsec(inode));
+ btrfs_set_timespec_sec(leaf, &item->otime, BTRFS_I(inode)->i_otime_sec);
+ btrfs_set_timespec_nsec(leaf, &item->otime, BTRFS_I(inode)->i_otime_nsec);
+
/*
* We do not need to set the nbytes field, in fact during a fast fsync
* its value may not even be correct, since a fast fsync does not wait
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 59305202c67fea50378dcad0cc199dbc13a0e99a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025081859-excess-willfully-0f01@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 59305202c67fea50378dcad0cc199dbc13a0e99a Mon Sep 17 00:00:00 2001
From: Anshuman Khandual <anshuman.khandual(a)arm.com>
Date: Fri, 20 Jun 2025 10:54:27 +0530
Subject: [PATCH] mm/ptdump: take the memory hotplug lock inside
ptdump_walk_pgd()
Memory hot remove unmaps and tears down various kernel page table regions
as required. The ptdump code can race with concurrent modifications of
the kernel page tables. When leaf entries are modified concurrently, the
dump code may log stale or inconsistent information for a VA range, but
this is otherwise not harmful.
But when intermediate levels of kernel page table are freed, the dump code
will continue to use memory that has been freed and potentially
reallocated for another purpose. In such cases, the ptdump code may
dereference bogus addresses, leading to a number of potential problems.
To avoid the above mentioned race condition, platforms such as arm64,
riscv and s390 take memory hotplug lock, while dumping kernel page table
via the sysfs interface /sys/kernel/debug/kernel_page_tables.
Similar race condition exists while checking for pages that might have
been marked W+X via /sys/kernel/debug/kernel_page_tables/check_wx_pages
which in turn calls ptdump_check_wx(). Instead of solving this race
condition again, let's just move the memory hotplug lock inside generic
ptdump_check_wx() which will benefit both the scenarios.
Drop get_online_mems() and put_online_mems() combination from all existing
platform ptdump code paths.
Link: https://lkml.kernel.org/r/20250620052427.2092093-1-anshuman.khandual@arm.com
Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove")
Signed-off-by: Anshuman Khandual <anshuman.khandual(a)arm.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Dev Jain <dev.jain(a)arm.com>
Acked-by: Alexander Gordeev <agordeev(a)linux.ibm.com> [s390]
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Paul Walmsley <paul.walmsley(a)sifive.com>
Cc: Palmer Dabbelt <palmer(a)dabbelt.com>
Cc: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com>
Cc: Heiko Carstens <hca(a)linux.ibm.com>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: Christian Borntraeger <borntraeger(a)linux.ibm.com>
Cc: Sven Schnelle <svens(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 68bf1a125502..1e308328c079 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -1,6 +1,5 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/debugfs.h>
-#include <linux/memory_hotplug.h>
#include <linux/seq_file.h>
#include <asm/ptdump.h>
@@ -9,9 +8,7 @@ static int ptdump_show(struct seq_file *m, void *v)
{
struct ptdump_info *info = m->private;
- get_online_mems();
ptdump_walk(m, info);
- put_online_mems();
return 0;
}
DEFINE_SHOW_ATTRIBUTE(ptdump);
diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 32922550a50a..3b51690cc876 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -6,7 +6,6 @@
#include <linux/efi.h>
#include <linux/init.h>
#include <linux/debugfs.h>
-#include <linux/memory_hotplug.h>
#include <linux/seq_file.h>
#include <linux/ptdump.h>
@@ -413,9 +412,7 @@ bool ptdump_check_wx(void)
static int ptdump_show(struct seq_file *m, void *v)
{
- get_online_mems();
ptdump_walk(m, m->private);
- put_online_mems();
return 0;
}
diff --git a/arch/s390/mm/dump_pagetables.c b/arch/s390/mm/dump_pagetables.c
index ac604b176660..9af2aae0a515 100644
--- a/arch/s390/mm/dump_pagetables.c
+++ b/arch/s390/mm/dump_pagetables.c
@@ -247,11 +247,9 @@ static int ptdump_show(struct seq_file *m, void *v)
.marker = markers,
};
- get_online_mems();
mutex_lock(&cpa_mutex);
ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
mutex_unlock(&cpa_mutex);
- put_online_mems();
return 0;
}
DEFINE_SHOW_ATTRIBUTE(ptdump);
diff --git a/mm/ptdump.c b/mm/ptdump.c
index 61a352aa12ed..b600c7f864b8 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -176,6 +176,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
{
const struct ptdump_range *range = st->range;
+ get_online_mems();
mmap_write_lock(mm);
while (range->start != range->end) {
walk_page_range_debug(mm, range->start, range->end,
@@ -183,6 +184,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
range++;
}
mmap_write_unlock(mm);
+ put_online_mems();
/* Flush out the last page */
st->note_page_flush(st);
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 59305202c67fea50378dcad0cc199dbc13a0e99a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025081858-sabbath-blunt-7735@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 59305202c67fea50378dcad0cc199dbc13a0e99a Mon Sep 17 00:00:00 2001
From: Anshuman Khandual <anshuman.khandual(a)arm.com>
Date: Fri, 20 Jun 2025 10:54:27 +0530
Subject: [PATCH] mm/ptdump: take the memory hotplug lock inside
ptdump_walk_pgd()
Memory hot remove unmaps and tears down various kernel page table regions
as required. The ptdump code can race with concurrent modifications of
the kernel page tables. When leaf entries are modified concurrently, the
dump code may log stale or inconsistent information for a VA range, but
this is otherwise not harmful.
But when intermediate levels of kernel page table are freed, the dump code
will continue to use memory that has been freed and potentially
reallocated for another purpose. In such cases, the ptdump code may
dereference bogus addresses, leading to a number of potential problems.
To avoid the above mentioned race condition, platforms such as arm64,
riscv and s390 take memory hotplug lock, while dumping kernel page table
via the sysfs interface /sys/kernel/debug/kernel_page_tables.
Similar race condition exists while checking for pages that might have
been marked W+X via /sys/kernel/debug/kernel_page_tables/check_wx_pages
which in turn calls ptdump_check_wx(). Instead of solving this race
condition again, let's just move the memory hotplug lock inside generic
ptdump_check_wx() which will benefit both the scenarios.
Drop get_online_mems() and put_online_mems() combination from all existing
platform ptdump code paths.
Link: https://lkml.kernel.org/r/20250620052427.2092093-1-anshuman.khandual@arm.com
Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove")
Signed-off-by: Anshuman Khandual <anshuman.khandual(a)arm.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Dev Jain <dev.jain(a)arm.com>
Acked-by: Alexander Gordeev <agordeev(a)linux.ibm.com> [s390]
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Paul Walmsley <paul.walmsley(a)sifive.com>
Cc: Palmer Dabbelt <palmer(a)dabbelt.com>
Cc: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com>
Cc: Heiko Carstens <hca(a)linux.ibm.com>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: Christian Borntraeger <borntraeger(a)linux.ibm.com>
Cc: Sven Schnelle <svens(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 68bf1a125502..1e308328c079 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -1,6 +1,5 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/debugfs.h>
-#include <linux/memory_hotplug.h>
#include <linux/seq_file.h>
#include <asm/ptdump.h>
@@ -9,9 +8,7 @@ static int ptdump_show(struct seq_file *m, void *v)
{
struct ptdump_info *info = m->private;
- get_online_mems();
ptdump_walk(m, info);
- put_online_mems();
return 0;
}
DEFINE_SHOW_ATTRIBUTE(ptdump);
diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 32922550a50a..3b51690cc876 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -6,7 +6,6 @@
#include <linux/efi.h>
#include <linux/init.h>
#include <linux/debugfs.h>
-#include <linux/memory_hotplug.h>
#include <linux/seq_file.h>
#include <linux/ptdump.h>
@@ -413,9 +412,7 @@ bool ptdump_check_wx(void)
static int ptdump_show(struct seq_file *m, void *v)
{
- get_online_mems();
ptdump_walk(m, m->private);
- put_online_mems();
return 0;
}
diff --git a/arch/s390/mm/dump_pagetables.c b/arch/s390/mm/dump_pagetables.c
index ac604b176660..9af2aae0a515 100644
--- a/arch/s390/mm/dump_pagetables.c
+++ b/arch/s390/mm/dump_pagetables.c
@@ -247,11 +247,9 @@ static int ptdump_show(struct seq_file *m, void *v)
.marker = markers,
};
- get_online_mems();
mutex_lock(&cpa_mutex);
ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
mutex_unlock(&cpa_mutex);
- put_online_mems();
return 0;
}
DEFINE_SHOW_ATTRIBUTE(ptdump);
diff --git a/mm/ptdump.c b/mm/ptdump.c
index 61a352aa12ed..b600c7f864b8 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -176,6 +176,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
{
const struct ptdump_range *range = st->range;
+ get_online_mems();
mmap_write_lock(mm);
while (range->start != range->end) {
walk_page_range_debug(mm, range->start, range->end,
@@ -183,6 +184,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
range++;
}
mmap_write_unlock(mm);
+ put_online_mems();
/* Flush out the last page */
st->note_page_flush(st);