The patch titled
Subject: fs/proc/task_mmu: check cur_buf for NULL
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
fs-proc-task_mmu-check-cur_buf-for-null.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Jakub Acs <acsjakub(a)amazon.de>
Subject: fs/proc/task_mmu: check cur_buf for NULL
Date: Fri, 19 Sep 2025 14:21:04 +0000
When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:
[ 44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[ 44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[ 44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 #22 PREEMPT(none)
[ 44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80
<snip registers, unreliable trace>
[ 44.946828] Call Trace:
[ 44.947030] <TASK>
[ 44.949219] pagemap_scan_pmd_entry+0xec/0xfa0
[ 44.952593] walk_pmd_range.isra.0+0x302/0x910
[ 44.954069] walk_pud_range.isra.0+0x419/0x790
[ 44.954427] walk_p4d_range+0x41e/0x620
[ 44.954743] walk_pgd_range+0x31e/0x630
[ 44.955057] __walk_page_range+0x160/0x670
[ 44.956883] walk_page_range_mm+0x408/0x980
[ 44.958677] walk_page_range+0x66/0x90
[ 44.958984] do_pagemap_scan+0x28d/0x9c0
[ 44.961833] do_pagemap_cmd+0x59/0x80
[ 44.962484] __x64_sys_ioctl+0x18d/0x210
[ 44.962804] do_syscall_64+0x5b/0x290
[ 44.963111] entry_SYSCALL_64_after_hwframe+0x76/0x7e
vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.
This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.
Fix it by explicitly checking cur_buf for NULL before dereferencing.
Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.
Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in, hence
it passes check in pagemap_scan_get_args().
This issue was found by syzkaller.
Link: https://lkml.kernel.org/r/20250919142106.43527-1-acsjakub@amazon.de
Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub(a)amazon.de>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Jinjiang Tu <tujinjiang(a)huawei.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Penglei Jiang <superman.xpt(a)gmail.com>
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Andrei Vagin <avagin(a)gmail.com>
Cc: "Micha�� Miros��aw" <mirq-linux(a)rere.qmqm.pl>
Cc: Stephen Rothwell <sfr(a)canb.auug.org.au>
Cc: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
Cc: Alexey Dobriyan <adobriyan(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/proc/task_mmu.c | 3 +++
1 file changed, 3 insertions(+)
--- a/fs/proc/task_mmu.c~fs-proc-task_mmu-check-cur_buf-for-null
+++ a/fs/proc/task_mmu.c
@@ -2417,6 +2417,9 @@ static void pagemap_scan_backout_range(s
{
struct page_region *cur_buf = &p->vec_buf[p->vec_buf_index];
+ if (!cur_buf)
+ return;
+
if (cur_buf->start != addr)
cur_buf->end = addr;
else
_
Patches currently in -mm which might be from acsjakub(a)amazon.de are
fs-proc-task_mmu-check-cur_buf-for-null.patch
Bug-report: https://lore.kernel.org/all/915c0e00-b92d-4e37-9d4b-0f6a4580da97@oracle.com/
Summary: While backporting commit: 7c62c442b6eb ("x86/vmscape: Enumerate
VMSCAPE bug") to 6.12.y --> VULNBL_AMD(0x1a, SRSO | VMSCAPE) was added
even when 6.12.y doesn't have commit: 877818802c3e ("x86/bugs: Add
SRSO_USER_KERNEL_NO support").
Boris Ostrovsky suggested backporting three commits to 6.12.y:
1. commit: 877818802c3e ("x86/bugs: Add SRSO_USER_KERNEL_NO support")
2. commit: 8442df2b49ed ("x86/bugs: KVM: Add support for SRSO_MSR_FIX")
and its fix
3. commit: e3417ab75ab2 ("KVM: SVM: Set/clear SRSO's BP_SPEC_REDUCE on 0
<=> 1 VM count transitions") -- Maybe optional
Which changes current mitigation status on turin for 6.12.48 from Safe
RET to Reduced Speculation, leaving it with Safe RET liely causes heavy
performance regressions.
This three patches together change mitigation status from Safe RET to
Reduced Speculation
Tested on Turin:
[ 3.188134] Speculative Return Stack Overflow: Mitigation: Reduced Speculation
Backports:
1. Patch 1 had minor conflict as VMSCAPE commit added VULNBL_AMD(0x1a,
SRSO | VMSCAPE), and resolution is to skip that line.
2. Patch 2 and 3 are clean cherry-picks, 3 is a fix for 2.
Note: I verified if this problem is also on other stable trees like (6.6
--> 5.10, no they don't have this backport problem)
Thanks,
Harshit
Borislav Petkov (1):
x86/bugs: KVM: Add support for SRSO_MSR_FIX
Borislav Petkov (AMD) (1):
x86/bugs: Add SRSO_USER_KERNEL_NO support
Sean Christopherson (1):
KVM: SVM: Set/clear SRSO's BP_SPEC_REDUCE on 0 <=> 1 VM count
transitions
Documentation/admin-guide/hw-vuln/srso.rst | 13 +++++
arch/x86/include/asm/cpufeatures.h | 5 ++
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/bugs.c | 28 ++++++++--
arch/x86/kvm/svm/svm.c | 65 ++++++++++++++++++++++
arch/x86/kvm/svm/svm.h | 2 +
arch/x86/lib/msr.c | 2 +
7 files changed, 112 insertions(+), 4 deletions(-)
--
2.50.1
The following commit has been merged into the x86/cpu branch of tip:
Commit-ID: 32278c677947ae2f042c9535674a7fff9a245dd3
Gitweb: https://git.kernel.org/tip/32278c677947ae2f042c9535674a7fff9a245dd3
Author: Sean Christopherson <seanjc(a)google.com>
AuthorDate: Fri, 08 Aug 2025 10:23:56 -07:00
Committer: Borislav Petkov (AMD) <bp(a)alien8.de>
CommitterDate: Fri, 19 Sep 2025 20:21:12 +02:00
x86/umip: Check that the instruction opcode is at least two bytes
When checking for a potential UMIP violation on #GP, verify the decoder found
at least two opcode bytes to avoid false positives when the kernel encounters
an unknown instruction that starts with 0f. Because the array of opcode.bytes
is zero-initialized by insn_init(), peeking at bytes[1] will misinterpret
garbage as a potential SLDT or STR instruction, and can incorrectly trigger
emulation.
E.g. if a VPALIGNR instruction
62 83 c5 05 0f 08 ff vpalignr xmm17{k5},xmm23,XMMWORD PTR [r8],0xff
hits a #GP, the kernel emulates it as STR and squashes the #GP (and corrupts
the userspace code stream).
Arguably the check should look for exactly two bytes, but no three byte
opcodes use '0f 00 xx' or '0f 01 xx' as an escape, i.e. it should be
impossible to get a false positive if the first two opcode bytes match '0f 00'
or '0f 01'. Go with a more conservative check with respect to the existing
code to minimize the chances of breaking userspace, e.g. due to decoder
weirdness.
Analyzed by Nick Bray <ncbray(a)google.com>.
Fixes: 1e5db223696a ("x86/umip: Add emulation code for UMIP instructions")
Reported-by: Dan Snyder <dansnyder(a)google.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org
---
arch/x86/kernel/umip.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
index 5a4b213..406ac01 100644
--- a/arch/x86/kernel/umip.c
+++ b/arch/x86/kernel/umip.c
@@ -156,8 +156,8 @@ static int identify_insn(struct insn *insn)
if (!insn->modrm.nbytes)
return -EINVAL;
- /* All the instructions of interest start with 0x0f. */
- if (insn->opcode.bytes[0] != 0xf)
+ /* The instructions of interest have 2-byte opcodes: 0F 00 or 0F 01. */
+ if (insn->opcode.nbytes < 2 || insn->opcode.bytes[0] != 0xf)
return -EINVAL;
if (insn->opcode.bytes[1] == 0x1) {
The following commit has been merged into the x86/cpu branch of tip:
Commit-ID: 27b1fd62012dfe9d3eb8ecde344d7aa673695ecf
Gitweb: https://git.kernel.org/tip/27b1fd62012dfe9d3eb8ecde344d7aa673695ecf
Author: Sean Christopherson <seanjc(a)google.com>
AuthorDate: Fri, 08 Aug 2025 10:23:57 -07:00
Committer: Borislav Petkov (AMD) <bp(a)alien8.de>
CommitterDate: Fri, 19 Sep 2025 21:34:48 +02:00
x86/umip: Fix decoding of register forms of 0F 01 (SGDT and SIDT aliases)
Filter out the register forms of 0F 01 when determining whether or not to
emulate in response to a potential UMIP violation #GP, as SGDT and SIDT only
accept memory operands. The register variants of 0F 01 are used to encode
instructions for things like VMX and SGX, i.e. not checking the Mod field
would cause the kernel to incorrectly emulate on #GP, e.g. due to a CPL
violation on VMLAUNCH.
Fixes: 1e5db223696a ("x86/umip: Add emulation code for UMIP instructions")
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org
---
arch/x86/kernel/umip.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
index 406ac01..d432f38 100644
--- a/arch/x86/kernel/umip.c
+++ b/arch/x86/kernel/umip.c
@@ -163,8 +163,19 @@ static int identify_insn(struct insn *insn)
if (insn->opcode.bytes[1] == 0x1) {
switch (X86_MODRM_REG(insn->modrm.value)) {
case 0:
+ /* The reg form of 0F 01 /0 encodes VMX instructions. */
+ if (X86_MODRM_MOD(insn->modrm.value) == 3)
+ return -EINVAL;
+
return UMIP_INST_SGDT;
case 1:
+ /*
+ * The reg form of 0F 01 /1 encodes MONITOR/MWAIT,
+ * STAC/CLAC, and ENCLS.
+ */
+ if (X86_MODRM_MOD(insn->modrm.value) == 3)
+ return -EINVAL;
+
return UMIP_INST_SIDT;
case 4:
return UMIP_INST_SMSW;
From: HariKrishna Sagala <hariconscious(a)gmail.com>
Syzbot reported an uninit-value bug on at kmalloc_reserve for
commit 320475fbd590 ("Merge tag 'mtd/fixes-for-6.17-rc6' of
git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux")'
Syzbot KMSAN reported use of uninitialized memory originating from functions
"kmalloc_reserve()", where memory allocated via "kmem_cache_alloc_node()" or
"kmalloc_node_track_caller()" was not explicitly initialized.
This can lead to undefined behavior when the allocated buffer
is later accessed.
Fix this by requesting the initialized memory using the gfp flag
appended with the option "__GFP_ZERO".
Reported-by: syzbot+9a4fbb77c9d4aacd3388(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9a4fbb77c9d4aacd3388
Fixes: 915d975b2ffa ("net: deal with integer overflows in
kmalloc_reserve()")
Tested-by: syzbot+9a4fbb77c9d4aacd3388(a)syzkaller.appspotmail.com
Signed-off-by: HariKrishna Sagala <hariconscious(a)gmail.com>
---
net/core/skbuff.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ee0274417948..2308ebf99bbd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -573,6 +573,7 @@ static void *kmalloc_reserve(unsigned int *size, gfp_t flags, int node,
void *obj;
obj_size = SKB_HEAD_ALIGN(*size);
+ flags |= __GFP_ZERO;
if (obj_size <= SKB_SMALL_HEAD_CACHE_SIZE &&
!(flags & KMALLOC_NOT_NORMAL_BITS)) {
obj = kmem_cache_alloc_node(net_hotdata.skb_small_head_cache,
--
2.43.0
Once of_device_register() failed, we should call put_device() to
decrement reference count for cleanup. Or it could cause memory leak.
So fix this by calling put_device(), then the name can be freed in
kobject_cleanup().
Calling path: of_device_register() -> of_device_add() -> device_add().
As comment of device_add() says, 'if device_add() succeeds, you should
call device_del() when you want to get rid of it. If device_add() has
not succeeded, use only put_device() to drop the reference count'.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: cf44bbc26cf1 ("[SPARC]: Beginnings of generic of_device framework.")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
Changes in v2:
- retained kfree() manually due to the lack of a release callback function.
---
arch/sparc/kernel/of_device_64.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/sparc/kernel/of_device_64.c b/arch/sparc/kernel/of_device_64.c
index f98c2901f335..f53092b07b9e 100644
--- a/arch/sparc/kernel/of_device_64.c
+++ b/arch/sparc/kernel/of_device_64.c
@@ -677,6 +677,7 @@ static struct platform_device * __init scan_one_device(struct device_node *dp,
if (of_device_register(op)) {
printk("%pOF: Could not register of device.\n", dp);
+ put_device(&op->dev);
kfree(op);
op = NULL;
}
--
2.25.1
Hi stable maintainers,
While skimming over stable backports for VMSCAPE commits, I found
something unusual.
This is regarding the 6.12.y commit: 7c62c442b6eb ("x86/vmscape:
Enumerate VMSCAPE bug")
commit 7c62c442b6eb95d21bc4c5afc12fee721646ebe2
Author: Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
Date: Thu Aug 14 10:20:42 2025 -0700
x86/vmscape: Enumerate VMSCAPE bug
Commit a508cec6e5215a3fbc7e73ae86a5c5602187934d upstream.
The VMSCAPE vulnerability may allow a guest to cause Branch Target
Injection (BTI) in userspace hypervisors.
Kernels (both host and guest) have existing defenses against direct BTI
attacks from guests. There are also inter-process BTI mitigations which
prevent processes from attacking each other. However, the threat in
this
case is to a userspace hypervisor within the same process as the
attacker.
Userspace hypervisors have access to their own sensitive data like disk
encryption keys and also typically have access to all guest data. This
means guest userspace may use the hypervisor as a confused deputy
to attack
sensitive guest kernel data. There are no existing mitigations for
these
attacks.
Introduce X86_BUG_VMSCAPE for this vulnerability and set it on affected
Intel and AMD CPUs.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
So the problem in this commit is this part of the backport:
in file: arch/x86/kernel/cpu/common.c
VULNBL_AMD(0x15, RETBLEED),
VULNBL_AMD(0x16, RETBLEED),
- VULNBL_AMD(0x17, RETBLEED | SMT_RSB | SRSO),
- VULNBL_HYGON(0x18, RETBLEED | SMT_RSB | SRSO),
- VULNBL_AMD(0x19, SRSO | TSA),
+ VULNBL_AMD(0x17, RETBLEED | SMT_RSB | SRSO | VMSCAPE),
+ VULNBL_HYGON(0x18, RETBLEED | SMT_RSB | SRSO | VMSCAPE),
+ VULNBL_AMD(0x19, SRSO | TSA | VMSCAPE),
+ VULNBL_AMD(0x1a, SRSO | VMSCAPE),
+
{}
Notice the part where VULNBL_AMD(0x1a, SRSO | VMSCAPE) is added, 6.12.y
doesn't have commit: 877818802c3e ("x86/bugs: Add SRSO_USER_KERNEL_NO
support") so I think we shouldn't be adding VULNBL_AMD(0x1a, SRSO |
VMSCAPE) directly.
Boris Ostrovsky suggested me to verify this on a Turin machine as this
could cause a very big performance regression : and stated if SRSO
mitigation status is Safe RET we are likely in a problem, and we are in
that situation.
# lscpu | grep -E "CPU family"
CPU family: 26
Notes: CPU ID 26 -> 0x1a
And Turin machine reports the SRSO mitigation status as "Safe RET"
# uname -r
6.12.48-master.20250917.el8.rc1.x86_64
# cat /sys/devices/system/cpu/vulnerabilities/spec_rstack_overflow
Mitigation: Safe RET
Boris Ostrovsky suggested backporting three commits to 6.12.y:
1. commit: 877818802c3e ("x86/bugs: Add SRSO_USER_KERNEL_NO support")
2. commit: 8442df2b49ed ("x86/bugs: KVM: Add support for SRSO_MSR_FIX")
and its fix
3. commit: e3417ab75ab2 ("KVM: SVM: Set/clear SRSO's BP_SPEC_REDUCE on 0
<=> 1 VM count transitions") -- Maybe optional
After backporting these three:
# uname -r
6.12.48-master.20250919.el8.dev.x86_64 // Note this this is kernel with
patches above three applied.
# dmesg | grep -C 2 Reduce
[ 3.186135] Speculative Store Bypass: Mitigation: Speculative Store
Bypass disabled via prctl
[ 3.187135] Speculative Return Stack Overflow: Reducing speculation to
address VM/HV SRSO attack vector.
[ 3.188134] Speculative Return Stack Overflow: Mitigation: Reduced
Speculation
[ 3.189135] VMSCAPE: Mitigation: IBPB before exit to userspace
[ 3.191139] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point
registers'
# cat /sys/devices/system/cpu/vulnerabilities/spec_rstack_overflow
Mitigation: Reduced Speculation
I can send my backports to stable if this looks good. Thoughts ?
Thanks,
Harshit
The iput() function is a dangerous one - if the reference counter goes
to zero, the function may block for a long time due to:
- inode_wait_for_writeback() waits until writeback on this inode
completes
- the filesystem-specific "evict_inode" callback can do similar
things; e.g. all netfs-based filesystems will call
netfs_wait_for_outstanding_io() which is similar to
inode_wait_for_writeback()
Therefore, callers must carefully evaluate the context they're in and
check whether invoking iput() is a good idea at all.
Most of the time, this is not a problem because the dcache holds
references to all inodes, and the dcache is usually the one to release
the last reference. But this assumption is fragile. For example,
under (memcg) memory pressure, the dcache shrinker is more likely to
release inode references, moving the inode eviction to contexts where
that was extremely unlikely to occur.
Our production servers "found" at least two deadlock bugs in the Ceph
filesystem that were caused by this iput() behavior:
1. Writeback may lead to iput() calls in Ceph (e.g. from
ceph_put_wrbuffer_cap_refs()) which deadlocks in
inode_wait_for_writeback(). Waiting for writeback completion from
within writeback will obviously never be able to make any progress.
This leads to blocked kworkers like this:
INFO: task kworker/u777:6:1270802 blocked for more than 122 seconds.
Not tainted 6.16.7-i1-es #773
task:kworker/u777:6 state:D stack:0 pid:1270802 tgid:1270802 ppid:2
task_flags:0x4208060 flags:0x00004000
Workqueue: writeback wb_workfn (flush-ceph-3)
Call Trace:
<TASK>
__schedule+0x4ea/0x17d0
schedule+0x1c/0xc0
inode_wait_for_writeback+0x71/0xb0
evict+0xcf/0x200
ceph_put_wrbuffer_cap_refs+0xdd/0x220
ceph_invalidate_folio+0x97/0xc0
ceph_writepages_start+0x127b/0x14d0
do_writepages+0xba/0x150
__writeback_single_inode+0x34/0x290
writeback_sb_inodes+0x203/0x470
__writeback_inodes_wb+0x4c/0xe0
wb_writeback+0x189/0x2b0
wb_workfn+0x30b/0x3d0
process_one_work+0x143/0x2b0
worker_thread+0x30a/0x450
2. In the Ceph messenger thread (net/ceph/messenger*.c), any iput()
call may invoke ceph_evict_inode() which will deadlock in
netfs_wait_for_outstanding_io(); since this blocks the messenger
thread, completions from the Ceph servers will not ever be received
and handled.
It looks like these deadlock bugs have been in the Ceph filesystem
code since forever (therefore no "Fixes" tag in this patch). There
may be various ways to solve this:
- make iput() asynchronous and defer the actual eviction like fput()
(may add overhead)
- make iput() only asynchronous if I_SYNC is set (doesn't solve random
things happening inside the "evict_inode" callback)
- add iput_deferred() to make this asynchronous behavior/overhead
optional and explicit
- refactor Ceph to avoid iput() calls from within writeback and
messenger (if that is even possible)
- add a Ceph-specific workaround
After advice from Mateusz Guzik, I decided to do the latter. The
implementation is simple because it piggybacks on the existing
work_struct for ceph_queue_inode_work() - ceph_inode_work() calls
iput() at the end which means we can donate the last reference to it.
Since Ceph has a few iput() callers in a loop, it seemed simple enough
to pass this counter and use atomic_sub() instead of atomic_dec().
This patch adds ceph_iput_n_async() and converts lots of iput() calls
to it - at least those that may come through writeback and the
messenger.
Signed-off-by: Max Kellermann <max.kellermann(a)ionos.com>
Cc: Mateusz Guzik <mjguzik(a)gmail.com>
Cc: stable(a)vger.kernel.org
---
fs/ceph/addr.c | 2 +-
fs/ceph/caps.c | 21 ++++++++++-----------
fs/ceph/dir.c | 2 +-
fs/ceph/inode.c | 42 ++++++++++++++++++++++++++++++++++++++++++
fs/ceph/mds_client.c | 32 ++++++++++++++++----------------
fs/ceph/quota.c | 4 ++--
fs/ceph/snap.c | 10 +++++-----
fs/ceph/super.h | 7 +++++++
8 files changed, 84 insertions(+), 36 deletions(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 322ed268f14a..fc497c91530e 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -265,7 +265,7 @@ static void finish_netfs_read(struct ceph_osd_request *req)
subreq->error = err;
trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
netfs_read_subreq_terminated(subreq);
- iput(req->r_inode);
+ ceph_iput_async(req->r_inode);
ceph_dec_osd_stopping_blocker(fsc->mdsc);
}
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index b1a8ff612c41..bd88b5287a2b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1771,7 +1771,7 @@ void ceph_flush_snaps(struct ceph_inode_info *ci,
spin_unlock(&mdsc->snap_flush_lock);
if (need_put)
- iput(inode);
+ ceph_iput_async(inode);
}
/*
@@ -3318,8 +3318,8 @@ static void __ceph_put_cap_refs(struct ceph_inode_info *ci, int had,
}
if (wake)
wake_up_all(&ci->i_cap_wq);
- while (put-- > 0)
- iput(inode);
+ if (put > 0)
+ ceph_iput_n_async(inode, put);
}
void ceph_put_cap_refs(struct ceph_inode_info *ci, int had)
@@ -3418,9 +3418,8 @@ void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
}
if (complete_capsnap)
wake_up_all(&ci->i_cap_wq);
- while (put-- > 0) {
- iput(inode);
- }
+ if (put > 0)
+ ceph_iput_n_async(inode, put);
}
/*
@@ -3917,7 +3916,7 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid,
if (wake_mdsc)
wake_up_all(&mdsc->cap_flushing_wq);
if (drop)
- iput(inode);
+ ceph_iput_async(inode);
}
void __ceph_remove_capsnap(struct inode *inode, struct ceph_cap_snap *capsnap,
@@ -4008,7 +4007,7 @@ static void handle_cap_flushsnap_ack(struct inode *inode, u64 flush_tid,
wake_up_all(&ci->i_cap_wq);
if (wake_mdsc)
wake_up_all(&mdsc->cap_flushing_wq);
- iput(inode);
+ ceph_iput_async(inode);
}
}
@@ -4557,7 +4556,7 @@ void ceph_handle_caps(struct ceph_mds_session *session,
done:
mutex_unlock(&session->s_mutex);
done_unlocked:
- iput(inode);
+ ceph_iput_async(inode);
out:
ceph_dec_mds_stopping_blocker(mdsc);
@@ -4636,7 +4635,7 @@ unsigned long ceph_check_delayed_caps(struct ceph_mds_client *mdsc)
doutc(cl, "on %p %llx.%llx\n", inode,
ceph_vinop(inode));
ceph_check_caps(ci, 0);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&mdsc->cap_delay_lock);
}
@@ -4675,7 +4674,7 @@ static void flush_dirty_session_caps(struct ceph_mds_session *s)
spin_unlock(&mdsc->cap_dirty_lock);
ceph_wait_on_async_create(inode);
ceph_check_caps(ci, CHECK_CAPS_FLUSH);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&mdsc->cap_dirty_lock);
}
spin_unlock(&mdsc->cap_dirty_lock);
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 32973c62c1a2..ec73ed52a227 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1290,7 +1290,7 @@ static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc,
ceph_mdsc_free_path_info(&path_info);
}
out:
- iput(req->r_old_inode);
+ ceph_iput_async(req->r_old_inode);
ceph_mdsc_release_dir_caps(req);
}
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index f67025465de0..385d5261632d 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2191,6 +2191,48 @@ void ceph_queue_inode_work(struct inode *inode, int work_bit)
}
}
+/**
+ * Queue an asynchronous iput() call in a worker thread. Use this
+ * instead of iput() in contexts where evicting the inode is unsafe.
+ * For example, inode eviction may cause deadlocks in
+ * inode_wait_for_writeback() (when called from within writeback) or
+ * in netfs_wait_for_outstanding_io() (when called from within the
+ * Ceph messenger).
+ *
+ * @n: how many references to put
+ */
+void ceph_iput_n_async(struct inode *inode, int n)
+{
+ if (unlikely(!inode))
+ return;
+
+ if (likely(atomic_sub_return(n, &inode->i_count) > 0))
+ /* somebody else is holding another reference -
+ * nothing left to do for us
+ */
+ return;
+
+ doutc(ceph_inode_to_fs_client(inode)->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));
+
+ /* the reference counter is now 0, i.e. nobody else is holding
+ * a reference to this inode; restore it to 1 and donate it to
+ * ceph_inode_work() which will call iput() at the end
+ */
+ atomic_set(&inode->i_count, 1);
+
+ /* simply queue a ceph_inode_work() without setting
+ * i_work_mask bit; other than putting the reference, there is
+ * nothing to do
+ */
+ WARN_ON_ONCE(!queue_work(ceph_inode_to_fs_client(inode)->inode_wq,
+ &ceph_inode(inode)->i_work));
+
+ /* note: queue_work() cannot fail; it i_work were already
+ * queued, then it would be holding another reference, but no
+ * such reference exists
+ */
+}
+
static void ceph_do_invalidate_pages(struct inode *inode)
{
struct ceph_client *cl = ceph_inode_to_client(inode);
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 3bc72b47fe4d..d7fce1ad8073 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1097,14 +1097,14 @@ void ceph_mdsc_release_request(struct kref *kref)
ceph_msg_put(req->r_reply);
if (req->r_inode) {
ceph_put_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
- iput(req->r_inode);
+ ceph_iput_async(req->r_inode);
}
if (req->r_parent) {
ceph_put_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
- iput(req->r_parent);
+ ceph_iput_async(req->r_parent);
}
- iput(req->r_target_inode);
- iput(req->r_new_inode);
+ ceph_iput_async(req->r_target_inode);
+ ceph_iput_async(req->r_new_inode);
if (req->r_dentry)
dput(req->r_dentry);
if (req->r_old_dentry)
@@ -1118,7 +1118,7 @@ void ceph_mdsc_release_request(struct kref *kref)
*/
ceph_put_cap_refs(ceph_inode(req->r_old_dentry_dir),
CEPH_CAP_PIN);
- iput(req->r_old_dentry_dir);
+ ceph_iput_async(req->r_old_dentry_dir);
}
kfree(req->r_path1);
kfree(req->r_path2);
@@ -1240,7 +1240,7 @@ static void __unregister_request(struct ceph_mds_client *mdsc,
}
if (req->r_unsafe_dir) {
- iput(req->r_unsafe_dir);
+ ceph_iput_async(req->r_unsafe_dir);
req->r_unsafe_dir = NULL;
}
@@ -1413,7 +1413,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc,
cap = rb_entry(rb_first(&ci->i_caps), struct ceph_cap, ci_node);
if (!cap) {
spin_unlock(&ci->i_ceph_lock);
- iput(inode);
+ ceph_iput_async(inode);
goto random;
}
mds = cap->session->s_mds;
@@ -1422,7 +1422,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc,
cap == ci->i_auth_cap ? "auth " : "", cap);
spin_unlock(&ci->i_ceph_lock);
out:
- iput(inode);
+ ceph_iput_async(inode);
return mds;
random:
@@ -1841,7 +1841,7 @@ int ceph_iterate_session_caps(struct ceph_mds_session *session,
spin_unlock(&session->s_cap_lock);
if (last_inode) {
- iput(last_inode);
+ ceph_iput_async(last_inode);
last_inode = NULL;
}
if (old_cap) {
@@ -1874,7 +1874,7 @@ int ceph_iterate_session_caps(struct ceph_mds_session *session,
session->s_cap_iterator = NULL;
spin_unlock(&session->s_cap_lock);
- iput(last_inode);
+ ceph_iput_async(last_inode);
if (old_cap)
ceph_put_cap(session->s_mdsc, old_cap);
@@ -1903,8 +1903,8 @@ static int remove_session_caps_cb(struct inode *inode, int mds, void *arg)
wake_up_all(&ci->i_cap_wq);
if (invalidate)
ceph_queue_invalidate(inode);
- while (iputs--)
- iput(inode);
+ if (iputs > 0)
+ ceph_iput_n_async(inode, iputs);
return 0;
}
@@ -1944,7 +1944,7 @@ static void remove_session_caps(struct ceph_mds_session *session)
spin_unlock(&session->s_cap_lock);
inode = ceph_find_inode(sb, vino);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&session->s_cap_lock);
}
@@ -2512,7 +2512,7 @@ static void ceph_cap_unlink_work(struct work_struct *work)
doutc(cl, "on %p %llx.%llx\n", inode,
ceph_vinop(inode));
ceph_check_caps(ci, CHECK_CAPS_FLUSH);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&mdsc->cap_delay_lock);
}
}
@@ -3933,7 +3933,7 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
!req->r_reply_info.has_create_ino) {
/* This should never happen on an async create */
WARN_ON_ONCE(req->r_deleg_ino);
- iput(in);
+ ceph_iput_async(in);
in = NULL;
}
@@ -5313,7 +5313,7 @@ static void handle_lease(struct ceph_mds_client *mdsc,
out:
mutex_unlock(&session->s_mutex);
- iput(inode);
+ ceph_iput_async(inode);
ceph_dec_mds_stopping_blocker(mdsc);
return;
diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c
index d90eda19bcc4..bba00f8926e6 100644
--- a/fs/ceph/quota.c
+++ b/fs/ceph/quota.c
@@ -76,7 +76,7 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc,
le64_to_cpu(h->max_files));
spin_unlock(&ci->i_ceph_lock);
- iput(inode);
+ ceph_iput_async(inode);
out:
ceph_dec_mds_stopping_blocker(mdsc);
}
@@ -190,7 +190,7 @@ void ceph_cleanup_quotarealms_inodes(struct ceph_mds_client *mdsc)
node = rb_first(&mdsc->quotarealms_inodes);
qri = rb_entry(node, struct ceph_quotarealm_inode, node);
rb_erase(node, &mdsc->quotarealms_inodes);
- iput(qri->inode);
+ ceph_iput_async(qri->inode);
kfree(qri);
}
mutex_unlock(&mdsc->quotarealms_inodes_mutex);
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index c65f2b202b2b..19f097e79b3c 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -735,7 +735,7 @@ static void queue_realm_cap_snaps(struct ceph_mds_client *mdsc,
if (!inode)
continue;
spin_unlock(&realm->inodes_with_caps_lock);
- iput(lastinode);
+ ceph_iput_async(lastinode);
lastinode = inode;
/*
@@ -762,7 +762,7 @@ static void queue_realm_cap_snaps(struct ceph_mds_client *mdsc,
spin_lock(&realm->inodes_with_caps_lock);
}
spin_unlock(&realm->inodes_with_caps_lock);
- iput(lastinode);
+ ceph_iput_async(lastinode);
if (capsnap)
kmem_cache_free(ceph_cap_snap_cachep, capsnap);
@@ -955,7 +955,7 @@ static void flush_snaps(struct ceph_mds_client *mdsc)
ihold(inode);
spin_unlock(&mdsc->snap_flush_lock);
ceph_flush_snaps(ci, &session);
- iput(inode);
+ ceph_iput_async(inode);
spin_lock(&mdsc->snap_flush_lock);
}
spin_unlock(&mdsc->snap_flush_lock);
@@ -1116,12 +1116,12 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
ceph_get_snap_realm(mdsc, realm);
ceph_change_snap_realm(inode, realm);
spin_unlock(&ci->i_ceph_lock);
- iput(inode);
+ ceph_iput_async(inode);
continue;
skip_inode:
spin_unlock(&ci->i_ceph_lock);
- iput(inode);
+ ceph_iput_async(inode);
}
/* we may have taken some of the old realm's children. */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index cf176aab0f82..15c09b6c94aa 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1085,6 +1085,13 @@ static inline void ceph_queue_flush_snaps(struct inode *inode)
ceph_queue_inode_work(inode, CEPH_I_WORK_FLUSH_SNAPS);
}
+void ceph_iput_n_async(struct inode *inode, int n);
+
+static inline void ceph_iput_async(struct inode *inode)
+{
+ ceph_iput_n_async(inode, 1);
+}
+
extern int ceph_try_to_choose_auth_mds(struct inode *inode, int mask);
extern int __ceph_do_getattr(struct inode *inode, struct page *locked_page,
int mask, bool force);
--
2.47.3