The patch titled
Subject: nilfs2: fix buffer head leaks in calls to truncate_inode_pages()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Subject: nilfs2: fix buffer head leaks in calls to truncate_inode_pages()
Date: Fri, 13 Dec 2024 01:43:28 +0900
When block_invalidatepage was converted to block_invalidate_folio, the
fallback to block_invalidatepage in folio_invalidate() if the
address_space_operations method invalidatepage (currently
invalidate_folio) was not set, was removed.
Unfortunately, some pseudo-inodes in nilfs2 use empty_aops set by
inode_init_always_gfp() as is, or explicitly set it to
address_space_operations. Therefore, with this change,
block_invalidatepage() is no longer called from folio_invalidate(), and as
a result, the buffer_head structures attached to these pages/folios are no
longer freed via try_to_free_buffers().
Thus, these buffer heads are now leaked by truncate_inode_pages(), which
cleans up the page cache from inode evict(), etc.
Three types of caches use empty_aops: gc inode caches and the DAT shadow
inode used by GC, and b-tree node caches. Of these, b-tree node caches
explicitly call invalidate_mapping_pages() during cleanup, which involves
calling try_to_free_buffers(), so the leak was not visible during normal
operation but worsened when GC was performed.
Fix this issue by using address_space_operations with invalidate_folio set
to block_invalidate_folio instead of empty_aops, which will ensure the
same behavior as before.
Link: https://lkml.kernel.org/r/20241212164556.21338-1-konishi.ryusuke@gmail.com
Fixes: 7ba13abbd31e ("fs: Turn block_invalidatepage into block_invalidate_folio")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [5.18+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/nilfs2/btnode.c | 1 +
fs/nilfs2/gcinode.c | 2 +-
fs/nilfs2/inode.c | 5 +++++
fs/nilfs2/nilfs.h | 1 +
4 files changed, 8 insertions(+), 1 deletion(-)
--- a/fs/nilfs2/btnode.c~nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages
+++ a/fs/nilfs2/btnode.c
@@ -35,6 +35,7 @@ void nilfs_init_btnc_inode(struct inode
ii->i_flags = 0;
memset(&ii->i_bmap_data, 0, sizeof(struct nilfs_bmap));
mapping_set_gfp_mask(btnc_inode->i_mapping, GFP_NOFS);
+ btnc_inode->i_mapping->a_ops = &nilfs_buffer_cache_aops;
}
void nilfs_btnode_cache_clear(struct address_space *btnc)
--- a/fs/nilfs2/gcinode.c~nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages
+++ a/fs/nilfs2/gcinode.c
@@ -163,7 +163,7 @@ int nilfs_init_gcinode(struct inode *ino
inode->i_mode = S_IFREG;
mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
- inode->i_mapping->a_ops = &empty_aops;
+ inode->i_mapping->a_ops = &nilfs_buffer_cache_aops;
ii->i_flags = 0;
nilfs_bmap_init_gc(ii->i_bmap);
--- a/fs/nilfs2/inode.c~nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages
+++ a/fs/nilfs2/inode.c
@@ -276,6 +276,10 @@ const struct address_space_operations ni
.is_partially_uptodate = block_is_partially_uptodate,
};
+const struct address_space_operations nilfs_buffer_cache_aops = {
+ .invalidate_folio = block_invalidate_folio,
+};
+
static int nilfs_insert_inode_locked(struct inode *inode,
struct nilfs_root *root,
unsigned long ino)
@@ -681,6 +685,7 @@ struct inode *nilfs_iget_for_shadow(stru
NILFS_I(s_inode)->i_flags = 0;
memset(NILFS_I(s_inode)->i_bmap, 0, sizeof(struct nilfs_bmap));
mapping_set_gfp_mask(s_inode->i_mapping, GFP_NOFS);
+ s_inode->i_mapping->a_ops = &nilfs_buffer_cache_aops;
err = nilfs_attach_btree_node_cache(s_inode);
if (unlikely(err)) {
--- a/fs/nilfs2/nilfs.h~nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages
+++ a/fs/nilfs2/nilfs.h
@@ -401,6 +401,7 @@ extern const struct file_operations nilf
extern const struct inode_operations nilfs_file_inode_operations;
extern const struct file_operations nilfs_file_operations;
extern const struct address_space_operations nilfs_aops;
+extern const struct address_space_operations nilfs_buffer_cache_aops;
extern const struct inode_operations nilfs_dir_inode_operations;
extern const struct inode_operations nilfs_special_inode_operations;
extern const struct inode_operations nilfs_symlink_inode_operations;
_
Patches currently in -mm which might be from konishi.ryusuke(a)gmail.com are
nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages.patch
The patch titled
Subject: zram: fix panic when using ext4 over zram
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
zram-panic-when-use-ext4-over-zram.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: caiqingfu <caiqingfu(a)ruijie.com.cn>
Subject: zram: fix panic when using ext4 over zram
Date: Fri, 29 Nov 2024 19:57:35 +0800
[ 52.073080 ] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 52.073511 ] Modules linked in:
[ 52.074094 ] CPU: 0 UID: 0 PID: 3825 Comm: a.out Not tainted 6.12.0-07749-g28eb75e178d3-dirty #3
[ 52.074672 ] Hardware name: linux,dummy-virt (DT)
[ 52.075128 ] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 52.075619 ] pc : obj_malloc+0x5c/0x160
[ 52.076402 ] lr : zs_malloc+0x200/0x570
[ 52.076630 ] sp : ffff80008dd335f0
[ 52.076797 ] x29: ffff80008dd335f0 x28: ffff000004104a00 x27: ffff000004dfc400
[ 52.077319 ] x26: 000000000000ca18 x25: ffff00003fcaf0e0 x24: ffff000006925cf0
[ 52.077785 ] x23: 0000000000000c0a x22: ffff0000032ee780 x21: ffff000006925cf0
[ 52.078257 ] x20: 0000000000088000 x19: 0000000000000000 x18: 0000000000fffc18
[ 52.078701 ] x17: 00000000fffffffd x16: 0000000000000803 x15: 00000000fffffffe
[ 52.079203 ] x14: 000000001824429d x13: ffff000006e84000 x12: ffff000006e83fec
[ 52.079711 ] x11: ffff000006e83000 x10: 00000000000002a5 x9 : ffff000006e83ff3
[ 52.080269 ] x8 : 0000000000000001 x7 : 0000000017e80000 x6 : 0000000000017e80
[ 52.080724 ] x5 : 0000000000000003 x4 : ffff00000402a5e8 x3 : 0000000000000066
[ 52.081081 ] x2 : ffff000006925cf0 x1 : ffff00000402a5e8 x0 : ffff000004104a00
[ 52.081595 ] Call trace:
[ 52.081925 ] obj_malloc+0x5c/0x160 (P)
[ 52.082220 ] zs_malloc+0x200/0x570 (L)
[ 52.082504 ] zs_malloc+0x200/0x570
[ 52.082716 ] zram_submit_bio+0x788/0x9e8
[ 52.083017 ] __submit_bio+0x1c4/0x338
[ 52.083343 ] submit_bio_noacct_nocheck+0x128/0x2c0
[ 52.083518 ] submit_bio_noacct+0x1c8/0x308
[ 52.083722 ] submit_bio+0xa8/0x14c
[ 52.083942 ] submit_bh_wbc+0x140/0x1bc
[ 52.084088 ] __block_write_full_folio+0x23c/0x5f0
[ 52.084232 ] block_write_full_folio+0x134/0x21c
[ 52.084524 ] write_cache_pages+0x64/0xd4
[ 52.084778 ] blkdev_writepages+0x50/0x8c
[ 52.085040 ] do_writepages+0x80/0x2b0
[ 52.085292 ] filemap_fdatawrite_wbc+0x6c/0x90
[ 52.085597 ] __filemap_fdatawrite_range+0x64/0x94
[ 52.085900 ] filemap_fdatawrite+0x1c/0x28
[ 52.086158 ] sync_bdevs+0x170/0x17c
[ 52.086374 ] ksys_sync+0x6c/0xb8
[ 52.086597 ] __arm64_sys_sync+0x10/0x20
[ 52.086847 ] invoke_syscall+0x44/0x100
[ 52.087230 ] el0_svc_common.constprop.0+0x40/0xe0
[ 52.087550 ] do_el0_svc+0x1c/0x28
[ 52.087690 ] el0_svc+0x30/0xd0
[ 52.087818 ] el0t_64_sync_handler+0xc8/0xcc
[ 52.088046 ] el0t_64_sync+0x198/0x19c
[ 52.088500 ] Code: 110004a5 6b0500df f9401273 54000160 (f9401664)
[ 52.089097 ] ---[ end trace 0000000000000000 ]---
When using ext4 on zram, the following panic occasionally occurs under
high memory usage
The reason is that when the handle is obtained using the slow path, it
will be re-compressed. If the data in the page changes, the compressed
length may exceed the previous one. Overflow occurred when writing to
zs_object, which then caused the panic.
Comment the fast path and force the slow path. Adding a large number of
read and write file systems can quickly reproduce it.
The solution is to re-obtain the handle after re-compression if the length
is different from the previous one.
Link: https://lkml.kernel.org/r/20241129115735.136033-1-baicaiaichibaicai@gmail.c…
Signed-off-by: caiqingfu <caiqingfu(a)ruijie.com.cn>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Sergey Senozhatsky <senozhatsky(a)chromium.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/block/zram/zram_drv.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
--- a/drivers/block/zram/zram_drv.c~zram-panic-when-use-ext4-over-zram
+++ a/drivers/block/zram/zram_drv.c
@@ -1633,6 +1633,7 @@ static int zram_write_page(struct zram *
unsigned long alloced_pages;
unsigned long handle = -ENOMEM;
unsigned int comp_len = 0;
+ unsigned int last_comp_len = 0;
void *src, *dst, *mem;
struct zcomp_strm *zstrm;
unsigned long element = 0;
@@ -1664,6 +1665,11 @@ compress_again:
if (comp_len >= huge_class_size)
comp_len = PAGE_SIZE;
+
+ if (last_comp_len && (last_comp_len != comp_len)) {
+ zs_free(zram->mem_pool, handle);
+ handle = (unsigned long)ERR_PTR(-ENOMEM);
+ }
/*
* handle allocation has 2 paths:
* a) fast path is executed with preemption disabled (for
@@ -1692,8 +1698,10 @@ compress_again:
if (IS_ERR_VALUE(handle))
return PTR_ERR((void *)handle);
- if (comp_len != PAGE_SIZE)
+ if (comp_len != PAGE_SIZE) {
+ last_comp_len = comp_len;
goto compress_again;
+ }
/*
* If the page is not compressible, you need to acquire the
* lock and execute the code below. The zcomp_stream_get()
_
Patches currently in -mm which might be from caiqingfu(a)ruijie.com.cn are
zram-panic-when-use-ext4-over-zram.patch
Since 5.16 and prior to 6.13 KVM can't be used with FSDAX
guest memory (PMD pages). To reproduce the issue you need to reserve
guest memory with `memmap=` cmdline, create and mount FS in DAX mode
(tested both XFS and ext4), see doc link below. ndctl command for test:
ndctl create-namespace -v -e namespace1.0 --map=dev --mode=fsdax -a 2M
Then pass memory object to qemu like:
-m 8G -object memory-backend-file,id=ram0,size=8G,\
mem-path=/mnt/pmem/guestmem,share=on,prealloc=on,dump=off,align=2097152 \
-numa node,memdev=ram0,cpus=0-1
QEMU fails to run guest with error: kvm run failed Bad address
and there are two warnings in dmesg:
WARN_ON_ONCE(!page_count(page)) in kvm_is_zone_device_page() and
WARN_ON_ONCE(folio_ref_count(folio) <= 0) in try_grab_folio() (v6.6.63)
It looks like in the past assumption was made that pfn won't change from
faultin_pfn() to release_pfn_clean(), e.g. see
commit 4cd071d13c5c ("KVM: x86/mmu: Move calls to thp_adjust() down a level")
But kvm_page_fault structure made pfn part of mutable state, so
now release_pfn_clean() can take hugepage-adjusted pfn.
And it works for all cases (/dev/shm, hugetlb, devdax) except fsdax.
Apparently in fsdax mode faultin-pfn and adjusted-pfn may refer to
different folios, so we're getting get_page/put_page imbalance.
To solve this preserve faultin pfn in separate local variable
and pass it in kvm_release_pfn_clean().
Patch tested for all mentioned guest memory backends with tdp_mmu={0,1}.
No bug in upstream as it was solved fundamentally by
commit 8dd861cc07e2 ("KVM: x86/mmu: Put refcounted pages instead of blindly releasing pfns")
and related patch series.
Link: https://nvdimm.docs.kernel.org/2mib_fs_dax.html
Fixes: 2f6305dd5676 ("KVM: MMU: change kvm_tdp_mmu_map() arguments to kvm_page_fault")
Co-developed-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Reviewed-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: Nikolay Kuratov <kniv(a)yandex-team.ru>
---
v1 -> v2:
* Instead of new struct field prefer local variable to snapshot faultin pfn
as suggested by Sean Christopherson.
* Tested patch for 6.1 and 6.12
arch/x86/kvm/mmu/mmu.c | 5 ++++-
arch/x86/kvm/mmu/paging_tmpl.h | 5 ++++-
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 13134954e24d..d392022dcb89 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4245,6 +4245,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
unsigned long mmu_seq;
+ kvm_pfn_t orig_pfn;
int r;
fault->gfn = fault->addr >> PAGE_SHIFT;
@@ -4272,6 +4273,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (r != RET_PF_CONTINUE)
return r;
+ orig_pfn = fault->pfn;
+
r = RET_PF_RETRY;
if (is_tdp_mmu_fault)
@@ -4296,7 +4299,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
read_unlock(&vcpu->kvm->mmu_lock);
else
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ kvm_release_pfn_clean(orig_pfn);
return r;
}
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 1f4f5e703f13..685560a45bf6 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -790,6 +790,7 @@ FNAME(is_self_change_mapping)(struct kvm_vcpu *vcpu,
static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct guest_walker walker;
+ kvm_pfn_t orig_pfn;
int r;
unsigned long mmu_seq;
bool is_self_change_mapping;
@@ -868,6 +869,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
walker.pte_access &= ~ACC_EXEC_MASK;
}
+ orig_pfn = fault->pfn;
+
r = RET_PF_RETRY;
write_lock(&vcpu->kvm->mmu_lock);
@@ -881,7 +884,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ kvm_release_pfn_clean(orig_pfn);
return r;
}
--
2.34.1
This patch series is to fix bug for APIs
- devm_pci_epc_destroy().
- pci_epf_remove_vepf().
and simplify APIs below:
- pci_epc_get().
Signed-off-by: Zijun Hu <quic_zijuhu(a)quicinc.com>
---
Changes in v3:
- Remove stable tag of patch 1/3
- Add one more patch 3/3
- Link to v2: https://lore.kernel.org/all/20241102-pci-epc-core_fix-v2-0-0785f8435be5@qui…
Changes in v2:
- Correct tile and commit message for patch 1/2.
- Add one more patch 2/2 to simplify API pci_epc_get().
- Link to v1: https://lore.kernel.org/r/20241020-pci-epc-core_fix-v1-1-3899705e3537@quici…
---
Zijun Hu (3):
PCI: endpoint: Fix that API devm_pci_epc_destroy() fails to destroy the EPC device
PCI: endpoint: Simplify API pci_epc_get() implementation
PCI: endpoint: Fix API pci_epf_add_vepf() returning -EBUSY error
drivers/pci/endpoint/pci-epc-core.c | 23 +++++++----------------
drivers/pci/endpoint/pci-epf-core.c | 1 +
2 files changed, 8 insertions(+), 16 deletions(-)
---
base-commit: 11066801dd4b7c4d75fce65c812723a80c1481ae
change-id: 20241020-pci-epc-core_fix-a92512fa9d19
Best regards,
--
Zijun Hu <quic_zijuhu(a)quicinc.com>
While testing the encoded read feature the following crash was observed
and it can be reliably reproduced:
[ 2916.441731] Oops: general protection fault, probably for non-canonical address 0xa3f64e06d5eee2c7: 0000 [#1] PREEMPT_RT SMP NOPTI
[ 2916.441736] CPU: 5 UID: 0 PID: 592 Comm: kworker/u38:4 Kdump: loaded Not tainted 6.13.0-rc1+ #4
[ 2916.441739] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 2916.441740] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
[ 2916.441777] RIP: 0010:__wake_up_common+0x29/0xa0
[ 2916.441808] RSP: 0018:ffffaaec0128fd80 EFLAGS: 00010216
[ 2916.441810] RAX: 0000000000000001 RBX: ffff95a6429cf020 RCX: 0000000000000000
[ 2916.441811] RDX: a3f64e06d5eee2c7 RSI: 0000000000000003 RDI: ffff95a6429cf000
^^^^^^^^^^^^^^^^
This comes from `priv->wait.head.next`
[ 2916.441823] Call Trace:
[ 2916.441833] <TASK>
[ 2916.441881] ? __wake_up_common+0x29/0xa0
[ 2916.441883] __wake_up_common_lock+0x37/0x60
[ 2916.441887] btrfs_encoded_read_endio+0x73/0x90 [btrfs] <<< UAF of `priv` object,
[ 2916.441921] btrfs_check_read_bio+0x321/0x500 [btrfs] details below.
[ 2916.441947] process_scheduled_works+0xc1/0x410
[ 2916.441960] worker_thread+0x105/0x240
crash> btrfs_encoded_read_private.wait.head ffff95a6429cf000 # `priv` from RDI ^^
wait.head = {
next = 0xa3f64e06d5eee2c7, # Corrupted as the object was already freed/reused.
prev = 0xffff95a6429cf020 # Stale data still point to itself (`&priv->wait.head`
} also in RBX ^^) ie. the list was free.
Possibly, this is easier (or even only?) reproducible on preemptible kernel.
It just happened to build an RT kernel for additional testing coverage.
Enabling slab debug gives us further related details, mostly confirming
what's expected:
[11:23:07] =============================================================================
[11:23:07] BUG kmalloc-64 (Not tainted): Poison overwritten
[11:23:07] -----------------------------------------------------------------------------
[11:23:07] 0xffff8fc7c5b6b542-0xffff8fc7c5b6b543 @offset=5442. First byte 0x4 instead of 0x6b
^
That makes two bytes into the `priv->wait.lock`
[11:23:07] FIX kmalloc-64: Restoring Poison 0xffff8fc7c5b6b542-0xffff8fc7c5b6b543=0x6b
[11:23:07] Allocated in btrfs_encoded_read_regular_fill_pages+0x5e/0x260 [btrfs] age=4 cpu=0 pid=18295
[11:23:07] __kmalloc_cache_noprof+0x81/0x2a0
[11:23:07] btrfs_encoded_read_regular_fill_pages+0x5e/0x260 [btrfs]
[11:23:07] btrfs_encoded_read_regular+0xee/0x200 [btrfs]
[11:23:07] btrfs_ioctl_encoded_read+0x477/0x600 [btrfs]
[11:23:07] btrfs_ioctl+0xefe/0x2a00 [btrfs]
[11:23:07] __x64_sys_ioctl+0xa3/0xc0
[11:23:07] do_syscall_64+0x74/0x180
[11:23:07] entry_SYSCALL_64_after_hwframe+0x76/0x7e
9121 unsigned long i = 0;
9122 struct btrfs_bio *bbio;
9123 int ret;
9124
* 9125 priv = kmalloc(sizeof(struct btrfs_encoded_read_private), GFP_NOFS);
9126 if (!priv)
9127 return -ENOMEM;
9128
9129 init_waitqueue_head(&priv->wait);
[11:23:07] Freed in btrfs_encoded_read_regular_fill_pages+0x1f9/0x260 [btrfs] age=4 cpu=0 pid=18295
[11:23:07] btrfs_encoded_read_regular_fill_pages+0x1f9/0x260 [btrfs]
[11:23:07] btrfs_encoded_read_regular+0xee/0x200 [btrfs]
[11:23:07] btrfs_ioctl_encoded_read+0x477/0x600 [btrfs]
[11:23:07] btrfs_ioctl+0xefe/0x2a00 [btrfs]
[11:23:07] __x64_sys_ioctl+0xa3/0xc0
[11:23:07] do_syscall_64+0x74/0x180
[11:23:07] entry_SYSCALL_64_after_hwframe+0x76/0x7e
9171 if (atomic_dec_return(&priv->pending) != 0)
9172 io_wait_event(priv->wait, !atomic_read(&priv->pending));
9173 /* See btrfs_encoded_read_endio() for ordering. */
9174 ret = blk_status_to_errno(READ_ONCE(priv->status));
* 9175 kfree(priv);
9176 return ret;
9177 }
9178 }
`priv` was freed here but then after that it was further used. The report
is comming soon after, see below. Note that the report is a few seconds
delayed by the RCU stall timeout. (It is the same example as with the
GPF crash above, just that one was reported right away without any delay).
Due to the poison this time instead of the GPF exception as observed above
the UAF caused a CPU hard lockup (reported by the RCU stall check as this
was a VM):
[11:23:28] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[11:23:28] rcu: 0-...!: (1 GPs behind) idle=48b4/1/0x4000000000000000 softirq=0/0 fqs=5 rcuc=5254 jiffies(starved)
[11:23:28] rcu: (detected by 1, t=5252 jiffies, g=1631241, q=250054 ncpus=8)
[11:23:28] Sending NMI from CPU 1 to CPUs 0:
[11:23:28] NMI backtrace for cpu 0
[11:23:28] CPU: 0 UID: 0 PID: 21445 Comm: kworker/u33:3 Kdump: loaded Tainted: G B 6.13.0-rc1+ #4
[11:23:28] Tainted: [B]=BAD_PAGE
[11:23:28] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[11:23:28] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
[11:23:28] RIP: 0010:native_halt+0xa/0x10
[11:23:28] RSP: 0018:ffffb42ec277bc48 EFLAGS: 00000046
[11:23:28] Call Trace:
[11:23:28] <TASK>
[11:23:28] kvm_wait+0x53/0x60
[11:23:28] __pv_queued_spin_lock_slowpath+0x2ea/0x350
[11:23:28] _raw_spin_lock_irq+0x2b/0x40
[11:23:28] rtlock_slowlock_locked+0x1f3/0xce0
[11:23:28] rt_spin_lock+0x7b/0xb0
[11:23:28] __wake_up_common_lock+0x23/0x60
[11:23:28] btrfs_encoded_read_endio+0x73/0x90 [btrfs] <<< UAF of `priv` object.
[11:23:28] btrfs_check_read_bio+0x321/0x500 [btrfs]
[11:23:28] process_scheduled_works+0xc1/0x410
[11:23:28] worker_thread+0x105/0x240
9105 if (priv->uring_ctx) {
9106 int err = blk_status_to_errno(READ_ONCE(priv->status));
9107 btrfs_uring_read_extent_endio(priv->uring_ctx, err);
9108 kfree(priv);
9109 } else {
* 9110 wake_up(&priv->wait); <<< So we know UAF/GPF happens here.
9111 }
9112 }
9113 bio_put(&bbio->bio);
Now, the wait queue here does not really guarantee a proper
synchronization between `btrfs_encoded_read_regular_fill_pages()`
and `btrfs_encoded_read_endio()` which eventually results in various
use-afer-free effects like general protection fault or CPU hard lockup.
Using plain wait queue without additional instrumentation on top of the
`pending` counter is simply insufficient in this context. The reason wait
queue fails here is because the lifespan of that structure is only within
the `btrfs_encoded_read_regular_fill_pages()` function. In such a case
plain wait queue cannot be used to synchronize for it's own destruction.
Fix this by correctly using completion instead.
Also, while the lifespan of the structures in sync case is strictly
limited within the `..._fill_pages()` function, there is no need to
allocate from slab. Stack can be safely used instead.
Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
CC: stable(a)vger.kernel.org # 5.18+
Signed-off-by: Daniel Vacek <neelx(a)suse.com>
---
fs/btrfs/inode.c | 62 ++++++++++++++++++++++++++----------------------
1 file changed, 33 insertions(+), 29 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fa648ab6fe806..61e0fd5c6a15f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9078,7 +9078,7 @@ static ssize_t btrfs_encoded_read_inline(
}
struct btrfs_encoded_read_private {
- wait_queue_head_t wait;
+ struct completion *sync_read;
void *uring_ctx;
atomic_t pending;
blk_status_t status;
@@ -9090,23 +9090,22 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
if (bbio->bio.bi_status) {
/*
- * The memory barrier implied by the atomic_dec_return() here
- * pairs with the memory barrier implied by the
- * atomic_dec_return() or io_wait_event() in
- * btrfs_encoded_read_regular_fill_pages() to ensure that this
- * write is observed before the load of status in
- * btrfs_encoded_read_regular_fill_pages().
+ * The memory barrier implied by the
+ * atomic_dec_and_test() here pairs with the memory
+ * barrier implied by the atomic_dec_and_test() in
+ * btrfs_encoded_read_regular_fill_pages() to ensure
+ * that this write is observed before the load of
+ * status in btrfs_encoded_read_regular_fill_pages().
*/
WRITE_ONCE(priv->status, bbio->bio.bi_status);
}
if (atomic_dec_and_test(&priv->pending)) {
- int err = blk_status_to_errno(READ_ONCE(priv->status));
-
if (priv->uring_ctx) {
+ int err = blk_status_to_errno(READ_ONCE(priv->status));
btrfs_uring_read_extent_endio(priv->uring_ctx, err);
kfree(priv);
} else {
- wake_up(&priv->wait);
+ complete(priv->sync_read);
}
}
bio_put(&bbio->bio);
@@ -9117,16 +9116,21 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
struct page **pages, void *uring_ctx)
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
- struct btrfs_encoded_read_private *priv;
+ struct completion sync_read;
+ struct btrfs_encoded_read_private sync_priv, *priv;
unsigned long i = 0;
struct btrfs_bio *bbio;
- int ret;
- priv = kmalloc(sizeof(struct btrfs_encoded_read_private), GFP_NOFS);
- if (!priv)
- return -ENOMEM;
+ if (uring_ctx) {
+ priv = kmalloc(sizeof(struct btrfs_encoded_read_private), GFP_NOFS);
+ if (!priv)
+ return -ENOMEM;
+ } else {
+ priv = &sync_priv;
+ init_completion(&sync_read);
+ priv->sync_read = &sync_read;
+ }
- init_waitqueue_head(&priv->wait);
atomic_set(&priv->pending, 1);
priv->status = 0;
priv->uring_ctx = uring_ctx;
@@ -9158,23 +9162,23 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
atomic_inc(&priv->pending);
btrfs_submit_bbio(bbio, 0);
- if (uring_ctx) {
- if (atomic_dec_return(&priv->pending) == 0) {
- ret = blk_status_to_errno(READ_ONCE(priv->status));
- btrfs_uring_read_extent_endio(uring_ctx, ret);
+ if (atomic_dec_and_test(&priv->pending)) {
+ if (uring_ctx) {
+ int err = blk_status_to_errno(READ_ONCE(priv->status));
+ btrfs_uring_read_extent_endio(uring_ctx, err);
kfree(priv);
- return ret;
+ return err;
+ } else {
+ complete(&sync_read);
}
+ }
+ if (uring_ctx)
return -EIOCBQUEUED;
- } else {
- if (atomic_dec_return(&priv->pending) != 0)
- io_wait_event(priv->wait, !atomic_read(&priv->pending));
- /* See btrfs_encoded_read_endio() for ordering. */
- ret = blk_status_to_errno(READ_ONCE(priv->status));
- kfree(priv);
- return ret;
- }
+
+ wait_for_completion_io(&sync_read);
+ /* See btrfs_encoded_read_endio() for ordering. */
+ return blk_status_to_errno(READ_ONCE(priv->status));
}
ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
--
2.45.2
From: Shu Han <ebpqwerty472123(a)gmail.com>
[ Upstream commit ea7e2d5e49c05e5db1922387b09ca74aa40f46e2 ]
The remap_file_pages syscall handler calls do_mmap() directly, which
doesn't contain the LSM security check. And if the process has called
personality(READ_IMPLIES_EXEC) before and remap_file_pages() is called for
RW pages, this will actually result in remapping the pages to RWX,
bypassing a W^X policy enforced by SELinux.
So we should check prot by security_mmap_file LSM hook in the
remap_file_pages syscall handler before do_mmap() is called. Otherwise, it
potentially permits an attacker to bypass a W^X policy enforced by
SELinux.
The bypass is similar to CVE-2016-10044, which bypass the same thing via
AIO and can be found in [1].
The PoC:
$ cat > test.c
int main(void) {
size_t pagesz = sysconf(_SC_PAGE_SIZE);
int mfd = syscall(SYS_memfd_create, "test", 0);
const char *buf = mmap(NULL, 4 * pagesz, PROT_READ | PROT_WRITE,
MAP_SHARED, mfd, 0);
unsigned int old = syscall(SYS_personality, 0xffffffff);
syscall(SYS_personality, READ_IMPLIES_EXEC | old);
syscall(SYS_remap_file_pages, buf, pagesz, 0, 2, 0);
syscall(SYS_personality, old);
// show the RWX page exists even if W^X policy is enforced
int fd = open("/proc/self/maps", O_RDONLY);
unsigned char buf2[1024];
while (1) {
int ret = read(fd, buf2, 1024);
if (ret <= 0) break;
write(1, buf2, ret);
}
close(fd);
}
$ gcc test.c -o test
$ ./test | grep rwx
7f1836c34000-7f1836c35000 rwxs 00002000 00:01 2050 /memfd:test (deleted)
Link: https://project-zero.issues.chromium.org/issues/42452389 [1]
Cc: stable(a)vger.kernel.org
Signed-off-by: Shu Han <ebpqwerty472123(a)gmail.com>
Acked-by: Stephen Smalley <stephen.smalley.work(a)gmail.com>
[PM: subject line tweaks]
Signed-off-by: Paul Moore <paul(a)paul-moore.com>
[ Resolve merge conflict in mm/mmap.c. ]
Signed-off-by: Bin Lan <bin.lan.cn(a)windriver.com>
---
mm/mmap.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c
index 9a9933ede542..ebc3583fa612 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3021,8 +3021,12 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
flags |= MAP_LOCKED;
file = get_file(vma->vm_file);
+ ret = security_mmap_file(vma->vm_file, prot, flags);
+ if (ret)
+ goto out_fput;
ret = do_mmap(vma->vm_file, start, size,
prot, flags, pgoff, &populate, NULL);
+out_fput:
fput(file);
out:
mmap_write_unlock(mm);
--
2.43.0