Christoffer reports that on some implementations, writing to
GICR_ISACTIVER0 (and similar GICD registers) can race badly
with a guest issuing a deactivation of that interrupt via the
system register interface.
There are multiple reasons to this:
- we use an early write-acknoledgement memory type (nGnRE), meaning
that the write may only have made it as far as some interconnect
by the time the store is considered "done"
- the GIC itself is allowed to buffer the write until it decides to
take it into account (as long as it is in finite time)
The effects are that the activation may not have taken effect by the
time we enter the guest, forcing an immediate exit, or that a guest
deactivation occurs before the interrupt is active, doing nothing.
In order to guarantee that the write to the ISACTIVER register has
taken effect, read back from it, forcing the interconnect to propagate
the write, and the GIC to process the write before returning the read.
Reported-by: Christoffer Dall <christoffer.dall(a)arm.com>
Acked-by: Christoffer Dall <christoffer.dall(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Cc: stable(a)vger.kernel.org
---
drivers/irqchip/irq-gic-v3.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index ce87205e3e823..8b6159f4cdafa 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -524,6 +524,13 @@ static int gic_irq_set_irqchip_state(struct irq_data *d,
}
gic_poke_irq(d, reg);
+
+ /*
+ * Force read-back to guarantee that the active state has taken
+ * effect, and won't race with a guest-driven deactivation.
+ */
+ if (reg == GICD_ISACTIVER)
+ gic_peek_irq(d, reg);
return 0;
}
--
2.39.2
When FPMR and SME are both present then entering and exiting streaming mode
clears FPMR in the same manner as it clears the V/Z and P registers.
Since entering and exiting streaming mode via ptrace is expected to have
the same effect as doing so via SMSTART/SMSTOP it should clear FPMR too
but this was missed when FPMR support was added. Add the required reset
of FPMR.
Since changing the vector length resets SVCR a SME vector length change
implemented via a write to ZA can trigger an exit of streaming mode and
we need to check when writing to ZA as well.
Fixes: 4035c22ef7d4 ("arm64/ptrace: Expose FPMR via ptrace")
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Cc: stable(a)vger.kernel.org
---
arch/arm64/kernel/ptrace.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index b756578aeaeea1d3250276734520e3eaae8a671d..f242df53de2992bf8a3fd51710d6653fe82f7779 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -876,6 +876,7 @@ static int sve_set_common(struct task_struct *target,
const void *kbuf, const void __user *ubuf,
enum vec_type type)
{
+ u64 old_svcr = target->thread.svcr;
int ret;
struct user_sve_header header;
unsigned int vq;
@@ -903,8 +904,6 @@ static int sve_set_common(struct task_struct *target,
/* Enter/exit streaming mode */
if (system_supports_sme()) {
- u64 old_svcr = target->thread.svcr;
-
switch (type) {
case ARM64_VEC_SVE:
target->thread.svcr &= ~SVCR_SM_MASK;
@@ -1003,6 +1002,10 @@ static int sve_set_common(struct task_struct *target,
start, end);
out:
+ /* If we entered or exited streaming mode then reset FPMR */
+ if ((target->thread.svcr & SVCR_SM) != (old_svcr & SVCR_SM))
+ target->thread.uw.fpmr = 0;
+
fpsimd_flush_task_state(target);
return ret;
}
@@ -1099,6 +1102,7 @@ static int za_set(struct task_struct *target,
unsigned int pos, unsigned int count,
const void *kbuf, const void __user *ubuf)
{
+ u64 old_svcr = target->thread.svcr;
int ret;
struct user_za_header header;
unsigned int vq;
@@ -1175,6 +1179,10 @@ static int za_set(struct task_struct *target,
target->thread.svcr |= SVCR_ZA_MASK;
out:
+ /* If we entered or exited streaming mode then reset FPMR */
+ if ((target->thread.svcr & SVCR_SM) != (old_svcr & SVCR_SM))
+ target->thread.uw.fpmr = 0;
+
fpsimd_flush_task_state(target);
return ret;
}
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241106-arm64-ptrace-fpmr-sm-45390f592574
Best regards,
--
Mark Brown <broonie(a)kernel.org>
The patch titled
Subject: ocfs2: fix UBSAN warning in ocfs2_verify_volume()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
ocfs2-fix-ubsan-warning-in-ocfs2_verify_volume.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Dmitry Antipov <dmantipov(a)yandex.ru>
Subject: ocfs2: fix UBSAN warning in ocfs2_verify_volume()
Date: Wed, 6 Nov 2024 12:21:00 +0300
Syzbot has reported the following splat triggered by UBSAN:
UBSAN: shift-out-of-bounds in fs/ocfs2/super.c:2336:10
shift exponent 32768 is too large for 32-bit type 'int'
CPU: 2 UID: 0 PID: 5255 Comm: repro Not tainted 6.12.0-rc4-syzkaller-00047-gc2ee9f594da8 #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x241/0x360
? __pfx_dump_stack_lvl+0x10/0x10
? __pfx__printk+0x10/0x10
? __asan_memset+0x23/0x50
? lockdep_init_map_type+0xa1/0x910
__ubsan_handle_shift_out_of_bounds+0x3c8/0x420
ocfs2_fill_super+0xf9c/0x5750
? __pfx_ocfs2_fill_super+0x10/0x10
? __pfx_validate_chain+0x10/0x10
? __pfx_validate_chain+0x10/0x10
? validate_chain+0x11e/0x5920
? __lock_acquire+0x1384/0x2050
? __pfx_validate_chain+0x10/0x10
? string+0x26a/0x2b0
? widen_string+0x3a/0x310
? string+0x26a/0x2b0
? bdev_name+0x2b1/0x3c0
? pointer+0x703/0x1210
? __pfx_pointer+0x10/0x10
? __pfx_format_decode+0x10/0x10
? __lock_acquire+0x1384/0x2050
? vsnprintf+0x1ccd/0x1da0
? snprintf+0xda/0x120
? __pfx_lock_release+0x10/0x10
? do_raw_spin_lock+0x14f/0x370
? __pfx_snprintf+0x10/0x10
? set_blocksize+0x1f9/0x360
? sb_set_blocksize+0x98/0xf0
? setup_bdev_super+0x4e6/0x5d0
mount_bdev+0x20c/0x2d0
? __pfx_ocfs2_fill_super+0x10/0x10
? __pfx_mount_bdev+0x10/0x10
? vfs_parse_fs_string+0x190/0x230
? __pfx_vfs_parse_fs_string+0x10/0x10
legacy_get_tree+0xf0/0x190
? __pfx_ocfs2_mount+0x10/0x10
vfs_get_tree+0x92/0x2b0
do_new_mount+0x2be/0xb40
? __pfx_do_new_mount+0x10/0x10
__se_sys_mount+0x2d6/0x3c0
? __pfx___se_sys_mount+0x10/0x10
? do_syscall_64+0x100/0x230
? __x64_sys_mount+0x20/0xc0
do_syscall_64+0xf3/0x230
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f37cae96fda
Code: 48 8b 0d 51 ce 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1e ce 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007fff6c1aa228 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007fff6c1aa240 RCX: 00007f37cae96fda
RDX: 00000000200002c0 RSI: 0000000020000040 RDI: 00007fff6c1aa240
RBP: 0000000000000004 R08: 00007fff6c1aa280 R09: 0000000000000000
R10: 00000000000008c0 R11: 0000000000000206 R12: 00000000000008c0
R13: 00007fff6c1aa280 R14: 0000000000000003 R15: 0000000001000000
</TASK>
For a really damaged superblock, the value of 'i_super.s_blocksize_bits'
may exceed the maximum possible shift for an underlying 'int'. So add an
extra check whether the aforementioned field represents the valid block
size, which is 512 bytes, 1K, 2K, or 4K.
Link: https://lkml.kernel.org/r/20241106092100.2661330-1-dmantipov@yandex.ru
Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Signed-off-by: Dmitry Antipov <dmantipov(a)yandex.ru>
Reported-by: syzbot+56f7cd1abe4b8e475180(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=56f7cd1abe4b8e475180
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/super.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
--- a/fs/ocfs2/super.c~ocfs2-fix-ubsan-warning-in-ocfs2_verify_volume
+++ a/fs/ocfs2/super.c
@@ -2319,6 +2319,7 @@ static int ocfs2_verify_volume(struct oc
struct ocfs2_blockcheck_stats *stats)
{
int status = -EAGAIN;
+ u32 blksz_bits;
if (memcmp(di->i_signature, OCFS2_SUPER_BLOCK_SIGNATURE,
strlen(OCFS2_SUPER_BLOCK_SIGNATURE)) == 0) {
@@ -2333,11 +2334,15 @@ static int ocfs2_verify_volume(struct oc
goto out;
}
status = -EINVAL;
- if ((1 << le32_to_cpu(di->id2.i_super.s_blocksize_bits)) != blksz) {
+ /* Acceptable block sizes are 512 bytes, 1K, 2K and 4K. */
+ blksz_bits = le32_to_cpu(di->id2.i_super.s_blocksize_bits);
+ if (blksz_bits < 9 || blksz_bits > 12) {
mlog(ML_ERROR, "found superblock with incorrect block "
- "size: found %u, should be %u\n",
- 1 << le32_to_cpu(di->id2.i_super.s_blocksize_bits),
- blksz);
+ "size bits: found %u, should be 9, 10, 11, or 12\n",
+ blksz_bits);
+ } else if ((1 << le32_to_cpu(blksz_bits)) != blksz) {
+ mlog(ML_ERROR, "found superblock with incorrect block "
+ "size: found %u, should be %u\n", 1 << blksz_bits, blksz);
} else if (le16_to_cpu(di->id2.i_super.s_major_rev_level) !=
OCFS2_MAJOR_REV_LEVEL ||
le16_to_cpu(di->id2.i_super.s_minor_rev_level) !=
_
Patches currently in -mm which might be from dmantipov(a)yandex.ru are
ocfs2-fix-ubsan-warning-in-ocfs2_verify_volume.patch
ocfs2-fix-uninitialized-value-in-ocfs2_file_read_iter.patch
The patch titled
Subject: nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
nilfs2-fix-null-ptr-deref-in-block_dirty_buffer-tracepoint.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Subject: nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint
Date: Thu, 7 Nov 2024 01:07:33 +0900
When using the "block:block_dirty_buffer" tracepoint, mark_buffer_dirty()
may cause a NULL pointer dereference, or a general protection fault when
KASAN is enabled.
This happens because, since the tracepoint was added in
mark_buffer_dirty(), it references the dev_t member bh->b_bdev->bd_dev
regardless of whether the buffer head has a pointer to a block_device
structure.
In the current implementation, nilfs_grab_buffer(), which grabs a buffer
to read (or create) a block of metadata, including b-tree node blocks,
does not set the block device, but instead does so only if the buffer is
not in the "uptodate" state for each of its caller block reading
functions. However, if the uptodate flag is set on a folio/page, and the
buffer heads are detached from it by try_to_free_buffers(), and new buffer
heads are then attached by create_empty_buffers(), the uptodate flag may
be restored to each buffer without the block device being set to
bh->b_bdev, and mark_buffer_dirty() may be called later in that state,
resulting in the bug mentioned above.
Fix this issue by making nilfs_grab_buffer() always set the block device
of the super block structure to the buffer head, regardless of the state
of the buffer's uptodate flag.
Link: https://lkml.kernel.org/r/20241106160811.3316-3-konishi.ryusuke@gmail.com
Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Ubisectech Sirius <bugreport(a)valiantsec.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/nilfs2/btnode.c | 2 --
fs/nilfs2/gcinode.c | 4 +---
fs/nilfs2/mdt.c | 1 -
fs/nilfs2/page.c | 1 +
4 files changed, 2 insertions(+), 6 deletions(-)
--- a/fs/nilfs2/btnode.c~nilfs2-fix-null-ptr-deref-in-block_dirty_buffer-tracepoint
+++ a/fs/nilfs2/btnode.c
@@ -68,7 +68,6 @@ nilfs_btnode_create_block(struct address
goto failed;
}
memset(bh->b_data, 0, i_blocksize(inode));
- bh->b_bdev = inode->i_sb->s_bdev;
bh->b_blocknr = blocknr;
set_buffer_mapped(bh);
set_buffer_uptodate(bh);
@@ -133,7 +132,6 @@ int nilfs_btnode_submit_block(struct add
goto found;
}
set_buffer_mapped(bh);
- bh->b_bdev = inode->i_sb->s_bdev;
bh->b_blocknr = pblocknr; /* set block address for read */
bh->b_end_io = end_buffer_read_sync;
get_bh(bh);
--- a/fs/nilfs2/gcinode.c~nilfs2-fix-null-ptr-deref-in-block_dirty_buffer-tracepoint
+++ a/fs/nilfs2/gcinode.c
@@ -83,10 +83,8 @@ int nilfs_gccache_submit_read_data(struc
goto out;
}
- if (!buffer_mapped(bh)) {
- bh->b_bdev = inode->i_sb->s_bdev;
+ if (!buffer_mapped(bh))
set_buffer_mapped(bh);
- }
bh->b_blocknr = pbn;
bh->b_end_io = end_buffer_read_sync;
get_bh(bh);
--- a/fs/nilfs2/mdt.c~nilfs2-fix-null-ptr-deref-in-block_dirty_buffer-tracepoint
+++ a/fs/nilfs2/mdt.c
@@ -89,7 +89,6 @@ static int nilfs_mdt_create_block(struct
if (buffer_uptodate(bh))
goto failed_bh;
- bh->b_bdev = sb->s_bdev;
err = nilfs_mdt_insert_new_block(inode, block, bh, init_block);
if (likely(!err)) {
get_bh(bh);
--- a/fs/nilfs2/page.c~nilfs2-fix-null-ptr-deref-in-block_dirty_buffer-tracepoint
+++ a/fs/nilfs2/page.c
@@ -63,6 +63,7 @@ struct buffer_head *nilfs_grab_buffer(st
folio_put(folio);
return NULL;
}
+ bh->b_bdev = inode->i_sb->s_bdev;
return bh;
}
_
Patches currently in -mm which might be from konishi.ryusuke(a)gmail.com are
nilfs2-fix-null-ptr-deref-in-block_touch_buffer-tracepoint.patch
nilfs2-fix-null-ptr-deref-in-block_dirty_buffer-tracepoint.patch
The patch titled
Subject: nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
nilfs2-fix-null-ptr-deref-in-block_touch_buffer-tracepoint.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Subject: nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint
Date: Thu, 7 Nov 2024 01:07:32 +0900
Patch series "nilfs2: fix null-ptr-deref bugs on block tracepoints".
This series fixes null pointer dereference bugs that occur when using
nilfs2 and two block-related tracepoints.
This patch (of 2):
It has been reported that when using "block:block_touch_buffer"
tracepoint, touch_buffer() called from __nilfs_get_folio_block() causes a
NULL pointer dereference, or a general protection fault when KASAN is
enabled.
This happens because since the tracepoint was added in touch_buffer(), it
references the dev_t member bh->b_bdev->bd_dev regardless of whether the
buffer head has a pointer to a block_device structure. In the current
implementation, the block_device structure is set after the function
returns to the caller.
Here, touch_buffer() is used to mark the folio/page that owns the buffer
head as accessed, but the common search helper for folio/page used by the
caller function was optimized to mark the folio/page as accessed when it
was reimplemented a long time ago, eliminating the need to call
touch_buffer() here in the first place.
So this solves the issue by eliminating the touch_buffer() call itself.
Link: https://lkml.kernel.org/r/20241106160811.3316-1-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/20241106160811.3316-2-konishi.ryusuke@gmail.com
Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Reported-by: Ubisectech Sirius <bugreport(a)valiantsec.com>
Closes: https://lkml.kernel.org/r/86bd3013-887e-4e38-960f-ca45c657f032.bugreport@va…
Reported-by: syzbot+9982fb8d18eba905abe2(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9982fb8d18eba905abe2
Tested-by: syzbot+9982fb8d18eba905abe2(a)syzkaller.appspotmail.com
Cc: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/nilfs2/page.c | 1 -
1 file changed, 1 deletion(-)
--- a/fs/nilfs2/page.c~nilfs2-fix-null-ptr-deref-in-block_touch_buffer-tracepoint
+++ a/fs/nilfs2/page.c
@@ -39,7 +39,6 @@ static struct buffer_head *__nilfs_get_f
first_block = (unsigned long)index << (PAGE_SHIFT - blkbits);
bh = get_nth_bh(bh, block - first_block);
- touch_buffer(bh);
wait_on_buffer(bh);
return bh;
}
_
Patches currently in -mm which might be from konishi.ryusuke(a)gmail.com are
nilfs2-fix-null-ptr-deref-in-block_touch_buffer-tracepoint.patch
nilfs2-fix-null-ptr-deref-in-block_dirty_buffer-tracepoint.patch
The patch titled
Subject: mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-page_alloc-move-mlocked-flag-clearance-into-free_pages_prepare.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Roman Gushchin <roman.gushchin(a)linux.dev>
Subject: mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
Date: Wed, 6 Nov 2024 19:53:54 +0000
Syzbot reported a bad page state problem caused by a page being freed
using free_page() still having a mlocked flag at free_pages_prepare()
stage:
BUG: Bad page state in process syz.5.504 pfn:61f45
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x61f45
flags: 0xfff00000080204(referenced|workingset|mlocked|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000080204 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 8443, tgid 8442 (syz.5.504), ts 201884660643, free_ts 201499827394
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
prep_new_page mm/page_alloc.c:1545 [inline]
get_page_from_freelist+0x303f/0x3190 mm/page_alloc.c:3457
__alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4733
alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
kvm_coalesced_mmio_init+0x1f/0xf0 virt/kvm/coalesced_mmio.c:99
kvm_create_vm virt/kvm/kvm_main.c:1235 [inline]
kvm_dev_ioctl_create_vm virt/kvm/kvm_main.c:5488 [inline]
kvm_dev_ioctl+0x12dc/0x2240 virt/kvm/kvm_main.c:5530
__do_compat_sys_ioctl fs/ioctl.c:1007 [inline]
__se_compat_sys_ioctl+0x510/0xc90 fs/ioctl.c:950
do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
__do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386
do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411
entry_SYSENTER_compat_after_hwframe+0x84/0x8e
page last free pid 8399 tgid 8399 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
free_pages_prepare mm/page_alloc.c:1108 [inline]
free_unref_folios+0xf12/0x18d0 mm/page_alloc.c:2686
folios_put_refs+0x76c/0x860 mm/swap.c:1007
free_pages_and_swap_cache+0x5c8/0x690 mm/swap_state.c:335
__tlb_batch_free_encoded_pages mm/mmu_gather.c:136 [inline]
tlb_batch_pages_flush mm/mmu_gather.c:149 [inline]
tlb_flush_mmu_free mm/mmu_gather.c:366 [inline]
tlb_flush_mmu+0x3a3/0x680 mm/mmu_gather.c:373
tlb_finish_mmu+0xd4/0x200 mm/mmu_gather.c:465
exit_mmap+0x496/0xc40 mm/mmap.c:1926
__mmput+0x115/0x390 kernel/fork.c:1348
exit_mm+0x220/0x310 kernel/exit.c:571
do_exit+0x9b2/0x28e0 kernel/exit.c:926
do_group_exit+0x207/0x2c0 kernel/exit.c:1088
__do_sys_exit_group kernel/exit.c:1099 [inline]
__se_sys_exit_group kernel/exit.c:1097 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
x64_sys_call+0x2634/0x2640 arch/x86/include/generated/asm/syscalls_64.h:232
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Modules linked in:
CPU: 0 UID: 0 PID: 8442 Comm: syz.5.504 Not tainted 6.12.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
bad_page+0x176/0x1d0 mm/page_alloc.c:501
free_page_is_bad mm/page_alloc.c:918 [inline]
free_pages_prepare mm/page_alloc.c:1100 [inline]
free_unref_page+0xed0/0xf20 mm/page_alloc.c:2638
kvm_destroy_vm virt/kvm/kvm_main.c:1327 [inline]
kvm_put_kvm+0xc75/0x1350 virt/kvm/kvm_main.c:1386
kvm_vcpu_release+0x54/0x60 virt/kvm/kvm_main.c:4143
__fput+0x23f/0x880 fs/file_table.c:431
task_work_run+0x24f/0x310 kernel/task_work.c:239
exit_task_work include/linux/task_work.h:43 [inline]
do_exit+0xa2f/0x28e0 kernel/exit.c:939
do_group_exit+0x207/0x2c0 kernel/exit.c:1088
__do_sys_exit_group kernel/exit.c:1099 [inline]
__se_sys_exit_group kernel/exit.c:1097 [inline]
__ia32_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
ia32_sys_call+0x2624/0x2630 arch/x86/include/generated/asm/syscalls_32.h:253
do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
__do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386
do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411
entry_SYSENTER_compat_after_hwframe+0x84/0x8e
RIP: 0023:0xf745d579
Code: Unable to access opcode bytes at 0xf745d54f.
RSP: 002b:00000000f75afd6c EFLAGS: 00000206 ORIG_RAX: 00000000000000fc
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00000000ffffff9c RDI: 00000000f744cff4
RBP: 00000000f717ae61 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>
The problem was originally introduced by commit b109b87050df ("mm/munlock:
replace clear_page_mlock() by final clearance"): it was focused on
handling pagecache and anonymous memory and wasn't suitable for lower
level get_page()/free_page() API's used for example by KVM, as with this
reproducer.
Fix it by moving the mlocked flag clearance down to free_page_prepare().
The bug itself if fairly old and harmless (aside from generating these
warnings), aside from a small memory leak - "bad" pages are stopped from
being allocated again.
Link: https://lkml.kernel.org/r/20241106195354.270757-1-roman.gushchin@linux.dev
Fixes: b109b87050df ("mm/munlock: replace clear_page_mlock() by final clearance")
Signed-off-by: Roman Gushchin <roman.gushchin(a)linux.dev>
Reported-by: syzbot+e985d3026c4fd041578e(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6729f475.050a0220.701a.0019.GAE@google.com
Acked-by: Hugh Dickins <hughd(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Sean Christopherson <seanjc(a)google.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 15 +++++++++++++++
mm/swap.c | 14 --------------
2 files changed, 15 insertions(+), 14 deletions(-)
--- a/mm/page_alloc.c~mm-page_alloc-move-mlocked-flag-clearance-into-free_pages_prepare
+++ a/mm/page_alloc.c
@@ -1048,6 +1048,7 @@ __always_inline bool free_pages_prepare(
bool skip_kasan_poison = should_skip_kasan_poison(page);
bool init = want_init_on_free();
bool compound = PageCompound(page);
+ struct folio *folio = page_folio(page);
VM_BUG_ON_PAGE(PageTail(page), page);
@@ -1057,6 +1058,20 @@ __always_inline bool free_pages_prepare(
if (memcg_kmem_online() && PageMemcgKmem(page))
__memcg_kmem_uncharge_page(page, order);
+ /*
+ * In rare cases, when truncation or holepunching raced with
+ * munlock after VM_LOCKED was cleared, Mlocked may still be
+ * found set here. This does not indicate a problem, unless
+ * "unevictable_pgs_cleared" appears worryingly large.
+ */
+ if (unlikely(folio_test_mlocked(folio))) {
+ long nr_pages = folio_nr_pages(folio);
+
+ __folio_clear_mlocked(folio);
+ zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
+ count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
+ }
+
if (unlikely(PageHWPoison(page)) && !order) {
/* Do not let hwpoison pages hit pcplists/buddy */
reset_page_owner(page, order);
--- a/mm/swap.c~mm-page_alloc-move-mlocked-flag-clearance-into-free_pages_prepare
+++ a/mm/swap.c
@@ -78,20 +78,6 @@ static void __page_cache_release(struct
lruvec_del_folio(*lruvecp, folio);
__folio_clear_lru_flags(folio);
}
-
- /*
- * In rare cases, when truncation or holepunching raced with
- * munlock after VM_LOCKED was cleared, Mlocked may still be
- * found set here. This does not indicate a problem, unless
- * "unevictable_pgs_cleared" appears worryingly large.
- */
- if (unlikely(folio_test_mlocked(folio))) {
- long nr_pages = folio_nr_pages(folio);
-
- __folio_clear_mlocked(folio);
- zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
- count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
- }
}
/*
_
Patches currently in -mm which might be from roman.gushchin(a)linux.dev are
signal-restore-the-override_rlimit-logic.patch
mm-page_alloc-move-mlocked-flag-clearance-into-free_pages_prepare.patch
We intend that signal handlers are entered with PSTATE.{SM,ZA}={0,0}.
The logic for this in setup_return() manipulates the saved state and
live CPU state in an unsafe manner, and consequently, when a task enters
a signal handler:
* The task entering the signal handler might not have its PSTATE.{SM,ZA}
bits cleared, and other register state that is affected by changes to
PSTATE.{SM,ZA} might not be zeroed as expected.
* An unrelated task might have its PSTATE.{SM,ZA} bits cleared
unexpectedly, potentially zeroing other register state that is
affected by changes to PSTATE.{SM,ZA}.
Tasks which do not set PSTATE.{SM,ZA} (i.e. those only using plain
FPSIMD or non-streaming SVE) are not affected, as there is no
resulting change to PSTATE.{SM,ZA}.
Consider for example two tasks on one CPU:
A: Begins signal entry in kernel mode, is preempted prior to SMSTOP.
B: Using SM and/or ZA in userspace with register state current on the
CPU, is preempted.
A: Scheduled in, no register state changes made as in kernel mode.
A: Executes SMSTOP, modifying live register state.
A: Scheduled out.
B: Scheduled in, fpsimd_thread_switch() sees the register state on the
CPU is tracked as being that for task B so the state is not reloaded
prior to returning to userspace.
Task B is now running with SM and ZA incorrectly cleared.
Fix this by:
* Checking TIF_FOREIGN_FPSTATE, and only updating the saved or live
state as appropriate.
* Using {get,put}_cpu_fpsimd_context() to ensure mutual exclusion
against other code which manipulates this state. To allow their use,
the logic is moved into a new fpsimd_enter_sighandler() helper in
fpsimd.c.
This race has been observed intermittently with fp-stress, especially
with preempt disabled, commonly but not exclusively reporting "Bad SVCR: 0".
While we're at it also fix a discrepancy between in register and in memory
entries. When operating on the register state we issue a SMSTOP, exiting
streaming mode if we were in it. This clears the V/Z and P register and
FPMR but nothing else. The in memory version clears all the user FPSIMD
state including FPCR and FPSR but does not clear FPMR. Add the clear of
FPMR and limit the existing memset() to only cover the vregs, preserving
the state of FPCR and FPSR like SMSTOP does.
Fixes: 40a8e87bb3285 ("arm64/sme: Disable ZA and streaming mode when handling signals")
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Cc: stable(a)vger.kernel.org
---
Changes in v3:
- Fix the in memory update to follow the behaviour of the SMSTOP we
issue when updating in registers.
- Link to v2: https://lore.kernel.org/r/20241030-arm64-fp-sme-sigentry-v2-1-43ce805d1b20@…
Changes in v2:
- Commit message tweaks.
- Flush the task state when updating in memory to ensure we reload.
- Link to v1: https://lore.kernel.org/r/20241023-arm64-fp-sme-sigentry-v1-1-249ff7ec3ad0@…
---
arch/arm64/include/asm/fpsimd.h | 1 +
arch/arm64/kernel/fpsimd.c | 39 +++++++++++++++++++++++++++++++++++++++
arch/arm64/kernel/signal.c | 19 +------------------
3 files changed, 41 insertions(+), 18 deletions(-)
diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
index f2a84efc361858d4deda99faf1967cc7cac386c1..09af7cfd9f6c2cec26332caa4c254976e117b1bf 100644
--- a/arch/arm64/include/asm/fpsimd.h
+++ b/arch/arm64/include/asm/fpsimd.h
@@ -76,6 +76,7 @@ extern void fpsimd_load_state(struct user_fpsimd_state *state);
extern void fpsimd_thread_switch(struct task_struct *next);
extern void fpsimd_flush_thread(void);
+extern void fpsimd_enter_sighandler(void);
extern void fpsimd_signal_preserve_current_state(void);
extern void fpsimd_preserve_current_state(void);
extern void fpsimd_restore_current_state(void);
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 77006df20a75aee7c991cf116b6d06bfe953d1a4..10c8efd1c5ce83f4ea4025b213111ab9519263b1 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -1693,6 +1693,45 @@ void fpsimd_signal_preserve_current_state(void)
sve_to_fpsimd(current);
}
+/*
+ * Called by the signal handling code when preparing current to enter
+ * a signal handler. Currently this only needs to take care of exiting
+ * streaming mode and clearing ZA on SME systems.
+ */
+void fpsimd_enter_sighandler(void)
+{
+ if (!system_supports_sme())
+ return;
+
+ get_cpu_fpsimd_context();
+
+ if (test_thread_flag(TIF_FOREIGN_FPSTATE)) {
+ /*
+ * Exiting streaming mode zeros the V/Z and P
+ * registers and FPMR. Zero FPMR and the V registers,
+ * marking the state as FPSIMD only to force a clear
+ * of the remaining bits during reload if needed.
+ */
+ if (current->thread.svcr & SVCR_SM_MASK) {
+ memset(¤t->thread.uw.fpsimd_state.vregs, 0,
+ sizeof(current->thread.uw.fpsimd_state.vregs));
+ current->thread.uw.fpmr = 0;
+ current->thread.fp_type = FP_STATE_FPSIMD;
+ }
+
+ current->thread.svcr &= ~(SVCR_ZA_MASK |
+ SVCR_SM_MASK);
+
+ /* Ensure any copies on other CPUs aren't reused */
+ fpsimd_flush_task_state(current);
+ } else {
+ /* The register state is current, just update it. */
+ sme_smstop();
+ }
+
+ put_cpu_fpsimd_context();
+}
+
/*
* Called by KVM when entering the guest.
*/
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 5619869475304776fc005fe24a385bf86bfdd253..fe07d0bd9f7978d73973f07ce38b7bdd7914abb2 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1218,24 +1218,7 @@ static void setup_return(struct pt_regs *regs, struct k_sigaction *ka,
/* TCO (Tag Check Override) always cleared for signal handlers */
regs->pstate &= ~PSR_TCO_BIT;
- /* Signal handlers are invoked with ZA and streaming mode disabled */
- if (system_supports_sme()) {
- /*
- * If we were in streaming mode the saved register
- * state was SVE but we will exit SM and use the
- * FPSIMD register state - flush the saved FPSIMD
- * register state in case it gets loaded.
- */
- if (current->thread.svcr & SVCR_SM_MASK) {
- memset(¤t->thread.uw.fpsimd_state, 0,
- sizeof(current->thread.uw.fpsimd_state));
- current->thread.fp_type = FP_STATE_FPSIMD;
- }
-
- current->thread.svcr &= ~(SVCR_ZA_MASK |
- SVCR_SM_MASK);
- sme_smstop();
- }
+ fpsimd_enter_sighandler();
if (system_supports_poe())
write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0);
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241023-arm64-fp-sme-sigentry-a2bd7187e71b
Best regards,
--
Mark Brown <broonie(a)kernel.org>
From: Eric Dumazet <edumazet(a)google.com>
commit ac888d58869bb99753e7652be19a151df9ecb35d upstream.
dst_entries_add() uses per-cpu data that might be freed at netns
dismantle from ip6_route_net_exit() calling dst_entries_destroy()
Before ip6_route_net_exit() can be called, we release all
the dsts associated with this netns, via calls to dst_release(),
which waits an rcu grace period before calling dst_destroy()
dst_entries_add() use in dst_destroy() is racy, because
dst_entries_destroy() could have been called already.
Decrementing the number of dsts must happen sooner.
Notes:
1) in CONFIG_XFRM case, dst_destroy() can call
dst_release_immediate(child), this might also cause UAF
if the child does not have DST_NOCOUNT set.
IPSEC maintainers might take a look and see how to address this.
2) There is also discussion about removing this count of dst,
which might happen in future kernels.
Fixes: f88649721268 ("ipv4: fix dst race in sk_dst_get()")
Closes: https://lore.kernel.org/lkml/CANn89iLCCGsP7SFn9HKpvnKu96Td4KD08xf7aGtiYgZnk…
Reported-by: Naresh Kamboju <naresh.kamboju(a)linaro.org>
Tested-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju(a)linaro.org>
Signed-off-by: Eric Dumazet <edumazet(a)google.com>
Cc: Xin Long <lucien.xin(a)gmail.com>
Cc: Steffen Klassert <steffen.klassert(a)secunet.com>
Reviewed-by: Xin Long <lucien.xin(a)gmail.com>
Link: https://patch.msgid.link/20241008143110.1064899-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
[Conflict due to
bc9d3a9f2afc ("net: dst: Switch to rcuref_t reference counting")
is not in the tree]
Signed-off-by: Abdelkareem Abdelsaamad <kareemem(a)amazon.com>
---
net/core/dst.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/net/core/dst.c b/net/core/dst.c
index 453ec8aafc4a..5bb143857336 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -109,9 +109,6 @@ struct dst_entry *dst_destroy(struct dst_entry * dst)
child = xdst->child;
}
#endif
- if (!(dst->flags & DST_NOCOUNT))
- dst_entries_add(dst->ops, -1);
-
if (dst->ops->destroy)
dst->ops->destroy(dst);
if (dst->dev)
@@ -162,6 +159,12 @@ void dst_dev_put(struct dst_entry *dst)
}
EXPORT_SYMBOL(dst_dev_put);
+static void dst_count_dec(struct dst_entry *dst)
+{
+ if (!(dst->flags & DST_NOCOUNT))
+ dst_entries_add(dst->ops, -1);
+}
+
void dst_release(struct dst_entry *dst)
{
if (dst) {
@@ -171,8 +174,10 @@ void dst_release(struct dst_entry *dst)
if (WARN_ONCE(newrefcnt < 0, "dst_release underflow"))
net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
__func__, dst, newrefcnt);
- if (!newrefcnt)
+ if (!newrefcnt){
+ dst_count_dec(dst);
call_rcu(&dst->rcu_head, dst_destroy_rcu);
+ }
}
}
EXPORT_SYMBOL(dst_release);
@@ -186,8 +191,10 @@ void dst_release_immediate(struct dst_entry *dst)
if (WARN_ONCE(newrefcnt < 0, "dst_release_immediate underflow"))
net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
__func__, dst, newrefcnt);
- if (!newrefcnt)
+ if (!newrefcnt){
+ dst_count_dec(dst);
dst_destroy(dst);
+ }
}
}
EXPORT_SYMBOL(dst_release_immediate);
--
2.40.1