From: Kairui Song <kasong(a)tencent.com>
On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
lookup, and try to move the found folio to the faulting vma when.
Currently, it relies on the PTE value check to ensure the moved folio
still belongs to the src swap entry, which turns out is not reliable.
While working and reviewing the swap table series with Barry, following
existing race is observed and reproduced [1]:
( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
CPU1 CPU2
userfaultfd_move
move_pages_pte()
entry = pte_to_swp_entry(orig_src_pte);
// Here it got entry = S1
... < Somehow interrupted> ...
<swapin src_pte, alloc and use folio A>
// folio A is just a new allocated folio
// and get installed into src_pte
<frees swap entry S1>
// src_pte now points to folio A, S1
// has swap count == 0, it can be freed
// by folio_swap_swap or swap
// allocator's reclaim.
<try to swap out another folio B>
// folio B is a folio in another VMA.
<put folio B to swap cache using S1 >
// S1 is freed, folio B could use it
// for swap out with no problem.
...
folio = filemap_get_folio(S1)
// Got folio B here !!!
... < Somehow interrupted again> ...
<swapin folio B and free S1>
// Now S1 is free to be used again.
<swapout src_pte & folio A using S1>
// Now src_pte is a swap entry pte
// holding S1 again.
folio_trylock(folio)
move_swap_pte
double_pt_lock
is_pte_pages_stable
// Check passed because src_pte == S1
folio_move_anon_rmap(...)
// Moved invalid folio B here !!!
The race window is very short and requires multiple collisions of
multiple rare events, so it's very unlikely to happen, but with a
deliberately constructed reproducer and increased time window, it can be
reproduced [1].
It's also possible that folio (A) is swapped in, and swapped out again
after the filemap_get_folio lookup, in such case folio (A) may stay in
swap cache so it needs to be moved too. In this case we should also try
again so kernel won't miss a folio move.
Fix this by checking if the folio is the valid swap cache folio after
acquiring the folio lock, and checking the swap cache again after
acquiring the src_pte lock.
SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
we don't need to worry about that since folios only might get exposed to
swap cache in the swap out path, and it's covered in this patch too by
checking the swap cache again after acquiring src_pte lock.
Testing with a simple C program to allocate and move several GB of memory
did not show any observable performance change.
Cc: <stable(a)vger.kernel.org>
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoS… [1]
Signed-off-by: Kairui Song <kasong(a)tencent.com>
---
V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
Changes:
- Check swap_map instead of doing a filemap lookup after acquiring the
PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
Changes:
- Move the folio and swap check inside move_swap_pte to avoid skipping
the check and potential overhead [ Lokesh Gidra ]
- Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
value.
mm/userfaultfd.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..5dc05346e360 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
pte_t orig_dst_pte, pte_t orig_src_pte,
pmd_t *dst_pmd, pmd_t dst_pmdval,
spinlock_t *dst_ptl, spinlock_t *src_ptl,
- struct folio *src_folio)
+ struct folio *src_folio,
+ struct swap_info_struct *si, swp_entry_t entry)
{
+ /*
+ * Check if the folio still belongs to the target swap entry after
+ * acquiring the lock. Folio can be freed in the swap cache while
+ * not locked.
+ */
+ if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
+ entry.val != src_folio->swap.val))
+ return -EAGAIN;
+
double_pt_lock(dst_ptl, src_ptl);
if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
@@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
if (src_folio) {
folio_move_anon_rmap(src_folio, dst_vma);
src_folio->index = linear_page_index(dst_vma, dst_addr);
+ } else {
+ /*
+ * Check if the swap entry is cached after acquiring the src_pte
+ * lock. Or we might miss a new loaded swap cache folio.
+ */
+ if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
+ double_pt_unlock(dst_ptl, src_ptl);
+ return -EAGAIN;
+ }
}
orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
@@ -1412,7 +1431,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
}
err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
- dst_ptl, src_ptl, src_folio);
+ dst_ptl, src_ptl, src_folio, si, entry);
}
out:
--
2.49.0
Use the helper screen_info_video_type() to get the framebuffer
type from struct screen_info. Handle supported values in sorted
switch statement.
Reading orig_video_isVGA is unreliable. On most systems it is a
VIDEO_TYPE_ constant. On some systems with VGA it is simply set
to 1 to signal the presence of a VGA output. See vga_probe() for
an example. Retrieving the screen_info type with the helper
screen_info_video_type() detects these cases and returns the
appropriate VIDEO_TYPE_ constant. For VGA, sysfb creates a device
named "vga-framebuffer".
The sysfb code has been taken from vga16fb, where it likely didn't
work correctly either. With this bugfix applied, vga16fb loads for
compatible vga-framebuffer devices.
Fixes: 0db5b61e0dc0 ("fbdev/vga16fb: Create EGA/VGA devices in sysfb code")
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Javier Martinez Canillas <javierm(a)redhat.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: Tzung-Bi Shih <tzungbi(a)kernel.org>
Cc: Helge Deller <deller(a)gmx.de>
Cc: "Uwe Kleine-König" <u.kleine-koenig(a)baylibre.com>
Cc: Zsolt Kajtar <soci(a)c64.rulez.org>
Cc: <stable(a)vger.kernel.org> # v6.1+
Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de>
---
drivers/firmware/sysfb.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)
diff --git a/drivers/firmware/sysfb.c b/drivers/firmware/sysfb.c
index 7c5c03f274b9..889e5b05c739 100644
--- a/drivers/firmware/sysfb.c
+++ b/drivers/firmware/sysfb.c
@@ -143,6 +143,7 @@ static __init int sysfb_init(void)
{
struct screen_info *si = &screen_info;
struct device *parent;
+ unsigned int type;
struct simplefb_platform_data mode;
const char *name;
bool compatible;
@@ -170,17 +171,26 @@ static __init int sysfb_init(void)
goto put_device;
}
+ type = screen_info_video_type(si);
+
/* if the FB is incompatible, create a legacy framebuffer device */
- if (si->orig_video_isVGA == VIDEO_TYPE_EFI)
- name = "efi-framebuffer";
- else if (si->orig_video_isVGA == VIDEO_TYPE_VLFB)
- name = "vesa-framebuffer";
- else if (si->orig_video_isVGA == VIDEO_TYPE_VGAC)
- name = "vga-framebuffer";
- else if (si->orig_video_isVGA == VIDEO_TYPE_EGAC)
+ switch (type) {
+ case VIDEO_TYPE_EGAC:
name = "ega-framebuffer";
- else
+ break;
+ case VIDEO_TYPE_VGAC:
+ name = "vga-framebuffer";
+ break;
+ case VIDEO_TYPE_VLFB:
+ name = "vesa-framebuffer";
+ break;
+ case VIDEO_TYPE_EFI:
+ name = "efi-framebuffer";
+ break;
+ default:
name = "platform-framebuffer";
+ break;
+ }
pd = platform_device_alloc(name, 0);
if (!pd) {
--
2.49.0
Hi,
The patches fix some errors in the gs101 clock driver as well as a
trivial comment typo in the Exynos E850 clock driver.
Cheers,
Andre
Signed-off-by: André Draszik <andre.draszik(a)linaro.org>
---
André Draszik (3):
clk: samsung: gs101: fix CLK_DOUT_CMU_G3D_BUSD
clk: samsung: gs101: fix alternate mout_hsi0_usb20_ref parent clock
clk: samsung: exynos850: fix a comment
drivers/clk/samsung/clk-exynos850.c | 2 +-
drivers/clk/samsung/clk-gs101.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
---
base-commit: a0bea9e39035edc56a994630e6048c8a191a99d8
change-id: 20250519-samsung-clk-fixes-a4f5bfb54c73
Best regards,
--
André Draszik <andre.draszik(a)linaro.org>
From: "Mike Rapoport (Microsoft)" <rppt(a)kernel.org>
Hi,
Jürgen Groß reported some bugs in interaction of ITS mitigation with
execmem [1] when running on a Xen PV guest.
These patches fix the issue by moving all the permissions management of
ITS memory allocated from execmem into ITS code.
I didn't test on a real Xen PV guest, but I emulated !PSE variant by
force-disabling the ROX cache in x86::execmem_arch_setup().
Peter, I took liberty to put your SoB in the patch that actually
implements the execmem permissions management in ITS, please let me know
if I need to update something about the authorship.
The patches are against v6.15.
They are also available in git:
https://web.git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=i…
[1] https://lore.kernel.org/all/20250528123557.12847-2-jgross@suse.com/
Juergen Gross (1):
x86/mm/pat: don't collapse pages without PSE set
Mike Rapoport (Microsoft) (3):
x86/Kconfig: only enable ROX cache in execmem when STRICT_MODULE_RWX is set
x86/its: move its_pages array to struct mod_arch_specific
Revert "mm/execmem: Unify early execmem_cache behaviour"
Peter Zijlstra (Intel) (1):
x86/its: explicitly manage permissions for ITS pages
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/module.h | 8 ++++
arch/x86/kernel/alternative.c | 89 ++++++++++++++++++++++++++---------
arch/x86/mm/init_32.c | 3 --
arch/x86/mm/init_64.c | 3 --
arch/x86/mm/pat/set_memory.c | 3 ++
include/linux/execmem.h | 8 +---
include/linux/module.h | 5 --
mm/execmem.c | 40 ++--------------
9 files changed, 82 insertions(+), 79 deletions(-)
base-commit: 0ff41df1cb268fc69e703a08a57ee14ae967d0ca
--
2.47.2
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 8b68e978718f14fdcb080c2a7791c52a0d09bc6d
Gitweb: https://git.kernel.org/tip/8b68e978718f14fdcb080c2a7791c52a0d09bc6d
Author: Thomas Gleixner <tglx(a)linutronix.de>
AuthorDate: Wed, 26 Feb 2025 16:01:57 +01:00
Committer: Borislav Petkov (AMD) <bp(a)alien8.de>
CommitterDate: Tue, 03 Jun 2025 15:56:39 +02:00
x86/iopl: Cure TIF_IO_BITMAP inconsistencies
io_bitmap_exit() is invoked from exit_thread() when a task exists or
when a fork fails. In the latter case the exit_thread() cleans up
resources which were allocated during fork().
io_bitmap_exit() invokes task_update_io_bitmap(), which in turn ends up
in tss_update_io_bitmap(). tss_update_io_bitmap() operates on the
current task. If current has TIF_IO_BITMAP set, but no bitmap installed,
tss_update_io_bitmap() crashes with a NULL pointer dereference.
There are two issues, which lead to that problem:
1) io_bitmap_exit() should not invoke task_update_io_bitmap() when
the task, which is cleaned up, is not the current task. That's a
clear indicator for a cleanup after a failed fork().
2) A task should not have TIF_IO_BITMAP set and neither a bitmap
installed nor IOPL emulation level 3 activated.
This happens when a kernel thread is created in the context of
a user space thread, which has TIF_IO_BITMAP set as the thread
flags are copied and the IO bitmap pointer is cleared.
Other than in the failed fork() case this has no impact because
kernel threads including IO workers never return to user space and
therefore never invoke tss_update_io_bitmap().
Cure this by adding the missing cleanups and checks:
1) Prevent io_bitmap_exit() to invoke task_update_io_bitmap() if
the to be cleaned up task is not the current task.
2) Clear TIF_IO_BITMAP in copy_thread() unconditionally. For user
space forks it is set later, when the IO bitmap is inherited in
io_bitmap_share().
For paranoia sake, add a warning into tss_update_io_bitmap() to catch
the case, when that code is invoked with inconsistent state.
Fixes: ea5f1cd7ab49 ("x86/ioperm: Remove bitmap if all permissions dropped")
Reported-by: syzbot+e2b1803445d236442e54(a)syzkaller.appspotmail.com
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/87wmdceom2.ffs@tglx
---
arch/x86/kernel/ioport.c | 13 +++++++++----
arch/x86/kernel/process.c | 6 ++++++
2 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/ioport.c b/arch/x86/kernel/ioport.c
index 6290dd1..ff40f09 100644
--- a/arch/x86/kernel/ioport.c
+++ b/arch/x86/kernel/ioport.c
@@ -33,8 +33,9 @@ void io_bitmap_share(struct task_struct *tsk)
set_tsk_thread_flag(tsk, TIF_IO_BITMAP);
}
-static void task_update_io_bitmap(struct task_struct *tsk)
+static void task_update_io_bitmap(void)
{
+ struct task_struct *tsk = current;
struct thread_struct *t = &tsk->thread;
if (t->iopl_emul == 3 || t->io_bitmap) {
@@ -54,7 +55,12 @@ void io_bitmap_exit(struct task_struct *tsk)
struct io_bitmap *iobm = tsk->thread.io_bitmap;
tsk->thread.io_bitmap = NULL;
- task_update_io_bitmap(tsk);
+ /*
+ * Don't touch the TSS when invoked on a failed fork(). TSS
+ * reflects the state of @current and not the state of @tsk.
+ */
+ if (tsk == current)
+ task_update_io_bitmap();
if (iobm && refcount_dec_and_test(&iobm->refcnt))
kfree(iobm);
}
@@ -192,8 +198,7 @@ SYSCALL_DEFINE1(iopl, unsigned int, level)
}
t->iopl_emul = level;
- task_update_io_bitmap(current);
-
+ task_update_io_bitmap();
return 0;
}
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c1d2dac..704883c 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -176,6 +176,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
frame->ret_addr = (unsigned long) ret_from_fork_asm;
p->thread.sp = (unsigned long) fork_frame;
p->thread.io_bitmap = NULL;
+ clear_tsk_thread_flag(p, TIF_IO_BITMAP);
p->thread.iopl_warn = 0;
memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
@@ -464,6 +465,11 @@ void native_tss_update_io_bitmap(void)
} else {
struct io_bitmap *iobm = t->io_bitmap;
+ if (WARN_ON_ONCE(!iobm)) {
+ clear_thread_flag(TIF_IO_BITMAP);
+ native_tss_invalidate_io_bitmap();
+ }
+
/*
* Only copy bitmap data when the sequence number differs. The
* update time is accounted to the incoming task.
This change makes the tty device file available only after the tty's
backing character device is ready.
Since 6a7e6f78c235975cc14d4e141fa088afffe7062c, the class device is
registered before the cdev is created, and userspace may pick it up,
yet open() will fail because the backing cdev doesn't exist yet.
Userspace is racing the bottom half of tty_register_device_attr() here,
specifically the call to tty_cdev_add().
dev_set_uevent_suppress() was used to work around this, but this fails
on embedded systems that rely on bare devtmpfs rather than udev.
On such systems, the device file is created as part of device_add(),
and userspace can pick it up via inotify, irrespective of uevent
suppression.
So let's undo the existing patch, and create the cdev first, and only
afterwards register the class device in the kernel's device tree.
However, this restores the original race of the cdev existing before the
class device is registered, and an attempt to tty_[k]open() the chardev
between these two steps will lead to tty->dev being assigned NULL by
alloc_tty_struct().
This will be addressed in a second patch.
Fixes: 6a7e6f78c235 ("tty: close race between device register and open")
Signed-off-by: Max Staudt <max(a)enpas.org>
Cc: <stable(a)vger.kernel.org>
---
drivers/tty/tty_io.c | 54 +++++++++++++++++++++++++-------------------
1 file changed, 31 insertions(+), 23 deletions(-)
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index ca9b7d7bad2b..e922b84524d2 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -3245,6 +3245,7 @@ struct device *tty_register_device_attr(struct tty_driver *driver,
struct ktermios *tp;
struct device *dev;
int retval;
+ bool cdev_added = false;
if (index >= driver->num) {
pr_err("%s: Attempt to register invalid tty line number (%d)\n",
@@ -3257,24 +3258,6 @@ struct device *tty_register_device_attr(struct tty_driver *driver,
else
tty_line_name(driver, index, name);
- dev = kzalloc(sizeof(*dev), GFP_KERNEL);
- if (!dev)
- return ERR_PTR(-ENOMEM);
-
- dev->devt = devt;
- dev->class = &tty_class;
- dev->parent = device;
- dev->release = tty_device_create_release;
- dev_set_name(dev, "%s", name);
- dev->groups = attr_grp;
- dev_set_drvdata(dev, drvdata);
-
- dev_set_uevent_suppress(dev, 1);
-
- retval = device_register(dev);
- if (retval)
- goto err_put;
-
if (!(driver->flags & TTY_DRIVER_DYNAMIC_ALLOC)) {
/*
* Free any saved termios data so that the termios state is
@@ -3288,19 +3271,44 @@ struct device *tty_register_device_attr(struct tty_driver *driver,
retval = tty_cdev_add(driver, devt, index, 1);
if (retval)
- goto err_del;
+ return ERR_PTR(retval);
+
+ cdev_added = true;
+ }
+
+ dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+ if (!dev) {
+ retval = -ENOMEM;
+ goto err_del_cdev;
}
- dev_set_uevent_suppress(dev, 0);
- kobject_uevent(&dev->kobj, KOBJ_ADD);
+ dev->devt = devt;
+ dev->class = &tty_class;
+ dev->parent = device;
+ dev->release = tty_device_create_release;
+ dev_set_name(dev, "%s", name);
+ dev->groups = attr_grp;
+ dev_set_drvdata(dev, drvdata);
+
+ retval = device_register(dev);
+ if (retval)
+ goto err_put;
return dev;
-err_del:
- device_del(dev);
err_put:
+ /*
+ * device_register() calls device_add(), after which
+ * we must use put_device() instead of kfree().
+ */
put_device(dev);
+err_del_cdev:
+ if (cdev_added) {
+ cdev_del(driver->cdevs[index]);
+ driver->cdevs[index] = NULL;
+ }
+
return ERR_PTR(retval);
}
EXPORT_SYMBOL_GPL(tty_register_device_attr);
--
2.39.5
fxls8962af_fifo_flush() uses indio_dev->active_scan_mask (with
iio_for_each_active_channel()) without making sure the indio_dev
stays in buffer mode.
There is a race if indio_dev exits buffer mode in the middle of the
interrupt that flushes the fifo. Fix this by calling
synchronize_irq() to ensure that no interrupt is currently running when
disabling buffer mode.
Unable to handle kernel NULL pointer dereference at virtual address 00000000 when read
[...]
_find_first_bit_le from fxls8962af_fifo_flush+0x17c/0x290
fxls8962af_fifo_flush from fxls8962af_interrupt+0x80/0x178
fxls8962af_interrupt from irq_thread_fn+0x1c/0x7c
irq_thread_fn from irq_thread+0x110/0x1f4
irq_thread from kthread+0xe0/0xfc
kthread from ret_from_fork+0x14/0x2c
Fixes: 79e3a5bdd9ef ("iio: accel: fxls8962af: add hw buffered sampling")
Cc: stable(a)vger.kernel.org
Suggested-by: David Lechner <dlechner(a)baylibre.com>
Signed-off-by: Sean Nyekjaer <sean(a)geanix.com>
---
Changes in v2:
- As per David's suggestion; switched to use synchronize_irq() instead.
- Link to v1: https://lore.kernel.org/r/20250524-fxlsrace-v1-1-dec506dc87ae@geanix.com
---
drivers/iio/accel/fxls8962af-core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/iio/accel/fxls8962af-core.c b/drivers/iio/accel/fxls8962af-core.c
index 6d23da3e7aa22c61f2d9348bb91d70cc5719a732..f2558fba491dffa78b26d47d2cd9f1f4d9811f54 100644
--- a/drivers/iio/accel/fxls8962af-core.c
+++ b/drivers/iio/accel/fxls8962af-core.c
@@ -866,6 +866,8 @@ static int fxls8962af_buffer_predisable(struct iio_dev *indio_dev)
if (ret)
return ret;
+ synchronize_irq(data->irq);
+
ret = __fxls8962af_fifo_set_mode(data, false);
if (data->enable_event)
---
base-commit: 5c3fcb36c92443a9a037683626a2e43d8825f783
change-id: 20250524-fxlsrace-f4d20e29fb29
Best regards,
--
Sean Nyekjaer <sean(a)geanix.com>