From: Kairui Song <kasong(a)tencent.com>
On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
lookup, and try to move the found folio to the faulting vma when.
Currently, it relies on the PTE value check to ensure the moved folio
still belongs to the src swap entry, which turns out is not reliable.
While working and reviewing the swap table series with Barry, following
existing race is observed and reproduced [1]:
( move_pages_pte is moving src_pte to dst_pte, where src_pte is a
swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.)
CPU1 CPU2
userfaultfd_move
move_pages_pte()
entry = pte_to_swp_entry(orig_src_pte);
// Here it got entry = S1
... < Somehow interrupted> ...
<swapin src_pte, alloc and use folio A>
// folio A is just a new allocated folio
// and get installed into src_pte
<frees swap entry S1>
// src_pte now points to folio A, S1
// has swap count == 0, it can be freed
// by folio_swap_swap or swap
// allocator's reclaim.
<try to swap out another folio B>
// folio B is a folio in another VMA.
<put folio B to swap cache using S1 >
// S1 is freed, folio B could use it
// for swap out with no problem.
...
folio = filemap_get_folio(S1)
// Got folio B here !!!
... < Somehow interrupted again> ...
<swapin folio B and free S1>
// Now S1 is free to be used again.
<swapout src_pte & folio A using S1>
// Now src_pte is a swap entry pte
// holding S1 again.
folio_trylock(folio)
move_swap_pte
double_pt_lock
is_pte_pages_stable
// Check passed because src_pte == S1
folio_move_anon_rmap(...)
// Moved invalid folio B here !!!
The race window is very short and requires multiple collisions of
multiple rare events, so it's very unlikely to happen, but with a
deliberately constructed reproducer and increased time window, it can be
reproduced [1].
It's also possible that folio (A) is swapped in, and swapped out again
after the filemap_get_folio lookup, in such case folio (A) may stay in
swap cache so it needs to be moved too. In this case we should also try
again so kernel won't miss a folio move.
Fix this by checking if the folio is the valid swap cache folio after
acquiring the folio lock, and checking the swap cache again after
acquiring the src_pte lock.
SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far
we don't need to worry about that since folios only might get exposed to
swap cache in the swap out path, and it's covered in this patch too by
checking the swap cache again after acquiring src_pte lock.
Testing with a simple C program to allocate and move several GB of memory
did not show any observable performance change.
Cc: <stable(a)vger.kernel.org>
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoS… [1]
Signed-off-by: Kairui Song <kasong(a)tencent.com>
---
V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
Changes:
- Check swap_map instead of doing a filemap lookup after acquiring the
PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
Changes:
- Move the folio and swap check inside move_swap_pte to avoid skipping
the check and potential overhead [ Lokesh Gidra ]
- Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
value.
mm/userfaultfd.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..5dc05346e360 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
pte_t orig_dst_pte, pte_t orig_src_pte,
pmd_t *dst_pmd, pmd_t dst_pmdval,
spinlock_t *dst_ptl, spinlock_t *src_ptl,
- struct folio *src_folio)
+ struct folio *src_folio,
+ struct swap_info_struct *si, swp_entry_t entry)
{
+ /*
+ * Check if the folio still belongs to the target swap entry after
+ * acquiring the lock. Folio can be freed in the swap cache while
+ * not locked.
+ */
+ if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
+ entry.val != src_folio->swap.val))
+ return -EAGAIN;
+
double_pt_lock(dst_ptl, src_ptl);
if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
@@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
if (src_folio) {
folio_move_anon_rmap(src_folio, dst_vma);
src_folio->index = linear_page_index(dst_vma, dst_addr);
+ } else {
+ /*
+ * Check if the swap entry is cached after acquiring the src_pte
+ * lock. Or we might miss a new loaded swap cache folio.
+ */
+ if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
+ double_pt_unlock(dst_ptl, src_ptl);
+ return -EAGAIN;
+ }
}
orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
@@ -1412,7 +1431,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
}
err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
- dst_ptl, src_ptl, src_folio);
+ dst_ptl, src_ptl, src_folio, si, entry);
}
out:
--
2.49.0
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 8b68e978718f14fdcb080c2a7791c52a0d09bc6d
Gitweb: https://git.kernel.org/tip/8b68e978718f14fdcb080c2a7791c52a0d09bc6d
Author: Thomas Gleixner <tglx(a)linutronix.de>
AuthorDate: Wed, 26 Feb 2025 16:01:57 +01:00
Committer: Borislav Petkov (AMD) <bp(a)alien8.de>
CommitterDate: Tue, 03 Jun 2025 15:56:39 +02:00
x86/iopl: Cure TIF_IO_BITMAP inconsistencies
io_bitmap_exit() is invoked from exit_thread() when a task exists or
when a fork fails. In the latter case the exit_thread() cleans up
resources which were allocated during fork().
io_bitmap_exit() invokes task_update_io_bitmap(), which in turn ends up
in tss_update_io_bitmap(). tss_update_io_bitmap() operates on the
current task. If current has TIF_IO_BITMAP set, but no bitmap installed,
tss_update_io_bitmap() crashes with a NULL pointer dereference.
There are two issues, which lead to that problem:
1) io_bitmap_exit() should not invoke task_update_io_bitmap() when
the task, which is cleaned up, is not the current task. That's a
clear indicator for a cleanup after a failed fork().
2) A task should not have TIF_IO_BITMAP set and neither a bitmap
installed nor IOPL emulation level 3 activated.
This happens when a kernel thread is created in the context of
a user space thread, which has TIF_IO_BITMAP set as the thread
flags are copied and the IO bitmap pointer is cleared.
Other than in the failed fork() case this has no impact because
kernel threads including IO workers never return to user space and
therefore never invoke tss_update_io_bitmap().
Cure this by adding the missing cleanups and checks:
1) Prevent io_bitmap_exit() to invoke task_update_io_bitmap() if
the to be cleaned up task is not the current task.
2) Clear TIF_IO_BITMAP in copy_thread() unconditionally. For user
space forks it is set later, when the IO bitmap is inherited in
io_bitmap_share().
For paranoia sake, add a warning into tss_update_io_bitmap() to catch
the case, when that code is invoked with inconsistent state.
Fixes: ea5f1cd7ab49 ("x86/ioperm: Remove bitmap if all permissions dropped")
Reported-by: syzbot+e2b1803445d236442e54(a)syzkaller.appspotmail.com
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/87wmdceom2.ffs@tglx
---
arch/x86/kernel/ioport.c | 13 +++++++++----
arch/x86/kernel/process.c | 6 ++++++
2 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/ioport.c b/arch/x86/kernel/ioport.c
index 6290dd1..ff40f09 100644
--- a/arch/x86/kernel/ioport.c
+++ b/arch/x86/kernel/ioport.c
@@ -33,8 +33,9 @@ void io_bitmap_share(struct task_struct *tsk)
set_tsk_thread_flag(tsk, TIF_IO_BITMAP);
}
-static void task_update_io_bitmap(struct task_struct *tsk)
+static void task_update_io_bitmap(void)
{
+ struct task_struct *tsk = current;
struct thread_struct *t = &tsk->thread;
if (t->iopl_emul == 3 || t->io_bitmap) {
@@ -54,7 +55,12 @@ void io_bitmap_exit(struct task_struct *tsk)
struct io_bitmap *iobm = tsk->thread.io_bitmap;
tsk->thread.io_bitmap = NULL;
- task_update_io_bitmap(tsk);
+ /*
+ * Don't touch the TSS when invoked on a failed fork(). TSS
+ * reflects the state of @current and not the state of @tsk.
+ */
+ if (tsk == current)
+ task_update_io_bitmap();
if (iobm && refcount_dec_and_test(&iobm->refcnt))
kfree(iobm);
}
@@ -192,8 +198,7 @@ SYSCALL_DEFINE1(iopl, unsigned int, level)
}
t->iopl_emul = level;
- task_update_io_bitmap(current);
-
+ task_update_io_bitmap();
return 0;
}
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c1d2dac..704883c 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -176,6 +176,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
frame->ret_addr = (unsigned long) ret_from_fork_asm;
p->thread.sp = (unsigned long) fork_frame;
p->thread.io_bitmap = NULL;
+ clear_tsk_thread_flag(p, TIF_IO_BITMAP);
p->thread.iopl_warn = 0;
memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
@@ -464,6 +465,11 @@ void native_tss_update_io_bitmap(void)
} else {
struct io_bitmap *iobm = t->io_bitmap;
+ if (WARN_ON_ONCE(!iobm)) {
+ clear_thread_flag(TIF_IO_BITMAP);
+ native_tss_invalidate_io_bitmap();
+ }
+
/*
* Only copy bitmap data when the sequence number differs. The
* update time is accounted to the incoming task.
This change makes the tty device file available only after the tty's
backing character device is ready.
Since 6a7e6f78c235975cc14d4e141fa088afffe7062c, the class device is
registered before the cdev is created, and userspace may pick it up,
yet open() will fail because the backing cdev doesn't exist yet.
Userspace is racing the bottom half of tty_register_device_attr() here,
specifically the call to tty_cdev_add().
dev_set_uevent_suppress() was used to work around this, but this fails
on embedded systems that rely on bare devtmpfs rather than udev.
On such systems, the device file is created as part of device_add(),
and userspace can pick it up via inotify, irrespective of uevent
suppression.
So let's undo the existing patch, and create the cdev first, and only
afterwards register the class device in the kernel's device tree.
However, this restores the original race of the cdev existing before the
class device is registered, and an attempt to tty_[k]open() the chardev
between these two steps will lead to tty->dev being assigned NULL by
alloc_tty_struct().
This will be addressed in a second patch.
Fixes: 6a7e6f78c235 ("tty: close race between device register and open")
Signed-off-by: Max Staudt <max(a)enpas.org>
Cc: <stable(a)vger.kernel.org>
---
drivers/tty/tty_io.c | 54 +++++++++++++++++++++++++-------------------
1 file changed, 31 insertions(+), 23 deletions(-)
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index ca9b7d7bad2b..e922b84524d2 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -3245,6 +3245,7 @@ struct device *tty_register_device_attr(struct tty_driver *driver,
struct ktermios *tp;
struct device *dev;
int retval;
+ bool cdev_added = false;
if (index >= driver->num) {
pr_err("%s: Attempt to register invalid tty line number (%d)\n",
@@ -3257,24 +3258,6 @@ struct device *tty_register_device_attr(struct tty_driver *driver,
else
tty_line_name(driver, index, name);
- dev = kzalloc(sizeof(*dev), GFP_KERNEL);
- if (!dev)
- return ERR_PTR(-ENOMEM);
-
- dev->devt = devt;
- dev->class = &tty_class;
- dev->parent = device;
- dev->release = tty_device_create_release;
- dev_set_name(dev, "%s", name);
- dev->groups = attr_grp;
- dev_set_drvdata(dev, drvdata);
-
- dev_set_uevent_suppress(dev, 1);
-
- retval = device_register(dev);
- if (retval)
- goto err_put;
-
if (!(driver->flags & TTY_DRIVER_DYNAMIC_ALLOC)) {
/*
* Free any saved termios data so that the termios state is
@@ -3288,19 +3271,44 @@ struct device *tty_register_device_attr(struct tty_driver *driver,
retval = tty_cdev_add(driver, devt, index, 1);
if (retval)
- goto err_del;
+ return ERR_PTR(retval);
+
+ cdev_added = true;
+ }
+
+ dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+ if (!dev) {
+ retval = -ENOMEM;
+ goto err_del_cdev;
}
- dev_set_uevent_suppress(dev, 0);
- kobject_uevent(&dev->kobj, KOBJ_ADD);
+ dev->devt = devt;
+ dev->class = &tty_class;
+ dev->parent = device;
+ dev->release = tty_device_create_release;
+ dev_set_name(dev, "%s", name);
+ dev->groups = attr_grp;
+ dev_set_drvdata(dev, drvdata);
+
+ retval = device_register(dev);
+ if (retval)
+ goto err_put;
return dev;
-err_del:
- device_del(dev);
err_put:
+ /*
+ * device_register() calls device_add(), after which
+ * we must use put_device() instead of kfree().
+ */
put_device(dev);
+err_del_cdev:
+ if (cdev_added) {
+ cdev_del(driver->cdevs[index]);
+ driver->cdevs[index] = NULL;
+ }
+
return ERR_PTR(retval);
}
EXPORT_SYMBOL_GPL(tty_register_device_attr);
--
2.39.5
Property num_cpu and feature is read-only once eiointc is created, which
is set with KVM_DEV_LOONGARCH_EXTIOI_GRP_CTRL attr group before device
creation.
Attr group KVM_DEV_LOONGARCH_EXTIOI_GRP_SW_STATUS is to update register
and software state for migration and reset usage, property num_cpu and
feature can not be update again if it is created already.
Here discard write operation with property num_cpu and feature in attr
group KVM_DEV_LOONGARCH_EXTIOI_GRP_CTRL.
Cc: stable(a)vger.kernel.org
Fixes: 1ad7efa552fd ("LoongArch: KVM: Add EIOINTC user mode read and write functions")
Signed-off-by: Bibo Mao <maobibo(a)loongson.cn>
---
arch/loongarch/kvm/intc/eiointc.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/loongarch/kvm/intc/eiointc.c b/arch/loongarch/kvm/intc/eiointc.c
index 0b648c56b0c3..b48511f903b5 100644
--- a/arch/loongarch/kvm/intc/eiointc.c
+++ b/arch/loongarch/kvm/intc/eiointc.c
@@ -910,9 +910,22 @@ static int kvm_eiointc_sw_status_access(struct kvm_device *dev,
data = (void __user *)attr->addr;
switch (addr) {
case KVM_DEV_LOONGARCH_EXTIOI_SW_STATUS_NUM_CPU:
+ /*
+ * Property num_cpu and feature is read-only once eiointc is
+ * created with KVM_DEV_LOONGARCH_EXTIOI_GRP_CTRL group API
+ *
+ * Disable writing with KVM_DEV_LOONGARCH_EXTIOI_GRP_SW_STATUS
+ * group API
+ */
+ if (is_write)
+ return ret;
+
p = &s->num_cpu;
break;
case KVM_DEV_LOONGARCH_EXTIOI_SW_STATUS_FEATURE:
+ if (is_write)
+ return ret;
+
p = &s->features;
break;
case KVM_DEV_LOONGARCH_EXTIOI_SW_STATUS_STATE:
--
2.39.3
Hi,
I wanted to check if you’d be interested in acquiring the attendees list of UITP Summit - Hamburg 2025?
Event Overview:
Dates: 15 - 18 Jun 2025
Location: Hamburg, Germany
Attendees: 10,126
Exhibitors: 380
Each contact contains: Contact Name, First Name, Last Name, Job Title, Company, Website Address, City, State, Zip, Country Code, Revenue, Employee Size, Email, Phone Number, and Fax Number.
This dataset is an excellent asset for companies looking to expand their reach, build partnerships, and strengthen market presence.
If you're interested in the list, just reply "Send Counts and Cost"?
Best regards,
Michelle Calara
Senior Marketing Manager
To unsubscribe, simply respond with “Not interested.”
Hi
The IPQ50xx chip integrates only a 2x2 2.4GHz Wi-Fi module. Through the high-performance NSS core and PCIe expansion and integration
the application range of the IPQ50xx chip ranges from Wi-Fi mesh node to Enterprise AP. OEM manufacturers can think in a unified way when
doing circuit design and PCB layout, and the same packaging can really save designers a lot of trouble.
.# Part Number Manufacturer Date Code Quantity Unit Price Lead Time Condition (PCS) USD/Each one 1 IPQ-5018-0-MRQFN232-TR-01-0 QUALCOMM 2022+ 30000pcs US$3.50/pcs 7days New & original - stock 2 QCN-6102-0-DRQFN116-TR-01-1 QUALCOMM 2022+ 30000pcs US$2.50/pcs 3 QCN-9024-0-MSP234-TR-01-0 QUALCOMM 2022+ 5000pcs US$3.50/pcs 4 QCN-6112-0-DRQFN116-TR-01-0 QUALCOMM 2023+ 15000pcs US$2.70/pcs 5 QCA-8337-AL3C-R QUALCOMM 2023+ 20000pcs US$1.30/pcs
The above are our company's current inventory, all of which are genuine and original packaging.
If you need anything, please feel free to contact me, thank you
Best Regards
Maintain your product-savvy edge with . Stay Updated on News
If you prefer to exit, choose Review Communication Options.