From: Kairui Song <kasong(a)tencent.com>
On seeing a swap entry PTE, userfaultfd_move does a lockless swap
cache lookup, and tries to move the found folio to the faulting vma.
Currently, it relies on checking the PTE value to ensure that the moved
folio still belongs to the src swap entry and that no new folio has
been added to the swap cache, which turns out to be unreliable.
While working and reviewing the swap table series with Barry, following
existing races are observed and reproduced [1]:
In the example below, move_pages_pte is moving src_pte to dst_pte,
where src_pte is a swap entry PTE holding swap entry S1, and S1
is not in the swap cache:
CPU1 CPU2
userfaultfd_move
move_pages_pte()
entry = pte_to_swp_entry(orig_src_pte);
// Here it got entry = S1
... < interrupted> ...
<swapin src_pte, alloc and use folio A>
// folio A is a new allocated folio
// and get installed into src_pte
<frees swap entry S1>
// src_pte now points to folio A, S1
// has swap count == 0, it can be freed
// by folio_swap_swap or swap
// allocator's reclaim.
<try to swap out another folio B>
// folio B is a folio in another VMA.
<put folio B to swap cache using S1 >
// S1 is freed, folio B can use it
// for swap out with no problem.
...
folio = filemap_get_folio(S1)
// Got folio B here !!!
... < interrupted again> ...
<swapin folio B and free S1>
// Now S1 is free to be used again.
<swapout src_pte & folio A using S1>
// Now src_pte is a swap entry PTE
// holding S1 again.
folio_trylock(folio)
move_swap_pte
double_pt_lock
is_pte_pages_stable
// Check passed because src_pte == S1
folio_move_anon_rmap(...)
// Moved invalid folio B here !!!
The race window is very short and requires multiple collisions of
multiple rare events, so it's very unlikely to happen, but with a
deliberately constructed reproducer and increased time window, it
can be reproduced easily.
This can be fixed by checking if the folio returned by filemap is the
valid swap cache folio after acquiring the folio lock.
Another similar race is possible: filemap_get_folio may return NULL, but
folio (A) could be swapped in and then swapped out again using the same
swap entry after the lookup. In such a case, folio (A) may remain in the
swap cache, so it must be moved too:
CPU1 CPU2
userfaultfd_move
move_pages_pte()
entry = pte_to_swp_entry(orig_src_pte);
// Here it got entry = S1, and S1 is not in swap cache
folio = filemap_get_folio(S1)
// Got NULL
... < interrupted again> ...
<swapin folio A and free S1>
<swapout folio A re-using S1>
move_swap_pte
double_pt_lock
is_pte_pages_stable
// Check passed because src_pte == S1
folio_move_anon_rmap(...)
// folio A is ignored !!!
Fix this by checking the swap cache again after acquiring the src_pte
lock. And to avoid the filemap overhead, we check swap_map directly [2].
The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so
far we don't need to worry about that, since folios can only be exposed
to the swap cache in the swap out path, and this is covered in this
patch by checking the swap cache again after acquiring the src_pte lock.
Testing with a simple C program that allocates and moves several GB of
memory did not show any observable performance change.
Cc: <stable(a)vger.kernel.org>
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoS… [1]
Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8… [2]
Signed-off-by: Kairui Song <kasong(a)tencent.com>
Reviewed-by: Lokesh Gidra <lokeshgidra(a)google.com>
---
V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/
Changes:
- Check swap_map instead of doing a filemap lookup after acquiring the
PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ]
V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/
Changes:
- Move the folio and swap check inside move_swap_pte to avoid skipping
the check and potential overhead [ Lokesh Gidra ]
- Add a READ_ONCE for the swap_map read to ensure it reads a up to dated
value.
V3: https://lore.kernel.org/all/20250602181419.20478-1-ryncsn@gmail.com/
Changes:
- Add more comments and more context in commit message.
mm/userfaultfd.c | 33 +++++++++++++++++++++++++++++++--
1 file changed, 31 insertions(+), 2 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..8253978ee0fb 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
pte_t orig_dst_pte, pte_t orig_src_pte,
pmd_t *dst_pmd, pmd_t dst_pmdval,
spinlock_t *dst_ptl, spinlock_t *src_ptl,
- struct folio *src_folio)
+ struct folio *src_folio,
+ struct swap_info_struct *si, swp_entry_t entry)
{
+ /*
+ * Check if the folio still belongs to the target swap entry after
+ * acquiring the lock. Folio can be freed in the swap cache while
+ * not locked.
+ */
+ if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
+ entry.val != src_folio->swap.val))
+ return -EAGAIN;
+
double_pt_lock(dst_ptl, src_ptl);
if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
@@ -1102,6 +1112,25 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
if (src_folio) {
folio_move_anon_rmap(src_folio, dst_vma);
src_folio->index = linear_page_index(dst_vma, dst_addr);
+ } else {
+ /*
+ * Check if the swap entry is cached after acquiring the src_pte
+ * lock. Otherwise, we might miss a newly loaded swap cache folio.
+ *
+ * Check swap_map directly to minimize overhead, READ_ONCE is sufficient.
+ * We are trying to catch newly added swap cache, the only possible case is
+ * when a folio is swapped in and out again staying in swap cache, using the
+ * same entry before the PTE check above. The PTL is acquired and released
+ * twice, each time after updating the swap_map's flag. So holding
+ * the PTL here ensures we see the updated value. False positive is possible,
+ * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the
+ * cache, or during the tiny synchronization window between swap cache and
+ * swap_map, but it will be gone very quickly, worst result is retry jitters.
+ */
+ if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
+ double_pt_unlock(dst_ptl, src_ptl);
+ return -EAGAIN;
+ }
}
orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
@@ -1412,7 +1441,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
}
err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
- dst_ptl, src_ptl, src_folio);
+ dst_ptl, src_ptl, src_folio, si, entry);
}
out:
--
2.49.0
The patch titled
Subject: mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
has been added to the -mm mm-new branch. Its filename is
mm-huge_memory-dont-ignore-queried-cachemode-in-vmf_insert_pfn_pud.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
Date: Wed, 11 Jun 2025 14:06:52 +0200
Patch series "mm/huge_memory: vmf_insert_folio_*() and
vmf_insert_pfn_pud() fixes", v2.
While working on improving vm_normal_page() and friends, I stumbled over
this issues: refcounted "normal" pages must not be marked using
pmd_special() / pud_special().
Fortunately, so far there doesn't seem to be serious damage.
This patch (of 3):
We setup the cache mode but ... don't forward the updated pgprot to
insert_pfn_pud().
Only a problem on x86-64 PAT when mapping PFNs using PUDs that require a
special cachemode.
Fix it by using the proper pgprot where the cachemode was setup.
Identified by code inspection.
Link: https://lkml.kernel.org/r/20250611120654.545963-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250611120654.545963-2-david@redhat.com
Fixes: 7b806d229ef1 ("mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries")
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Dev Jain <dev.jain(a)arm.com>
Cc: Liam Howlett <liam.howlett(a)oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Mariano Pache <npache(a)redhat.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Mike Rapoport <rppt(a)kernel.org>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/huge_memory.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
--- a/mm/huge_memory.c~mm-huge_memory-dont-ignore-queried-cachemode-in-vmf_insert_pfn_pud
+++ a/mm/huge_memory.c
@@ -1516,10 +1516,9 @@ static pud_t maybe_pud_mkwrite(pud_t pud
}
static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
- pud_t *pud, pfn_t pfn, bool write)
+ pud_t *pud, pfn_t pfn, pgprot_t prot, bool write)
{
struct mm_struct *mm = vma->vm_mm;
- pgprot_t prot = vma->vm_page_prot;
pud_t entry;
if (!pud_none(*pud)) {
@@ -1581,7 +1580,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_
pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
ptl = pud_lock(vma->vm_mm, vmf->pud);
- insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
+ insert_pfn_pud(vma, addr, vmf->pud, pfn, pgprot, write);
spin_unlock(ptl);
return VM_FAULT_NOPAGE;
@@ -1625,7 +1624,7 @@ vm_fault_t vmf_insert_folio_pud(struct v
add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
}
insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)),
- write);
+ vma->vm_page_prot, write);
spin_unlock(ptl);
return VM_FAULT_NOPAGE;
_
Patches currently in -mm which might be from david(a)redhat.com are
mm-gup-revert-mm-gup-fix-infinite-loop-within-__get_longterm_locked.patch
mm-gup-remove-vm_bug_ons.patch
mm-gup-remove-vm_bug_ons-fix.patch
mm-huge_memory-dont-ignore-queried-cachemode-in-vmf_insert_pfn_pud.patch
mm-huge_memory-dont-mark-refcounted-folios-special-in-vmf_insert_folio_pmd.patch
mm-huge_memory-dont-mark-refcounted-folios-special-in-vmf_insert_folio_pud.patch
From: Francesco Dolcini <francesco.dolcini(a)toradex.com>
This reverts commit 4fcfcbe457349267fe048524078e8970807c1a5b.
That commit introduces a regression, when HT40 mode is enabled,
received packets are lost, this was experience with W8997 with both
SDIO-UART and SDIO-SDIO variants. From an initial investigation the
issue solves on its own after some time, but it's not clear what is
the reason. Given that this was just a performance optimization, let's
revert it till we have a better understanding of the issue and a proper
fix.
Cc: Jeff Chen <jeff.chen_1(a)nxp.com>
Cc: stable(a)vger.kernel.org
Fixes: 4fcfcbe45734 ("wifi: mwifiex: Fix HT40 bandwidth issue.")
Closes: https://lore.kernel.org/all/20250603203337.GA109929@francesco-nb/
Signed-off-by: Francesco Dolcini <francesco.dolcini(a)toradex.com>
---
v2: fix reverted commit sha
v1: https://lore.kernel.org/all/20250605100313.34014-1-francesco@dolcini.it/
---
drivers/net/wireless/marvell/mwifiex/11n.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/net/wireless/marvell/mwifiex/11n.c b/drivers/net/wireless/marvell/mwifiex/11n.c
index 738bafc3749b..66f0f5377ac1 100644
--- a/drivers/net/wireless/marvell/mwifiex/11n.c
+++ b/drivers/net/wireless/marvell/mwifiex/11n.c
@@ -403,14 +403,12 @@ mwifiex_cmd_append_11n_tlv(struct mwifiex_private *priv,
if (sband->ht_cap.cap & IEEE80211_HT_CAP_SUP_WIDTH_20_40 &&
bss_desc->bcn_ht_oper->ht_param &
- IEEE80211_HT_PARAM_CHAN_WIDTH_ANY) {
- chan_list->chan_scan_param[0].radio_type |=
- CHAN_BW_40MHZ << 2;
+ IEEE80211_HT_PARAM_CHAN_WIDTH_ANY)
SET_SECONDARYCHAN(chan_list->chan_scan_param[0].
radio_type,
(bss_desc->bcn_ht_oper->ht_param &
IEEE80211_HT_PARAM_CHA_SEC_OFFSET));
- }
+
*buffer += struct_size(chan_list, chan_scan_param, 1);
ret_len += struct_size(chan_list, chan_scan_param, 1);
}
--
2.39.5
The patch titled
Subject: mm/gup: revert "mm: gup: fix infinite loop within __get_longterm_locked"
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-gup-revert-mm-gup-fix-infinite-loop-within-__get_longterm_locked.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: mm/gup: revert "mm: gup: fix infinite loop within __get_longterm_locked"
Date: Wed, 11 Jun 2025 15:13:14 +0200
After commit 1aaf8c122918 ("mm: gup: fix infinite loop within
__get_longterm_locked") we are able to longterm pin folios that are not
supposed to get longterm pinned, simply because they temporarily have the
LRU flag cleared (esp. temporarily isolated).
For example, two __get_longterm_locked() callers can race, or
__get_longterm_locked() can race with anything else that temporarily
isolates folios.
The introducing commit mentions the use case of a driver that uses
vm_ops->fault to insert pages allocated through cma_alloc() into the page
tables, assuming they can later get longterm pinned. These pages/ folios
would never have the LRU flag set and consequently cannot get isolated.
There is no known in-tree user making use of that so far, fortunately.
To handle that in the future -- and avoid retrying forever to
isolate/migrate them -- we will need a different mechanism for the CMA
area *owner* to indicate that it actually already allocated the page and
is fine with longterm pinning it. The LRU flag is not suitable for that.
Probably we can lookup the relevant CMA area and query the bitmap; we only
have have to care about some races, probably. If already allocated, we
could just allow longterm pinning)
Anyhow, let's fix the "must not be longterm pinned" problem first by
reverting the original commit.
Link: https://lkml.kernel.org/r/20250611131314.594529-1-david@redhat.com
Fixes: 1aaf8c122918 ("mm: gup: fix infinite loop within __get_longterm_locked")
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Closes: https://lore.kernel.org/all/20250522092755.GA3277597@tiffany/
Reported-by: Hyesoo Yu <hyesoo.yu(a)samsung.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
Cc: Aijun Sun <aijun.sun(a)unisoc.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/gup.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
--- a/mm/gup.c~mm-gup-revert-mm-gup-fix-infinite-loop-within-__get_longterm_locked
+++ a/mm/gup.c
@@ -2303,13 +2303,13 @@ static void pofs_unpin(struct pages_or_f
/*
* Returns the number of collected folios. Return value is always >= 0.
*/
-static void collect_longterm_unpinnable_folios(
+static unsigned long collect_longterm_unpinnable_folios(
struct list_head *movable_folio_list,
struct pages_or_folios *pofs)
{
+ unsigned long i, collected = 0;
struct folio *prev_folio = NULL;
bool drain_allow = true;
- unsigned long i;
for (i = 0; i < pofs->nr_entries; i++) {
struct folio *folio = pofs_get_folio(pofs, i);
@@ -2321,6 +2321,8 @@ static void collect_longterm_unpinnable_
if (folio_is_longterm_pinnable(folio))
continue;
+ collected++;
+
if (folio_is_device_coherent(folio))
continue;
@@ -2342,6 +2344,8 @@ static void collect_longterm_unpinnable_
NR_ISOLATED_ANON + folio_is_file_lru(folio),
folio_nr_pages(folio));
}
+
+ return collected;
}
/*
@@ -2418,9 +2422,11 @@ static long
check_and_migrate_movable_pages_or_folios(struct pages_or_folios *pofs)
{
LIST_HEAD(movable_folio_list);
+ unsigned long collected;
- collect_longterm_unpinnable_folios(&movable_folio_list, pofs);
- if (list_empty(&movable_folio_list))
+ collected = collect_longterm_unpinnable_folios(&movable_folio_list,
+ pofs);
+ if (!collected)
return 0;
return migrate_longterm_unpinnable_folios(&movable_folio_list, pofs);
_
Patches currently in -mm which might be from david(a)redhat.com are
mm-gup-revert-mm-gup-fix-infinite-loop-within-__get_longterm_locked.patch
mm-gup-remove-vm_bug_ons.patch
mm-gup-remove-vm_bug_ons-fix.patch
From: "Mike Rapoport (Microsoft)" <rppt(a)kernel.org>
Hi,
Jürgen Groß reported some bugs in interaction of ITS mitigation with
execmem [1] when running on a Xen PV guest.
These patches fix the issue by moving all the permissions management of
ITS memory allocated from execmem into ITS code.
I didn't test on a real Xen PV guest, but I emulated !PSE variant by
force-disabling the ROX cache in x86::execmem_arch_setup().
Peter, I took liberty to put your SoB in the patch that actually
implements the execmem permissions management in ITS, please let me know
if I need to update something about the authorship.
The patches are against v6.15.
They are also available in git:
https://web.git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=i…
[1] https://lore.kernel.org/all/20250528123557.12847-2-jgross@suse.com/
Juergen Gross (1):
x86/mm/pat: don't collapse pages without PSE set
Mike Rapoport (Microsoft) (3):
x86/Kconfig: only enable ROX cache in execmem when STRICT_MODULE_RWX is set
x86/its: move its_pages array to struct mod_arch_specific
Revert "mm/execmem: Unify early execmem_cache behaviour"
Peter Zijlstra (Intel) (1):
x86/its: explicitly manage permissions for ITS pages
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/module.h | 8 ++++
arch/x86/kernel/alternative.c | 89 ++++++++++++++++++++++++++---------
arch/x86/mm/init_32.c | 3 --
arch/x86/mm/init_64.c | 3 --
arch/x86/mm/pat/set_memory.c | 3 ++
include/linux/execmem.h | 8 +---
include/linux/module.h | 5 --
mm/execmem.c | 40 ++--------------
9 files changed, 82 insertions(+), 79 deletions(-)
base-commit: 0ff41df1cb268fc69e703a08a57ee14ae967d0ca
--
2.47.2
Number of apqn target list entries contained in 'nr_apqns' variable is
determined by userspace via an ioctl call so the result of the product in
calculation of size passed to memdup_user() may overflow.
In this case the actual size of the allocated area and the value
describing it won't be in sync leading to various types of unpredictable
behaviour later.
Return an error if an overflow is detected. Note that it is different
from when nr_apqns is zero - that case is considered valid and should be
handled in subsequent pkey_handler implementations.
Found by Linux Verification Center (linuxtesting.org).
Fixes: f2bbc96e7cfa ("s390/pkey: add CCA AES cipher key support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin(a)ispras.ru>
---
drivers/s390/crypto/pkey_api.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/s390/crypto/pkey_api.c b/drivers/s390/crypto/pkey_api.c
index cef60770f68b..a731fc9c62a7 100644
--- a/drivers/s390/crypto/pkey_api.c
+++ b/drivers/s390/crypto/pkey_api.c
@@ -83,10 +83,15 @@ static void *_copy_key_from_user(void __user *ukey, size_t keylen)
static void *_copy_apqns_from_user(void __user *uapqns, size_t nr_apqns)
{
+ size_t size;
+
if (!uapqns || nr_apqns == 0)
return NULL;
- return memdup_user(uapqns, nr_apqns * sizeof(struct pkey_apqn));
+ if (check_mul_overflow(nr_apqns, sizeof(struct pkey_apqn), &size))
+ return ERR_PTR(-EINVAL);
+
+ return memdup_user(uapqns, size);
}
static int pkey_ioctl_genseck(struct pkey_genseck __user *ugs)
--
2.49.0
Hello Cassio,
On 10/06/2025 21:31:48+0100, Cassio Neri wrote:
> Hi all,
>
> Although untested, I'm pretty sure that with very small changes, the
> previous revision (1d1bb12) can handle dates prior to 1970-01-01 with no
> need to add extra branches or arithmetic operations. Indeed, 1d1bb12
> contains:
>
> <code>
> /* time must be positive */
> days = div_s64_rem(time, 86400, &secs);
>
> /* day of the week, 1970-01-01 was a Thursday */
> tm->tm_wday = (days + 4) % 7;
>
> /* long comments */
>
> udays = ((u32) days) + 719468;
> </code>
>
> This could have been changed to:
>
> <code>
> /* time must be >= -719468 * 86400 which corresponds to 0000-03-01 */
> udays = div_u64_rem(time + 719468 * 86400, 86400, &secs);
>
> /* day of the week, 0000-03-01 was a Wednesday (in the proleptic Gregorian
> calendar) */
> tm->tm_wday = (days + 3) % 7;
>
> /* long comments */
> </code>
>
> Indeed, the addition of 719468 * 86400 to `time` makes `days` to be 719468
> more than it should be. Therefore, in the calculation of `udays`, the
> addition of 719468 becomes unnecessary and thus, `udays == days`. Moreover,
> this means that `days` can be removed altogether and replaced by `udays`.
> (Not the other way around because in the remaining code `udays` must be
> u32.)
>
> Now, 719468 % 7 = 1 and thus tm->wday is 1 day after what it should be and
> we correct that by adding 3 instead of 4.
>
> Therefore, I suggest these changes on top of 1d1bb12 instead of those made
> in 7df4cfe. Since you're working on this, can I please kindly suggest two
> other changes?
>
> 1) Change the reference provided in the long comment. It should say, "The
> following algorithm is, basically, Figure 12 of Neri and Schneider [1]" and
> [1] should refer to the published article:
>
> Neri C, Schneider L. Euclidean affine functions and their application to
> calendar algorithms. Softw Pract Exper. 2023;53(4):937-970. doi:
> 10.1002/spe.3172
> https://doi.org/10.1002/spe.3172
>
> The article is much better written and clearer than the pre-print currently
> referred to.
>
Thanks for your input, I wanted to look again at your paper and make those
optimizations which is why I took so long to review the original patch.
Unfortunately, I didn't have the time before the merge window.
I would also gladly take patches for this if you are up for the task.
> 2) Function rtc_time64_to_tm_test_date_range in drivers/rtc/lib_test.c, is
> a kunit test that checks the result for everyday in a 160000 years range
> starting at 1970-01-01. It'd be nice if this test is adapted to the new
> code and starts at 1900-01-01 (technically, it could start at 0000-03-01
> but since tm->year counts from 1900, it would be weird to see tm->year ==
> -1900 to mean that the calendar year is 0.) Also 160000 is definitely an
> overkill (my bad!) and a couple of thousands of years, say 3000, should be
> more than safe for anyone. :-)
This is also something on my radar as some have been complaining about the time
it takes to run those tests.
>
> Many thanks,
> Cassio.
>
>
>
> On Tue, 10 Jun 2025 at 08:35, Uwe Kleine-König <u.kleine-koenig(a)baylibre.com>
> wrote:
>
> > From: Alexandre Mergnat <amergnat(a)baylibre.com>
> >
> > commit 7df4cfef8b351fec3156160bedfc7d6d29de4cce upstream.
> >
> > Conversion of dates before 1970 is still relevant today because these
> > dates are reused on some hardwares to store dates bigger than the
> > maximal date that is representable in the device's native format.
> > This prominently and very soon affects the hardware covered by the
> > rtc-mt6397 driver that can only natively store dates in the interval
> > 1900-01-01 up to 2027-12-31. So to store the date 2028-01-01 00:00:00
> > to such a device, rtc_time64_to_tm() must do the right thing for
> > time=-2208988800.
> >
> > Signed-off-by: Alexandre Mergnat <amergnat(a)baylibre.com>
> > Reviewed-by: Uwe Kleine-König <u.kleine-koenig(a)baylibre.com>
> > Link:
> > https://lore.kernel.org/r/20250428-enable-rtc-v4-1-2b2f7e3f9349@baylibre.com
> > Signed-off-by: Alexandre Belloni <alexandre.belloni(a)bootlin.com>
> > Signed-off-by: Uwe Kleine-König <u.kleine-koenig(a)baylibre.com>
> > ---
> > drivers/rtc/lib.c | 24 +++++++++++++++++++-----
> > 1 file changed, 19 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/rtc/lib.c b/drivers/rtc/lib.c
> > index fe361652727a..13b5b1f20465 100644
> > --- a/drivers/rtc/lib.c
> > +++ b/drivers/rtc/lib.c
> > @@ -46,24 +46,38 @@ EXPORT_SYMBOL(rtc_year_days);
> > * rtc_time64_to_tm - converts time64_t to rtc_time.
> > *
> > * @time: The number of seconds since 01-01-1970 00:00:00.
> > - * (Must be positive.)
> > + * Works for values since at least 1900
> > * @tm: Pointer to the struct rtc_time.
> > */
> > void rtc_time64_to_tm(time64_t time, struct rtc_time *tm)
> > {
> > - unsigned int secs;
> > - int days;
> > + int days, secs;
> >
> > u64 u64tmp;
> > u32 u32tmp, udays, century, day_of_century, year_of_century, year,
> > day_of_year, month, day;
> > bool is_Jan_or_Feb, is_leap_year;
> >
> > - /* time must be positive */
> > + /*
> > + * Get days and seconds while preserving the sign to
> > + * handle negative time values (dates before 1970-01-01)
> > + */
> > days = div_s64_rem(time, 86400, &secs);
> >
> > + /*
> > + * We need 0 <= secs < 86400 which isn't given for negative
> > + * values of time. Fixup accordingly.
> > + */
> > + if (secs < 0) {
> > + days -= 1;
> > + secs += 86400;
> > + }
> > +
> > /* day of the week, 1970-01-01 was a Thursday */
> > tm->tm_wday = (days + 4) % 7;
> > + /* Ensure tm_wday is always positive */
> > + if (tm->tm_wday < 0)
> > + tm->tm_wday += 7;
> >
> > /*
> > * The following algorithm is, basically, Proposition 6.3 of Neri
> > @@ -93,7 +107,7 @@ void rtc_time64_to_tm(time64_t time, struct rtc_time
> > *tm)
> > * thus, is slightly different from [1].
> > */
> >
> > - udays = ((u32) days) + 719468;
> > + udays = days + 719468;
> >
> > u32tmp = 4 * udays + 3;
> > century = u32tmp / 146097;
> > --
> > 2.49.0
> >
> >
This reverts commit e70c301faece15b618e54b613b1fd6ece3dd05b4.
Commit <e70c301faece> ("block: don't reorder requests in
blk_add_rq_to_plug") reversed how requests are stored in the blk_plug
list, this had significant impact on bio merging with requests exist on
the plug list. This impact has been reported in [1] and could easily be
reproducible using 4k randwrite fio benchmark on an NVME based SSD without
having any filesystem on the disk.
My benchmark is:
fio --time_based --name=benchmark --size=50G --rw=randwrite \
--runtime=60 --filename="/dev/nvme1n1" --ioengine=psync \
--randrepeat=0 --iodepth=1 --fsync=64 --invalidate=1 \
--verify=0 --verify_fatal=0 --blocksize=4k --numjobs=4 \
--group_reporting
On 1.9TiB SSD(180K Max IOPS) attached to i3.16xlarge AWS EC2 instance.
Kernel | fio (B.W MiB/sec) | I/O size (iostat)
--------------+---------------------+--------------------
6.15.1 | 362 | 2KiB
6.15.1+revert | 660 (+82%) | 4KiB
--------------+---------------------+--------------------
I have run iostat while the fio benchmark was running and was able to
see that the I/O size seen on the disk is shown as 2KB without this revert
while it's 4KB with the revert. In the bad case the write bandwidth
is capped at around 362MiB/sec which almost 2KiB * 180K IOPS so we are
hitting the SSD Disk IOPS limit which is 180K. After the revert the I/O
size has been doubled to 4KiB hence the bandwidth has been almost doubled
as we no longer hit the Disk IOPS limit.
I have done some tracing using bpftrace & bcc and was able to conclude that
the reason behind the I/O size discrepancy with the revert is that this
fio benchmark is subimitting each 4k I/O as 2 contiguous 2KB bios.
In the good case each 2 bios are merged in a 4KB request that's then been
submitted to the disk while in the bad case 2K bios are submitted to the
disk without merging because blk_attempt_plug_merge() failed to merge
them as seen below.
**Without the revert**
[12:12:28]
r::blk_attempt_plug_merge():int:$retval
COUNT EVENT
5618 $retval = 1
176578 $retval = 0
**With the revert**
[12:11:43]
r::blk_attempt_plug_merge():int:$retval
COUNT EVENT
146684 $retval = 0
146686 $retval = 1
In blk_attempt_plug_merge() we are iterating ithrought the plug list
from head to tail looking for a request with which we can merge the
most recently submitted bio.
With commit <e70c301faece> ("block: don't reorder requests in
blk_add_rq_to_plug") the most recent request will be at the tail so
blk_attempt_plug_merge() will fail because it tries to merge bio with
the plug list head. In blk_attempt_plug_merge() we don't iterate across
the whole plug list because as we exit the loop once we fail merging in
blk_attempt_bio_merge().
In commit <bc490f81731> ("block: change plugging to use a singly linked
list") the plug list has been changed to single linked list so there's
no way to iterate the list from tail to head which is the only way to
mitigate the impact on bio merging if we want to keep commit <e70c301faece>
("block: don't reorder requests in blk_add_rq_to_plug").
Given that moving plug list to a single linked list was mainly for
performance reason then let's revert commit <e70c301faece> ("block: don't
reorder requests in blk_add_rq_to_plug") for now to mitigate the
reported performance regression.
[1] https://lore.kernel.org/lkml/202412122112.ca47bcec-lkp@intel.com/
Cc: stable(a)vger.kernel.org # 6.12
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Reported-by: Hagar Hemdan <hagarhem(a)amazon.com>
Reported-and-bisected-by: Shaoying Xu <shaoyi(a)amazon.com>
Signed-off-by: Hazem Mohamed Abuelfotoh <abuehaze(a)amazon.com>
---
block/blk-mq.c | 4 ++--
drivers/block/virtio_blk.c | 2 +-
drivers/nvme/host/pci.c | 2 +-
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c2697db59109..28965cac19fb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1394,7 +1394,7 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
*/
if (!plug->has_elevator && (rq->rq_flags & RQF_SCHED_TAGS))
plug->has_elevator = true;
- rq_list_add_tail(&plug->mq_list, rq);
+ rq_list_add_head(&plug->mq_list, rq);
plug->rq_count++;
}
@@ -2846,7 +2846,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
rq_list_add_tail(&requeue_list, rq);
continue;
}
- list_add_tail(&rq->queuelist, &list);
+ list_add(&rq->queuelist, &list);
depth++;
} while (!rq_list_empty(&plug->mq_list));
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 7cffea01d868..7992a171f905 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -513,7 +513,7 @@ static void virtio_queue_rqs(struct rq_list *rqlist)
vq = this_vq;
if (virtblk_prep_rq_batch(req))
- rq_list_add_tail(&submit_list, req);
+ rq_list_add_head(&submit_list, req); /* reverse order */
else
rq_list_add_tail(&requeue_list, req);
}
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f1dd804151b1..5f7da42f9dac 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1026,7 +1026,7 @@ static void nvme_queue_rqs(struct rq_list *rqlist)
nvmeq = req->mq_hctx->driver_data;
if (nvme_prep_rq_batch(nvmeq, req))
- rq_list_add_tail(&submit_list, req);
+ rq_list_add_head(&submit_list, req); /* reverse order */
else
rq_list_add_tail(&requeue_list, req);
}
--
2.47.1
From: Bartosz Golaszewski <bartosz.golaszewski(a)linaro.org>
When requesting pins whose intr_detection_width setting is not 1 or 2
for interrupts (for example by running `gpiomon -c 0 113` on RB2), we'll
hit a BUG() in msm_gpio_irq_set_type(). Potentially crashing the kernel
due to an invalid request from user-space is not optimal, so let's go
through the pins and mark those that would fail the check as invalid for
the irq chip as we should not even register them as available irqs.
This function can be extended if we determine that there are more
corner-cases like this.
Fixes: f365be092572 ("pinctrl: Add Qualcomm TLMM driver")
Cc: stable(a)vger.kernel.org
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski(a)linaro.org>
---
drivers/pinctrl/qcom/pinctrl-msm.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/drivers/pinctrl/qcom/pinctrl-msm.c b/drivers/pinctrl/qcom/pinctrl-msm.c
index f012ea88aa22c..77e0c2f023455 100644
--- a/drivers/pinctrl/qcom/pinctrl-msm.c
+++ b/drivers/pinctrl/qcom/pinctrl-msm.c
@@ -1038,6 +1038,24 @@ static bool msm_gpio_needs_dual_edge_parent_workaround(struct irq_data *d,
test_bit(d->hwirq, pctrl->skip_wake_irqs);
}
+static void msm_gpio_irq_init_valid_mask(struct gpio_chip *gc,
+ unsigned long *valid_mask,
+ unsigned int ngpios)
+{
+ struct msm_pinctrl *pctrl = gpiochip_get_data(gc);
+ const struct msm_pingroup *g;
+ int i;
+
+ bitmap_fill(valid_mask, ngpios);
+
+ for (i = 0; i < ngpios; i++) {
+ g = &pctrl->soc->groups[i];
+ if (g->intr_detection_width != 1 &&
+ g->intr_detection_width != 2)
+ clear_bit(i, valid_mask);
+ }
+}
+
static int msm_gpio_irq_set_type(struct irq_data *d, unsigned int type)
{
struct gpio_chip *gc = irq_data_get_irq_chip_data(d);
@@ -1441,6 +1459,7 @@ static int msm_gpio_init(struct msm_pinctrl *pctrl)
girq->default_type = IRQ_TYPE_NONE;
girq->handler = handle_bad_irq;
girq->parents[0] = pctrl->irq;
+ girq->init_valid_mask = msm_gpio_irq_init_valid_mask;
ret = gpiochip_add_data(&pctrl->chip, pctrl);
if (ret) {
--
2.48.1