The quilt patch titled
Subject: mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly
has been removed from the -mm tree. Its filename was
mm-madvise-make-madv_populate_readwrite-handle-vm_fault_retry-properly.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly
Date: Thu, 14 Mar 2024 17:12:59 +0100
Darrick reports that in some cases where pread() would fail with -EIO and
mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.
While the madvise() call can be interrupted by a signal, this is not the
desired behavior. MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave
like page faults in that case: fail and not retry forever.
A reproducer can be found at [1].
The reason is that __get_user_pages(), as called by
faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way:
it will simply return 0 when VM_FAULT_RETRY happened, making
madvise_populate()->faultin_vma_page_range() retry again and again, never
setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages().
__get_user_pages_locked() does what we want, but duplicating that logic in
faultin_vma_page_range() feels wrong.
So let's use __get_user_pages_locked() instead, that will detect
VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler
return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT
from faultin_page() to __get_user_pages(), all the way to
madvise_populate().
But, there is an issue: __get_user_pages_locked() will end up re-taking
the MM lock and then __get_user_pages() will do another VMA lookup. In
the meantime, the VMA layout could have changed and we'd fail with
different error codes than we'd want to.
As __get_user_pages() will currently do a new VMA lookup either way, let
it do the VMA handling in a different way, controlled by a new
FOLL_MADV_POPULATE flag, effectively moving these checks from
madvise_populate() + faultin_page_range() in there.
With this change, Darricks reproducer properly fails with -EFAULT, as
documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE.
[1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/
Link: https://lkml.kernel.org/r/20240314161300.382526-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240314161300.382526-2-david@redhat.com
Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Reported-by: Darrick J. Wong <djwong(a)kernel.org>
Closes: https://lore.kernel.org/all/20240311223815.GW1927156@frogsfrogsfrogs/
Cc: Darrick J. Wong <djwong(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/gup.c | 54 ++++++++++++++++++++++++++++--------------------
mm/internal.h | 10 +++++---
mm/madvise.c | 17 +--------------
3 files changed, 40 insertions(+), 41 deletions(-)
--- a/mm/gup.c~mm-madvise-make-madv_populate_readwrite-handle-vm_fault_retry-properly
+++ a/mm/gup.c
@@ -1206,6 +1206,22 @@ static long __get_user_pages(struct mm_s
/* first iteration or cross vma bound */
if (!vma || start >= vma->vm_end) {
+ /*
+ * MADV_POPULATE_(READ|WRITE) wants to handle VMA
+ * lookups+error reporting differently.
+ */
+ if (gup_flags & FOLL_MADV_POPULATE) {
+ vma = vma_lookup(mm, start);
+ if (!vma) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ if (check_vma_flags(vma, gup_flags)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ goto retry;
+ }
vma = gup_vma_lookup(mm, start);
if (!vma && in_gate_area(mm, start)) {
ret = get_gate_page(mm, start & PAGE_MASK,
@@ -1685,35 +1701,35 @@ long populate_vma_page_range(struct vm_a
}
/*
- * faultin_vma_page_range() - populate (prefault) page tables inside the
- * given VMA range readable/writable
+ * faultin_page_range() - populate (prefault) page tables inside the
+ * given range readable/writable
*
* This takes care of mlocking the pages, too, if VM_LOCKED is set.
*
- * @vma: target vma
+ * @mm: the mm to populate page tables in
* @start: start address
* @end: end address
* @write: whether to prefault readable or writable
* @locked: whether the mmap_lock is still held
*
- * Returns either number of processed pages in the vma, or a negative error
- * code on error (see __get_user_pages()).
+ * Returns either number of processed pages in the MM, or a negative error
+ * code on error (see __get_user_pages()). Note that this function reports
+ * errors related to VMAs, such as incompatible mappings, as expected by
+ * MADV_POPULATE_(READ|WRITE).
*
- * vma->vm_mm->mmap_lock must be held. The range must be page-aligned and
- * covered by the VMA. If it's released, *@locked will be set to 0.
+ * The range must be page-aligned.
+ *
+ * mm->mmap_lock must be held. If it's released, *@locked will be set to 0.
*/
-long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, bool write, int *locked)
+long faultin_page_range(struct mm_struct *mm, unsigned long start,
+ unsigned long end, bool write, int *locked)
{
- struct mm_struct *mm = vma->vm_mm;
unsigned long nr_pages = (end - start) / PAGE_SIZE;
int gup_flags;
long ret;
VM_BUG_ON(!PAGE_ALIGNED(start));
VM_BUG_ON(!PAGE_ALIGNED(end));
- VM_BUG_ON_VMA(start < vma->vm_start, vma);
- VM_BUG_ON_VMA(end > vma->vm_end, vma);
mmap_assert_locked(mm);
/*
@@ -1725,19 +1741,13 @@ long faultin_vma_page_range(struct vm_ar
* a poisoned page.
* !FOLL_FORCE: Require proper access permissions.
*/
- gup_flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_UNLOCKABLE;
+ gup_flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_UNLOCKABLE |
+ FOLL_MADV_POPULATE;
if (write)
gup_flags |= FOLL_WRITE;
- /*
- * We want to report -EINVAL instead of -EFAULT for any permission
- * problems or incompatible mappings.
- */
- if (check_vma_flags(vma, gup_flags))
- return -EINVAL;
-
- ret = __get_user_pages(mm, start, nr_pages, gup_flags,
- NULL, locked);
+ ret = __get_user_pages_locked(mm, start, nr_pages, NULL, locked,
+ gup_flags);
lru_add_drain();
return ret;
}
--- a/mm/internal.h~mm-madvise-make-madv_populate_readwrite-handle-vm_fault_retry-properly
+++ a/mm/internal.h
@@ -686,9 +686,8 @@ struct anon_vma *folio_anon_vma(struct f
void unmap_mapping_folio(struct folio *folio);
extern long populate_vma_page_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end, int *locked);
-extern long faultin_vma_page_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end,
- bool write, int *locked);
+extern long faultin_page_range(struct mm_struct *mm, unsigned long start,
+ unsigned long end, bool write, int *locked);
extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
unsigned long bytes);
@@ -1127,10 +1126,13 @@ enum {
FOLL_FAST_ONLY = 1 << 20,
/* allow unlocking the mmap lock */
FOLL_UNLOCKABLE = 1 << 21,
+ /* VMA lookup+checks compatible with MADV_POPULATE_(READ|WRITE) */
+ FOLL_MADV_POPULATE = 1 << 22,
};
#define INTERNAL_GUP_FLAGS (FOLL_TOUCH | FOLL_TRIED | FOLL_REMOTE | FOLL_PIN | \
- FOLL_FAST_ONLY | FOLL_UNLOCKABLE)
+ FOLL_FAST_ONLY | FOLL_UNLOCKABLE | \
+ FOLL_MADV_POPULATE)
/*
* Indicates for which pages that are write-protected in the page table,
--- a/mm/madvise.c~mm-madvise-make-madv_populate_readwrite-handle-vm_fault_retry-properly
+++ a/mm/madvise.c
@@ -908,27 +908,14 @@ static long madvise_populate(struct vm_a
{
const bool write = behavior == MADV_POPULATE_WRITE;
struct mm_struct *mm = vma->vm_mm;
- unsigned long tmp_end;
int locked = 1;
long pages;
*prev = vma;
while (start < end) {
- /*
- * We might have temporarily dropped the lock. For example,
- * our VMA might have been split.
- */
- if (!vma || start >= vma->vm_end) {
- vma = vma_lookup(mm, start);
- if (!vma)
- return -ENOMEM;
- }
-
- tmp_end = min_t(unsigned long, end, vma->vm_end);
/* Populate (prefault) page tables readable/writable. */
- pages = faultin_vma_page_range(vma, start, tmp_end, write,
- &locked);
+ pages = faultin_page_range(mm, start, end, write, &locked);
if (!locked) {
mmap_read_lock(mm);
locked = 1;
@@ -949,7 +936,7 @@ static long madvise_populate(struct vm_a
pr_warn_once("%s: unhandled return value: %ld\n",
__func__, pages);
fallthrough;
- case -ENOMEM:
+ case -ENOMEM: /* No VMA or out of memory. */
return -ENOMEM;
}
}
_
Patches currently in -mm which might be from david(a)redhat.com are
mm-madvise-dont-perform-madvise-vma-walk-for-madv_populate_readwrite.patch
mm-convert-folio_estimated_sharers-to-folio_likely_mapped_shared.patch
mm-convert-folio_estimated_sharers-to-folio_likely_mapped_shared-fix.patch
selftests-memfd_secret-add-vmsplice-test.patch
mm-merge-folio_is_secretmem-and-folio_fast_pin_allowed-into-gup_fast_folio_allowed.patch
mm-optimize-config_per_vma_lock-member-placement-in-vm_area_struct.patch
mm-remove-prot-parameter-from-move_pte.patch
mm-gup-consistently-name-gup-fast-functions.patch
mm-treewide-rename-config_have_fast_gup-to-config_have_gup_fast.patch
mm-use-gup-fast-instead-fast-gup-in-remaining-comments.patch
drivers-virt-acrn-fix-pfnmap-pte-checks-in-acrn_vm_ram_map.patch
mm-pass-vma-instead-of-mm-to-follow_pte.patch
mm-follow_pte-improvements.patch
mm-allow-for-detecting-underflows-with-page_mapcount-again.patch
mm-rmap-always-inline-anon-file-rmap-duplication-of-a-single-pte.patch
mm-rmap-add-fast-path-for-small-folios-when-adding-removing-duplicating.patch
mm-track-mapcount-of-large-folios-in-single-value.patch
mm-improve-folio_likely_mapped_shared-using-the-mapcount-of-large-folios.patch
mm-make-folio_mapcount-return-0-for-small-typed-folios.patch
mm-memory-use-folio_mapcount-in-zap_present_folio_ptes.patch
mm-huge_memory-use-folio_mapcount-in-zap_huge_pmd-sanity-check.patch
mm-memory-failure-use-folio_mapcount-in-hwpoison_user_mappings.patch
mm-page_alloc-use-folio_mapped-in-__alloc_contig_migrate_range.patch
mm-migrate-use-folio_likely_mapped_shared-in-add_page_for_migration.patch
sh-mm-cache-use-folio_mapped-in-copy_from_user_page.patch
mm-filemap-use-folio_mapcount-in-filemap_unaccount_folio.patch
mm-migrate_device-use-folio_mapcount-in-migrate_vma_check_page.patch
trace-events-page_ref-trace-the-raw-page-mapcount-value.patch
xtensa-mm-convert-check_tlb_entry-to-sanity-check-folios.patch
mm-debug-print-only-page-mapcount-excluding-folio-entire-mapcount-in-__dump_folio.patch
documentation-admin-guide-cgroup-v1-memoryrst-dont-reference-page_mapcount.patch
mm-ksm-rename-get_ksm_page_flags-to-ksm_get_folio_flags.patch
mm-ksm-remove-page_mapcount-usage-in-stable_tree_search.patch
loongarch-tlb-fix-error-parameter-ptep-set-but-not-used-due-to-__tlb_remove_tlb_entry.patch
The quilt patch titled
Subject: nilfs2: fix OOB in nilfs_set_de_type
has been removed from the -mm tree. Its filename was
nilfs2-fix-oob-in-nilfs_set_de_type.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Jeongjun Park <aha310510(a)gmail.com>
Subject: nilfs2: fix OOB in nilfs_set_de_type
Date: Tue, 16 Apr 2024 03:20:48 +0900
The size of the nilfs_type_by_mode array in the fs/nilfs2/dir.c file is
defined as "S_IFMT >> S_SHIFT", but the nilfs_set_de_type() function,
which uses this array, specifies the index to read from the array in the
same way as "(mode & S_IFMT) >> S_SHIFT".
static void nilfs_set_de_type(struct nilfs_dir_entry *de, struct inode
*inode)
{
umode_t mode = inode->i_mode;
de->file_type = nilfs_type_by_mode[(mode & S_IFMT)>>S_SHIFT]; // oob
}
However, when the index is determined this way, an out-of-bounds (OOB)
error occurs by referring to an index that is 1 larger than the array size
when the condition "mode & S_IFMT == S_IFMT" is satisfied. Therefore, a
patch to resize the nilfs_type_by_mode array should be applied to prevent
OOB errors.
Link: https://lkml.kernel.org/r/20240415182048.7144-1-konishi.ryusuke@gmail.com
Reported-by: syzbot+2e22057de05b9f3b30d8(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2e22057de05b9f3b30d8
Fixes: 2ba466d74ed7 ("nilfs2: directory entry operations")
Signed-off-by: Jeongjun Park <aha310510(a)gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/nilfs2/dir.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/fs/nilfs2/dir.c~nilfs2-fix-oob-in-nilfs_set_de_type
+++ a/fs/nilfs2/dir.c
@@ -240,7 +240,7 @@ nilfs_filetype_table[NILFS_FT_MAX] = {
#define S_SHIFT 12
static unsigned char
-nilfs_type_by_mode[S_IFMT >> S_SHIFT] = {
+nilfs_type_by_mode[(S_IFMT >> S_SHIFT) + 1] = {
[S_IFREG >> S_SHIFT] = NILFS_FT_REG_FILE,
[S_IFDIR >> S_SHIFT] = NILFS_FT_DIR,
[S_IFCHR >> S_SHIFT] = NILFS_FT_CHRDEV,
_
Patches currently in -mm which might be from aha310510(a)gmail.com are
The quilt patch titled
Subject: fork: defer linking file vma until vma is fully initialized
has been removed from the -mm tree. Its filename was
fork-defer-linking-file-vma-until-vma-is-fully-initialized.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Miaohe Lin <linmiaohe(a)huawei.com>
Subject: fork: defer linking file vma until vma is fully initialized
Date: Wed, 10 Apr 2024 17:14:41 +0800
Thorvald reported a WARNING [1]. And the root cause is below race:
CPU 1 CPU 2
fork hugetlbfs_fallocate
dup_mmap hugetlbfs_punch_hole
i_mmap_lock_write(mapping);
vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
i_mmap_unlock_write(mapping);
hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!
i_mmap_lock_write(mapping);
hugetlb_vmdelete_list
vma_interval_tree_foreach
hugetlb_vma_trylock_write -- Vma_lock is cleared.
tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
i_mmap_unlock_write(mapping);
hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside
i_mmap_rwsem lock while vma lock can be used in the same time. Fix this
by deferring linking file vma until vma is fully initialized. Those vmas
should be initialized first before they can be used.
Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com
Fixes: 8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
Signed-off-by: Miaohe Lin <linmiaohe(a)huawei.com>
Reported-by: Thorvald Natvig <thorvald(a)google.com>
Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1]
Reviewed-by: Jane Chu <jane.chu(a)oracle.com>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Heiko Carstens <hca(a)linux.ibm.com>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Mateusz Guzik <mjguzik(a)gmail.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Peng Zhang <zhangpeng.00(a)bytedance.com>
Cc: Tycho Andersen <tandersen(a)netflix.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/fork.c | 33 +++++++++++++++++----------------
1 file changed, 17 insertions(+), 16 deletions(-)
--- a/kernel/fork.c~fork-defer-linking-file-vma-until-vma-is-fully-initialized
+++ a/kernel/fork.c
@@ -714,6 +714,23 @@ static __latent_entropy int dup_mmap(str
} else if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
vm_flags_clear(tmp, VM_LOCKED_MASK);
+ /*
+ * Copy/update hugetlb private vma information.
+ */
+ if (is_vm_hugetlb_page(tmp))
+ hugetlb_dup_vma_private(tmp);
+
+ /*
+ * Link the vma into the MT. After using __mt_dup(), memory
+ * allocation is not necessary here, so it cannot fail.
+ */
+ vma_iter_bulk_store(&vmi, tmp);
+
+ mm->map_count++;
+
+ if (tmp->vm_ops && tmp->vm_ops->open)
+ tmp->vm_ops->open(tmp);
+
file = tmp->vm_file;
if (file) {
struct address_space *mapping = file->f_mapping;
@@ -730,25 +747,9 @@ static __latent_entropy int dup_mmap(str
i_mmap_unlock_write(mapping);
}
- /*
- * Copy/update hugetlb private vma information.
- */
- if (is_vm_hugetlb_page(tmp))
- hugetlb_dup_vma_private(tmp);
-
- /*
- * Link the vma into the MT. After using __mt_dup(), memory
- * allocation is not necessary here, so it cannot fail.
- */
- vma_iter_bulk_store(&vmi, tmp);
-
- mm->map_count++;
if (!(tmp->vm_flags & VM_WIPEONFORK))
retval = copy_page_range(tmp, mpnt);
- if (tmp->vm_ops && tmp->vm_ops->open)
- tmp->vm_ops->open(tmp);
-
if (retval) {
mpnt = vma_next(&vmi);
goto loop_out;
_
Patches currently in -mm which might be from linmiaohe(a)huawei.com are
The quilt patch titled
Subject: Squashfs: check the inode number is not the invalid value of zero
has been removed from the -mm tree. Its filename was
squashfs-check-the-inode-number-is-not-the-invalid-value-of-zero.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Phillip Lougher <phillip(a)squashfs.org.uk>
Subject: Squashfs: check the inode number is not the invalid value of zero
Date: Mon, 8 Apr 2024 23:02:06 +0100
Syskiller has produced an out of bounds access in fill_meta_index().
That out of bounds access is ultimately caused because the inode
has an inode number with the invalid value of zero, which was not checked.
The reason this causes the out of bounds access is due to following
sequence of events:
1. Fill_meta_index() is called to allocate (via empty_meta_index())
and fill a metadata index. It however suffers a data read error
and aborts, invalidating the newly returned empty metadata index.
It does this by setting the inode number of the index to zero,
which means unused (zero is not a valid inode number).
2. When fill_meta_index() is subsequently called again on another
read operation, locate_meta_index() returns the previous index
because it matches the inode number of 0. Because this index
has been returned it is expected to have been filled, and because
it hasn't been, an out of bounds access is performed.
This patch adds a sanity check which checks that the inode number
is not zero when the inode is created and returns -EINVAL if it is.
[phillip(a)squashfs.org.uk: whitespace fix]
Link: https://lkml.kernel.org/r/20240409204723.446925-1-phillip@squashfs.org.uk
Link: https://lkml.kernel.org/r/20240408220206.435788-1-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip(a)squashfs.org.uk>
Reported-by: "Ubisectech Sirius" <bugreport(a)ubisectech.com>
Closes: https://lore.kernel.org/lkml/87f5c007-b8a5-41ae-8b57-431e924c5915.bugreport…
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/squashfs/inode.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--- a/fs/squashfs/inode.c~squashfs-check-the-inode-number-is-not-the-invalid-value-of-zero
+++ a/fs/squashfs/inode.c
@@ -48,6 +48,10 @@ static int squashfs_new_inode(struct sup
gid_t i_gid;
int err;
+ inode->i_ino = le32_to_cpu(sqsh_ino->inode_number);
+ if (inode->i_ino == 0)
+ return -EINVAL;
+
err = squashfs_get_id(sb, le16_to_cpu(sqsh_ino->uid), &i_uid);
if (err)
return err;
@@ -58,7 +62,6 @@ static int squashfs_new_inode(struct sup
i_uid_write(inode, i_uid);
i_gid_write(inode, i_gid);
- inode->i_ino = le32_to_cpu(sqsh_ino->inode_number);
inode_set_mtime(inode, le32_to_cpu(sqsh_ino->mtime), 0);
inode_set_atime(inode, inode_get_mtime_sec(inode), 0);
inode_set_ctime(inode, inode_get_mtime_sec(inode), 0);
_
Patches currently in -mm which might be from phillip(a)squashfs.org.uk are
squashfs-remove-deprecated-strncpy-by-not-copying-the-string.patch
The quilt patch titled
Subject: mm,swapops: update check in is_pfn_swap_entry for hwpoison entries
has been removed from the -mm tree. Its filename was
mmswapops-update-check-in-is_pfn_swap_entry-for-hwpoison-entries.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Oscar Salvador <osalvador(a)suse.de>
Subject: mm,swapops: update check in is_pfn_swap_entry for hwpoison entries
Date: Sun, 7 Apr 2024 15:05:37 +0200
Tony reported that the Machine check recovery was broken in v6.9-rc1, as
he was hitting a VM_BUG_ON when injecting uncorrectable memory errors to
DRAM.
After some more digging and debugging on his side, he realized that this
went back to v6.1, with the introduction of 'commit 0d206b5d2e0d
("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")'. That
commit, among other things, introduced swp_offset_pfn(), replacing
hwpoison_entry_to_pfn() in its favour.
The patch also introduced a VM_BUG_ON() check for is_pfn_swap_entry(), but
is_pfn_swap_entry() never got updated to cover hwpoison entries, which
means that we would hit the VM_BUG_ON whenever we would call
swp_offset_pfn() for such entries on environments with CONFIG_DEBUG_VM
set. Fix this by updating the check to cover hwpoison entries as well,
and update the comment while we are it.
Link: https://lkml.kernel.org/r/20240407130537.16977-1-osalvador@suse.de
Fixes: 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
Signed-off-by: Oscar Salvador <osalvador(a)suse.de>
Reported-by: Tony Luck <tony.luck(a)intel.com>
Closes: https://lore.kernel.org/all/Zg8kLSl2yAlA3o5D@agluck-desk3/
Tested-by: Tony Luck <tony.luck(a)intel.com>
Reviewed-by: Peter Xu <peterx(a)redhat.com>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Acked-by: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: <stable(a)vger.kernel.org> [6.1.x]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/swapops.h | 65 +++++++++++++++++++-------------------
1 file changed, 33 insertions(+), 32 deletions(-)
--- a/include/linux/swapops.h~mmswapops-update-check-in-is_pfn_swap_entry-for-hwpoison-entries
+++ a/include/linux/swapops.h
@@ -390,6 +390,35 @@ static inline bool is_migration_entry_di
}
#endif /* CONFIG_MIGRATION */
+#ifdef CONFIG_MEMORY_FAILURE
+
+/*
+ * Support for hardware poisoned pages
+ */
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+ BUG_ON(!PageLocked(page));
+ return swp_entry(SWP_HWPOISON, page_to_pfn(page));
+}
+
+static inline int is_hwpoison_entry(swp_entry_t entry)
+{
+ return swp_type(entry) == SWP_HWPOISON;
+}
+
+#else
+
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+ return swp_entry(0, 0);
+}
+
+static inline int is_hwpoison_entry(swp_entry_t swp)
+{
+ return 0;
+}
+#endif
+
typedef unsigned long pte_marker;
#define PTE_MARKER_UFFD_WP BIT(0)
@@ -483,8 +512,9 @@ static inline struct folio *pfn_swap_ent
/*
* A pfn swap entry is a special type of swap entry that always has a pfn stored
- * in the swap offset. They are used to represent unaddressable device memory
- * and to restrict access to a page undergoing migration.
+ * in the swap offset. They can either be used to represent unaddressable device
+ * memory, to restrict access to a page undergoing migration or to represent a
+ * pfn which has been hwpoisoned and unmapped.
*/
static inline bool is_pfn_swap_entry(swp_entry_t entry)
{
@@ -492,7 +522,7 @@ static inline bool is_pfn_swap_entry(swp
BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
return is_migration_entry(entry) || is_device_private_entry(entry) ||
- is_device_exclusive_entry(entry);
+ is_device_exclusive_entry(entry) || is_hwpoison_entry(entry);
}
struct page_vma_mapped_walk;
@@ -561,35 +591,6 @@ static inline int is_pmd_migration_entry
}
#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
-#ifdef CONFIG_MEMORY_FAILURE
-
-/*
- * Support for hardware poisoned pages
- */
-static inline swp_entry_t make_hwpoison_entry(struct page *page)
-{
- BUG_ON(!PageLocked(page));
- return swp_entry(SWP_HWPOISON, page_to_pfn(page));
-}
-
-static inline int is_hwpoison_entry(swp_entry_t entry)
-{
- return swp_type(entry) == SWP_HWPOISON;
-}
-
-#else
-
-static inline swp_entry_t make_hwpoison_entry(struct page *page)
-{
- return swp_entry(0, 0);
-}
-
-static inline int is_hwpoison_entry(swp_entry_t swp)
-{
- return 0;
-}
-#endif
-
static inline int non_swap_entry(swp_entry_t entry)
{
return swp_type(entry) >= MAX_SWAPFILES;
_
Patches currently in -mm which might be from osalvador(a)suse.de are
The quilt patch titled
Subject: mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled
has been removed from the -mm tree. Its filename was
mm-memory-failure-fix-deadlock-when-hugetlb_optimize_vmemmap-is-enabled.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Miaohe Lin <linmiaohe(a)huawei.com>
Subject: mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled
Date: Sun, 7 Apr 2024 16:54:56 +0800
When I did hard offline test with hugetlb pages, below deadlock occurs:
======================================================
WARNING: possible circular locking dependency detected
6.8.0-11409-gf6cef5f8c37f #1 Not tainted
------------------------------------------------------
bash/46904 is trying to acquire lock:
ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60
but task is already holding lock:
ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (pcp_batch_high_lock){+.+.}-{3:3}:
__mutex_lock+0x6c/0x770
page_alloc_cpu_online+0x3c/0x70
cpuhp_invoke_callback+0x397/0x5f0
__cpuhp_invoke_callback_range+0x71/0xe0
_cpu_up+0xeb/0x210
cpu_up+0x91/0xe0
cpuhp_bringup_mask+0x49/0xb0
bringup_nonboot_cpus+0xb7/0xe0
smp_init+0x25/0xa0
kernel_init_freeable+0x15f/0x3e0
kernel_init+0x15/0x1b0
ret_from_fork+0x2f/0x50
ret_from_fork_asm+0x1a/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1298/0x1cd0
lock_acquire+0xc0/0x2b0
cpus_read_lock+0x2a/0xc0
static_key_slow_dec+0x16/0x60
__hugetlb_vmemmap_restore_folio+0x1b9/0x200
dissolve_free_huge_page+0x211/0x260
__page_handle_poison+0x45/0xc0
memory_failure+0x65e/0xc70
hard_offline_page_store+0x55/0xa0
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x387/0x550
ksys_write+0x64/0xe0
do_syscall_64+0xca/0x1e0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(pcp_batch_high_lock);
lock(cpu_hotplug_lock);
lock(pcp_batch_high_lock);
rlock(cpu_hotplug_lock);
*** DEADLOCK ***
5 locks held by bash/46904:
#0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
#1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
#2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
#3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70
#4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40
stack backtrace:
CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x68/0xa0
check_noncircular+0x129/0x140
__lock_acquire+0x1298/0x1cd0
lock_acquire+0xc0/0x2b0
cpus_read_lock+0x2a/0xc0
static_key_slow_dec+0x16/0x60
__hugetlb_vmemmap_restore_folio+0x1b9/0x200
dissolve_free_huge_page+0x211/0x260
__page_handle_poison+0x45/0xc0
memory_failure+0x65e/0xc70
hard_offline_page_store+0x55/0xa0
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x387/0x550
ksys_write+0x64/0xe0
do_syscall_64+0xca/0x1e0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7fc862314887
Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887
RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001
RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00
In short, below scene breaks the lock dependency chain:
memory_failure
__page_handle_poison
zone_pcp_disable -- lock(pcp_batch_high_lock)
dissolve_free_huge_page
__hugetlb_vmemmap_restore_folio
static_key_slow_dec
cpus_read_lock -- rlock(cpu_hotplug_lock)
Fix this by calling drain_all_pages() instead.
This issue won't occur until commit a6b40850c442 ("mm: hugetlb: replace
hugetlb_free_vmemmap_enabled with a static_key"). As it introduced
rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while
lock(pcp_batch_high_lock) is already in the __page_handle_poison().
[linmiaohe(a)huawei.com: extend comment per Oscar]
[akpm(a)linux-foundation.org: reflow block comment]
Link: https://lkml.kernel.org/r/20240407085456.2798193-1-linmiaohe@huawei.com
Fixes: a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
Signed-off-by: Miaohe Lin <linmiaohe(a)huawei.com>
Acked-by: Oscar Salvador <osalvador(a)suse.de>
Reviewed-by: Jane Chu <jane.chu(a)oracle.com>
Cc: Naoya Horiguchi <nao.horiguchi(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory-failure.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
--- a/mm/memory-failure.c~mm-memory-failure-fix-deadlock-when-hugetlb_optimize_vmemmap-is-enabled
+++ a/mm/memory-failure.c
@@ -154,11 +154,23 @@ static int __page_handle_poison(struct p
{
int ret;
- zone_pcp_disable(page_zone(page));
+ /*
+ * zone_pcp_disable() can't be used here. It will
+ * hold pcp_batch_high_lock and dissolve_free_huge_page() might hold
+ * cpu_hotplug_lock via static_key_slow_dec() when hugetlb vmemmap
+ * optimization is enabled. This will break current lock dependency
+ * chain and leads to deadlock.
+ * Disabling pcp before dissolving the page was a deterministic
+ * approach because we made sure that those pages cannot end up in any
+ * PCP list. Draining PCP lists expels those pages to the buddy system,
+ * but nothing guarantees that those pages do not get back to a PCP
+ * queue if we need to refill those.
+ */
ret = dissolve_free_huge_page(page);
- if (!ret)
+ if (!ret) {
+ drain_all_pages(page_zone(page));
ret = take_page_off_buddy(page);
- zone_pcp_enable(page_zone(page));
+ }
return ret;
}
_
Patches currently in -mm which might be from linmiaohe(a)huawei.com are
The quilt patch titled
Subject: mm/userfaultfd: allow hugetlb change protection upon poison entry
has been removed from the -mm tree. Its filename was
mm-userfaultfd-allow-hugetlb-change-protection-upon-poison-entry.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Peter Xu <peterx(a)redhat.com>
Subject: mm/userfaultfd: allow hugetlb change protection upon poison entry
Date: Fri, 5 Apr 2024 19:19:20 -0400
After UFFDIO_POISON, there can be two kinds of hugetlb pte markers, either
the POISON one or UFFD_WP one.
Allow change protection to run on a poisoned marker just like !hugetlb
cases, ignoring the marker irrelevant of the permission.
Here the two bits are mutual exclusive. For example, when install a
poisoned entry it must not be UFFD_WP already (by checking pte_none()
before such install). And it also means if UFFD_WP is set there must have
no POISON bit set. It makes sense because UFFD_WP is a bit to reflect
permission, and permissions do not apply if the pte is poisoned and
destined to sigbus.
So here we simply check uffd_wp bit set first, do nothing otherwise.
Attach the Fixes to UFFDIO_POISON work, as before that it should not be
possible to have poison entry for hugetlb (e.g., hugetlb doesn't do swap,
so no chance of swapin errors).
Link: https://lkml.kernel.org/r/20240405231920.1772199-1-peterx@redhat.com
Link: https://lore.kernel.org/r/000000000000920d5e0615602dd1@google.com
Fixes: fc71884a5f59 ("mm: userfaultfd: add new UFFDIO_POISON ioctl")
Signed-off-by: Peter Xu <peterx(a)redhat.com>
Reported-by: syzbot+b07c8ac8eee3d4d8440f(a)syzkaller.appspotmail.com
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: <stable(a)vger.kernel.org> [6.6+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
--- a/mm/hugetlb.c~mm-userfaultfd-allow-hugetlb-change-protection-upon-poison-entry
+++ a/mm/hugetlb.c
@@ -7044,9 +7044,13 @@ long hugetlb_change_protection(struct vm
if (!pte_same(pte, newpte))
set_huge_pte_at(mm, address, ptep, newpte, psize);
} else if (unlikely(is_pte_marker(pte))) {
- /* No other markers apply for now. */
- WARN_ON_ONCE(!pte_marker_uffd_wp(pte));
- if (uffd_wp_resolve)
+ /*
+ * Do nothing on a poison marker; page is
+ * corrupted, permissons do not apply. Here
+ * pte_marker_uffd_wp()==true implies !poison
+ * because they're mutual exclusive.
+ */
+ if (pte_marker_uffd_wp(pte) && uffd_wp_resolve)
/* Safe to modify directly (non-present->none). */
huge_pte_clear(mm, address, ptep, psize);
} else if (!huge_pte_none(pte)) {
_
Patches currently in -mm which might be from peterx(a)redhat.com are
mm-hmm-process-pud-swap-entry-without-pud_huge.patch
mm-gup-cache-p4d-in-follow_p4d_mask.patch
mm-gup-check-p4d-presence-before-going-on.patch
mm-x86-change-pxd_huge-behavior-to-exclude-swap-entries.patch
mm-sparc-change-pxd_huge-behavior-to-exclude-swap-entries.patch
mm-arm-use-macros-to-define-pmd-pud-helpers.patch
mm-arm-redefine-pmd_huge-with-pmd_leaf.patch
mm-arm64-merge-pxd_huge-and-pxd_leaf-definitions.patch
mm-powerpc-redefine-pxd_huge-with-pxd_leaf.patch
mm-gup-merge-pxd-huge-mapping-checks.patch
mm-treewide-replace-pxd_huge-with-pxd_leaf.patch
mm-treewide-remove-pxd_huge.patch
mm-arm-remove-pmd_thp_or_huge.patch
mm-document-pxd_leaf-api.patch
selftests-mm-run_vmtestssh-fix-hugetlb-mem-size-calculation.patch
selftests-mm-run_vmtestssh-fix-hugetlb-mem-size-calculation-fix.patch
mm-kconfig-config_pgtable_has_huge_leaves.patch
mm-hugetlb-declare-hugetlbfs_pagecache_present-non-static.patch
mm-make-hpage_pxd_-macros-even-if-thp.patch
mm-introduce-vma_pgtable_walk_beginend.patch
mm-arch-provide-pud_pfn-fallback.patch
mm-arch-provide-pud_pfn-fallback-fix.patch
mm-gup-drop-folio_fast_pin_allowed-in-hugepd-processing.patch
mm-gup-refactor-record_subpages-to-find-1st-small-page.patch
mm-gup-handle-hugetlb-for-no_page_table.patch
mm-gup-cache-pudp-in-follow_pud_mask.patch
mm-gup-handle-huge-pud-for-follow_pud_mask.patch
mm-gup-handle-huge-pmd-for-follow_pmd_mask.patch
mm-gup-handle-huge-pmd-for-follow_pmd_mask-fix.patch
mm-gup-handle-hugepd-for-follow_page.patch
mm-gup-handle-hugetlb-in-the-generic-follow_page_mask-code.patch
mm-allow-anon-exclusive-check-over-hugetlb-tail-pages.patch
mm-page_table_check-support-userfault-wr-protect-entries.patch
The quilt patch titled
Subject: userfaultfd: change src_folio after ensuring it's unpinned in UFFDIO_MOVE
has been removed from the -mm tree. Its filename was
userfaultfd-change-src_folio-after-ensuring-its-unpinned-in-uffdio_move.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Lokesh Gidra <lokeshgidra(a)google.com>
Subject: userfaultfd: change src_folio after ensuring it's unpinned in UFFDIO_MOVE
Date: Thu, 4 Apr 2024 10:17:26 -0700
Commit d7a08838ab74 ("mm: userfaultfd: fix unexpected change to src_folio
when UFFDIO_MOVE fails") moved the src_folio->{mapping, index} changing to
after clearing the page-table and ensuring that it's not pinned. This
avoids failure of swapout+migration and possibly memory corruption.
However, the commit missed fixing it in the huge-page case.
Link: https://lkml.kernel.org/r/20240404171726.2302435-1-lokeshgidra@google.com
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Lokesh Gidra <lokeshgidra(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Kalesh Singh <kaleshsingh(a)google.com>
Cc: Lokesh Gidra <lokeshgidra(a)google.com>
Cc: Nicolas Geoffray <ngeoffray(a)google.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Qi Zheng <zhengqi.arch(a)bytedance.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/huge_memory.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/mm/huge_memory.c~userfaultfd-change-src_folio-after-ensuring-its-unpinned-in-uffdio_move
+++ a/mm/huge_memory.c
@@ -2259,9 +2259,6 @@ int move_pages_huge_pmd(struct mm_struct
goto unlock_ptls;
}
- folio_move_anon_rmap(src_folio, dst_vma);
- WRITE_ONCE(src_folio->index, linear_page_index(dst_vma, dst_addr));
-
src_pmdval = pmdp_huge_clear_flush(src_vma, src_addr, src_pmd);
/* Folio got pinned from under us. Put it back and fail the move. */
if (folio_maybe_dma_pinned(src_folio)) {
@@ -2270,6 +2267,9 @@ int move_pages_huge_pmd(struct mm_struct
goto unlock_ptls;
}
+ folio_move_anon_rmap(src_folio, dst_vma);
+ WRITE_ONCE(src_folio->index, linear_page_index(dst_vma, dst_addr));
+
_dst_pmd = mk_huge_pmd(&src_folio->page, dst_vma->vm_page_prot);
/* Follow mremap() behavior and treat the entry dirty after the move */
_dst_pmd = pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
_
Patches currently in -mm which might be from lokeshgidra(a)google.com are
The quilt patch titled
Subject: mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled
has been removed from the -mm tree. Its filename was
mm-memory-failure-fix-deadlock-when-hugetlb_optimize_vmemmap-is-enabled-v2.patch
This patch was dropped because it was folded into mm-memory-failure-fix-deadlock-when-hugetlb_optimize_vmemmap-is-enabled.patch
------------------------------------------------------
From: Miaohe Lin <linmiaohe(a)huawei.com>
Subject: mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled
Date: Fri, 12 Apr 2024 10:57:54 +0800
extend comment per Oscar
Link: https://lkml.kernel.org/r/20240412025754.1897615-1-linmiaohe@huawei.com
Fixes: a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
Signed-off-by: Miaohe Lin <linmiaohe(a)huawei.com>
Acked-by: Oscar Salvador <osalvador(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory-failure.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/mm/memory-failure.c~mm-memory-failure-fix-deadlock-when-hugetlb_optimize_vmemmap-is-enabled-v2
+++ a/mm/memory-failure.c
@@ -159,6 +159,10 @@ static int __page_handle_poison(struct p
* dissolve_free_huge_page() might hold cpu_hotplug_lock via static_key_slow_dec()
* when hugetlb vmemmap optimization is enabled. This will break current lock
* dependency chain and leads to deadlock.
+ * Disabling pcp before dissolving the page was a deterministic approach because
+ * we made sure that those pages cannot end up in any PCP list. Draining PCP lists
+ * expels those pages to the buddy system, but nothing guarantees that those pages
+ * do not get back to a PCP queue if we need to refill those.
*/
ret = dissolve_free_huge_page(page);
if (!ret) {
_
Patches currently in -mm which might be from linmiaohe(a)huawei.com are
mm-memory-failure-fix-deadlock-when-hugetlb_optimize_vmemmap-is-enabled.patch
fork-defer-linking-file-vma-until-vma-is-fully-initialized.patch
The `module!` macro creates glue code that are called by C to initialize
the Rust modules using the `Module::init` function. Part of this glue
code are the local functions `__init` and `__exit` that are used to
initialize/destroy the Rust module.
These functions are safe and also visible to the Rust mod in which the
`module!` macro is invoked. This means that they can be called by other
safe Rust code. But since they contain `unsafe` blocks that rely on only
being called at the right time, this is a soundness issue.
Wrap these generated functions inside of two private modules, this
guarantees that the public functions cannot be called from the outside.
Make the safe functions `unsafe` and add SAFETY comments.
Cc: stable(a)vger.kernel.org
Closes: https://github.com/Rust-for-Linux/linux/issues/629
Fixes: 1fbde52bde73 ("rust: add `macros` crate")
Signed-off-by: Benno Lossin <benno.lossin(a)proton.me>
---
v1: https://lore.kernel.org/rust-for-linux/20240327160346.22442-1-benno.lossin@…
v1 -> v2:
- wrapped `__init` and `__exit` calls in `unsafe` blocks and added
SAFETY comments,
- fixed safety requirement on `__exit` and `__init`,
- rebased onto rust-next.
rust/macros/module.rs | 213 +++++++++++++++++++++++++-----------------
1 file changed, 127 insertions(+), 86 deletions(-)
diff --git a/rust/macros/module.rs b/rust/macros/module.rs
index 27979e582e4b..293beca0a583 100644
--- a/rust/macros/module.rs
+++ b/rust/macros/module.rs
@@ -199,103 +199,144 @@ pub(crate) fn module(ts: TokenStream) -> TokenStream {
/// Used by the printing macros, e.g. [`info!`].
const __LOG_PREFIX: &[u8] = b\"{name}\\0\";
- /// The \"Rust loadable module\" mark.
- //
- // This may be best done another way later on, e.g. as a new modinfo
- // key or a new section. For the moment, keep it simple.
- #[cfg(MODULE)]
- #[doc(hidden)]
- #[used]
- static __IS_RUST_MODULE: () = ();
-
- static mut __MOD: Option<{type_}> = None;
-
- // SAFETY: `__this_module` is constructed by the kernel at load time and will not be
- // freed until the module is unloaded.
- #[cfg(MODULE)]
- static THIS_MODULE: kernel::ThisModule = unsafe {{
- kernel::ThisModule::from_ptr(&kernel::bindings::__this_module as *const _ as *mut _)
- }};
- #[cfg(not(MODULE))]
- static THIS_MODULE: kernel::ThisModule = unsafe {{
- kernel::ThisModule::from_ptr(core::ptr::null_mut())
- }};
-
- // Loadable modules need to export the `{{init,cleanup}}_module` identifiers.
- /// # Safety
- ///
- /// This function must not be called after module initialization, because it may be
- /// freed after that completes.
- #[cfg(MODULE)]
- #[doc(hidden)]
- #[no_mangle]
- #[link_section = \".init.text\"]
- pub unsafe extern \"C\" fn init_module() -> core::ffi::c_int {{
- __init()
- }}
+ // Double nested modules, since then nobody can access the public items inside.
+ mod __module_init {{
+ mod __module_init {{
+ use super::super::{type_};
+
+ /// The \"Rust loadable module\" mark.
+ //
+ // This may be best done another way later on, e.g. as a new modinfo
+ // key or a new section. For the moment, keep it simple.
+ #[cfg(MODULE)]
+ #[doc(hidden)]
+ #[used]
+ static __IS_RUST_MODULE: () = ();
+
+ static mut __MOD: Option<{type_}> = None;
+
+ // SAFETY: `__this_module` is constructed by the kernel at load time and will not be
+ // freed until the module is unloaded.
+ #[cfg(MODULE)]
+ static THIS_MODULE: kernel::ThisModule = unsafe {{
+ kernel::ThisModule::from_ptr(&kernel::bindings::__this_module as *const _ as *mut _)
+ }};
+ #[cfg(not(MODULE))]
+ static THIS_MODULE: kernel::ThisModule = unsafe {{
+ kernel::ThisModule::from_ptr(core::ptr::null_mut())
+ }};
+
+ // Loadable modules need to export the `{{init,cleanup}}_module` identifiers.
+ /// # Safety
+ ///
+ /// This function must not be called after module initialization, because it may be
+ /// freed after that completes.
+ #[cfg(MODULE)]
+ #[doc(hidden)]
+ #[no_mangle]
+ #[link_section = \".init.text\"]
+ pub unsafe extern \"C\" fn init_module() -> core::ffi::c_int {{
+ // SAFETY: this function is inaccessible to the outside due to the double
+ // module wrapping it. It is called exactly once by the C side via its
+ // unique name.
+ unsafe {{ __init() }}
+ }}
- #[cfg(MODULE)]
- #[doc(hidden)]
- #[no_mangle]
- pub extern \"C\" fn cleanup_module() {{
- __exit()
- }}
+ #[cfg(MODULE)]
+ #[doc(hidden)]
+ #[no_mangle]
+ pub extern \"C\" fn cleanup_module() {{
+ // SAFETY:
+ // - this function is inaccessible to the outside due to the double
+ // module wrapping it. It is called exactly once by the C side via its
+ // unique name,
+ // - furthermore it is only called after `init_module` has returned `0`
+ // (which delegates to `__init`).
+ unsafe {{ __exit() }}
+ }}
- // Built-in modules are initialized through an initcall pointer
- // and the identifiers need to be unique.
- #[cfg(not(MODULE))]
- #[cfg(not(CONFIG_HAVE_ARCH_PREL32_RELOCATIONS))]
- #[doc(hidden)]
- #[link_section = \"{initcall_section}\"]
- #[used]
- pub static __{name}_initcall: extern \"C\" fn() -> core::ffi::c_int = __{name}_init;
-
- #[cfg(not(MODULE))]
- #[cfg(CONFIG_HAVE_ARCH_PREL32_RELOCATIONS)]
- core::arch::global_asm!(
- r#\".section \"{initcall_section}\", \"a\"
- __{name}_initcall:
- .long __{name}_init - .
- .previous
- \"#
- );
+ // Built-in modules are initialized through an initcall pointer
+ // and the identifiers need to be unique.
+ #[cfg(not(MODULE))]
+ #[cfg(not(CONFIG_HAVE_ARCH_PREL32_RELOCATIONS))]
+ #[doc(hidden)]
+ #[link_section = \"{initcall_section}\"]
+ #[used]
+ pub static __{name}_initcall: extern \"C\" fn() -> core::ffi::c_int = __{name}_init;
+
+ #[cfg(not(MODULE))]
+ #[cfg(CONFIG_HAVE_ARCH_PREL32_RELOCATIONS)]
+ core::arch::global_asm!(
+ r#\".section \"{initcall_section}\", \"a\"
+ __{name}_initcall:
+ .long __{name}_init - .
+ .previous
+ \"#
+ );
+
+ #[cfg(not(MODULE))]
+ #[doc(hidden)]
+ #[no_mangle]
+ pub extern \"C\" fn __{name}_init() -> core::ffi::c_int {{
+ // SAFETY: this function is inaccessible to the outside due to the double
+ // module wrapping it. It is called exactly once by the C side via its
+ // placement above in the initcall section.
+ unsafe {{ __init() }}
+ }}
- #[cfg(not(MODULE))]
- #[doc(hidden)]
- #[no_mangle]
- pub extern \"C\" fn __{name}_init() -> core::ffi::c_int {{
- __init()
- }}
+ #[cfg(not(MODULE))]
+ #[doc(hidden)]
+ #[no_mangle]
+ pub extern \"C\" fn __{name}_exit() {{
+ // SAFETY:
+ // - this function is inaccessible to the outside due to the double
+ // module wrapping it. It is called exactly once by the C side via its
+ // unique name,
+ // - furthermore it is only called after `__{name}_init` has returned `0`
+ // (which delegates to `__init`).
+ unsafe {{ __exit() }}
+ }}
- #[cfg(not(MODULE))]
- #[doc(hidden)]
- #[no_mangle]
- pub extern \"C\" fn __{name}_exit() {{
- __exit()
- }}
+ /// # Safety
+ ///
+ /// This function must only be called once.
+ unsafe fn __init() -> core::ffi::c_int {{
+ match <{type_} as kernel::Module>::init(&THIS_MODULE) {{
+ Ok(m) => {{
+ // SAFETY:
+ // no data race, since `__MOD` can only be accessed by this module and
+ // there only `__init` and `__exit` access it. These functions are only
+ // called once and `__exit` cannot be called before or during `__init`.
+ unsafe {{
+ __MOD = Some(m);
+ }}
+ return 0;
+ }}
+ Err(e) => {{
+ return e.to_errno();
+ }}
+ }}
+ }}
- fn __init() -> core::ffi::c_int {{
- match <{type_} as kernel::Module>::init(&THIS_MODULE) {{
- Ok(m) => {{
+ /// # Safety
+ ///
+ /// This function must
+ /// - only be called once,
+ /// - be called after `__init` has been called and returned `0`.
+ unsafe fn __exit() {{
+ // SAFETY:
+ // no data race, since `__MOD` can only be accessed by this module and there
+ // only `__init` and `__exit` access it. These functions are only called once
+ // and `__init` was already called.
unsafe {{
- __MOD = Some(m);
+ // Invokes `drop()` on `__MOD`, which should be used for cleanup.
+ __MOD = None;
}}
- return 0;
}}
- Err(e) => {{
- return e.to_errno();
- }}
- }}
- }}
- fn __exit() {{
- unsafe {{
- // Invokes `drop()` on `__MOD`, which should be used for cleanup.
- __MOD = None;
+ {modinfo}
}}
}}
-
- {modinfo}
",
type_ = info.type_,
name = info.name,
base-commit: 9ffe2a730313f27cebd0859ea856247ac59c576c
--
2.44.0
In LUCID EVO PLL CAL_L_VAL and L_VAL bitfields are part of single
PLL_L_VAL register. Update for L_VAL bitfield values in PLL_L_VAL
register using regmap_write() API in __alpha_pll_trion_set_rate
callback will override LUCID EVO PLL initial configuration related
to PLL_CAL_L_VAL bit fields in PLL_L_VAL register.
Observed random PLL lock failures during PLL enable due to such
override in PLL calibration value. Use regmap_update_bits() with
L_VAL bitfield mask instead of regmap_write() API to update only
PLL_L_VAL bitfields in __alpha_pll_trion_set_rate callback.
Fixes: 260e36606a03 ("clk: qcom: clk-alpha-pll: add Lucid EVO PLL configuration interfaces")
Signed-off-by: Ajit Pandey <quic_ajipan(a)quicinc.com>
Cc: stable(a)vger.kernel.org
---
drivers/clk/qcom/clk-alpha-pll.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/clk/qcom/clk-alpha-pll.c b/drivers/clk/qcom/clk-alpha-pll.c
index 8a412ef47e16..81cabd28eabe 100644
--- a/drivers/clk/qcom/clk-alpha-pll.c
+++ b/drivers/clk/qcom/clk-alpha-pll.c
@@ -1656,7 +1656,7 @@ static int __alpha_pll_trion_set_rate(struct clk_hw *hw, unsigned long rate,
if (ret < 0)
return ret;
- regmap_write(pll->clkr.regmap, PLL_L_VAL(pll), l);
+ regmap_update_bits(pll->clkr.regmap, PLL_L_VAL(pll), LUCID_EVO_PLL_L_VAL_MASK, l);
regmap_write(pll->clkr.regmap, PLL_ALPHA_VAL(pll), a);
/* Latch the PLL input */
--
2.25.1
This is the start of the stable review cycle for the 6.8.7 release.
There are 172 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed, 17 Apr 2024 14:19:30 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.8.7-rc1.…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.8.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.8.7-rc1
Fudongwang <fudong.wang(a)amd.com>
drm/amd/display: fix disable otg wa logic in DCN316
Wenjing Liu <wenjing.liu(a)amd.com>
drm/amd/display: always reset ODM mode in context when adding first plane
Alex Hung <alex.hung(a)amd.com>
drm/amd/display: Return max resolution supported by DWB
Dillon Varone <dillon.varone(a)amd.com>
drm/amd/display: Do not recursively call manual trigger programming
Harry Wentland <harry.wentland(a)amd.com>
drm/amd/display: Set VSC SDP Colorimetry same way for MST and SST
Harry Wentland <harry.wentland(a)amd.com>
drm/amd/display: Program VSC SDP colorimetry for all DP sinks >= 1.4
Yifan Zhang <yifan1.zhang(a)amd.com>
drm/amdgpu: differentiate external rev id for gfx 11.5.0
Tim Huang <Tim.Huang(a)amd.com>
drm/amdgpu: fix incorrect number of active RBs for gfx11
Alex Deucher <alexander.deucher(a)amd.com>
drm/amdgpu: always force full reset for SOC21
Lijo Lazar <lijo.lazar(a)amd.com>
drm/amdgpu: Reset dGPU if suspend got aborted
Ville Syrjälä <ville.syrjala(a)linux.intel.com>
drm/i915: Disable live M/N updates when using bigjoiner
Ville Syrjälä <ville.syrjala(a)linux.intel.com>
drm/i915: Disable port sync when bigjoiner is used
Ville Syrjälä <ville.syrjala(a)linux.intel.com>
drm/i915/psr: Disable PSR when bigjoiner is used
Ville Syrjälä <ville.syrjala(a)linux.intel.com>
drm/i915/cdclk: Fix CDCLK programming order when pipes are active
Josh Poimboeuf <jpoimboe(a)kernel.org>
x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
Josh Poimboeuf <jpoimboe(a)kernel.org>
x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
Josh Poimboeuf <jpoimboe(a)kernel.org>
x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
Josh Poimboeuf <jpoimboe(a)kernel.org>
x86/bugs: Fix BHI handling of RRSBA
Ingo Molnar <mingo(a)kernel.org>
x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
Josh Poimboeuf <jpoimboe(a)kernel.org>
x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
Josh Poimboeuf <jpoimboe(a)kernel.org>
x86/bugs: Fix BHI documentation
Daniel Sneddon <daniel.sneddon(a)linux.intel.com>
x86/bugs: Fix return type of spectre_bhi_state()
Amir Goldstein <amir73il(a)gmail.com>
kernfs: annotate different lockdep class for of->mutex of writable files
Oleg Nesterov <oleg(a)redhat.com>
selftests: kselftest: Fix build failure with NOLIBC
Arnd Bergmann <arnd(a)arndb.de>
irqflags: Explicitly ignore lockdep_hrtimer_exit() argument
Adam Dunlap <acdunlap(a)google.com>
x86/apic: Force native_apic_mem_read() to use the MOV instruction
Nathan Chancellor <nathan(a)kernel.org>
selftests: kselftest: Mark functions that unconditionally call exit() as __noreturn
John Stultz <jstultz(a)google.com>
selftests: timers: Fix abs() warning in posix_timers test
John Stultz <jstultz(a)google.com>
selftests: timers: Fix posix_timers ksft_print_msg() warning
Oleg Nesterov <oleg(a)redhat.com>
selftests/timers/posix_timers: Reimplement check_timer_distribution()
Sean Christopherson <seanjc(a)google.com>
x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
Namhyung Kim <namhyung(a)kernel.org>
perf/x86: Fix out of range data
Gavin Shan <gshan(a)redhat.com>
vhost: Add smp_rmb() in vhost_enable_notify()
Gavin Shan <gshan(a)redhat.com>
vhost: Add smp_rmb() in vhost_vq_avail_empty()
Frank Li <Frank.Li(a)nxp.com>
arm64: dts: imx8-ss-dma: fix spi lpcg indices
Frank Li <Frank.Li(a)nxp.com>
arm64: dts: imx8-ss-lsio: fix pwm lpcg indices
Frank Li <Frank.Li(a)nxp.com>
arm64: dts: imx8-ss-dma: fix pwm lpcg indices
Frank Li <Frank.Li(a)nxp.com>
arm64: dts: imx8-ss-conn: fix usb lpcg indices
Frank Li <Frank.Li(a)nxp.com>
arm64: dts: imx8-ss-dma: fix adc lpcg indices
Frank Li <Frank.Li(a)nxp.com>
arm64: dts: imx8-ss-dma: fix can lpcg indices
Frank Li <Frank.Li(a)nxp.com>
arm64: dts: imx8qm-ss-dma: fix can lpcg indices
Lang Yu <Lang.Yu(a)amd.com>
drm/amdgpu/umsch: reinitialize write pointer in hw init
Johan Hovold <johan+linaro(a)kernel.org>
drm/msm/dp: fix runtime PM leak on connect failure
Johan Hovold <johan+linaro(a)kernel.org>
drm/msm/dp: fix runtime PM leak on disconnect
Ville Syrjälä <ville.syrjala(a)linux.intel.com>
drm/client: Fully protect modes[] with dev->mode_config.mutex
Boris Brezillon <boris.brezillon(a)collabora.com>
drm/panfrost: Fix the error path in panfrost_mmu_map_fault_addr()
Jammy Huang <jammy_huang(a)aspeedtech.com>
drm/ast: Fix soft lockup
Harish Kasiviswanathan <Harish.Kasiviswanathan(a)amd.com>
drm/amdkfd: Reset GPU on queue preemption failure
Ville Syrjälä <ville.syrjala(a)linux.intel.com>
drm/i915/vrr: Disable VRR when using bigjoiner
Zack Rusin <zack.rusin(a)broadcom.com>
drm/vmwgfx: Enable DMA mappings with SEV
Jacek Lawrynowicz <jacek.lawrynowicz(a)linux.intel.com>
accel/ivpu: Fix deadlock in context_xa
Jacek Lawrynowicz <jacek.lawrynowicz(a)linux.intel.com>
accel/ivpu: Return max freq for DRM_IVPU_PARAM_CORE_CLOCK_RATE
Jacek Lawrynowicz <jacek.lawrynowicz(a)linux.intel.com>
accel/ivpu: Put NPU back to D3hot after failed resume
Wachowski, Karol <karol.wachowski(a)intel.com>
accel/ivpu: Fix PCI D0 state entry in resume
Wachowski, Karol <karol.wachowski(a)intel.com>
accel/ivpu: Check return code of ipc->lock init
Alexander Wetzel <Alexander(a)wetzel-home.de>
scsi: sg: Avoid race in error handling & drop bogus warn
Alexander Wetzel <Alexander(a)wetzel-home.de>
scsi: sg: Avoid sg device teardown race
Masami Hiramatsu <mhiramat(a)kernel.org>
fs/proc: Skip bootloader comment if no embedded kernel parameters
Zhenhua Huang <quic_zhenhuah(a)quicinc.com>
fs/proc: remove redundant comments from /proc/bootconfig
Zheng Yejian <zhengyejian1(a)huawei.com>
kprobes: Fix possible use-after-free issue on kprobe registration
Pavel Begunkov <asml.silence(a)gmail.com>
io_uring/net: restore msg_control on sendzc retry
Boris Burkov <boris(a)bur.io>
btrfs: qgroup: convert PREALLOC to PERTRANS after record_root_in_trans
Boris Burkov <boris(a)bur.io>
btrfs: record delayed inode root in transaction
Boris Burkov <boris(a)bur.io>
btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations
Boris Burkov <boris(a)bur.io>
btrfs: qgroup: correctly model root qgroup rsv in convert
Jens Axboe <axboe(a)kernel.dk>
io_uring: disable io-wq execution of multishot NOWAIT requests
Pavel Begunkov <asml.silence(a)gmail.com>
io_uring: refactor DEFER_TASKRUN multishot checks
Lu Baolu <baolu.lu(a)linux.intel.com>
iommu/vt-d: Fix WARN_ON in iommu probe path
Jacob Pan <jacob.jun.pan(a)linux.intel.com>
iommu/vt-d: Allocate local memory for page request queue
Xuchun Shang <xuchun.shang(a)linux.alibaba.com>
iommu/vt-d: Fix wrong use of pasid config
Arnd Bergmann <arnd(a)arndb.de>
tracing: hide unused ftrace_event_id_fops
Karthik Poosa <karthik.poosa(a)intel.com>
drm/xe/hwmon: Cast result to output precision on left shift of operand
Lucas De Marchi <lucas.demarchi(a)intel.com>
drm/xe/display: Fix double mutex initialization
David Arinzon <darinzon(a)amazon.com>
net: ena: Set tx_info->xdpf value to NULL
David Arinzon <darinzon(a)amazon.com>
net: ena: Fix incorrect descriptor free behavior
David Arinzon <darinzon(a)amazon.com>
net: ena: Wrong missing IO completions check order
David Arinzon <darinzon(a)amazon.com>
net: ena: Fix potential sign extension issue
Michal Luczaj <mhal(a)rbox.co>
af_unix: Fix garbage collector racing against connect()
Kuniyuki Iwashima <kuniyu(a)amazon.com>
af_unix: Do not use atomic ops for unix_sk(sk)->inflight.
Arınç ÜNAL <arinc.unal(a)arinc9.com>
net: dsa: mt7530: trap link-local frames regardless of ST Port State
Gerd Bayer <gbayer(a)linux.ibm.com>
Revert "s390/ism: fix receive message buffer allocation"
Daniel Machon <daniel.machon(a)microchip.com>
net: sparx5: fix wrong config being used when reconfiguring PCS
Rahul Rameshbabu <rrameshbabu(a)nvidia.com>
net/mlx5e: Do not produce metadata freelist entries in Tx port ts WQE xmit
Carolina Jubran <cjubran(a)nvidia.com>
net/mlx5e: HTB, Fix inconsistencies with QoS SQs number
Carolina Jubran <cjubran(a)nvidia.com>
net/mlx5e: Fix mlx5e_priv_init() cleanup flow
Carolina Jubran <cjubran(a)nvidia.com>
net/mlx5e: RSS, Block changing channels number when RXFH is configured
Cosmin Ratiu <cratiu(a)nvidia.com>
net/mlx5: Correctly compare pkt reformat ids
Cosmin Ratiu <cratiu(a)nvidia.com>
net/mlx5: Properly link new fs rules into the tree
Michael Liang <mliang(a)purestorage.com>
net/mlx5: offset comp irq index in name by one
Shay Drory <shayd(a)nvidia.com>
net/mlx5: Register devlink first under devlink lock
Moshe Shemesh <moshe(a)nvidia.com>
net/mlx5: SF, Stop waiting for FW as teardown was called
Eric Dumazet <edumazet(a)google.com>
netfilter: complete validation of user input
Archie Pusaka <apusaka(a)chromium.org>
Bluetooth: l2cap: Don't double set the HCI_CONN_MGMT_CONNECTED bit
Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
Bluetooth: hci_sock: Fix not validating setsockopt user input
Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
Bluetooth: ISO: Fix not validating setsockopt user input
Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
Bluetooth: L2CAP: Fix not validating setsockopt user input
Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
Bluetooth: RFCOMM: Fix not validating setsockopt user input
Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
Bluetooth: SCO: Fix not validating setsockopt user input
Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
Bluetooth: hci_sync: Fix using the same interval and window for Coded PHY
Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
Bluetooth: hci_sync: Use QoS to determine which PHY to scan
Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
Bluetooth: ISO: Don't reject BT_ISO_QOS if parameters are unset
Luiz Augusto von Dentz <luiz.von.dentz(a)intel.com>
Bluetooth: ISO: Align broadcast sync_timeout with connection timeout
Brett Creeley <brett.creeley(a)amd.com>
pds_core: Fix pdsc_check_pci_health function to use work thread
Shannon Nelson <shannon.nelson(a)amd.com>
pds_core: use pci_reset_function for health reset
Jiri Benc <jbenc(a)redhat.com>
ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr
Arnd Bergmann <arnd(a)arndb.de>
ipv4/route: avoid unused-but-set-variable warning
Arnd Bergmann <arnd(a)arndb.de>
ipv6: fib: hide unused 'pn' variable
Geetha sowjanya <gakula(a)marvell.com>
octeontx2-af: Fix NIX SQ mode and BP config
Kuniyuki Iwashima <kuniyu(a)amazon.com>
af_unix: Clear stale u->oob_skb.
Marek Vasut <marex(a)denx.de>
net: ks8851: Handle softirqs at the end of IRQ thread to fix hang
Marek Vasut <marex(a)denx.de>
net: ks8851: Inline ks8851_rx_skb()
Dave Jiang <dave.jiang(a)intel.com>
cxl: Fix retrieving of access_coordinates in PCIe path
Dave Jiang <dave.jiang(a)intel.com>
cxl: Remove checking of iter in cxl_endpoint_get_perf_coordinates()
Dave Jiang <dave.jiang(a)intel.com>
cxl: Split out host bridge access coordinates
Dave Jiang <dave.jiang(a)intel.com>
cxl: Split out combine_coordinates() for common shared usage
Dave Jiang <dave.jiang(a)intel.com>
ACPI: HMAT / cxl: Add retrieval of generic port coordinates for both access classes
Dave Jiang <dave.jiang(a)intel.com>
ACPI: HMAT: Introduce 2 levels of generic port access class
Dave Jiang <dave.jiang(a)intel.com>
base/node / ACPI: Enumerate node access class for 'struct access_coordinate'
Raag Jadav <raag.jadav(a)intel.com>
ACPI: bus: allow _UID matching for integer zero
Pavan Chebbi <pavan.chebbi(a)broadcom.com>
bnxt_en: Reset PTP tx_avail after possible firmware reset
Vikas Gupta <vikas.gupta(a)broadcom.com>
bnxt_en: Fix error recovery for RoCE ulp client
Vikas Gupta <vikas.gupta(a)broadcom.com>
bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()
Gerd Bayer <gbayer(a)linux.ibm.com>
s390/ism: fix receive message buffer allocation
Eric Dumazet <edumazet(a)google.com>
geneve: fix header validation in geneve[6]_xmit_skb
Arnd Bergmann <arnd(a)arndb.de>
lib: checksum: hide unused expected_csum_ipv6_magic[]
Ming Lei <ming.lei(a)redhat.com>
block: fix q->blkg_list corruption during disk rebind
Hariprasad Kelam <hkelam(a)marvell.com>
octeontx2-pf: Fix transmit scheduler resource leak
Eric Dumazet <edumazet(a)google.com>
xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING
Petr Tesarik <petr(a)tesarici.cz>
u64_stats: fix u64_stats_init() for lockdep when used repeatedly in one file
Ilya Maximets <i.maximets(a)ovn.org>
net: openvswitch: fix unwanted error log on timeout policy probing
Dan Carpenter <dan.carpenter(a)linaro.org>
scsi: qla2xxx: Fix off by one in qla_edif_app_getstats()
Xiang Chen <chenxiang66(a)hisilicon.com>
scsi: hisi_sas: Modify the deadline for ata_wait_after_reset()
Luca Weiss <luca.weiss(a)fairphone.com>
drm/msm/adreno: Set highest_bank_bit for A619
Arnd Bergmann <arnd(a)arndb.de>
nouveau: fix function cast warning
Alex Constantino <dreaming.about.electric.sheep(a)gmail.com>
Revert "drm/qxl: simplify qxl_fence_wait"
Kwangjin Ko <kwangjin.ko(a)sk.com>
cxl/core: Fix initialization of mbox_cmd.size_out in get event
Frank Li <Frank.Li(a)nxp.com>
arm64: dts: imx8-ss-conn: fix usdhc wrong lpcg clock order
Dmitry Baryshkov <dmitry.baryshkov(a)linaro.org>
dt-bindings: display/msm: sm8150-mdss: add DP node
Dmitry Baryshkov <dmitry.baryshkov(a)linaro.org>
drm/msm/dpu: make error messages at dpu_core_irq_register_callback() more sensible
Dmitry Baryshkov <dmitry.baryshkov(a)linaro.org>
drm/msm/dpu: don't allow overriding data from catalog
Stephen Boyd <swboyd(a)chromium.org>
drm/msm: Add newlines to some debug prints
Tim Harvey <tharvey(a)gateworks.com>
arm64: dts: freescale: imx8mp-venice-gw73xx-2x: fix USB vbus regulator
Tim Harvey <tharvey(a)gateworks.com>
arm64: dts: freescale: imx8mp-venice-gw72xx-2x: fix USB vbus regulator
Dave Jiang <dave.jiang(a)intel.com>
cxl/core/regs: Fix usage of map->reg_type in cxl_decode_regblock() before assigned
Yuquan Wang <wangyuquan1236(a)phytium.com.cn>
cxl/mem: Fix for the index of Clear Event Record Handle
Cristian Marussi <cristian.marussi(a)arm.com>
firmware: arm_scmi: Make raw debugfs entries non-seekable
Jens Wiklander <jens.wiklander(a)linaro.org>
firmware: arm_ffa: Fix the partition ID check in ffa_notification_info_get()
Aaro Koskinen <aaro.koskinen(a)iki.fi>
ARM: OMAP2+: fix USB regression on Nokia N8x0
Aaro Koskinen <aaro.koskinen(a)iki.fi>
mmc: omap: restore original power up/down steps
Aaro Koskinen <aaro.koskinen(a)iki.fi>
mmc: omap: fix deferred probe
Aaro Koskinen <aaro.koskinen(a)iki.fi>
mmc: omap: fix broken slot switch lookup
Aaro Koskinen <aaro.koskinen(a)iki.fi>
ARM: OMAP2+: fix N810 MMC gpiod table
Aaro Koskinen <aaro.koskinen(a)iki.fi>
ARM: OMAP2+: fix bogus MMC GPIO labels on Nokia N8x0
David Sterba <dsterba(a)suse.com>
btrfs: tests: allocate dummy fs_info and root in test_find_delalloc()
Nini Song <nini.song(a)mediatek.com>
media: cec: core: remove length check of Timer Status
Anna-Maria Behnsen <anna-maria(a)linutronix.de>
PM: s2idle: Make sure CPUs will wakeup directly on resume
Hans de Goede <hdegoede(a)redhat.com>
ACPI: scan: Do not increase dep_unmet for already met dependencies
Noah Loomans <noah(a)noahloomans.com>
platform/chrome: cros_ec_uart: properly fix race condition
Tim Huang <Tim.Huang(a)amd.com>
drm/amd/pm: fixes a random hang in S4 for SMU v13.0.4/11
Dmitry Antipov <dmantipov(a)yandex.ru>
Bluetooth: Fix memory leak in hci_req_sync_complete()
Steven Rostedt (Google) <rostedt(a)goodmis.org>
ring-buffer: Only update pages_touched when a new page is touched
Yu Kuai <yukuai3(a)huawei.com>
raid1: fix use-after-free for original bio in raid1_write_request()
Fabio Estevam <festevam(a)denx.de>
ARM: dts: imx7s-warp: Pass OV2680 link-frequencies
Gavin Shan <gshan(a)redhat.com>
arm64: tlb: Fix TLBI RANGE operand
Breno Leitao <leitao(a)debian.org>
virtio_net: Do not send RSS key if it is not supported
Xiubo Li <xiubli(a)redhat.com>
ceph: switch to use cap_delay_lock for the unlink delay list
NeilBrown <neilb(a)suse.de>
ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Avoid infinite loop trying to resize local TT
Peyton Lee <peytolee(a)amd.com>
drm/amdgpu/vpe: power on vpe when hw_init
Damien Le Moal <dlemoal(a)kernel.org>
ata: libata-scsi: Fix ata_scsi_dev_rescan() error path
Igor Pylypiv <ipylypiv(a)google.com>
ata: libata-core: Allow command duration limits detection for ACS-4 drives
Steve French <stfrench(a)microsoft.com>
smb3: fix Open files on server counter going negative
-------------
Diffstat:
Documentation/admin-guide/hw-vuln/spectre.rst | 22 +-
Documentation/admin-guide/kernel-parameters.txt | 12 +-
.../bindings/display/msm/qcom,sm8150-mdss.yaml | 9 +
Makefile | 4 +-
arch/arm/boot/dts/nxp/imx/imx7s-warp.dts | 1 +
arch/arm/mach-omap2/board-n8x0.c | 23 +--
arch/arm64/boot/dts/freescale/imx8-ss-conn.dtsi | 16 +-
arch/arm64/boot/dts/freescale/imx8-ss-dma.dtsi | 40 ++--
arch/arm64/boot/dts/freescale/imx8-ss-lsio.dtsi | 16 +-
.../boot/dts/freescale/imx8mp-venice-gw72xx.dtsi | 2 +-
.../boot/dts/freescale/imx8mp-venice-gw73xx.dtsi | 2 +-
arch/arm64/boot/dts/freescale/imx8qm-ss-dma.dtsi | 8 +-
arch/arm64/include/asm/tlbflush.h | 20 +-
arch/x86/Kconfig | 21 +-
arch/x86/events/core.c | 1 +
arch/x86/include/asm/apic.h | 3 +-
arch/x86/kernel/apic/apic.c | 6 +-
arch/x86/kernel/cpu/bugs.c | 82 ++++----
arch/x86/kernel/cpu/common.c | 48 ++---
block/blk-cgroup.c | 9 +-
block/blk-cgroup.h | 2 +
block/blk-core.c | 2 +
drivers/accel/ivpu/ivpu_drv.c | 20 +-
drivers/accel/ivpu/ivpu_hw.h | 6 +
drivers/accel/ivpu/ivpu_hw_37xx.c | 7 +-
drivers/accel/ivpu/ivpu_hw_40xx.c | 6 +
drivers/accel/ivpu/ivpu_ipc.c | 8 +-
drivers/accel/ivpu/ivpu_pm.c | 7 +-
drivers/acpi/numa/hmat.c | 43 ++--
drivers/acpi/scan.c | 3 +-
drivers/ata/libata-core.c | 2 +-
drivers/ata/libata-scsi.c | 9 +-
drivers/base/node.c | 6 +-
drivers/cxl/acpi.c | 8 +-
drivers/cxl/core/cdat.c | 58 ++++--
drivers/cxl/core/mbox.c | 5 +-
drivers/cxl/core/port.c | 76 ++++---
drivers/cxl/core/regs.c | 5 +-
drivers/cxl/cxl.h | 8 +-
drivers/firmware/arm_ffa/driver.c | 2 +-
drivers/firmware/arm_scmi/raw_mode.c | 7 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 6 +
drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 2 +-
drivers/gpu/drm/amd/amdgpu/soc21.c | 32 ++-
drivers/gpu/drm/amd/amdgpu/umsch_mm_v4_0.c | 2 +
.../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 +
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 15 +-
.../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_wb.c | 6 +-
.../amd/display/dc/clk_mgr/dcn316/dcn316_clk_mgr.c | 19 +-
drivers/gpu/drm/amd/display/dc/core/dc_state.c | 9 +
.../gpu/drm/amd/display/dc/optc/dcn32/dcn32_optc.c | 3 -
.../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c | 12 +-
drivers/gpu/drm/ast/ast_dp.c | 3 +
drivers/gpu/drm/drm_client_modeset.c | 3 +-
drivers/gpu/drm/i915/display/intel_cdclk.c | 7 +-
drivers/gpu/drm/i915/display/intel_cdclk.h | 3 +
drivers/gpu/drm/i915/display/intel_ddi.c | 5 +
drivers/gpu/drm/i915/display/intel_dp.c | 6 +-
drivers/gpu/drm/i915/display/intel_psr.c | 11 +
drivers/gpu/drm/i915/display/intel_vrr.c | 7 +
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 +
drivers/gpu/drm/msm/disp/dpu1/dpu_core_perf.c | 10 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_hw_interrupts.c | 8 +-
drivers/gpu/drm/msm/dp/dp_display.c | 2 +
drivers/gpu/drm/msm/msm_fb.c | 6 +-
drivers/gpu/drm/msm/msm_kms.c | 4 +-
.../gpu/drm/nouveau/nvkm/subdev/bios/shadowof.c | 7 +-
drivers/gpu/drm/panfrost/panfrost_mmu.c | 13 +-
drivers/gpu/drm/qxl/qxl_release.c | 50 ++++-
drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 11 +-
drivers/gpu/drm/xe/xe_display.c | 5 -
drivers/gpu/drm/xe/xe_hwmon.c | 4 +-
drivers/iommu/intel/iommu.c | 11 +-
drivers/iommu/intel/perfmon.c | 2 +-
drivers/iommu/intel/svm.c | 2 +-
drivers/md/raid1.c | 2 +-
drivers/media/cec/core/cec-adap.c | 14 --
drivers/mmc/host/omap.c | 48 +++--
drivers/net/dsa/mt7530.c | 229 ++++++++++++++++++---
drivers/net/dsa/mt7530.h | 5 +
drivers/net/ethernet/amazon/ena/ena_com.c | 2 +-
drivers/net/ethernet/amazon/ena/ena_netdev.c | 35 ++--
drivers/net/ethernet/amazon/ena/ena_xdp.c | 4 +-
drivers/net/ethernet/amd/pds_core/core.c | 14 +-
drivers/net/ethernet/amd/pds_core/core.h | 5 +-
drivers/net/ethernet/amd/pds_core/dev.c | 3 +
drivers/net/ethernet/amd/pds_core/main.c | 8 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +
drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c | 6 +-
.../net/ethernet/marvell/octeontx2/af/rvu_nix.c | 22 +-
drivers/net/ethernet/marvell/octeontx2/nic/qos.c | 1 +
drivers/net/ethernet/mellanox/mlx5/core/en/ptp.h | 8 +-
drivers/net/ethernet/mellanox/mlx5/core/en/qos.c | 33 +--
drivers/net/ethernet/mellanox/mlx5/core/en/selq.c | 2 +
.../net/ethernet/mellanox/mlx5/core/en_ethtool.c | 17 ++
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 -
drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 7 +-
drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 17 +-
drivers/net/ethernet/mellanox/mlx5/core/main.c | 37 ++--
drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 4 +-
.../ethernet/mellanox/mlx5/core/sf/dev/driver.c | 22 +-
drivers/net/ethernet/micrel/ks8851.h | 3 -
drivers/net/ethernet/micrel/ks8851_common.c | 16 +-
drivers/net/ethernet/micrel/ks8851_par.c | 11 -
drivers/net/ethernet/micrel/ks8851_spi.c | 11 -
.../net/ethernet/microchip/sparx5/sparx5_port.c | 4 +-
drivers/net/geneve.c | 4 +-
drivers/net/virtio_net.c | 28 ++-
drivers/platform/chrome/cros_ec_uart.c | 28 +--
drivers/scsi/hisi_sas/hisi_sas_main.c | 2 +-
drivers/scsi/qla2xxx/qla_edif.c | 2 +-
drivers/scsi/sg.c | 20 +-
drivers/vhost/vhost.c | 28 ++-
fs/btrfs/delayed-inode.c | 3 +
fs/btrfs/inode.c | 13 +-
fs/btrfs/ioctl.c | 37 +++-
fs/btrfs/qgroup.c | 2 +
fs/btrfs/root-tree.c | 10 -
fs/btrfs/root-tree.h | 2 -
fs/btrfs/tests/extent-io-tests.c | 28 ++-
fs/btrfs/transaction.c | 17 +-
fs/ceph/addr.c | 4 +-
fs/ceph/caps.c | 4 +-
fs/ceph/mds_client.c | 9 +-
fs/ceph/mds_client.h | 3 +-
fs/kernfs/file.c | 9 +-
fs/proc/bootconfig.c | 12 +-
fs/smb/client/cached_dir.c | 4 +-
include/acpi/acpi_bus.h | 8 +-
include/linux/bootconfig.h | 1 +
include/linux/dma-fence.h | 7 +
include/linux/irqflags.h | 2 +-
include/linux/node.h | 18 +-
include/linux/u64_stats_sync.h | 9 +-
include/net/addrconf.h | 4 +
include/net/af_unix.h | 2 +-
include/net/bluetooth/bluetooth.h | 11 +
include/net/ip_tunnels.h | 33 +++
init/main.c | 5 +
io_uring/io_uring.c | 25 +++
io_uring/net.c | 22 +-
io_uring/rw.c | 2 -
kernel/cpu.c | 3 +-
kernel/kprobes.c | 18 +-
kernel/power/suspend.c | 6 +
kernel/trace/ring_buffer.c | 6 +-
kernel/trace/trace_events.c | 4 +
lib/checksum_kunit.c | 5 +-
net/batman-adv/translation-table.c | 2 +-
net/bluetooth/hci_request.c | 4 +-
net/bluetooth/hci_sock.c | 21 +-
net/bluetooth/hci_sync.c | 66 +++++-
net/bluetooth/iso.c | 50 ++---
net/bluetooth/l2cap_core.c | 3 +-
net/bluetooth/l2cap_sock.c | 52 ++---
net/bluetooth/rfcomm/sock.c | 14 +-
net/bluetooth/sco.c | 23 +--
net/ipv4/netfilter/arp_tables.c | 4 +
net/ipv4/netfilter/ip_tables.c | 4 +
net/ipv4/route.c | 4 +-
net/ipv6/addrconf.c | 7 +-
net/ipv6/ip6_fib.c | 7 +-
net/ipv6/netfilter/ip6_tables.c | 4 +
net/openvswitch/conntrack.c | 5 +-
net/unix/af_unix.c | 8 +-
net/unix/garbage.c | 35 +++-
net/unix/scm.c | 8 +-
net/xdp/xsk.c | 2 +
tools/testing/selftests/kselftest.h | 33 ++-
tools/testing/selftests/timers/posix_timers.c | 105 +++++-----
170 files changed, 1559 insertions(+), 882 deletions(-)
The BLKRRPART ioctl used to report errors such as EIO before we changed
the blkdev_reread_part() logic.
Lets add a flag and capture the errors returned by bdev_disk_changed()
when the flag is set. Setting this flag for the BLKRRPART path when we
want the errors to be reported when rereading partitions on the disk.
Link: https://lore.kernel.org/all/20240320015134.GA14267@lst.de/
Suggested-by: Christoph Hellwig <hch(a)lst.de>
Tested: Tested by simulating failure to the block device and will
propose a new test to blktests.
Fixes: 4601b4b130de ("block: reopen the device in blkdev_reread_part")
Reported-by: Saranya Muruganandam <saranyamohan(a)google.com>
Signed-off-by: Saranya Muruganandam <saranyamohan(a)google.com>
Change-Id: Idf3d97390ed78061556f8468d10d6cab24ae20b1
---
block/bdev.c | 31 +++++++++++++++++++++----------
block/ioctl.c | 3 ++-
include/linux/blkdev.h | 3 +++
3 files changed, 26 insertions(+), 11 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index 77fa77cd29bee..71478f8865546 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -632,6 +632,14 @@ static void blkdev_flush_mapping(struct block_device *bdev)
bdev_write_inode(bdev);
}
+static void blkdev_put_whole(struct block_device *bdev)
+{
+ if (atomic_dec_and_test(&bdev->bd_openers))
+ blkdev_flush_mapping(bdev);
+ if (bdev->bd_disk->fops->release)
+ bdev->bd_disk->fops->release(bdev->bd_disk);
+}
+
static int blkdev_get_whole(struct block_device *bdev, blk_mode_t mode)
{
struct gendisk *disk = bdev->bd_disk;
@@ -650,18 +658,21 @@ static int blkdev_get_whole(struct block_device *bdev, blk_mode_t mode)
if (!atomic_read(&bdev->bd_openers))
set_init_blocksize(bdev);
- if (test_bit(GD_NEED_PART_SCAN, &disk->state))
- bdev_disk_changed(disk, false);
atomic_inc(&bdev->bd_openers);
- return 0;
-}
-static void blkdev_put_whole(struct block_device *bdev)
-{
- if (atomic_dec_and_test(&bdev->bd_openers))
- blkdev_flush_mapping(bdev);
- if (bdev->bd_disk->fops->release)
- bdev->bd_disk->fops->release(bdev->bd_disk);
+ if (test_bit(GD_NEED_PART_SCAN, &disk->state)) {
+ /*
+ * Only return scanning errors if we are called from contexts
+ * that explicitly want them, e.g. the BLKRRPART ioctl.
+ */
+ ret = bdev_disk_changed(disk, false);
+ if (ret && (mode & BLK_OPEN_STRICT_SCAN)) {
+ blkdev_put_whole(bdev);
+ return ret;
+ }
+ }
+
+ return 0;
}
static int blkdev_get_part(struct block_device *part, blk_mode_t mode)
diff --git a/block/ioctl.c b/block/ioctl.c
index aa46f3761c3ed..e8d72d9f327fd 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -557,7 +557,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, blk_mode_t mode,
return -EACCES;
if (bdev_is_partition(bdev))
return -EINVAL;
- return disk_scan_partitions(bdev->bd_disk, mode);
+ return disk_scan_partitions(bdev->bd_disk,
+ mode | BLK_OPEN_STRICT_SCAN);
case BLKTRACESTART:
case BLKTRACESTOP:
case BLKTRACETEARDOWN:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 01983eece8f2a..d0104dc839b0d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -151,6 +151,9 @@ struct access_rules_head {
int max_rules;
};
+/* return partition scanning errors */
+#define BLK_OPEN_STRICT_SCAN ((__force blk_mode_t)(1 << 5))
+
struct gendisk {
/*
* major/first_minor/minors should not be set by any new driver, the
--
2.44.0.478.gd926399ef9-goog
The following behavior is inconsistent:
* For request-based dm queues the default value of rq_affinity is 1.
* For bio-based dm queues the default value of rq_affinity is 0.
The default value for request-based dm queues is 1 because of the following
code in blk_mq_init_allocated_queue():
q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
From <linux/blkdev.h>:
#define QUEUE_FLAG_MQ_DEFAULT ((1UL << QUEUE_FLAG_IO_STAT) | \
(1UL << QUEUE_FLAG_SAME_COMP) | \
(1UL << QUEUE_FLAG_NOWAIT))
The default value of rq_affinity for bio-based dm queues is 0 because the
dm alloc_dev() function does not set any of the QUEUE_FLAG_SAME_* flags. I
think the different default values are the result of an oversight when
blk-mq support was added in the device mapper code. Hence this patch that
changes the default value of rq_affinity from 0 to 1 for bio-based dm
queues.
This patch reduces the boot time from 12.23 to 12.20 seconds on my test
setup, a Pixel 2023 development board. The storage controller on that test
setup supports a single completion interrupt and hence benefits from
redirecting I/O completions to a CPU core that is closer to the submitter.
Cc: Mikulas Patocka <mpatocka(a)redhat.com>
Cc: Eric Biggers <ebiggers(a)kernel.org>
Cc: Jaegeuk Kim <jaegeuk(a)kernel.org>
Cc: Daniel Lee <chullee(a)google.com>
Cc: stable(a)vger.kernel.org
Fixes: bfebd1cdb497 ("dm: add full blk-mq support to request-based DM")
Signed-off-by: Bart Van Assche <bvanassche(a)acm.org>
---
drivers/md/dm.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 56aa2a8b9d71..9af216c11cf7 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2106,6 +2106,7 @@ static struct mapped_device *alloc_dev(int minor)
if (IS_ERR(md->disk))
goto bad;
md->queue = md->disk->queue;
+ blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, md->queue);
init_waitqueue_head(&md->wait);
INIT_WORK(&md->work, dm_wq_work);
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb()
which the core scheduler code has depended upon since commit:
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can
unset the actively used cid when it fails to observe active task after it
sets lazy_put.
There *is* a memory barrier between storing to rq->curr and _return to
userspace_ (as required by membarrier), but the rseq mm_cid has stricter
requirements: the barrier needs to be issued between store to rq->curr
and switch_mm_cid(), which happens earlier than:
- spin_unlock(),
- switch_to().
So it's fine when the architecture switch_mm() happens to have that
barrier already, but less so when the architecture only provides the
full barrier in switch_to() or spin_unlock().
It is a bug in the rseq switch_mm_cid() implementation. All architectures
that don't have memory barriers in switch_mm(), but rather have the full
barrier either in finish_lock_switch() or switch_to() have them too late
for the needs of switch_mm_cid().
Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the
generic barrier.h header, and use it in switch_mm_cid() for scheduler
transitions where switch_mm() is expected to provide a memory barrier.
Architectures can override smp_mb__after_switch_mm() if their
switch_mm() implementation provides an implicit memory barrier.
Override it with a no-op on x86 which implicitly provide this memory
barrier by writing to CR3.
Link: https://lore.kernel.org/lkml/20240305145335.2696125-1-yeoreum.yun@arm.com/
Reported-by: levi.yun <yeoreum.yun(a)arm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Reviewed-by: Catalin Marinas <catalin.marinas(a)arm.com> # for arm64
Acked-by: Dave Hansen <dave.hansen(a)linux.intel.com> # for x86
Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
Cc: <stable(a)vger.kernel.org> # 6.4.x
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Vincent Guittot <vincent.guittot(a)linaro.org>
Cc: Juri Lelli <juri.lelli(a)redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann(a)arm.com>
Cc: Ben Segall <bsegall(a)google.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Daniel Bristot de Oliveira <bristot(a)redhat.com>
Cc: Valentin Schneider <vschneid(a)redhat.com>
Cc: levi.yun <yeoreum.yun(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Aaron Lu <aaron.lu(a)intel.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: linux-arch(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Cc: x86(a)kernel.org
---
arch/x86/include/asm/barrier.h | 3 +++
include/asm-generic/barrier.h | 8 ++++++++
kernel/sched/sched.h | 20 ++++++++++++++------
3 files changed, 25 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index fe1e7e3cc844..63bdc6b85219 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -79,6 +79,9 @@ do { \
#define __smp_mb__before_atomic() do { } while (0)
#define __smp_mb__after_atomic() do { } while (0)
+/* Writing to CR3 provides a full memory barrier in switch_mm(). */
+#define smp_mb__after_switch_mm() do { } while (0)
+
#include <asm-generic/barrier.h>
#endif /* _ASM_X86_BARRIER_H */
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 0c0695763bea..dc32b96140c1 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -294,5 +294,13 @@ do { \
#define io_stop_wc() do { } while (0)
#endif
+/*
+ * Architectures that guarantee an implicit smp_mb() in switch_mm()
+ * can override smp_mb__after_switch_mm.
+ */
+#ifndef smp_mb__after_switch_mm
+#define smp_mb__after_switch_mm() smp_mb()
+#endif
+
#endif /* !__ASSEMBLY__ */
#endif /* __ASM_GENERIC_BARRIER_H */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2242679239e..d2895d264196 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -79,6 +79,8 @@
# include <asm/paravirt_api_clock.h>
#endif
+#include <asm/barrier.h>
+
#include "cpupri.h"
#include "cpudeadline.h"
@@ -3445,13 +3447,19 @@ static inline void switch_mm_cid(struct rq *rq,
* between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
* Provide it here.
*/
- if (!prev->mm) // from kernel
+ if (!prev->mm) { // from kernel
smp_mb();
- /*
- * user -> user transition guarantees a memory barrier through
- * switch_mm() when current->mm changes. If current->mm is
- * unchanged, no barrier is needed.
- */
+ } else { // from user
+ /*
+ * user -> user transition relies on an implicit
+ * memory barrier in switch_mm() when
+ * current->mm changes. If the architecture
+ * switch_mm() does not have an implicit memory
+ * barrier, it is emitted here. If current->mm
+ * is unchanged, no barrier is needed.
+ */
+ smp_mb__after_switch_mm();
+ }
}
if (prev->mm_cid_active) {
mm_cid_snapshot_time(rq, prev->mm);
--
2.39.2
Inspired by a patch from [Justin][1] I took a closer look at kdb_read().
Despite Justin's patch being a (correct) one-line manipulation it was a
tough patch to review because the surrounding code was hard to read and
it looked like there were unfixed problems.
This series isn't enough to make kdb_read() beautiful but it does make
it shorter, easier to reason about and fixes a buffer overflow and a
screen redraw problem!
[1]: https://lore.kernel.org/all/20240403-strncpy-kernel-debug-kdb-kdb_io-c-v1-1…
Signed-off-by: Daniel Thompson <daniel.thompson(a)linaro.org>
---
Daniel Thompson (7):
kdb: Fix buffer overflow during tab-complete
kdb: Use format-strings rather than '\0' injection in kdb_read()
kdb: Fix console handling when editing and tab-completing commands
kdb: Replace double memcpy() with memmove() in kdb_read()
kdb: Merge identical case statements in kdb_read()
kdb: Use format-specifiers rather than memset() for padding in kdb_read()
kdb: Simplify management of tmpbuffer in kdb_read()
kernel/debug/kdb/kdb_io.c | 133 ++++++++++++++++++++--------------------------
1 file changed, 58 insertions(+), 75 deletions(-)
---
base-commit: dccce9b8780618986962ba37c373668bcf426866
change-id: 20240415-kgdb_read_refactor-2ea2dfc15dbb
Best regards,
--
Daniel Thompson <daniel.thompson(a)linaro.org>
This reverts commit 7dcd3e014aa7faeeaf4047190b22d8a19a0db696.
Qualcomm Bluetooth controllers like WCN6855 do not have persistent
storage for the Bluetooth address and must therefore start as
unconfigured to allow the user to set a valid address unless one has
been provided by the boot firmware in the devicetree.
A recent change snuck into v6.8-rc7 and incorrectly started marking the
default (non-unique) address as valid. This specifically also breaks the
Bluetooth setup for some user of the Lenovo ThinkPad X13s.
Note that this is the second time Qualcomm breaks the driver this way
and that this was fixed last year by commit 6945795bc81a ("Bluetooth:
fix use-bdaddr-property quirk"), which also has some further details.
Fixes: 7dcd3e014aa7 ("Bluetooth: hci_qca: Set BDA quirk bit if fwnode exists in DT")
Cc: stable(a)vger.kernel.org # 6.8
Cc: Janaki Ramaiah Thota <quic_janathot(a)quicinc.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/bluetooth/hci_qca.c | 13 +------------
1 file changed, 1 insertion(+), 12 deletions(-)
diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
index edd2a81b4d5e..f989c05f8177 100644
--- a/drivers/bluetooth/hci_qca.c
+++ b/drivers/bluetooth/hci_qca.c
@@ -7,7 +7,6 @@
*
* Copyright (C) 2007 Texas Instruments, Inc.
* Copyright (c) 2010, 2012, 2018 The Linux Foundation. All rights reserved.
- * Copyright (c) 2023 Qualcomm Innovation Center, Inc. All rights reserved.
*
* Acknowledgements:
* This file is based on hci_ll.c, which was...
@@ -1904,17 +1903,7 @@ static int qca_setup(struct hci_uart *hu)
case QCA_WCN6750:
case QCA_WCN6855:
case QCA_WCN7850:
-
- /* Set BDA quirk bit for reading BDA value from fwnode property
- * only if that property exist in DT.
- */
- if (fwnode_property_present(dev_fwnode(hdev->dev.parent), "local-bd-address")) {
- set_bit(HCI_QUIRK_USE_BDADDR_PROPERTY, &hdev->quirks);
- bt_dev_info(hdev, "setting quirk bit to read BDA from fwnode later");
- } else {
- bt_dev_dbg(hdev, "local-bd-address` is not present in the devicetree so not setting quirk bit for BDA");
- }
-
+ set_bit(HCI_QUIRK_USE_BDADDR_PROPERTY, &hdev->quirks);
hci_set_aosp_capable(hdev);
ret = qca_read_soc_version(hdev, &ver, soc_type);
--
2.43.2
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 325f3fb551f8cd672dbbfc4cf58b14f9ee3fc9e8
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024041526-whisking-flyover-825a@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
325f3fb551f8 ("kprobes: Fix possible use-after-free issue on kprobe registration")
1efda38d6f9b ("kprobes: Prohibit probes in gate area")
28f6c37a2910 ("kprobes: Forbid probing on trampoline and BPF code areas")
223a76b268c9 ("kprobes: Fix coding style issues")
9c89bb8e3272 ("kprobes: treewide: Cleanup the error messages for kprobes")
02afb8d6048d ("kprobe: Simplify prepare_kprobe() by dropping redundant version")
9840cfcb97fc ("Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 325f3fb551f8cd672dbbfc4cf58b14f9ee3fc9e8 Mon Sep 17 00:00:00 2001
From: Zheng Yejian <zhengyejian1(a)huawei.com>
Date: Wed, 10 Apr 2024 09:58:02 +0800
Subject: [PATCH] kprobes: Fix possible use-after-free issue on kprobe
registration
When unloading a module, its state is changing MODULE_STATE_LIVE ->
MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take
a time. `is_module_text_address()` and `__module_text_address()`
works with MODULE_STATE_LIVE and MODULE_STATE_GOING.
If we use `is_module_text_address()` and `__module_text_address()`
separately, there is a chance that the first one is succeeded but the
next one is failed because module->state becomes MODULE_STATE_UNFORMED
between those operations.
In `check_kprobe_address_safe()`, if the second `__module_text_address()`
is failed, that is ignored because it expected a kernel_text address.
But it may have failed simply because module->state has been changed
to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify
non-exist module text address (use-after-free).
To fix this problem, we should not use separated `is_module_text_address()`
and `__module_text_address()`, but use only `__module_text_address()`
once and do `try_module_get(module)` which is only available with
MODULE_STATE_LIVE.
Link: https://lore.kernel.org/all/20240410015802.265220-1-zhengyejian1@huawei.com/
Fixes: 28f6c37a2910 ("kprobes: Forbid probing on trampoline and BPF code areas")
Cc: stable(a)vger.kernel.org
Signed-off-by: Zheng Yejian <zhengyejian1(a)huawei.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..65adc815fc6e 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1567,10 +1567,17 @@ static int check_kprobe_address_safe(struct kprobe *p,
jump_label_lock();
preempt_disable();
- /* Ensure it is not in reserved area nor out of text */
- if (!(core_kernel_text((unsigned long) p->addr) ||
- is_module_text_address((unsigned long) p->addr)) ||
- in_gate_area_no_mm((unsigned long) p->addr) ||
+ /* Ensure the address is in a text area, and find a module if exists. */
+ *probed_mod = NULL;
+ if (!core_kernel_text((unsigned long) p->addr)) {
+ *probed_mod = __module_text_address((unsigned long) p->addr);
+ if (!(*probed_mod)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+ /* Ensure it is not in reserved area. */
+ if (in_gate_area_no_mm((unsigned long) p->addr) ||
within_kprobe_blacklist((unsigned long) p->addr) ||
jump_label_text_reserved(p->addr, p->addr) ||
static_call_text_reserved(p->addr, p->addr) ||
@@ -1580,8 +1587,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
- /* Check if 'p' is probing a module. */
- *probed_mod = __module_text_address((unsigned long) p->addr);
+ /* Get module refcount and reject __init functions for loaded modules. */
if (*probed_mod) {
/*
* We must hold a refcount of the probed module while updating
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 325f3fb551f8cd672dbbfc4cf58b14f9ee3fc9e8
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024041525-compel-delta-953a@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
325f3fb551f8 ("kprobes: Fix possible use-after-free issue on kprobe registration")
1efda38d6f9b ("kprobes: Prohibit probes in gate area")
28f6c37a2910 ("kprobes: Forbid probing on trampoline and BPF code areas")
223a76b268c9 ("kprobes: Fix coding style issues")
9c89bb8e3272 ("kprobes: treewide: Cleanup the error messages for kprobes")
02afb8d6048d ("kprobe: Simplify prepare_kprobe() by dropping redundant version")
9840cfcb97fc ("Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 325f3fb551f8cd672dbbfc4cf58b14f9ee3fc9e8 Mon Sep 17 00:00:00 2001
From: Zheng Yejian <zhengyejian1(a)huawei.com>
Date: Wed, 10 Apr 2024 09:58:02 +0800
Subject: [PATCH] kprobes: Fix possible use-after-free issue on kprobe
registration
When unloading a module, its state is changing MODULE_STATE_LIVE ->
MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take
a time. `is_module_text_address()` and `__module_text_address()`
works with MODULE_STATE_LIVE and MODULE_STATE_GOING.
If we use `is_module_text_address()` and `__module_text_address()`
separately, there is a chance that the first one is succeeded but the
next one is failed because module->state becomes MODULE_STATE_UNFORMED
between those operations.
In `check_kprobe_address_safe()`, if the second `__module_text_address()`
is failed, that is ignored because it expected a kernel_text address.
But it may have failed simply because module->state has been changed
to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify
non-exist module text address (use-after-free).
To fix this problem, we should not use separated `is_module_text_address()`
and `__module_text_address()`, but use only `__module_text_address()`
once and do `try_module_get(module)` which is only available with
MODULE_STATE_LIVE.
Link: https://lore.kernel.org/all/20240410015802.265220-1-zhengyejian1@huawei.com/
Fixes: 28f6c37a2910 ("kprobes: Forbid probing on trampoline and BPF code areas")
Cc: stable(a)vger.kernel.org
Signed-off-by: Zheng Yejian <zhengyejian1(a)huawei.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..65adc815fc6e 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1567,10 +1567,17 @@ static int check_kprobe_address_safe(struct kprobe *p,
jump_label_lock();
preempt_disable();
- /* Ensure it is not in reserved area nor out of text */
- if (!(core_kernel_text((unsigned long) p->addr) ||
- is_module_text_address((unsigned long) p->addr)) ||
- in_gate_area_no_mm((unsigned long) p->addr) ||
+ /* Ensure the address is in a text area, and find a module if exists. */
+ *probed_mod = NULL;
+ if (!core_kernel_text((unsigned long) p->addr)) {
+ *probed_mod = __module_text_address((unsigned long) p->addr);
+ if (!(*probed_mod)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+ /* Ensure it is not in reserved area. */
+ if (in_gate_area_no_mm((unsigned long) p->addr) ||
within_kprobe_blacklist((unsigned long) p->addr) ||
jump_label_text_reserved(p->addr, p->addr) ||
static_call_text_reserved(p->addr, p->addr) ||
@@ -1580,8 +1587,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
- /* Check if 'p' is probing a module. */
- *probed_mod = __module_text_address((unsigned long) p->addr);
+ /* Get module refcount and reject __init functions for loaded modules. */
if (*probed_mod) {
/*
* We must hold a refcount of the probed module while updating
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 325f3fb551f8cd672dbbfc4cf58b14f9ee3fc9e8
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024041524-unlovely-blemish-8954@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
325f3fb551f8 ("kprobes: Fix possible use-after-free issue on kprobe registration")
1efda38d6f9b ("kprobes: Prohibit probes in gate area")
28f6c37a2910 ("kprobes: Forbid probing on trampoline and BPF code areas")
223a76b268c9 ("kprobes: Fix coding style issues")
9c89bb8e3272 ("kprobes: treewide: Cleanup the error messages for kprobes")
02afb8d6048d ("kprobe: Simplify prepare_kprobe() by dropping redundant version")
9840cfcb97fc ("Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 325f3fb551f8cd672dbbfc4cf58b14f9ee3fc9e8 Mon Sep 17 00:00:00 2001
From: Zheng Yejian <zhengyejian1(a)huawei.com>
Date: Wed, 10 Apr 2024 09:58:02 +0800
Subject: [PATCH] kprobes: Fix possible use-after-free issue on kprobe
registration
When unloading a module, its state is changing MODULE_STATE_LIVE ->
MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take
a time. `is_module_text_address()` and `__module_text_address()`
works with MODULE_STATE_LIVE and MODULE_STATE_GOING.
If we use `is_module_text_address()` and `__module_text_address()`
separately, there is a chance that the first one is succeeded but the
next one is failed because module->state becomes MODULE_STATE_UNFORMED
between those operations.
In `check_kprobe_address_safe()`, if the second `__module_text_address()`
is failed, that is ignored because it expected a kernel_text address.
But it may have failed simply because module->state has been changed
to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify
non-exist module text address (use-after-free).
To fix this problem, we should not use separated `is_module_text_address()`
and `__module_text_address()`, but use only `__module_text_address()`
once and do `try_module_get(module)` which is only available with
MODULE_STATE_LIVE.
Link: https://lore.kernel.org/all/20240410015802.265220-1-zhengyejian1@huawei.com/
Fixes: 28f6c37a2910 ("kprobes: Forbid probing on trampoline and BPF code areas")
Cc: stable(a)vger.kernel.org
Signed-off-by: Zheng Yejian <zhengyejian1(a)huawei.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..65adc815fc6e 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1567,10 +1567,17 @@ static int check_kprobe_address_safe(struct kprobe *p,
jump_label_lock();
preempt_disable();
- /* Ensure it is not in reserved area nor out of text */
- if (!(core_kernel_text((unsigned long) p->addr) ||
- is_module_text_address((unsigned long) p->addr)) ||
- in_gate_area_no_mm((unsigned long) p->addr) ||
+ /* Ensure the address is in a text area, and find a module if exists. */
+ *probed_mod = NULL;
+ if (!core_kernel_text((unsigned long) p->addr)) {
+ *probed_mod = __module_text_address((unsigned long) p->addr);
+ if (!(*probed_mod)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+ /* Ensure it is not in reserved area. */
+ if (in_gate_area_no_mm((unsigned long) p->addr) ||
within_kprobe_blacklist((unsigned long) p->addr) ||
jump_label_text_reserved(p->addr, p->addr) ||
static_call_text_reserved(p->addr, p->addr) ||
@@ -1580,8 +1587,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
- /* Check if 'p' is probing a module. */
- *probed_mod = __module_text_address((unsigned long) p->addr);
+ /* Get module refcount and reject __init functions for loaded modules. */
if (*probed_mod) {
/*
* We must hold a refcount of the probed module while updating