Commit 14be4e6f3522 ("selftests: vDSO: fix ELF hash table entry size for s390x")
changed the type of the ELF hash table entries to 64bit on s390x.
However the *GNU* hash tables entries are always 32bit.
The "bucket" pointer is shared between both hash algorithms.
On s390x the GNU algorithm assigns and dereferences this pointer to a
64bit value as a pointer to a 32bit value, leading to compiler warnings and
runtime crashes.
Introduce a new dedicated "gnu_bucket" pointer which is used by the GNU hash.
Fixes: e0746bde6f82 ("selftests/vDSO: support DT_GNU_HASH")
Reviewed-by: Jens Remus <jremus(a)linux.ibm.com>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v2:
- Fix wording around the width of pointers vs the pointed-to values
- Link to v1: https://lore.kernel.org/r/20250213-selftests-vdso-s390-gnu-hash-v1-1-ace3bc…
---
tools/testing/selftests/vDSO/parse_vdso.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/vDSO/parse_vdso.c b/tools/testing/selftests/vDSO/parse_vdso.c
index 2fe5e983cb22f1ed066d0310a54f6aef2ed77ed8..f89d052c730eb43eea28d69ca27b56e897503e16 100644
--- a/tools/testing/selftests/vDSO/parse_vdso.c
+++ b/tools/testing/selftests/vDSO/parse_vdso.c
@@ -53,7 +53,7 @@ static struct vdso_info
/* Symbol table */
ELF(Sym) *symtab;
const char *symstrings;
- ELF(Word) *gnu_hash;
+ ELF(Word) *gnu_hash, *gnu_bucket;
ELF_HASH_ENTRY *bucket, *chain;
ELF_HASH_ENTRY nbucket, nchain;
@@ -185,8 +185,8 @@ void vdso_init_from_sysinfo_ehdr(uintptr_t base)
/* The bucket array is located after the header (4 uint32) and the bloom
* filter (size_t array of gnu_hash[2] elements).
*/
- vdso_info.bucket = vdso_info.gnu_hash + 4 +
- sizeof(size_t) / 4 * vdso_info.gnu_hash[2];
+ vdso_info.gnu_bucket = vdso_info.gnu_hash + 4 +
+ sizeof(size_t) / 4 * vdso_info.gnu_hash[2];
} else {
vdso_info.nbucket = hash[0];
vdso_info.nchain = hash[1];
@@ -268,11 +268,11 @@ void *vdso_sym(const char *version, const char *name)
if (vdso_info.gnu_hash) {
uint32_t h1 = gnu_hash(name), h2, *hashval;
- i = vdso_info.bucket[h1 % vdso_info.nbucket];
+ i = vdso_info.gnu_bucket[h1 % vdso_info.nbucket];
if (i == 0)
return 0;
h1 |= 1;
- hashval = vdso_info.bucket + vdso_info.nbucket +
+ hashval = vdso_info.gnu_bucket + vdso_info.nbucket +
(i - vdso_info.gnu_hash[1]);
for (;; i++) {
ELF(Sym) *sym = &vdso_info.symtab[i];
---
base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
change-id: 20250213-selftests-vdso-s390-gnu-hash-7206671abc85
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
The implementation is limited and only supports numeric arguments.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Changes in v2:
- Return __LINE__ from different testcases to directly point to the
failed testcase
- Add some comments
- Expand commit message
- Link to v1: https://lore.kernel.org/r/20240731-nolibc-scanf-v1-0-f71bcc4abb9e@weissschu…
---
Thomas Weißschuh (2):
tools/nolibc: add support for [v]sscanf()
Revert "selftests: kselftest: Fix build failure with NOLIBC"
tools/include/nolibc/stdio.h | 98 ++++++++++++++++++++++++++++
tools/testing/selftests/kselftest.h | 5 --
tools/testing/selftests/nolibc/nolibc-test.c | 68 +++++++++++++++++++
3 files changed, 166 insertions(+), 5 deletions(-)
---
base-commit: 665fa8dea90d9fbc0e7137c7e1315d6f7e15757e
change-id: 20240414-nolibc-scanf-f1db6930d0c6
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Hi all,
This patchset adds a new buddy allocator like (or non-uniform) large folio
split from a order-n folio to order-m with m < n. It reduces
1. the total number of after-split folios from 2^(n-m) to n-m+1;
2. the amount of memory needed for multi-index xarray split from 2^(n/6-m/6) to
n/6-m/6, assuming XA_CHUNK_SHIFT=6;
3. keep more large folios after a split from all order-m folios to
order-(n-1) to order-m folios.
For example, to split an order-9 to order-0, folio split generates 10
(or 11 for anonymous memory) folios instead of 512, allocates 1 xa_node
instead of 8, and leaves 1 order-8, 1 order-7, ..., 1 order-1 and 2 order-0
folios (or 4 order-0 for anonymous memory) instead of 512 order-0 folios.
It is on top of mm-everything-2025-02-15-05-49 with V7 reverted. It is ready to
be merged.
Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split()
can support both uniform split and buddy allocator like (or non-uniform) split.
All existing split_huge_page*() users can be gradually converted to use
folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().
xfstests quick group passed for both tmpfs and xfs.
Changelog
===
From V7[9]:
1. Fixed a wrong function name in lib/test_xarray.c.
2. Made __split_folio_to_order() never fail, since the old order check
is already done in __folio_split(). (per David Hildenbrand)
3. Fixed an issue reported by syzbot[10] by not dropping the original
folio during truncate.
4. Fixed a WARNING when READ_ONLY_THP_FOR_FS is enabled. (Thank David
Hildenbrand for reporting the issue)
5. Used two separate struct page* parameters, split_at and lock_at, to
specify at which subpage the non-uniform split happens and which subpage
to keep locked after the split, respectively. It improves code
readability.
From V6[8]:
1. Added an xarray function xas_try_split() to support iterative folio split,
removing the need of using xas_split_alloc() and xas_split(). The
function guarantees that at most one xa_node is allocated for each
call.
2. Added concrete numbers of after-split folios and xa_node savings to
cover letter, commit log. (per Andrew)
From V5[7]:
1. Split shmem to any lower order patches are in mm tree, so dropped
from this series.
2. Rename split_folio_at() to try_folio_split() to clarify that
non-uniform split will not be used if it is not supported.
From V4[6]:
1. Enabled shmem support in both uniform and buddy allocator like split
and added selftests for it.
2. Added functions to check if uniform split and buddy allocator like
split are supported for the given folio and order.
3. Made truncate fall back to uniform split if buddy allocator split is
not supported (CONFIG_READ_ONLY_THP_FOR_FS and FS without large folio).
4. Added the missing folio_clear_has_hwpoisoned() to
__split_unmapped_folio().
From V3[5]:
1. Used xas_split_alloc(GFP_NOWAIT) instead of xas_nomem(), since extra
operations inside xas_split_alloc() are needed for correctness.
2. Enabled folio_split() for shmem and no issue was found with xfstests
quick test group.
3. Split both ends of a truncate range in truncate_inode_partial_folio()
to avoid wasting memory in shmem truncate (per David Hildenbrand).
4. Removed page_in_folio_offset() since page_folio() does the same
thing.
5. Finished truncate related tests from xfstests quick test group on XFS and
tmpfs without issues.
6. Disabled buddy allocator like split on CONFIG_READ_ONLY_THP_FOR_FS
and FS without large folio. This check was missed in the prior
versions.
From V2[3]:
1. Incorporated all the feedback from Kirill[4].
2. Used GFP_NOWAIT for xas_nomem().
3. Tested the code path when xas_nomem() fails.
4. Added selftests for folio_split().
5. Fixed no THP config build error.
From V1[2]:
1. Split the original patch 1 into multiple ones for easy review (per
Kirill).
2. Added xas_destroy() to avoid memory leak.
3. Fixed nr_dropped not used error (per kernel test robot).
4. Added proper error handling when xas_nomem() fails to allocate memory
for xas_split() during buddy allocator like split.
From RFC[1]:
1. Merged backend code of split_huge_page_to_list_to_order() and
folio_split(). The same code is used for both uniform split and buddy
allocator like split.
2. Use xas_nomem() instead of xas_split_alloc() for folio_split().
3. folio_split() now leaves the first after-split folio unlocked,
instead of the one containing the given page, since
the caller of truncate_inode_partial_folio() locks and unlocks the
first folio.
4. Extended split_huge_page debugfs to use folio_split().
5. Added truncate_inode_partial_folio() as first user of folio_split().
Design
===
folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.
The split process is similar to existing approach:
1. Unmap all page mappings (split PMD mappings if exist);
2. Split meta data like memcg, page owner, page alloc tag;
3. Copy meta data in struct folio to sub pages, but instead of spliting
the whole folio into multiple smaller ones with the same order in a
shot, this approach splits the folio iteratively. Taking the example
above, this approach first splits the original order-9 into two order-8,
then splits left part of order-8 to two order-7 and so on;
4. Post-process split folios, like write mapping->i_pages for pagecache,
adjust folio refcounts, add split folios to corresponding list;
5. Remap split folios
6. Unlock split folios.
__split_unmapped_folio() and __split_folio_to_order() replace
__split_huge_page() and __split_huge_page_tail() respectively.
__split_unmapped_folio() uses different approaches to perform
uniform split and buddy allocator like split:
1. uniform split: one single call to __split_folio_to_order() is used to
uniformly split the given folio. All resulting folios are put back to
the list after split. The folio containing the given page is left to
caller to unlock and others are unlocked.
2. buddy allocator like (or non-uniform) split: (old_order - new_order) calls
to __split_folio_to_order() are used to split the given folio at order N to
order N-1. After each call, the target folio is changed to the one
containing the page, which is given as a folio_split() parameter.
After each call, folios not containing the page are put back to the list.
The folio containing the page is put back to the list when its order
is new_order. All folios are unlocked except the first folio, which
is left to caller to unlock.
Patch Overview
===
1. Patch 1 added a new xarray function xas_try_split() to perform
iterative xarray split.
2. Patch 2 added __split_unmapped_folio() and __split_folio_to_order() to
prepare for moving to new backend split code.
3. Patch 3 moved common code in split_huge_page_to_list_to_order() to
__folio_split().
4. Patch 4 added new folio_split() and made
split_huge_page_to_list_to_order() share the new
__split_unmapped_folio() with folio_split().
5. Patch 5 removed no longer used __split_huge_page() and
__split_huge_page_tail().
6. Patch 6 added a new in_folio_offset to split_huge_page debugfs for
folio_split() test.
7. Patch 7 used try_folio_split() for truncate operation.
8. Patch 8 added folio_split() tests.
Any comments and/or suggestions are welcome. Thanks.
[1] https://lore.kernel.org/linux-mm/20241008223748.555845-1-ziy@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20241028180932.1319265-1-ziy@nvidia.com/
[3] https://lore.kernel.org/linux-mm/20241101150357.1752726-1-ziy@nvidia.com/
[4] https://lore.kernel.org/linux-mm/e6ppwz5t4p4kvir6eqzoto4y5fmdjdxdyvxvtw43nc…
[5] https://lore.kernel.org/linux-mm/20241205001839.2582020-1-ziy@nvidia.com/
[6] https://lore.kernel.org/linux-mm/20250106165513.104899-1-ziy@nvidia.com/
[7] https://lore.kernel.org/linux-mm/20250116211042.741543-1-ziy@nvidia.com/
[8] https://lore.kernel.org/linux-mm/20250205031417.1771278-1-ziy@nvidia.com/
[9] https://lore.kernel.org/linux-mm/20250211155034.268962-1-ziy@nvidia.com/
[10] https://lore.kernel.org/all/67af65cb.050a0220.21dd3.004a.GAE@google.com/
Zi Yan (8):
xarray: add xas_try_split() to split a multi-index entry
mm/huge_memory: add two new (not yet used) functions for folio_split()
mm/huge_memory: move folio split common code to __folio_split()
mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
mm/huge_memory: remove the old, unused __split_huge_page()
mm/huge_memory: add folio_split() to debugfs testing interface
mm/truncate: use buddy allocator like folio split for truncate
operation
selftests/mm: add tests for folio_split(), buddy allocator like split
Documentation/core-api/xarray.rst | 14 +-
include/linux/huge_mm.h | 36 +
include/linux/xarray.h | 7 +
lib/test_xarray.c | 47 ++
lib/xarray.c | 138 +++-
mm/huge_memory.c | 756 ++++++++++++------
mm/truncate.c | 31 +-
tools/testing/radix-tree/Makefile | 1 +
.../selftests/mm/split_huge_page_test.c | 34 +-
9 files changed, 783 insertions(+), 281 deletions(-)
--
2.47.2
The uprobe events test fails on s390, but also on x86 (Fedora 41). The
problem appears to be that there is an assumption that adding a uprobe to
the beginning of the executable mapping of /bin/sh is sufficient to trigger
a uprobe event when /bin/sh is executed.
This assumption is not necessarily true. Therefore use "readelf -h" to find
the entry point address of /bin/sh and use this address when adding the
uprobe event.
This adds a dependency to readelf which is not always installed. Therefore
add a check and exit with exit_unresolved if it is not installed.
Signed-off-by: Heiko Carstens <hca(a)linux.ibm.com>
---
.../ftrace/test.d/dynevent/add_remove_uprobe.tc | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_uprobe.tc b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_uprobe.tc
index 86c76679c56e..f2048c244526 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_uprobe.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_uprobe.tc
@@ -3,14 +3,18 @@
# description: Generic dynamic event - add/remove/test uprobe events
# requires: uprobe_events
+if ! which readelf > /dev/null 2>&1 ; then
+ echo "No readelf found. skipped."
+ exit_unresolved
+fi
+
echo 0 > events/enable
echo > dynamic_events
REALBIN=`readlink -f /bin/sh`
+ENTRYPOINT=`readelf -h ${REALBIN} | grep Entry | sed -e 's/[^0]*//'`
-echo 'cat /proc/$$/maps' | /bin/sh | \
- grep "r-xp .*${REALBIN}$" | \
- awk '{printf "p:myevent %s:0x%s\n", $6,$3 }' >> uprobe_events
+echo "p:myevent ${REALBIN}:${ENTRYPOINT}" >> uprobe_events
grep -q myevent uprobe_events
test -d events/uprobes/myevent
--
2.45.2
Hi,
thank you for your reviw. As promised, here is V3 of this patch series.
I noticed that the updated selftests were flaky sometimes due to the kernel
networking stack sending IPv6 multicast listener reports on the created
test interfaces.
This can be seen here:
https://github.com/kernel-patches/bpf/actions/runs/13449071153/job/37580497…
Setting the NOARP flag on the interfaces should fix this race condition.
Successful pipeline:
https://github.com/kernel-patches/bpf/actions/runs/13500667544
Signed-off-by: Marcus Wichelmann <marcus.wichelmann(a)hetzner-cloud.de>
Acked-by: Jason Wang <jasowang(a)redhat.com>
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
---
v3:
- change the condition to handle xdp_buffs without metadata support, as
suggested by Willem de Bruijn <willemb(a)google.com>
- add clarifying comment why that condition is needed
- set NOARP flag in selftests to ensure that the kernel does not send
packets on the test interfaces that may interfere with the tests
v2: https://lore.kernel.org/bpf/20250217172308.3291739-1-marcus.wichelmann@hetz…
- submit against bpf-next subtree
- split commits and improved commit messages
- remove redundant metasize check and add clarifying comment instead
- use max() instead of ternary operator
- add selftest for metadata support in the tun driver
v1: https://lore.kernel.org/all/20250130171614.1657224-1-marcus.wichelmann@hetz…
Marcus Wichelmann (6):
net: tun: enable XDP metadata support
net: tun: enable transfer of XDP metadata to skb
selftests/bpf: move open_tuntap to network helpers
selftests/bpf: refactor xdp_context_functional test and bpf program
selftests/bpf: add test for XDP metadata support in tun driver
selftests/bpf: fix file descriptor assertion in open_tuntap helper
drivers/net/tun.c | 28 ++-
tools/testing/selftests/bpf/network_helpers.c | 28 +++
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../selftests/bpf/prog_tests/lwt_helpers.h | 29 ----
.../bpf/prog_tests/xdp_context_test_run.c | 163 ++++++++++++++++--
.../selftests/bpf/progs/test_xdp_meta.c | 56 +++---
6 files changed, 230 insertions(+), 77 deletions(-)
--
2.43.0
Hello,
This patchset is our exploration of how to support 1G pages in guest_memfd, and
how the pages will be used in Confidential VMs.
The patchset covers:
+ How to get 1G pages
+ Allowing mmap() of guest_memfd to userspace so that both private and shared
memory can use the same physical pages
+ Splitting and reconstructing pages to support conversions and mmap()
+ How the VM, userspace and guest_memfd interact to support conversions
+ Selftests to test all the above
+ Selftests also demonstrate the conversion flow between VM, userspace and
guest_memfd.
Why 1G pages in guest memfd?
Bring guest_memfd to performance and memory savings parity with VMs that are
backed by HugeTLBfs.
+ Performance is improved with 1G pages by more TLB hits and faster page walks
on TLB misses.
+ Memory savings from 1G pages comes from HugeTLB Vmemmap Optimization (HVO).
Options for 1G page support:
1. HugeTLB
2. Contiguous Memory Allocator (CMA)
3. Other suggestions are welcome!
Comparison between options:
1. HugeTLB
+ Refactor HugeTLB to separate allocator from the rest of HugeTLB
+ Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
+ Near term: Allows co-tenancy of HugeTLB and guest_memfd backed VMs
+ Pro: Can provide iterative steps toward new future allocator
+ Unexplored: Managing userspace-visible changes
+ e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
but not when future allocator is used
2. CMA
+ Port some HugeTLB features to be applied on CMA
+ Pro: Clean slate
What would refactoring HugeTLB involve?
(Some refactoring was done in this RFC, more can be done.)
1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
+ Brings more modularity to HugeTLB
+ No functionality change intended
+ Likely step towards HugeTLB's integration into core-mm
2. guest_memfd will use just the allocator component of HugeTLB, not including
the complex parts of HugeTLB like
+ Userspace reservations (resv_map)
+ Shared PMD mappings
+ Special page walkers
What features would need to be ported to CMA?
+ Improved allocation guarantees
+ Per NUMA node pool of huge pages
+ Subpools per guest_memfd
+ Memory savings
+ Something like HugeTLB Vmemmap Optimization
+ Configuration/reporting features
+ Configuration of number of pages available (and per NUMA node) at and
after host boot
+ Reporting of memory usage/availability statistics at runtime
HugeTLB was picked as the source of 1G pages for this RFC because it allows a
graceful transition, and retains memory savings from HVO.
To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
confidential VM were to be scheduled on that host, some HugeTLBfs pages would
have to be given up and returned to CMA for guest_memfd pages to be rebuilt from
that memory. This requires memory to be reserved for HVO to be removed and
reapplied on the new guest_memfd memory. This not only slows down memory
allocation but also trims the benefits of HVO. Memory would have to be reserved
on the host to facilitate these transitions.
Improving how guest_memfd uses the allocator in a future revision of this RFC:
To provide an easier transition away from HugeTLB, guest_memfd's use of HugeTLB
should be limited to these allocator functions:
+ reserve(node, page_size, num_pages) => opaque handle
+ Used when a guest_memfd inode is created to reserve memory from backend
allocator
+ allocate(handle, mempolicy, page_size) => folio
+ To allocate a folio from guest_memfd's reservation
+ split(handle, folio, target_page_size) => void
+ To take a huge folio, and split it to smaller folios, restore to filemap
+ reconstruct(handle, first_folio, nr_pages) => void
+ To take a folio, and reconstruct a huge folio out of nr_pages from the
first_folio
+ free(handle, folio) => void
+ To return folio to guest_memfd's reservation
+ error(handle, folio) => void
+ To handle memory errors
+ unreserve(handle) => void
+ To return guest_memfd's reservation to allocator backend
Userspace should only provide a page size when creating a guest_memfd and should
not have to specify HugeTLB.
Overview of patches:
+ Patches 01-12
+ Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts from
HugeTLB, and to expose HugeTLB functions.
+ Patches 13-16
+ Letting guest_memfd use HugeTLB
+ Creation of each guest_memfd reserves pages from HugeTLB's global hstate
and puts it into the guest_memfd inode's subpool
+ Each folio allocation takes a page from the guest_memfd inode's subpool
+ Patches 17-21
+ Selftests for new HugeTLB features in guest_memfd
+ Patches 22-24
+ More small changes on the HugeTLB side to expose functions needed by
guest_memfd
+ Patch 25:
+ Uses the newly available functions from patches 22-24 to split HugeTLB
pages. In this patch, HugeTLB folios are always split to 4K before any
usage, private or shared.
+ Patches 26-28
+ Allow mmap() in guest_memfd and faulting in shared pages
+ Patch 29
+ Enables conversion between private/shared pages
+ Patch 30
+ Required to zero folios after conversions to avoid leaking initialized
kernel memory
+ Patch 31-38
+ Add selftests to test mapping pages to userspace, guest/host memory
sharing and update conversions tests
+ Patch 33 illustrates the conversion flow between VM/userspace/guest_memfd
+ Patch 39
+ Dynamically split and reconstruct HugeTLB pages instead of always
splitting before use. All earlier selftests are expected to still pass.
TODOs:
+ Add logic to wait for safe_refcount [1]
+ Look into lazy splitting/reconstruction of pages
+ Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only is the
mem_attr_array and faultability updated, the pages in the requested range
are also split/reconstructed as necessary. We want to look into delaying
splitting/reconstruction to fault time.
+ Solve race between folios being faulted in and being truncated
+ When running private_mem_conversions_test with more than 1 vCPU, a folio
getting truncated may get faulted in by another process, causing elevated
mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
+ Add intermediate splits (1G should first split to 2M and not split directly to
4K)
+ Use guest's lock instead of hugetlb_lock
+ Use multi-index xarray/replace xarray with some other data struct for
faultability flag
+ Refactor HugeTLB better, present generic allocator interface
Please let us know your thoughts on:
+ HugeTLB as the choice of transitional allocator backend
+ Refactoring HugeTLB to provide generic allocator interface
+ Shared/private conversion flow
+ Requiring user to request kernel to unmap pages from userspace using
madvise(MADV_DONTNEED)
+ Failing conversion on elevated mapcounts/pincounts/refcounts
+ Process of splitting/reconstructing page
+ Anything else!
[1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quic…
Ackerley Tng (37):
mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
mm: hugetlb: Remove unnecessary check for avoid_reserve
mm: mempolicy: Refactor out policy_node_nodemask()
mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
interpret mempolicy instead of vma
mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
mm: hugetlb: Refactor out hugetlb_alloc_folio
mm: truncate: Expose preparation steps for truncate_inode_pages_final
mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
mm: hugetlb: Add option to create new subpool without using surplus
mm: hugetlb: Expose hugetlb_acct_memory()
mm: hugetlb: Move and expose hugetlb_zero_partial_page()
KVM: guest_memfd: Make guest mem use guest mem inodes instead of
anonymous inodes
KVM: guest_memfd: hugetlb: initialization and cleanup
KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
KVM: selftests: Support various types of backing sources for private
memory
KVM: selftests: Update test for various private memory backing source
types
KVM: selftests: Add private_mem_conversions_test.sh
KVM: selftests: Test that guest_memfd usage is reported via hugetlb
mm: hugetlb: Expose vmemmap optimization functions
mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
mm: hugetlb: Add functions to add/move/remove from hugetlb lists
KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
KVM: guest_memfd: Allow mmapping guest_memfd files
KVM: guest_memfd: Use vm_type to determine default faultability
KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
KVM: selftests: Allow vm_set_memory_attributes to be used without
asserting return value of 0
KVM: selftests: Test using guest_memfd memory from userspace
KVM: selftests: Test guest_memfd memory sharing between guest and host
KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
guest_memfd
KVM: selftests: Test that pinned pages block KVM from setting memory
attributes to PRIVATE
KVM: selftests: Refactor vm_mem_add to be more flexible
KVM: selftests: Add helper to perform madvise by memslots
KVM: selftests: Update private_mem_conversions_test for mmap()able
guest_memfd
Vishal Annapurve (2):
KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
fs/hugetlbfs/inode.c | 35 +-
include/linux/hugetlb.h | 54 +-
include/linux/kvm_host.h | 1 +
include/linux/mempolicy.h | 2 +
include/linux/mm.h | 1 +
include/uapi/linux/kvm.h | 26 +
include/uapi/linux/magic.h | 1 +
mm/hugetlb.c | 346 ++--
mm/hugetlb_vmemmap.h | 11 -
mm/mempolicy.c | 36 +-
mm/truncate.c | 26 +-
tools/include/linux/kernel.h | 4 +-
tools/testing/selftests/kvm/Makefile | 3 +
.../kvm/guest_memfd_hugetlb_reporting_test.c | 222 +++
.../selftests/kvm/guest_memfd_pin_test.c | 104 ++
.../selftests/kvm/guest_memfd_sharing_test.c | 160 ++
.../testing/selftests/kvm/guest_memfd_test.c | 238 ++-
.../testing/selftests/kvm/include/kvm_util.h | 45 +-
.../testing/selftests/kvm/include/test_util.h | 18 +
tools/testing/selftests/kvm/lib/kvm_util.c | 443 +++--
tools/testing/selftests/kvm/lib/test_util.c | 99 ++
.../kvm/x86_64/private_mem_conversions_test.c | 158 +-
.../x86_64/private_mem_conversions_test.sh | 91 +
.../kvm/x86_64/private_mem_kvm_exits_test.c | 11 +-
virt/kvm/guest_memfd.c | 1563 ++++++++++++++++-
virt/kvm/kvm_main.c | 17 +
virt/kvm/kvm_mm.h | 16 +
27 files changed, 3288 insertions(+), 443 deletions(-)
create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
--
2.46.0.598.g6f2099f65c-goog
This patch series introduces changes to add default build support for
the sched tests in selftests.
The only test under sched is cs_prctl_test which validates cookies when
core scheduling is in effect. This test fails on systems where core
scheduling is disabled. The patch series also modifies this behaviour to
gracefully skip the test on such systems.
A system with core scheduling disabled would skip the test like:
~# ./run_kselftest.sh
TAP version 13
1..1
timeout set to 45
selftests: sched: cs_prctl_test
prctl failed: Invalid argument
Core sched not supported, hence skipping tests
ok 1 selftests: sched: cs_prctl_test # SKIP
Signed-off-by: Sinadin Shan <sinadin.shan(a)oracle.com>
---
v3:
* Use prctl to check core sched support instead of config
* v2 link: https://lore.kernel.org/all/20250221115750.631990-1-sinadin.shan@oracle.com/
v2:
* Add patch to skip cs_prctl_test on core scheduling disabled systems
* v1 link: https://lore.kernel.org/all/20250219064658.449069-1-sinadin.shan@oracle.com
---
Sinadin Shan (2):
selftests: sched: add sched as a default selftest target
selftests: sched: skip cs_prctl_test for systems with core scheduling
disabled
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/sched/cs_prctl_test.c | 34 ++++++++++++++++++-
2 files changed, 34 insertions(+), 1 deletion(-)
--
2.43.5
Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.
This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).
Kernel entry vs execution of the deferred operation
===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper), i.e.:
idtentry_func_foo() <--- we're in the kernel
irqentry_enter()
enter_from_user_mode()
__ct_user_exit()
ct_kernel_enter_state()
ct_work_flush() <--- deferred operation is executed here
This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.
Patches
=======
o Patches 1-2 are standalone objtool cleanups.
o Patches 3-4 add an RCU testing feature.
o Patches 5-6 add infrastructure for annotating static keys and static calls
that may be used in noinstr code (courtesy of Josh).
o Patches 7-19 use said annotations on relevant keys / calls.
o Patch 20 enforces proper usage of said annotations (courtesy of Josh).
o Patches 21-23 fiddle with CT_STATE* within context tracking
o Patches 24-29 add the actual IPI deferral faff
o Patch 30 adds a freebie: deferring IPIs for NOHZ_IDLE. Not tested that much!
if you care about battery-powered devices and energy consumption, go give it
a try!
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v4
Stuff I'd like eyes and neurons on
==================================
Context-tracking vs idle. Patch 22 "improves" the situation by adding an
IDLE->KERNEL transition when getting an IRQ while idle, but it leaves the
following window:
~> IRQ
ct_nmi_enter()
state = state + CT_STATE_KERNEL - CT_STATE_IDLE
[...]
ct_nmi_exit()
state = state - CT_STATE_KERNEL + CT_STATE_IDLE
[...] /!\ CT_STATE_IDLE here while we're really in kernelspace! /!\
ct_cpuidle_exit()
state = state + CT_STATE_KERNEL - CT_STATE_IDLE
Said window is contained within cpu_idle_poll() and the cpuidle call within
cpuidle_enter_state(), both being noinstr (the former is __cpuidle which is
noinstr itself). Thus objtool will consider it as early entry and will warn
accordingly of any static key / call misuse, so the damage is somewhat
contained, but it's not ideal.
I tried fiddling with this but idle polling likes being annoying, as it is
shaped like so:
ct_cpuidle_enter();
raw_local_irq_enable();
while (!tif_need_resched() &&
(cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
raw_local_irq_disable();
ct_cpuidle_exit();
IOW, getting an IRQ that doesn't end up setting NEED_RESCHED while idle-polling
doesn't come near ct_cpuidle_exit(), which prevents me from having the outermost
ct_nmi_exit() leave the state as CT_STATE_KERNEL (rather than CT_STATE_IDLE).
Testing
=======
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL9 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 6 hours.
v6.13-rc6
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
531 callback=generic_smp_call_function_single_interrupt+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
12818 func=do_flush_tlb_all
910 func=do_kernel_range_flush
78 func=do_sync_core
v6.13-rc6 + patches:
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
# Zilch!
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
# Nada!
Note that tlb_remove_table_smp_sync() showed up during testing of v3, and has
gone as mysteriously as it showed up. Yair had a series adressing this [5] which
would be worth revisiting.
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o All of the folks who attended various (too many?) talks about this and
provided precious feedback.
Links
=====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/
Revisions
=========
RFCv3 -> v4
++++++++++++++
o Rebased onto v6.13-rc6
o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)
o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups
o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ
RFCv2 -> RFCv3
++++++++++++++
o Rebased onto v6.12-rc6
o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Josh Poimboeuf (3):
jump_label: Add annotations for validating noinstr usage
static_call: Add read-only-after-init static calls
objtool: Add noinstr validation for static branches/calls
Peter Zijlstra (1):
x86,tlb: Make __flush_tlb_global() noinstr-compliant
Valentin Schneider (26):
objtool: Make validate_call() recognize indirect calls to pv_ops[]
objtool: Flesh out warning related to pv_ops[] calls
rcu: Add a small-width RCU watching counter debug option
rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
x86/idle: Mark x86_idle static call as __ro_after_init
x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
sched/clock: Mark sched_clock_running key as __ro_after_init
x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr
sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
allowed in .noinstr
stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
context_tracking: Explicitely use CT_STATE_KERNEL where it is missing
context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry
context_tracking: Turn CT_STATE_* into bits
context-tracking: Introduce work deferral infrastructure
context_tracking,x86: Defer kernel text patching IPIs
x86/tlb: Make __flush_tlb_local() noinstr-compliant
x86/tlb: Make __flush_tlb_all() noinstr
x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL
CPUs
context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE
arch/Kconfig | 9 ++
arch/arm/kernel/paravirt.c | 2 +-
arch/arm64/kernel/paravirt.c | 2 +-
arch/loongarch/kernel/paravirt.c | 2 +-
arch/riscv/kernel/paravirt.c | 2 +-
arch/x86/Kconfig | 1 +
arch/x86/events/amd/brs.c | 2 +-
arch/x86/include/asm/context_tracking_work.h | 22 ++++
arch/x86/include/asm/invpcid.h | 13 +--
arch/x86/include/asm/paravirt.h | 4 +-
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/include/asm/tlbflush.h | 3 +-
arch/x86/include/asm/xen/hypercall.h | 11 +-
arch/x86/kernel/alternative.c | 38 ++++++-
arch/x86/kernel/cpu/bugs.c | 9 +-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/kernel/paravirt.c | 4 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 11 +-
arch/x86/mm/tlb.c | 46 ++++++--
arch/x86/xen/mmu_pv.c | 10 +-
arch/x86/xen/xen-ops.h | 12 +-
include/asm-generic/sections.h | 15 +++
include/linux/context_tracking.h | 21 ++++
include/linux/context_tracking_state.h | 64 +++++++++--
include/linux/context_tracking_work.h | 28 +++++
include/linux/jump_label.h | 30 ++++-
include/linux/objtool.h | 7 ++
include/linux/static_call.h | 19 ++++
kernel/context_tracking.c | 98 ++++++++++++++--
kernel/rcu/Kconfig.debug | 15 +++
kernel/sched/clock.c | 7 +-
kernel/stackleak.c | 6 +-
kernel/time/Kconfig | 19 ++++
mm/vmalloc.c | 35 +++++-
tools/objtool/Documentation/objtool.txt | 34 ++++++
tools/objtool/check.c | 106 +++++++++++++++---
tools/objtool/include/objtool/check.h | 1 +
tools/objtool/include/objtool/elf.h | 1 +
tools/objtool/include/objtool/special.h | 1 +
tools/objtool/special.c | 18 ++-
.../selftests/rcutorture/configs/rcu/TREE04 | 1 +
44 files changed, 635 insertions(+), 107 deletions(-)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h
--
2.43.0