Unmapping virtual machine guest memory from the host kernel's direct map
is a successful mitigation against Spectre-style transient execution
issues: If the kernel page tables do not contain entries pointing to
guest memory, then any attempted speculative read through the direct map
will necessarily be blocked by the MMU before any observable
microarchitectural side-effects happen. This means that Spectre-gadgets
and similar cannot be used to target virtual machine memory. Roughly 60%
of speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its
memory from the host kernel's direct map, to be able to attain the above
protection for KVM guests running inside guest_memfd.
=== Changes to RFC v3 ===
- Settle relationship between direct map removal and shared/private
memory in guest_memfd (David H.)
- Omit TLB flushes upon direct map removal again
- Settle uABI for how KVM accesses guest memory in non-CoCo guest_memfd
VMs (upstream guest_memfd calls)
- Add selftests that exercise the codepaths of non-CoCo guest_memfd VMs
Lastly, this series is rebased on top of Fuad's v4 for shared mapping of
guest_memfd [2]. The KVM parts should also apply on top of 0ad2507d5d93
("Linux 6.14-rc3"), but the selftest patches need Fuad's series as base.
=== Overview ===
guest_memfd should be usable for "non-CoCo" VMs - virtual machines where
host userspace is trusted (e.g. can have access to all of guest memory),
but which should still be hardened against speculative execution attacks
(Spectre, etc.) staged through potentially existing gadgets in the host
kernel.
To attain this hardening, unmap guest memory from the host kernels
address space (e.g. zap direct map entries), while allowing KVM to
continue accessing guest memory through userspace mappings. This works
because KVM already almost always uses userspace mappings whenever KVM
needs to access guest memory - the only parts that require direct map
entries (because they use GUP) are KVM's MMU, and kvm-clock on x86.
Building on top of guest_memfd sidesteps the MMU problem, as for
memslots with KVM_MEM_GUEST_MEMFD set, the MMU consumes fd + offset
directly without going through any VMAs. kvm-clock on the other hand is
not strictly needed (guests boot fine without it), so ignore it for
now.
=== Implementation ===
Make KVM_CREATE_GUEST_MEMFD accept a flag (KVM_GMEM_NO_DIRECT_MAP) that
instructs it to remove newly allocated folios from the host kernels
direct map immediately after preparation.
Nothing further is needed to make non-CoCo VMs work - particularly, KVM
does not need to be taught any special ways of accessing guest memory if
it is in guest_memfd. Userspace can simply mmap guest_memfd (via
KVM_GMEM_SHARED_MEM added in Fuad's series), and set the memslot's
userspace_addr to this userspace mapping of guest_memfd.
=== Open Questions ===
In this patch series, stale TLB entries do not get flushed after direct
map entries are marked as not present. This is fine from a functional
point of view (as the mapping is still valid, it's just temporarily not
supposed to be used), but pokes a theoretical hole into the speculation
protection: Something could try to keep alive stale TLB entries for
specific pages until the guest starts using them for sensitive
information, and then stage a Spectre attack on that memory, despite it
being unmapped. In practice, this would require knowing in advance, at
gmem fault-time, which pages will eventually contain information of
interest, and then preventing these specific TLB entries from getting
naturally evicted (where the number of pages that can be targeted like
this is limited by the size of the TLB). These seem to be fairly
difficult requisites to fulfill, but we were wondering what the
community thinks.
=== Summary ===
Patch 1 adds a struct address_space flag that indices that folios in a
mapping are direct map removed, and threads it through mm code to ensure
direct map removed folios don't end up in places where they can cause
mayhem (particularly, we reject them in get_user_pages). Since these
checks end up being duplicates of already existing checks for secretmem
folios, patch 2 unifies the two by using the new address_space flag for
secretmem mappings. Patches 3 through 5 are about support for direct map
removal in guest_memfd, while patches 6 through 12 are about testing the
non-CoCo setup in KVM selftests, with patches 6 through 9 being
preparatory, and patches 10 through 12 adding the actual test cases.
[1]: https://download.vusec.net/papers/quarantine_raid23.pdf
[2]: https://lore.kernel.org/kvm/20250218172500.807733-1-tabba@google.com/
[RFC v1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/
[RFC v2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
[RFC v3]: https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.co.uk/
Patrick Roy (12):
mm: introduce AS_NO_DIRECT_MAP
mm/secretmem: set AS_NO_DIRECT_MAP instead of special-casing
KVM: guest_memfd: Add flag to remove from direct map
KVM: Add capability to discover KVM_GMEM_NO_DIRECT_MAP support
KVM: Documentation: document KVM_GMEM_NO_DIRECT_MAP flag
KVM: selftests: load elf via bounce buffer
KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd
!= -1
KVM: selftests: Add guest_memfd based vm_mem_backing_src_types
KVM: selftests: stuff vm_mem_backing_src_type into vm_shape
KVM: selftests: adjust test_create_guest_memfd_invalid
KVM: selftests: set KVM_GMEM_NO_DIRECT_MAP in mem conversion tests
KVM: selftests: Test guest execution from direct map removed gmem
Documentation/virt/kvm/api.rst | 13 ++++
include/linux/pagemap.h | 16 +++++
include/linux/secretmem.h | 18 ------
include/uapi/linux/kvm.h | 3 +
lib/buildid.c | 4 +-
mm/gup.c | 14 +---
mm/mlock.c | 2 +-
mm/secretmem.c | 6 +-
.../testing/selftests/kvm/guest_memfd_test.c | 2 +-
.../testing/selftests/kvm/include/kvm_util.h | 29 ++++++---
.../testing/selftests/kvm/include/test_util.h | 8 +++
tools/testing/selftests/kvm/lib/elf.c | 8 +--
tools/testing/selftests/kvm/lib/io.c | 23 +++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 64 +++++++++++--------
tools/testing/selftests/kvm/lib/test_util.c | 8 +++
tools/testing/selftests/kvm/lib/x86/sev.c | 1 +
.../selftests/kvm/pre_fault_memory_test.c | 1 +
.../selftests/kvm/set_memory_region_test.c | 40 ++++++++++++
.../kvm/x86/private_mem_conversions_test.c | 7 +-
virt/kvm/guest_memfd.c | 24 ++++++-
virt/kvm/kvm_main.c | 5 ++
21 files changed, 214 insertions(+), 82 deletions(-)
base-commit: da40655874b54a2b563f8ceb3ed839c6cd38e0b4
--
2.48.1
Hi all,
This patchset adds a new buddy allocator like (or non-uniform) large folio
split from a order-n folio to order-m with m < n. It reduces
1. the total number of after-split folios from 2^(n-m) to n-m+1;
2. the amount of memory needed for multi-index xarray split from 2^(n/6-m/6) to
n/6-m/6, assuming XA_CHUNK_SHIFT=6;
3. keep more large folios after a split from all order-m folios to
order-(n-1) to order-m folios.
For example, to split an order-9 to order-0, folio split generates 10
(or 11 for anonymous memory) folios instead of 512, allocates 1 xa_node
instead of 8, and leaves 1 order-8, 1 order-7, ..., 1 order-1 and 2 order-0
folios (or 4 order-0 for anonymous memory) instead of 512 order-0 folios.
It is on top of mm-everything-2025-02-15-05-49 with V7 reverted. It is ready to
be merged.
Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split()
can support both uniform split and buddy allocator like (or non-uniform) split.
All existing split_huge_page*() users can be gradually converted to use
folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().
xfstests quick group passed for both tmpfs and xfs.
Changelog
===
From V7[9]:
1. Fixed a wrong function name in lib/test_xarray.c.
2. Made __split_folio_to_order() never fail, since the old order check
is already done in __folio_split(). (per David Hildenbrand)
3. Fixed an issue reported by syzbot[10] by not dropping the original
folio during truncate.
4. Fixed a WARNING when READ_ONLY_THP_FOR_FS is enabled. (Thank David
Hildenbrand for reporting the issue)
5. Used two separate struct page* parameters, split_at and lock_at, to
specify at which subpage the non-uniform split happens and which subpage
to keep locked after the split, respectively. It improves code
readability.
From V6[8]:
1. Added an xarray function xas_try_split() to support iterative folio split,
removing the need of using xas_split_alloc() and xas_split(). The
function guarantees that at most one xa_node is allocated for each
call.
2. Added concrete numbers of after-split folios and xa_node savings to
cover letter, commit log. (per Andrew)
From V5[7]:
1. Split shmem to any lower order patches are in mm tree, so dropped
from this series.
2. Rename split_folio_at() to try_folio_split() to clarify that
non-uniform split will not be used if it is not supported.
From V4[6]:
1. Enabled shmem support in both uniform and buddy allocator like split
and added selftests for it.
2. Added functions to check if uniform split and buddy allocator like
split are supported for the given folio and order.
3. Made truncate fall back to uniform split if buddy allocator split is
not supported (CONFIG_READ_ONLY_THP_FOR_FS and FS without large folio).
4. Added the missing folio_clear_has_hwpoisoned() to
__split_unmapped_folio().
From V3[5]:
1. Used xas_split_alloc(GFP_NOWAIT) instead of xas_nomem(), since extra
operations inside xas_split_alloc() are needed for correctness.
2. Enabled folio_split() for shmem and no issue was found with xfstests
quick test group.
3. Split both ends of a truncate range in truncate_inode_partial_folio()
to avoid wasting memory in shmem truncate (per David Hildenbrand).
4. Removed page_in_folio_offset() since page_folio() does the same
thing.
5. Finished truncate related tests from xfstests quick test group on XFS and
tmpfs without issues.
6. Disabled buddy allocator like split on CONFIG_READ_ONLY_THP_FOR_FS
and FS without large folio. This check was missed in the prior
versions.
From V2[3]:
1. Incorporated all the feedback from Kirill[4].
2. Used GFP_NOWAIT for xas_nomem().
3. Tested the code path when xas_nomem() fails.
4. Added selftests for folio_split().
5. Fixed no THP config build error.
From V1[2]:
1. Split the original patch 1 into multiple ones for easy review (per
Kirill).
2. Added xas_destroy() to avoid memory leak.
3. Fixed nr_dropped not used error (per kernel test robot).
4. Added proper error handling when xas_nomem() fails to allocate memory
for xas_split() during buddy allocator like split.
From RFC[1]:
1. Merged backend code of split_huge_page_to_list_to_order() and
folio_split(). The same code is used for both uniform split and buddy
allocator like split.
2. Use xas_nomem() instead of xas_split_alloc() for folio_split().
3. folio_split() now leaves the first after-split folio unlocked,
instead of the one containing the given page, since
the caller of truncate_inode_partial_folio() locks and unlocks the
first folio.
4. Extended split_huge_page debugfs to use folio_split().
5. Added truncate_inode_partial_folio() as first user of folio_split().
Design
===
folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.
The split process is similar to existing approach:
1. Unmap all page mappings (split PMD mappings if exist);
2. Split meta data like memcg, page owner, page alloc tag;
3. Copy meta data in struct folio to sub pages, but instead of spliting
the whole folio into multiple smaller ones with the same order in a
shot, this approach splits the folio iteratively. Taking the example
above, this approach first splits the original order-9 into two order-8,
then splits left part of order-8 to two order-7 and so on;
4. Post-process split folios, like write mapping->i_pages for pagecache,
adjust folio refcounts, add split folios to corresponding list;
5. Remap split folios
6. Unlock split folios.
__split_unmapped_folio() and __split_folio_to_order() replace
__split_huge_page() and __split_huge_page_tail() respectively.
__split_unmapped_folio() uses different approaches to perform
uniform split and buddy allocator like split:
1. uniform split: one single call to __split_folio_to_order() is used to
uniformly split the given folio. All resulting folios are put back to
the list after split. The folio containing the given page is left to
caller to unlock and others are unlocked.
2. buddy allocator like (or non-uniform) split: (old_order - new_order) calls
to __split_folio_to_order() are used to split the given folio at order N to
order N-1. After each call, the target folio is changed to the one
containing the page, which is given as a folio_split() parameter.
After each call, folios not containing the page are put back to the list.
The folio containing the page is put back to the list when its order
is new_order. All folios are unlocked except the first folio, which
is left to caller to unlock.
Patch Overview
===
1. Patch 1 added a new xarray function xas_try_split() to perform
iterative xarray split.
2. Patch 2 added __split_unmapped_folio() and __split_folio_to_order() to
prepare for moving to new backend split code.
3. Patch 3 moved common code in split_huge_page_to_list_to_order() to
__folio_split().
4. Patch 4 added new folio_split() and made
split_huge_page_to_list_to_order() share the new
__split_unmapped_folio() with folio_split().
5. Patch 5 removed no longer used __split_huge_page() and
__split_huge_page_tail().
6. Patch 6 added a new in_folio_offset to split_huge_page debugfs for
folio_split() test.
7. Patch 7 used try_folio_split() for truncate operation.
8. Patch 8 added folio_split() tests.
Any comments and/or suggestions are welcome. Thanks.
[1] https://lore.kernel.org/linux-mm/20241008223748.555845-1-ziy@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20241028180932.1319265-1-ziy@nvidia.com/
[3] https://lore.kernel.org/linux-mm/20241101150357.1752726-1-ziy@nvidia.com/
[4] https://lore.kernel.org/linux-mm/e6ppwz5t4p4kvir6eqzoto4y5fmdjdxdyvxvtw43nc…
[5] https://lore.kernel.org/linux-mm/20241205001839.2582020-1-ziy@nvidia.com/
[6] https://lore.kernel.org/linux-mm/20250106165513.104899-1-ziy@nvidia.com/
[7] https://lore.kernel.org/linux-mm/20250116211042.741543-1-ziy@nvidia.com/
[8] https://lore.kernel.org/linux-mm/20250205031417.1771278-1-ziy@nvidia.com/
[9] https://lore.kernel.org/linux-mm/20250211155034.268962-1-ziy@nvidia.com/
[10] https://lore.kernel.org/all/67af65cb.050a0220.21dd3.004a.GAE@google.com/
Zi Yan (8):
xarray: add xas_try_split() to split a multi-index entry
mm/huge_memory: add two new (not yet used) functions for folio_split()
mm/huge_memory: move folio split common code to __folio_split()
mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
mm/huge_memory: remove the old, unused __split_huge_page()
mm/huge_memory: add folio_split() to debugfs testing interface
mm/truncate: use buddy allocator like folio split for truncate
operation
selftests/mm: add tests for folio_split(), buddy allocator like split
Documentation/core-api/xarray.rst | 14 +-
include/linux/huge_mm.h | 36 +
include/linux/xarray.h | 7 +
lib/test_xarray.c | 47 ++
lib/xarray.c | 138 +++-
mm/huge_memory.c | 756 ++++++++++++------
mm/truncate.c | 31 +-
tools/testing/radix-tree/Makefile | 1 +
.../selftests/mm/split_huge_page_test.c | 34 +-
9 files changed, 783 insertions(+), 281 deletions(-)
--
2.47.2
A task in the kernel (task_mm_cid_work) runs somewhat periodically to
compact the mm_cid for each process. Add a test to validate that it runs
correctly and timely.
The test spawns 1 thread pinned to each CPU, then each thread, including
the main one, runs in short bursts for some time. During this period, the
mm_cids should be spanning all numbers between 0 and nproc.
At the end of this phase, a thread with high enough mm_cid (>= nproc/2)
is selected to be the new leader, all other threads terminate.
After some time, the only running thread should see 0 as mm_cid, if that
doesn't happen, the compaction mechanism didn't work and the test fails.
The test never fails if only 1 core is available, in which case, we
cannot test anything as the only available mm_cid is 0.
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Signed-off-by: Gabriele Monaco <gmonaco(a)redhat.com>
---
tools/testing/selftests/rseq/.gitignore | 1 +
tools/testing/selftests/rseq/Makefile | 2 +-
.../selftests/rseq/mm_cid_compaction_test.c | 200 ++++++++++++++++++
3 files changed, 202 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c
diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
index 16496de5f6ce4..2c89f97e4f737 100644
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -3,6 +3,7 @@ basic_percpu_ops_test
basic_percpu_ops_mm_cid_test
basic_test
basic_rseq_op_test
+mm_cid_compaction_test
param_test
param_test_benchmark
param_test_compare_twice
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 5a3432fceb586..ce1b38f46a355 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -16,7 +16,7 @@ OVERRIDE_TARGETS = 1
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
param_test_benchmark param_test_compare_twice param_test_mm_cid \
- param_test_mm_cid_benchmark param_test_mm_cid_compare_twice
+ param_test_mm_cid_benchmark param_test_mm_cid_compare_twice mm_cid_compaction_test
TEST_GEN_PROGS_EXTENDED = librseq.so
diff --git a/tools/testing/selftests/rseq/mm_cid_compaction_test.c b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
new file mode 100644
index 0000000000000..7ddde3b657dd6
--- /dev/null
+++ b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stddef.h>
+
+#include "../kselftest.h"
+#include "rseq.h"
+
+#define VERBOSE 0
+#define printf_verbose(fmt, ...) \
+ do { \
+ if (VERBOSE) \
+ printf(fmt, ##__VA_ARGS__); \
+ } while (0)
+
+/* 0.5 s */
+#define RUNNER_PERIOD 500000
+/* Number of runs before we terminate or get the token */
+#define THREAD_RUNS 5
+
+/*
+ * Number of times we check that the mm_cid were compacted.
+ * Checks are repeated every RUNNER_PERIOD.
+ */
+#define MM_CID_COMPACT_TIMEOUT 10
+
+struct thread_args {
+ int cpu;
+ int num_cpus;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ pthread_t *tinfo;
+ struct thread_args *args_head;
+};
+
+static void __noreturn *thread_runner(void *arg)
+{
+ struct thread_args *args = arg;
+ int i, ret, curr_mm_cid;
+ cpu_set_t cpumask;
+
+ CPU_ZERO(&cpumask);
+ CPU_SET(args->cpu, &cpumask);
+ ret = pthread_setaffinity_np(pthread_self(), sizeof(cpumask), &cpumask);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to set affinity");
+ abort();
+ }
+ pthread_barrier_wait(args->barrier);
+
+ for (i = 0; i < THREAD_RUNS; i++)
+ usleep(RUNNER_PERIOD);
+ curr_mm_cid = rseq_current_mm_cid();
+ /*
+ * We select one thread with high enough mm_cid to be the new leader.
+ * All other threads (including the main thread) will terminate.
+ * After some time, the mm_cid of the only remaining thread should
+ * converge to 0, if not, the test fails.
+ */
+ if (curr_mm_cid >= args->num_cpus / 2 &&
+ !pthread_mutex_trylock(args->token)) {
+ printf_verbose(
+ "cpu%d has mm_cid=%d and will be the new leader.\n",
+ sched_getcpu(), curr_mm_cid);
+ for (i = 0; i < args->num_cpus; i++) {
+ if (args->tinfo[i] == pthread_self())
+ continue;
+ ret = pthread_join(args->tinfo[i], NULL);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to join thread");
+ abort();
+ }
+ }
+ pthread_barrier_destroy(args->barrier);
+ free(args->tinfo);
+ free(args->token);
+ free(args->barrier);
+ free(args->args_head);
+
+ for (i = 0; i < MM_CID_COMPACT_TIMEOUT; i++) {
+ curr_mm_cid = rseq_current_mm_cid();
+ printf_verbose("run %d: mm_cid=%d on cpu%d.\n", i,
+ curr_mm_cid, sched_getcpu());
+ if (curr_mm_cid == 0)
+ exit(EXIT_SUCCESS);
+ usleep(RUNNER_PERIOD);
+ }
+ exit(EXIT_FAILURE);
+ }
+ printf_verbose("cpu%d has mm_cid=%d and is going to terminate.\n",
+ sched_getcpu(), curr_mm_cid);
+ pthread_exit(NULL);
+}
+
+int test_mm_cid_compaction(void)
+{
+ cpu_set_t affinity;
+ int i, j, ret = 0, num_threads;
+ pthread_t *tinfo;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ struct thread_args *args;
+
+ sched_getaffinity(0, sizeof(affinity), &affinity);
+ num_threads = CPU_COUNT(&affinity);
+ tinfo = calloc(num_threads, sizeof(*tinfo));
+ if (!tinfo) {
+ perror("Error: failed to allocate tinfo");
+ return -1;
+ }
+ args = calloc(num_threads, sizeof(*args));
+ if (!args) {
+ perror("Error: failed to allocate args");
+ ret = -1;
+ goto out_free_tinfo;
+ }
+ token = malloc(sizeof(*token));
+ if (!token) {
+ perror("Error: failed to allocate token");
+ ret = -1;
+ goto out_free_args;
+ }
+ barrier = malloc(sizeof(*barrier));
+ if (!barrier) {
+ perror("Error: failed to allocate barrier");
+ ret = -1;
+ goto out_free_token;
+ }
+ if (num_threads == 1) {
+ fprintf(stderr, "Cannot test on a single cpu. "
+ "Skipping mm_cid_compaction test.\n");
+ /* only skipping the test, this is not a failure */
+ goto out_free_barrier;
+ }
+ pthread_mutex_init(token, NULL);
+ ret = pthread_barrier_init(barrier, NULL, num_threads);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to initialise barrier");
+ goto out_free_barrier;
+ }
+ for (i = 0, j = 0; i < CPU_SETSIZE && j < num_threads; i++) {
+ if (!CPU_ISSET(i, &affinity))
+ continue;
+ args[j].num_cpus = num_threads;
+ args[j].tinfo = tinfo;
+ args[j].token = token;
+ args[j].barrier = barrier;
+ args[j].cpu = i;
+ args[j].args_head = args;
+ if (!j) {
+ /* The first thread is the main one */
+ tinfo[0] = pthread_self();
+ ++j;
+ continue;
+ }
+ ret = pthread_create(&tinfo[j], NULL, thread_runner, &args[j]);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to create thread");
+ abort();
+ }
+ ++j;
+ }
+ printf_verbose("Started %d threads.\n", num_threads);
+
+ /* Also main thread will terminate if it is not selected as leader */
+ thread_runner(&args[0]);
+
+ /* only reached in case of errors */
+out_free_barrier:
+ free(barrier);
+out_free_token:
+ free(token);
+out_free_args:
+ free(args);
+out_free_tinfo:
+ free(tinfo);
+
+ return ret;
+}
+
+int main(int argc, char **argv)
+{
+ if (!rseq_mm_cid_available()) {
+ fprintf(stderr, "Error: rseq_mm_cid unavailable\n");
+ return -1;
+ }
+ if (test_mm_cid_compaction())
+ return -1;
+ return 0;
+}
--
2.48.1
v1/v2:
There is only the first patch: RISC-V: Enable cbo.clean/flush in usermode,
which mainly removes the enabling of cbo.inval in user mode.
v3:
Add the functionality of Expose Zicbom and selftests for Zicbom.
v4:
Modify the order of macros, The test_no_cbo_inval function is added
separately.
v5:
1. Modify the order of RISCV_HWPROBE_KEY_ZICBOM_BLOCK_SIZE in hwprobe.rst
2. "TEST_NO_ZICBOINVAL" -> "TEST_NO_CBO_INVAL"
v6:
Change hwprobe_ext0_has's second param to u64.
v7:
Rebase to the latest code of linux-next.
Yunhui Cui (3):
RISC-V: Enable cbo.clean/flush in usermode
RISC-V: hwprobe: Expose Zicbom extension and its block size
RISC-V: selftests: Add TEST_ZICBOM into CBO tests
Documentation/arch/riscv/hwprobe.rst | 6 ++
arch/riscv/include/asm/hwprobe.h | 2 +-
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/kernel/cpufeature.c | 8 +++
arch/riscv/kernel/sys_hwprobe.c | 8 ++-
tools/testing/selftests/riscv/hwprobe/cbo.c | 66 +++++++++++++++++----
6 files changed, 79 insertions(+), 13 deletions(-)
--
2.39.2
Hi,
thank you for your reviw. As promised, here is V3 of this patch series.
I noticed that the updated selftests were flaky sometimes due to the kernel
networking stack sending IPv6 multicast listener reports on the created
test interfaces.
This can be seen here:
https://github.com/kernel-patches/bpf/actions/runs/13449071153/job/37580497…
Setting the NOARP flag on the interfaces should fix this race condition.
Successful pipeline:
https://github.com/kernel-patches/bpf/actions/runs/13500667544
Signed-off-by: Marcus Wichelmann <marcus.wichelmann(a)hetzner-cloud.de>
Acked-by: Jason Wang <jasowang(a)redhat.com>
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
---
v3:
- change the condition to handle xdp_buffs without metadata support, as
suggested by Willem de Bruijn <willemb(a)google.com>
- add clarifying comment why that condition is needed
- set NOARP flag in selftests to ensure that the kernel does not send
packets on the test interfaces that may interfere with the tests
v2: https://lore.kernel.org/bpf/20250217172308.3291739-1-marcus.wichelmann@hetz…
- submit against bpf-next subtree
- split commits and improved commit messages
- remove redundant metasize check and add clarifying comment instead
- use max() instead of ternary operator
- add selftest for metadata support in the tun driver
v1: https://lore.kernel.org/all/20250130171614.1657224-1-marcus.wichelmann@hetz…
Marcus Wichelmann (6):
net: tun: enable XDP metadata support
net: tun: enable transfer of XDP metadata to skb
selftests/bpf: move open_tuntap to network helpers
selftests/bpf: refactor xdp_context_functional test and bpf program
selftests/bpf: add test for XDP metadata support in tun driver
selftests/bpf: fix file descriptor assertion in open_tuntap helper
drivers/net/tun.c | 28 ++-
tools/testing/selftests/bpf/network_helpers.c | 28 +++
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../selftests/bpf/prog_tests/lwt_helpers.h | 29 ----
.../bpf/prog_tests/xdp_context_test_run.c | 163 ++++++++++++++++--
.../selftests/bpf/progs/test_xdp_meta.c | 56 +++---
6 files changed, 230 insertions(+), 77 deletions(-)
--
2.43.0
The following series fixes some bugs and adding some error messages
which are not handled.
This also add some selftests which tests the new error messages.
Thank you,
---
Masami Hiramatsu (Google) (8):
tracing: tprobe-events: Fix a memory leak when tprobe with $retval
tracing: tprobe-events: Reject invalid tracepoint name
tracing: fprobe-events: Log error for exceeding the number of entry args
tracing: probe-events: Log errro for exceeding the number of arguments
tracing: probe-events: Remove unused MAX_ARG_BUF_LEN macro
selftests/ftrace: Expand the tprobe event test to check wrong format
selftests/ftrace: Add new syntax error test
selftests/ftrace: Add dynamic events argument limitation test case
kernel/trace/trace_eprobe.c | 2 +
kernel/trace/trace_fprobe.c | 25 +++++++++++-
kernel/trace/trace_kprobe.c | 5 ++
kernel/trace/trace_probe.h | 6 ++-
kernel/trace/trace_uprobe.c | 9 +++-
.../ftrace/test.d/dynevent/add_remove_tprobe.tc | 14 +++++++
.../ftrace/test.d/dynevent/dynevent_limitations.tc | 42 ++++++++++++++++++++
.../ftrace/test.d/dynevent/fprobe_syntax_errors.tc | 1
8 files changed, 98 insertions(+), 6 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/dynevent_limitations.tc
--
Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
As the vIOMMU infrastructure series part-3, this introduces a new vEVENTQ
object. The existing FAULT object provides a nice notification pathway to
the user space with a queue already, so let vEVENTQ reuse that.
Mimicing the HWPT structure, add a common EVENTQ structure to support its
derivatives: IOMMUFD_OBJ_FAULT (existing) and IOMMUFD_OBJ_VEVENTQ (new).
An IOMMUFD_CMD_VEVENTQ_ALLOC is introduced to allocate vEVENTQ object for
vIOMMUs. One vIOMMU can have multiple vEVENTQs in different types but can
not support multiple vEVENTQs in the same type.
The forwarding part is fairly simple but might need to replace a physical
device ID with a virtual device ID in a driver-level event data structure.
So, this also adds some helpers for drivers to use.
As usual, this series comes with the selftest coverage for this new ioctl
and with a real world use case in the ARM SMMUv3 driver.
This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_veventq-v8
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_veventq-v8
Changelog
v8
* Add Reviewed-by from Jason and Pranjal
* Fix errno returned in arm_smmu_handle_event()
* Validate domain->type outside of arm_smmu_attach_prepare_vmaster()
* Drop unnecessary vmaster comparison in arm_smmu_attach_commit_vmaster()
v7
https://lore.kernel.org/all/cover.1740238876.git.nicolinc@nvidia.com/
* Rebase on Jason's for-next tree for latest fault.c
* Add Reviewed-by
* Update commit logs
* Add __reserved field sanity
* Skip kfree() on the static header
* Replace "bool on_list" with list_is_last()
* Use u32 for flags in iommufd_vevent_header
* Drop casting in iommufd_viommu_get_vdev_id()
* Update the bounding logic to veventq->sequence
* Add missing cpu_to_le64() around STRTAB_STE_1_MEV
* Reuse veventq->common.lock to fence sequence and num_events
* Rename overflow to lost_events and log it in upon kmalloc failure
* Correct the error handling part in iommufd_veventq_deliver_fetch()
* Add an arm_smmu_clear_vmaster() to simplify identity/blocked domain
attach ops
* Add additional four event records to forward to user space VM, and
update the uAPI doc
* Reuse the existing smmu->streams_mutex lock to fence master->vmaster
pointer, instead of adding a new rwsem
v6
https://lore.kernel.org/all/cover.1737754129.git.nicolinc@nvidia.com/
* Drop supports_veventq viommu op
* Split bug/cosmetics fixes out of the series
* Drop the blocking mutex around copy_to_user()
* Add veventq_depth in uAPI to limit vEVENTQ size
* Revise the documentation for a clear description
* Fix sparse warnings in arm_vmaster_report_event()
* Rework iommufd_viommu_get_vdev_id() to return -ENOENT v.s. 0
* Allow Abort/Bypass STEs to allocate vEVENTQ and set STE.MEV for DoS
mitigations
v5
https://lore.kernel.org/all/cover.1736237481.git.nicolinc@nvidia.com/
* Add Reviewed-by from Baolu
* Reorder the OBJ list as well
* Fix alphabetical order after renaming in v4
* Add supports_veventq viommu op for vEVENTQ type validation
v4
https://lore.kernel.org/all/cover.1735933254.git.nicolinc@nvidia.com/
* Rename "vIRQ" to "vEVENTQ"
* Use flexible array in struct iommufd_vevent
* Add the new ioctl command to union ucmd_buffer
* Fix the alphabetical order in union ucmd_buffer too
* Rename _TYPE_NONE to _TYPE_DEFAULT aligning with vIOMMU naming
v3
https://lore.kernel.org/all/cover.1734477608.git.nicolinc@nvidia.com/
* Rebase on Will's for-joerg/arm-smmu/updates for arm_smmu_event series
* Add "Reviewed-by" lines from Kevin
* Fix typos in comments, kdocs, and jump tags
* Add a patch to sort struct iommufd_ioctl_op
* Update iommufd's userpsace-api documentation
* Update uAPI kdoc to quote SMMUv3 offical spec
* Drop the unused workqueue in struct iommufd_virq
* Drop might_sleep() in iommufd_viommu_report_irq() helper
* Add missing "break" in iommufd_viommu_get_vdev_id() helper
* Shrink the scope of the vmaster's read lock in SMMUv3 driver
* Pass in two arguments to iommufd_eventq_virq_handler() helper
* Move "!ops || !ops->read" validation into iommufd_eventq_init()
* Move "fault->ictx = ictx" closer to iommufd_ctx_get(fault->ictx)
* Update commit message for arm_smmu_attach_prepare/commit_vmaster()
* Keep "iommufd_fault" as-is and rename "iommufd_eventq_virq" to just
"iommufd_virq"
v2
https://lore.kernel.org/all/cover.1733263737.git.nicolinc@nvidia.com/
* Rebase on v6.13-rc1
* Add IOPF and vIRQ in iommufd.rst (userspace-api)
* Add a proper locking in iommufd_event_virq_destroy
* Add iommufd_event_virq_abort with a lockdep_assert_held
* Rename "EVENT_*" to "EVENTQ_*" to describe the objects better
* Reorganize flows in iommufd_eventq_virq_alloc for abort() to work
* Adde struct arm_smmu_vmaster to store vSID upon attaching to a nested
domain, calling a newly added iommufd_viommu_get_vdev_id helper
* Adde an arm_vmaster_report_event helper in arm-smmu-v3-iommufd file
to simplify the routine in arm_smmu_handle_evt() of the main driver
v1
https://lore.kernel.org/all/cover.1724777091.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Nicolin Chen (14):
iommufd/fault: Move two fault functions out of the header
iommufd/fault: Add an iommufd_fault_init() helper
iommufd: Abstract an iommufd_eventq from iommufd_fault
iommufd: Rename fault.c to eventq.c
iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
iommufd/viommu: Add iommufd_viommu_get_vdev_id helper
iommufd/viommu: Add iommufd_viommu_report_event helper
iommufd/selftest: Require vdev_id when attaching to a nested domain
iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_VEVENT for vEVENTQ
coverage
iommufd/selftest: Add IOMMU_VEVENTQ_ALLOC test coverage
Documentation: userspace-api: iommufd: Update FAULT and VEVENTQ
iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster
iommu/arm-smmu-v3: Report events that belong to devices attached to
vIOMMU
iommu/arm-smmu-v3: Set MEV bit in nested STE for DoS mitigations
drivers/iommu/iommufd/Makefile | 2 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 36 ++
drivers/iommu/iommufd/iommufd_private.h | 135 +++-
drivers/iommu/iommufd/iommufd_test.h | 10 +
include/linux/iommufd.h | 23 +
include/uapi/linux/iommufd.h | 105 +++
tools/testing/selftests/iommu/iommufd_utils.h | 115 ++++
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 64 ++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 82 ++-
drivers/iommu/iommufd/driver.c | 72 +++
drivers/iommu/iommufd/eventq.c | 597 ++++++++++++++++++
drivers/iommu/iommufd/fault.c | 342 ----------
drivers/iommu/iommufd/hw_pagetable.c | 6 +-
drivers/iommu/iommufd/main.c | 7 +
drivers/iommu/iommufd/selftest.c | 54 ++
drivers/iommu/iommufd/viommu.c | 2 +
tools/testing/selftests/iommu/iommufd.c | 36 ++
.../selftests/iommu/iommufd_fail_nth.c | 7 +
Documentation/userspace-api/iommufd.rst | 17 +
19 files changed, 1304 insertions(+), 408 deletions(-)
create mode 100644 drivers/iommu/iommufd/eventq.c
delete mode 100644 drivers/iommu/iommufd/fault.c
base-commit: 598749522d4254afb33b8a6c1bea614a95896868
--
2.43.0
Some distributions may not enable MPTCP by default. All other MPTCP tests
source mptcp_lib.sh to ensure MPTCP is enabled before testing. However,
the ip_local_port_range test is the only one that does not include this
step.
Let's also ensure MPTCP is enabled in netns for ip_local_port_range so
that it passes on all distributions.
Suggested-by: Davide Caratti <dcaratti(a)redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
tools/testing/selftests/net/ip_local_port_range.sh | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/ip_local_port_range.sh b/tools/testing/selftests/net/ip_local_port_range.sh
index 6c6ad346eaa0..4ff746db1256 100755
--- a/tools/testing/selftests/net/ip_local_port_range.sh
+++ b/tools/testing/selftests/net/ip_local_port_range.sh
@@ -2,4 +2,6 @@
# SPDX-License-Identifier: GPL-2.0
./in_netns.sh \
- sh -c 'sysctl -q -w net.ipv4.ip_local_port_range="40000 49999" && ./ip_local_port_range'
+ sh -c 'sysctl -q -w net.mptcp.enabled=1 && \
+ sysctl -q -w net.ipv4.ip_local_port_range="40000 49999" && \
+ ./ip_local_port_range'
--
2.46.0