July 2025 - Linux-kselftest-mirror

[PATCH 0/4] selftests/mm/uffd: refactor global variables

by Ujwal Kundur

This patchset refactors non-composite global variables into a common struct that can be initialized and passed around per-test instead of relying on the presence of global variables. This allows: - Better encapsulation - Debugging becomes easier -- local variable state can be viewed per stack frame, and we can more easily reason about the variable mutations Patch 1 needs to be applied first and can be followed by any of the other patches. I've ensured that the tests are passing locally (or atleast have the same output as the code on master). Ujwal Kundur (4): selftests/mm/uffd: Refactor non-composite global vars into struct selftests/mm/uffd: Swap global vars with global test options selftests/mm/uffd: Swap global variables with global test opts selftests/mm/uffd: Swap global variables with global test opts tools/testing/selftests/mm/uffd-common.c | 269 +++++----- tools/testing/selftests/mm/uffd-common.h | 78 +-- tools/testing/selftests/mm/uffd-stress.c | 226 ++++---- tools/testing/selftests/mm/uffd-unit-tests.c | 523 ++++++++++--------- tools/testing/selftests/mm/uffd-wp-mremap.c | 23 +- 5 files changed, 591 insertions(+), 528 deletions(-) -- 2.20.1

1 week, 1 day

4
38
0 0

[PATCH v18 0/8] fork: Support shadow stacks in clone3()

by Mark Brown

The kernel has recently added support for shadow stacks, currently x86 only using their CET feature but both arm64 and RISC-V have equivalent features (GCS and Zicfiss respectively), I am actively working on GCS[1]. With shadow stacks the hardware maintains an additional stack containing only the return addresses for branch instructions which is not generally writeable by userspace and ensures that any returns are to the recorded addresses. This provides some protection against ROP attacks and making it easier to collect call stacks. These shadow stacks are allocated in the address space of the userspace process. Our API for shadow stacks does not currently offer userspace any flexiblity for managing the allocation of shadow stacks for newly created threads, instead the kernel allocates a new shadow stack with the same size as the normal stack whenever a thread is created with the feature enabled. The stacks allocated in this way are freed by the kernel when the thread exits or shadow stacks are disabled for the thread. This lack of flexibility and control isn't ideal, in the vast majority of cases the shadow stack will be over allocated and the implicit allocation and deallocation is not consistent with other interfaces. As far as I can tell the interface is done in this manner mainly because the shadow stack patches were in development since before clone3() was implemented. Since clone3() is readily extensible let's add support for specifying a shadow stack when creating a new thread or process, keeping the current implicit allocation behaviour if one is not specified either with clone3() or through the use of clone(). The user must provide a shadow stack pointer, this must point to memory mapped for use as a shadow stackby map_shadow_stack() with an architecture specified shadow stack token at the top of the stack. Yuri Khrustalev has raised questions from the libc side regarding discoverability of extended clone3() structure sizes[2], this seems like a general issue with clone3(). There was a suggestion to add a hwcap on arm64 which isn't ideal but is doable there, though architecture specific mechanisms would also be needed for x86 (and RISC-V if it's support gets merged before this does). The idea has, however, had strong pushback from the architecture maintainers and it is possible to detect support for this in clone3() by attempting a call with a misaligned shadow stack pointer specified so no hwcap has been added. [1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87… [2] https://lore.kernel.org/r/aCs65ccRQtJBnZ_5@arm.com Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v18: - Rebase onto v6.16-rc3. - Thanks to pointers from Yuri Khrustalev this version has been tested on x86 so I have removed the RFT tag. - Clarify clone3_shadow_stack_valid() comment about the Kconfig check. - Remove redundant GCSB DSYNCs in arm64 code. - Fix token validation on x86. - Link to v17: https://lore.kernel.org/r/20250609-clone3-shadow-stack-v17-0-8840ed97ff6f@k… Changes in v17: - Rebase onto v6.16-rc1. - Link to v16: https://lore.kernel.org/r/20250416-clone3-shadow-stack-v16-0-2ffc9ca3917b@k… Changes in v16: - Rebase onto v6.15-rc2. - Roll in fixes from x86 testing from Rick Edgecombe. - Rework so that the argument is shadow_stack_token. - Link to v15: https://lore.kernel.org/r/20250408-clone3-shadow-stack-v15-0-3fa245c6e3be@k… Changes in v15: - Rebase onto v6.15-rc1. - Link to v14: https://lore.kernel.org/r/20250206-clone3-shadow-stack-v14-0-805b53af73b9@k… Changes in v14: - Rebase onto v6.14-rc1. - Link to v13: https://lore.kernel.org/r/20241203-clone3-shadow-stack-v13-0-93b89a81a5ed@k… Changes in v13: - Rebase onto v6.13-rc1. - Link to v12: https://lore.kernel.org/r/20241031-clone3-shadow-stack-v12-0-7183eb8bee17@k… Changes in v12: - Add the regular prctl() to the userspace API document since arm64 support is queued in -next. - Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@k… Changes in v11: - Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and integrate arm64 support. - Rework the interface to specify a shadow stack pointer rather than a base and size like we do for the regular stack. - Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k… Changes in v10: - Integrate fixes & improvements for the x86 implementation from Rick Edgecombe. - Require that the shadow stack be VM_WRITE. - Require that the shadow stack base and size be sizeof(void *) aligned. - Clean up trailing newline. - Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke… Changes in v9: - Pull token validation earlier and report problems with an error return to parent rather than signal delivery to the child. - Verify that the top of the supplied shadow stack is VM_SHADOW_STACK. - Rework token validation to only do the page mapping once. - Drop no longer needed support for testing for signals in selftest. - Fix typo in comments. - Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke… Changes in v8: - Fix token verification with user specified shadow stack. - Don't track user managed shadow stacks for child processes. - Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke… Changes in v7: - Rebase onto v6.11-rc1. - Typo fixes. - Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke… Changes in v6: - Rebase onto v6.10-rc3. - Ensure we don't try to free the parent shadow stack in error paths of x86 arch code. - Spelling fixes in userspace API document. - Additional cleanups and improvements to the clone3() tests to support the shadow stack tests. - Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke… Changes in v5: - Rebase onto v6.8-rc2. - Rework ABI to have the user allocate the shadow stack memory with map_shadow_stack() and a token. - Force inlining of the x86 shadow stack enablement. - Move shadow stack enablement out into a shared header for reuse by other tests. - Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke… Changes in v4: - Formatting changes. - Use a define for minimum shadow stack size and move some basic validation to fork.c. - Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke… Changes in v3: - Rebase onto v6.7-rc2. - Remove stale shadow_stack in internal kargs. - If a shadow stack is specified unconditionally use it regardless of CLONE_ parameters. - Force enable shadow stacks in the selftest. - Update changelogs for RISC-V feature rename. - Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke… Changes in v2: - Rebase onto v6.7-rc1. - Remove ability to provide preallocated shadow stack, just specify the desired size. - Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke… --- Mark Brown (8): arm64/gcs: Return a success value from gcs_alloc_thread_stack() Documentation: userspace-api: Add shadow stack API documentation selftests: Provide helper header for shadow stack testing fork: Add shadow stack support to clone3() selftests/clone3: Remove redundant flushes of output streams selftests/clone3: Factor more of main loop into test_clone3() selftests/clone3: Allow tests to flag if -E2BIG is a valid error code selftests/clone3: Test shadow stack support Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/shadow_stack.rst | 44 +++++ arch/arm64/include/asm/gcs.h | 8 +- arch/arm64/kernel/process.c | 8 +- arch/arm64/mm/gcs.c | 55 +++++- arch/x86/include/asm/shstk.h | 11 +- arch/x86/kernel/process.c | 2 +- arch/x86/kernel/shstk.c | 53 ++++- include/asm-generic/cacheflush.h | 11 ++ include/linux/sched/task.h | 17 ++ include/uapi/linux/sched.h | 9 +- kernel/fork.c | 93 +++++++-- tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++---- tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++- tools/testing/selftests/ksft_shstk.h | 98 ++++++++++ 15 files changed, 620 insertions(+), 81 deletions(-) --- base-commit: 86731a2a651e58953fc949573895f2fa6d456841 change-id: 20231019-clone3-shadow-stack-15d40d2bf536 Best regards, -- Mark Brown <broonie(a)kernel.org>

1 week, 1 day

3
10
0 0

[PATCH v4 00/23] ARM64 PMU Partitioning

by Colton Lewis

This series creates a new PMU scheme on ARM, a partitioned PMU that allows reserving a subset of counters for more direct guest access, significantly reducing overhead. More details, including performance benchmarks, can be read in the v1 cover letter linked below. v4: * Apply Mark Brown's non-UNDEF FGT control commit to the PMU FGT controls and calculate those controls with the others in kvm_calculate_traps() * Introduce lazy context swaps for guests that only turns on for guests that have enabled partitioning and accessed PMU registers. * Rename pmu-part.c to pmu-direct.c because future features might achieve direct PMU access without partitioning. * Better explain certain commits, such as why the untrapped registers are safe to untrap. * Reduce the PMU include cleanup down to only what is still necessary and explain why. v3: https://lore.kernel.org/kvm/20250626200459.1153955-1-coltonlewis@google.com/ v2: https://lore.kernel.org/kvm/20250620221326.1261128-1-coltonlewis@google.com/ v1: https://lore.kernel.org/kvm/20250602192702.2125115-1-coltonlewis@google.com/ Colton Lewis (21): arm64: cpufeature: Add cpucap for HPMN0 KVM: arm64: Reorganize PMU functions perf: arm_pmuv3: Introduce method to partition the PMU perf: arm_pmuv3: Generalize counter bitmasks perf: arm_pmuv3: Keep out of guest counter partition KVM: arm64: Account for partitioning in kvm_pmu_get_max_counters() KVM: arm64: Set up FGT for Partitioned PMU KVM: arm64: Writethrough trapped PMEVTYPER register KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned KVM: arm64: Writethrough trapped PMOVS register KVM: arm64: Write fast path PMU register handlers KVM: arm64: Setup MDCR_EL2 to handle a partitioned PMU KVM: arm64: Account for partitioning in PMCR_EL0 access KVM: arm64: Context swap Partitioned PMU guest registers KVM: arm64: Enforce PMU event filter at vcpu_load() KVM: arm64: Extract enum debug_owner to enum vcpu_register_owner KVM: arm64: Implement lazy PMU context swaps perf: arm_pmuv3: Handle IRQs for Partitioned PMU guest counters KVM: arm64: Inject recorded guest interrupts KVM: arm64: Add ioctl to partition the PMU when supported KVM: arm64: selftests: Add test case for partitioned PMU Marc Zyngier (1): KVM: arm64: Reorganize PMU includes Mark Brown (1): KVM: arm64: Introduce non-UNDEF FGT control Documentation/virt/kvm/api.rst | 21 + arch/arm/include/asm/arm_pmuv3.h | 38 + arch/arm64/include/asm/arm_pmuv3.h | 61 +- arch/arm64/include/asm/kvm_host.h | 34 +- arch/arm64/include/asm/kvm_pmu.h | 123 +++ arch/arm64/include/asm/kvm_types.h | 7 +- arch/arm64/kernel/cpufeature.c | 8 + arch/arm64/kvm/Makefile | 2 +- arch/arm64/kvm/arm.c | 22 + arch/arm64/kvm/debug.c | 33 +- arch/arm64/kvm/hyp/include/hyp/debug-sr.h | 6 +- arch/arm64/kvm/hyp/include/hyp/switch.h | 181 ++++- arch/arm64/kvm/pmu-direct.c | 395 ++++++++++ arch/arm64/kvm/pmu-emul.c | 674 +--------------- arch/arm64/kvm/pmu.c | 725 ++++++++++++++++++ arch/arm64/kvm/sys_regs.c | 137 +++- arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/sysreg | 6 +- drivers/perf/arm_pmuv3.c | 128 +++- include/linux/perf/arm_pmu.h | 1 + include/linux/perf/arm_pmuv3.h | 14 +- include/uapi/linux/kvm.h | 4 + tools/include/uapi/linux/kvm.h | 2 + .../selftests/kvm/arm64/vpmu_counter_access.c | 62 +- 24 files changed, 1910 insertions(+), 775 deletions(-) create mode 100644 arch/arm64/kvm/pmu-direct.c base-commit: 79150772457f4d45e38b842d786240c36bb1f97f -- 2.50.0.727.gbf7dc18ff4-goog

1 week, 3 days

3
29
0 0

[PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks

by Matt Fleming

From: Matt Fleming <mfleming(a)cloudflare.com> Add benchmarks for the standard set of operations: lookup, update, delete. Also, include a benchmark for trie_free() which is known to have terrible performance for maps with many entries. Benchmarks operate on tries without gaps in the key range, i.e. each test begins with a trie with valid keys in the range [0, nr_entries). This is intended to cause maximum branching when traversing the trie. All measurements are recorded inside the kernel to remove syscall overhead. Most benchmarks run an XDP program to generate stats but free needs to collect latencies using fentry/fexit on map_free_deferred() because it's not possible to use fentry directly on lpm_trie.c since commit c83508da5620 ("bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs") and there's no way to create/destroy a map from within an XDP program. Here is example output from an AMD EPYC 9684X 96-Core machine for each of the benchmarks using a trie with 10K entries and a 32-bit prefix length, e.g. $ ./bench lpm-trie-$op \ --prefix_len=32 \ --producers=1 \ --nr_entries=10000 lookup: throughput 7.423 ± 0.023 M ops/s ( 7.423M ops/prod), latency 134.710 ns/op update: throughput 2.643 ± 0.015 M ops/s ( 2.643M ops/prod), latency 378.310 ns/op delete: throughput 0.712 ± 0.008 M ops/s ( 0.712M ops/prod), latency 1405.152 ns/op free: throughput 0.574 ± 0.003 K ops/s ( 0.574K ops/prod), latency 1.743 ms/op Tested-by: Jesper Dangaard Brouer <hawk(a)kernel.org> Reviewed-by: Jesper Dangaard Brouer <hawk(a)kernel.org> Signed-off-by: Matt Fleming <mfleming(a)cloudflare.com> --- Changes in v3: - Replace BPF_CORE_READ() with BPF_CORE_READ_STR_INTO() to avoid gcc-bpf CI build failure Changes in v2: - Add Jesper's Tested-by and Revewied-by tags - Remove use of atomic_*() in favour of __sync_add_and_fetch() - Use a file-local 'deleted_entries' in the DELETE op benchmark and add a comment explaining why non-atomic accesses are safe. - Bump 'hits' with the number of bpf_loop() loops actually executed tools/testing/selftests/bpf/Makefile | 2 + tools/testing/selftests/bpf/bench.c | 10 + tools/testing/selftests/bpf/bench.h | 1 + .../selftests/bpf/benchs/bench_lpm_trie_map.c | 337 ++++++++++++++++++ .../selftests/bpf/progs/lpm_trie_bench.c | 171 +++++++++ .../selftests/bpf/progs/lpm_trie_map.c | 19 + 6 files changed, 540 insertions(+) create mode 100644 tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c create mode 100644 tools/testing/selftests/bpf/progs/lpm_trie_bench.c create mode 100644 tools/testing/selftests/bpf/progs/lpm_trie_map.c diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 910d8d6402ef..10a5f1d0fa41 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -815,6 +815,7 @@ $(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h $(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h $(OUTPUT)/bench_bpf_crypto.o: $(OUTPUT)/crypto_bench.skel.h $(OUTPUT)/bench_sockmap.o: $(OUTPUT)/bench_sockmap_prog.skel.h +$(OUTPUT)/bench_lpm_trie_map.o: $(OUTPUT)/lpm_trie_bench.skel.h $(OUTPUT)/lpm_trie_map.skel.h $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ) $(OUTPUT)/bench: LDLIBS += -lm $(OUTPUT)/bench: $(OUTPUT)/bench.o \ @@ -836,6 +837,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \ $(OUTPUT)/bench_htab_mem.o \ $(OUTPUT)/bench_bpf_crypto.o \ $(OUTPUT)/bench_sockmap.o \ + $(OUTPUT)/bench_lpm_trie_map.o \ # $(call msg,BINARY,,$@) $(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@ diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c index ddd73d06a1eb..fd15f60fd5a8 100644 --- a/tools/testing/selftests/bpf/bench.c +++ b/tools/testing/selftests/bpf/bench.c @@ -284,6 +284,7 @@ extern struct argp bench_htab_mem_argp; extern struct argp bench_trigger_batch_argp; extern struct argp bench_crypto_argp; extern struct argp bench_sockmap_argp; +extern struct argp bench_lpm_trie_map_argp; static const struct argp_child bench_parsers[] = { { &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 }, @@ -299,6 +300,7 @@ static const struct argp_child bench_parsers[] = { { &bench_trigger_batch_argp, 0, "BPF triggering benchmark", 0 }, { &bench_crypto_argp, 0, "bpf crypto benchmark", 0 }, { &bench_sockmap_argp, 0, "bpf sockmap benchmark", 0 }, + { &bench_lpm_trie_map_argp, 0, "LPM trie map benchmark", 0 }, {}, }; @@ -558,6 +560,10 @@ extern const struct bench bench_htab_mem; extern const struct bench bench_crypto_encrypt; extern const struct bench bench_crypto_decrypt; extern const struct bench bench_sockmap; +extern const struct bench bench_lpm_trie_lookup; +extern const struct bench bench_lpm_trie_update; +extern const struct bench bench_lpm_trie_delete; +extern const struct bench bench_lpm_trie_free; static const struct bench *benchs[] = { &bench_count_global, @@ -625,6 +631,10 @@ static const struct bench *benchs[] = { &bench_crypto_encrypt, &bench_crypto_decrypt, &bench_sockmap, + &bench_lpm_trie_lookup, + &bench_lpm_trie_update, + &bench_lpm_trie_delete, + &bench_lpm_trie_free, }; static void find_benchmark(void) diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h index 005c401b3e22..bea323820ffb 100644 --- a/tools/testing/selftests/bpf/bench.h +++ b/tools/testing/selftests/bpf/bench.h @@ -46,6 +46,7 @@ struct bench_res { unsigned long gp_ns; unsigned long gp_ct; unsigned int stime; + unsigned long duration_ns; }; struct bench { diff --git a/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c b/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c new file mode 100644 index 000000000000..435b5c7ceee9 --- /dev/null +++ b/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c @@ -0,0 +1,337 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Cloudflare */ + +/* + * All of these benchmarks operate on tries with keys in the range + * [0, args.nr_entries), i.e. there are no gaps or partially filled + * branches of the trie for any key < args.nr_entries. + * + * This gives an idea of worst-case behaviour. + */ + +#include <argp.h> +#include <linux/time64.h> +#include <linux/if_ether.h> +#include "lpm_trie_bench.skel.h" +#include "lpm_trie_map.skel.h" +#include "bench.h" +#include "testing_helpers.h" + +static struct ctx { + struct lpm_trie_bench *bench; +} ctx; + +static struct { + __u32 nr_entries; + __u32 prefixlen; +} args = { + .nr_entries = 10000, + .prefixlen = 32, +}; + +enum { + ARG_NR_ENTRIES = 9000, + ARG_PREFIX_LEN, +}; + +static const struct argp_option opts[] = { + { "nr_entries", ARG_NR_ENTRIES, "NR_ENTRIES", 0, + "Number of unique entries in the LPM trie" }, + { "prefix_len", ARG_PREFIX_LEN, "PREFIX_LEN", 0, + "Number of prefix bits to use in the LPM trie" }, + {}, +}; + +static error_t lpm_parse_arg(int key, char *arg, struct argp_state *state) +{ + long ret; + + switch (key) { + case ARG_NR_ENTRIES: + ret = strtol(arg, NULL, 10); + if (ret < 1 || ret > UINT_MAX) { + fprintf(stderr, "Invalid nr_entries count."); + argp_usage(state); + } + args.nr_entries = ret; + break; + case ARG_PREFIX_LEN: + ret = strtol(arg, NULL, 10); + if (ret < 1 || ret > UINT_MAX) { + fprintf(stderr, "Invalid prefix_len value."); + argp_usage(state); + } + args.prefixlen = ret; + break; + default: + return ARGP_ERR_UNKNOWN; + } + return 0; +} + +const struct argp bench_lpm_trie_map_argp = { + .options = opts, + .parser = lpm_parse_arg, +}; + +static void __lpm_validate(void) +{ + if (env.consumer_cnt != 0) { + fprintf(stderr, "benchmark doesn't support consumer!\n"); + exit(1); + } + + if ((1UL << args.prefixlen) < args.nr_entries) { + fprintf(stderr, "prefix_len value too small for nr_entries!\n"); + exit(1); + }; +} + +enum { OP_LOOKUP = 1, OP_UPDATE, OP_DELETE, OP_FREE }; + +static void lpm_delete_validate(void) +{ + __lpm_validate(); + + if (env.producer_cnt != 1) { + fprintf(stderr, + "lpm-trie-delete requires a single producer!\n"); + exit(1); + } +} + +static void lpm_free_validate(void) +{ + __lpm_validate(); + + if (env.producer_cnt != 1) { + fprintf(stderr, "lpm-trie-free requires a single producer!\n"); + exit(1); + } +} + +static void fill_map(int map_fd) +{ + int i, err; + + for (i = 0; i < args.nr_entries; i++) { + struct trie_key { + __u32 prefixlen; + __u32 data; + } key = { args.prefixlen, i }; + __u32 val = 1; + + err = bpf_map_update_elem(map_fd, &key, &val, BPF_NOEXIST); + if (err) { + fprintf(stderr, "failed to add key %d to map: %d\n", + key.data, -err); + exit(1); + } + } +} + +static void __lpm_setup(void) +{ + ctx.bench = lpm_trie_bench__open_and_load(); + if (!ctx.bench) { + fprintf(stderr, "failed to open skeleton\n"); + exit(1); + } + + ctx.bench->bss->nr_entries = args.nr_entries; + ctx.bench->bss->prefixlen = args.prefixlen; + + if (lpm_trie_bench__attach(ctx.bench)) { + fprintf(stderr, "failed to attach skeleton\n"); + exit(1); + } +} + +static void lpm_setup(void) +{ + int fd; + + __lpm_setup(); + + fd = bpf_map__fd(ctx.bench->maps.trie_map); + fill_map(fd); +} + +static void lpm_lookup_setup(void) +{ + lpm_setup(); + + ctx.bench->bss->op = OP_LOOKUP; +} + +static void lpm_update_setup(void) +{ + lpm_setup(); + + ctx.bench->bss->op = OP_UPDATE; +} + +static void lpm_delete_setup(void) +{ + lpm_setup(); + + ctx.bench->bss->op = OP_DELETE; +} + +static void lpm_free_setup(void) +{ + __lpm_setup(); + ctx.bench->bss->op = OP_FREE; +} + +static void lpm_measure(struct bench_res *res) +{ + res->hits = atomic_swap(&ctx.bench->bss->hits, 0); + res->duration_ns = atomic_swap(&ctx.bench->bss->duration_ns, 0); +} + +/* For LOOKUP, UPDATE, and DELETE */ +static void *lpm_producer(void *unused __always_unused) +{ + int err; + char in[ETH_HLEN]; /* unused */ + + LIBBPF_OPTS(bpf_test_run_opts, opts, .data_in = in, + .data_size_in = sizeof(in), .repeat = 1, ); + + while (true) { + int fd = bpf_program__fd(ctx.bench->progs.run_bench); + err = bpf_prog_test_run_opts(fd, &opts); + if (err) { + fprintf(stderr, "failed to run BPF prog: %d\n", err); + exit(1); + } + + if (opts.retval < 0) { + fprintf(stderr, "BPF prog returned error: %d\n", + opts.retval); + exit(1); + } + + if (ctx.bench->bss->op == OP_DELETE && opts.retval == 1) { + /* trie_map needs to be refilled */ + fill_map(bpf_map__fd(ctx.bench->maps.trie_map)); + } + } + + return NULL; +} + +static void *lpm_free_producer(void *unused __always_unused) +{ + while (true) { + struct lpm_trie_map *skel; + + skel = lpm_trie_map__open_and_load(); + if (!skel) { + fprintf(stderr, "failed to open skeleton\n"); + exit(1); + } + + fill_map(bpf_map__fd(skel->maps.trie_free_map)); + lpm_trie_map__destroy(skel); + } + + return NULL; +} + +static void free_ops_report_progress(int iter, struct bench_res *res, + long delta_ns) +{ + double hits_per_sec, hits_per_prod; + double rate_divisor = 1000.0; + char rate = 'K'; + + hits_per_sec = res->hits / (res->duration_ns / (double)NSEC_PER_SEC) / + rate_divisor; + hits_per_prod = hits_per_sec / env.producer_cnt; + + printf("Iter %3d (%7.3lfus): ", iter, + (delta_ns - NSEC_PER_SEC) / 1000.0); + printf("hits %8.3lf%c/s (%7.3lf%c/prod)\n", hits_per_sec, rate, + hits_per_prod, rate); +} + +static void free_ops_report_final(struct bench_res res[], int res_cnt) +{ + double hits_mean = 0.0, hits_stddev = 0.0; + double lat_divisor = 1000000.0; + double rate_divisor = 1000.0; + const char *unit = "ms"; + double latency = 0.0; + char rate = 'K'; + int i; + + for (i = 0; i < res_cnt; i++) { + double val = res[i].hits / rate_divisor / + (res[i].duration_ns / (double)NSEC_PER_SEC); + hits_mean += val / (0.0 + res_cnt); + latency += res[i].duration_ns / res[i].hits / (0.0 + res_cnt); + } + + if (res_cnt > 1) { + for (i = 0; i < res_cnt; i++) { + double val = + res[i].hits / rate_divisor / + (res[i].duration_ns / (double)NSEC_PER_SEC); + hits_stddev += (hits_mean - val) * (hits_mean - val) / + (res_cnt - 1.0); + } + + hits_stddev = sqrt(hits_stddev); + } + printf("Summary: throughput %8.3lf \u00B1 %5.3lf %c ops/s (%7.3lf%c ops/prod), ", + hits_mean, hits_stddev, rate, hits_mean / env.producer_cnt, + rate); + printf("latency %8.3lf %s/op\n", + latency / lat_divisor / env.producer_cnt, unit); +} + +const struct bench bench_lpm_trie_lookup = { + .name = "lpm-trie-lookup", + .argp = &bench_lpm_trie_map_argp, + .validate = __lpm_validate, + .setup = lpm_lookup_setup, + .producer_thread = lpm_producer, + .measure = lpm_measure, + .report_progress = ops_report_progress, + .report_final = ops_report_final, +}; + +const struct bench bench_lpm_trie_update = { + .name = "lpm-trie-update", + .argp = &bench_lpm_trie_map_argp, + .validate = __lpm_validate, + .setup = lpm_update_setup, + .producer_thread = lpm_producer, + .measure = lpm_measure, + .report_progress = ops_report_progress, + .report_final = ops_report_final, +}; + +const struct bench bench_lpm_trie_delete = { + .name = "lpm-trie-delete", + .argp = &bench_lpm_trie_map_argp, + .validate = lpm_delete_validate, + .setup = lpm_delete_setup, + .producer_thread = lpm_producer, + .measure = lpm_measure, + .report_progress = ops_report_progress, + .report_final = ops_report_final, +}; + +const struct bench bench_lpm_trie_free = { + .name = "lpm-trie-free", + .argp = &bench_lpm_trie_map_argp, + .validate = lpm_free_validate, + .setup = lpm_free_setup, + .producer_thread = lpm_free_producer, + .measure = lpm_measure, + .report_progress = free_ops_report_progress, + .report_final = free_ops_report_final, +}; diff --git a/tools/testing/selftests/bpf/progs/lpm_trie_bench.c b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c new file mode 100644 index 000000000000..522e1cbef490 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c @@ -0,0 +1,171 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Cloudflare */ + +#include <vmlinux.h> +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_core_read.h> +#include "bpf_misc.h" + +#define BPF_OBJ_NAME_LEN 16U +#define MAX_ENTRIES 100000000 +#define NR_LOOPS 10000 + +struct trie_key { + __u32 prefixlen; + __u32 data; +}; + +char _license[] SEC("license") = "GPL"; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 512); + __type(key, struct bpf_map *); + __type(value, __u64); +} latency_free_start SEC(".maps"); + +/* Filled by userspace. See fill_map() in bench_lpm_trie_map.c */ +struct { + __uint(type, BPF_MAP_TYPE_LPM_TRIE); + __type(key, struct trie_key); + __type(value, __u32); + __uint(map_flags, BPF_F_NO_PREALLOC); + __uint(max_entries, MAX_ENTRIES); +} trie_map SEC(".maps"); + +long hits; +long duration_ns; + +/* Configured from userspace */ +__u32 nr_entries; +__u32 prefixlen; +__u8 op; + +SEC("fentry/bpf_map_free_deferred") +int BPF_PROG(trie_free_entry, struct work_struct *work) +{ + struct bpf_map *map = container_of(work, struct bpf_map, work); + char name[BPF_OBJ_NAME_LEN]; + u32 map_type; + __u64 val; + + map_type = BPF_CORE_READ(map, map_type); + if (map_type != BPF_MAP_TYPE_LPM_TRIE) + return 0; + + /* + * Ideally we'd have access to the map ID but that's already + * freed before we enter trie_free(). + */ + BPF_CORE_READ_STR_INTO(&name, map, name); + if (bpf_strncmp(name, BPF_OBJ_NAME_LEN, "trie_free_map")) + return 0; + + val = bpf_ktime_get_ns(); + bpf_map_update_elem(&latency_free_start, &map, &val, BPF_ANY); + + return 0; +} + +SEC("fexit/bpf_map_free_deferred") +int BPF_PROG(trie_free_exit, struct work_struct *work) +{ + struct bpf_map *map = container_of(work, struct bpf_map, work); + __u64 *val; + + val = bpf_map_lookup_elem(&latency_free_start, &map); + if (val) { + __sync_add_and_fetch(&duration_ns, bpf_ktime_get_ns() - *val); + __sync_add_and_fetch(&hits, 1); + bpf_map_delete_elem(&latency_free_start, &map); + } + + return 0; +} + +static void gen_random_key(struct trie_key *key) +{ + key->prefixlen = prefixlen; + key->data = bpf_get_prandom_u32() % nr_entries; +} + +static int lookup(__u32 index, __u32 *unused) +{ + struct trie_key key; + + gen_random_key(&key); + bpf_map_lookup_elem(&trie_map, &key); + return 0; +} + +static int update(__u32 index, __u32 *unused) +{ + struct trie_key key; + u32 val = bpf_get_prandom_u32(); + + gen_random_key(&key); + bpf_map_update_elem(&trie_map, &key, &val, BPF_EXIST); + return 0; +} + +static __u32 deleted_entries; + +static int delete (__u32 index, bool *need_refill) +{ + struct trie_key key = { + .data = deleted_entries, + .prefixlen = prefixlen, + }; + + bpf_map_delete_elem(&trie_map, &key); + + /* Do we need to refill the map? */ + if (++deleted_entries == nr_entries) { + /* + * Atomicity isn't required because DELETE only supports + * one producer running concurrently. What we need is a + * way to track how many entries have been deleted from + * the trie between consecutive invocations of the BPF + * prog because a single bpf_loop() call might not + * delete all entries, e.g. when NR_LOOPS < nr_entries. + */ + deleted_entries = 0; + *need_refill = true; + return 1; + } + + return 0; +} + +SEC("xdp") +int BPF_PROG(run_bench) +{ + bool need_refill = false; + u64 start, delta; + int loops; + + start = bpf_ktime_get_ns(); + + switch (op) { + case 1: + loops = bpf_loop(NR_LOOPS, lookup, NULL, 0); + break; + case 2: + loops = bpf_loop(NR_LOOPS, update, NULL, 0); + break; + case 3: + loops = bpf_loop(NR_LOOPS, delete, &need_refill, 0); + break; + default: + bpf_printk("invalid benchmark operation\n"); + return -1; + } + + delta = bpf_ktime_get_ns() - start; + + __sync_add_and_fetch(&duration_ns, delta); + __sync_add_and_fetch(&hits, loops); + + return need_refill; +} diff --git a/tools/testing/selftests/bpf/progs/lpm_trie_map.c b/tools/testing/selftests/bpf/progs/lpm_trie_map.c new file mode 100644 index 000000000000..2ab43e2cd6c6 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/lpm_trie_map.c @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +#define MAX_ENTRIES 100000000 + +struct trie_key { + __u32 prefixlen; + __u32 data; +}; + +struct { + __uint(type, BPF_MAP_TYPE_LPM_TRIE); + __type(key, struct trie_key); + __type(value, __u32); + __uint(map_flags, BPF_F_NO_PREALLOC); + __uint(max_entries, MAX_ENTRIES); +} trie_free_map SEC(".maps"); -- 2.34.1

1 week, 4 days

3
7
0 0

[PATCH v5 00/15] kunit: Introduce UAPI testing framework

by Thomas Weißschuh

Currently testing of userspace and in-kernel API use two different frameworks. kselftests for the userspace ones and Kunit for the in-kernel ones. Besides their different scopes, both have different strengths and limitations: Kunit: * Tests are normal kernel code. * They use the regular kernel toolchain. * They can be packaged and distributed as modules conveniently. Kselftests: * Tests are normal userspace code * They need a userspace toolchain. A kernel cross toolchain is likely not enough. * A fair amout of userland is required to run the tests, which means a full distro or handcrafted rootfs. * There is no way to conveniently package and run kselftests with a given kernel image. * The kselftests makefiles are not as powerful as regular kbuild. For example they are missing proper header dependency tracking or more complex compiler option modifications. Therefore kunit is much easier to run against different kernel configurations and architectures. This series aims to combine kselftests and kunit, avoiding both their limitations. It works by compiling the userspace kselftests as part of the regular kernel build, embedding them into the kunit kernel or module and executing them from there. If the kernel toolchain is not fit to produce userspace because of a missing libc, the kernel's own nolibc can be used instead. The structured TAP output from the kselftest is integrated into the kunit KTAP output transparently, the kunit parser can parse the combined logs together. Further room for improvements: * Call each test in its completely dedicated namespace * Handle additional test files besides the test executable through archives. CPIO, cramfs, etc. * Compatibility with kselftest_harness.h (in progress) * Expose the blobs in debugfs * Provide some convience wrappers around compat userprogs * Figure out a migration path/coexistence solution for kunit UAPI and tools/testing/selftests/ Output from the kunit example testcase, note the output of "example_uapi_tests". $ ./tools/testing/kunit/kunit.py run --kunitconfig lib/kunit example ... Running tests with: $ .kunit/linux kunit.filter_glob=example kunit.enable=1 mem=1G console=tty kunit_shutdown=halt [11:53:53] ================== example (10 subtests) =================== [11:53:53] [PASSED] example_simple_test [11:53:53] [SKIPPED] example_skip_test [11:53:53] [SKIPPED] example_mark_skipped_test [11:53:53] [PASSED] example_all_expect_macros_test [11:53:53] [PASSED] example_static_stub_test [11:53:53] [PASSED] example_static_stub_using_fn_ptr_test [11:53:53] [PASSED] example_priv_test [11:53:53] =================== example_params_test =================== [11:53:53] [SKIPPED] example value 3 [11:53:53] [PASSED] example value 2 [11:53:53] [PASSED] example value 1 [11:53:53] [SKIPPED] example value 0 [11:53:53] =============== [PASSED] example_params_test =============== [11:53:53] [PASSED] example_slow_test [11:53:53] ======================= (4 subtests) ======================= [11:53:53] [PASSED] procfs [11:53:53] [PASSED] userspace test 2 [11:53:53] [SKIPPED] userspace test 3: some reason [11:53:53] [PASSED] userspace test 4 [11:53:53] ================ [PASSED] example_uapi_test ================ [11:53:53] ===================== [PASSED] example ===================== [11:53:53] ============================================================ [11:53:53] Testing complete. Ran 16 tests: passed: 11, skipped: 5 [11:53:53] Elapsed time: 67.543s total, 1.823s configuring, 65.655s building, 0.058s running Based on v6.16-rc1. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- Changes in v5: - Initialize output variable of kernel_wait() - Fix .incbin with in-tree builds - Keep requirement of KTAP tests to have a number which was removed accidentally - Only synthesize KTAP subtest failure if the outer one is TestStatus.FAILURE - Use -I instead of -isystem in NOLIBC_USERCFLAGS to populate dependency files - +To filesystem developers to all patches - +To Luis Chamberlain for discussions about usage of usermodehelper (see patches 6 and 12) - Link to v4: https://lore.kernel.org/r/20250626-kunit-kselftests-v4-0-48760534fef5@linut… Changes in v4: - Move Kconfig.nolibc from tools/ to init/ - Drop generic userprogs nolibc integration - Drop generic blob framework - Pick up review tags from David - Extend new kunit TAP parser tests - Add MAINTAINERS entry - Allow CONFIG_KUNIT_UAPI=m - Split /proc validation into dedicated UAPI test - Trim recipient list a bit - Use KUNIT_FAIL_AND_ABORT() over KUNIT_FAIL() - Link to v3: https://lore.kernel.org/r/20250611-kunit-kselftests-v3-0-55e3d148cbc6@linut… Changes in v3: - Reintroduce CONFIG_CC_CAN_LINK_STATIC - Enable CONFIG_ARCH_HAS_NOLIBC for m68k and SPARC - Properly handle 'clean' target for userprogs - Use ramfs over tmpfs to reduce dependencies - Inherit userprogs byte order and ABI from kernel - Drop now unnecessary "#ifndef NOLIBC" - Pick up review tags - Drop usage of __private in blob.h, sparse complains and it is not really necessary - Fix execution on loongarch when using clang - Drop userprogs libgcc handling, it was ugly and is not yet necessary - Link to v2: https://lore.kernel.org/r/20250407-kunit-kselftests-v2-0-454114e287fd@linut… Changes in v2: - Rebase onto v6.15-rc1 - Add documentation and kernel docs - Resolve invalid kconfig breakages - Drop already applied patch "kbuild: implement CONFIG_HEADERS_INSTALL for Usermode Linux" - Drop userprogs CONFIG_WERROR integration, it doesn't need to be part of this series - Replace patch prefix "kconfig" with "kbuild" - Rename kunit_uapi_run_executable() to kunit_uapi_run_kselftest() - Generate private, conflict-free symbols in the blob framework - Handle kselftest exit codes - Handle SIGABRT - Forward output also to kunit debugfs log - Install a fd=0 stdin filedescriptor - Link to v1: https://lore.kernel.org/r/20250217-kunit-kselftests-v1-0-42b4524c3b0a@linut… --- Thomas Weißschuh (15): kbuild: userprogs: avoid duplication of flags inherited from kernel kbuild: userprogs: also inherit byte order and ABI from kernel kbuild: doc: add label for userprogs section init: re-add CONFIG_CC_CAN_LINK_STATIC init: add nolibc build support fs,fork,exit: export symbols necessary for KUnit UAPI support kunit: tool: Add test for nested test result reporting kunit: tool: Don't overwrite test status based on subtest counts kunit: tool: Parse skipped tests from kselftest.h kunit: Always descend into kunit directory during build kunit: qemu_configs: loongarch: Enable LSX/LSAX kunit: Introduce UAPI testing framework kunit: uapi: Add example for UAPI tests kunit: uapi: Introduce preinit executable kunit: uapi: Validate usability of /proc Documentation/dev-tools/kunit/api/index.rst | 5 + Documentation/dev-tools/kunit/api/uapi.rst | 14 + Documentation/kbuild/makefiles.rst | 2 + MAINTAINERS | 11 + Makefile | 7 +- fs/exec.c | 2 + fs/file.c | 1 + fs/filesystems.c | 2 + fs/fs_struct.c | 1 + fs/pipe.c | 2 + include/kunit/uapi.h | 77 ++++++ init/Kconfig | 7 + init/Kconfig.nolibc | 15 + init/Makefile.nolibc | 13 + kernel/exit.c | 3 + kernel/fork.c | 2 + lib/Makefile | 4 - lib/kunit/Kconfig | 14 + lib/kunit/Makefile | 30 +- lib/kunit/kunit-example-test.c | 15 + lib/kunit/kunit-example-uapi.c | 22 ++ lib/kunit/kunit-test-uapi.c | 51 ++++ lib/kunit/kunit-test.c | 23 +- lib/kunit/kunit-uapi.c | 305 +++++++++++++++++++++ lib/kunit/uapi-preinit.c | 63 +++++ tools/testing/kunit/kunit_parser.py | 11 +- tools/testing/kunit/kunit_tool_test.py | 11 + tools/testing/kunit/qemu_configs/loongarch.py | 2 + .../test_is_test_passed-failure-nested.log | 10 + .../test_data/test_is_test_passed-kselftest.log | 3 +- 30 files changed, 715 insertions(+), 13 deletions(-) --- base-commit: 9d5898b413d17510b2a41664a42390a2c79f8bf4 change-id: 20241015-kunit-kselftests-56273bc40442 Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

1 week, 5 days

4
24
0 0

[PATCH net-next 0/3] bonding: support aggregator selection based on port priority

by Hangbin Liu

This patchset introduces a new per-port bonding option: `ad_actor_port_prio`. It allows users to configure the actor's port priority, which can then be used by the bonding driver for aggregator selection based on port priority. This provides finer control over LACP aggregator choice, especially in setups with multiple eligible aggregators over 2 switches. Hangbin Liu (3): bonding: add support for per-port LACP actor priority bonding: support aggregator selection based on port priority selftests: bonding: add test for LACP actor port priority Documentation/networking/bonding.rst | 18 ++++- drivers/net/bonding/bond_3ad.c | 31 ++++++++ drivers/net/bonding/bond_netlink.c | 16 ++++ drivers/net/bonding/bond_options.c | 36 +++++++++ include/net/bond_3ad.h | 2 + include/net/bond_options.h | 1 + include/uapi/linux/if_link.h | 1 + .../selftests/drivers/net/bonding/Makefile | 3 +- .../drivers/net/bonding/bond_lacp_prio.sh | 73 +++++++++++++++++++ tools/testing/selftests/net/forwarding/lib.sh | 24 ------ tools/testing/selftests/net/lib.sh | 24 ++++++ 11 files changed, 203 insertions(+), 26 deletions(-) create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_lacp_prio.sh -- 2.46.0

1 week, 5 days

3
11
0 0

[PATCH v3] selftests/tty: add TIOCSTI test suite

by Abhinav Saxena

TIOCSTI is a TTY ioctl command that allows inserting characters into the terminal input queue, making it appear as if the user typed those characters. This functionality has behavior that varies based on system configuration and process credentials. The dev.tty.legacy_tiocsti sysctl introduced in commit 83efeeeb3d04 ("tty: Allow TIOCSTI to be disabled") controls TIOCSTI usage. When disabled, TIOCSTI requires CAP_SYS_ADMIN capability. The current implementation checks the current process's credentials via capable(CAP_SYS_ADMIN), but does not validate against the file opener's credentials stored in file->f_cred. This creates different behavior when file descriptors are passed between processes via SCM_RIGHTS. Add a test suite with 16 test variants using fixture variants to verify TIOCSTI behavior when dev.tty.legacy_tiocsti is enabled/disabled: - Basic TIOCSTI tests (8 variants): Direct testing with different capability and controlling terminal combinations - FD passing tests (8 variants): Test behavior when file descriptors are passed between processes with different capabilities The FD passing tests document this behavior - some tests show different results than expected based on file opener credentials, demonstrating that TIOCSTI uses current process credentials rather than file opener credentials. The tests validate proper enforcement of the legacy_tiocsti sysctl. Test implementation uses openpty(3) with TIOCSCTTY for isolated PTY environments. See tty_ioctl(4) for details on TIOCSTI behavior and security requirements. Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com> --- To run all tests: $ sudo ./tools/testing/selftests/tty/tty_tiocsti_test Test Results: - PASSED: 13/16 tests - Different behavior: 3/16 tests (documenting credential checking behavior) All tests validated using: - scripts/checkpatch.pl --strict (clean output) - Functional testing on kernel v6.16-rc2 Changes in v3: - Replaced all printf() calls with TH_LOG() for proper test logging (Kees Cook) - Added struct __test_metadata parameter to helper functions - Moved common legacy_tiocsti availability check to FIXTURE_SETUP() - Implemented sysctl modification/restoration in FIXTURE_SETUP/TEARDOWN - Used openpty() with TIOCSCTTY for reliable PTY testing environment - Fixed child/parent synchronization in FD passing tests - Replaced manual _exit(1) handling with proper ASSERT statements - Switched // comments to /* */ format throughout - Expanded to 16 test variants using fixture variants - Enhanced error handling and test reliability - Link to v2: https://lore.kernel.org/r/20250713-toicsti-bug-v2-1-b183787eea29@gmail.com - Link to v1: https://lore.kernel.org/r/20250622-toicsti-bug-v1-0-f374373b04b2@gmail.com References: - tty_ioctl(4) - documents TIOCSTI ioctl and capability requirements - openpty(3) - pseudo-terminal creation and management - commit 83efeeeb3d04 ("tty: Allow TIOCSTI to be disabled") - Documentation/security/credentials.rst - https://github.com/KSPP/linux/issues/156 - https://lore.kernel.org/linux-hardening/Y0m9l52AKmw6Yxi1@hostpad/ - drivers/tty/Kconfig - Documentation/driver-api/tty/ --- tools/testing/selftests/tty/Makefile | 6 +- tools/testing/selftests/tty/config | 1 + tools/testing/selftests/tty/tty_tiocsti_test.c | 650 +++++++++++++++++++++++++ 3 files changed, 656 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/tty/Makefile b/tools/testing/selftests/tty/Makefile index 50d7027b2ae3..7f6fbe5a0cd5 100644 --- a/tools/testing/selftests/tty/Makefile +++ b/tools/testing/selftests/tty/Makefile @@ -1,5 +1,9 @@ # SPDX-License-Identifier: GPL-2.0 CFLAGS = -O2 -Wall -TEST_GEN_PROGS := tty_tstamp_update +TEST_GEN_PROGS := tty_tstamp_update tty_tiocsti_test +LDLIBS += -lcap include ../lib.mk + +# Add libcap for TIOCSTI test +$(OUTPUT)/tty_tiocsti_test: LDLIBS += -lcap diff --git a/tools/testing/selftests/tty/config b/tools/testing/selftests/tty/config new file mode 100644 index 000000000000..c6373aba6636 --- /dev/null +++ b/tools/testing/selftests/tty/config @@ -0,0 +1 @@ +CONFIG_LEGACY_TIOCSTI=y diff --git a/tools/testing/selftests/tty/tty_tiocsti_test.c b/tools/testing/selftests/tty/tty_tiocsti_test.c new file mode 100644 index 000000000000..1eafef6e36fa --- /dev/null +++ b/tools/testing/selftests/tty/tty_tiocsti_test.c @@ -0,0 +1,650 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * TTY Tests - TIOCSTI + * + * Copyright © 2025 Abhinav Saxena <xandfury(a)gmail.com> + */ + +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <fcntl.h> +#include <sys/ioctl.h> +#include <errno.h> +#include <stdbool.h> +#include <string.h> +#include <sys/socket.h> +#include <sys/wait.h> +#include <pwd.h> +#include <termios.h> +#include <grp.h> +#include <sys/capability.h> +#include <sys/prctl.h> +#include <pty.h> +#include <utmp.h> + +#include "../kselftest_harness.h" + +enum test_type { + TEST_PTY_TIOCSTI_BASIC, + TEST_PTY_TIOCSTI_FD_PASSING, + /* other tests cases such as serial may be added. */ +}; + +/* + * Test Strategy: + * - Basic tests: Use PTY with/without TIOCSCTTY (controlling terminal for + * current process) + * - FD passing tests: Child creates PTY, parent receives FD (demonstrates + * security issue) + * + * SECURITY VULNERABILITY DEMONSTRATION: + * FD passing tests show that TIOCSTI uses CURRENT process credentials, not + * opener credentials. This means privileged processes can be given FDs from + * unprivileged processes and successfully perform TIOCSTI operations that the + * unprivileged process couldn't do directly. + * + * Attack scenario: + * 1. Unprivileged process opens TTY (direct TIOCSTI fails due to lack of + * privileges) + * 2. Unprivileged process passes FD to privileged process via SCM_RIGHTS + * 3. Privileged process can use TIOCSTI on the FD (succeeds due to its + * privileges) + * 4. Result: Effective privilege escalation via file descriptor passing + * + * This matches the kernel logic in tiocsti(): + * 1. if (!tty_legacy_tiocsti && !capable(CAP_SYS_ADMIN)) return -EIO; + * 2. if ((current->signal->tty != tty) && !capable(CAP_SYS_ADMIN)) + * return -EPERM; + * Note: Both checks use capable() on CURRENT process, not FD opener! + * + * If the file credentials were also checked along with the capable() checks + * then the results for FD pass tests would be consistent with the basic tests. + */ + +FIXTURE(tiocsti) +{ + int pty_master_fd; /* PTY - for basic tests */ + int pty_slave_fd; + bool has_pty; + bool initial_cap_sys_admin; + int original_legacy_tiocsti_setting; + bool can_modify_sysctl; +}; + +FIXTURE_VARIANT(tiocsti) +{ + const enum test_type test_type; + const bool controlling_tty; /* true=current->signal->tty == tty */ + const int legacy_tiocsti; /* 0=restricted, 1=permissive */ + const bool requires_cap; /* true=with CAP_SYS_ADMIN, false=without */ + const int expected_success; /* 0=success, -EIO/-EPERM=specific error */ +}; + +/* + * Tests Controlling Terminal Variants (current->signal->tty == tty) + * + * TIOCSTI Test Matrix: + * + * | legacy_tiocsti | CAP_SYS_ADMIN | Expected Result | Error | + * |----------------|---------------|-----------------|-------| + * | 1 (permissive) | true | SUCCESS | - | + * | 1 (permissive) | false | SUCCESS | - | + * | 0 (restricted) | true | SUCCESS | - | + * | 0 (restricted) | false | FAILURE | -EIO | + */ + +/* clang-format off */ +FIXTURE_VARIANT_ADD(tiocsti, basic_pty_permissive_withcap) { + .test_type = TEST_PTY_TIOCSTI_BASIC, + .controlling_tty = true, + .legacy_tiocsti = 1, + .requires_cap = true, + .expected_success = 0, +}; + +FIXTURE_VARIANT_ADD(tiocsti, basic_pty_permissive_nocap) { + .test_type = TEST_PTY_TIOCSTI_BASIC, + .controlling_tty = true, + .legacy_tiocsti = 1, + .requires_cap = false, + .expected_success = 0, +}; + +FIXTURE_VARIANT_ADD(tiocsti, basic_pty_restricted_withcap) { + .test_type = TEST_PTY_TIOCSTI_BASIC, + .controlling_tty = true, + .legacy_tiocsti = 0, + .requires_cap = true, + .expected_success = 0, +}; + +FIXTURE_VARIANT_ADD(tiocsti, basic_pty_restricted_nocap) { + .test_type = TEST_PTY_TIOCSTI_BASIC, + .controlling_tty = true, + .legacy_tiocsti = 0, + .requires_cap = false, + .expected_success = -EIO, /* FAILURE: legacy restriction */ +}; /* clang-format on */ + +/* + * Note for FD Passing Test Variants + * Since we're testing the scenario where an unprivileged process pass an FD + * to a privileged one, .requires_cap here means the caps of the child process. + * Not the parent; parent would always be privileged. + */ + +/* clang-format off */ +FIXTURE_VARIANT_ADD(tiocsti, fdpass_pty_permissive_withcap) { + .test_type = TEST_PTY_TIOCSTI_FD_PASSING, + .controlling_tty = true, + .legacy_tiocsti = 1, + .requires_cap = true, + .expected_success = 0, +}; + +FIXTURE_VARIANT_ADD(tiocsti, fdpass_pty_permissive_nocap) { + .test_type = TEST_PTY_TIOCSTI_FD_PASSING, + .controlling_tty = true, + .legacy_tiocsti = 1, + .requires_cap = false, + .expected_success = 0, +}; + +FIXTURE_VARIANT_ADD(tiocsti, fdpass_pty_restricted_withcap) { + .test_type = TEST_PTY_TIOCSTI_FD_PASSING, + .controlling_tty = true, + .legacy_tiocsti = 0, + .requires_cap = true, + .expected_success = 0, +}; + +FIXTURE_VARIANT_ADD(tiocsti, fdpass_pty_restricted_nocap) { + .test_type = TEST_PTY_TIOCSTI_FD_PASSING, + .controlling_tty = true, + .legacy_tiocsti = 0, + .requires_cap = false, + .expected_success = -EIO, +}; /* clang-format on */ + +/* + * Non-Controlling Terminal Variants (current->signal->tty != tty) + * + * TIOCSTI Test Matrix: + * + * | legacy_tiocsti | CAP_SYS_ADMIN | Expected Result | Error | + * |----------------|---------------|-----------------|-------| + * | 1 (permissive) | true | SUCCESS | - | + * | 1 (permissive) | false | FAILURE | -EPERM| + * | 0 (restricted) | true | SUCCESS | - | + * | 0 (restricted) | false | FAILURE | -EIO | + */ + +/* clang-format off */ +FIXTURE_VARIANT_ADD(tiocsti, basic_nopty_permissive_withcap) { + .test_type = TEST_PTY_TIOCSTI_BASIC, + .controlling_tty = false, + .legacy_tiocsti = 1, + .requires_cap = true, + .expected_success = 0, +}; + +FIXTURE_VARIANT_ADD(tiocsti, basic_nopty_permissive_nocap) { + .test_type = TEST_PTY_TIOCSTI_BASIC, + .controlling_tty = false, + .legacy_tiocsti = 1, + .requires_cap = false, + .expected_success = -EPERM, +}; + +FIXTURE_VARIANT_ADD(tiocsti, basic_nopty_restricted_withcap) { + .test_type = TEST_PTY_TIOCSTI_BASIC, + .controlling_tty = false, + .legacy_tiocsti = 0, + .requires_cap = true, + .expected_success = 0, +}; + +FIXTURE_VARIANT_ADD(tiocsti, basic_nopty_restricted_nocap) { + .test_type = TEST_PTY_TIOCSTI_BASIC, + .controlling_tty = false, + .legacy_tiocsti = 0, + .requires_cap = false, + .expected_success = -EIO, +}; + +FIXTURE_VARIANT_ADD(tiocsti, fdpass_nopty_permissive_withcap) { + .test_type = TEST_PTY_TIOCSTI_FD_PASSING, + .controlling_tty = false, + .legacy_tiocsti = 1, + .requires_cap = true, + .expected_success = 0, +}; + +FIXTURE_VARIANT_ADD(tiocsti, fdpass_nopty_permissive_nocap) { + .test_type = TEST_PTY_TIOCSTI_FD_PASSING, + .controlling_tty = false, + .legacy_tiocsti = 1, + .requires_cap = false, + .expected_success = -EPERM, +}; + +FIXTURE_VARIANT_ADD(tiocsti, fdpass_nopty_restricted_withcap) { + .test_type = TEST_PTY_TIOCSTI_FD_PASSING, + .controlling_tty = false, + .legacy_tiocsti = 0, + .requires_cap = true, + .expected_success = 0, +}; + +FIXTURE_VARIANT_ADD(tiocsti, fdpass_nopty_restricted_nocap) { + .test_type = TEST_PTY_TIOCSTI_FD_PASSING, + .controlling_tty = false, + .legacy_tiocsti = 0, + .requires_cap = false, + .expected_success = -EIO, +}; /* clang-format on */ + +/* Helper function to send FD via SCM_RIGHTS */ +static int send_fd_via_socket(int socket_fd, int fd_to_send) +{ + struct msghdr msg = { 0 }; + struct cmsghdr *cmsg; + char cmsg_buf[CMSG_SPACE(sizeof(int))]; + char dummy_data = 'F'; + struct iovec iov = { .iov_base = &dummy_data, .iov_len = 1 }; + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = cmsg_buf; + msg.msg_controllen = sizeof(cmsg_buf); + + cmsg = CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_RIGHTS; + cmsg->cmsg_len = CMSG_LEN(sizeof(int)); + + memcpy(CMSG_DATA(cmsg), &fd_to_send, sizeof(int)); + + return sendmsg(socket_fd, &msg, 0) < 0 ? -1 : 0; +} + +/* Helper function to receive FD via SCM_RIGHTS */ +static int recv_fd_via_socket(int socket_fd) +{ + struct msghdr msg = { 0 }; + struct cmsghdr *cmsg; + char cmsg_buf[CMSG_SPACE(sizeof(int))]; + char dummy_data; + struct iovec iov = { .iov_base = &dummy_data, .iov_len = 1 }; + int received_fd = -1; + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = cmsg_buf; + msg.msg_controllen = sizeof(cmsg_buf); + + if (recvmsg(socket_fd, &msg, 0) < 0) + return -1; + + for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) { + if (cmsg->cmsg_level == SOL_SOCKET && + cmsg->cmsg_type == SCM_RIGHTS) { + memcpy(&received_fd, CMSG_DATA(cmsg), sizeof(int)); + break; + } + } + + return received_fd; +} + +static inline bool has_cap_sys_admin(void) +{ + cap_t caps = cap_get_proc(); + + if (!caps) + return false; + + cap_flag_value_t cap_val; + bool has_cap = (cap_get_flag(caps, CAP_SYS_ADMIN, CAP_EFFECTIVE, + &cap_val) == 0) && + (cap_val == CAP_SET); + + cap_free(caps); + return has_cap; +} + +/* + * Drop to nobody user (uid/gid 65534) to lose all capabilities + */ +static inline bool drop_to_nobody(struct __test_metadata *_metadata) +{ + ASSERT_EQ(setgroups(0, NULL), 0); + ASSERT_EQ(setgid(65534), 0); + ASSERT_EQ(setuid(65534), 0); + + ASSERT_FALSE(has_cap_sys_admin()); + return true; +} + +static inline int get_legacy_tiocsti_setting(struct __test_metadata *_metadata) +{ + FILE *fp; + int value = -1; + + fp = fopen("/proc/sys/dev/tty/legacy_tiocsti", "r"); + if (!fp) { + /* legacy_tiocsti sysctl not available (kernel < 6.2) */ + return -1; + } + + if (fscanf(fp, "%d", &value) == 1) { + if (value < 0 || value > 1) + value = -1; /* Invalid value */ + } else { + value = -1; /* Failed to parse */ + } + + fclose(fp); + return value; +} + +static inline bool set_legacy_tiocsti_setting(struct __test_metadata *_metadata, + int value) +{ + FILE *fp; + bool success = false; + + /* Sanity-check the value */ + ASSERT_GE(value, 0); + ASSERT_LE(value, 1); + + /* + * Try to open for writing; if we lack permission, return false so + * the test harness will skip variants that need to change it + */ + fp = fopen("/proc/sys/dev/tty/legacy_tiocsti", "w"); + if (!fp) + return false; + + /* Write the new setting */ + if (fprintf(fp, "%d\n", value) > 0) + success = true; + else + TH_LOG("Failed to write legacy_tiocsti: %s", strerror(errno)); + + fclose(fp); + return success; +} + +/* + * TIOCSTI injection test function + * @tty_fd: TTY slave file descriptor to test TIOCSTI on + * Returns: 0 on success, -errno on failure + */ +static inline int test_tiocsti_injection(struct __test_metadata *_metadata, + int tty_fd) +{ + int ret; + char inject_char = 'V'; + + errno = 0; + ret = ioctl(tty_fd, TIOCSTI, &inject_char); + return ret == 0 ? 0 : -errno; +} + +FIXTURE_SETUP(tiocsti) +{ + /* Create PTY pair for basic tests */ + self->has_pty = (openpty(&self->pty_master_fd, &self->pty_slave_fd, + NULL, NULL, NULL) == 0); + if (!self->has_pty) { + self->pty_master_fd = -1; + self->pty_slave_fd = -1; + } + + self->initial_cap_sys_admin = has_cap_sys_admin(); + self->original_legacy_tiocsti_setting = + get_legacy_tiocsti_setting(_metadata); + + if (self->original_legacy_tiocsti_setting < 0) + SKIP(return, "legacy_tiocsti sysctl not available (kernel < 6.2)"); + + /* Test if we can modify the sysctl (requires appropriate privileges) */ + self->can_modify_sysctl = set_legacy_tiocsti_setting(_metadata, + self->original_legacy_tiocsti_setting); + if (!self->can_modify_sysctl) + TH_LOG("Warning: Cannot modify legacy_tiocsti sysctl - will skip mismatched variants"); +} + +FIXTURE_TEARDOWN(tiocsti) +{ + /* + * Backup restoration - + * each test should restore its own sysctl changes + */ + if (self->can_modify_sysctl && + self->original_legacy_tiocsti_setting >= 0) { + int current_value = get_legacy_tiocsti_setting(_metadata); + + if (current_value != self->original_legacy_tiocsti_setting) { + TH_LOG("Backup: Restoring legacy_tiocsti from %d to %d", + current_value, + self->original_legacy_tiocsti_setting); + set_legacy_tiocsti_setting(_metadata, + self->original_legacy_tiocsti_setting); + } + } + + if (self->has_pty) { + if (self->pty_master_fd >= 0) + close(self->pty_master_fd); + if (self->pty_slave_fd >= 0) + close(self->pty_slave_fd); + } +} + +TEST_F(tiocsti, test) +{ + int saved_legacy_tiocsti = get_legacy_tiocsti_setting(_metadata); + bool need_restore = false; + int status; + pid_t child_pid; + + /* Set legacy_tiocsti sysctl to match variant requirement */ + if (self->can_modify_sysctl) { + if (saved_legacy_tiocsti != variant->legacy_tiocsti) { + if (!set_legacy_tiocsti_setting(_metadata, + variant->legacy_tiocsti)) { + SKIP(return, + "Failed to set legacy_tiocsti sysctl"); + } + need_restore = true; + } + } else { + /* + * Can't modify sysctl + * - check if current value matches variant + */ + if (self->original_legacy_tiocsti_setting != + variant->legacy_tiocsti) { + SKIP(return, + "legacy_tiocsti setting mismatch and cannot modify sysctl"); + } + } + + /* Common skip conditions */ + if (variant->test_type == TEST_PTY_TIOCSTI_BASIC && !self->has_pty) { + SKIP(goto restore_sysctl, + "PTY not available for controlling terminal test"); + } + + if (variant->test_type == TEST_PTY_TIOCSTI_FD_PASSING && + !self->initial_cap_sys_admin) { + SKIP(goto restore_sysctl, + "FD Pass tests require CAP_SYS_ADMIN"); + } + + if (variant->requires_cap && !self->initial_cap_sys_admin) { + SKIP(goto restore_sysctl, + "Test requires initial CAP_SYS_ADMIN"); + } + + if (variant->test_type == TEST_PTY_TIOCSTI_BASIC) { + /* ===== BASIC TIOCSTI TEST ===== */ + child_pid = fork(); + ASSERT_GE(child_pid, 0); + + if (child_pid == 0) { + /* Child process - perform the actual test */ + + /* Handle capability requirements */ + if (self->initial_cap_sys_admin && + !variant->requires_cap) + ASSERT_TRUE(drop_to_nobody(_metadata)); + + if (variant->controlling_tty) { + /* + * Create new session and set PTY as + * controlling terminal + */ + pid_t sid = setsid(); + + ASSERT_GE(sid, 0); + ASSERT_EQ(ioctl(self->pty_slave_fd, TIOCSCTTY, + 0), + 0); + } + + /* + * Validate test environment setup and verify final + * capability state matches expectation + * after potential drop. + * + */ + ASSERT_TRUE(self->has_pty); + ASSERT_EQ(has_cap_sys_admin(), variant->requires_cap); + + /* Test TIOCSTI and validate result */ + int result = test_tiocsti_injection(_metadata, + self->pty_slave_fd); + + /* Check against expected result from variant */ + EXPECT_EQ(result, variant->expected_success); + _exit(0); + } + + } else { + /* ===== FD PASSING SECURITY TEST ===== */ + int sockpair[2]; + + ASSERT_EQ(socketpair(AF_UNIX, SOCK_STREAM, 0, sockpair), 0); + + child_pid = fork(); + ASSERT_GE(child_pid, 0); + + if (child_pid == 0) { + /* Child process - create PTY and send FD */ + close(sockpair[0]); + signal(SIGHUP, SIG_IGN); + + /* Handle privilege dropping */ + if (!variant->requires_cap && has_cap_sys_admin()) + ASSERT_TRUE(drop_to_nobody(_metadata)); + + /* Create child's PTY */ + int child_master_fd, child_slave_fd; + + ASSERT_EQ(openpty(&child_master_fd, &child_slave_fd, + NULL, NULL, NULL), + 0); + + if (variant->controlling_tty) { + pid_t sid = setsid(); + + ASSERT_GE(sid, 0); + ASSERT_EQ(ioctl(child_slave_fd, TIOCSCTTY, 0), + 0); + } + + /* Test child's direct TIOCSTI for reference */ + int direct_result = test_tiocsti_injection(_metadata, + child_slave_fd); + EXPECT_EQ(direct_result, variant->expected_success); + + /* Send FD to parent */ + ASSERT_EQ(send_fd_via_socket(sockpair[1], + child_slave_fd), + 0); + + /* Wait for parent completion signal */ + char sync_byte; + ssize_t bytes_read = read(sockpair[1], &sync_byte, 1); + + ASSERT_EQ(bytes_read, 1); + + close(child_master_fd); + close(child_slave_fd); + close(sockpair[1]); + _exit(0); + } + + /* Parent process - receive FD and test TIOCSTI */ + close(sockpair[1]); + + int received_fd = recv_fd_via_socket(sockpair[0]); + + ASSERT_GE(received_fd, 0); + + bool parent_has_cap = self->initial_cap_sys_admin; + + TH_LOG("=== TIOCSTI FD Passing Test Context ==="); + TH_LOG("legacy_tiocsti: %d, Parent CAP_SYS_ADMIN: %s, Child: %s", + variant->legacy_tiocsti, parent_has_cap ? "yes" : "no", + variant->requires_cap ? "kept" : "dropped"); + + /* SECURITY TEST: Try TIOCSTI with FD opened by child */ + int result = test_tiocsti_injection(_metadata, received_fd); + + /* Log security concern if demonstrated */ + if (result == 0 && !variant->requires_cap) { + TH_LOG("*** SECURITY CONCERN DEMONSTRATED ***"); + TH_LOG("Privileged parent can use TIOCSTI on FD from unprivileged child"); + TH_LOG("This shows current process credentials are used, not opener credentials"); + } + + EXPECT_EQ(result, variant->expected_success) + { + TH_LOG("FD passing: expected error %d, got %d", + variant->expected_success, result); + } + + /* Signal child completion */ + char sync_byte = 'D'; + ssize_t bytes_written = write(sockpair[0], &sync_byte, 1); + + ASSERT_EQ(bytes_written, 1); + + close(received_fd); + close(sockpair[0]); + } + + /* Common child process cleanup for both test types */ + ASSERT_EQ(waitpid(child_pid, &status, 0), child_pid); + + if (WIFSIGNALED(status)) { + TH_LOG("Child terminated by signal %d", WTERMSIG(status)); + ASSERT_FALSE(WIFSIGNALED(status)) + { + TH_LOG("Child process failed assertion"); + } + } else { + EXPECT_EQ(WEXITSTATUS(status), 0); + } + +restore_sysctl: + if (need_restore) + set_legacy_tiocsti_setting(_metadata, saved_legacy_tiocsti); +} + +TEST_HARNESS_MAIN --- base-commit: 283564a43383d6f26a55546fe9ae345b5fa95e66 change-id: 20250618-toicsti-bug-7822b8e94a32 Best regards, -- Abhinav Saxena <xandfury(a)gmail.com>

1 week, 5 days

3
2
0 0

[PATCH -next] selftests/sched_ext: Remove duplicate sched.h header

by Jiapeng Chong

./tools/testing/selftests/sched_ext/hotplug.c: sched.h is included more than once. Reported-by: Abaci Robot <abaci(a)linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=22941 Signed-off-by: Jiapeng Chong <jiapeng.chong(a)linux.alibaba.com> --- tools/testing/selftests/sched_ext/hotplug.c | 1 - 1 file changed, 1 deletion(-) diff --git a/tools/testing/selftests/sched_ext/hotplug.c b/tools/testing/selftests/sched_ext/hotplug.c index 1c9ceb661c43..0cfbb111a2d0 100644 --- a/tools/testing/selftests/sched_ext/hotplug.c +++ b/tools/testing/selftests/sched_ext/hotplug.c @@ -6,7 +6,6 @@ #include <bpf/bpf.h> #include <sched.h> #include <scx/common.h> -#include <sched.h> #include <sys/wait.h> #include <unistd.h> -- 2.43.5

1 week, 6 days

3
2
0 0

[PATCH V9 0/7] Add NUMA mempolicy support for KVM guest-memfd

by Shivank Garg

This series introduces NUMA-aware memory placement support for KVM guests with guest_memfd memory backends. It builds upon Fuad Tabba's work that enabled host-mapping for guest_memfd memory [1]. == Background == KVM's guest-memfd memory backend currently lacks support for NUMA policy enforcement, causing guest memory allocations to be distributed across host nodes according to kernel's default behavior, irrespective of any policy specified by the VMM. This limitation arises because conventional userspace NUMA control mechanisms like mbind(2) don't work since the memory isn't directly mapped to userspace when allocations occur. Fuad's work [1] provides the necessary mmap capability, and this series leverages it to enable mbind(2). == Implementation == This series implements proper NUMA policy support for guest-memfd by: 1. Adding mempolicy-aware allocation APIs to the filemap layer. 2. Introducing custom inodes (via a dedicated slab-allocated inode cache, kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory. 3. Implementing get/set_policy vm_ops in guest_memfd to support NUMA policy. With these changes, VMMs can now control guest memory placement by mapping guest_memfd file descriptor and using mbind(2) to specify: - Policy modes: default, bind, interleave, or preferred - Host NUMA nodes: List of target nodes for memory allocation These Policies affect only future allocations and do not migrate existing memory. This matches mbind(2)'s default behavior which affects only new allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not supported for guest_memfd as it is unmovable by design). == Upstream Plan == Phased approach as per David's guest_memfd extension overview [2] and community calls [3]: Phase 1 (this series): 1. Focuses on shared guest_memfd support (non-CoCo VMs). 2. Builds on Fuad's host-mapping work. Phase2 (future work): 1. NUMA support for private guest_memfd (CoCo VMs). 2. Depends on SNP in-place conversion support [4]. This series provides a clean integration path for NUMA-aware memory management for guest_memfd and lays the groundwork for future confidential computing NUMA capabilities. Please review and provide feedback! Thanks, Shivank == Changelog == - v1,v2: Extended the KVM_CREATE_GUEST_MEMFD IOCTL to pass mempolicy. - v3: Introduced fbind() syscall for VMM memory-placement configuration. - v4-v6: Current approach using shared_policy support and vm_ops (based on suggestions from David [5] and guest_memfd bi-weekly upstream call discussion [6]). - v7: Use inodes to store NUMA policy instead of file [7]. - v8: Rebase on top of Fuad's V12: Host mmaping for guest_memfd memory. - v9: Rebase on top of Fuad's V13 and incorporate review comments [1] https://lore.kernel.org/all/20250709105946.4009897-1-tabba@google.com [2] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo… [4] https://lore.kernel.org/all/20250613005400.3694904-1-michael.roth@amd.com [5] https://lore.kernel.org/all/6fbef654-36e2-4be5-906e-2a648a845278@redhat.com [6] https://lore.kernel.org/all/2b77e055-98ac-43a1-a7ad-9f9065d7f38f@amd.com [7] https://lore.kernel.org/all/diqzbjumm167.fsf@ackerleytng-ctop.c.googlers.com Ackerley Tng (1): KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes Matthew Wilcox (Oracle) (2): mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio() mm/filemap: Extend __filemap_get_folio() to support NUMA memory policies Shivank Garg (4): mm/mempolicy: Export memory policy symbols KVM: guest_memfd: Add slab-allocated inode cache KVM: guest_memfd: Enforce NUMA mempolicy using shared policy KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy support fs/bcachefs/fs-io-buffered.c | 2 +- fs/btrfs/compression.c | 4 +- fs/btrfs/verity.c | 2 +- fs/erofs/zdata.c | 2 +- fs/f2fs/compress.c | 2 +- include/linux/pagemap.h | 18 +- include/uapi/linux/magic.h | 1 + mm/filemap.c | 23 +- mm/mempolicy.c | 6 + mm/readahead.c | 2 +- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/guest_memfd_test.c | 122 ++++++++- virt/kvm/guest_memfd.c | 255 ++++++++++++++++-- virt/kvm/kvm_main.c | 7 +- virt/kvm/kvm_mm.h | 10 +- 15 files changed, 408 insertions(+), 49 deletions(-) -- 2.43.0 --- == Earlier Postings == v8: https://lore.kernel.org/all/20250618112935.7629-1-shivankg@amd.com v7: https://lore.kernel.org/all/20250408112402.181574-1-shivankg@amd.com v6: https://lore.kernel.org/all/20250226082549.6034-1-shivankg@amd.com v5: https://lore.kernel.org/all/20250219101559.414878-1-shivankg@amd.com v4: https://lore.kernel.org/all/20250210063227.41125-1-shivankg@amd.com v3: https://lore.kernel.org/all/20241105164549.154700-1-shivankg@amd.com v2: https://lore.kernel.org/all/20240919094438.10987-1-shivankg@amd.com v1: https://lore.kernel.org/all/20240916165743.201087-1-shivankg@amd.com

1 week, 6 days

6
23
0 0

[PATCH v3 00/13] stackleak: Support Clang stack depth tracking

by Kees Cook

v3: - split up and drop __init vs inline patches that went via arch trees - apply feedback about preferring __init to __always_inline - incorporate Ritesh Harjani's patch for __init cleanups in powerpc - wider build testing on older compilers v2: https://lore.kernel.org/lkml/20250523043251.it.550-kees@kernel.org/ v1: https://lore.kernel.org/lkml/20250507180852.work.231-kees@kernel.org/ Hi, As part of looking at what GCC plugins could be replaced with Clang implementations, this series uses the recently landed stack depth tracking callback in Clang[1] to implement the stackleak feature. Since the Clang feature is now landed, I'm moving this out of RFC to a v1. Since this touches a lot of arch-specific Makefiles, I tried to trim the CC list down to just mailing lists in those cases, otherwise the CC was giant. Thanks! -Kees [1] https://clang.llvm.org/docs/SanitizerCoverage.html#tracing-stack-depth Kees Cook (12): stackleak: Rename STACKLEAK to KSTACK_ERASE stackleak: Rename stackleak_track_stack to __sanitizer_cov_stack_depth stackleak: Split KSTACK_ERASE_CFLAGS from GCC_PLUGINS_CFLAGS x86: Handle KCOV __init vs inline mismatches arm: Handle KCOV __init vs inline mismatches arm64: Handle KCOV __init vs inline mismatches s390: Handle KCOV __init vs inline mismatches mips: Handle KCOV __init vs inline mismatch init.h: Disable sanitizer coverage for __init and __head kstack_erase: Support Clang stack depth tracking configs/hardening: Enable CONFIG_KSTACK_ERASE configs/hardening: Enable CONFIG_INIT_ON_FREE_DEFAULT_ON Ritesh Harjani (IBM) (1): powerpc/mm/book3s64: Move kfence and debug_pagealloc related calls to __init section arch/Kconfig | 4 +- arch/arm/Kconfig | 2 +- arch/arm64/Kconfig | 2 +- arch/riscv/Kconfig | 2 +- arch/s390/Kconfig | 2 +- arch/x86/Kconfig | 2 +- security/Kconfig.hardening | 45 +++++++++------- Makefile | 1 + arch/arm/boot/compressed/Makefile | 2 +- arch/arm/vdso/Makefile | 2 +- arch/arm64/kernel/pi/Makefile | 2 +- arch/arm64/kernel/vdso/Makefile | 3 +- arch/arm64/kvm/hyp/nvhe/Makefile | 2 +- arch/riscv/kernel/pi/Makefile | 2 +- arch/riscv/purgatory/Makefile | 2 +- arch/sparc/vdso/Makefile | 3 +- arch/x86/entry/vdso/Makefile | 3 +- arch/x86/purgatory/Makefile | 2 +- drivers/firmware/efi/libstub/Makefile | 8 +-- drivers/misc/lkdtm/Makefile | 2 +- kernel/Makefile | 10 ++-- lib/Makefile | 2 +- scripts/Makefile.gcc-plugins | 16 +----- scripts/Makefile.kstack_erase | 21 ++++++++ scripts/gcc-plugins/stackleak_plugin.c | 52 +++++++++---------- Documentation/admin-guide/sysctl/kernel.rst | 4 +- Documentation/arch/x86/x86_64/mm.rst | 2 +- Documentation/security/self-protection.rst | 2 +- .../zh_CN/security/self-protection.rst | 2 +- arch/arm64/include/asm/acpi.h | 2 +- arch/mips/include/asm/time.h | 2 +- arch/s390/hypfs/hypfs.h | 2 +- arch/s390/hypfs/hypfs_diag.h | 2 +- arch/x86/entry/calling.h | 4 +- arch/x86/include/asm/acpi.h | 4 +- arch/x86/include/asm/init.h | 2 +- arch/x86/include/asm/realmode.h | 2 +- include/linux/acpi.h | 4 +- include/linux/bootconfig.h | 2 +- include/linux/efi.h | 2 +- include/linux/init.h | 4 +- include/linux/{stackleak.h => kstack_erase.h} | 20 +++---- include/linux/memblock.h | 2 +- include/linux/mfd/dbx500-prcmu.h | 2 +- include/linux/sched.h | 4 +- include/linux/smp.h | 2 +- arch/arm/kernel/entry-common.S | 2 +- arch/arm64/kernel/entry.S | 2 +- arch/riscv/kernel/entry.S | 2 +- arch/s390/kernel/entry.S | 2 +- arch/arm/mm/cache-feroceon-l2.c | 2 +- arch/arm/mm/cache-tauros2.c | 2 +- arch/powerpc/mm/book3s64/hash_utils.c | 6 +-- arch/powerpc/mm/book3s64/radix_pgtable.c | 4 +- arch/s390/mm/init.c | 2 +- arch/x86/kernel/kvm.c | 2 +- arch/x86/mm/init_64.c | 2 +- drivers/clocksource/timer-orion.c | 2 +- .../lkdtm/{stackleak.c => kstack_erase.c} | 26 +++++----- drivers/soc/ti/pm33xx.c | 2 +- fs/proc/base.c | 6 +-- kernel/fork.c | 2 +- kernel/kexec_handover.c | 4 +- kernel/{stackleak.c => kstack_erase.c} | 22 ++++---- tools/objtool/check.c | 4 +- tools/testing/selftests/lkdtm/config | 2 +- MAINTAINERS | 6 ++- kernel/configs/hardening.config | 6 +++ 68 files changed, 204 insertions(+), 172 deletions(-) create mode 100644 scripts/Makefile.kstack_erase rename include/linux/{stackleak.h => kstack_erase.h} (81%) rename drivers/misc/lkdtm/{stackleak.c => kstack_erase.c} (89%) rename kernel/{stackleak.c => kstack_erase.c} (87%) -- 2.34.1

2 weeks

9
29
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror July 2025