October 2025 - Linux-kselftest-mirror

[PATCH v2 0/4] KVM: selftests: Test SET_NESTED_STATE with 48-bit L2 on 57-bit L1

by Jim Mattson

Prior to commit 9245fd6b8531 ("KVM: x86: model canonical checks more precisely"), KVM_SET_NESTED_STATE would fail if the state was captured with L2 active, L1 had CR4.LA57 set, L2 did not, and the VMCS12.HOST_GSBASE (or other host-state field checked for canonicality) had an address greater than 48 bits wide. Add a regression test that reproduces the KVM_SET_NESTED_STATE failure conditions. To do so, the first three patches add support for 5-level paging in the selftest L1 VM. v1 -> v2 Ended the page walking loops before visiting 4K mappings [Yosry] Changed VM_MODE_PXXV48_4K into VM_MODE_PXXVYY_4K; use 5-level paging when possible [Sean] Removed the check for non-NULL vmx_pages in guest_code() [Yosry] Jim Mattson (4): KVM: selftests: Use a loop to create guest page tables KVM: selftests: Use a loop to walk guest page tables KVM: selftests: Change VM_MODE_PXXV48_4K to VM_MODE_PXXVYY_4K KVM: selftests: Add a VMX test for LA57 nested state tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/include/kvm_util.h | 4 +- .../selftests/kvm/include/x86/processor.h | 2 +- .../selftests/kvm/lib/arm64/processor.c | 2 +- tools/testing/selftests/kvm/lib/kvm_util.c | 30 ++-- .../testing/selftests/kvm/lib/x86/processor.c | 80 +++++------ tools/testing/selftests/kvm/lib/x86/vmx.c | 6 +- .../kvm/x86/vmx_la57_nested_state_test.c | 134 ++++++++++++++++++ 8 files changed, 197 insertions(+), 62 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86/vmx_la57_nested_state_test.c -- 2.51.1.851.g4ebd6896fd-goog

2 weeks, 5 days

3
7
0 0

[PATCH v2 0/3] arm64/sme: Support disabling streaming mode via ptrace on SME only systems

by Mark Brown

Currently it is not possible to disable streaming mode via ptrace on SME only systems, the interface for doing this is to write via NT_ARM_SVE but such writes will be rejected on a system without SVE support. Enable this functionality by allowing userspace to write SVE_PT_REGS_FPSIMD format data via NT_ARM_SVE with the vector length set to 0 on SME only systems. Such writes currently error since we require that a vector length is specified which should minimise the risk that existing software is relying on current behaviour. Reads are not supported since I am not aware of any use case for this and there is some risk that an existing userspace application may be confused if it reads NT_ARM_SVE on a system without SVE. Existing kernels will return FPSIMD formatted register state from NT_ARM_SVE if full SVE state is not stored, for example if the task has not used SVE. Returning a vector length of 0 would create a risk that software could try to do things like allocate space for register state with zero sizes, while returning a vector length of 128 bits would look like SVE is supported. It seems safer to just not make the changes to add read support. It remains possible for userspace to detect a SME only system via the ptrace interface only since reads of NT_ARM_SSVE and NT_ARM_ZA will suceed while reads of NT_ARM_SVE will fail. Read/write access to the FPSIMD registers in non-streaming mode is available via REGSET_FPR. The aim is is to make a minimally invasive change, no operation that would previously have succeeded will be affected, and we use a previously defined interface in new circumstances rather than define completely new ABI. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v2: - Rebase onto v6.18-rc1 - Link to v1: https://lore.kernel.org/r/20250820-arm64-sme-ptrace-sme-only-v1-0-f7c22b287… --- Mark Brown (3): arm64/sme: Support disabling streaming mode via ptrace on SME only systems kselftst/arm64: Test NT_ARM_SVE FPSIMD format writes on non-SVE systems kselftest/arm64: Cover disabling streaming mode without SVE in fp-ptrace Documentation/arch/arm64/sve.rst | 5 +++ arch/arm64/kernel/ptrace.c | 40 +++++++++++++++--- tools/testing/selftests/arm64/fp/fp-ptrace.c | 5 +-- tools/testing/selftests/arm64/fp/sve-ptrace.c | 61 +++++++++++++++++++++++++++ 4 files changed, 100 insertions(+), 11 deletions(-) --- base-commit: cb6649f6217c0331b885cf787f1d175963e2a1d2 change-id: 20250717-arm64-sme-ptrace-sme-only-1fb850600ea0 Best regards, -- Mark Brown <broonie(a)kernel.org>

2 weeks, 5 days

5
7
0 0

[PATCH RFC 0/4] landlock: add LANDLOCK_SCOPE_MEMFD_EXEC execution

by Abhinav Saxena

This patch series introduces LANDLOCK_SCOPE_MEMFD_EXEC, a new Landlock scoping mechanism that restricts execution of anonymous memory file descriptors (memfd) created via memfd_create(2). This addresses security gaps where processes can bypass W^X policies and execute arbitrary code through anonymous memory objects. Fixes: https://github.com/landlock-lsm/linux/issues/37 SECURITY PROBLEM ================ Current Landlock filesystem restrictions do not cover memfd objects, allowing processes to: 1. Read-to-execute bypass: Create writable memfd, inject code, then execute via mmap(PROT_EXEC) or direct execve() 2. Anonymous execution: Execute code without touching the filesystem via execve("/proc/self/fd/N") where N is a memfd descriptor 3. Cross-domain access violations: Pass memfd between processes to bypass domain restrictions These scenarios can occur in sandboxed environments where filesystem access is restricted but memfd creation remains possible. IMPLEMENTATION ============== The implementation adds hierarchical execution control through domain scoping: Core Components: - is_memfd_file(): Reliable memfd detection via "memfd:" dentry prefix - domain_is_scoped(): Cross-domain hierarchy checking (moved to domain.c) - LSM hooks: mmap_file, file_mprotect, bprm_creds_for_exec - Creation-time restrictions: hook_file_alloc_security Security Matrix: Execution decisions follow domain hierarchy rules preventing both same-domain bypass attempts and cross-domain access violations while preserving legitimate hierarchical access patterns. Domain Hierarchy with LANDLOCK_SCOPE_MEMFD_EXEC: =============================================== Root (no domain) - No restrictions | +-- Domain A [SCOPE_MEMFD_EXEC] Layer 1 | +-- memfd_A (tagged with Domain A as creator) | | | +-- Domain A1 (child) [NO SCOPE] Layer 2 | | +-- Inherits Layer 1 restrictions from parent | | +-- memfd_A1 (can create, inherits restrictions) | | +-- Domain A1a [SCOPE_MEMFD_EXEC] Layer 3 | | +-- memfd_A1a (tagged with Domain A1a) | | | +-- Domain A2 (child) [SCOPE_MEMFD_EXEC] Layer 2 | +-- memfd_A2 (tagged with Domain A2 as creator) | +-- CANNOT access memfd_A1 (different subtree) | +-- Domain B [SCOPE_MEMFD_EXEC] Layer 1 +-- memfd_B (tagged with Domain B as creator) +-- CANNOT access ANY memfd from Domain A subtree Execution Decision Matrix: ======================== Executor-> | A | A1 | A1a | A2 | B | Root Creator | | | | | | ------------|-----|----|-----|----|----|----- Domain A | X | X | X | X | X | Y Domain A1 | Y | X | X | X | X | Y Domain A1a | Y | Y | X | X | X | Y Domain A2 | Y | X | X | X | X | Y Domain B | X | X | X | X | X | Y Root | Y | Y | Y | Y | Y | Y Legend: Y = Execution allowed, X = Execution denied Scenarios Covered: - Direct mmap(PROT_EXEC) on memfd files - Two-stage mmap(PROT_READ) + mprotect(PROT_EXEC) bypass attempts - execve("/proc/self/fd/N") anonymous execution - execveat() and fexecve() file descriptor execution - Cross-process memfd inheritance and IPC passing TESTING ======= All patches have been validated with: - scripts/checkpatch.pl --strict (clean) - Selftests covering same-domain restrictions, cross-domain hierarchy enforcement, and regular file isolation - KUnit tests for memfd detection edge cases DISCLAIMER ========== My understanding of Landlock scoping semantics may be limited, but this implementation reflects my current understanding based on available documentation and code analysis. I welcome feedback and corrections regarding the scoping logic and domain hierarchy enforcement. Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com> --- Abhinav Saxena (4): landlock: add LANDLOCK_SCOPE_MEMFD_EXEC scope landlock: implement memfd detection landlock: add memfd exec LSM hooks and scoping selftests/landlock: add memfd execution tests include/uapi/linux/landlock.h | 5 + security/landlock/.kunitconfig | 1 + security/landlock/audit.c | 4 + security/landlock/audit.h | 1 + security/landlock/cred.c | 14 - security/landlock/domain.c | 67 ++++ security/landlock/domain.h | 4 + security/landlock/fs.c | 405 ++++++++++++++++++++- security/landlock/limits.h | 2 +- security/landlock/task.c | 67 ---- .../selftests/landlock/scoped_memfd_exec_test.c | 325 +++++++++++++++++ 11 files changed, 812 insertions(+), 83 deletions(-) --- base-commit: 5b74b2eff1eeefe43584e5b7b348c8cd3b723d38 change-id: 20250716-memfd-exec-ac0d582018c3 Best regards, -- Abhinav Saxena <xandfury(a)gmail.com>

2 weeks, 6 days

3
11
0 0

[PATCH v4 00/10] riscv: Add Zalasr ISA extension support

by Xu Lu

This patch adds support for the Zalasr ISA extension, which supplies the real load acquire/store release instructions. The specification can be found here: https://github.com/riscv/riscv-zalasr/blob/main/chapter2.adoc This patch seires has been tested with ltp on Qemu with Brensan's zalasr support patch[1]. Some false positive spacing error happens during patch checking. Thus I CCed maintainers of checkpatch.pl as well. [1] https://lore.kernel.org/all/CAGPSXwJEdtqW=nx71oufZp64nK6tK=0rytVEcz4F-gfvCO… v4: - Apply acquire/release semantics to arch_atomic operations. Thanks to Andrea. v3: - Apply acquire/release semantics to arch_xchg/arch_cmpxchg operations so as to ensure FENCE.TSO ordering between operations which precede the UNLOCK+LOCK sequence and operations which follow the sequence. Thanks to Andrea. - Support hwprobe of Zalasr. - Allow Zalasr extensions for Guest/VM. v2: - Adjust the order of Zalasr and Zalrsc in dt-bindings. Thanks to Conor. Xu Lu (10): riscv: Add ISA extension parsing for Zalasr dt-bindings: riscv: Add Zalasr ISA extension description riscv: hwprobe: Export Zalasr extension riscv: Introduce Zalasr instructions riscv: Apply Zalasr to smp_load_acquire/smp_store_release riscv: Apply acquire/release semantics to arch_xchg/arch_cmpxchg operations riscv: Apply acquire/release semantics to arch_atomic operations riscv: Remove arch specific __atomic_acquire/release_fence RISC-V: KVM: Allow Zalasr extensions for Guest/VM RISC-V: KVM: selftests: Add Zalasr extensions to get-reg-list test Documentation/arch/riscv/hwprobe.rst | 5 +- .../devicetree/bindings/riscv/extensions.yaml | 5 + arch/riscv/include/asm/atomic.h | 70 ++++++++- arch/riscv/include/asm/barrier.h | 91 +++++++++-- arch/riscv/include/asm/cmpxchg.h | 144 +++++++++--------- arch/riscv/include/asm/fence.h | 4 - arch/riscv/include/asm/hwcap.h | 1 + arch/riscv/include/asm/insn-def.h | 79 ++++++++++ arch/riscv/include/uapi/asm/hwprobe.h | 1 + arch/riscv/include/uapi/asm/kvm.h | 1 + arch/riscv/kernel/cpufeature.c | 1 + arch/riscv/kernel/sys_hwprobe.c | 1 + arch/riscv/kvm/vcpu_onereg.c | 2 + .../selftests/kvm/riscv/get-reg-list.c | 4 + 14 files changed, 314 insertions(+), 95 deletions(-) -- 2.20.1

3 weeks

7
12
0 0

[PATCH v2 0/2] Optimize the allocation of vector regset

by Yong-Xuan Wang

The vector regset uses the maximum possible vlenb 8192 to allocate a 2^18 bytes buffer to copy the vector register. But most platforms don’t support the largest vlenb. The regset has 2 users, ptrace syscall and coredump. When handling the PTRACE_GETREGSET requests from ptrace syscall, Linux will prepare a kernel buffer which size is min(user buffer size, limit). A malicious user process might overwhelm a memory-constrainted system when the buffer limit is very large. The coredump uses regset_get_alloc() to get the context of vector register. But this API allocates buffer before checking whether the target process uses vector extension, this wastes time to prepare a large memory buffer. The buffer limit can be determined after getting platform vlenb in the early boot stage, this can let the regset buffer match real hardware limits. Also add .active callbacks to let the coredump skip vector part when target process doesn't use it. After this patchset, userspace process needs 2 ptrace syscalls to retrieve the vector regset with PTRACE_GETREGSET. The first ptrace call only reads the header to get the vlenb information. Then prepare a suitable buffer to get the register context. The new vector ptrace kselftest demonstrates it. --- v2: - fix issues in vector ptrace kselftest (Andy) Yong-Xuan Wang (2): riscv: ptrace: Optimize the allocation of vector regset selftests: riscv: Add test for the Vector ptrace interface arch/riscv/include/asm/vector.h | 1 + arch/riscv/kernel/ptrace.c | 24 +++- arch/riscv/kernel/vector.c | 2 + tools/testing/selftests/riscv/vector/Makefile | 5 +- .../selftests/riscv/vector/vstate_ptrace.c | 134 ++++++++++++++++++ 5 files changed, 162 insertions(+), 4 deletions(-) create mode 100644 tools/testing/selftests/riscv/vector/vstate_ptrace.c -- 2.43.0

3 weeks

3
4
0 0

[PATCH] selftests/seccomp: improve backwards compatibility for older kernels

by Wake Liu

This commit introduces checks for kernel version and seccomp filter flag support to the seccomp selftests. It also includes conditional header inclusions using __GLIBC_PREREQ. Some tests were gated by kernel version, and adjustments were made for flags introduced after kernel 5.4. This ensures the selftests can run and pass correctly on kernel versions 5.4 and later, preventing failures due to features not present in older kernels. The use of __GLIBC_PREREQ ensures proper compilation and functionality across different glibc versions in a mainline Linux kernel context. While it might appear redundant in specific build environments due to global overrides, it is crucial for upstream correctness and portability. Signed-off-by: Wake Liu <wakel(a)google.com> --- tools/testing/selftests/seccomp/seccomp_bpf.c | 108 ++++++++++++++++-- 1 file changed, 99 insertions(+), 9 deletions(-) diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index 61acbd45ffaa..9b660cff5a4a 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -13,12 +13,14 @@ * we need to use the kernel's siginfo.h file and trick glibc * into accepting it. */ +#if defined(__GLIBC__) && defined(__GLIBC_PREREQ) #if !__GLIBC_PREREQ(2, 26) # include <asm/siginfo.h> # define __have_siginfo_t 1 # define __have_sigval_t 1 # define __have_sigevent_t 1 #endif +#endif #include <errno.h> #include <linux/filter.h> @@ -300,6 +302,26 @@ int seccomp(unsigned int op, unsigned int flags, void *args) } #endif +int seccomp_flag_supported(int flag) +{ + /* + * Probes if a seccomp filter flag is supported by the kernel. + * + * When an unsupported flag is passed to seccomp(SECCOMP_SET_MODE_FILTER, ...), + * the kernel returns EINVAL. + * + * When a supported flag is passed, the kernel proceeds to validate the + * filter program pointer. By passing NULL for the filter program, + * the kernel attempts to dereference a bad address, resulting in EFAULT. + * + * Therefore, checking for EFAULT indicates that the flag itself was + * recognized and supported by the kernel. + */ + if (seccomp(SECCOMP_SET_MODE_FILTER, flag, NULL) == -1 && errno == EFAULT) + return 1; + return 0; +} + #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n])) #elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ @@ -2436,13 +2458,12 @@ TEST(detect_seccomp_filter_flags) ASSERT_NE(ENOSYS, errno) { TH_LOG("Kernel does not support seccomp syscall!"); } - EXPECT_EQ(-1, ret); - EXPECT_EQ(EFAULT, errno) { - TH_LOG("Failed to detect that a known-good filter flag (0x%X) is supported!", - flag); - } - all_flags |= flag; + if (seccomp_flag_supported(flag)) + all_flags |= flag; + else + TH_LOG("Filter flag (0x%X) is not found to be supported!", + flag); } /* @@ -2870,6 +2891,12 @@ TEST_F(TSYNC, two_siblings_with_one_divergence) TEST_F(TSYNC, two_siblings_with_one_divergence_no_tid_in_err) { + /* Depends on 5189149 (seccomp: allow TSYNC and USER_NOTIF together) */ + if (!seccomp_flag_supported(SECCOMP_FILTER_FLAG_TSYNC_ESRCH)) { + SKIP(return, "Kernel does not support SECCOMP_FILTER_FLAG_TSYNC_ESRCH"); + return; + } + long ret, flags; void *status; @@ -3475,6 +3502,11 @@ TEST(user_notification_basic) TEST(user_notification_with_tsync) { + /* Depends on 5189149 (seccomp: allow TSYNC and USER_NOTIF together) */ + if (!seccomp_flag_supported(SECCOMP_FILTER_FLAG_TSYNC_ESRCH)) { + SKIP(return, "Kernel does not support SECCOMP_FILTER_FLAG_TSYNC_ESRCH"); + return; + } int ret; unsigned int flags; @@ -3966,6 +3998,13 @@ TEST(user_notification_filter_empty) TEST(user_ioctl_notification_filter_empty) { + /* Depends on 95036a7 (seccomp: interrupt SECCOMP_IOCTL_NOTIF_RECV + * when all users have exited) */ + if (!ksft_min_kernel_version(6, 11)) { + SKIP(return, "Kernel version < 6.11"); + return; + } + pid_t pid; long ret; int status, p[2]; @@ -4119,6 +4158,12 @@ int get_next_fd(int prev_fd) TEST(user_notification_addfd) { + /* Depends on 0ae71c7 (seccomp: Support atomic "addfd + send reply") */ + if (!ksft_min_kernel_version(5, 14)) { + SKIP(return, "Kernel version < 5.14"); + return; + } + pid_t pid; long ret; int status, listener, memfd, fd, nextfd; @@ -4281,6 +4326,12 @@ TEST(user_notification_addfd) TEST(user_notification_addfd_rlimit) { + /* Depends on 7cf97b1 (seccomp: Introduce addfd ioctl to seccomp user notifier) */ + if (!ksft_min_kernel_version(5, 9)) { + SKIP(return, "Kernel version < 5.9"); + return; + } + pid_t pid; long ret; int status, listener, memfd; @@ -4326,9 +4377,12 @@ TEST(user_notification_addfd_rlimit) EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1); EXPECT_EQ(errno, EMFILE); - addfd.flags = SECCOMP_ADDFD_FLAG_SEND; - EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1); - EXPECT_EQ(errno, EMFILE); + /* Depends on 0ae71c7 (seccomp: Support atomic "addfd + send reply") */ + if (ksft_min_kernel_version(5, 14)) { + addfd.flags = SECCOMP_ADDFD_FLAG_SEND; + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1); + EXPECT_EQ(errno, EMFILE); + } addfd.newfd = 100; addfd.flags = SECCOMP_ADDFD_FLAG_SETFD; @@ -4356,6 +4410,12 @@ TEST(user_notification_addfd_rlimit) TEST(user_notification_sync) { + /* Depends on 48a1084 (seccomp: add the synchronous mode for seccomp_unotify) */ + if (!ksft_min_kernel_version(6, 6)) { + SKIP(return, "Kernel version < 6.6"); + return; + } + struct seccomp_notif req = {}; struct seccomp_notif_resp resp = {}; int status, listener; @@ -4520,6 +4580,12 @@ static char get_proc_stat(struct __test_metadata *_metadata, pid_t pid) TEST(user_notification_fifo) { + /* Depends on 4cbf6f6 (seccomp: Use FIFO semantics to order notifications) */ + if (!ksft_min_kernel_version(5, 19)) { + SKIP(return, "Kernel version < 5.19"); + return; + } + struct seccomp_notif_resp resp = {}; struct seccomp_notif req = {}; int i, status, listener; @@ -4623,6 +4689,12 @@ static long get_proc_syscall(struct __test_metadata *_metadata, int pid) /* Ensure non-fatal signals prior to receive are unmodified */ TEST(user_notification_wait_killable_pre_notification) { + /* Depends on c2aa2df (seccomp: Add wait_killable semantic to seccomp user notifier) */ + if (!ksft_min_kernel_version(5, 19)) { + SKIP(return, "Kernel version < 5.19"); + return; + } + struct sigaction new_action = { .sa_handler = signal_handler, }; @@ -4693,6 +4765,12 @@ TEST(user_notification_wait_killable_pre_notification) /* Ensure non-fatal signals after receive are blocked */ TEST(user_notification_wait_killable) { + /* Depends on c2aa2df (seccomp: Add wait_killable semantic to seccomp user notifier) */ + if (!ksft_min_kernel_version(5, 19)) { + SKIP(return, "Kernel version < 5.19"); + return; + } + struct sigaction new_action = { .sa_handler = signal_handler, }; @@ -4772,6 +4850,12 @@ TEST(user_notification_wait_killable) /* Ensure fatal signals after receive are not blocked */ TEST(user_notification_wait_killable_fatal) { + /* Depends on c2aa2df (seccomp: Add wait_killable semantic to seccomp user notifier) */ + if (!ksft_min_kernel_version(5, 19)) { + SKIP(return, "Kernel version < 5.19"); + return; + } + struct seccomp_notif req = {}; int listener, status; pid_t pid; @@ -4854,6 +4938,12 @@ static void *tsync_vs_dead_thread_leader_sibling(void *_args) */ TEST(tsync_vs_dead_thread_leader) { + /* Depends on bfafe5e (seccomp: release task filters when the task exits) */ + if (!ksft_min_kernel_version(6, 11)) { + SKIP(return, "Kernel version < 6.11"); + return; + } + int status; pid_t pid; long ret; -- 2.50.1.703.g449372360f-goog

3 weeks

2
2
0 0

[RFC PATCH v1 00/37] guest_memfd: In-place conversion support

by Ackerley Tng

Hello, IIUC this is the first independent patch series for guest_memfd's in-place conversion series! Happy to finally bring this out on its own. Previous versions of this feature, part of other series, are available at [1][2][3]. Many prior discussions have led up to these main features of this series, and these are the main points I'd like feedback on. 1. Having private/shared status stored in a maple tree (Thanks Michael for your support of using maple trees over xarrays for performance! [4]). 2. Having a new guest_memfd ioctl (not a vm ioctl) that performs conversions. 3. Using ioctls/structs/input attribute similar to the existing vm ioctl KVM_SET_MEMORY_ATTRIBUTES to perform conversions. 4. Storing requested attributes directly in the maple tree. 5. Using a KVM module-wide param to toggle between setting memory attributes via vm and guest_memfd ioctls (making them mututally exclusive - a single loaded KVM module can only do one of the two.) 6. Skipping LRU in guest_memfd folios - make guest_memfd folios not participate in LRU to avoid LRU refcounts from interfering with conversions. This series is based on kvm/next, followed by + v12 of NUMA mempolicy support patches [5] + 3 cleanup patches from Sean [6][7][8] Everything is stitched together here for your convenience https://github.com/googleprodkernel/linux-cc/commits/guest_memfd-inplace-co… Thank you all for helping with this series! If I missed out your comment from a previous series, it's not intentional! Please do raise it again. TODOs: + There might be an issue with memory failure handling because when guest_memfd folios stop participating in LRU. From a preliminary analysis, HWPoisonHandlable() is only true if PageLRU() is true. This needs further investigation. [1] https://lore.kernel.org/all/bd163de3118b626d1005aa88e71ef2fb72f0be0f.172600… [2] https://lore.kernel.org/all/20250117163001.2326672-6-tabba@google.com/ [3] https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.174726… [4] https://lore.kernel.org/all/20250529054227.hh2f4jmyqf6igd3i@amd.com/ [5] https://lore.kernel.org/all/20251007221420.344669-1-seanjc@google.com/T/ [6] https://lore.kernel.org/all/20250924174255.2141847-1-seanjc@google.com/ [7] https://lore.kernel.org/all/20251007224515.374516-1-seanjc@google.com/ [8] https://lore.kernel.org/all/20251007223625.369939-1-seanjc@google.com/ Ackerley Tng (19): KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes KVM: Introduce KVM_SET_MEMORY_ATTRIBUTES2 KVM: guest_memfd: Don't set FGP_ACCESSED when getting folios KVM: guest_memfd: Skip LRU for guest_memfd folios KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES KVM: selftests: Update framework to use KVM_SET_MEMORY_ATTRIBUTES2 KVM: selftests: guest_memfd: Test basic single-page conversion flow KVM: selftests: guest_memfd: Test conversion flow when INIT_SHARED KVM: selftests: guest_memfd: Test indexing in guest_memfd KVM: selftests: guest_memfd: Test conversion before allocation KVM: selftests: guest_memfd: Convert with allocated folios in different layouts KVM: selftests: guest_memfd: Test precision of conversion KVM: selftests: guest_memfd: Test that truncation does not change shared/private status KVM: selftests: guest_memfd: Test conversion with elevated page refcount KVM: selftests: Reset shared memory after hole-punching KVM: selftests: Provide function to look up guest_memfd details from gpa KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd KVM: selftests: Add script to exercise private_mem_conversions_test Sean Christopherson (18): KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined KVM: Stub in ability to disable per-VM memory attribute tracking KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86 KVM: Let userspace disable per-VM mem attributes, enable per-gmem attributes KVM: selftests: Create gmem fd before "regular" fd when adding memslot KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset} KVM: selftests: Add support for mmap() on guest_memfd in core library KVM: selftests: Add helpers for calling ioctls on guest_memfd KVM: selftests: guest_memfd: Test that shared/private status is consistent across processes KVM: selftests: Add selftests global for guest memory attributes capability KVM: selftests: Provide common function to set memory attributes KVM: selftests: Check fd/flags provided to mmap() when setting up memslot KVM: selftests: Update pre-fault test to work with per-guest_memfd attributes KVM: selftests: Update private memory exits test work with per-gmem attributes Documentation/virt/kvm/api.rst | 72 ++- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/Kconfig | 15 +- arch/x86/kvm/mmu/mmu.c | 4 +- arch/x86/kvm/x86.c | 13 +- include/linux/kvm_host.h | 44 +- include/trace/events/kvm.h | 4 +- include/uapi/linux/kvm.h | 17 + mm/filemap.c | 1 + mm/memcontrol.c | 2 + tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile.kvm | 1 + .../kvm/guest_memfd_conversions_test.c | 498 ++++++++++++++++++ .../testing/selftests/kvm/include/kvm_util.h | 127 ++++- .../testing/selftests/kvm/include/test_util.h | 29 +- tools/testing/selftests/kvm/lib/kvm_util.c | 128 +++-- tools/testing/selftests/kvm/lib/test_util.c | 7 - .../selftests/kvm/pre_fault_memory_test.c | 2 +- .../kvm/x86/private_mem_conversions_test.c | 55 +- .../kvm/x86/private_mem_conversions_test.py | 159 ++++++ .../kvm/x86/private_mem_kvm_exits_test.c | 36 +- virt/kvm/Kconfig | 4 +- virt/kvm/guest_memfd.c | 414 +++++++++++++-- virt/kvm/kvm_main.c | 104 +++- 24 files changed, 1554 insertions(+), 185 deletions(-) create mode 100644 tools/testing/selftests/kvm/guest_memfd_conversions_test.c create mode 100755 tools/testing/selftests/kvm/x86/private_mem_conversions_test.py -- 2.51.0.858.gf9c4a03a3a-goog

3 weeks, 4 days

7
62
0 0

[PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

by Jiaqi Yan

Problem ======= When host APEI is unable to claim a synchronous external abort (SEA) during guest abort, today KVM directly injects an asynchronous SError into the VCPU then resumes it. The injected SError usually results in unpleasant guest kernel panic. One of the major situation of guest SEA is when VCPU consumes recoverable uncorrected memory error (UER), which is not uncommon at all in modern datacenter servers with large amounts of physical memory. Although SError and guest panic is sufficient to stop the propagation of corrupted memory, there is room to recover from an UER in a more graceful manner. Proposed Solution ================= The idea is, we can replay the SEA to the faulting VCPU. If the memory error consumption or the fault that cause SEA is not from guest kernel, the blast radius can be limited to the poison-consuming guest process, while the VM can keep running. In addition, instead of doing under the hood without involving userspace, there are benefits to redirect the SEA to VMM: - VM customers care about the disruptions caused by memory errors, and VMM usually has the responsibility to start the process of notifying the customers of memory error events in their VMs. For example some cloud provider emits a critical log in their observability UI [1], and provides a playbook for customers on how to mitigate disruptions to their workloads. - VMM can protect future memory error consumption by unmapping the poisoned pages from stage-2 page table with KVM userfault [2], or by splitting the memslot that contains the poisoned pages. - VMM can keep track of SEA events in the VM. When VMM thinks the status on the host or the VM is bad enough, e.g. number of distinct SEAs exceeds a threshold, it can restart the VM on another healthy host. - Behavior parity with x86 architecture. When machine check exception (MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to let VMM either recover from the MCE, or terminate itself with VM. The prior RFC proposes to implement SIGBUS on arm64 as well, but Marc preferred KVM exit over signal [3]. However, implementation aside, returning SEA to VMM is on par with returning MCE to VMM. Once SEA is redirected to VMM, among other actions, VMM is encouraged to inject external aborts into the faulting VCPU. New UAPIs ========= This patchset introduces following userspace-visible changes to empower VMM to control what happens for SEA on guest memory: - KVM_CAP_ARM_SEA_TO_USER. While taking SEA, if userspace has enabled this new capability at VM creation, and the SEA is not owned by kernel allocated memory, instead of injecting SError, return KVM_EXIT_ARM_SEA to userspace. - KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details about the SEA is provided in arm_sea as much as possible, including sanitized ESR value at EL2, faulting guest virtual and physical addresses if available. * From v3 [4] - Rebased on commit 3a8660878839 ("Linux 6.18-rc1"). - In selftest, print a message if GVA or GPA expects to be valid. * From v2 [5]: - Rebased on "[PATCH] KVM: arm64: nv: Handle SEAs due to VNCR redirection" [6] and kvmarm/next commit 7b8346bd9fce6 ("KVM: arm64: Don't attempt vLPI mappings when vPE allocation is disabled") - Took the host_owns_sea implementation from Oliver [7, 8]. - Excluded the guest SEA injection patches. - Updated selftest. * From v1 [9]: - Rebased on commit 4d62121ce9b5 ("KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer"). - Sanitize ESR_EL2 before reporting it to userspace. - Do not do KVM_EXIT_ARM_SEA when SEA is caused by memory allocated to stage-2 translation table. [1] https://cloud.google.com/solutions/sap/docs/manage-host-errors [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com [3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org [4] https://lore.kernel.org/kvmarm/20250731205844.1346839-1-jiaqiyan@google.com [5] https://lore.kernel.org/kvm/20250604050902.3944054-1-jiaqiyan@google.com [6] https://lore.kernel.org/kvmarm/20250729182342.3281742-1-oliver.upton@linux.… [7] https://lore.kernel.org/kvm/aHFohmTb9qR_JG1E@linux.dev [8] https://lore.kernel.org/kvm/aHK-DPufhLy5Dtuk@linux.dev [9] https://lore.kernel.org/kvm/20250505161412.1926643-1-jiaqiyan@google.com Jiaqi Yan (3): KVM: arm64: VM exit to userspace to handle SEA KVM: selftests: Test for KVM_EXIT_ARM_SEA Documentation: kvm: new UAPI for handling SEA Documentation/virt/kvm/api.rst | 61 ++++ arch/arm64/include/asm/kvm_host.h | 2 + arch/arm64/kvm/arm.c | 5 + arch/arm64/kvm/mmu.c | 68 +++- include/uapi/linux/kvm.h | 10 + tools/arch/arm64/include/asm/esr.h | 2 + tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/arm64/sea_to_user.c | 331 ++++++++++++++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 1 + 9 files changed, 480 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c -- 2.51.0.760.g7b8bcc2412-goog

3 weeks, 5 days

8
18
0 0

[PATCH v2 0/4] KVM ARM64 pre_fault_memory

by Jack Thomson

From: Jack Thomson <jackabt(a)amazon.com> This patch series adds ARM64 support for the KVM_PRE_FAULT_MEMORY feature, which was previously only available on x86 [1]. This allows us to reduce the number of stage-2 faults during execution. This is of benefit in post-copy migration scenarios, particularly in memory intensive applications, where we are experiencing high latencies due to the stage-2 faults. Patch Overview: - The first patch adds support for the KVM_PRE_FAULT_MEMORY ioctl on arm64. - The second patch fixes an issue with unaligned mmap allocations in the selftests. - The third patch updates the pre_fault_memory_test to support arm64. - The last patch extends the pre_fault_memory_test to cover different vm memory backings. === Changes Since v1 [2] === Addressing feedback from Oliver: - No pre-fault flag is passed to user_mem_abort() or gmem_abort() now aborts are synthesized. - Remove retry loop from kvm_arch_vcpu_pre_fault_memory() [1]: https://lore.kernel.org/kvm/20240710174031.312055-1-pbonzini@redhat.com [2]: https://lore.kernel.org/all/20250911134648.58945-1-jackabt.amazon@gmail.com Jack Thomson (4): KVM: arm64: Add pre_fault_memory implementation KVM: selftests: Fix unaligned mmap allocations KVM: selftests: Enable pre_fault_memory_test for arm64 KVM: selftests: Add option for different backing in pre-fault tests Documentation/virt/kvm/api.rst | 3 +- arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/mmu.c | 73 +++++++++++- tools/testing/selftests/kvm/Makefile.kvm | 1 + tools/testing/selftests/kvm/lib/kvm_util.c | 12 +- .../selftests/kvm/pre_fault_memory_test.c | 110 +++++++++++++----- 7 files changed, 163 insertions(+), 38 deletions(-) base-commit: 42188667be387867d2bf763d028654cbad046f7b -- 2.43.0

3 weeks, 6 days

4
11
0 0

[PATCHSET v10 sched_ext/for-6.19] Add a deadline server for sched_ext tasks

by Andrea Righi

sched_ext tasks can be starved by long-running RT tasks, especially since RT throttling was replaced by deadline servers to boost only SCHED_NORMAL tasks. Several users in the community have reported issues with RT stalling sched_ext tasks. This is fairly common on distributions or environments where applications like video compositors, audio services, etc. run as RT tasks by default. Example trace (showing a per-CPU kthread stalled due to the sway Wayland compositor running as an RT task): runnable task stall (kworker/0:0[106377] failed to run for 5.043s) ... CPU 0 : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738 curr=sway[994] class=rt_sched_class R kworker/0:0[106377] -5043ms scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0 sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000 cpus=01 This is often perceived as a bug in the BPF schedulers, but in reality schedulers can't do much: RT tasks run outside their control and can potentially consume 100% of the CPU bandwidth. Fix this by adding a sched_ext deadline server, so that sched_ext tasks are also boosted and do not suffer starvation. Two kselftests are also provided to verify the starvation fixes and bandwidth allocation is correct. == Highlights in this version == - wait for inactive_task_timer() to fire before removing the bandwidth reservation (Juri/Peter: please check if this new dl_server_remove_params() implementation makes sense to you) - removed the explicit dl_server_stop() from dequeue_task_scx() and rely on the delayed stop behavior (Juri/Peter: ditto) This patchset is also available in the following git branch: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dl-server Changes in v10: - reordered patches to better isolate sched_ext changes vs sched/deadline changes (Andrea Righi) - define ext_server only with CONFIG_SCHED_CLASS_EXT=y (Andrea Righi) - add WARN_ON_ONCE(!cpus) check in dl_server_apply_params() (Andrea Righi) - wait for inactive_task_timer to fire before removing the bandwidth reservation (Juri Lelli) - remove explicit dl_server_stop() in dequeue_task_scx() to reduce timer reprogramming overhead (Juri Lelli) - do not restart pick_task() when invoked by the dl_server (Tejun Heo) - rename rq_dl_server to dl_server (Peter Zijlstra) - fixed a missing dl_server start in dl_server_on() (Christian Loehle) - add a comment to the rt_stall selftest to better explain the 4% threshold (Emil Tsalapatis) Changes in v9: - Drop the ->balance() logic as its functionality is now integrated into ->pick_task(), allowing dl_server to call pick_task_scx() directly - Link to v8: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/ Changes in v8: - Add tj's patch to de-couple balance and pick_task and avoid changing sched/core callbacks to propagate @rf - Simplify dl_se->dl_server check (suggested by PeterZ) - Small coding style fixes in the kselftests - Link to v7: https://lore.kernel.org/all/20250809184800.129831-1-joelagnelf@nvidia.com/ Changes in v7: - Rebased to Linus master - Link to v6: https://lore.kernel.org/all/20250702232944.3221001-1-joelagnelf@nvidia.com/ Changes in v6: - Added Acks to few patches - Fixes to few nits suggested by Tejun - Link to v5: https://lore.kernel.org/all/20250620203234.3349930-1-joelagnelf@nvidia.com/ Changes in v5: - Added a kselftest (total_bw) to sched_ext to verify bandwidth values from debugfs - Address comment from Andrea about redundant rq clock invalidation - Link to v4: https://lore.kernel.org/all/20250617200523.1261231-1-joelagnelf@nvidia.com/ Changes in v4: - Fixed issues with hotplugged CPUs having their DL server bandwidth altered due to loading SCX - Fixed other issues - Rebased on Linus master - All sched_ext kselftests reliably pass now, also verified that the total_bw in debugfs (CONFIG_SCHED_DEBUG) is conserved with these patches - Link to v3: https://lore.kernel.org/all/20250613051734.4023260-1-joelagnelf@nvidia.com/ Changes in v3: - Removed code duplication in debugfs. Made ext interface separate - Fixed issue where rq_lock_irqsave was not used in the relinquish patch - Fixed running bw accounting issue in dl_server_remove_params - Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/ Changes in v2: - Fixed a hang related to using rq_lock instead of rq_lock_irqsave - Added support to remove BW of DL servers when they are switched to/from EXT - Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/ Andrea Righi (5): sched/deadline: Add support to initialize and remove dl_server bandwidth sched_ext: Add a DL server for sched_ext tasks sched/deadline: Account ext server bandwidth sched_ext: Selectively enable ext and fair DL servers selftests/sched_ext: Add test for sched_ext dl_server Joel Fernandes (6): sched/debug: Fix updating of ppos on server write ops sched/debug: Stop and start server based on if it was active sched/deadline: Clear the defer params sched/deadline: Add a server arg to dl_server_update_idle_time() sched/debug: Add support to change sched_ext server params selftests/sched_ext: Add test for DL server total_bw consistency kernel/sched/core.c | 3 + kernel/sched/deadline.c | 169 +++++++++++--- kernel/sched/debug.c | 171 +++++++++++--- kernel/sched/ext.c | 144 +++++++++++- kernel/sched/fair.c | 2 +- kernel/sched/idle.c | 2 +- kernel/sched/sched.h | 8 +- kernel/sched/topology.c | 5 + tools/testing/selftests/sched_ext/Makefile | 2 + tools/testing/selftests/sched_ext/rt_stall.bpf.c | 23 ++ tools/testing/selftests/sched_ext/rt_stall.c | 222 ++++++++++++++++++ tools/testing/selftests/sched_ext/total_bw.c | 281 +++++++++++++++++++++++ 12 files changed, 955 insertions(+), 77 deletions(-) create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c create mode 100644 tools/testing/selftests/sched_ext/total_bw.c

4 weeks

4
27
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror October 2025