Prior to commit 9245fd6b8531 ("KVM: x86: model canonical checks more
precisely"), KVM_SET_NESTED_STATE would fail if the state was captured
with L2 active, L1 had CR4.LA57 set, L2 did not, and the
VMCS12.HOST_GSBASE (or other host-state field checked for canonicality)
had an address greater than 48 bits wide.
Add a regression test that reproduces the KVM_SET_NESTED_STATE failure
conditions. To do so, the first three patches add support for 5-level
paging in the selftest L1 VM.
v1 -> v2
Ended the page walking loops before visiting 4K mappings [Yosry]
Changed VM_MODE_PXXV48_4K into VM_MODE_PXXVYY_4K;
use 5-level paging when possible [Sean]
Removed the check for non-NULL vmx_pages in guest_code() [Yosry]
Jim Mattson (4):
KVM: selftests: Use a loop to create guest page tables
KVM: selftests: Use a loop to walk guest page tables
KVM: selftests: Change VM_MODE_PXXV48_4K to VM_MODE_PXXVYY_4K
KVM: selftests: Add a VMX test for LA57 nested state
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../testing/selftests/kvm/include/kvm_util.h | 4 +-
.../selftests/kvm/include/x86/processor.h | 2 +-
.../selftests/kvm/lib/arm64/processor.c | 2 +-
tools/testing/selftests/kvm/lib/kvm_util.c | 30 ++--
.../testing/selftests/kvm/lib/x86/processor.c | 80 +++++------
tools/testing/selftests/kvm/lib/x86/vmx.c | 6 +-
.../kvm/x86/vmx_la57_nested_state_test.c | 134 ++++++++++++++++++
8 files changed, 197 insertions(+), 62 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86/vmx_la57_nested_state_test.c
--
2.51.1.851.g4ebd6896fd-goog
Currently it is not possible to disable streaming mode via ptrace on SME
only systems, the interface for doing this is to write via NT_ARM_SVE but
such writes will be rejected on a system without SVE support. Enable this
functionality by allowing userspace to write SVE_PT_REGS_FPSIMD format data
via NT_ARM_SVE with the vector length set to 0 on SME only systems. Such
writes currently error since we require that a vector length is specified
which should minimise the risk that existing software is relying on current
behaviour.
Reads are not supported since I am not aware of any use case for this and
there is some risk that an existing userspace application may be confused if
it reads NT_ARM_SVE on a system without SVE. Existing kernels will return
FPSIMD formatted register state from NT_ARM_SVE if full SVE state is not
stored, for example if the task has not used SVE. Returning a vector length
of 0 would create a risk that software could try to do things like allocate
space for register state with zero sizes, while returning a vector length of
128 bits would look like SVE is supported. It seems safer to just not make
the changes to add read support.
It remains possible for userspace to detect a SME only system via the ptrace
interface only since reads of NT_ARM_SSVE and NT_ARM_ZA will suceed while
reads of NT_ARM_SVE will fail. Read/write access to the FPSIMD registers in
non-streaming mode is available via REGSET_FPR.
The aim is is to make a minimally invasive change, no operation that would
previously have succeeded will be affected, and we use a previously defined
interface in new circumstances rather than define completely new ABI.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Rebase onto v6.18-rc1
- Link to v1: https://lore.kernel.org/r/20250820-arm64-sme-ptrace-sme-only-v1-0-f7c22b287…
---
Mark Brown (3):
arm64/sme: Support disabling streaming mode via ptrace on SME only systems
kselftst/arm64: Test NT_ARM_SVE FPSIMD format writes on non-SVE systems
kselftest/arm64: Cover disabling streaming mode without SVE in fp-ptrace
Documentation/arch/arm64/sve.rst | 5 +++
arch/arm64/kernel/ptrace.c | 40 +++++++++++++++---
tools/testing/selftests/arm64/fp/fp-ptrace.c | 5 +--
tools/testing/selftests/arm64/fp/sve-ptrace.c | 61 +++++++++++++++++++++++++++
4 files changed, 100 insertions(+), 11 deletions(-)
---
base-commit: cb6649f6217c0331b885cf787f1d175963e2a1d2
change-id: 20250717-arm64-sme-ptrace-sme-only-1fb850600ea0
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This patch series introduces LANDLOCK_SCOPE_MEMFD_EXEC, a new Landlock
scoping mechanism that restricts execution of anonymous memory file
descriptors (memfd) created via memfd_create(2). This addresses security
gaps where processes can bypass W^X policies and execute arbitrary code
through anonymous memory objects.
Fixes: https://github.com/landlock-lsm/linux/issues/37
SECURITY PROBLEM
================
Current Landlock filesystem restrictions do not cover memfd objects,
allowing processes to:
1. Read-to-execute bypass: Create writable memfd, inject code,
then execute via mmap(PROT_EXEC) or direct execve()
2. Anonymous execution: Execute code without touching the filesystem via
execve("/proc/self/fd/N") where N is a memfd descriptor
3. Cross-domain access violations: Pass memfd between processes to
bypass domain restrictions
These scenarios can occur in sandboxed environments where filesystem
access is restricted but memfd creation remains possible.
IMPLEMENTATION
==============
The implementation adds hierarchical execution control through domain
scoping:
Core Components:
- is_memfd_file(): Reliable memfd detection via "memfd:" dentry prefix
- domain_is_scoped(): Cross-domain hierarchy checking (moved to domain.c)
- LSM hooks: mmap_file, file_mprotect, bprm_creds_for_exec
- Creation-time restrictions: hook_file_alloc_security
Security Matrix:
Execution decisions follow domain hierarchy rules preventing both
same-domain bypass attempts and cross-domain access violations while
preserving legitimate hierarchical access patterns.
Domain Hierarchy with LANDLOCK_SCOPE_MEMFD_EXEC:
===============================================
Root (no domain) - No restrictions
|
+-- Domain A [SCOPE_MEMFD_EXEC] Layer 1
| +-- memfd_A (tagged with Domain A as creator)
| |
| +-- Domain A1 (child) [NO SCOPE] Layer 2
| | +-- Inherits Layer 1 restrictions from parent
| | +-- memfd_A1 (can create, inherits restrictions)
| | +-- Domain A1a [SCOPE_MEMFD_EXEC] Layer 3
| | +-- memfd_A1a (tagged with Domain A1a)
| |
| +-- Domain A2 (child) [SCOPE_MEMFD_EXEC] Layer 2
| +-- memfd_A2 (tagged with Domain A2 as creator)
| +-- CANNOT access memfd_A1 (different subtree)
|
+-- Domain B [SCOPE_MEMFD_EXEC] Layer 1
+-- memfd_B (tagged with Domain B as creator)
+-- CANNOT access ANY memfd from Domain A subtree
Execution Decision Matrix:
========================
Executor-> | A | A1 | A1a | A2 | B | Root
Creator | | | | | |
------------|-----|----|-----|----|----|-----
Domain A | X | X | X | X | X | Y
Domain A1 | Y | X | X | X | X | Y
Domain A1a | Y | Y | X | X | X | Y
Domain A2 | Y | X | X | X | X | Y
Domain B | X | X | X | X | X | Y
Root | Y | Y | Y | Y | Y | Y
Legend: Y = Execution allowed, X = Execution denied
Scenarios Covered:
- Direct mmap(PROT_EXEC) on memfd files
- Two-stage mmap(PROT_READ) + mprotect(PROT_EXEC) bypass attempts
- execve("/proc/self/fd/N") anonymous execution
- execveat() and fexecve() file descriptor execution
- Cross-process memfd inheritance and IPC passing
TESTING
=======
All patches have been validated with:
- scripts/checkpatch.pl --strict (clean)
- Selftests covering same-domain restrictions, cross-domain
hierarchy enforcement, and regular file isolation
- KUnit tests for memfd detection edge cases
DISCLAIMER
==========
My understanding of Landlock scoping semantics may be limited, but this
implementation reflects my current understanding based on available
documentation and code analysis. I welcome feedback and corrections
regarding the scoping logic and domain hierarchy enforcement.
Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com>
---
Abhinav Saxena (4):
landlock: add LANDLOCK_SCOPE_MEMFD_EXEC scope
landlock: implement memfd detection
landlock: add memfd exec LSM hooks and scoping
selftests/landlock: add memfd execution tests
include/uapi/linux/landlock.h | 5 +
security/landlock/.kunitconfig | 1 +
security/landlock/audit.c | 4 +
security/landlock/audit.h | 1 +
security/landlock/cred.c | 14 -
security/landlock/domain.c | 67 ++++
security/landlock/domain.h | 4 +
security/landlock/fs.c | 405 ++++++++++++++++++++-
security/landlock/limits.h | 2 +-
security/landlock/task.c | 67 ----
.../selftests/landlock/scoped_memfd_exec_test.c | 325 +++++++++++++++++
11 files changed, 812 insertions(+), 83 deletions(-)
---
base-commit: 5b74b2eff1eeefe43584e5b7b348c8cd3b723d38
change-id: 20250716-memfd-exec-ac0d582018c3
Best regards,
--
Abhinav Saxena <xandfury(a)gmail.com>
This patch adds support for the Zalasr ISA extension, which supplies the
real load acquire/store release instructions.
The specification can be found here:
https://github.com/riscv/riscv-zalasr/blob/main/chapter2.adoc
This patch seires has been tested with ltp on Qemu with Brensan's zalasr
support patch[1].
Some false positive spacing error happens during patch checking. Thus I
CCed maintainers of checkpatch.pl as well.
[1] https://lore.kernel.org/all/CAGPSXwJEdtqW=nx71oufZp64nK6tK=0rytVEcz4F-gfvCO…
v4:
- Apply acquire/release semantics to arch_atomic operations. Thanks
to Andrea.
v3:
- Apply acquire/release semantics to arch_xchg/arch_cmpxchg operations
so as to ensure FENCE.TSO ordering between operations which precede the
UNLOCK+LOCK sequence and operations which follow the sequence. Thanks
to Andrea.
- Support hwprobe of Zalasr.
- Allow Zalasr extensions for Guest/VM.
v2:
- Adjust the order of Zalasr and Zalrsc in dt-bindings. Thanks to
Conor.
Xu Lu (10):
riscv: Add ISA extension parsing for Zalasr
dt-bindings: riscv: Add Zalasr ISA extension description
riscv: hwprobe: Export Zalasr extension
riscv: Introduce Zalasr instructions
riscv: Apply Zalasr to smp_load_acquire/smp_store_release
riscv: Apply acquire/release semantics to arch_xchg/arch_cmpxchg
operations
riscv: Apply acquire/release semantics to arch_atomic operations
riscv: Remove arch specific __atomic_acquire/release_fence
RISC-V: KVM: Allow Zalasr extensions for Guest/VM
RISC-V: KVM: selftests: Add Zalasr extensions to get-reg-list test
Documentation/arch/riscv/hwprobe.rst | 5 +-
.../devicetree/bindings/riscv/extensions.yaml | 5 +
arch/riscv/include/asm/atomic.h | 70 ++++++++-
arch/riscv/include/asm/barrier.h | 91 +++++++++--
arch/riscv/include/asm/cmpxchg.h | 144 +++++++++---------
arch/riscv/include/asm/fence.h | 4 -
arch/riscv/include/asm/hwcap.h | 1 +
arch/riscv/include/asm/insn-def.h | 79 ++++++++++
arch/riscv/include/uapi/asm/hwprobe.h | 1 +
arch/riscv/include/uapi/asm/kvm.h | 1 +
arch/riscv/kernel/cpufeature.c | 1 +
arch/riscv/kernel/sys_hwprobe.c | 1 +
arch/riscv/kvm/vcpu_onereg.c | 2 +
.../selftests/kvm/riscv/get-reg-list.c | 4 +
14 files changed, 314 insertions(+), 95 deletions(-)
--
2.20.1
The vector regset uses the maximum possible vlenb 8192 to allocate a
2^18 bytes buffer to copy the vector register. But most platforms
don’t support the largest vlenb.
The regset has 2 users, ptrace syscall and coredump. When handling the
PTRACE_GETREGSET requests from ptrace syscall, Linux will prepare a
kernel buffer which size is min(user buffer size, limit). A malicious
user process might overwhelm a memory-constrainted system when the
buffer limit is very large. The coredump uses regset_get_alloc() to
get the context of vector register. But this API allocates buffer
before checking whether the target process uses vector extension, this
wastes time to prepare a large memory buffer.
The buffer limit can be determined after getting platform vlenb in the
early boot stage, this can let the regset buffer match real hardware
limits. Also add .active callbacks to let the coredump skip vector part
when target process doesn't use it.
After this patchset, userspace process needs 2 ptrace syscalls to
retrieve the vector regset with PTRACE_GETREGSET. The first ptrace call
only reads the header to get the vlenb information. Then prepare a
suitable buffer to get the register context. The new vector ptrace
kselftest demonstrates it.
---
v2:
- fix issues in vector ptrace kselftest (Andy)
Yong-Xuan Wang (2):
riscv: ptrace: Optimize the allocation of vector regset
selftests: riscv: Add test for the Vector ptrace interface
arch/riscv/include/asm/vector.h | 1 +
arch/riscv/kernel/ptrace.c | 24 +++-
arch/riscv/kernel/vector.c | 2 +
tools/testing/selftests/riscv/vector/Makefile | 5 +-
.../selftests/riscv/vector/vstate_ptrace.c | 134 ++++++++++++++++++
5 files changed, 162 insertions(+), 4 deletions(-)
create mode 100644 tools/testing/selftests/riscv/vector/vstate_ptrace.c
--
2.43.0
This commit introduces checks for kernel version and seccomp filter flag
support to the seccomp selftests. It also includes conditional header
inclusions using __GLIBC_PREREQ.
Some tests were gated by kernel version, and adjustments were made for
flags introduced after kernel 5.4. This ensures the selftests can run
and pass correctly on kernel versions 5.4 and later, preventing failures
due to features not present in older kernels.
The use of __GLIBC_PREREQ ensures proper compilation and functionality
across different glibc versions in a mainline Linux kernel context.
While it might appear redundant in specific build environments due to
global overrides, it is crucial for upstream correctness and portability.
Signed-off-by: Wake Liu <wakel(a)google.com>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 108 ++++++++++++++++--
1 file changed, 99 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 61acbd45ffaa..9b660cff5a4a 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -13,12 +13,14 @@
* we need to use the kernel's siginfo.h file and trick glibc
* into accepting it.
*/
+#if defined(__GLIBC__) && defined(__GLIBC_PREREQ)
#if !__GLIBC_PREREQ(2, 26)
# include <asm/siginfo.h>
# define __have_siginfo_t 1
# define __have_sigval_t 1
# define __have_sigevent_t 1
#endif
+#endif
#include <errno.h>
#include <linux/filter.h>
@@ -300,6 +302,26 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
}
#endif
+int seccomp_flag_supported(int flag)
+{
+ /*
+ * Probes if a seccomp filter flag is supported by the kernel.
+ *
+ * When an unsupported flag is passed to seccomp(SECCOMP_SET_MODE_FILTER, ...),
+ * the kernel returns EINVAL.
+ *
+ * When a supported flag is passed, the kernel proceeds to validate the
+ * filter program pointer. By passing NULL for the filter program,
+ * the kernel attempts to dereference a bad address, resulting in EFAULT.
+ *
+ * Therefore, checking for EFAULT indicates that the flag itself was
+ * recognized and supported by the kernel.
+ */
+ if (seccomp(SECCOMP_SET_MODE_FILTER, flag, NULL) == -1 && errno == EFAULT)
+ return 1;
+ return 0;
+}
+
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
#define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
@@ -2436,13 +2458,12 @@ TEST(detect_seccomp_filter_flags)
ASSERT_NE(ENOSYS, errno) {
TH_LOG("Kernel does not support seccomp syscall!");
}
- EXPECT_EQ(-1, ret);
- EXPECT_EQ(EFAULT, errno) {
- TH_LOG("Failed to detect that a known-good filter flag (0x%X) is supported!",
- flag);
- }
- all_flags |= flag;
+ if (seccomp_flag_supported(flag))
+ all_flags |= flag;
+ else
+ TH_LOG("Filter flag (0x%X) is not found to be supported!",
+ flag);
}
/*
@@ -2870,6 +2891,12 @@ TEST_F(TSYNC, two_siblings_with_one_divergence)
TEST_F(TSYNC, two_siblings_with_one_divergence_no_tid_in_err)
{
+ /* Depends on 5189149 (seccomp: allow TSYNC and USER_NOTIF together) */
+ if (!seccomp_flag_supported(SECCOMP_FILTER_FLAG_TSYNC_ESRCH)) {
+ SKIP(return, "Kernel does not support SECCOMP_FILTER_FLAG_TSYNC_ESRCH");
+ return;
+ }
+
long ret, flags;
void *status;
@@ -3475,6 +3502,11 @@ TEST(user_notification_basic)
TEST(user_notification_with_tsync)
{
+ /* Depends on 5189149 (seccomp: allow TSYNC and USER_NOTIF together) */
+ if (!seccomp_flag_supported(SECCOMP_FILTER_FLAG_TSYNC_ESRCH)) {
+ SKIP(return, "Kernel does not support SECCOMP_FILTER_FLAG_TSYNC_ESRCH");
+ return;
+ }
int ret;
unsigned int flags;
@@ -3966,6 +3998,13 @@ TEST(user_notification_filter_empty)
TEST(user_ioctl_notification_filter_empty)
{
+ /* Depends on 95036a7 (seccomp: interrupt SECCOMP_IOCTL_NOTIF_RECV
+ * when all users have exited) */
+ if (!ksft_min_kernel_version(6, 11)) {
+ SKIP(return, "Kernel version < 6.11");
+ return;
+ }
+
pid_t pid;
long ret;
int status, p[2];
@@ -4119,6 +4158,12 @@ int get_next_fd(int prev_fd)
TEST(user_notification_addfd)
{
+ /* Depends on 0ae71c7 (seccomp: Support atomic "addfd + send reply") */
+ if (!ksft_min_kernel_version(5, 14)) {
+ SKIP(return, "Kernel version < 5.14");
+ return;
+ }
+
pid_t pid;
long ret;
int status, listener, memfd, fd, nextfd;
@@ -4281,6 +4326,12 @@ TEST(user_notification_addfd)
TEST(user_notification_addfd_rlimit)
{
+ /* Depends on 7cf97b1 (seccomp: Introduce addfd ioctl to seccomp user notifier) */
+ if (!ksft_min_kernel_version(5, 9)) {
+ SKIP(return, "Kernel version < 5.9");
+ return;
+ }
+
pid_t pid;
long ret;
int status, listener, memfd;
@@ -4326,9 +4377,12 @@ TEST(user_notification_addfd_rlimit)
EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
EXPECT_EQ(errno, EMFILE);
- addfd.flags = SECCOMP_ADDFD_FLAG_SEND;
- EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
- EXPECT_EQ(errno, EMFILE);
+ /* Depends on 0ae71c7 (seccomp: Support atomic "addfd + send reply") */
+ if (ksft_min_kernel_version(5, 14)) {
+ addfd.flags = SECCOMP_ADDFD_FLAG_SEND;
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
+ EXPECT_EQ(errno, EMFILE);
+ }
addfd.newfd = 100;
addfd.flags = SECCOMP_ADDFD_FLAG_SETFD;
@@ -4356,6 +4410,12 @@ TEST(user_notification_addfd_rlimit)
TEST(user_notification_sync)
{
+ /* Depends on 48a1084 (seccomp: add the synchronous mode for seccomp_unotify) */
+ if (!ksft_min_kernel_version(6, 6)) {
+ SKIP(return, "Kernel version < 6.6");
+ return;
+ }
+
struct seccomp_notif req = {};
struct seccomp_notif_resp resp = {};
int status, listener;
@@ -4520,6 +4580,12 @@ static char get_proc_stat(struct __test_metadata *_metadata, pid_t pid)
TEST(user_notification_fifo)
{
+ /* Depends on 4cbf6f6 (seccomp: Use FIFO semantics to order notifications) */
+ if (!ksft_min_kernel_version(5, 19)) {
+ SKIP(return, "Kernel version < 5.19");
+ return;
+ }
+
struct seccomp_notif_resp resp = {};
struct seccomp_notif req = {};
int i, status, listener;
@@ -4623,6 +4689,12 @@ static long get_proc_syscall(struct __test_metadata *_metadata, int pid)
/* Ensure non-fatal signals prior to receive are unmodified */
TEST(user_notification_wait_killable_pre_notification)
{
+ /* Depends on c2aa2df (seccomp: Add wait_killable semantic to seccomp user notifier) */
+ if (!ksft_min_kernel_version(5, 19)) {
+ SKIP(return, "Kernel version < 5.19");
+ return;
+ }
+
struct sigaction new_action = {
.sa_handler = signal_handler,
};
@@ -4693,6 +4765,12 @@ TEST(user_notification_wait_killable_pre_notification)
/* Ensure non-fatal signals after receive are blocked */
TEST(user_notification_wait_killable)
{
+ /* Depends on c2aa2df (seccomp: Add wait_killable semantic to seccomp user notifier) */
+ if (!ksft_min_kernel_version(5, 19)) {
+ SKIP(return, "Kernel version < 5.19");
+ return;
+ }
+
struct sigaction new_action = {
.sa_handler = signal_handler,
};
@@ -4772,6 +4850,12 @@ TEST(user_notification_wait_killable)
/* Ensure fatal signals after receive are not blocked */
TEST(user_notification_wait_killable_fatal)
{
+ /* Depends on c2aa2df (seccomp: Add wait_killable semantic to seccomp user notifier) */
+ if (!ksft_min_kernel_version(5, 19)) {
+ SKIP(return, "Kernel version < 5.19");
+ return;
+ }
+
struct seccomp_notif req = {};
int listener, status;
pid_t pid;
@@ -4854,6 +4938,12 @@ static void *tsync_vs_dead_thread_leader_sibling(void *_args)
*/
TEST(tsync_vs_dead_thread_leader)
{
+ /* Depends on bfafe5e (seccomp: release task filters when the task exits) */
+ if (!ksft_min_kernel_version(6, 11)) {
+ SKIP(return, "Kernel version < 6.11");
+ return;
+ }
+
int status;
pid_t pid;
long ret;
--
2.50.1.703.g449372360f-goog
Hello,
IIUC this is the first independent patch series for guest_memfd's in-place
conversion series! Happy to finally bring this out on its own.
Previous versions of this feature, part of other series, are available at
[1][2][3].
Many prior discussions have led up to these main features of this series, and
these are the main points I'd like feedback on.
1. Having private/shared status stored in a maple tree (Thanks Michael for your
support of using maple trees over xarrays for performance! [4]).
2. Having a new guest_memfd ioctl (not a vm ioctl) that performs conversions.
3. Using ioctls/structs/input attribute similar to the existing vm ioctl
KVM_SET_MEMORY_ATTRIBUTES to perform conversions.
4. Storing requested attributes directly in the maple tree.
5. Using a KVM module-wide param to toggle between setting memory attributes via
vm and guest_memfd ioctls (making them mututally exclusive - a single loaded
KVM module can only do one of the two.)
6. Skipping LRU in guest_memfd folios - make guest_memfd folios not participate
in LRU to avoid LRU refcounts from interfering with conversions.
This series is based on kvm/next, followed by
+ v12 of NUMA mempolicy support patches [5]
+ 3 cleanup patches from Sean [6][7][8]
Everything is stitched together here for your convenience
https://github.com/googleprodkernel/linux-cc/commits/guest_memfd-inplace-co…
Thank you all for helping with this series!
If I missed out your comment from a previous series, it's not intentional!
Please do raise it again.
TODOs:
+ There might be an issue with memory failure handling because when guest_memfd
folios stop participating in LRU. From a preliminary analysis,
HWPoisonHandlable() is only true if PageLRU() is true. This needs further
investigation.
[1] https://lore.kernel.org/all/bd163de3118b626d1005aa88e71ef2fb72f0be0f.172600…
[2] https://lore.kernel.org/all/20250117163001.2326672-6-tabba@google.com/
[3] https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.174726…
[4] https://lore.kernel.org/all/20250529054227.hh2f4jmyqf6igd3i@amd.com/
[5] https://lore.kernel.org/all/20251007221420.344669-1-seanjc@google.com/T/
[6] https://lore.kernel.org/all/20250924174255.2141847-1-seanjc@google.com/
[7] https://lore.kernel.org/all/20251007224515.374516-1-seanjc@google.com/
[8] https://lore.kernel.org/all/20251007223625.369939-1-seanjc@google.com/
Ackerley Tng (19):
KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes
KVM: Introduce KVM_SET_MEMORY_ATTRIBUTES2
KVM: guest_memfd: Don't set FGP_ACCESSED when getting folios
KVM: guest_memfd: Skip LRU for guest_memfd folios
KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES
KVM: selftests: Update framework to use KVM_SET_MEMORY_ATTRIBUTES2
KVM: selftests: guest_memfd: Test basic single-page conversion flow
KVM: selftests: guest_memfd: Test conversion flow when INIT_SHARED
KVM: selftests: guest_memfd: Test indexing in guest_memfd
KVM: selftests: guest_memfd: Test conversion before allocation
KVM: selftests: guest_memfd: Convert with allocated folios in
different layouts
KVM: selftests: guest_memfd: Test precision of conversion
KVM: selftests: guest_memfd: Test that truncation does not change
shared/private status
KVM: selftests: guest_memfd: Test conversion with elevated page
refcount
KVM: selftests: Reset shared memory after hole-punching
KVM: selftests: Provide function to look up guest_memfd details from
gpa
KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe
KVM: selftests: Update private_mem_conversions_test to mmap()
guest_memfd
KVM: selftests: Add script to exercise private_mem_conversions_test
Sean Christopherson (18):
KVM: guest_memfd: Introduce per-gmem attributes, use to guard user
mappings
KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem
is defined
KVM: Stub in ability to disable per-VM memory attribute tracking
KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem
attributes
KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86
KVM: Let userspace disable per-VM mem attributes, enable per-gmem
attributes
KVM: selftests: Create gmem fd before "regular" fd when adding memslot
KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset}
KVM: selftests: Add support for mmap() on guest_memfd in core library
KVM: selftests: Add helpers for calling ioctls on guest_memfd
KVM: selftests: guest_memfd: Test that shared/private status is
consistent across processes
KVM: selftests: Add selftests global for guest memory attributes
capability
KVM: selftests: Provide common function to set memory attributes
KVM: selftests: Check fd/flags provided to mmap() when setting up
memslot
KVM: selftests: Update pre-fault test to work with per-guest_memfd
attributes
KVM: selftests: Update private memory exits test work with per-gmem
attributes
Documentation/virt/kvm/api.rst | 72 ++-
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/Kconfig | 15 +-
arch/x86/kvm/mmu/mmu.c | 4 +-
arch/x86/kvm/x86.c | 13 +-
include/linux/kvm_host.h | 44 +-
include/trace/events/kvm.h | 4 +-
include/uapi/linux/kvm.h | 17 +
mm/filemap.c | 1 +
mm/memcontrol.c | 2 +
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../kvm/guest_memfd_conversions_test.c | 498 ++++++++++++++++++
.../testing/selftests/kvm/include/kvm_util.h | 127 ++++-
.../testing/selftests/kvm/include/test_util.h | 29 +-
tools/testing/selftests/kvm/lib/kvm_util.c | 128 +++--
tools/testing/selftests/kvm/lib/test_util.c | 7 -
.../selftests/kvm/pre_fault_memory_test.c | 2 +-
.../kvm/x86/private_mem_conversions_test.c | 55 +-
.../kvm/x86/private_mem_conversions_test.py | 159 ++++++
.../kvm/x86/private_mem_kvm_exits_test.c | 36 +-
virt/kvm/Kconfig | 4 +-
virt/kvm/guest_memfd.c | 414 +++++++++++++--
virt/kvm/kvm_main.c | 104 +++-
24 files changed, 1554 insertions(+), 185 deletions(-)
create mode 100644 tools/testing/selftests/kvm/guest_memfd_conversions_test.c
create mode 100755 tools/testing/selftests/kvm/x86/private_mem_conversions_test.py
--
2.51.0.858.gf9c4a03a3a-goog
Problem
=======
When host APEI is unable to claim a synchronous external abort (SEA)
during guest abort, today KVM directly injects an asynchronous SError
into the VCPU then resumes it. The injected SError usually results in
unpleasant guest kernel panic.
One of the major situation of guest SEA is when VCPU consumes recoverable
uncorrected memory error (UER), which is not uncommon at all in modern
datacenter servers with large amounts of physical memory. Although SError
and guest panic is sufficient to stop the propagation of corrupted memory,
there is room to recover from an UER in a more graceful manner.
Proposed Solution
=================
The idea is, we can replay the SEA to the faulting VCPU. If the memory
error consumption or the fault that cause SEA is not from guest kernel,
the blast radius can be limited to the poison-consuming guest process,
while the VM can keep running.
In addition, instead of doing under the hood without involving userspace,
there are benefits to redirect the SEA to VMM:
- VM customers care about the disruptions caused by memory errors, and
VMM usually has the responsibility to start the process of notifying
the customers of memory error events in their VMs. For example some
cloud provider emits a critical log in their observability UI [1], and
provides a playbook for customers on how to mitigate disruptions to
their workloads.
- VMM can protect future memory error consumption by unmapping the poisoned
pages from stage-2 page table with KVM userfault [2], or by splitting the
memslot that contains the poisoned pages.
- VMM can keep track of SEA events in the VM. When VMM thinks the status
on the host or the VM is bad enough, e.g. number of distinct SEAs
exceeds a threshold, it can restart the VM on another healthy host.
- Behavior parity with x86 architecture. When machine check exception
(MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to
let VMM either recover from the MCE, or terminate itself with VM.
The prior RFC proposes to implement SIGBUS on arm64 as well, but
Marc preferred KVM exit over signal [3]. However, implementation
aside, returning SEA to VMM is on par with returning MCE to VMM.
Once SEA is redirected to VMM, among other actions, VMM is encouraged
to inject external aborts into the faulting VCPU.
New UAPIs
=========
This patchset introduces following userspace-visible changes to empower
VMM to control what happens for SEA on guest memory:
- KVM_CAP_ARM_SEA_TO_USER. While taking SEA, if userspace has enabled
this new capability at VM creation, and the SEA is not owned by kernel
allocated memory, instead of injecting SError, return KVM_EXIT_ARM_SEA
to userspace.
- KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details
about the SEA is provided in arm_sea as much as possible, including
sanitized ESR value at EL2, faulting guest virtual and physical
addresses if available.
* From v3 [4]
- Rebased on commit 3a8660878839 ("Linux 6.18-rc1").
- In selftest, print a message if GVA or GPA expects to be valid.
* From v2 [5]:
- Rebased on "[PATCH] KVM: arm64: nv: Handle SEAs due to VNCR redirection" [6]
and kvmarm/next commit 7b8346bd9fce6 ("KVM: arm64: Don't attempt vLPI
mappings when vPE allocation is disabled")
- Took the host_owns_sea implementation from Oliver [7, 8].
- Excluded the guest SEA injection patches.
- Updated selftest.
* From v1 [9]:
- Rebased on commit 4d62121ce9b5 ("KVM: arm64: vgic-debug: Avoid
dereferencing NULL ITE pointer").
- Sanitize ESR_EL2 before reporting it to userspace.
- Do not do KVM_EXIT_ARM_SEA when SEA is caused by memory allocated to
stage-2 translation table.
[1] https://cloud.google.com/solutions/sap/docs/manage-host-errors
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com
[3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org
[4] https://lore.kernel.org/kvmarm/20250731205844.1346839-1-jiaqiyan@google.com
[5] https://lore.kernel.org/kvm/20250604050902.3944054-1-jiaqiyan@google.com
[6] https://lore.kernel.org/kvmarm/20250729182342.3281742-1-oliver.upton@linux.…
[7] https://lore.kernel.org/kvm/aHFohmTb9qR_JG1E@linux.dev
[8] https://lore.kernel.org/kvm/aHK-DPufhLy5Dtuk@linux.dev
[9] https://lore.kernel.org/kvm/20250505161412.1926643-1-jiaqiyan@google.com
Jiaqi Yan (3):
KVM: arm64: VM exit to userspace to handle SEA
KVM: selftests: Test for KVM_EXIT_ARM_SEA
Documentation: kvm: new UAPI for handling SEA
Documentation/virt/kvm/api.rst | 61 ++++
arch/arm64/include/asm/kvm_host.h | 2 +
arch/arm64/kvm/arm.c | 5 +
arch/arm64/kvm/mmu.c | 68 +++-
include/uapi/linux/kvm.h | 10 +
tools/arch/arm64/include/asm/esr.h | 2 +
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../testing/selftests/kvm/arm64/sea_to_user.c | 331 ++++++++++++++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 1 +
9 files changed, 480 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c
--
2.51.0.760.g7b8bcc2412-goog
From: Jack Thomson <jackabt(a)amazon.com>
This patch series adds ARM64 support for the KVM_PRE_FAULT_MEMORY
feature, which was previously only available on x86 [1]. This allows us
to reduce the number of stage-2 faults during execution. This is of
benefit in post-copy migration scenarios, particularly in memory
intensive applications, where we are experiencing high latencies due to
the stage-2 faults.
Patch Overview:
- The first patch adds support for the KVM_PRE_FAULT_MEMORY ioctl
on arm64.
- The second patch fixes an issue with unaligned mmap allocations
in the selftests.
- The third patch updates the pre_fault_memory_test to support
arm64.
- The last patch extends the pre_fault_memory_test to cover
different vm memory backings.
=== Changes Since v1 [2] ===
Addressing feedback from Oliver:
- No pre-fault flag is passed to user_mem_abort() or gmem_abort() now
aborts are synthesized.
- Remove retry loop from kvm_arch_vcpu_pre_fault_memory()
[1]: https://lore.kernel.org/kvm/20240710174031.312055-1-pbonzini@redhat.com
[2]: https://lore.kernel.org/all/20250911134648.58945-1-jackabt.amazon@gmail.com
Jack Thomson (4):
KVM: arm64: Add pre_fault_memory implementation
KVM: selftests: Fix unaligned mmap allocations
KVM: selftests: Enable pre_fault_memory_test for arm64
KVM: selftests: Add option for different backing in pre-fault tests
Documentation/virt/kvm/api.rst | 3 +-
arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/mmu.c | 73 +++++++++++-
tools/testing/selftests/kvm/Makefile.kvm | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 12 +-
.../selftests/kvm/pre_fault_memory_test.c | 110 +++++++++++++-----
7 files changed, 163 insertions(+), 38 deletions(-)
base-commit: 42188667be387867d2bf763d028654cbad046f7b
--
2.43.0
sched_ext tasks can be starved by long-running RT tasks, especially since
RT throttling was replaced by deadline servers to boost only SCHED_NORMAL
tasks.
Several users in the community have reported issues with RT stalling
sched_ext tasks. This is fairly common on distributions or environments
where applications like video compositors, audio services, etc. run as RT
tasks by default.
Example trace (showing a per-CPU kthread stalled due to the sway Wayland
compositor running as an RT task):
runnable task stall (kworker/0:0[106377] failed to run for 5.043s)
...
CPU 0 : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738
curr=sway[994] class=rt_sched_class
R kworker/0:0[106377] -5043ms
scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0
sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000
cpus=01
This is often perceived as a bug in the BPF schedulers, but in reality
schedulers can't do much: RT tasks run outside their control and can
potentially consume 100% of the CPU bandwidth.
Fix this by adding a sched_ext deadline server, so that sched_ext tasks are
also boosted and do not suffer starvation.
Two kselftests are also provided to verify the starvation fixes and
bandwidth allocation is correct.
== Highlights in this version ==
- wait for inactive_task_timer() to fire before removing the bandwidth
reservation (Juri/Peter: please check if this new
dl_server_remove_params() implementation makes sense to you)
- removed the explicit dl_server_stop() from dequeue_task_scx() and rely
on the delayed stop behavior (Juri/Peter: ditto)
This patchset is also available in the following git branch:
git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dl-server
Changes in v10:
- reordered patches to better isolate sched_ext changes vs sched/deadline
changes (Andrea Righi)
- define ext_server only with CONFIG_SCHED_CLASS_EXT=y (Andrea Righi)
- add WARN_ON_ONCE(!cpus) check in dl_server_apply_params() (Andrea Righi)
- wait for inactive_task_timer to fire before removing the bandwidth
reservation (Juri Lelli)
- remove explicit dl_server_stop() in dequeue_task_scx() to reduce timer
reprogramming overhead (Juri Lelli)
- do not restart pick_task() when invoked by the dl_server (Tejun Heo)
- rename rq_dl_server to dl_server (Peter Zijlstra)
- fixed a missing dl_server start in dl_server_on() (Christian Loehle)
- add a comment to the rt_stall selftest to better explain the 4%
threshold (Emil Tsalapatis)
Changes in v9:
- Drop the ->balance() logic as its functionality is now integrated into
->pick_task(), allowing dl_server to call pick_task_scx() directly
- Link to v8: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/
Changes in v8:
- Add tj's patch to de-couple balance and pick_task and avoid changing
sched/core callbacks to propagate @rf
- Simplify dl_se->dl_server check (suggested by PeterZ)
- Small coding style fixes in the kselftests
- Link to v7: https://lore.kernel.org/all/20250809184800.129831-1-joelagnelf@nvidia.com/
Changes in v7:
- Rebased to Linus master
- Link to v6: https://lore.kernel.org/all/20250702232944.3221001-1-joelagnelf@nvidia.com/
Changes in v6:
- Added Acks to few patches
- Fixes to few nits suggested by Tejun
- Link to v5: https://lore.kernel.org/all/20250620203234.3349930-1-joelagnelf@nvidia.com/
Changes in v5:
- Added a kselftest (total_bw) to sched_ext to verify bandwidth values
from debugfs
- Address comment from Andrea about redundant rq clock invalidation
- Link to v4: https://lore.kernel.org/all/20250617200523.1261231-1-joelagnelf@nvidia.com/
Changes in v4:
- Fixed issues with hotplugged CPUs having their DL server bandwidth
altered due to loading SCX
- Fixed other issues
- Rebased on Linus master
- All sched_ext kselftests reliably pass now, also verified that the
total_bw in debugfs (CONFIG_SCHED_DEBUG) is conserved with these patches
- Link to v3: https://lore.kernel.org/all/20250613051734.4023260-1-joelagnelf@nvidia.com/
Changes in v3:
- Removed code duplication in debugfs. Made ext interface separate
- Fixed issue where rq_lock_irqsave was not used in the relinquish patch
- Fixed running bw accounting issue in dl_server_remove_params
- Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/
Changes in v2:
- Fixed a hang related to using rq_lock instead of rq_lock_irqsave
- Added support to remove BW of DL servers when they are switched to/from EXT
- Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/
Andrea Righi (5):
sched/deadline: Add support to initialize and remove dl_server bandwidth
sched_ext: Add a DL server for sched_ext tasks
sched/deadline: Account ext server bandwidth
sched_ext: Selectively enable ext and fair DL servers
selftests/sched_ext: Add test for sched_ext dl_server
Joel Fernandes (6):
sched/debug: Fix updating of ppos on server write ops
sched/debug: Stop and start server based on if it was active
sched/deadline: Clear the defer params
sched/deadline: Add a server arg to dl_server_update_idle_time()
sched/debug: Add support to change sched_ext server params
selftests/sched_ext: Add test for DL server total_bw consistency
kernel/sched/core.c | 3 +
kernel/sched/deadline.c | 169 +++++++++++---
kernel/sched/debug.c | 171 +++++++++++---
kernel/sched/ext.c | 144 +++++++++++-
kernel/sched/fair.c | 2 +-
kernel/sched/idle.c | 2 +-
kernel/sched/sched.h | 8 +-
kernel/sched/topology.c | 5 +
tools/testing/selftests/sched_ext/Makefile | 2 +
tools/testing/selftests/sched_ext/rt_stall.bpf.c | 23 ++
tools/testing/selftests/sched_ext/rt_stall.c | 222 ++++++++++++++++++
tools/testing/selftests/sched_ext/total_bw.c | 281 +++++++++++++++++++++++
12 files changed, 955 insertions(+), 77 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c
create mode 100644 tools/testing/selftests/sched_ext/total_bw.c