Problem
=======
When host APEI is unable to claim a synchronous external abort (SEA)
during guest abort, today KVM directly injects an asynchronous SError
into the VCPU then resumes it. The injected SError usually results in
unpleasant guest kernel panic.
One of the major situation of guest SEA is when VCPU consumes recoverable
uncorrected memory error (UER), which is not uncommon at all in modern
datacenter servers with large amounts of physical memory. Although SError
and guest panic is sufficient to stop the propagation of corrupted memory,
there is room to recover from an UER in a more graceful manner.
Proposed Solution
=================
The idea is, we can replay the SEA to the faulting VCPU. If the memory
error consumption or the fault that cause SEA is not from guest kernel,
the blast radius can be limited to the poison-consuming guest process,
while the VM can keep running.
In addition, instead of doing under the hood without involving userspace,
there are benefits to redirect the SEA to VMM:
- VM customers care about the disruptions caused by memory errors, and
VMM usually has the responsibility to start the process of notifying
the customers of memory error events in their VMs. For example some
cloud provider emits a critical log in their observability UI [1], and
provides a playbook for customers on how to mitigate disruptions to
their workloads.
- VMM can protect future memory error consumption by unmapping the poisoned
pages from stage-2 page table with KVM userfault [2], or by splitting the
memslot that contains the poisoned pages.
- VMM can keep track of SEA events in the VM. When VMM thinks the status
on the host or the VM is bad enough, e.g. number of distinct SEAs
exceeds a threshold, it can restart the VM on another healthy host.
- Behavior parity with x86 architecture. When machine check exception
(MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to
let VMM either recover from the MCE, or terminate itself with VM.
The prior RFC proposes to implement SIGBUS on arm64 as well, but
Marc preferred KVM exit over signal [3]. However, implementation
aside, returning SEA to VMM is on par with returning MCE to VMM.
Once SEA is redirected to VMM, among other actions, VMM is encouraged
to inject external aborts into the faulting VCPU.
New UAPIs
=========
This patchset introduces following userspace-visible changes to empower
VMM to control what happens for SEA on guest memory:
- KVM_CAP_ARM_SEA_TO_USER. While taking SEA, if userspace has enabled
this new capability at VM creation, and the SEA is not owned by kernel
allocated memory, instead of injecting SError, return KVM_EXIT_ARM_SEA
to userspace.
- KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details
about the SEA is provided in arm_sea as much as possible, including
sanitized ESR value at EL2, faulting guest virtual and physical
addresses if available.
* From v3 [4]
- Rebased on commit 3a8660878839 ("Linux 6.18-rc1").
- In selftest, print a message if GVA or GPA expects to be valid.
* From v2 [5]:
- Rebased on "[PATCH] KVM: arm64: nv: Handle SEAs due to VNCR redirection" [6]
and kvmarm/next commit 7b8346bd9fce6 ("KVM: arm64: Don't attempt vLPI
mappings when vPE allocation is disabled")
- Took the host_owns_sea implementation from Oliver [7, 8].
- Excluded the guest SEA injection patches.
- Updated selftest.
* From v1 [9]:
- Rebased on commit 4d62121ce9b5 ("KVM: arm64: vgic-debug: Avoid
dereferencing NULL ITE pointer").
- Sanitize ESR_EL2 before reporting it to userspace.
- Do not do KVM_EXIT_ARM_SEA when SEA is caused by memory allocated to
stage-2 translation table.
[1] https://cloud.google.com/solutions/sap/docs/manage-host-errors
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com
[3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org
[4] https://lore.kernel.org/kvmarm/20250731205844.1346839-1-jiaqiyan@google.com
[5] https://lore.kernel.org/kvm/20250604050902.3944054-1-jiaqiyan@google.com
[6] https://lore.kernel.org/kvmarm/20250729182342.3281742-1-oliver.upton@linux.…
[7] https://lore.kernel.org/kvm/aHFohmTb9qR_JG1E@linux.dev
[8] https://lore.kernel.org/kvm/aHK-DPufhLy5Dtuk@linux.dev
[9] https://lore.kernel.org/kvm/20250505161412.1926643-1-jiaqiyan@google.com
Jiaqi Yan (3):
KVM: arm64: VM exit to userspace to handle SEA
KVM: selftests: Test for KVM_EXIT_ARM_SEA
Documentation: kvm: new UAPI for handling SEA
Documentation/virt/kvm/api.rst | 61 ++++
arch/arm64/include/asm/kvm_host.h | 2 +
arch/arm64/kvm/arm.c | 5 +
arch/arm64/kvm/mmu.c | 68 +++-
include/uapi/linux/kvm.h | 10 +
tools/arch/arm64/include/asm/esr.h | 2 +
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../testing/selftests/kvm/arm64/sea_to_user.c | 331 ++++++++++++++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 1 +
9 files changed, 480 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c
--
2.51.0.760.g7b8bcc2412-goog
Changelog:
v9:
Added review-bys and addressed comments from Mike Rapoport and
Pratyush Yadav.
Dropped patch that moves abort/finalize to public header per Mike's
request.
Added patch from Zhu Yanjun to output errors by name.
This series appliyes against akpm's mm-unstable branch.
This series refactors the KHO framework to better support in-kernel
users like the upcoming LUO. The current design, which relies on a
notifier chain and debugfs for control, is too restrictive for direct
programmatic use.
The core of this rework is the removal of the notifier chain in favor of
a direct registration API. This decouples clients from the shutdown-time
finalization sequence, allowing them to manage their preserved state
more flexibly and at any time.
In support of this new model, this series also:
- Makes the debugfs interface optional.
- Introduces APIs to unpreserve memory and fixes a bug in the abort
path where client state was being incorrectly discarded. Note that
this is an interim step, as a more comprehensive fix is planned as
part of the stateless KHO work [1].
- Moves all KHO code into a new kernel/liveupdate/ directory to
consolidate live update components.
[1] https://lore.kernel.org/all/20251020100306.2709352-1-jasonmiu@google.com
Mike Rapoport (Microsoft) (1):
kho: drop notifiers
Pasha Tatashin (7):
kho: make debugfs interface optional
kho: add interfaces to unpreserve folios, page ranges, and vmalloc
memblock: Unpreserve memory in case of error
test_kho: Unpreserve memory in case of error
kho: don't unpreserve memory during abort
liveupdate: kho: move to kernel/liveupdate
MAINTAINERS: update KHO maintainers
Zhu Yanjun (1):
liveupdate: kho: Use %pe format specifier for error pointer printing
Documentation/core-api/kho/concepts.rst | 2 +-
MAINTAINERS | 4 +-
include/linux/kexec_handover.h | 46 +-
init/Kconfig | 2 +
kernel/Kconfig.kexec | 24 -
kernel/Makefile | 3 +-
kernel/kexec_handover_internal.h | 16 -
kernel/liveupdate/Kconfig | 39 ++
kernel/liveupdate/Makefile | 5 +
kernel/{ => liveupdate}/kexec_handover.c | 532 +++++++-----------
.../{ => liveupdate}/kexec_handover_debug.c | 0
kernel/liveupdate/kexec_handover_debugfs.c | 221 ++++++++
kernel/liveupdate/kexec_handover_internal.h | 56 ++
lib/test_kho.c | 128 +++--
mm/memblock.c | 93 +--
tools/testing/selftests/kho/vmtest.sh | 1 +
16 files changed, 690 insertions(+), 482 deletions(-)
delete mode 100644 kernel/kexec_handover_internal.h
create mode 100644 kernel/liveupdate/Kconfig
create mode 100644 kernel/liveupdate/Makefile
rename kernel/{ => liveupdate}/kexec_handover.c (80%)
rename kernel/{ => liveupdate}/kexec_handover_debug.c (100%)
create mode 100644 kernel/liveupdate/kexec_handover_debugfs.c
create mode 100644 kernel/liveupdate/kexec_handover_internal.h
base-commit: 9ef7b034116354ee75502d1849280a4d2ff98a7c
--
2.51.1.930.gacf6e81ea2-goog
Hi Linus,
Please pull this kselftest fixes update for Linux 6.18-rc6
Fixes event-filter-function.tc tracing test failure caused when a first
run to sample events triggers kmem_cache_free which interferes with the
rest of the test. Fix this calling sample_events twice to eliminate the
kmem_cache_free related noise from the sampling.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 920aa3a7705a061cb3004572d8b7932b54463dbf:
selftests: cachestat: Fix warning on declaration under label (2025-10-22 09:23:18 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-fixes-6.18-rc6
for you to fetch changes up to dd4adb986a86727ed8f56c48b6d0695f1e211e65:
selftests/tracing: Run sample events to clear page cache events (2025-11-10 18:00:07 -0700)
----------------------------------------------------------------
linux_kselftest-fixes-6.18-rc6
Fixes event-filter-function.tc tracing test failure caused when a first
run to sample events triggers kmem_cache_free which interferes with the
rest of the test. Fix this calling sample_events twice to eliminate the
kmem_cache_free related noise from the sampling.
----------------------------------------------------------------
Steven Rostedt (1):
selftests/tracing: Run sample events to clear page cache events
tools/testing/selftests/ftrace/test.d/filter/event-filter-function.tc | 4 ++++
1 file changed, 4 insertions(+)
----------------------------------------------------------------
The current shmem_allocate_area() implementation uses a hardcoded virtual
base address (BASE_PMD_ADDR) as a hint for mmap() when creating shmem-backed
test areas. This approach is fragile and may fail on systems with ASLR or
different virtual memory layouts, where the chosen address is unavailable.
Replace the static base address with a dynamically reserved address range
obtained via mmap(NULL, ..., PROT_NONE). The memfd-backed areas and their
alias are then mapped into that reserved region using MAP_FIXED, preserving
the original layout and aliasing semantics while avoiding collisions with
unrelated mappings.
This change improves robustness and portability of the test suite without
altering its behavior or coverage.
Suggested-by: Mike Rapoport <rppt(a)kernel.org>
Signed-off-by: Mehdi Ben Hadj Khelifa <mehdi.benhadjkhelifa(a)gmail.com>
---
Testing(Retested):
A diff between running the mm selftests on 6.18-rc5 from before and after
the change show no regression on x86_64 architecture with 32GB DDR5 RAM.
ChangeLog:
Changes from v1:
-Implemented Mike's suggestions to make cleanup code more clear.
Link:https://lore.kernel.org/all/20251111205739.420009-1-mehdi.benhadjkheli…
tools/testing/selftests/mm/uffd-common.c | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/mm/uffd-common.c b/tools/testing/selftests/mm/uffd-common.c
index 994fe8c03923..edd02328f77b 100644
--- a/tools/testing/selftests/mm/uffd-common.c
+++ b/tools/testing/selftests/mm/uffd-common.c
@@ -10,7 +10,6 @@
uffd_test_ops_t *uffd_test_ops;
uffd_test_case_ops_t *uffd_test_case_ops;
-#define BASE_PMD_ADDR ((void *)(1UL << 30))
/* pthread_mutex_t starts at page offset 0 */
pthread_mutex_t *area_mutex(char *area, unsigned long nr, uffd_global_test_opts_t *gopts)
@@ -142,30 +141,37 @@ static int shmem_allocate_area(uffd_global_test_opts_t *gopts, void **alloc_area
unsigned long offset = is_src ? 0 : bytes;
char *p = NULL, *p_alias = NULL;
int mem_fd = uffd_mem_fd_create(bytes * 2, false);
+ size_t region_size = bytes * 2 + hpage_size;
- /* TODO: clean this up. Use a static addr is ugly */
- p = BASE_PMD_ADDR;
- if (!is_src)
- /* src map + alias + interleaved hpages */
- p += 2 * (bytes + hpage_size);
+ void *reserve = mmap(NULL, region_size, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS,
+ -1, 0);
+ if (reserve == MAP_FAILED) {
+ close(mem_fd);
+ return -errno;
+ }
+
+ p = reserve;
p_alias = p;
p_alias += bytes;
p_alias += hpage_size; /* Prevent src/dst VMA merge */
- *alloc_area = mmap(p, bytes, PROT_READ | PROT_WRITE, MAP_SHARED,
+ *alloc_area = mmap(p, bytes, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED,
mem_fd, offset);
if (*alloc_area == MAP_FAILED) {
*alloc_area = NULL;
+ munmap(reserve, region_size);
+ close(mem_fd);
return -errno;
}
if (*alloc_area != p)
err("mmap of memfd failed at %p", p);
- area_alias = mmap(p_alias, bytes, PROT_READ | PROT_WRITE, MAP_SHARED,
+ area_alias = mmap(p_alias, bytes, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED,
mem_fd, offset);
if (area_alias == MAP_FAILED) {
- munmap(*alloc_area, bytes);
*alloc_area = NULL;
+ munmap(reserve, region_size);
+ close(mem_fd);
return -errno;
}
if (area_alias != p_alias)
--
2.51.2
The current shmem_allocate_area() implementation uses a hardcoded virtual
base address(BASE_PMD_ADDR) as a hint for mmap() when creating shmem-backed
test areas. This approach is fragile and may fail on systems with ASLR or
different virtual memory layouts, where the chosen address is unavailable.
Replace the static base address with a dynamically reserved address range
obtained via mmap(NULL, ..., PROT_NONE). The memfd-backed areas and their
alias are then mapped into that reserved region using MAP_FIXED, preserving
the original layout and aliasing semantics while avoiding collisions with
unrelated mappings.
This change improves robustness and portability of the test suite without
altering its behavior or coverage.
Signed-off-by: Mehdi Ben Hadj Khelifa <mehdi.benhadjkhelifa(a)gmail.com>
---
Testing:
A diff between running the mm selftests on 6.18-rc5 from before and after
the change show no regression on x86_64 architecture with 32GB DDR5 RAM.
tools/testing/selftests/mm/uffd-common.c | 25 +++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/mm/uffd-common.c b/tools/testing/selftests/mm/uffd-common.c
index 994fe8c03923..492b21c960bb 100644
--- a/tools/testing/selftests/mm/uffd-common.c
+++ b/tools/testing/selftests/mm/uffd-common.c
@@ -6,11 +6,11 @@
*/
#include "uffd-common.h"
+#include "asm-generic/mman-common.h"
uffd_test_ops_t *uffd_test_ops;
uffd_test_case_ops_t *uffd_test_case_ops;
-#define BASE_PMD_ADDR ((void *)(1UL << 30))
/* pthread_mutex_t starts at page offset 0 */
pthread_mutex_t *area_mutex(char *area, unsigned long nr, uffd_global_test_opts_t *gopts)
@@ -142,30 +142,37 @@ static int shmem_allocate_area(uffd_global_test_opts_t *gopts, void **alloc_area
unsigned long offset = is_src ? 0 : bytes;
char *p = NULL, *p_alias = NULL;
int mem_fd = uffd_mem_fd_create(bytes * 2, false);
+ size_t region_size = bytes * 2 + hpage_size;
- /* TODO: clean this up. Use a static addr is ugly */
- p = BASE_PMD_ADDR;
- if (!is_src)
- /* src map + alias + interleaved hpages */
- p += 2 * (bytes + hpage_size);
+ void *reserve = mmap(NULL, region_size, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS,
+ -1, 0);
+ if (reserve == MAP_FAILED) {
+ close(mem_fd);
+ return -errno;
+ }
+
+ p = (char *)reserve;
p_alias = p;
p_alias += bytes;
p_alias += hpage_size; /* Prevent src/dst VMA merge */
- *alloc_area = mmap(p, bytes, PROT_READ | PROT_WRITE, MAP_SHARED,
+ *alloc_area = mmap(p, bytes, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED,
mem_fd, offset);
if (*alloc_area == MAP_FAILED) {
+ munmap(reserve, region_size);
*alloc_area = NULL;
+ close(mem_fd);
return -errno;
}
if (*alloc_area != p)
err("mmap of memfd failed at %p", p);
- area_alias = mmap(p_alias, bytes, PROT_READ | PROT_WRITE, MAP_SHARED,
+ area_alias = mmap(p_alias, bytes, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED,
mem_fd, offset);
if (area_alias == MAP_FAILED) {
- munmap(*alloc_area, bytes);
+ munmap(reserve, region_size);
*alloc_area = NULL;
+ close(mem_fd);
return -errno;
}
if (area_alias != p_alias)
--
2.51.2
Pahole fails to encode BTF for some Go projects (e.g. Kubernetes and
Podman) due to recursive type definitions that create reference loops
not representable in C. These recursive typedefs trigger a failure in
the BTF deduplication algorithm.
This patch extends btf_dedup_struct_types() to properly handle potential
recursion for BTF_KIND_TYPEDEF, similar to how recursion is already
handled for BTF_KIND_STRUCT. This allows pahole to successfully
generate BTF for Go binaries using recursive types without impacting
existing C-based workflows.
Changes in v3:
1. Patch 1: Adjusted the comment of btf_dedup_ref_type() to refer to
typedef as well.
2. Patch 2: Update of the "dedup: recursive typedef" test to include a
duplicated version of the types to make sure deduplication still happens
in this case.
Changes in v2:
1. Patch 1: Refactored code to prevent copying existing logic. Instead of
adding a new function we modify the existing btf_dedup_struct_type()
function to handle the BTF_KIND_TYPEDEF case. Calls to btf_hash_struct()
and btf_shallow_equal_struct() are replaced with calls to functions that
select btf_hash_struct() / btf_hash_typedef() based on the type.
2. Patch 2: Added tests
v2: https://lore.kernel.org/lkml/cover.1762956564.git.paul.houssel@orange.com/
v1: https://lore.kernel.org/lkml/20251107153408.159342-1-paulhoussel2@gmail.com/
Paul Houssel (2):
libbpf: fix BTF dedup to support recursive typedef definitions
selftests/bpf: add BTF dedup tests for recursive typedef definitions
tools/lib/bpf/btf.c | 73 +++++++++++++++-----
tools/testing/selftests/bpf/prog_tests/btf.c | 65 +++++++++++++++++
2 files changed, 121 insertions(+), 17 deletions(-)
--
2.51.0
From: Jack Thomson <jackabt(a)amazon.com>
This patch series adds ARM64 support for the KVM_PRE_FAULT_MEMORY
feature, which was previously only available on x86 [1]. This allows us
to reduce the number of stage-2 faults during execution. This is of
benefit in post-copy migration scenarios, particularly in memory
intensive applications, where we are experiencing high latencies due to
the stage-2 faults.
Patch Overview:
- The first patch adds support for the KVM_PRE_FAULT_MEMORY ioctl
on arm64.
- The second patch fixes an issue with unaligned mmap allocations
in the selftests.
- The third patch updates the pre_fault_memory_test to support
arm64.
- The last patch extends the pre_fault_memory_test to cover
different vm memory backings.
=== Changes Since v1 [2] ===
Addressing feedback from Oliver:
- No pre-fault flag is passed to user_mem_abort() or gmem_abort() now
aborts are synthesized.
- Remove retry loop from kvm_arch_vcpu_pre_fault_memory()
[1]: https://lore.kernel.org/kvm/20240710174031.312055-1-pbonzini@redhat.com
[2]: https://lore.kernel.org/all/20250911134648.58945-1-jackabt.amazon@gmail.com
Jack Thomson (4):
KVM: arm64: Add pre_fault_memory implementation
KVM: selftests: Fix unaligned mmap allocations
KVM: selftests: Enable pre_fault_memory_test for arm64
KVM: selftests: Add option for different backing in pre-fault tests
Documentation/virt/kvm/api.rst | 3 +-
arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/mmu.c | 73 +++++++++++-
tools/testing/selftests/kvm/Makefile.kvm | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 12 +-
.../selftests/kvm/pre_fault_memory_test.c | 110 +++++++++++++-----
7 files changed, 163 insertions(+), 38 deletions(-)
base-commit: 42188667be387867d2bf763d028654cbad046f7b
--
2.43.0