On 25.09.25 17:52, Roy, Patrick wrote:
On Thu, 2025-09-25 at 12:00 +0100, David Hildenbrand wrote:
On 24.09.25 17:22, Roy, Patrick wrote:
Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() ioctl. When set, guest_memfd folios will be removed from the direct map after preparation, with direct map entries only restored when the folios are freed.
To ensure these folios do not end up in places where the kernel cannot deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on guest_memfd itself being supported, but also on whether linux supports manipulatomg the direct map at page granularity at all (possible most of the time, outliers being arm64 where its impossible if the direct map has been setup using hugepages, as arm64 cannot break these apart due to break-before-make semantics, and powerpc, which does not select ARCH_HAS_SET_DIRECT_MAP, though also doesn't support guest_memfd anyway).
Note that this flag causes removal of direct map entries for all guest_memfd folios independent of whether they are "shared" or "private" (although current guest_memfd only supports either all folios in the "shared" state, or all folios in the "private" state if GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map entries of also the shared parts of guest_memfd are a special type of non-CoCo VM where, host userspace is trusted to have access to all of guest memory, but where Spectre-style transient execution attacks through the host kernel's direct map should still be mitigated. In this setup, KVM retains access to guest memory via userspace mappings of guest_memfd, which are reflected back into KVM's memslots via userspace_addr. This is needed for things like MMIO emulation on x86_64 to work.
Direct map entries are zapped right before guest or userspace mappings of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or kvm_gmem_get_pfn() [called from the KVM MMU code]. The only place where a gmem folio can be allocated without being mapped anywhere is kvm_gmem_populate(), where handling potential failures of direct map removal is not possible (by the time direct map removal is attempted, the folio is already marked as prepared, meaning attempting to re-try kvm_gmem_populate() would just result in -EEXIST without fixing up the direct map state). These folios are then removed form the direct map upon kvm_gmem_get_pfn(), e.g. when they are mapped into the guest later.
Signed-off-by: Patrick Roy roypat@amazon.co.uk
Documentation/virt/kvm/api.rst | 5 +++ arch/arm64/include/asm/kvm_host.h | 12 ++++++ include/linux/kvm_host.h | 6 +++ include/uapi/linux/kvm.h | 2 + virt/kvm/guest_memfd.c | 61 ++++++++++++++++++++++++++++++- virt/kvm/kvm_main.c | 5 +++ 6 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index c17a87a0a5ac..b52c14d58798 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6418,6 +6418,11 @@ When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field supports GUEST_MEMFD_FLAG_MMAP. Setting this flag on guest_memfd creation enables mmap() and faulting of guest_memfd memory to host userspace.
+When the capability KVM_CAP_GMEM_NO_DIRECT_MAP is supported, the 'flags' field +supports GUEST_MEMFG_FLAG_NO_DIRECT_MAP. Setting this flag makes the guest_memfd +instance behave similarly to memfd_secret, and unmaps the memory backing it from +the kernel's address space after allocation.
Do we want to document what the implication of that is? Meaning, limitations etc. I recall that we would need the user mapping for gmem slots to be properly set up.
Is that still the case in this patch set?
The ->userspace_addr thing is the general requirement for non-CoCo VMs, and not specific for direct map removal (e.g. I expect direct map removal to just work out of the box for CoCo setups, where KVM already cannot access guest memory, ignoring the question of whether direct map removal is even useful for CoCo VMs). So I don't think it should be documented as part of KVM_CAP_GMEM_NO_DIRECT_MAP/GUEST_MEMFG_FLAG_NO_DIRECT_MAP (heh, there's a typo I just noticed.
Okay I was rather wondering whether this will be the first patch set where it is actually required to be set. In the basic mmap series, I am not sure yet if we really depend on it (but IIRC we did document it, but do no sanity checks etc).
"MEMFG". Also "GMEM" needs to be "GUEST_MEMFD".
Will fix that), but rather as part of GUEST_MEMFD_FLAG_MMAP. I can add a patch it there (or maybe send it separately, since FLAG_MMAP is already in -next?).
Yes, it's in kvm/next and will go upstream soon.
When the KVM MMU performs a PFN lookup to service a guest fault and the backing guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be consumed from guest_memfd, regardless of whether it is a shared or a private diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 2f2394cce24e..0bfd8e5fd9de 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -19,6 +19,7 @@ #include <linux/maple_tree.h> #include <linux/percpu.h> #include <linux/psci.h> +#include <linux/set_memory.h> #include <asm/arch_gicv3.h> #include <asm/barrier.h> #include <asm/cpufeature.h> @@ -1706,5 +1707,16 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt); void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1); void check_feature_map(void);
+#ifdef CONFIG_KVM_GUEST_MEMFD +static inline bool kvm_arch_gmem_supports_no_direct_map(void) +{
/*
* Without FWB, direct map access is needed in kvm_pgtable_stage2_map(),
* as it calls dcache_clean_inval_poc().
*/
return can_set_direct_map() && cpus_have_final_cap(ARM64_HAS_STAGE2_FWB);
+} +#define kvm_arch_gmem_supports_no_direct_map kvm_arch_gmem_supports_no_direct_map +#endif /* CONFIG_KVM_GUEST_MEMFD */
I strongly assume that the aarch64 support should be moved to a separate patch -- if possible, see below.
#endif /* __ARM64_KVM_HOST_H__ */ diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 1d0585616aa3..73a15cade54a 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -731,6 +731,12 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm) bool kvm_arch_supports_gmem_mmap(struct kvm *kvm); #endif
+#ifdef CONFIG_KVM_GUEST_MEMFD +#ifndef kvm_arch_gmem_supports_no_direct_map +#define kvm_arch_gmem_supports_no_direct_map can_set_direct_map +#endif
Hm, wouldn't it be better to have an opt-in per arch, and really only unlock the ones we know work (tested etc), explicitly in separate patches.
Ack, can definitely do that. Something like
#ifndef kvm_arch_gmem_supports_no_direct_map static inline bool kvm_arch_gmem_supports_no_direct_map() { return false; } #endif
and then actual definitions (in separate patches) in the arm64 and x86 headers?
On a related note, maybe PATCH 2 should only export set_direct_map_valid_noflush() for the architectures on which we actually need it? Which would only be x86, since arm64 doesnt allow building KVM as a module, and nothing else supports guest_memfd right now.
Yes, that's probably best. Could be done in the same arch patch then.