On 26.02.25 09:48, Patrick Roy wrote:
On Tue, 2025-02-25 at 16:54 +0000, David Hildenbrand wrote:> On 21.02.25 17:07, Patrick Roy wrote:
Add KVM_GMEM_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() ioctl. When set, guest_memfd folios will be removed from the direct map after preparation, with direct map entries only restored when the folios are freed.
To ensure these folios do not end up in places where the kernel cannot deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct address_space if KVM_GMEM_NO_DIRECT_MAP is requested.
Note that this flag causes removal of direct map entries for all guest_memfd folios independent of whether they are "shared" or "private" (although current guest_memfd only supports either all folios in the "shared" state, or all folios in the "private" state if !IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM)). The usecase for removing direct map entries of also the shared parts of guest_memfd are a special type of non-CoCo VM where, host userspace is trusted to have access to all of guest memory, but where Spectre-style transient execution attacks through the host kernel's direct map should still be mitigated.
Note that KVM retains access to guest memory via userspace mappings of guest_memfd, which are reflected back into KVM's memslots via userspace_addr. This is needed for things like MMIO emulation on x86_64 to work. Previous iterations attempted to instead have KVM temporarily restore direct map entries whenever such an access to guest memory was needed, but this turned out to have a significant performance impact, as well as additional complexity due to needing to refcount direct map reinsertion operations and making them play nicely with gmem truncations.
This iteration also doesn't have KVM perform TLB flushes after direct map manipulations. This is because TLB flushes resulted in a up to 40x elongation of page faults in guest_memfd (scaling with the number of CPU cores), or a 5x elongation of memory population. On the one hand, TLB flushes are not needed for functional correctness (the virt->phys mapping technically stays "correct", the kernel should simply to not it for a while), so this is a correct optimization to make. On the other hand, it means that the desired protection from Spectre-style attacks is not perfect, as an attacker could try to prevent a stale TLB entry from getting evicted, keeping it alive until the page it refers to is used by the guest for some sensitive data, and then targeting it using a spectre-gadget.
Signed-off-by: Patrick Roy roypat@amazon.co.uk
...
+static bool kvm_gmem_test_no_direct_map(struct inode *inode) +{
return ((unsigned long) inode->i_private) & KVM_GMEM_NO_DIRECT_MAP;
+}
- static inline void kvm_gmem_mark_prepared(struct folio *folio) {
struct inode *inode = folio_inode(folio);
if (kvm_gmem_test_no_direct_map(inode)) {
int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
false);
Will this work if KVM is built as a module, or is this another good reason why we might want guest_memfd core part of core-mm?
mh, I'm admittedly not too familiar with the differences that would come from building KVM as a module vs not. I do remember something about the direct map accessors not being available for modules, so this would indeed not work. Does that mean moving gmem into core-mm will be a pre-requisite for the direct map removal stuff?
Likely, we'd need some shim.
Maybe for the time being it could be fenced using #if IS_BUILTIN() ... but that sure won't win in a beauty contest.