Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

24 Apr 2025

      On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
...
On 4/24/2025 12:25 PM, Yan Zhao wrote:
...
On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
...
On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
...
Yan Zhao yan.y.zhao@intel.com writes:
...
On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
...
+/*

Allocates and then caches a folio in the filemap. Returns a folio with

refcount of 2: 1 after allocation, and 1 taken by the filemap.

*/

+static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,

					    pgoff_t index)

+{

struct kvm_gmem_hugetlb *hgmem;
pgoff_t aligned_index;
struct folio *folio;
int nr_pages;
int ret;

hgmem = kvm_gmem_hgmem(inode);
folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
if (IS_ERR(folio))
return folio;

nr_pages = 1UL << huge_page_order(hgmem->h);
aligned_index = round_down(index, nr_pages);

Maybe a gap here.
When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
corresponding GFN is not 2M/1G aligned.
Thanks for looking into this.
In 1G page support for guest_memfd, the offset and size are always
hugepage aligned to the hugepage size requested at guest_memfd creation
time, and it is true that when binding to a memslot, slot->base_gfn and
slot->npages may not be hugepage aligned.
...
However, TDX requires that private huge pages be 2M aligned in GFN.
IIUC other factors also contribute to determining the mapping level in
the guest page tables, like lpage_info and .private_max_mapping_level()
in kvm_x86_ops.
If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
will track that and not allow faulting into guest page tables at higher
granularity.
lpage_info only checks the alignments of slot->base_gfn and
slot->base_gfn + npages. e.g.,
if slot->base_gfn is 8K, npages is 8M, then for this slot,
lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
Should it be?
lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
Right. Good catch. Thanks!
Let me update the example as below:
slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA
4MB+16KB. However, their aligned_index values lead guest_memfd to allocate two
2MB folios, whose physical addresses may not be contiguous.
Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB,
KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB).
However, guest_memfd just allocates the same 2MB folio for both faults.
...
...
...

|          |  |          |  |          |  |          |  |
  8K        2M 2M+8K      4M  4M+8K     6M  6M+8K     8M  8M+8K
For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
So, guest_memfd allocates the same huge folio of 2M order for them.
Sorry, sent too fast this morning. The example is not right. The correct
one is:
For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
KVM will create a 2M mapping for them.
However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
same 2M folio and physical addresses may not be contiguous.
...
However, for TDX, GFN 6M and GFN 6M+4K should not belong to the same folio.
It's also weird for a 2M mapping in KVM to stride across 2 huge folios.
...
Hence I think it is okay to leave it to KVM to fault pages into the
guest correctly. For guest_memfd will just maintain the invariant that
offset and size are hugepage aligned, but not require that
slot->base_gfn and slot->npages are hugepage aligned. This behavior will
be consistent with other backing memory for guests like regular shmem or
HugeTLB.
...
...

ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
				 aligned_index,

				 htlb_alloc_mask(hgmem->h));

WARN_ON(ret);
spin_lock(&inode->i_lock);
 inode->i_blocks += blocks_per_huge_page(hgmem->h);
 spin_unlock(&inode->i_lock);

return page_folio(requested_page);

return folio;

+}

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page