Since 5.16 and prior to 6.13 KVM can't be used with FSDAX guest memory (PMD pages). To reproduce the issue you need to reserve guest memory with `memmap=` cmdline, create and mount FS in DAX mode (tested both XFS and ext4), see doc link below. ndctl command for test: ndctl create-namespace -v -e namespace1.0 --map=dev --mode=fsdax -a 2M Then pass memory object to qemu like: -m 8G -object memory-backend-file,id=ram0,size=8G,\ mem-path=/mnt/pmem/guestmem,share=on,prealloc=on,dump=off,align=2097152 \ -numa node,memdev=ram0,cpus=0-1 QEMU fails to run guest with error: kvm run failed Bad address and there are two warnings in dmesg: WARN_ON_ONCE(!page_count(page)) in kvm_is_zone_device_page() and WARN_ON_ONCE(folio_ref_count(folio) <= 0) in try_grab_folio() (v6.6.63)
It looks like in the past assumption was made that pfn won't change from faultin_pfn() to release_pfn_clean(), e.g. see commit 4cd071d13c5c ("KVM: x86/mmu: Move calls to thp_adjust() down a level") But kvm_page_fault structure made pfn part of mutable state, so now release_pfn_clean() can take hugepage-adjusted pfn. And it works for all cases (/dev/shm, hugetlb, devdax) except fsdax. Apparently in fsdax mode faultin-pfn and adjusted-pfn may refer to different folios, so we're getting get_page/put_page imbalance.
To solve this preserve faultin pfn in separate kvm_page_fault field and pass it in kvm_release_pfn_clean(). Patch tested for all mentioned guest memory backends with tdp_mmu={0,1}.
No bug in upstream as it was solved fundamentally by commit 8dd861cc07e2 ("KVM: x86/mmu: Put refcounted pages instead of blindly releasing pfns") and related patch series.
Link: https://nvdimm.docs.kernel.org/2mib_fs_dax.html Fixes: 2f6305dd5676 ("KVM: MMU: change kvm_tdp_mmu_map() arguments to kvm_page_fault") Signed-off-by: Nikolay Kuratov kniv@yandex-team.ru --- arch/x86/kvm/mmu/mmu.c | 5 +++-- arch/x86/kvm/mmu/mmu_internal.h | 2 ++ arch/x86/kvm/mmu/paging_tmpl.h | 2 +- 3 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 294775b7383b..2105f3bc2e59 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4321,6 +4321,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, smp_rmb();
ret = __kvm_faultin_pfn(vcpu, fault); + fault->faultin_pfn = fault->pfn; if (ret != RET_PF_CONTINUE) return ret;
@@ -4398,7 +4399,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
out_unlock: write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + kvm_release_pfn_clean(fault->faultin_pfn); return r; }
@@ -4474,7 +4475,7 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
out_unlock: read_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + kvm_release_pfn_clean(fault->faultin_pfn); return r; } #endif diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index decc1f153669..a016b51f9c62 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -236,6 +236,8 @@ struct kvm_page_fault { /* Outputs of kvm_faultin_pfn. */ unsigned long mmu_seq; kvm_pfn_t pfn; + /* pfn copy for kvm_release_pfn_clean(), constant after kvm_faultin_pfn() */ + kvm_pfn_t faultin_pfn; hva_t hva; bool map_writable;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index c85255073f67..b945dde6e3be 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -848,7 +848,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
out_unlock: write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + kvm_release_pfn_clean(fault->faultin_pfn); return r; }
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#opti...
Rule: The upstream commit ID must be specified with a separate line above the commit text. Subject: [PATCH 6.6] KVM: x86/mmu: Ensure that kvm_release_pfn_clean() takes exact pfn from kvm_faultin_pfn() Link: https://lore.kernel.org/stable/20241130132702.1974626-1-kniv%40yandex-team.r...
Please ignore this mail if the patch is not relevant for upstream.
On Sat, Nov 30, 2024, Nikolay Kuratov wrote:
Since 5.16 and prior to 6.13 KVM can't be used with FSDAX guest memory (PMD pages). To reproduce the issue you need to reserve guest memory with `memmap=` cmdline, create and mount FS in DAX mode (tested both XFS and ext4), see doc link below. ndctl command for test: ndctl create-namespace -v -e namespace1.0 --map=dev --mode=fsdax -a 2M Then pass memory object to qemu like: -m 8G -object memory-backend-file,id=ram0,size=8G,\ mem-path=/mnt/pmem/guestmem,share=on,prealloc=on,dump=off,align=2097152 \ -numa node,memdev=ram0,cpus=0-1 QEMU fails to run guest with error: kvm run failed Bad address and there are two warnings in dmesg: WARN_ON_ONCE(!page_count(page)) in kvm_is_zone_device_page() and WARN_ON_ONCE(folio_ref_count(folio) <= 0) in try_grab_folio() (v6.6.63)
It looks like in the past assumption was made that pfn won't change from faultin_pfn() to release_pfn_clean(), e.g. see commit 4cd071d13c5c ("KVM: x86/mmu: Move calls to thp_adjust() down a level") But kvm_page_fault structure made pfn part of mutable state, so now release_pfn_clean() can take hugepage-adjusted pfn. And it works for all cases (/dev/shm, hugetlb, devdax) except fsdax. Apparently in fsdax mode faultin-pfn and adjusted-pfn may refer to different folios, so we're getting get_page/put_page imbalance.
Oof. Yeah, KVM definitely doesn't expect huge page mappings to straddle multiple struct pages.
To solve this preserve faultin pfn in separate kvm_page_fault field and pass it in kvm_release_pfn_clean(). Patch tested for all mentioned guest memory backends with tdp_mmu={0,1}.
No bug in upstream as it was solved fundamentally by commit 8dd861cc07e2 ("KVM: x86/mmu: Put refcounted pages instead of blindly releasing pfns") and related patch series.
Link: https://nvdimm.docs.kernel.org/2mib_fs_dax.html Fixes: 2f6305dd5676 ("KVM: MMU: change kvm_tdp_mmu_map() arguments to kvm_page_fault") Signed-off-by: Nikolay Kuratov kniv@yandex-team.ru
arch/x86/kvm/mmu/mmu.c | 5 +++-- arch/x86/kvm/mmu/mmu_internal.h | 2 ++ arch/x86/kvm/mmu/paging_tmpl.h | 2 +- 3 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 294775b7383b..2105f3bc2e59 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4321,6 +4321,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, smp_rmb(); ret = __kvm_faultin_pfn(vcpu, fault);
- fault->faultin_pfn = fault->pfn;
This fix should go to 6.12 and 6.1 as well, and so I'm leaning toward using a local variable to snapshot the original PFN, even though (a) it's more churn and (b) it's theoretically less robust.
6.12 has an unaffected kvm_release_pfn_clean(fault->pfn). Using a local variable means the same fix+code can live in all relevant LTS kernels without creating weirdness (leaving behind the unaffected call). For me, that's worth the churn.
As far as it potentially being less robust, if something modifies fault->pfn between kvm_faultin_pfn() and acquiring mmu_lock, KVM has far bigger problems.
(untested)
--- arch/x86/kvm/mmu/mmu.c | 10 ++++++++-- arch/x86/kvm/mmu/paging_tmpl.h | 5 ++++- 2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 8e853a5fc867..f5a7d89ea68d 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4580,6 +4580,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { + kvm_pfn_t orig_pfn; int r;
/* Dummy roots are used only for shadowing bad guest roots. */ @@ -4601,6 +4602,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault if (r != RET_PF_CONTINUE) return r;
+ orig_pfn = fault->pfn; + r = RET_PF_RETRY; write_lock(&vcpu->kvm->mmu_lock);
@@ -4615,7 +4618,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
out_unlock: write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + kvm_release_pfn_clean(orig_pfn); return r; }
@@ -4675,6 +4678,7 @@ EXPORT_SYMBOL_GPL(kvm_handle_page_fault); static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { + kvm_pfn_t orig_pfn; int r;
if (page_fault_handle_page_track(vcpu, fault)) @@ -4692,6 +4696,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, if (r != RET_PF_CONTINUE) return r;
+ orig_pfn = fault->pfn; + r = RET_PF_RETRY; read_lock(&vcpu->kvm->mmu_lock);
@@ -4702,7 +4708,7 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
out_unlock: read_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + kvm_release_pfn_clean(orig_pfn); return r; } #endif diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index ae7d39ff2d07..b08017683920 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -778,6 +778,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct guest_walker walker; + kvm_pfn_t orig_pfn; int r;
WARN_ON_ONCE(fault->is_tdp); @@ -836,6 +837,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault walker.pte_access &= ~ACC_EXEC_MASK; }
+ orig_pfn = fault->pfn; + r = RET_PF_RETRY; write_lock(&vcpu->kvm->mmu_lock);
@@ -849,7 +852,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
out_unlock: write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + kvm_release_pfn_clean(orig_pfn); return r; }
base-commit: adc218676eef25575469234709c2d87185ca223a --
[ Sasha's backport helper bot ]
Hi,
No upstream commit was identified. Using temporary commit for testing.
Results of testing on various branches:
| Branch | Patch Apply | Build Test | |---------------------------|-------------|------------| | stable/linux-6.6.y | Success | Success |
linux-stable-mirror@lists.linaro.org