On Mon, Dec 13, 2021, Sean Christopherson wrote:
On Mon, Dec 13, 2021, Paolo Bonzini wrote:
kvm_tdp_mmu_zap_all is intended to visit all roots and zap their page tables, which flushes the accessed and dirty bits out to the Linux "struct page"s. Missing some of the roots has catastrophic effects, because kvm_tdp_mmu_zap_all is called when the MMU notifier is being removed and any PTEs left behind might become dangling by the time kvm-arch_destroy_vm tears down the roots for good.
Unfortunately that is exactly what kvm_tdp_mmu_zap_all is doing: it visits all roots via for_each_tdp_mmu_root_yield_safe, which in turn uses kvm_tdp_mmu_get_root to skip invalid roots. If the current root is invalid at the time of kvm_tdp_mmu_zap_all, its page tables will remain in place but will later be zapped during kvm_arch_destroy_vm.
As stated in the bug report thread[*], it should be impossible as for the MMU notifier to be unregistered while kvm_mmu_zap_all_fast() is running.
I do believe there's a race between set_nx_huge_pages() and kvm_mmu_notifier_release(), but that would result in the use-after-free kvm_set_pfn_dirty() tracing back to set_nx_huge_pages(), not kvm_destroy_vm(). And for that, I would much prefer we elevant mm->users while changing the NX hugepage setting.
Mwhahaha, race confirmed with a bit of hacking to force the issue. I'll get a patch out.