This is a backport of the CR0.WP KVM series[1] to Linux v5.15. It differs from the v6.1 backport as in needing additional prerequisite patches from Lai Jiangshan (and fixes for those) to ensure the assumption it's safe to let CR0.WP be a guest owned bit still stand.
I used 'ssdd 10 50000' from rt-tests[2] as a micro-benchmark, running on a grsecurity L1 VM. Below table shows the results (runtime in seconds, lower is better):
legacy TDP shadow Linux v5.15.106 9.94s 66.1s 64.9s + patches 4.81s 4.79s 64.6s
It's interesting to see that using the TDP MMU is even slower than shadow paging on a vanilla kernel, making the impact of this backport even more significant.
The KVM unit test suite showed no regressions.
Please consider applying.
Thanks, Mathias
[1] https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/ [2] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
Lai Jiangshan (3): KVM: X86: Don't reset mmu context when X86_CR4_PCIDE 1->0 KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE KVM: x86/mmu: Reconstruct shadow page root if the guest PDPTEs is changed
Mathias Krause (3): KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled KVM: x86: Make use of kvm_read_cr*_bits() when testing bits KVM: VMX: Make CR0.WP a guest owned bit
Paolo Bonzini (1): KVM: x86/mmu: Avoid indirect call for get_cr3
Sean Christopherson (1): KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated permission faults
arch/x86/kvm/kvm_cache_regs.h | 2 +- arch/x86/kvm/mmu.h | 42 ++++++++++++++++++++++++++++++---- arch/x86/kvm/mmu/mmu.c | 27 +++++++++++++++++----- arch/x86/kvm/mmu/paging_tmpl.h | 2 +- arch/x86/kvm/pmu.c | 4 ++-- arch/x86/kvm/vmx/nested.c | 4 ++-- arch/x86/kvm/vmx/vmx.c | 6 ++--- arch/x86/kvm/vmx/vmx.h | 18 +++++++++++++++ arch/x86/kvm/x86.c | 27 +++++++++++++++++++--- 9 files changed, 110 insertions(+), 22 deletions(-)
From: Paolo Bonzini pbonzini@redhat.com
[ Upstream commit 2fdcc1b324189b5fb20655baebd40cd82e2bdf0c ]
Most of the time, calls to get_guest_pgd result in calling kvm_read_cr3 (the exception is only nested TDP). Hardcode the default instead of using the get_cr3 function, avoiding a retpoline if they are enabled.
Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Mathias Krause minipli@grsecurity.net Link: https://lore.kernel.org/r/20230322013731.102955-2-minipli@grsecurity.net Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Mathias Krause minipli@grsecurity.net # backport to v5.15.x --- arch/x86/kvm/mmu.h | 11 +++++++++++ arch/x86/kvm/mmu/mmu.c | 12 ++++++------ arch/x86/kvm/mmu/paging_tmpl.h | 2 +- arch/x86/kvm/x86.c | 2 +- 4 files changed, 19 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 7bb165c23233..baafc5d8bb9e 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -115,6 +115,17 @@ static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu) vcpu->arch.mmu->shadow_root_level); }
+unsigned long get_guest_cr3(struct kvm_vcpu *vcpu); + +static inline unsigned long kvm_mmu_get_guest_pgd(struct kvm_vcpu *vcpu, + struct kvm_mmu *mmu) +{ + if (IS_ENABLED(CONFIG_RETPOLINE) && mmu->get_guest_pgd == get_guest_cr3) + return kvm_read_cr3(vcpu); + + return mmu->get_guest_pgd(vcpu); +} + int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, bool prefault);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 4724289c8a7f..7c3b809f24b3 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3472,7 +3472,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) unsigned i; int r;
- root_pgd = mmu->get_guest_pgd(vcpu); + root_pgd = kvm_mmu_get_guest_pgd(vcpu, mmu); root_gfn = root_pgd >> PAGE_SHIFT;
if (mmu_check_root(vcpu, root_gfn)) @@ -3899,7 +3899,7 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, arch.token = alloc_apf_token(vcpu); arch.gfn = gfn; arch.direct_map = vcpu->arch.mmu->direct_map; - arch.cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu); + arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);
return kvm_setup_async_pf(vcpu, cr2_or_gpa, kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch); @@ -4203,7 +4203,7 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd) } EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd);
-static unsigned long get_cr3(struct kvm_vcpu *vcpu) +unsigned long get_guest_cr3(struct kvm_vcpu *vcpu) { return kvm_read_cr3(vcpu); } @@ -4756,7 +4756,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) context->invlpg = NULL; context->shadow_root_level = kvm_mmu_get_tdp_level(vcpu); context->direct_map = true; - context->get_guest_pgd = get_cr3; + context->get_guest_pgd = get_guest_cr3; context->get_pdptr = kvm_pdptr_read; context->inject_page_fault = kvm_inject_page_fault; context->root_level = role_regs_to_root_level(®s); @@ -4933,7 +4933,7 @@ static void init_kvm_softmmu(struct kvm_vcpu *vcpu)
kvm_init_shadow_mmu(vcpu, ®s);
- context->get_guest_pgd = get_cr3; + context->get_guest_pgd = get_guest_cr3; context->get_pdptr = kvm_pdptr_read; context->inject_page_fault = kvm_inject_page_fault; } @@ -4965,7 +4965,7 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu) return;
g_context->mmu_role.as_u64 = new_role.as_u64; - g_context->get_guest_pgd = get_cr3; + g_context->get_guest_pgd = get_guest_cr3; g_context->get_pdptr = kvm_pdptr_read; g_context->inject_page_fault = kvm_inject_page_fault; g_context->root_level = new_role.base.level; diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index a1811f51eda9..dbb54f65d1df 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -359,7 +359,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, trace_kvm_mmu_pagetable_walk(addr, access); retry_walk: walker->level = mmu->root_level; - pte = mmu->get_guest_pgd(vcpu); + pte = kvm_mmu_get_guest_pgd(vcpu, mmu); have_ad = PT_HAVE_ACCESSED_DIRTY(mmu);
#if PTTYPE == 64 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5cb4af42ba64..018f6a394d44 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12155,7 +12155,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) return;
if (!vcpu->arch.mmu->direct_map && - work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu)) + work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu)) return;
kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
[ Upstream commit 01b31714bd90be2784f7145bf93b7f78f3d081e1 ]
There is no need to unload the MMU roots with TDP enabled when only CR0.WP has changed -- the paging structures are still valid, only the permission bitmap needs to be updated.
One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to implement kernel W^X.
The optimization brings a huge performance gain for this case as the following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a grsecurity L1 VM shows (runtime in seconds, lower is better):
legacy TDP shadow kvm-x86/next@d8708b 8.43s 9.45s 70.3s +patch 5.39s 5.63s 70.2s
For legacy MMU this is ~36% faster, for TDP MMU even ~40% faster. Also TDP and legacy MMU now both have a similar runtime which vanishes the need to disable TDP MMU for grsecurity.
Shadow MMU sees no measurable difference and is still slow, as expected.
[1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
Signed-off-by: Mathias Krause minipli@grsecurity.net Link: https://lore.kernel.org/r/20230322013731.102955-3-minipli@grsecurity.net Co-developed-by: Sean Christopherson seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Mathias Krause minipli@grsecurity.net --- arch/x86/kvm/x86.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 018f6a394d44..27900d4017a7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -878,6 +878,18 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0) { + /* + * CR0.WP is incorporated into the MMU role, but only for non-nested, + * indirect shadow MMUs. If TDP is enabled, the MMU's metadata needs + * to be updated, e.g. so that emulating guest translations does the + * right thing, but there's no need to unload the root as CR0.WP + * doesn't affect SPTEs. + */ + if (tdp_enabled && (cr0 ^ old_cr0) == X86_CR0_WP) { + kvm_init_mmu(vcpu); + return; + } + if ((cr0 ^ old_cr0) & X86_CR0_PG) { kvm_clear_async_pf_completion_queue(vcpu); kvm_async_pf_hash_reset(vcpu);
[ Upstream commit 74cdc836919bf34684ef66f995273f35e2189daf ]
Make use of the kvm_read_cr{0,4}_bits() helper functions when we only want to know the state of certain bits instead of the whole register.
This not only makes the intent cleaner, it also avoids a potential VMREAD in case the tested bits aren't guest owned.
Signed-off-by: Mathias Krause minipli@grsecurity.net Link: https://lore.kernel.org/r/20230322013731.102955-5-minipli@grsecurity.net Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Mathias Krause minipli@grsecurity.net # backport to v5.15.x --- arch/x86/kvm/pmu.c | 4 ++-- arch/x86/kvm/vmx/vmx.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 62333f9756a3..5c2b9ff8e014 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -366,9 +366,9 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data) if (!pmc) return 1;
- if (!(kvm_read_cr4(vcpu) & X86_CR4_PCE) && + if (!(kvm_read_cr4_bits(vcpu, X86_CR4_PCE)) && (static_call(kvm_x86_get_cpl)(vcpu) != 0) && - (kvm_read_cr0(vcpu) & X86_CR0_PE)) + (kvm_read_cr0_bits(vcpu, X86_CR0_PE))) return 1;
*data = pmc_read_counter(pmc) & mask; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c95c3675e8d5..566367409598 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5128,7 +5128,7 @@ static int handle_cr(struct kvm_vcpu *vcpu) break; case 3: /* lmsw */ val = (exit_qualification >> LMSW_SOURCE_DATA_SHIFT) & 0x0f; - trace_kvm_cr_write(0, (kvm_read_cr0(vcpu) & ~0xful) | val); + trace_kvm_cr_write(0, (kvm_read_cr0_bits(vcpu, ~0xful) | val)); kvm_lmsw(vcpu, val);
return kvm_skip_emulated_instruction(vcpu); @@ -7149,7 +7149,7 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) goto exit; }
- if (kvm_read_cr0(vcpu) & X86_CR0_CD) { + if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { ipat = VMX_EPT_IPAT_BIT; if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) cache = MTRR_TYPE_WRBACK;
[ Upstream commit fb509f76acc8d42bed11bca308404f81c2be856a ]
Guests like grsecurity that make heavy use of CR0.WP to implement kernel level W^X will suffer from the implied VMEXITs.
With EPT there is no need to intercept a guest change of CR0.WP, so simply make it a guest owned bit if we can do so.
This implies that a read of a guest's CR0.WP bit might need a VMREAD. However, the only potentially affected user seems to be kvm_init_mmu() which is a heavy operation to begin with. But also most callers already cache the full value of CR0 anyway, so no additional VMREAD is needed. The only exception is nested_vmx_load_cr3().
This change is VMX-specific, as SVM has no such fine grained control register intercept control.
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Mathias Krause minipli@grsecurity.net Link: https://lore.kernel.org/r/20230322013731.102955-7-minipli@grsecurity.net Co-developed-by: Sean Christopherson seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Mathias Krause minipli@grsecurity.net # backport to v5.15.x --- arch/x86/kvm/kvm_cache_regs.h | 2 +- arch/x86/kvm/vmx/nested.c | 4 ++-- arch/x86/kvm/vmx/vmx.c | 2 +- arch/x86/kvm/vmx/vmx.h | 18 ++++++++++++++++++ 4 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h index 90e1ffdc05b7..dd536243f653 100644 --- a/arch/x86/kvm/kvm_cache_regs.h +++ b/arch/x86/kvm/kvm_cache_regs.h @@ -4,7 +4,7 @@
#include <linux/kvm_host.h>
-#define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS +#define KVM_POSSIBLE_CR0_GUEST_BITS (X86_CR0_TS | X86_CR0_WP) #define KVM_POSSIBLE_CR4_GUEST_BITS \ (X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR \ | X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index e4e4c1d3aa17..2bebb0d43666 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -4308,7 +4308,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu, * CR0_GUEST_HOST_MASK is already set in the original vmcs01 * (KVM doesn't change it); */ - vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS; + vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits(); vmx_set_cr0(vcpu, vmcs12->host_cr0);
/* Same as above - no reason to call set_cr4_guest_host_mask(). */ @@ -4459,7 +4459,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu) */ vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
- vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS; + vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits(); vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 566367409598..cab0ee27db74 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -4450,7 +4450,7 @@ static void init_vmcs(struct vcpu_vmx *vmx) /* 22.2.1, 20.8.1 */ vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
- vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS; + vmx->vcpu.arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits(); vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits);
set_cr4_guest_host_mask(vmx); diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 20f1213a9368..cd73fe0c05b9 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -531,6 +531,24 @@ static inline void vmx_register_cache_reset(struct kvm_vcpu *vcpu) vcpu->arch.regs_dirty = 0; }
+static inline unsigned long vmx_l1_guest_owned_cr0_bits(void) +{ + unsigned long bits = KVM_POSSIBLE_CR0_GUEST_BITS; + + /* + * CR0.WP needs to be intercepted when KVM is shadowing legacy paging + * in order to construct shadow PTEs with the correct protections. + * Note! CR0.WP technically can be passed through to the guest if + * paging is disabled, but checking CR0.PG would generate a cyclical + * dependency of sorts due to forcing the caller to ensure CR0 holds + * the correct value prior to determining which CR0 bits can be owned + * by L1. Keep it simple and limit the optimization to EPT. + */ + if (!enable_ept) + bits &= ~X86_CR0_WP; + return bits; +} + static inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm) { return container_of(kvm, struct kvm_vmx, kvm);
From: Lai Jiangshan laijs@linux.alibaba.com
[ Upstream commit 552617382c197949ff965a3559da8952bf3c1fa5 ]
X86_CR4_PCIDE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb.
Signed-off-by: Lai Jiangshan laijs@linux.alibaba.com Reviewed-by: Sean Christopherson seanjc@google.com Message-Id: 20210919024246.89230-2-jiangshanlai@gmail.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Mathias Krause minipli@grsecurity.net --- arch/x86/kvm/x86.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 27900d4017a7..515033d01b12 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1081,9 +1081,10 @@ static bool kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4) { - if (((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS) || - (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE))) + if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS) kvm_mmu_reset_context(vcpu); + else if (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)) + kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu); } EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
From: Lai Jiangshan laijs@linux.alibaba.com
[ Upstream commit a91a7c7096005113d8e749fd8dfdd3e1eecee263 ]
X86_CR4_PGE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb.
It is also inconsistent that X86_CR4_PGE is in KVM_MMU_CR4_ROLE_BITS while kvm_mmu_role doesn't use X86_CR4_PGE. So X86_CR4_PGE is also removed from KVM_MMU_CR4_ROLE_BITS.
Signed-off-by: Lai Jiangshan laijs@linux.alibaba.com Reviewed-by: Sean Christopherson seanjc@google.com Message-Id: 20210919024246.89230-3-jiangshanlai@gmail.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Mathias Krause minipli@grsecurity.net --- arch/x86/kvm/mmu.h | 5 ++--- arch/x86/kvm/x86.c | 3 ++- 2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index baafc5d8bb9e..03a9e37e446a 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -44,9 +44,8 @@ #define PT32_ROOT_LEVEL 2 #define PT32E_ROOT_LEVEL 3
-#define KVM_MMU_CR4_ROLE_BITS (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE | \ - X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE | \ - X86_CR4_LA57) +#define KVM_MMU_CR4_ROLE_BITS (X86_CR4_PSE | X86_CR4_PAE | X86_CR4_LA57 | \ + X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE)
#define KVM_MMU_CR0_ROLE_BITS (X86_CR0_PG | X86_CR0_WP) #define KVM_MMU_EFER_ROLE_BITS (EFER_LME | EFER_NX) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 515033d01b12..ea3bdc4f2284 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1083,7 +1083,8 @@ void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned lon { if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS) kvm_mmu_reset_context(vcpu); - else if (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)) + else if (((cr4 ^ old_cr4) & X86_CR4_PGE) || + (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE))) kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu); } EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
From: Lai Jiangshan laijs@linux.alibaba.com
[ Upstream commit 6b123c3a89a90ac6418e4d64b1e23f09d458a77d ]
For shadow paging, the page table needs to be reconstructed before the coming VMENTER if the guest PDPTEs is changed.
But not all paths that call load_pdptrs() will cause the page tables to be reconstructed. Normally, kvm_mmu_reset_context() and kvm_mmu_free_roots() are used to launch later reconstruction.
The commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs() when changing CR0.CD and CR0.NW.
The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after load_pdptrs() when rewriting the CR3 with the same value.
The commit a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after load_pdptrs() when changing CR4.PGE.
Guests like linux would keep the PDPTEs unchanged for every instance of pagetable, so this missing reconstruction has no problem for linux guests.
Fixes: d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed") Fixes: 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush") Fixes: a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE") Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Lai Jiangshan laijs@linux.alibaba.com Message-Id: 20211216021938.11752-3-jiangshanlai@gmail.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Mathias Krause minipli@grsecurity.net --- arch/x86/kvm/x86.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ea3bdc4f2284..a9f80a544ff1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -865,6 +865,13 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3) } ret = 1;
+ /* + * Marking VCPU_EXREG_PDPTR dirty doesn't work for !tdp_enabled. + * Shadow page roots need to be reconstructed instead. + */ + if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) + kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT); + memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)); kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR); kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
From: Sean Christopherson seanjc@google.com
[ Upstream commit cf9f4c0eb1699d306e348b1fd0225af7b2c282d3 ]
Refresh the MMU's snapshot of the vCPU's CR0.WP prior to checking for permission faults when emulating a guest memory access and CR0.WP may be guest owned. If the guest toggles only CR0.WP and triggers emulation of a supervisor write, e.g. when KVM is emulating UMIP, KVM may consume a stale CR0.WP, i.e. use stale protection bits metadata.
Note, KVM passes through CR0.WP if and only if EPT is enabled as CR0.WP is part of the MMU role for legacy shadow paging, and SVM (NPT) doesn't support per-bit interception controls for CR0. Don't bother checking for EPT vs. NPT as the "old == new" check will always be true under NPT, i.e. the only cost is the read of vcpu->arch.cr4 (SVM unconditionally grabs CR0 from the VMCB on VM-Exit).
Reported-by: Mathias Krause minipli@grsecurity.net Link: https://lkml.kernel.org/r/677169b4-051f-fcae-756b-9a3e1bb9f8fe%40grsecurity.... Fixes: fb509f76acc8 ("KVM: VMX: Make CR0.WP a guest owned bit") Tested-by: Mathias Krause minipli@grsecurity.net Link: https://lore.kernel.org/r/20230405002608.418442-1-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Mathias Krause minipli@grsecurity.net # backport to v5.15.x --- - the MMU role wasn't folded into the CPU role yet in this kernel version and the "not_smap" handling was done slightly different, however, independent of the permission bitmap handling, so refreshing the bitmap prior to determining the fault state is still sufficient
arch/x86/kvm/mmu.h | 26 +++++++++++++++++++++++++- arch/x86/kvm/mmu/mmu.c | 15 +++++++++++++++ 2 files changed, 40 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 03a9e37e446a..a3c0dc07fc96 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -76,6 +76,8 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, u64 fault_address, char *insn, int insn_len); +void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, + struct kvm_mmu *mmu);
int kvm_mmu_load(struct kvm_vcpu *vcpu); void kvm_mmu_unload(struct kvm_vcpu *vcpu); @@ -176,6 +178,24 @@ static inline bool is_writable_pte(unsigned long pte) return pte & PT_WRITABLE_MASK; }
+static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, + struct kvm_mmu *mmu) +{ + /* + * When EPT is enabled, KVM may passthrough CR0.WP to the guest, i.e. + * @mmu's snapshot of CR0.WP and thus all related paging metadata may + * be stale. Refresh CR0.WP and the metadata on-demand when checking + * for permission faults. Exempt nested MMUs, i.e. MMUs for shadowing + * nEPT and nNPT, as CR0.WP is ignored in both cases. Note, KVM does + * need to refresh nested_mmu, a.k.a. the walker used to translate L2 + * GVAs to GPAs, as that "MMU" needs to honor L2's CR0.WP. + */ + if (!tdp_enabled || mmu == &vcpu->arch.guest_mmu) + return; + + __kvm_mmu_refresh_passthrough_bits(vcpu, mmu); +} + /* * Check if a given access (described through the I/D, W/R and U/S bits of a * page fault error code pfec) causes a permission fault with the given PTE @@ -207,8 +227,12 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC); int index = (pfec >> 1) + (smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1)); - bool fault = (mmu->permissions[index] >> pte_access) & 1; u32 errcode = PFERR_PRESENT_MASK; + bool fault; + + kvm_mmu_refresh_passthrough_bits(vcpu, mmu); + + fault = (mmu->permissions[index] >> pte_access) & 1;
WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK)); if (unlikely(mmu->pkru_mask)) { diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 7c3b809f24b3..0e50b4dd01e5 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4713,6 +4713,21 @@ static union kvm_mmu_role kvm_calc_mmu_role_common(struct kvm_vcpu *vcpu, return role; }
+void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, + struct kvm_mmu *mmu) +{ + const bool cr0_wp = !!kvm_read_cr0_bits(vcpu, X86_CR0_WP); + + BUILD_BUG_ON((KVM_MMU_CR0_ROLE_BITS & KVM_POSSIBLE_CR0_GUEST_BITS) != X86_CR0_WP); + BUILD_BUG_ON((KVM_MMU_CR4_ROLE_BITS & KVM_POSSIBLE_CR4_GUEST_BITS)); + + if (is_cr0_wp(mmu) == cr0_wp) + return; + + mmu->mmu_role.base.cr0_wp = cr0_wp; + reset_guest_paging_metadata(vcpu, mmu); +} + static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu) { /* tdp_root_level is architecture forced level, use it if nonzero */
On Mon, May 08, 2023, Mathias Krause wrote:
This is a backport of the CR0.WP KVM series[1] to Linux v5.15. It differs from the v6.1 backport as in needing additional prerequisite patches from Lai Jiangshan (and fixes for those) to ensure the assumption it's safe to let CR0.WP be a guest owned bit still stand.
NAK.
The CR0.WP changes also very subtly rely on commit 2ba676774dfc ("KVM: x86/mmu: cleanup computation of MMU roles for two-dimensional paging"), which hardcodes WP=1 in the mmu role. Without that, KVM will end up in a weird state when reinitializing the MMU context without reloading the root, as KVM will effectively change the role of an active root. E.g. child pages in the legacy MMU will have a mix of WP=0 and WP=1 in their role.
The inconsistency may or may not cause functional problems (I honestly don't know), but this missed dependency is exactly the type of problem that I am/was worried about with respect to backporting these changes all the way to 5.15. I'm simply not comfortable backporting these changes due to the number of modifications and enhancements that we've made to the TDP MMU, and to KVM's MMU handling in general, between 5.15 and 6.1.
[Paolo, can you please take a look at this as well?]
On 11.05.23 23:15, Sean Christopherson wrote:
On Mon, May 08, 2023, Mathias Krause wrote:
This is a backport of the CR0.WP KVM series[1] to Linux v5.15. It differs from the v6.1 backport as in needing additional prerequisite patches from Lai Jiangshan (and fixes for those) to ensure the assumption it's safe to let CR0.WP be a guest owned bit still stand.
NAK.
The CR0.WP changes also very subtly rely on commit 2ba676774dfc ("KVM: x86/mmu: cleanup computation of MMU roles for two-dimensional paging"), which hardcodes WP=1 in the mmu role.
Well, that commit has the MMU role split into two (mmu_role and cpu_role) which was not the case for 5.15 and below. In fact, that split was more confusing than helpful, so commit 7a458f0e1ba1 ("KVM: x86/mmu: remove extended bits from mmu_role, rename field") /degraded/ mmu_role to root_role and made clear that bit test helpers like is_cr0_wp() look at cpu_role instead. In that regard, the backport should be equivalent to what we have in 6.4, as it's using mmu_role for the older kernels instead of cpu_role (which does not exist there yet).
Without that, KVM will end up in a weird state when
reinitializing the MMU context without reloading the root, as KVM will effectively change the role of an active root. E.g. child pages in the legacy MMU will have a mix of WP=0 and WP=1 in their role.
But does that really matter? Or, asked differently, don't we have that very same situation for 6.4 with cpu_role.base.cr0_wp being the bit looked at and not root_role.cr0_wp?
Either way, with EPT this should only matter if emulation is required and patch 8 makes sure we'll use the proper value of CR0.WP prior to starting the guest page table walk by refreshing the relevant cr0_wp bit. Or am I missing something?
The inconsistency may or may not cause functional problems (I honestly don't know), but this missed dependency is exactly the type of problem that I am/was worried about with respect to backporting these changes all the way to 5.15.
While doing the backports I carefully looked at the changes and differences between the trees and, honestly, I don't think I missed this dependency as I did account for the mmu_role->{cpu,root}_role split (or rather unification regarding the backport) as explained above.
I'm simply
not comfortable backporting these changes due to the number of modifications and enhancements that we've made to the TDP MMU, and to KVM's MMU handling in general, between 5.15 and 6.1.
I understand your concerns, fiddling with the MMU implementation is not easy at all, as it has so many combinations to keep in mind and quite some implicit assumptions throughout the code. Moreover, I'm a newcomer to this part of the kernel. However, I tried very hard to look at the changes for the individual backports and double- or even triple-checked the code to make sure the changes are still consistent with the rest of the code base. But I'm just a human, so I might have missed something, but a vague bad feeling doesn't convince me that I did something wrong. Less so, as KUT supports me in not having messed up things too badly.
I know your backlog is huge and your review time is a scarce resource. But could you or Paolo please take another look?
Thanks, Mathias
linux-stable-mirror@lists.linaro.org