From: Wanpeng Li wanpengli@tencent.com
[ Upstream commit 0361bdfddca20c8855ea3bdbbbc9c999912b10ff ]
MSR_KVM_POLL_CONTROL is cleared on reset, thus reverting guests to host-side polling after suspend/resume. Non-bootstrap CPUs are restored correctly by the haltpoll driver because they are hot-unplugged during suspend and hot-plugged during resume; however, the BSP is not hotpluggable and remains in host-sde polling mode after the guest resume. The makes the guest pay for the cost of vmexits every time the guest enters idle.
Fix it by recording BSP's haltpoll state and resuming it during guest resume.
Cc: Marcelo Tosatti mtosatti@redhat.com Signed-off-by: Wanpeng Li wanpengli@tencent.com Message-Id: 1650267752-46796-1-git-send-email-wanpengli@tencent.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/kernel/kvm.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 18e952fed021..6c3d38b5a8ad 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -66,6 +66,7 @@ static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_reason) __align DEFINE_PER_CPU_DECRYPTED(struct kvm_steal_time, steal_time) __aligned(64) __visible; static int has_steal_clock = 0;
+static int has_guest_poll = 0; /* * No need for any "IO delay" on KVM */ @@ -624,14 +625,26 @@ static int kvm_cpu_down_prepare(unsigned int cpu)
static int kvm_suspend(void) { + u64 val = 0; + kvm_guest_cpu_offline(false);
+#ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL + if (kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL)) + rdmsrl(MSR_KVM_POLL_CONTROL, val); + has_guest_poll = !(val & 1); +#endif return 0; }
static void kvm_resume(void) { kvm_cpu_online(raw_smp_processor_id()); + +#ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL + if (kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL) && has_guest_poll) + wrmsrl(MSR_KVM_POLL_CONTROL, 0); +#endif }
static struct syscore_ops kvm_syscore_ops = {
From: Paolo Bonzini pbonzini@redhat.com
[ Upstream commit d22a81b304a27fca6124174a8e842e826c193466 ]
Emulating writes to SELF_IPI with a write to ICR has an unwanted side effect: the value of ICR in vAPIC page gets changed. The lists SELF_IPI as write-only, with no associated MMIO offset, so any write should have no visible side effect in the vAPIC page.
Reported-by: Chao Gao chao.gao@intel.com Reviewed-by: Sean Christopherson seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/kvm/lapic.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index de11149e28e0..e45ebf0870b6 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2106,10 +2106,9 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) break;
case APIC_SELF_IPI: - if (apic_x2apic_mode(apic)) { - kvm_lapic_reg_write(apic, APIC_ICR, - APIC_DEST_SELF | (val & APIC_VECTOR_MASK)); - } else + if (apic_x2apic_mode(apic)) + kvm_apic_send_ipi(apic, APIC_DEST_SELF | (val & APIC_VECTOR_MASK), 0); + else ret = 1; break; default:
On 4/27/22 17:54, Sasha Levin wrote:
From: Paolo Bonzini pbonzini@redhat.com
[ Upstream commit d22a81b304a27fca6124174a8e842e826c193466 ]
Emulating writes to SELF_IPI with a write to ICR has an unwanted side effect: the value of ICR in vAPIC page gets changed. The lists SELF_IPI as write-only, with no associated MMIO offset, so any write should have no visible side effect in the vAPIC page.
Reported-by: Chao Gao chao.gao@intel.com Reviewed-by: Sean Christopherson seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org
arch/x86/kvm/lapic.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index de11149e28e0..e45ebf0870b6 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2106,10 +2106,9 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) break; case APIC_SELF_IPI:
if (apic_x2apic_mode(apic)) {
kvm_lapic_reg_write(apic, APIC_ICR,
APIC_DEST_SELF | (val & APIC_VECTOR_MASK));
} else
if (apic_x2apic_mode(apic))
kvm_apic_send_ipi(apic, APIC_DEST_SELF | (val & APIC_VECTOR_MASK), 0);
break; default:else ret = 1;
Acked-by: Paolo Bonzini pbonzini@redhat.com
From: Paolo Bonzini pbonzini@redhat.com
[ Upstream commit 9191b8f0745e63edf519e4a54a4aaae1d3d46fbd ]
WARN and bail if KVM attempts to free a root that isn't backed by a shadow page. KVM allocates a bare page for "special" roots, e.g. when using PAE paging or shadowing 2/3/4-level page tables with 4/5-level, and so root_hpa will be valid but won't be backed by a shadow page. It's all too easy to blindly call mmu_free_root_page() on root_hpa, be nice and WARN instead of crashing KVM and possibly the kernel.
Reviewed-by: Sean Christopherson seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/kvm/mmu/mmu.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 99ea1ec12ffe..70ef5b542681 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3140,6 +3140,8 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa, return;
sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK); + if (WARN_ON(!sp)) + return;
if (kvm_mmu_put_root(kvm, sp)) { if (sp->tdp_mmu_page)
On 4/27/22 17:54, Sasha Levin wrote:
From: Paolo Bonzini pbonzini@redhat.com
[ Upstream commit 9191b8f0745e63edf519e4a54a4aaae1d3d46fbd ]
WARN and bail if KVM attempts to free a root that isn't backed by a shadow page. KVM allocates a bare page for "special" roots, e.g. when using PAE paging or shadowing 2/3/4-level page tables with 4/5-level, and so root_hpa will be valid but won't be backed by a shadow page. It's all too easy to blindly call mmu_free_root_page() on root_hpa, be nice and WARN instead of crashing KVM and possibly the kernel.
Reviewed-by: Sean Christopherson seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org
arch/x86/kvm/mmu/mmu.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 99ea1ec12ffe..70ef5b542681 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3140,6 +3140,8 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa, return; sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
- if (WARN_ON(!sp))
return;
if (kvm_mmu_put_root(kvm, sp)) { if (sp->tdp_mmu_page)
Acked-by: Paolo Bonzini pbonzini@redhat.com
From: Wanpeng Li wanpengli@tencent.com
[ Upstream commit 1714a4eb6fb0cb79f182873cd011a8ed60ac65e8 ]
As commit 0c5f81dad46 ("KVM: LAPIC: Inject timer interrupt via posted interrupt") mentioned that the host admin should well tune the guest setup, so that vCPUs are placed on isolated pCPUs, and with several pCPUs surplus for *busy* housekeeping. In this setup, it is preferrable to disable mwait/hlt/pause vmexits to keep the vCPUs in non-root mode.
However, if only some guests isolated and others not, they would not have any benefit from posted timer interrupts, and at the same time lose VMX preemption timer fast paths because kvm_can_post_timer_interrupt() returns true and therefore forces kvm_can_use_hv_timer() to false.
By guaranteeing that posted-interrupt timer is only used if MWAIT or HLT are done without vmexit, KVM can make a better choice and use the VMX preemption timer and the corresponding fast paths.
Reported-by: Aili Yao yaoaili@kingsoft.com Reviewed-by: Sean Christopherson seanjc@google.com Cc: Aili Yao yaoaili@kingsoft.com Cc: Sean Christopherson seanjc@google.com Signed-off-by: Wanpeng Li wanpengli@tencent.com Message-Id: 1643112538-36743-1-git-send-email-wanpengli@tencent.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/kvm/lapic.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index e45ebf0870b6..a3ef793fce5f 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -113,7 +113,8 @@ static inline u32 kvm_x2apic_id(struct kvm_lapic *apic)
static bool kvm_can_post_timer_interrupt(struct kvm_vcpu *vcpu) { - return pi_inject_timer && kvm_vcpu_apicv_active(vcpu); + return pi_inject_timer && kvm_vcpu_apicv_active(vcpu) && + (kvm_mwait_in_guest(vcpu->kvm) || kvm_hlt_in_guest(vcpu->kvm)); }
bool kvm_can_use_hv_timer(struct kvm_vcpu *vcpu)
On 4/27/22 17:54, Sasha Levin wrote:
From: Wanpeng Li wanpengli@tencent.com
[ Upstream commit 1714a4eb6fb0cb79f182873cd011a8ed60ac65e8 ]
As commit 0c5f81dad46 ("KVM: LAPIC: Inject timer interrupt via posted interrupt") mentioned that the host admin should well tune the guest setup, so that vCPUs are placed on isolated pCPUs, and with several pCPUs surplus for *busy* housekeeping. In this setup, it is preferrable to disable mwait/hlt/pause vmexits to keep the vCPUs in non-root mode.
However, if only some guests isolated and others not, they would not have any benefit from posted timer interrupts, and at the same time lose VMX preemption timer fast paths because kvm_can_post_timer_interrupt() returns true and therefore forces kvm_can_use_hv_timer() to false.
By guaranteeing that posted-interrupt timer is only used if MWAIT or HLT are done without vmexit, KVM can make a better choice and use the VMX preemption timer and the corresponding fast paths.
Reported-by: Aili Yao yaoaili@kingsoft.com Reviewed-by: Sean Christopherson seanjc@google.com Cc: Aili Yao yaoaili@kingsoft.com Cc: Sean Christopherson seanjc@google.com Signed-off-by: Wanpeng Li wanpengli@tencent.com Message-Id: 1643112538-36743-1-git-send-email-wanpengli@tencent.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org
arch/x86/kvm/lapic.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index e45ebf0870b6..a3ef793fce5f 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -113,7 +113,8 @@ static inline u32 kvm_x2apic_id(struct kvm_lapic *apic) static bool kvm_can_post_timer_interrupt(struct kvm_vcpu *vcpu) {
- return pi_inject_timer && kvm_vcpu_apicv_active(vcpu);
- return pi_inject_timer && kvm_vcpu_apicv_active(vcpu) &&
}(kvm_mwait_in_guest(vcpu->kvm) || kvm_hlt_in_guest(vcpu->kvm));
bool kvm_can_use_hv_timer(struct kvm_vcpu *vcpu)
Acked-by: Paolo Bonzini pbonzini@redhat.com
On 4/27/22 17:54, Sasha Levin wrote:
From: Wanpeng Li wanpengli@tencent.com
[ Upstream commit 0361bdfddca20c8855ea3bdbbbc9c999912b10ff ]
MSR_KVM_POLL_CONTROL is cleared on reset, thus reverting guests to host-side polling after suspend/resume. Non-bootstrap CPUs are restored correctly by the haltpoll driver because they are hot-unplugged during suspend and hot-plugged during resume; however, the BSP is not hotpluggable and remains in host-sde polling mode after the guest resume. The makes the guest pay for the cost of vmexits every time the guest enters idle.
Fix it by recording BSP's haltpoll state and resuming it during guest resume.
Cc: Marcelo Tosatti mtosatti@redhat.com Signed-off-by: Wanpeng Li wanpengli@tencent.com Message-Id: 1650267752-46796-1-git-send-email-wanpengli@tencent.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org
arch/x86/kernel/kvm.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 18e952fed021..6c3d38b5a8ad 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -66,6 +66,7 @@ static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_reason) __align DEFINE_PER_CPU_DECRYPTED(struct kvm_steal_time, steal_time) __aligned(64) __visible; static int has_steal_clock = 0; +static int has_guest_poll = 0; /*
- No need for any "IO delay" on KVM
*/ @@ -624,14 +625,26 @@ static int kvm_cpu_down_prepare(unsigned int cpu) static int kvm_suspend(void) {
- u64 val = 0;
- kvm_guest_cpu_offline(false);
+#ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL
- if (kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL))
rdmsrl(MSR_KVM_POLL_CONTROL, val);
- has_guest_poll = !(val & 1);
+#endif return 0; } static void kvm_resume(void) { kvm_cpu_online(raw_smp_processor_id());
+#ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL
- if (kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL) && has_guest_poll)
wrmsrl(MSR_KVM_POLL_CONTROL, 0);
+#endif } static struct syscore_ops kvm_syscore_ops = {
Acked-by: Paolo Bonzini pbonzini@redhat.com
linux-stable-mirror@lists.linaro.org