On Fri, Dec 20, 2024 at 04:46:34PM +0000, Mark Brown wrote:
The SVE portion of kvm_vcpu_put() is quite large, especially given the comments required. When we add similar handling for SME the function will get even larger, in order to keep things managable factor the SVE portion out of the main kvm_vcpu_put().
While investigating some problems with SVE I spotted a latent bug in this area where I suspect the fix will conflict with / supersede this rework. Details below; IIUC the bug was introduced in commit:
8c8010d69c132273 ("KVM: arm64: Save/restore SVE state for nVHE")
Signed-off-by: Mark Brown broonie@kernel.org
arch/arm64/kvm/fpsimd.c | 67 +++++++++++++++++++++++++++---------------------- 1 file changed, 37 insertions(+), 30 deletions(-)
diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c index 09b65abaf9db60cc57dbc554ad2108a80c2dc46b..3c2e0b96877ac5b4f3b9d8dfa38975f11b74b60d 100644 --- a/arch/arm64/kvm/fpsimd.c +++ b/arch/arm64/kvm/fpsimd.c @@ -151,6 +151,41 @@ void kvm_arch_vcpu_ctxsync_fp(struct kvm_vcpu *vcpu) } } +static void kvm_vcpu_put_sve(struct kvm_vcpu *vcpu) +{
- u64 zcr;
- if (!vcpu_has_sve(vcpu))
return;
- zcr = read_sysreg_el1(SYS_ZCR);
- /*
* If the vCPU is in the hyp context then ZCR_EL1 is loaded
* with its vEL2 counterpart.
*/
- __vcpu_sys_reg(vcpu, vcpu_sve_zcr_elx(vcpu)) = zcr;
- /*
* Restore the VL that was saved when bound to the CPU, which
* is the maximum VL for the guest. Because the layout of the
* data when saving the sve state depends on the VL, we need
* to use a consistent (i.e., the maximum) VL. Note that this
* means that at guest exit ZCR_EL1 is not necessarily the
* same as on guest entry.
*
* ZCR_EL2 holds the guest hypervisor's VL when running a
* nested guest, which could be smaller than the max for the
* vCPU. Similar to above, we first need to switch to a VL
* consistent with the layout of the vCPU's SVE state. KVM
* support for NV implies VHE, so using the ZCR_EL1 alias is
* safe.
*/
- if (!has_vhe() || (vcpu_has_nv(vcpu) && !is_hyp_ctxt(vcpu)))
sve_cond_update_zcr_vq(vcpu_sve_max_vq(vcpu) - 1,
SYS_ZCR_EL1);
+}
/*
- Write back the vcpu FPSIMD regs if they are dirty, and invalidate the
- cpu FPSIMD regs so that they can't be spuriously reused if this vcpu
@@ -179,38 +214,10 @@ void kvm_arch_vcpu_put_fp(struct kvm_vcpu *vcpu) }
A little before this context, kvm_arch_vcpu_put_fp() calls local_irq_save(), which indicates that IRQs can be taken before this point, which is deeply suspicious.
If IRQs are enabled, then it's possible to take an IRQ and potentially run a softirq handler which uses kernel-mode NEON. That means kernel_neon_begin() will try to save the live FPSIMD/SVE/SME state via fpsimd_save_user_state(), using the live value of ZCR_ELx.LEN, which would not be correct per the comment.
Looking at kvm_arch_vcpu_ioctl_run(), the relevant logic is:
vcpu_load(vcpu); // calls kvm_arch_vcpu_load_fp()
while (ret > 0) { preempt_disable(); local_irq_disable();
kvm_arch_vcpu_ctxflush_fp(); ret = kvm_arm_vcpu_enter_exit(vcpu); kvm_arch_vcpu_ctxsync_fp(vcpu);
local_irq_enable(); preempt_enable(); }
vcpu_put(vcpu); // calls kvm_arch_vcpu_put_fp()
... and the problem can occur at any point after the vCPU has run where IRQs are enabled, i.e, between local_irq_enable() and either local_irq_disable() or vcpu_put()'s call to kvm_arch_vcpu_put_fp().
Note that kernel_neon_begin() calls:
fpsimd_save_user_state(); ... fpsimd_flush_cpu_state();
... and fpsimd_save_user_state() will see that the SVE VL is wrong:
if (WARN_ON(sve_get_vl() != vl)) { force_signal_inject(SIGKILL, SI_KERNEL, 0, 0); return; }
... pending a SIGKILL for the VMM thread without saving the vCPU's state before unbinding the live state via fpsimd_flush_cpu_state(), which'll set TIF_FOREIGN_FPSTATE.
AFAICT it's possible to re-enter the vCPU after that happens, whereupon stale vCPU FPSIMD/SVE state will be restored. Upon return to userspace the SIGKILL will be delivered, killing the VMM.
As above, it looks like that's been broken since the nVHE SVE save/restore was introduced in commit:
8c8010d69c132273 ("KVM: arm64: Save/restore SVE state for nVHE")
The TL;DR summary is that it's not sufficient for kvm_arch_vcpu_put_fp() to fix up ZCR_ELx. Either:
* That needs to be fixed up while IRQs are masked, e.g. by saving/restoring the host and guest ZCR_EL1 and/or ZCR_ELx values in kvm_arch_vcpu_ctxflush_fp() and kvm_arch_vcpu_ctxsync_fp()
* The lazy save logic in fpsimd_save_user_state() needs to handle KVM explicitly, saving the guest's ZCR_EL1 and restoring the host's ZCR_ELx.
I think we need to fix that before we extend this logic for SME.
Mark.
if (guest_owns_fp_regs()) {
if (vcpu_has_sve(vcpu)) {
u64 zcr = read_sysreg_el1(SYS_ZCR);
/*
* If the vCPU is in the hyp context then ZCR_EL1 is
* loaded with its vEL2 counterpart.
*/
__vcpu_sys_reg(vcpu, vcpu_sve_zcr_elx(vcpu)) = zcr;
/*
* Restore the VL that was saved when bound to the CPU,
* which is the maximum VL for the guest. Because the
* layout of the data when saving the sve state depends
* on the VL, we need to use a consistent (i.e., the
* maximum) VL.
* Note that this means that at guest exit ZCR_EL1 is
* not necessarily the same as on guest entry.
*
* ZCR_EL2 holds the guest hypervisor's VL when running
* a nested guest, which could be smaller than the
* max for the vCPU. Similar to above, we first need to
* switch to a VL consistent with the layout of the
* vCPU's SVE state. KVM support for NV implies VHE, so
* using the ZCR_EL1 alias is safe.
*/
if (!has_vhe() || (vcpu_has_nv(vcpu) && !is_hyp_ctxt(vcpu)))
sve_cond_update_zcr_vq(vcpu_sve_max_vq(vcpu) - 1,
SYS_ZCR_EL1);
}
kvm_vcpu_put_sve(vcpu);
/*
* Flush (save and invalidate) the fpsimd/sve state so that if
* Flush (save and invalidate) the FP state so that if
- the host tries to use fpsimd/sve, it's not using stale data
- from the guest.
-- 2.39.5