On Tue, Jun 07, 2022, Peter Xu wrote:
On Tue, Jun 07, 2022 at 03:04:27PM +0000, Sean Christopherson wrote:
On Tue, Jun 07, 2022, Paolo Bonzini wrote:
On 6/6/22 23:27, Peter Xu wrote:
On Mon, Jun 06, 2022 at 06:18:12PM +0200, Paolo Bonzini wrote:
However there seems to be something missing at least to me, on why it'll fail a migration from 5.15 (without this patch) to 5.18 (with this patch). In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this patch, but 0x0 if with it.
What CPU model are you using for the VM?
I didn't specify it, assuming it's qemu64 with no extra parameters.
Ok, so indeed it lacks AVX and this patch can have an effect.
For example, if the source lacks this patch but the destination has it, the source will transmit YMM registers, but the destination will fail to set them if they are not available for the selected CPU model.
See the commit message: "As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2."
IIUC you meant we should have failed KVM_SET_XSAVE when they're not aligned (probably by failing validate_user_xstate_header when checking against the user_xfeatures on dest host). But that's probably not my case, because here KVM_SET_XSAVE succeeded, it's just that the guest gets a double fault after the precopy migration completes (or for postcopy when the switchover is done).
Difficult to say what's happening without seeing at least the guest code around the double fault (above you said "fail a migration" and I thought that was a different scenario than the double fault), and possibly which was the first exception that contributed to the double fault.
Regardless of why the guest explodes in the way it does, is someone planning on bisecting this (if necessary?) and sending a backport to v5.15? There's another bug report that is more than likely hitting the same bug.
What's the bisection you mentioned? I actually did a bisection and I also checked reverting Leo's change can also fix this issue. Or do you mean something else?
Oooooh, sorry! I got completely turned around. You ran into a bug with the fix. I thought that you were hitting the same issues as Mike where migrating between hosts with different capabilities is broken in v5.15, but works in v5.18.
https://lore.kernel.org/all/48353e0d-e771-8a97-21d4-c65ff3bc4192@sentex.net
That is kvm64, and I agree it could be the same problem since both qemu64 and kvm64 models do not have any xsave feature bit declared in cpuid 0xd, so potentially we could be migrating some fpu states to it even with user_xfeatures==0 on dest host.
So today I continued the investigation, and I think what's really missing is qemu seems to be ignoring the user_xfeatures check for KVM_SET_XSAVE and continues even if it returns -EINVAL. IOW, I'm wondering whether we should fail properly and start to check kvm_arch_put_registers() retcode. But that'll be a QEMU fix, and it'll at least not causing random faults (e.g. double faults) in guest but we should fail the migration gracefully.
Sean: a side note is that I can also easily trigger one WARN_ON_ONCE() in your commit 98c25ead5eda5 in kvm_arch_vcpu_ioctl_run():
WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));
It'll be great if you'd like to check that up.
Ugh, userspace can force KVM_MP_STATE_UNINITIALIZED via KVM_SET_MP_STATE. Looks like QEMU does that when emulating RESET.
Logically, a full RESET of the xAPIC seems like the right thing to do. I think we can get away with that without breaking ABI? And kvm_lapic_reset() has a related bug where it stops the HR timer but not doesn't handle the HV timer :-/
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index e69b83708f05..948aba894245 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2395,7 +2395,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) return;
/* Stop the timer in case it's a reset to an active apic */ - hrtimer_cancel(&apic->lapic_timer.timer); + cancel_apic_timer(&apic->lapic_timer.timer);
/* The xAPIC ID is set at RESET even if the APIC was already enabled. */ if (!init_event) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 540651cd28d7..ed2c7cb1642d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10912,6 +10912,9 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, mp_state->mp_state == KVM_MP_STATE_INIT_RECEIVED)) goto out;
+ if (mp_state->mp_state == KVM_MP_STATE_UNINITIALIZED) + kvm_lapic_reset(vcpu, false); + if (mp_state->mp_state == KVM_MP_STATE_SIPI_RECEIVED) { vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED; set_bit(KVM_APIC_SIPI, &vcpu->arch.apic->pending_events);