On Tue, Mar 01, 2022 at 09:22:10PM +0100, Paolo Bonzini wrote:
On 3/1/22 21:13, Sasha Levin wrote:
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index d28829403ed08..6ac01f9828530 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -1563,7 +1563,10 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize, fpregs_restore_userregs(); newfps->xfeatures = curfps->xfeatures | xfeatures;
- newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
- if (!guest_fpu)
newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
- newfps->xfd = curfps->xfd & ~xfeatures; curfps = fpu_install_fpstate(fpu, newfps);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index bf18679757c70..875dce4aa2d28 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -276,6 +276,8 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.guest_supported_xcr0 = cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
- vcpu->arch.guest_fpu.fpstate->user_xfeatures = vcpu->arch.guest_supported_xcr0;
- kvm_update_pv_runtime(vcpu); vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
Leonardo, was this also buggy in 5.16? (I should have asked for a Fixes tag...).
I just stumbled over this patch on some migration tests in the past few days..
In short, I was migrating a VM from 5.15 host to 5.18 host and the guest trigger double fault immediately after the switch-over (I think that's when it's trying to do vmenter, a VECTOR_DF was injected), with either precopy or postcopy.
After I upgrade 5.15 src host to 5.18 host, problem goes away. I did a bisect on dest and surprisingly it points to this commit.
Side note: I'm using two hosts that have the same processor model, so no case of missing features on either side - they just match.
I'm not really sure whether this is a bug or by design - do we require this patch to be applied to all stable branches to make the guest not crash after migration, or it is unexpected?
FWICT, this patch modifies user_xfeatures while we don't do that trick before. It sounds reasonable to me from the 1st glance, say if the guest didn't enable some of the fpu features so we don't need to migrate those fpu state chunks as we're migrating things based on user_xfeatures, and it sounds good to solve the migration issue on "has-pksu" host to "no-pksu" host as described in the patch commit message.
However there seems to be something missing at least to me, on why it'll fail a migration from 5.15 (without this patch) to 5.18 (with this patch). In my test case, user_xfeatures will be 0x7 (FP|SSE|YMM) if without this patch, but 0x0 if with it.
I think what it should be happening is user_xfeatures will be set on src with 0x7 (old kernel), so we should have migrated some more chunks to dest, but I just don't quickly understand why that's a problem there because fundamentally when we restore the fpu status (fpu_swap_kvm_fpstate) we use the max feature bitmask anyway, and the dest hardware should support all of them. I don't quickly see how that could trigger a double fault, though.
I'll continue the dig probably next week, before that, any thoughts?