On Thu, 2018-12-06 at 10:49 -0800, Andy Lutomirski wrote:
On Dec 6, 2018, at 9:36 AM, Andrew Cooper andrew.cooper3@citrix.com wrote: Basically - what is happening is that xen_load_tls() is invalidating the %gs selector while %gs is still non-NUL.
If this happens to intersect with a vcpu reschedule, %gs (being non-NUL) takes precedence over KERNGSBASE, and faults when Xen tries to reload it. This results in the failsafe callback being invoked.
I think the correct course of action is to use xen_load_gs_index(0) (poorly named - it is a hypercall which does swapgs; mov to %gs; swapgs) before using update_descriptor() to invalidate the segment.
That will reset %gs to 0 without touching KERNGSBASE, and can be queued in the same multicall as the update_descriptor() hypercall.
Sounds good to me as long as we skip it on native.
Like this? The other option is just to declare that we don't care. On the rare occasion that it does happen to preempt and then take the trap on reloading, xen_failsafe_callback is actually doing the right thing and just leaving %gs as zero. We'd just need to fix the comments so they explicitly note this case is handled there too. At the moment it just says 'Retry the IRET', as I noted before.
diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h index ef05bea7010d..e8b383b24246 100644 --- a/arch/x86/include/asm/xen/hypercall.h +++ b/arch/x86/include/asm/xen/hypercall.h @@ -520,4 +520,15 @@ MULTI_stack_switch(struct multicall_entry *mcl, trace_xen_mc_entry(mcl, 2); }
+static inline void +MULTI_set_segment_base(struct multicall_entry *mcl, + int reg, unsigned long value) +{ + mcl->op = __HYPERVISOR_set_segment_base; + mcl->args[0] = reg; + mcl->args[1] = value; + + trace_xen_mc_entry(mcl, 2); +} + #endif /* _ASM_X86_XEN_HYPERCALL_H */ diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c index 2f6787fc7106..722f1f51e20c 100644 --- a/arch/x86/xen/enlighten_pv.c +++ b/arch/x86/xen/enlighten_pv.c @@ -527,6 +527,8 @@ static void load_TLS_descriptor(struct thread_struct *t,
static void xen_load_tls(struct thread_struct *t, unsigned int cpu) { + xen_mc_batch(); + /* * XXX sleazy hack: If we're being called in a lazy-cpu zone * and lazy gs handling is enabled, it means we're in a @@ -537,24 +539,24 @@ static void xen_load_tls(struct thread_struct *t, unsigned int cpu) * This will go away as soon as Xen has been modified to not * save/restore %gs for normal hypercalls. * - * On x86_64, this hack is not used for %gs, because gs points - * to KERNEL_GS_BASE (and uses it for PDA references), so we - * must not zero %gs on x86_64 - * * For x86_64, we need to zero %fs, otherwise we may get an * exception between the new %fs descriptor being loaded and - * %fs being effectively cleared at __switch_to(). + * %fs being effectively cleared at __switch_to(). We can't + * just zero %gs, but we do need to clear the selector in + * case of a Xen vCPU context switch before it gets reloaded + * which would also cause a fault. */ if (paravirt_get_lazy_mode() == PARAVIRT_LAZY_CPU) { #ifdef CONFIG_X86_32 lazy_load_gs(0); #else + struct multicall_space mc = __xen_mc_entry(0); + MULTI_set_segment_base(mc.mc, SEGBASE_GS_USER_SEL, 0); + loadsegment(fs, 0); #endif }
- xen_mc_batch(); - load_TLS_descriptor(t, cpu, 0); load_TLS_descriptor(t, cpu, 1); load_TLS_descriptor(t, cpu, 2);