On Wed, 2018-08-22 at 09:19 +0200, gregkh@linuxfoundation.org wrote:
This is a note to let you know that I've just added the patch titled
x86/entry/64: Remove %ebx handling from error_entry/exit
to the 4.9-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git%3Ba=su...
The filename of the patch is: x86-entry-64-remove-ebx-handling-from-error_entry-exit.patch and it can be found in the queue-4.9 subdirectory.
Can we have it for 4.4 too, please?
[ Note to stable maintainers: this should probably get applied to all kernels. If you're nervous about that, a more conservative fix to add xorl %ebx,%ebx; incl %ebx before the jump to error_exit should also fix the problem. ]
Can we assume it's always from kernel? The Xen code definitely seems to handle invoking this from both kernel and userspace contexts.
Shouldn't %ebx get set to !(regs->rsp & 3) ?
Either way, let's just do it in the stable tree exactly the same way it's done upstream.
- On entry, EBX is a "return to kernel mode" flag:
Re-introduce the typo 'EBS' here, to make the patch apply cleanly to 4.4. It's only removing that line anyway.
Or just cherry-pick upstream commit 75ca5b22260ef7 first.
On Wed, Nov 28, 2018 at 02:56:32PM +0000, David Woodhouse wrote:
On Wed, 2018-08-22 at 09:19 +0200, gregkh@linuxfoundation.org wrote:
This is a note to let you know that I've just added the patch titled
x86/entry/64: Remove %ebx handling from error_entry/exit
to the 4.9-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git%3Ba=su...
The filename of the patch is: x86-entry-64-remove-ebx-handling-from-error_entry-exit.patch and it can be found in the queue-4.9 subdirectory.
Can we have it for 4.4 too, please?
[ Note to stable maintainers: this should probably get applied to all kernels. If you're nervous about that, a more conservative fix to add xorl %ebx,%ebx; incl %ebx before the jump to error_exit should also fix the problem. ]
Can we assume it's always from kernel? The Xen code definitely seems to handle invoking this from both kernel and userspace contexts.
Shouldn't %ebx get set to !(regs->rsp & 3) ?
Either way, let's just do it in the stable tree exactly the same way it's done upstream.
- On entry, EBX is a "return to kernel mode" flag:
Re-introduce the typo 'EBS' here, to make the patch apply cleanly to 4.4. It's only removing that line anyway.
Or just cherry-pick upstream commit 75ca5b22260ef7 first.
Queued for 4.4. I've just grabbed the extra spellcheck fix as well.
-- Thanks, Sasha
On Nov 28, 2018, at 6:56 AM, David Woodhouse dwmw2@infradead.org wrote:
On Wed, 2018-08-22 at 09:19 +0200, gregkh@linuxfoundation.org wrote: This is a note to let you know that I've just added the patch titled
x86/entry/64: Remove %ebx handling from error_entry/exit
to the 4.9-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git%3Ba=su...
The filename of the patch is: x86-entry-64-remove-ebx-handling-from-error_entry-exit.patch and it can be found in the queue-4.9 subdirectory.
Can we have it for 4.4 too, please?
[ Note to stable maintainers: this should probably get applied to all kernels. If you're nervous about that, a more conservative fix to add xorl %ebx,%ebx; incl %ebx before the jump to error_exit should also fix the problem. ]
Can we assume it's always from kernel? The Xen code definitely seems to handle invoking this from both kernel and userspace contexts.
I learned that my comment here was wrong shortly after the patch landed :(
On Wed, 2018-11-28 at 08:44 -0800, Andy Lutomirski wrote:
Can we assume it's always from kernel? The Xen code definitely seems to handle invoking this from both kernel and userspace contexts.
I learned that my comment here was wrong shortly after the patch landed :(
Turns out the only place I see it getting called from is under __context_switch().
#7 [ffff8801144a7cf0] new_xen_failsafe_callback at ffffffffa028028a [kmod_ebxfix] #8 [ffff8801144a7d90] xen_hypercall_update_descriptor at ffffffff8100114a #9 [ffff8801144a7db8] xen_hypercall_update_descriptor at ffffffff8100114a #10 [ffff8801144a7df0] xen_mc_flush at ffffffff81006ab9 #11 [ffff8801144a7e30] xen_end_context_switch at ffffffff81004e12 #12 [ffff8801144a7e48] __switch_to at ffffffff81016582 #13 [ffff8801144a7ea0] __schedule at ffffffff815d2b37
That …114a in xen_hypercall_update_descriptor is the 'pop' instruction right after the syscall; it's happening when Xen is preempting the domain in the hypercall and then reloads the segment registers to run that vCPU again later.
[ 44185.225289] WARN: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000abbd76060
The update_descriptor hypercall args (rdi, rsi) were 0xabbd76060 and 0 respectively — it was setting a descriptor at that address, to zero.
Xen then failed to load the selector 0x63 into the %gs register (since that descriptor has just been wiped?), leaving it zero.
[ 44185.225256] WARN: xen_failsafe_callback from xen_hypercall_update_descriptor+0xa/0x40 [ 44185.225263] WARN: DS: 2b/2b ES: 2b/2b FS: 0/0 GS:0/63
This is on context switch from a 32-bit task to idle. So xen_failsafe_callback is returning to the "faulting" instruction, with a comment saying "Retry the IRET", but in fact is just continuing on its merry way with %gs unexpectedly set to zero.
In fact I think this is probably fine in practice, since it's about to get explicitly set a few lines further down in __context_switch(). But it's odd enough, and far enough away from what's actually said by the comments, that I'm utterly unsure.
In xen_load_tls() we explicitly only do the lazy_load_gs(0) for the 32-bit kernel. Is that really right?
On 06/12/2018 17:10, David Woodhouse wrote:
On Wed, 2018-11-28 at 08:44 -0800, Andy Lutomirski wrote:
Can we assume it's always from kernel? The Xen code definitely seems to handle invoking this from both kernel and userspace contexts.
I learned that my comment here was wrong shortly after the patch landed :(
Turns out the only place I see it getting called from is under __context_switch().
#7 [ffff8801144a7cf0] new_xen_failsafe_callback at ffffffffa028028a [kmod_ebxfix] #8 [ffff8801144a7d90] xen_hypercall_update_descriptor at ffffffff8100114a #9 [ffff8801144a7db8] xen_hypercall_update_descriptor at ffffffff8100114a #10 [ffff8801144a7df0] xen_mc_flush at ffffffff81006ab9 #11 [ffff8801144a7e30] xen_end_context_switch at ffffffff81004e12 #12 [ffff8801144a7e48] __switch_to at ffffffff81016582 #13 [ffff8801144a7ea0] __schedule at ffffffff815d2b37
That …114a in xen_hypercall_update_descriptor is the 'pop' instruction right after the syscall; it's happening when Xen is preempting the domain in the hypercall and then reloads the segment registers to run that vCPU again later.
[ 44185.225289] WARN: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000abbd76060
The update_descriptor hypercall args (rdi, rsi) were 0xabbd76060 and 0 respectively — it was setting a descriptor at that address, to zero.
Xen then failed to load the selector 0x63 into the %gs register (since that descriptor has just been wiped?), leaving it zero.
[ 44185.225256] WARN: xen_failsafe_callback from xen_hypercall_update_descriptor+0xa/0x40 [ 44185.225263] WARN: DS: 2b/2b ES: 2b/2b FS: 0/0 GS:0/63
This is on context switch from a 32-bit task to idle. So xen_failsafe_callback is returning to the "faulting" instruction, with a comment saying "Retry the IRET", but in fact is just continuing on its merry way with %gs unexpectedly set to zero.
In fact I think this is probably fine in practice, since it's about to get explicitly set a few lines further down in __context_switch(). But it's odd enough, and far enough away from what's actually said by the comments, that I'm utterly unsure.
In xen_load_tls() we explicitly only do the lazy_load_gs(0) for the 32-bit kernel. Is that really right?
Basically - what is happening is that xen_load_tls() is invalidating the %gs selector while %gs is still non-NUL.
If this happens to intersect with a vcpu reschedule, %gs (being non-NUL) takes precedence over KERNGSBASE, and faults when Xen tries to reload it. This results in the failsafe callback being invoked.
I think the correct course of action is to use xen_load_gs_index(0) (poorly named - it is a hypercall which does swapgs; mov to %gs; swapgs) before using update_descriptor() to invalidate the segment.
That will reset %gs to 0 without touching KERNGSBASE, and can be queued in the same multicall as the update_descriptor() hypercall.
~Andrew
On Dec 6, 2018, at 9:36 AM, Andrew Cooper andrew.cooper3@citrix.com wrote:
On 06/12/2018 17:10, David Woodhouse wrote: On Wed, 2018-11-28 at 08:44 -0800, Andy Lutomirski wrote:
Can we assume it's always from kernel? The Xen code definitely seems to handle invoking this from both kernel and userspace contexts.
I learned that my comment here was wrong shortly after the patch landed :(
Turns out the only place I see it getting called from is under __context_switch().
#7 [ffff8801144a7cf0] new_xen_failsafe_callback at ffffffffa028028a [kmod_ebxfix] #8 [ffff8801144a7d90] xen_hypercall_update_descriptor at ffffffff8100114a #9 [ffff8801144a7db8] xen_hypercall_update_descriptor at ffffffff8100114a #10 [ffff8801144a7df0] xen_mc_flush at ffffffff81006ab9 #11 [ffff8801144a7e30] xen_end_context_switch at ffffffff81004e12 #12 [ffff8801144a7e48] __switch_to at ffffffff81016582 #13 [ffff8801144a7ea0] __schedule at ffffffff815d2b37
That …114a in xen_hypercall_update_descriptor is the 'pop' instruction right after the syscall; it's happening when Xen is preempting the domain in the hypercall and then reloads the segment registers to run that vCPU again later.
[ 44185.225289] WARN: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000abbd76060
The update_descriptor hypercall args (rdi, rsi) were 0xabbd76060 and 0 respectively — it was setting a descriptor at that address, to zero.
Xen then failed to load the selector 0x63 into the %gs register (since that descriptor has just been wiped?), leaving it zero.
[ 44185.225256] WARN: xen_failsafe_callback from xen_hypercall_update_descriptor+0xa/0x40 [ 44185.225263] WARN: DS: 2b/2b ES: 2b/2b FS: 0/0 GS:0/63
This is on context switch from a 32-bit task to idle. So xen_failsafe_callback is returning to the "faulting" instruction, with a comment saying "Retry the IRET", but in fact is just continuing on its merry way with %gs unexpectedly set to zero.
In fact I think this is probably fine in practice, since it's about to get explicitly set a few lines further down in __context_switch(). But it's odd enough, and far enough away from what's actually said by the comments, that I'm utterly unsure.
In xen_load_tls() we explicitly only do the lazy_load_gs(0) for the 32-bit kernel. Is that really right?
Basically - what is happening is that xen_load_tls() is invalidating the %gs selector while %gs is still non-NUL.
If this happens to intersect with a vcpu reschedule, %gs (being non-NUL) takes precedence over KERNGSBASE, and faults when Xen tries to reload it. This results in the failsafe callback being invoked.
I think the correct course of action is to use xen_load_gs_index(0) (poorly named - it is a hypercall which does swapgs; mov to %gs; swapgs) before using update_descriptor() to invalidate the segment.
That will reset %gs to 0 without touching KERNGSBASE, and can be queued in the same multicall as the update_descriptor() hypercall.
Sounds good to me as long as we skip it on native.
On Thu, 2018-12-06 at 10:49 -0800, Andy Lutomirski wrote:
On Dec 6, 2018, at 9:36 AM, Andrew Cooper andrew.cooper3@citrix.com wrote: Basically - what is happening is that xen_load_tls() is invalidating the %gs selector while %gs is still non-NUL.
If this happens to intersect with a vcpu reschedule, %gs (being non-NUL) takes precedence over KERNGSBASE, and faults when Xen tries to reload it. This results in the failsafe callback being invoked.
I think the correct course of action is to use xen_load_gs_index(0) (poorly named - it is a hypercall which does swapgs; mov to %gs; swapgs) before using update_descriptor() to invalidate the segment.
That will reset %gs to 0 without touching KERNGSBASE, and can be queued in the same multicall as the update_descriptor() hypercall.
Sounds good to me as long as we skip it on native.
Like this? The other option is just to declare that we don't care. On the rare occasion that it does happen to preempt and then take the trap on reloading, xen_failsafe_callback is actually doing the right thing and just leaving %gs as zero. We'd just need to fix the comments so they explicitly note this case is handled there too. At the moment it just says 'Retry the IRET', as I noted before.
diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h index ef05bea7010d..e8b383b24246 100644 --- a/arch/x86/include/asm/xen/hypercall.h +++ b/arch/x86/include/asm/xen/hypercall.h @@ -520,4 +520,15 @@ MULTI_stack_switch(struct multicall_entry *mcl, trace_xen_mc_entry(mcl, 2); }
+static inline void +MULTI_set_segment_base(struct multicall_entry *mcl, + int reg, unsigned long value) +{ + mcl->op = __HYPERVISOR_set_segment_base; + mcl->args[0] = reg; + mcl->args[1] = value; + + trace_xen_mc_entry(mcl, 2); +} + #endif /* _ASM_X86_XEN_HYPERCALL_H */ diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c index 2f6787fc7106..722f1f51e20c 100644 --- a/arch/x86/xen/enlighten_pv.c +++ b/arch/x86/xen/enlighten_pv.c @@ -527,6 +527,8 @@ static void load_TLS_descriptor(struct thread_struct *t,
static void xen_load_tls(struct thread_struct *t, unsigned int cpu) { + xen_mc_batch(); + /* * XXX sleazy hack: If we're being called in a lazy-cpu zone * and lazy gs handling is enabled, it means we're in a @@ -537,24 +539,24 @@ static void xen_load_tls(struct thread_struct *t, unsigned int cpu) * This will go away as soon as Xen has been modified to not * save/restore %gs for normal hypercalls. * - * On x86_64, this hack is not used for %gs, because gs points - * to KERNEL_GS_BASE (and uses it for PDA references), so we - * must not zero %gs on x86_64 - * * For x86_64, we need to zero %fs, otherwise we may get an * exception between the new %fs descriptor being loaded and - * %fs being effectively cleared at __switch_to(). + * %fs being effectively cleared at __switch_to(). We can't + * just zero %gs, but we do need to clear the selector in + * case of a Xen vCPU context switch before it gets reloaded + * which would also cause a fault. */ if (paravirt_get_lazy_mode() == PARAVIRT_LAZY_CPU) { #ifdef CONFIG_X86_32 lazy_load_gs(0); #else + struct multicall_space mc = __xen_mc_entry(0); + MULTI_set_segment_base(mc.mc, SEGBASE_GS_USER_SEL, 0); + loadsegment(fs, 0); #endif }
- xen_mc_batch(); - load_TLS_descriptor(t, cpu, 0); load_TLS_descriptor(t, cpu, 1); load_TLS_descriptor(t, cpu, 2);
On Thu, 2018-12-06 at 20:27 +0000, David Woodhouse wrote:
On Thu, 2018-12-06 at 10:49 -0800, Andy Lutomirski wrote:
On Dec 6, 2018, at 9:36 AM, Andrew Cooper < andrew.cooper3@citrix.com> wrote: Basically - what is happening is that xen_load_tls() is invalidating the %gs selector while %gs is still non-NUL.
If this happens to intersect with a vcpu reschedule, %gs (being non-NUL) takes precedence over KERNGSBASE, and faults when Xen tries to reload it. This results in the failsafe callback being invoked.
I think the correct course of action is to use xen_load_gs_index(0) (poorly named - it is a hypercall which does swapgs; mov to %gs; swapgs) before using update_descriptor() to invalidate the segment.
That will reset %gs to 0 without touching KERNGSBASE, and can be queued in the same multicall as the update_descriptor() hypercall.
Sounds good to me as long as we skip it on native.
Like this?
#else
struct multicall_space mc = __xen_mc_entry(0);
MULTI_set_segment_base(mc.mc, SEGBASE_GS_USER_SEL, 0);
- loadsegment(fs, 0);
#endif
That seems to boot and run, at least.
I'm going to experiment with sticking a SCHEDOP_yield in the batch *after* the update_descriptor requests, to see if I can trigger the original problem a bit quicker than my current method — which involves running a hundred machines for a day or two.
Still wondering if the better fix is just to adjust the comments to admit that xen_failsafe_callback catches the race condition and fixes it up perfectly, by just letting the %gs selector be zero for a while?
On Fri, 2018-12-07 at 12:18 +0000, David Woodhouse wrote:
#else
struct multicall_space mc = __xen_mc_entry(0);
MULTI_set_segment_base(mc.mc, SEGBASE_GS_USER_SEL, 0);
loadsegment(fs, 0);
#endif
That seems to boot and run, at least.
I'm going to experiment with sticking a SCHEDOP_yield in the batch *after* the update_descriptor requests, to see if I can trigger the original problem a bit quicker than my current method — which involves running a hundred machines for a day or two.
That SCHEDOP_yield and some debug output in xen_failsafe_callback shows that it's nice and easy to reproduce now without the above MULTI_set_segment_base() call. And stopped happening when I add the MULTI_set_segment_base() call back again. I'll call that 'tested'.
But now we're making a hypercall to clear user %gs even in the case where none of the descriptors have changed; we should probably stop doing that...
Testing this (incremental) variant now.
--- a/arch/x86/xen/enlighten_pv.c +++ b/arch/x86/xen/enlighten_pv.c @@ -688,7 +688,7 @@ static inline bool desc_equal(const struct desc_struct *d1, }
static void load_TLS_descriptor(struct thread_struct *t, - unsigned int cpu, unsigned int i) + unsigned int cpu, unsigned int i, int flush_gs) { struct desc_struct *shadow = &per_cpu(shadow_tls_desc, cpu).desc[i]; struct desc_struct *gdt; @@ -698,6 +698,15 @@ static void load_TLS_descriptor(struct thread_struct *t, if (desc_equal(shadow, &t->tls_array[i])) return;
+ /* + * If the current user %gs points to a descriptor we're changing, + * zero it first to avoid taking a fault if Xen preempts this + * vCPU between now and the time that %gs is later loaded with + * the new value. + */ + if ((flush_gs >> 3) == GDT_ENTRY_TLS_MIN + i) + MULTI_set_segment_base(mc.mc, SEGBASE_GS_USER_SEL, 0); + *shadow = t->tls_array[i];
gdt = get_cpu_gdt_table(cpu); @@ -709,7 +718,7 @@ static void load_TLS_descriptor(struct thread_struct *t,
static void xen_load_tls(struct thread_struct *t, unsigned int cpu) { - xen_mc_batch(); + u16 flush_gs = 0;
/* * XXX sleazy hack: If we're being called in a lazy-cpu zone @@ -723,17 +732,19 @@ static void xen_load_tls(struct thread_struct *t, unsigned int cpu) * * For x86_64, we need to zero %fs, otherwise we may get an * exception between the new %fs descriptor being loaded and - * %fs being effectively cleared at __switch_to(). We can't - * just zero %gs, but we do need to clear the selector in - * case of a Xen vCPU context switch before it gets reloaded - * which would also cause a fault. + * %fs being effectively cleared at __switch_to(). + * + * We may also need to zero %gs, if it refers to a descriptor + * which we are clearing. Otherwise a Xen vCPU context switch + * before it gets reloaded could also cause a fault. Since + * clearing the user %gs is another hypercall, do that only if + * it's necessary. */ if (paravirt_get_lazy_mode() == PARAVIRT_LAZY_CPU) { #ifdef CONFIG_X86_32 lazy_load_gs(0); #else - struct multicall_space mc = __xen_mc_entry(0); - MULTI_set_segment_base(mc.mc, SEGBASE_GS_USER_SEL, 0); + savesegment(gs, flush_gs);
loadsegment(fs, 0); #endif @@ -741,9 +752,9 @@ static void xen_load_tls(struct thread_struct *t, unsigned int cpu)
xen_mc_batch();
- load_TLS_descriptor(t, cpu, 0); - load_TLS_descriptor(t, cpu, 1); - load_TLS_descriptor(t, cpu, 2); + load_TLS_descriptor(t, cpu, 0, flush_gs); + load_TLS_descriptor(t, cpu, 1, flush_gs); + load_TLS_descriptor(t, cpu, 2, flush_gs);
{ struct multicall_space mc = __xen_mc_entry(0);
linux-stable-mirror@lists.linaro.org