hi Michael,
On 11/20/18 8:34 AM, Michael Ellerman wrote:
Hi Breno,
Thanks for chasing this one down.
Breno Leitao leitao@debian.org writes:
On a signal handler return, the user could set a context with MSR[TS] bits set, and these bits would be copied to task regs->msr.
At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set, several __get_user() are called and then a recheckpoint is executed.
This is a problem since a page fault (in kernel space) could happen when calling __get_user(). If it happens, the process MSR[TS] bits were already set, but recheckpoint was not executed, and SPRs are still invalid.
The page fault can cause the current process to be de-scheduled, with MSR[TS] active and without tm_recheckpoint() being called. More importantly, without TEXAR[FS] bit set also.
Since TEXASR might not have the FS bit set, and when the process is scheduled back, it will try to reclaim, which will be aborted because of the CPU is not in the suspended state, and, then, recheckpoint. This recheckpoint will restore thread->texasr into TEXASR SPR, which might be zero, hitting a BUG_ON().
[ 2181.457997] kernel BUG at arch/powerpc/kernel/tm.S:446!
As Mikey said, would be good to have at least the stack trace & NIP here, if not the full oops.
Ack!
This patch simply delays the MSR[TS] set, so, if there is any page fault in the __get_user() section, it does not have regs->msr[TS] set, since the TM structures are still invalid, thus avoiding doing TM operations for in-kernel exceptions and possible process reschedule.
With this patch, the MSR[TS] will only be set just before recheckpointing and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in invalid state.
To make this safe when PREEMPT is enabled we need to preempt_disable() / enable() around the setting of regs->msr and the recheckpoint.
That could also serve as nice documentation.
I guess the other question is whether it should be the job of tm_recheckpoint() to set regs->msr, given that it already hard disables interrupts.
eg. we'd set the TM flags in a local msr variable and pass the to tm_recheckpoint(), it would then assign that to regs->msr in the IRQ disabled section. Though there's many callers of tm_recheckpoint() that don't need that behaviour, so it would probably need to be factored out.
Right, that might be doable, but I am wondering if it isn't better to create a new function that does it as below:
void tm_set_msr_and_recheckpoint(struct pt_regs *regs, u64 msr) { preempt_disable(); regs->msr = msr; tm_recheckpoint(); preempt_enable(); }
I understand that preempt_disable() does a better work disabling preemption compared to disabling IRQ. Also, it does not change the API just for this very specific signal case.
It is not possible to move tm_recheckpoint to happen earlier, because it is required to get the checkpointed registers from userspace, with __get_user(), thus, the only way to avoid this undesired behavior is delaying the MSR[TS] set, as done in this patch.
I think the root cause here is that we're copying into the live regs of current. That has obviously worked in the past, because the register state wasn't used until we returned back to userspace. But that's no longer true with TM. And even so it's quite subtle. I also suspect some of our FP/VEC handling may not work correctly if we're scheduled part way through restoring the regs.
I got your point and I think we have a risk of corrupting the regs if MSR[TS] is set and we do page_fault->recheckpoint/recheclaim in the middle of __get_user() chunks.
What might work better is if we copy all the regs into temporary space and then with interrupts disabled we copy them into the task. That way we should never be scheduled with a half-populated set of regs.
Yes, that seems to be the best strategy, and I might be interested in working on it.
That's obviously a much bigger patch though and something we'll have to do later.
Fixes: 87b4e5393af7 ("powerpc/tm: Fix return of active 64bit signals") Cc: stable@vger.kernel.org (v3.9+) Signed-off-by: Breno Leitao leitao@debian.org
arch/powerpc/kernel/signal_64.c | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-)
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c index 83d51bf586c7..15b153bdd826 100644 --- a/arch/powerpc/kernel/signal_64.c +++ b/arch/powerpc/kernel/signal_64.c @@ -467,20 +467,6 @@ static long restore_tm_sigcontexts(struct task_struct *tsk, if (MSR_TM_RESV(msr)) return -EINVAL;
- /* pull in MSR TS bits from user context */
- regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
- /*
* Ensure that TM is enabled in regs->msr before we leave the signal
* handler. It could be the case that (a) user disabled the TM bit
* through the manipulation of the MSR bits in uc_mcontext or (b) the
* TM bit was disabled because a sufficient number of context switches
* happened whilst in the signal handler and load_tm overflowed,
* disabling the TM bit. In either case we can end up with an illegal
* TM state leading to a TM Bad Thing when we return to userspace.
*/
- regs->msr |= MSR_TM;
- /* pull in MSR LE from user context */ regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
@@ -572,6 +558,21 @@ static long restore_tm_sigcontexts(struct task_struct *tsk, tm_enable(); /* Make sure the transaction is marked as failed */ tsk->thread.tm_texasr |= TEXASR_FS;
preempt_disable();
- /* pull in MSR TS bits from user context */
- regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
- /*
* Ensure that TM is enabled in regs->msr before we leave the signal
* handler. It could be the case that (a) user disabled the TM bit
* through the manipulation of the MSR bits in uc_mcontext or (b) the
* TM bit was disabled because a sufficient number of context switches
* happened whilst in the signal handler and load_tm overflowed,
* disabling the TM bit. In either case we can end up with an illegal
* TM state leading to a TM Bad Thing when we return to userspace.
*/
- regs->msr |= MSR_TM;
- /* This loads the checkpointed FP/VEC state, if used */ tm_recheckpoint(&tsk->thread);
preempt_enable();
Although looking at the code that follows, it probably won't cope with being preempted either. So the preempt_enable() should probably go at the end of the function.
Why? I confess I do not understand the preempt mechanism a lot. Does a preemption save/restore FP and vector states when it is preempted?
What is the code/IRQ that is executed when there is a preemption? I am wondering if a preempt happens because the Decrementer (0x900) IRQ happens and a syscall is being executed, thus, saving the whole context and then restoring it. If my guess is correct, the IRQ might not save the FP/VEC states, thus, your concern about ahving load_fp_state() after a preemption.
Let me paste the "patched" code here to help with the discussion:
regs->msr |= MSR_TM;
/* This loads the checkpointed FP/VEC state, if used */ tm_recheckpoint(&tsk->thread);
msr_check_and_set(msr & (MSR_FP | MSR_VEC)); if (msr & MSR_FP) { load_fp_state(&tsk->thread.fp_state); regs->msr |= (MSR_FP | tsk->thread.fpexc_mode); } if (msr & MSR_VEC) { load_vr_state(&tsk->thread.vr_state); regs->msr |= MSR_VEC; }
Thanks Breno