----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:
Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to transferring to a KVM guest, which is roughly equivalent to an exit to userspace and processes many of the same pending actions. While the task cannot be in an rseq critical section as the KVM path is reachable only via ioctl(KVM_RUN), the side effects that apply to rseq outside of a critical section still apply, e.g. the CPU ID needs to be updated if the task is migrated.
Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults and other badness in userspace VMMs that use rseq in combination with KVM, e.g. due to the CPU ID being stale after task migration.
I agree with the problem assessment, but I would recommend a small change to this fix.
Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function") Reported-by: Peter Foley pefoley@google.com Bisected-by: Doug Evans dje@google.com Cc: Shakeel Butt shakeelb@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com
kernel/entry/kvm.c | 4 +++- kernel/rseq.c | 4 ++-- 2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c index 49972ee99aff..049fd06b4c3d 100644 --- a/kernel/entry/kvm.c +++ b/kernel/entry/kvm.c @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work) if (ti_work & _TIF_NEED_RESCHED) schedule();
if (ti_work & _TIF_NOTIFY_RESUME)
if (ti_work & _TIF_NOTIFY_RESUME) { tracehook_notify_resume(NULL);
rseq_handle_notify_resume(NULL, NULL);
}
ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work); if (ret)
diff --git a/kernel/rseq.c b/kernel/rseq.c index 35f7bd0fced0..58c79a7918cd 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -236,7 +236,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
static int rseq_ip_fixup(struct pt_regs *regs) {
- unsigned long ip = instruction_pointer(regs);
- unsigned long ip = regs ? instruction_pointer(regs) : 0; struct task_struct *t = current; struct rseq_cs rseq_cs; int ret;
@@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs) * If not nested over a rseq critical section, restart is useless. * Clear the rseq_cs pointer and return. */
- if (!in_rseq_cs(ip, &rseq_cs))
- if (!regs || !in_rseq_cs(ip, &rseq_cs))
I think clearing the thread's rseq_cs unconditionally here when regs is NULL is not the behavior we want when this is called from xfer_to_guest_mode_work.
If we have a scenario where userspace ends up calling this ioctl(KVM_RUN) from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to kill this application in the rseq_syscall handler when exiting back to usermode when the ioctl eventually returns.
However, clearing the thread's rseq_cs will prevent this from happening.
So I would favor an approach where we simply do:
if (!regs) return 0;
Immediately at the beginning of rseq_ip_fixup, before getting the instruction pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it is not relevant to do any fixup here, because it is nested in a ioctl system call.
Effectively, this would preserve the SIGSEGV behavior when this ioctl is erroneously called by user-space from a rseq critical section.
Thanks for looking into this !
Mathieu
return clear_rseq_cs(t);
ret = rseq_need_restart(t, rseq_cs.flags); if (ret <= 0) -- 2.33.0.rc1.237.g0d66db33f3-goog