----- On Aug 19, 2021, at 7:48 PM, Sean Christopherson seanjc@google.com wrote:
On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:
@@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs) * If not nested over a rseq critical section, restart is useless. * Clear the rseq_cs pointer and return. */
- if (!in_rseq_cs(ip, &rseq_cs))
- if (!regs || !in_rseq_cs(ip, &rseq_cs))
I think clearing the thread's rseq_cs unconditionally here when regs is NULL is not the behavior we want when this is called from xfer_to_guest_mode_work.
If we have a scenario where userspace ends up calling this ioctl(KVM_RUN) from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to kill this application in the rseq_syscall handler when exiting back to usermode when the ioctl eventually returns.
However, clearing the thread's rseq_cs will prevent this from happening.
So I would favor an approach where we simply do:
if (!regs) return 0;
Immediately at the beginning of rseq_ip_fixup, before getting the instruction pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it is not relevant to do any fixup here, because it is nested in a ioctl system call.
Effectively, this would preserve the SIGSEGV behavior when this ioctl is erroneously called by user-space from a rseq critical section.
Ha, that's effectively what I implemented first, but I changed it because of the comment in clear_rseq_cs() that says:
The rseq_cs field is set to NULL on preemption or signal delivery ... as well as well as on top of code outside of the rseq assembly block.
Which makes it sound like something might rely on clearing rseq_cs?
This comment is describing succinctly the lazy clear scheme for rseq_cs.
Without the lazy clear scheme, a rseq c.s. would look like:
* init(rseq_cs) * cpu = TLS->rseq::cpu_id_start * [1] TLS->rseq::rseq_cs = rseq_cs * [start_ip] ---------------------------- * [2] if (cpu != TLS->rseq::cpu_id) * goto abort_ip; * [3] <last_instruction_in_cs> * [post_commit_ip] ---------------------------- * [4] TLS->rseq::rseq_cs = NULL
But as a fast-path optimization, [4] is not entirely needed because the rseq_cs descriptor contains information about the instruction pointer range of the critical section. Therefore, userspace can omit [4], but if the kernel never clears it, it means that it will have to re-read the rseq_cs descriptor's content each time it needs to check it to confirm that it is not nested over a rseq c.s..
So making the kernel lazily clear the rseq_cs pointer is just an optimization which ensures that the kernel won't do useless work the next time it needs to check rseq_cs, given that it has already validated that the userspace code is currently not within the rseq c.s. currently advertised by the rseq_cs field.
Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an rseq critical section, and because syscalls in critical sections are illegal, by definition clearing rseq_cs is a nop unless userspace is misbehaving.
Not quite, as I described above. But we want it to stay set so the CONFIG_DEBUG_RSEQ code executed when returning from ioctl to userspace will be able to validate that it is not nested within a rseq critical section.
If that's true, what about explicitly checking that at NOTIFY_RESUME? Or is it not worth the extra code to detect an error that will likely be caught anyways?
The error will indeed already be caught on return from ioctl to userspace, so I don't see any added value in duplicating this check.
Thanks,
Mathieu
diff --git a/kernel/rseq.c b/kernel/rseq.c index 35f7bd0fced0..28b8342290b0 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -282,6 +282,13 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (unlikely(t->flags & PF_EXITING)) return;
if (!regs) {
+#ifdef CONFIG_DEBUG_RSEQ
if (t->rseq && rseq_get_rseq_cs(t, &rseq_cs))
goto error;
+#endif
return;
} ret = rseq_ip_fixup(regs); if (unlikely(ret < 0)) goto error;
Thanks for looking into this !
Mathieu
return clear_rseq_cs(t);
ret = rseq_need_restart(t, rseq_cs.flags); if (ret <= 0) -- 2.33.0.rc1.237.g0d66db33f3-goog
-- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com