----- On Aug 26, 2021, at 7:54 PM, Sean Christopherson seanjc@google.com wrote:
On Thu, Aug 26, 2021, Mathieu Desnoyers wrote:
----- On Aug 25, 2021, at 8:51 PM, Sean Christopherson seanjc@google.com wrote:
r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
errno, strerror(errno));
atomic_inc(&seq_cnt);
CPU_CLR(cpu, &allowed_mask);
/*
* Let the read-side get back into KVM_RUN to improve the odds
* of task migration coinciding with KVM's run loop.
This comment should be about increasing the odds of letting the seqlock read-side complete. Otherwise, the delay between the two back-to-back atomic_inc is so small that the seqlock read-side may never have time to complete the reading the rseq cpu id and the sched_getcpu() call, and can retry forever.
Hmm, but that's not why there's a delay. I'm not arguing that a livelock isn't possible (though that syscall would have to be screaming fast),
I don't think we have the same understanding of the livelock scenario. AFAIU the livelock would be caused by a too-small delay between the two consecutive atomic_inc() from one loop iteration to the next compared to the time it takes to perform a read-side critical section of the seqlock. Back-to-back atomic_inc can be performed very quickly, so I doubt that the sched_getcpu implementation have good odds to be fast enough to complete in that narrow window, leading to lots of read seqlock retry.
Ooooh, yeah, brain fart on my side. I was thinking of the two atomic_inc() in the same loop iteration and completely ignoring the next iteration. Yes, I 100% agree that a delay to ensure forward progress is needed. An assertion in main() that the reader complete at least some reasonable number of KVM_RUNs is also probably a good idea, e.g. to rule out a false pass due to the reader never making forward progress.
Agreed.
FWIW, the do-while loop does make forward progress without a delay, but at ~50% throughput, give or take.
I did not expect absolutely no progress, but a significant slow down of the read-side.
but the primary motivation is very much to allow the read-side enough time to get back into KVM proper.
I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then we have:
migration thread KVM_RUN/read-side thread
- ioctl(KVM_RUN)
atomic_inc_seq_cst(&seqcnt)
sched_setaffinity
atomic_inc_seq_cst(&seqcnt) - a = atomic_load(&seqcnt) & ~1 - smp_rmb() - b = LOAD_ONCE(__rseq_abi->cpu_id); - sched_getcpu() - smp_rmb() - re-load seqcnt/compare (succeeds) - Can only succeed if entire read-side happens while the seqcnt is in an even numbered state. - if (a != b) abort() /* no delay. Even counter state is very short. */
atomic_inc_seq_cst(&seqcnt) /* Let's suppose the lack of delay causes the setaffinity to complete too early compared with KVM_RUN ioctl */
sched_setaffinity
atomic_inc_seq_cst(&seqcnt)
/* no delay. Even counter state is very short. */
atomic_inc_seq_cst(&seqcnt) /* Then a setaffinity from a following migration thread loop will run concurrently with KVM_RUN */ - ioctl(KVM_RUN)
sched_setaffinity
atomic_inc_seq_cst(&seqcnt)
As pointed out here, if the first setaffinity runs too early compared with KVM_RUN, a following setaffinity will run concurrently with it. However, the fact that the even counter state is very short may very well hurt progress of the read seqlock.
*sigh*
Several hours later, I think I finally have my head wrapped around everything.
Due to the way the test is written and because of how KVM's run loop works, TIF_NOTIFY_RESUME or TIF_NEED_RESCHED effectively has to be set before KVM actually enters the guest, otherwise KVM will exit to userspace without touching the flag, i.e. it will be handled by the normal exit_to_user_mode_loop().
Where I got lost was trying to figure out why I couldn't make the bug reproduce by causing the guest to exit to KVM, but not userspace, in which case KVM should easily trigger the problematic flow as the window for sched_getcpu() to collide with KVM would be enormous. The reason I didn't go down this route for the "official" test is that, unless there's something clever I'm overlooking, it requires arch specific guest code, and ialso I don't know that forcing an exit would even be necessary/sufficient on other architectures.
Anyways, I was trying to confirm that the bug was being hit without a delay, while still retaining the sequence retry in the test. The test doesn't fail because the back-to-back atomic_inc() changes seqcnt too fast. The read-side makes forward progress, but it never observes failure because the do-while loop only ever completes after another sched_setaffinity(), never after the one that collides with KVM because it takes too long to get out of ioctl(KVM_RUN) and back to the test. I.e. the atomic_inc() in the next loop iteration (makes seq_cnt odd) always completes before the check, and so the check ends up spinning until another migration, which correctly updates rseq. This was expected and didn't confuse me.
What confused me is that I was trying to confirm the bug was being hit from within the kernel by confirming KVM observed TIF_NOTIFY_RESUME, but I misunderstood when TIF_NOTIFY_RESUME would get set. KVM can observe TIF_NOTIFY_RESUME directly, but it's rare, and I suspect happens iff sched_setaffinity() hits the small window where it collides with KVM_RUN before KVM enters the guest.
More commonly, the bug occurs when KVM sees TIF_NEED_RESCHED. In that case, KVM calls xfer_to_guest_mode_work(), which does schedule() and _that_ sets TIF_NOTIFY_RESUME. xfer_to_guest_mode_work() then mishandles TIF_NOTIFY_RESUME and the bug is hit, but my confirmation logic in KVM never fired.
So there are effectively three reasons we want a delay:
To allow sched_setaffinity() to coincide with ioctl(KVM_RUN) before KVM can enter the guest so that the guest doesn't need an arch-specific VM-Exit source.
To let ioctl(KVM_RUN) make its way back to the test before the next round of migration.
To ensure the read-side can make forward progress, e.g. if sched_getcpu() involves a syscall.
After looking at KVM for arm64 and s390, #1 is a bit tenuous because x86 is the only arch that currently uses xfer_to_guest_mode_work(), i.e. the test could be tweaked to be overtly x86-specific. But since a delay is needed for #2 and #3, I'd prefer to rely on it for #1 as well in the hopes that this test provides coverage for arm64 and/or s390 if they're ever converted to use the common xfer_to_guest_mode_work().
Now that we have this understanding of why we need the delay, it would be good to write this down in a comment within the test.
Does it reproduce if we randomize the delay to have it picked randomly from 0us to 100us (with 1us step) ? It would remove a lot of the needs for arch-specific magic delay value.
Thanks,
Mathieu