On Mon, 2024-04-22 at 15:02 -0700, Chen, Zide wrote:
the selftest works for me, and I ran the test for 1000+ iterations, w/ or w/o TSC scaling, the TEST_ASSERT(delta_corrected <= ±1) never got hit. This is awesome!
I think that with further care we can get even better than that.
Let's look at where that ±1ns tolerance comes from.
Consider a 3GHz TSC. That gives us three ticks per nanosecond. Each TSC value can be seen as (3n) (3n+1) or (3n+2) for a given nanosecond n.
If we take a new reference point at a (3n+2) TSC value and calculate the KVM clock from that, we *know* we're going to round down and lose two-thirds of a nanosecond.
So then we set the new KVM clock parameters to use that new reference point, and that's why we have to allow a disruption of up to a single nanosecond. In fact, I don't think it's ±1 ns, is it? It'll only ever be in the same direction (rounding down)?
But if we're careful which *which* TSC value we use as the reference point, we can reduce that error.
The TSC value we use should be *around* the current time, but what if we were to evaluate maybe the previous 100 TSC values. Pass *each* of them through the conversion to nanoseconds and use the one that comes *closest* to a precise nanosecond (nnnnnnnn.000).
It's even fairly easy to calculate those, because of the way the KVM clock ABI has us multiply and then shift right by 32 bits. We just need to look at those low 32 bits (the fractional nanosecond) *before* shifting them out of existence. Something like...
uint64_t tsc_candidate, tsc_candidate_last, best_tsc; uint32_t frac_ns_min = 0xffffffff; uint64_t frac_ns;
best_tsc = tsc_candidate = rdtsc(); tsc_candidate_last = tsc_candidate - 100;
while (tsc_candidate-- > tsc_candidate_last) { uint64_t guest_tsc = kvm_scale_tsc(tsc_candidate, ...); frac_ns = guest_tsc * hvclock->tsc_to_system_mul; /* Shift *after* multiplication, not before as pvclock_scale_cycles() does. */ if (hvclock->tsc_shift < 0) frac_ns >>= -hvclock->tsc_shift; else frac_ns <<= hvclock->tsc_shift;
if ( (uint32_t)frac_ns <= frac_ns_min ) { frac_ns_min = frac_ns; best_tsc = tsc_candidate; } } printk("Best TSC to use for reference point is %lld", best_tsc);
And then you calculate your CLOCK_MONOTONIC_RAW and guest KVM clock from *that* host TSC value, and thus minimise the discrepancies due to rounding down?
Aside from the fact that I literally just typed that into email and it's barely even thought through let alone entirely untested... I'm just not sure it's even worth the runtime cost, for that ±1 ns on a rare case.
A slop of ±1ns is probably sufficient because over the past few years we've already shifted the definition of the KVM clock to *not* be NTP- corrected, and we leave guests to do fine-grained synchronization through other means anyway.
But I see talk of people offering a PPS signal to *hundreds* of guests on the same host simultaneously, just for them all to use it to calibrate the same underlying oscillator. Which is a little bit insane.
We *should* have a way for the host to do that once and then expose the precise time to its guests, in a much saner way than the KVM clock does. I'll look at adding something along those lines to this series too, which can be driven from the host's adjtimex() adjustments (which KVM already consumes), and fed into each guest's timekeeping as a PTP/PPS device or something.