On Tue, Oct 15, 2024, Vitaly Kuznetsov wrote:
Nikolas Wipper nik.wipper@gmx.de writes:
On 10.10.24 10:57, Vitaly Kuznetsov wrote:
...
int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu);
+static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu) +{
- return vcpu->arch.hyperv_enabled &&
READ_ONCE(vcpu->arch.hyperv->suspended);
I don't think READ_ONCE() means anything here, does it?
It does prevent compiler optimisations and is actually required[1]. Also it makes clear that this variable is shared, and may be accessed from remote CPUs.
[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html#Variab...
It certainly does no harm but I think if we follow 'Loads from and stores to shared (but non-atomic) variables should be protected with the READ_ONCE(), WRITE_ONCE()' rule literally we will need to sprinkle them all over KVM/kernel ;-) And personally, this makes reading the code harder.
To my (very limited) knowledge, we really need READ_ONCE()s when we need to have some sort of a serialization, e.g. the moment when this read happens actually makes a difference. If we can e.g. use a local variable in the beginning of a function and replace all READ_ONCE()s with reading this local variable -- then we don't need READ_ONCE()s and are OK with possible compiler optimizations. Similar (reversed) thoughts go to WRITE_ONCE().
I think it's OK to keep them but it would be nice (not mandatory IMO, but nice) to have a comment describing which particular synchronization we are achieving (== the compiler optimization scenario we are protecting against).
In this particular case, kvm_hv_vcpu_suspended() is inline so I briefly looked at all kvm_hv_vcpu_suspended() call sites (there are three) in your series but couldn't think of a place where the READ_ONCE() makes a real difference. kvm_hv_hypercall_complete() looks pretty safe anyway. kvm_hv_vcpu_unsuspend_tlb_flush() will be simplified significantly if we merge 'suspended' with 'waiting_on': instead of
kvm_for_each_vcpu(i, v, vcpu->kvm) { vcpu_hv = to_hv_vcpu(v); if (kvm_hv_vcpu_suspended(v) && READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) {
...
you will have just
kvm_for_each_vcpu(i, v, vcpu->kvm) { vcpu_hv = to_hv_vcpu(v); if (vcpu_hv && vcpu_hv->waiting_on == vcpu->vcpu_id) {
... (and yes, I also think that READ_ONCE() is superfluous here, as real (non-speculative) write below can't happen _before_ the check )
The last one, kvm_vcpu_running(), should also be indifferent to READ_ONCE() in kvm_hv_vcpu_suspended(). I may had missed something, of course, but I hope you got my line of thought.
I don't think you're missing anything. In general, all of this code is more than a bit heavy-handed and lacks any kind of precision, which makes it *really* hard to see what actually guarantees a vCPU won't get stuck blocking.
Writers synchronize SRCU and readers are required to acquire SRCU, but there's no actual data tagged as being protected by SRCU, i.e. tlb_flush_inhibit should be __rcu.
All of the {READ,WRITE}_ONCE() stuff provides some implicit compiler barriers, but the actual protection to ensure a vCPU either observes inhibit=false or a wake event is provided by the smp_wmb() in __kvm_make_request().
And from a performance perspective, synchronizing on kvm->srcu is going to be susceptible to random slowdowns, because writers will have to wait until all vCPUs drop SRCU, even if they have nothing to do with PV TLB flushes. E.g. if vCPUs are faulting in memory from swap, uninhibiting a TLB flushes could be stalled unnecessarily for an extended duration.
Lastly, KVM_REQ_EVENT is a big hammer (triggers a lot of processing) and semantically misleading (there is no event to process). At a glance, KVM_REQ_UNBLOCK is likely more appropriate.
Before we spend too much time cleaning things up, I want to first settle on the overall design, because it's not clear to me that punting HvTranslateVirtualAddress to userspace is a net positive. We agreed that VTLs should be modeled primarily in userspace, but that doesn't automatically make punting everything to userspace the best option, especially given the discussion at KVM Forum with respect to mplementing VTLs, VMPLs, TD partitions, etc.
The cover letters for this series and KVM_TRANSLATE2 simply say they're needed for HvTranslateVirtualAddress, but neither series nor Nicolas' patch to punt HVCALL_TRANSLATE_VIRTUAL_ADDRESS[*] justifies the split between userspace and KVM. And it very much is a split, because there are obviously a lot of details around TlbFlushInhibit that bleed into KVM.
Side topic, what actually clears HvRegisterInterceptSuspend.TlbFlushInhibit? The TLFS just says
After the memory intercept routine performs instruction completion, it should clear the TlbFlushInhibit bit of the HvRegisterInterceptSuspend register.
but I can't find anything that says _how_ it clears TlbFlushInhibit.
[*] https://lore.kernel.org/all/20240609154945.55332-8-nsaenz@amazon.com