Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state

15 Oct 2024

      On Tue, Oct 15, 2024, Vitaly Kuznetsov wrote:
...
Nikolas Wipper nik.wipper@gmx.de writes:
...
On 10.10.24 10:57, Vitaly Kuznetsov wrote:
...
...
...
...
int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu);

+static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu)
+{

return vcpu->arch.hyperv_enabled &&
      READ_ONCE(vcpu->arch.hyperv->suspended);

I don't think READ_ONCE() means anything here, does it?
It does prevent compiler optimisations and is actually required[1]. Also
it makes clear that this variable is shared, and may be accessed from
remote CPUs.
[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html#Variab...
It certainly does no harm but I think if we follow 'Loads from and
stores to shared (but non-atomic) variables should be protected with the
READ_ONCE(), WRITE_ONCE()' rule literally we will need to sprinkle them
all over KVM/kernel ;-) And personally, this makes reading the code
harder.
To my (very limited) knowledge, we really need READ_ONCE()s when we need
to have some sort of a serialization, e.g. the moment when this read
happens actually makes a difference. If we can e.g. use a local variable
in the beginning of a function and replace all READ_ONCE()s with
reading this local variable -- then we don't need READ_ONCE()s and are
OK with possible compiler optimizations. Similar (reversed) thoughts go
to WRITE_ONCE().
I think it's OK to keep them but it would be nice (not mandatory IMO,
but nice) to have a comment describing which particular synchronization
we are achieving (== the compiler optimization scenario we are protecting
against).
In this particular case, kvm_hv_vcpu_suspended() is inline so I briefly
looked at all kvm_hv_vcpu_suspended() call sites (there are three) in
your series but couldn't think of a place where the READ_ONCE() makes a
real difference. kvm_hv_hypercall_complete() looks pretty safe
anyway. kvm_hv_vcpu_unsuspend_tlb_flush() will be simplified
significantly if we merge 'suspended' with 'waiting_on': instead of
  kvm_for_each_vcpu(i, v, vcpu->kvm) {
          vcpu_hv = to_hv_vcpu(v);

          if (kvm_hv_vcpu_suspended(v) &&
              READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) {

...
you will have just
  kvm_for_each_vcpu(i, v, vcpu->kvm) {
          vcpu_hv = to_hv_vcpu(v);

          if (vcpu_hv && vcpu_hv->waiting_on == vcpu->vcpu_id) {

...
(and yes, I also think that READ_ONCE() is superfluous here, as real
(non-speculative) write below can't happen _before_ the check )
The last one, kvm_vcpu_running(), should also be indifferent to
READ_ONCE() in kvm_hv_vcpu_suspended(). I may had missed something, of
course, but I hope you got my line of thought.
I don't think you're missing anything.  In general, all of this code is more than
a bit heavy-handed and lacks any kind of precision, which makes it *really* hard
to see what actually guarantees a vCPU won't get stuck blocking.
Writers synchronize SRCU and readers are required to acquire SRCU, but there's
no actual data tagged as being protected by SRCU, i.e. tlb_flush_inhibit should
be __rcu.
All of the {READ,WRITE}_ONCE() stuff provides some implicit compiler barriers,
but the actual protection to ensure a vCPU either observes inhibit=false or a wake
event is provided by the smp_wmb() in __kvm_make_request().
And from a performance perspective, synchronizing on kvm->srcu is going to be
susceptible to random slowdowns, because writers will have to wait until all vCPUs
drop SRCU, even if they have nothing to do with PV TLB flushes.  E.g. if vCPUs
are faulting in memory from swap, uninhibiting a TLB flushes could be stalled
unnecessarily for an extended duration.
Lastly, KVM_REQ_EVENT is a big hammer (triggers a lot of processing) and semantically
misleading (there is no event to process).  At a glance, KVM_REQ_UNBLOCK is likely
more appropriate.
Before we spend too much time cleaning things up, I want to first settle on the
overall design, because it's not clear to me that punting HvTranslateVirtualAddress
to userspace is a net positive.  We agreed that VTLs should be modeled primarily
in userspace, but that doesn't automatically make punting everything to userspace
the best option, especially given the discussion at KVM Forum with respect to
mplementing VTLs, VMPLs, TD partitions, etc.
The cover letters for this series and KVM_TRANSLATE2 simply say they're needed
for HvTranslateVirtualAddress, but neither series nor Nicolas' patch to punt
HVCALL_TRANSLATE_VIRTUAL_ADDRESS[*] justifies the split between userspace and
KVM.  And it very much is a split, because there are obviously a lot of details
around TlbFlushInhibit that bleed into KVM.
Side topic, what actually clears HvRegisterInterceptSuspend.TlbFlushInhibit?  The
TLFS just says
After the memory intercept routine performs instruction completion, it should
  clear the TlbFlushInhibit bit of the HvRegisterInterceptSuspend register.
but I can't find anything that says _how_ it clears TlbFlushInhibit.
[*] https://lore.kernel.org/all/20240609154945.55332-8-nsaenz@amazon.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state