On Wed, Jan 25, 2023 at 1:46 PM Paolo Bonzini pbonzini@redhat.com wrote:
On 1/25/23 17:47, Jim Mattson wrote:
Part of the definition of the API is that you can take KVM_GET_SUPPORTED_CPUID and pass it to KVM_SET_CPUID2 for all vCPUs. Returning host topology information for a random host vCPU definitely violates the contract.
You are attempting to rewrite history. Leaf 0xB was added to > KVM_GET_SUPPORTED_CPUID in commit 0771671749b5 ("KVM: Enhance guest cpuid management"), and the only documentation of the KVM_GET_SUPPORTED_CPUID ioctl at that time was in the commit message:
- KVM_GET_SUPPORTED_CPUID: get all cpuid entries the host (and kvm) supports
[...] the intention was to return the host topology information for the current logical processor.
The handling of unknown features is so naive in that commit, that I don't think it is possible to read anything from the implementation; and it certainly should not be a paragon for a future-proof implementation of KVM_GET_SUPPORTED_CPUID.
For example, it only hid _known_ CPUID leaves or features and passed the unknown ones through, which you'll agree is completely broken. It also didn't try to handle all leaves for which ECX might turn out to be significant---which happened for EAX=7 so the commit returns a wrong output for CPUID[EAX=7,ECX=0].EAX.
In other words, the only reason it handles 0xB is because it was known to have subleaves.
We can get more information about how userspace was intended to use it from the qemu-kvm fork, which at the time was practically the only KVM userspace. As of 2009 it was only checking a handful of leaves:
https://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git/tree/target-i386/kvm.c?...
so shall we say that userspace is supposed to build each CPUID leaf one by one and use KVM_GET_SUPPORTED_CPUID2 for validation only? I think the first committed documentation agrees: "Userspace can use the information returned by this ioctl to construct cpuid information (for KVM_SET_CPUID2) that is consistent with hardware, kernel, and userspace capabilities, and with user requirements".
However, that's the theory. "Do not break userspace" also involves looking at how userspace *really* uses the API, and make compromises to cater to those uses; which is different from rewriting history.
And in practice, people basically stopped reading after "(for KVM_SET_CPUID2)".
For example in kvmtool:
kvm_cpuid->nent = MAX_KVM_CPUID_ENTRIES; if (ioctl(vcpu->kvm->sys_fd, KVM_GET_SUPPORTED_CPUID, kvm_cpuid) < 0) die_perror("KVM_GET_SUPPORTED_CPUID failed"); filter_cpuid(kvm_cpuid, vcpu->cpu_id); if (ioctl(vcpu->vcpu_fd, KVM_SET_CPUID2, kvm_cpuid) < 0) die_perror("KVM_SET_CPUID2 failed");
where filter_cpuid only does minor adjustments that do not include 0xB, 0x1F and 0x8000001E. The result is a topology that makes no sense if host #vCPUs != guest #vCPUs, and which doesn't include the correct APIC id in EDX.
https://github.com/kvmtool/kvmtool/blob/5657dd3e48b41bc6db38fa657994bc0e030f...
crosvm does optionally attempt to pass through leaves 0xB and 0x1F, but it fails to adjust the APIC id in EDX. On the other hand it also passes through 0x8000001E if ctx.cpu_config.host_cpu_topology is false, incorrectly. So on one hand this patch breaks host_cpu_topology == true, on the other hand it fixes host_cpu_topology == false on AMD processors.
https://github.com/google/crosvm/blob/cc79897fc0813ee8412e6395648593898962ec...
The rust-vmm reference hypervisor adjusts the APIC id in EDX for 0xB but not for 0x1F. Apart from that it passes through the host topology leaves, again resulting in nonsensical topology for host #vCPUs != guest #vCPUs.
https://github.com/rust-vmm/vmm-reference/blob/5cde58bc955afca8a180585a9f01c...
Firecracker, finally, ignores KVM_GET_SUPPORTED_CPUID's output for 0xb and 0x8000001E (good!) but fails to do the same for 0x1F, so this patch is again a fix of sorts---having all zeroes in 0x1F is better than having a value that is valid but inconsistent with 0xB.
https://github.com/firecracker-microvm/firecracker/blob/cdf4fef3011c51206f17... https://github.com/firecracker-microvm/firecracker/blob/cdf4fef3011c51206f17...
So basically the only open source userspace that is penalized (but not broken, and also partly fixed) by the patch is crosvm. QEMU doesn't care, while firecracker/kvmtool/vmm-reference are a net positive.
Paolo
The topology leaves returned by KVM_GET_SUPPORTED_CPUID *for over a decade* have been passed through unmodified from the host. They have never made sense for KVM_SET_CPUID2, with the unlikely exception of a whole-host VM.
Our VMM populates the topology of the guest CPUID table on its own, as any VMM must. However, it uses the host topology (which KVM_GET_SUPPORTED_CPUID has been providing pass-through *for over a decade*) to see if the requested guest topology is possible.
Changing a long-established ABI in a way that breaks userspace applications is a bad practice. I didn't think we, as a community, did that. I didn't realize that we were only catering to open source implementations here.