On Wed, Jun 04, 2025 at 08:10:27PM +0000, Colton Lewis wrote:
Thank you Oliver for the additional explanation.
Oliver Upton oliver.upton@linux.dev writes:
On Tue, Jun 03, 2025 at 09:32:41PM +0000, Colton Lewis wrote:
Oliver Upton oliver.upton@linux.dev writes:
On Mon, Jun 02, 2025 at 07:26:51PM +0000, Colton Lewis wrote:
static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu) {
- u8 hpmn = vcpu->kvm->arch.arm_pmu->hpmn;
- preempt_disable();
/* * This also clears MDCR_EL2_E2PB_MASK and MDCR_EL2_E2TB_MASK * to disable guest access to the profiling and trace buffers */
- vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN,
*host_data_ptr(nr_event_counters));
- vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
- vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN, hpmn);
- vcpu->arch.mdcr_el2 |= (MDCR_EL2_HPMD |
MDCR_EL2_TPM |
This isn't safe, as there's no guarantee that kvm_arch::arm_pmu is pointing that the PMU for this CPU. KVM needs to derive HPMN from some per-CPU state, not anything tied to the VM/vCPU.
I'm confused. Isn't this function preparing to run the vCPU on this CPU? Why would it be pointing at a different PMU?
Because arm64 is a silly ecosystem and system designers can glue together heterogenous CPU implementations. The arm_pmu that KVM is pointing at might only match a subset of CPUs, but vCPUs migrate at the whim of the scheduler (and userspace).
That means the arm_pmu field might at any time point to data that doesn't represent the current CPU. I'm surprised that's not swapped out anywhere. Seems like it would be useful to have an arch struct be a reliable source of information about the current arch.
There's no way to accomplish that. It is per-VM data, and you could have vCPUs on a mix of physical CPUs.
This is mitigated somewhat when the VMM explicitly selects a PMU implementation, as we prevent vCPUs from actually entering the guest on an unsupported CPU (see ON_SUPPORTED_CPU flag).
There are two *very* distinct functions w.r.t. partitioning:
- Partitioning of a particular arm_pmu that says how many counters the
host can use
- VMM intentions to present a subset of the KVM-owned counter
partition to its guest
#1 is modifying *global* state, we really can't mess with that in the context of a single VM...
I see the distinction more clearly now. Since KVM can only control the number of counters presented to the guest through HPMN, why would the VMM ever choose a subset? If the host PMU is globally partitioned to not use anything in the guest range, presenting fewer counters to a guest is just leaving some counters in the middle of the range unused.
You may not want to give a 'full' PMU to all VMs running on a system, but some OSes (Windows) expect to have at least the fixed CPU cycle counter present. In this case the VMM would deliberately expose fewer counters. FEAT_HPMN0 didn't get added to the architecture by accident...
Thanks, Oliver