Overview:
This series implements a new PMU scheme on ARM, a partitioned PMU that exists alongside the existing emulated PMU and may be enabled by the kernel command line kvm.reserved_host_counters or by the vcpu ioctl KVM_ARM_PARTITION_PMU. This is a continuation of the RFC posted earlier this year. [1]
The high level overview and reason for the name is that this implementation takes advantage of recent CPU features to partition the PMU counters into a host-reserved set and a guest-reserved set. Guests are allowed untrapped hardware access to the most frequently used PMU registers and features for the guest-reserved counters only.
This untrapped hardware access significantly reduces the overhead of using performance monitoring capabilities such as the `perf` tool inside a guest VM. Register accesses that aren't trapping to KVM mean less time spent in the host kernel and more time on the workloads guests care about. This optimization especially shines during high `perf` sample rates or large numbers of events that require multiplexing hardware counters.
Performance:
For example, the following tests were carried out on identical ARM machines with 10 general purpose counters with identical guest images run on QEMU, the only difference being my PMU implementation or the existing one. Some arguments have been simplified here to clarify the purpose of the test:
1) time perf record -e ${FIFTEEN_HW_EVENTS} -F 1000 -- \ gzip -c tmpfs/random.64M.img >/dev/null
On emulated PMU this command took 4.143s real time with 0.159s system time. On partitioned PMU this command took 3.139s real time with 0.110s system time, runtime reductions of 24.23% and 30.82%.
2) time perf stat -dd -- \ automated_specint2017.sh
On emulated PMU this benchmark completed in 3789.16s real time with 224.45s system time and a final benchmark score of 4.28. On partitioned PMU this benchmark completed in 3525.67s real time with 15.98s system time and a final benchmark score of 4.56. That is a 6.95% reduction in runtime, 92.88% reduction in system time, and 6.54% improvement in overall benchmark score.
Seeing these improvements on something as lightweight as perf stat is remarkable and implies there would have been a much greater improvement with perf record. I did not test that because I was not confident it would even finish in a reasonable time on the emulated PMU
Test 3 was slightly different, I ran the workload in a VM with a single VCPU pinned to a physical CPU and analyzed from the host where the physical CPU spent its time using mpstat.
3) perf record -e ${FIFTEEN_HW_EVENTS} -F 4000 -- \ stress-ng --cpu 0 --timeout 30
Over a period of 30s the cpu running with the emulated PMU spent 34.96% of the time in the host kernel and 55.85% of the time in the guest. The cpu running the partitioned PMU spent 0.97% of its time in the host kernel and 91.06% of its time in the guest.
Taken together, these tests represent a remarkable performance improvement for anything perf related using this new PMU implementation.
Caveats:
Because the most consistent and performant thing to do was untrap PMCR_EL0, the number of counters visible to the guest via PMCR_EL0.N is always equal to the value KVM sets for MDCR_EL2.HPMN. Previously allowed writes to PMCR_EL0.N via {GET,SET}_ONE_REG no longer affect the guest.
These improvements come at a cost to 7-35 new registers that must be swapped at every vcpu_load and vcpu_put if the feature is enabled. I have been informed KVM would like to avoid paying this cost when possible.
One solution is to make the trapping changes and context swapping lazy such that the trapping changes and context swapping only take place after the guest has actually accessed the PMU so guests that never access the PMU never pay the cost.
This is not done here because it is not crucial to the primary functionality and I thought review would be more productive as soon as I had something complete enough for reviewers to easily play with.
However, this or any better ideas are on the table for inclusion in future re-rolls.
[1] https://lore.kernel.org/kvmarm/20250213180317.3205285-1-coltonlewis@google.c...
Colton Lewis (16): arm64: cpufeature: Add cpucap for HPMN0 arm64: Generate sign macro for sysreg Enums arm64: cpufeature: Add cpucap for PMICNTR KVM: arm64: Reorganize PMU functions KVM: arm64: Introduce method to partition the PMU perf: arm_pmuv3: Generalize counter bitmasks perf: arm_pmuv3: Keep out of guest counter partition KVM: arm64: Set up FGT for Partitioned PMU KVM: arm64: Writethrough trapped PMEVTYPER register KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned KVM: arm64: Writethrough trapped PMOVS register KVM: arm64: Context switch Partitioned PMU guest registers perf: pmuv3: Handle IRQs for Partitioned PMU guest counters KVM: arm64: Inject recorded guest interrupts KVM: arm64: Add ioctl to partition the PMU when supported KVM: arm64: selftests: Add test case for partitioned PMU
Marc Zyngier (1): KVM: arm64: Cleanup PMU includes
Documentation/virt/kvm/api.rst | 16 + arch/arm/include/asm/arm_pmuv3.h | 24 + arch/arm64/include/asm/arm_pmuv3.h | 36 +- arch/arm64/include/asm/kvm_host.h | 208 +++++- arch/arm64/include/asm/kvm_pmu.h | 82 +++ arch/arm64/kernel/cpufeature.c | 15 + arch/arm64/kvm/Makefile | 2 +- arch/arm64/kvm/arm.c | 24 +- arch/arm64/kvm/debug.c | 13 +- arch/arm64/kvm/hyp/include/hyp/switch.h | 65 +- arch/arm64/kvm/pmu-emul.c | 629 +---------------- arch/arm64/kvm/pmu-part.c | 358 ++++++++++ arch/arm64/kvm/pmu.c | 630 ++++++++++++++++++ arch/arm64/kvm/sys_regs.c | 54 +- arch/arm64/tools/cpucaps | 2 + arch/arm64/tools/gen-sysreg.awk | 1 + arch/arm64/tools/sysreg | 6 +- drivers/perf/arm_pmuv3.c | 55 +- include/kvm/arm_pmu.h | 199 ------ include/linux/perf/arm_pmu.h | 15 +- include/linux/perf/arm_pmuv3.h | 14 +- include/uapi/linux/kvm.h | 4 + tools/include/uapi/linux/kvm.h | 2 + .../selftests/kvm/arm64/vpmu_counter_access.c | 40 +- virt/kvm/kvm_main.c | 1 + 25 files changed, 1616 insertions(+), 879 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_pmu.h create mode 100644 arch/arm64/kvm/pmu-part.c delete mode 100644 include/kvm/arm_pmu.h
base-commit: 1b85d923ba8c9e6afaf19e26708411adde94fba8 -- 2.49.0.1204.g71687c7c1d-goog