Overview:
This series implements a new PMU scheme on ARM, a partitioned PMU that exists alongside the existing emulated PMU and may be enabled by the kernel command line kvm.reserved_host_counters or by the vcpu ioctl KVM_ARM_PARTITION_PMU. This is a continuation of the RFC posted earlier this year. [1]
The high level overview and reason for the name is that this implementation takes advantage of recent CPU features to partition the PMU counters into a host-reserved set and a guest-reserved set. Guests are allowed untrapped hardware access to the most frequently used PMU registers and features for the guest-reserved counters only.
This untrapped hardware access significantly reduces the overhead of using performance monitoring capabilities such as the `perf` tool inside a guest VM. Register accesses that aren't trapping to KVM mean less time spent in the host kernel and more time on the workloads guests care about. This optimization especially shines during high `perf` sample rates or large numbers of events that require multiplexing hardware counters.
Performance:
For example, the following tests were carried out on identical ARM machines with 10 general purpose counters with identical guest images run on QEMU, the only difference being my PMU implementation or the existing one. Some arguments have been simplified here to clarify the purpose of the test:
1) time perf record -e ${FIFTEEN_HW_EVENTS} -F 1000 -- \ gzip -c tmpfs/random.64M.img >/dev/null
On emulated PMU this command took 4.143s real time with 0.159s system time. On partitioned PMU this command took 3.139s real time with 0.110s system time, runtime reductions of 24.23% and 30.82%.
2) time perf stat -dd -- \ automated_specint2017.sh
On emulated PMU this benchmark completed in 3789.16s real time with 224.45s system time and a final benchmark score of 4.28. On partitioned PMU this benchmark completed in 3525.67s real time with 15.98s system time and a final benchmark score of 4.56. That is a 6.95% reduction in runtime, 92.88% reduction in system time, and 6.54% improvement in overall benchmark score.
Seeing these improvements on something as lightweight as perf stat is remarkable and implies there would have been a much greater improvement with perf record. I did not test that because I was not confident it would even finish in a reasonable time on the emulated PMU
Test 3 was slightly different, I ran the workload in a VM with a single VCPU pinned to a physical CPU and analyzed from the host where the physical CPU spent its time using mpstat.
3) perf record -e ${FIFTEEN_HW_EVENTS} -F 4000 -- \ stress-ng --cpu 0 --timeout 30
Over a period of 30s the cpu running with the emulated PMU spent 34.96% of the time in the host kernel and 55.85% of the time in the guest. The cpu running the partitioned PMU spent 0.97% of its time in the host kernel and 91.06% of its time in the guest.
Taken together, these tests represent a remarkable performance improvement for anything perf related using this new PMU implementation.
Caveats:
Because the most consistent and performant thing to do was untrap PMCR_EL0, the number of counters visible to the guest via PMCR_EL0.N is always equal to the value KVM sets for MDCR_EL2.HPMN. Previously allowed writes to PMCR_EL0.N via {GET,SET}_ONE_REG no longer affect the guest.
These improvements come at a cost to 7-35 new registers that must be swapped at every vcpu_load and vcpu_put if the feature is enabled. I have been informed KVM would like to avoid paying this cost when possible.
One solution is to make the trapping changes and context swapping lazy such that the trapping changes and context swapping only take place after the guest has actually accessed the PMU so guests that never access the PMU never pay the cost.
This is not done here because it is not crucial to the primary functionality and I thought review would be more productive as soon as I had something complete enough for reviewers to easily play with.
However, this or any better ideas are on the table for inclusion in future re-rolls.
[1] https://lore.kernel.org/kvmarm/20250213180317.3205285-1-coltonlewis@google.c...
Colton Lewis (16): arm64: cpufeature: Add cpucap for HPMN0 arm64: Generate sign macro for sysreg Enums arm64: cpufeature: Add cpucap for PMICNTR KVM: arm64: Reorganize PMU functions KVM: arm64: Introduce method to partition the PMU perf: arm_pmuv3: Generalize counter bitmasks perf: arm_pmuv3: Keep out of guest counter partition KVM: arm64: Set up FGT for Partitioned PMU KVM: arm64: Writethrough trapped PMEVTYPER register KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned KVM: arm64: Writethrough trapped PMOVS register KVM: arm64: Context switch Partitioned PMU guest registers perf: pmuv3: Handle IRQs for Partitioned PMU guest counters KVM: arm64: Inject recorded guest interrupts KVM: arm64: Add ioctl to partition the PMU when supported KVM: arm64: selftests: Add test case for partitioned PMU
Marc Zyngier (1): KVM: arm64: Cleanup PMU includes
Documentation/virt/kvm/api.rst | 16 + arch/arm/include/asm/arm_pmuv3.h | 24 + arch/arm64/include/asm/arm_pmuv3.h | 36 +- arch/arm64/include/asm/kvm_host.h | 208 +++++- arch/arm64/include/asm/kvm_pmu.h | 82 +++ arch/arm64/kernel/cpufeature.c | 15 + arch/arm64/kvm/Makefile | 2 +- arch/arm64/kvm/arm.c | 24 +- arch/arm64/kvm/debug.c | 13 +- arch/arm64/kvm/hyp/include/hyp/switch.h | 65 +- arch/arm64/kvm/pmu-emul.c | 629 +---------------- arch/arm64/kvm/pmu-part.c | 358 ++++++++++ arch/arm64/kvm/pmu.c | 630 ++++++++++++++++++ arch/arm64/kvm/sys_regs.c | 54 +- arch/arm64/tools/cpucaps | 2 + arch/arm64/tools/gen-sysreg.awk | 1 + arch/arm64/tools/sysreg | 6 +- drivers/perf/arm_pmuv3.c | 55 +- include/kvm/arm_pmu.h | 199 ------ include/linux/perf/arm_pmu.h | 15 +- include/linux/perf/arm_pmuv3.h | 14 +- include/uapi/linux/kvm.h | 4 + tools/include/uapi/linux/kvm.h | 2 + .../selftests/kvm/arm64/vpmu_counter_access.c | 40 +- virt/kvm/kvm_main.c | 1 + 25 files changed, 1616 insertions(+), 879 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_pmu.h create mode 100644 arch/arm64/kvm/pmu-part.c delete mode 100644 include/kvm/arm_pmu.h
base-commit: 1b85d923ba8c9e6afaf19e26708411adde94fba8 -- 2.49.0.1204.g71687c7c1d-goog
Add a capability for FEAT_HPMN0, whether MDCR_EL2.HPMN can specify 0 counters reserved for the guest.
This required changing HPMN0 to an UnsignedEnum in tools/sysreg because otherwise not all the appropriate macros are generated to add it to arm64_cpu_capabilities_arm64_features.
Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm64/kernel/cpufeature.c | 8 ++++++++ arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/sysreg | 6 +++--- 3 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index a3da020f1d1c..578eea321a60 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -541,6 +541,7 @@ static const struct arm64_ftr_bits ftr_id_mmfr0[] = { };
static const struct arm64_ftr_bits ftr_id_aa64dfr0[] = { + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64DFR0_EL1_HPMN0_SHIFT, 4, 0), S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64DFR0_EL1_DoubleLock_SHIFT, 4, 0), ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64DFR0_EL1_PMSVer_SHIFT, 4, 0), ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64DFR0_EL1_CTX_CMPs_SHIFT, 4, 0), @@ -2884,6 +2885,13 @@ static const struct arm64_cpu_capabilities arm64_features[] = { .matches = has_cpuid_feature, ARM64_CPUID_FIELDS(ID_AA64MMFR0_EL1, FGT, FGT2) }, + { + .desc = "Hypervisor PMU Partitioning 0 Guest Counters", + .type = ARM64_CPUCAP_SYSTEM_FEATURE, + .capability = ARM64_HAS_HPMN0, + .matches = has_cpuid_feature, + ARM64_CPUID_FIELDS(ID_AA64DFR0_EL1, HPMN0, IMP) + }, #ifdef CONFIG_ARM64_SME { .desc = "Scalable Matrix Extension", diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps index 10effd4cff6b..5b196ba21629 100644 --- a/arch/arm64/tools/cpucaps +++ b/arch/arm64/tools/cpucaps @@ -39,6 +39,7 @@ HAS_GIC_CPUIF_SYSREGS HAS_GIC_PRIO_MASKING HAS_GIC_PRIO_RELAXED_SYNC HAS_HCR_NV1 +HAS_HPMN0 HAS_HCX HAS_LDAPR HAS_LPA2 diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index 8a8cf6874298..d29742481754 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -1531,9 +1531,9 @@ EndEnum EndSysreg
Sysreg ID_AA64DFR0_EL1 3 0 0 5 0 -Enum 63:60 HPMN0 - 0b0000 UNPREDICTABLE - 0b0001 DEF +UnsignedEnum 63:60 HPMN0 + 0b0000 NI + 0b0001 IMP EndEnum UnsignedEnum 59:56 ExtTrcBuff 0b0000 NI
Hi Colton,
On Mon, Jun 02, 2025 at 07:26:46PM +0000, Colton Lewis wrote:
Add a capability for FEAT_HPMN0, whether MDCR_EL2.HPMN can specify 0 counters reserved for the guest.
This required changing HPMN0 to an UnsignedEnum in tools/sysreg because otherwise not all the appropriate macros are generated to add it to arm64_cpu_capabilities_arm64_features.
Signed-off-by: Colton Lewis coltonlewis@google.com
arch/arm64/kernel/cpufeature.c | 8 ++++++++ arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/sysreg | 6 +++--- 3 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index a3da020f1d1c..578eea321a60 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -541,6 +541,7 @@ static const struct arm64_ftr_bits ftr_id_mmfr0[] = { }; static const struct arm64_ftr_bits ftr_id_aa64dfr0[] = {
- ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64DFR0_EL1_HPMN0_SHIFT, 4, 0), S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64DFR0_EL1_DoubleLock_SHIFT, 4, 0), ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64DFR0_EL1_PMSVer_SHIFT, 4, 0), ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64DFR0_EL1_CTX_CMPs_SHIFT, 4, 0),
@@ -2884,6 +2885,13 @@ static const struct arm64_cpu_capabilities arm64_features[] = { .matches = has_cpuid_feature, ARM64_CPUID_FIELDS(ID_AA64MMFR0_EL1, FGT, FGT2) },
- {
.desc = "Hypervisor PMU Partitioning 0 Guest Counters",
nit: just use the the FEAT_xxx name for the description (i.e. "HPMN0").
Thanks, Oliver
Hi Oliver. Thanks for the speedy response.
Oliver Upton oliver.upton@linux.dev writes:
Hi Colton,
On Mon, Jun 02, 2025 at 07:26:46PM +0000, Colton Lewis wrote:
Add a capability for FEAT_HPMN0, whether MDCR_EL2.HPMN can specify 0 counters reserved for the guest.
This required changing HPMN0 to an UnsignedEnum in tools/sysreg because otherwise not all the appropriate macros are generated to add it to arm64_cpu_capabilities_arm64_features.
Signed-off-by: Colton Lewis coltonlewis@google.com
arch/arm64/kernel/cpufeature.c | 8 ++++++++ arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/sysreg | 6 +++--- 3 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index a3da020f1d1c..578eea321a60 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -541,6 +541,7 @@ static const struct arm64_ftr_bits ftr_id_mmfr0[] = { };
static const struct arm64_ftr_bits ftr_id_aa64dfr0[] = {
- ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE,
ID_AA64DFR0_EL1_HPMN0_SHIFT, 4, 0), S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64DFR0_EL1_DoubleLock_SHIFT, 4, 0), ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64DFR0_EL1_PMSVer_SHIFT, 4, 0), ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64DFR0_EL1_CTX_CMPs_SHIFT, 4, 0), @@ -2884,6 +2885,13 @@ static const struct arm64_cpu_capabilities arm64_features[] = { .matches = has_cpuid_feature, ARM64_CPUID_FIELDS(ID_AA64MMFR0_EL1, FGT, FGT2) },
- {
.desc = "Hypervisor PMU Partitioning 0 Guest Counters",
nit: just use the the FEAT_xxx name for the description (i.e. "HPMN0").
Okay
There's no reason Enums shouldn't be equivalent to UnsignedEnums and explicitly specify they are unsigned. This will avoid the annoyance I had with HPMN0.
Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm64/tools/gen-sysreg.awk | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/arm64/tools/gen-sysreg.awk b/arch/arm64/tools/gen-sysreg.awk index f2a1732cb1f6..fa21a632d9b7 100755 --- a/arch/arm64/tools/gen-sysreg.awk +++ b/arch/arm64/tools/gen-sysreg.awk @@ -308,6 +308,7 @@ $1 == "Enum" && (block_current() == "Sysreg" || block_current() == "SysregFields parse_bitdef(reg, field, $2)
define_field(reg, field, msb, lsb) + define_field_sign(reg, field, "false")
next }
Add a cpucap for FEAT_PMUv3_PMICNTR, meaning there is a dedicated instruction counter as well as the cycle counter.
Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm64/kernel/cpufeature.c | 7 +++++++ arch/arm64/tools/cpucaps | 1 + 2 files changed, 8 insertions(+)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 578eea321a60..e798a706d8fb 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -2892,6 +2892,13 @@ static const struct arm64_cpu_capabilities arm64_features[] = { .matches = has_cpuid_feature, ARM64_CPUID_FIELDS(ID_AA64DFR0_EL1, HPMN0, IMP) }, + { + .desc = "PMU Dedicated Instruction Counter", + .type = ARM64_CPUCAP_SYSTEM_FEATURE, + .capability = ARM64_HAS_PMICNTR, + .matches = has_cpuid_feature, + ARM64_CPUID_FIELDS(ID_AA64DFR1_EL1, PMICNTR, IMP) + }, #ifdef CONFIG_ARM64_SME { .desc = "Scalable Matrix Extension", diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps index 5b196ba21629..6dd72fcdd612 100644 --- a/arch/arm64/tools/cpucaps +++ b/arch/arm64/tools/cpucaps @@ -47,6 +47,7 @@ HAS_LSE_ATOMICS HAS_MOPS HAS_NESTED_VIRT HAS_PAN +HAS_PMICNTR HAS_PMUV3 HAS_S1PIE HAS_S1POE
From: Marc Zyngier maz@kernel.org
asm/kvm_host.h includes asm/arm_pmu.h which includes perf/arm_pmuv3.h which includes asm/arm_pmuv3.h which includes asm/kvm_host.h This causes compilation problems why trying to use anything defined in any of the headers in any other headers.
Reorganize these tangled headers. In particular:
* Move the declarations defining the interface between KVM and PMU to its own header asm/kvm_pmu.h that can be used without the problem described above.
* Delete kvm/arm_pmu.h. These functions are mostly internal to KVM and should go in asm/kvm_host.h.
Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm64/include/asm/arm_pmuv3.h | 2 +- arch/arm64/include/asm/kvm_host.h | 190 ++++++++++++++++++++-- arch/arm64/include/asm/kvm_pmu.h | 38 +++++ arch/arm64/kvm/arm.c | 1 - arch/arm64/kvm/debug.c | 1 + arch/arm64/kvm/hyp/include/hyp/switch.h | 1 + arch/arm64/kvm/pmu-emul.c | 30 ++-- arch/arm64/kvm/pmu.c | 2 + arch/arm64/kvm/sys_regs.c | 1 + include/kvm/arm_pmu.h | 199 ------------------------ include/linux/perf/arm_pmu.h | 14 +- virt/kvm/kvm_main.c | 1 + 12 files changed, 246 insertions(+), 234 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_pmu.h delete mode 100644 include/kvm/arm_pmu.h
diff --git a/arch/arm64/include/asm/arm_pmuv3.h b/arch/arm64/include/asm/arm_pmuv3.h index 8a777dec8d88..32c003a7b810 100644 --- a/arch/arm64/include/asm/arm_pmuv3.h +++ b/arch/arm64/include/asm/arm_pmuv3.h @@ -6,7 +6,7 @@ #ifndef __ASM_PMUV3_H #define __ASM_PMUV3_H
-#include <asm/kvm_host.h> +#include <asm/kvm_pmu.h>
#include <asm/cpufeature.h> #include <asm/sysreg.h> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index d941abc6b5ee..f5d97cd8e177 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -14,6 +14,7 @@ #include <linux/arm-smccc.h> #include <linux/bitmap.h> #include <linux/types.h> +#include <linux/irq_work.h> #include <linux/jump_label.h> #include <linux/kvm_types.h> #include <linux/maple_tree.h> @@ -35,7 +36,6 @@
#include <kvm/arm_vgic.h> #include <kvm/arm_arch_timer.h> -#include <kvm/arm_pmu.h>
#define KVM_MAX_VCPUS VGIC_V3_MAX_CPUS
@@ -782,6 +782,33 @@ struct vcpu_reset_state {
struct vncr_tlb;
+#if IS_ENABLED(CONFIG_HW_PERF_EVENTS) + +#define KVM_ARMV8_PMU_MAX_COUNTERS 32 + +struct kvm_pmc { + u8 idx; /* index into the pmu->pmc array */ + struct perf_event *perf_event; +}; + +struct kvm_pmu_events { + u64 events_host; + u64 events_guest; +}; + +struct kvm_pmu { + struct irq_work overflow_work; + struct kvm_pmu_events events; + struct kvm_pmc pmc[KVM_ARMV8_PMU_MAX_COUNTERS]; + int irq_num; + bool created; + bool irq_level; +}; +#else +struct kvm_pmu { +}; +#endif + struct kvm_vcpu_arch { struct kvm_cpu_context ctxt;
@@ -1469,25 +1496,11 @@ void kvm_arch_vcpu_ctxflush_fp(struct kvm_vcpu *vcpu); void kvm_arch_vcpu_ctxsync_fp(struct kvm_vcpu *vcpu); void kvm_arch_vcpu_put_fp(struct kvm_vcpu *vcpu);
-static inline bool kvm_pmu_counter_deferred(struct perf_event_attr *attr) -{ - return (!has_vhe() && attr->exclude_host); -} - #ifdef CONFIG_KVM -void kvm_set_pmu_events(u64 set, struct perf_event_attr *attr); -void kvm_clr_pmu_events(u64 clr); -bool kvm_set_pmuserenr(u64 val); void kvm_enable_trbe(void); void kvm_disable_trbe(void); void kvm_tracing_set_el1_configuration(u64 trfcr_while_in_guest); #else -static inline void kvm_set_pmu_events(u64 set, struct perf_event_attr *attr) {} -static inline void kvm_clr_pmu_events(u64 clr) {} -static inline bool kvm_set_pmuserenr(u64 val) -{ - return false; -} static inline void kvm_enable_trbe(void) {} static inline void kvm_disable_trbe(void) {} static inline void kvm_tracing_set_el1_configuration(u64 trfcr_while_in_guest) {} @@ -1658,5 +1671,152 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt); void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1); void check_feature_map(void);
+#define kvm_vcpu_has_pmu(vcpu) \ + (vcpu_has_feature(vcpu, KVM_ARM_VCPU_PMU_V3)) + +#if IS_ENABLED(CONFIG_HW_PERF_EVENTS) + +bool kvm_supports_guest_pmuv3(void); +u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx); +void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx, u64 val); +void kvm_pmu_set_counter_value_user(struct kvm_vcpu *vcpu, u64 select_idx, u64 val); +u64 kvm_pmu_accessible_counter_mask(struct kvm_vcpu *vcpu); +u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1); +void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu); +void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu); +void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu); +void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val); +void kvm_pmu_flush_hwstate(struct kvm_vcpu *vcpu); +void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu); +bool kvm_pmu_should_notify_user(struct kvm_vcpu *vcpu); +void kvm_pmu_update_run(struct kvm_vcpu *vcpu); +void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u64 val); +void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val); +void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u64 data, + u64 select_idx); +void kvm_vcpu_reload_pmu(struct kvm_vcpu *vcpu); +int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, + struct kvm_device_attr *attr); +int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu *vcpu, + struct kvm_device_attr *attr); +int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu, + struct kvm_device_attr *attr); +int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu); + +struct kvm_pmu_events *kvm_get_pmu_events(void); +void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu); +void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu); + +/* + * Updates the vcpu's view of the pmu events for this cpu. + * Must be called before every vcpu run after disabling interrupts, to ensure + * that an interrupt cannot fire and update the structure. + */ +#define kvm_pmu_update_vcpu_events(vcpu) \ + do { \ + if (!has_vhe() && system_supports_pmuv3()) \ + vcpu->arch.pmu.events = *kvm_get_pmu_events(); \ + } while (0) + +u8 kvm_arm_pmu_get_pmuver_limit(void); +u64 kvm_pmu_evtyper_mask(struct kvm *kvm); +int kvm_arm_set_default_pmu(struct kvm *kvm); +u8 kvm_arm_pmu_get_max_counters(struct kvm *kvm); + +u64 kvm_vcpu_read_pmcr(struct kvm_vcpu *vcpu); +bool kvm_pmu_counter_is_hyp(struct kvm_vcpu *vcpu, unsigned int idx); +void kvm_pmu_nested_transition(struct kvm_vcpu *vcpu); +#else +static inline bool kvm_arm_support_pmu_v3(void) +{ + return false; +} + +static inline u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, + u64 select_idx) +{ + return 0; +} +static inline void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, + u64 select_idx, u64 val) {} +static inline u64 kvm_pmu_accessible_counter_mask(struct kvm_vcpu *vcpu) +{ + return 0; +} +static inline void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu) {} +static inline void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu) {} +static inline void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu) {} +static inline void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val) {} +static inline void kvm_pmu_flush_hwstate(struct kvm_vcpu *vcpu) {} +static inline void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu) {} +static inline bool kvm_pmu_should_notify_user(struct kvm_vcpu *vcpu) +{ + return false; +} +static inline void kvm_pmu_update_run(struct kvm_vcpu *vcpu) {} +static inline void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u64 val) {} +static inline void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val) {} +static inline void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, + u64 data, u64 select_idx) {} +static inline int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, + struct kvm_device_attr *attr) +{ + return -ENXIO; +} +static inline int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu *vcpu, + struct kvm_device_attr *attr) +{ + return -ENXIO; +} +static inline int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu, + struct kvm_device_attr *attr) +{ + return -ENXIO; +} +static inline int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu) +{ + return 0; +} +static inline u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1) +{ + return 0; +} + +static inline void kvm_pmu_update_vcpu_events(struct kvm_vcpu *vcpu) {} +static inline void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu) {} +static inline void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu) {} +static inline void kvm_vcpu_reload_pmu(struct kvm_vcpu *vcpu) {} +static inline u8 kvm_arm_pmu_get_pmuver_limit(void) +{ + return 0; +} +static inline u64 kvm_pmu_evtyper_mask(struct kvm *kvm) +{ + return 0; +} + +static inline int kvm_arm_set_default_pmu(struct kvm *kvm) +{ + return -ENODEV; +} + +static inline u8 kvm_arm_pmu_get_max_counters(struct kvm *kvm) +{ + return 0; +} + +static inline u64 kvm_vcpu_read_pmcr(struct kvm_vcpu *vcpu) +{ + return 0; +} + +static inline bool kvm_pmu_counter_is_hyp(struct kvm_vcpu *vcpu, unsigned int idx) +{ + return false; +} + +static inline void kvm_pmu_nested_transition(struct kvm_vcpu *vcpu) {} + +#endif
#endif /* __ARM64_KVM_HOST_H__ */ diff --git a/arch/arm64/include/asm/kvm_pmu.h b/arch/arm64/include/asm/kvm_pmu.h new file mode 100644 index 000000000000..613cddbdbdd8 --- /dev/null +++ b/arch/arm64/include/asm/kvm_pmu.h @@ -0,0 +1,38 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ + +#ifndef __KVM_PMU_H +#define __KVM_PMU_H + +/* + * Define the interface between the PMUv3 driver and KVM. + */ +struct perf_event_attr; +struct arm_pmu; + +#define kvm_pmu_counter_deferred(attr) \ + ({ \ + !has_vhe() && (attr)->exclude_host; \ + }) + +#ifdef CONFIG_KVM + +void kvm_set_pmu_events(u64 set, struct perf_event_attr *attr); +void kvm_clr_pmu_events(u64 clr); +bool kvm_set_pmuserenr(u64 val); +void kvm_vcpu_pmu_resync_el0(void); +void kvm_host_pmu_init(struct arm_pmu *pmu); + +#else + +static inline void kvm_set_pmu_events(u64 set, struct perf_event_attr *attr) {} +static inline void kvm_clr_pmu_events(u64 clr) {} +static inline bool kvm_set_pmuserenr(u64 val) +{ + return false; +} +static inline void kvm_vcpu_pmu_resync_el0(void) {} +static inline void kvm_host_pmu_init(struct arm_pmu *pmu) {} + +#endif + +#endif diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 36cfcffb40d8..3b9c003f2ea6 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -43,7 +43,6 @@ #include <asm/sections.h>
#include <kvm/arm_hypercalls.h> -#include <kvm/arm_pmu.h> #include <kvm/arm_psci.h>
#include "sys_regs.h" diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c index 0e4c805e7e89..7fb1d9e7180f 100644 --- a/arch/arm64/kvm/debug.c +++ b/arch/arm64/kvm/debug.c @@ -9,6 +9,7 @@
#include <linux/kvm_host.h> #include <linux/hw_breakpoint.h> +#include <linux/perf/arm_pmuv3.h>
#include <asm/debug-monitors.h> #include <asm/kvm_asm.h> diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h index eef310cdbdbd..d407e716df1b 100644 --- a/arch/arm64/kvm/hyp/include/hyp/switch.h +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h @@ -14,6 +14,7 @@ #include <linux/kvm_host.h> #include <linux/types.h> #include <linux/jump_label.h> +#include <linux/perf/arm_pmuv3.h> #include <uapi/linux/psci.h>
#include <kvm/arm_psci.h> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c index 25c29107f13f..472a2ab6938f 100644 --- a/arch/arm64/kvm/pmu-emul.c +++ b/arch/arm64/kvm/pmu-emul.c @@ -8,11 +8,10 @@ #include <linux/kvm.h> #include <linux/kvm_host.h> #include <linux/list.h> -#include <linux/perf_event.h> #include <linux/perf/arm_pmu.h> +#include <linux/perf/arm_pmuv3.h> #include <linux/uaccess.h> #include <asm/kvm_emulate.h> -#include <kvm/arm_pmu.h> #include <kvm/arm_vgic.h>
#define PERF_ATTR_CFG1_COUNTER_64BIT BIT(0) @@ -24,6 +23,8 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc); static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc); static bool kvm_pmu_counter_is_enabled(struct kvm_pmc *pmc);
+#define kvm_arm_pmu_irq_initialized(v) ((v)->arch.pmu.irq_num >= VGIC_NR_SGIS) + bool kvm_supports_guest_pmuv3(void) { guard(mutex)(&arm_pmus_lock); @@ -258,6 +259,16 @@ void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu) pmu->pmc[i].idx = i; }
+static u64 kvm_pmu_implemented_counter_mask(struct kvm_vcpu *vcpu) +{ + u64 val = FIELD_GET(ARMV8_PMU_PMCR_N, kvm_vcpu_read_pmcr(vcpu)); + + if (val == 0) + return BIT(ARMV8_PMU_CYCLE_IDX); + else + return GENMASK(val - 1, 0) | BIT(ARMV8_PMU_CYCLE_IDX); +} + /** * kvm_pmu_vcpu_destroy - free perf event of PMU for cpu * @vcpu: The vcpu pointer @@ -315,16 +326,6 @@ u64 kvm_pmu_accessible_counter_mask(struct kvm_vcpu *vcpu) return mask & ~kvm_pmu_hyp_counter_mask(vcpu); }
-u64 kvm_pmu_implemented_counter_mask(struct kvm_vcpu *vcpu) -{ - u64 val = FIELD_GET(ARMV8_PMU_PMCR_N, kvm_vcpu_read_pmcr(vcpu)); - - if (val == 0) - return BIT(ARMV8_PMU_CYCLE_IDX); - else - return GENMASK(val - 1, 0) | BIT(ARMV8_PMU_CYCLE_IDX); -} - static void kvm_pmc_enable_perf_event(struct kvm_pmc *pmc) { if (!pmc->perf_event) { @@ -784,6 +785,11 @@ void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u64 data, kvm_pmu_create_perf_event(pmc); }
+struct arm_pmu_entry { + struct list_head entry; + struct arm_pmu *arm_pmu; +}; + void kvm_host_pmu_init(struct arm_pmu *pmu) { struct arm_pmu_entry *entry; diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c index 6b48a3d16d0d..8bfc6b0a85f6 100644 --- a/arch/arm64/kvm/pmu.c +++ b/arch/arm64/kvm/pmu.c @@ -8,6 +8,8 @@ #include <linux/perf/arm_pmu.h> #include <linux/perf/arm_pmuv3.h>
+#include <asm/kvm_pmu.h> + static DEFINE_PER_CPU(struct kvm_pmu_events, kvm_pmu_events);
/* diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 707c651aff03..d368eeb4f88e 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -18,6 +18,7 @@ #include <linux/printk.h> #include <linux/uaccess.h> #include <linux/irqchip/arm-gic-v3.h> +#include <linux/perf/arm_pmuv3.h>
#include <asm/arm_pmuv3.h> #include <asm/cacheflush.h> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h deleted file mode 100644 index 96754b51b411..000000000000 --- a/include/kvm/arm_pmu.h +++ /dev/null @@ -1,199 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-only */ -/* - * Copyright (C) 2015 Linaro Ltd. - * Author: Shannon Zhao shannon.zhao@linaro.org - */ - -#ifndef __ASM_ARM_KVM_PMU_H -#define __ASM_ARM_KVM_PMU_H - -#include <linux/perf_event.h> -#include <linux/perf/arm_pmuv3.h> - -#define KVM_ARMV8_PMU_MAX_COUNTERS 32 - -#if IS_ENABLED(CONFIG_HW_PERF_EVENTS) && IS_ENABLED(CONFIG_KVM) -struct kvm_pmc { - u8 idx; /* index into the pmu->pmc array */ - struct perf_event *perf_event; -}; - -struct kvm_pmu_events { - u64 events_host; - u64 events_guest; -}; - -struct kvm_pmu { - struct irq_work overflow_work; - struct kvm_pmu_events events; - struct kvm_pmc pmc[KVM_ARMV8_PMU_MAX_COUNTERS]; - int irq_num; - bool created; - bool irq_level; -}; - -struct arm_pmu_entry { - struct list_head entry; - struct arm_pmu *arm_pmu; -}; - -bool kvm_supports_guest_pmuv3(void); -#define kvm_arm_pmu_irq_initialized(v) ((v)->arch.pmu.irq_num >= VGIC_NR_SGIS) -u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx); -void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx, u64 val); -void kvm_pmu_set_counter_value_user(struct kvm_vcpu *vcpu, u64 select_idx, u64 val); -u64 kvm_pmu_implemented_counter_mask(struct kvm_vcpu *vcpu); -u64 kvm_pmu_accessible_counter_mask(struct kvm_vcpu *vcpu); -u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1); -void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu); -void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu); -void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val); -void kvm_pmu_flush_hwstate(struct kvm_vcpu *vcpu); -void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu); -bool kvm_pmu_should_notify_user(struct kvm_vcpu *vcpu); -void kvm_pmu_update_run(struct kvm_vcpu *vcpu); -void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u64 val); -void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val); -void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u64 data, - u64 select_idx); -void kvm_vcpu_reload_pmu(struct kvm_vcpu *vcpu); -int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, - struct kvm_device_attr *attr); -int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu *vcpu, - struct kvm_device_attr *attr); -int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu, - struct kvm_device_attr *attr); -int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu); - -struct kvm_pmu_events *kvm_get_pmu_events(void); -void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu); -void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu); -void kvm_vcpu_pmu_resync_el0(void); - -#define kvm_vcpu_has_pmu(vcpu) \ - (vcpu_has_feature(vcpu, KVM_ARM_VCPU_PMU_V3)) - -/* - * Updates the vcpu's view of the pmu events for this cpu. - * Must be called before every vcpu run after disabling interrupts, to ensure - * that an interrupt cannot fire and update the structure. - */ -#define kvm_pmu_update_vcpu_events(vcpu) \ - do { \ - if (!has_vhe() && system_supports_pmuv3()) \ - vcpu->arch.pmu.events = *kvm_get_pmu_events(); \ - } while (0) - -u8 kvm_arm_pmu_get_pmuver_limit(void); -u64 kvm_pmu_evtyper_mask(struct kvm *kvm); -int kvm_arm_set_default_pmu(struct kvm *kvm); -u8 kvm_arm_pmu_get_max_counters(struct kvm *kvm); - -u64 kvm_vcpu_read_pmcr(struct kvm_vcpu *vcpu); -bool kvm_pmu_counter_is_hyp(struct kvm_vcpu *vcpu, unsigned int idx); -void kvm_pmu_nested_transition(struct kvm_vcpu *vcpu); -#else -struct kvm_pmu { -}; - -static inline bool kvm_supports_guest_pmuv3(void) -{ - return false; -} - -#define kvm_arm_pmu_irq_initialized(v) (false) -static inline u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, - u64 select_idx) -{ - return 0; -} -static inline void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, - u64 select_idx, u64 val) {} -static inline void kvm_pmu_set_counter_value_user(struct kvm_vcpu *vcpu, - u64 select_idx, u64 val) {} -static inline u64 kvm_pmu_implemented_counter_mask(struct kvm_vcpu *vcpu) -{ - return 0; -} -static inline u64 kvm_pmu_accessible_counter_mask(struct kvm_vcpu *vcpu) -{ - return 0; -} -static inline void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu) {} -static inline void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu) {} -static inline void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val) {} -static inline void kvm_pmu_flush_hwstate(struct kvm_vcpu *vcpu) {} -static inline void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu) {} -static inline bool kvm_pmu_should_notify_user(struct kvm_vcpu *vcpu) -{ - return false; -} -static inline void kvm_pmu_update_run(struct kvm_vcpu *vcpu) {} -static inline void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u64 val) {} -static inline void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val) {} -static inline void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, - u64 data, u64 select_idx) {} -static inline int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, - struct kvm_device_attr *attr) -{ - return -ENXIO; -} -static inline int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu *vcpu, - struct kvm_device_attr *attr) -{ - return -ENXIO; -} -static inline int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu, - struct kvm_device_attr *attr) -{ - return -ENXIO; -} -static inline int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu) -{ - return 0; -} -static inline u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1) -{ - return 0; -} - -#define kvm_vcpu_has_pmu(vcpu) ({ false; }) -static inline void kvm_pmu_update_vcpu_events(struct kvm_vcpu *vcpu) {} -static inline void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu) {} -static inline void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu) {} -static inline void kvm_vcpu_reload_pmu(struct kvm_vcpu *vcpu) {} -static inline u8 kvm_arm_pmu_get_pmuver_limit(void) -{ - return 0; -} -static inline u64 kvm_pmu_evtyper_mask(struct kvm *kvm) -{ - return 0; -} -static inline void kvm_vcpu_pmu_resync_el0(void) {} - -static inline int kvm_arm_set_default_pmu(struct kvm *kvm) -{ - return -ENODEV; -} - -static inline u8 kvm_arm_pmu_get_max_counters(struct kvm *kvm) -{ - return 0; -} - -static inline u64 kvm_vcpu_read_pmcr(struct kvm_vcpu *vcpu) -{ - return 0; -} - -static inline bool kvm_pmu_counter_is_hyp(struct kvm_vcpu *vcpu, unsigned int idx) -{ - return false; -} - -static inline void kvm_pmu_nested_transition(struct kvm_vcpu *vcpu) {} - -#endif - -#endif diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h index 6dc5e0cd76ca..1de206b09616 100644 --- a/include/linux/perf/arm_pmu.h +++ b/include/linux/perf/arm_pmu.h @@ -13,6 +13,9 @@ #include <linux/platform_device.h> #include <linux/sysfs.h> #include <asm/cputype.h> +#ifdef CONFIG_ARM64 +#include <asm/kvm_pmu.h> +#endif
#ifdef CONFIG_ARM_PMU
@@ -25,6 +28,11 @@ #else #define ARMPMU_MAX_HWEVENTS 33 #endif + +#ifdef CONFIG_ARM +#define kvm_host_pmu_init(_x) { (void)_x; } +#endif + /* * ARM PMU hw_event flags */ @@ -170,12 +178,6 @@ int arm_pmu_acpi_probe(armpmu_init_fn init_fn); static inline int arm_pmu_acpi_probe(armpmu_init_fn init_fn) { return 0; } #endif
-#ifdef CONFIG_KVM -void kvm_host_pmu_init(struct arm_pmu *pmu); -#else -#define kvm_host_pmu_init(x) do { } while(0) -#endif - bool arm_pmu_irq_is_nmi(void);
/* Internal functions only for core arm_pmu code */ diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e85b33a92624..d2263b5a0789 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -49,6 +49,7 @@ #include <linux/lockdep.h> #include <linux/kthread.h> #include <linux/suspend.h> +#include <linux/perf_event.h>
#include <asm/processor.h> #include <asm/ioctl.h>
On Mon, Jun 02, 2025, Colton Lewis wrote:
- Delete kvm/arm_pmu.h. These functions are mostly internal to KVM and should go in asm/kvm_host.h.
Ha! I'm a hair too late, as usual. I _just_ resurrected a patch[*] to move and rename all of the <kvm/arm_xxx.h> headers to <asm/kvm_xxx.h>. If only I had posted on Friday when they were ready :-)
It's a relatively small series (mostly arm64 code movement), but it does touch all architectures due to giving the same treatment to kvm/iodev.h (and purging include/kvm entirely).
Any preference/thoughts on how to proceed? My stuff obviously isn't urgent since I sat on the patches for almost two years. On the other hand, the almost pure code movement would be a nice precursor to this patch, e.g. move and rename to asm/kvm_pmu.h before extracting chunks of code into asm/kvm_host.h.
[*] https://lore.kernel.org/all/20230916003118.2540661-15-seanjc@google.com
Diff stats for context: --- Anish Ghulati (1): KVM: arm64: Move arm_{psci,hypercalls}.h to an internal KVM path
Sean Christopherson (7): KVM: arm64: Include KVM headers to get forward declarations KVM: arm64: Move ARM specific headers in include/kvm to arch directory KVM: Move include/kvm/iodev.h to include/linux as kvm_iodev.h KVM: MIPS: Stop adding virt/kvm to the arch include path KVM: PPC: Stop adding virt/kvm to the arch include path KVM: s390: Stop adding virt/kvm to the arch include path KVM: Standardize include paths across all architectures
MAINTAINERS | 1 - .../arm64/include/asm/kvm_arch_timer.h | 2 ++ arch/arm64/include/asm/kvm_host.h | 7 +++---- include/kvm/arm_pmu.h => arch/arm64/include/asm/kvm_pmu.h | 2 ++ .../kvm/arm_vgic.h => arch/arm64/include/asm/kvm_vgic.h | 2 +- arch/arm64/kvm/Makefile | 2 -- arch/arm64/kvm/arch_timer.c | 5 ++--- arch/arm64/kvm/arm.c | 6 +++--- {include => arch/arm64}/kvm/arm_hypercalls.h | 0 {include => arch/arm64}/kvm/arm_psci.h | 0 arch/arm64/kvm/guest.c | 2 +- arch/arm64/kvm/handle_exit.c | 2 +- arch/arm64/kvm/hyp/Makefile | 6 +++--- arch/arm64/kvm/hyp/include/hyp/switch.h | 4 ++-- arch/arm64/kvm/hyp/nvhe/switch.c | 4 ++-- arch/arm64/kvm/hyp/vhe/switch.c | 4 ++-- arch/arm64/kvm/hypercalls.c | 4 ++-- arch/arm64/kvm/pmu-emul.c | 4 ++-- arch/arm64/kvm/psci.c | 4 ++-- arch/arm64/kvm/pvtime.c | 2 +- arch/arm64/kvm/reset.c | 3 +-- arch/arm64/kvm/trace_arm.h | 2 +- arch/arm64/kvm/trng.c | 2 +- arch/arm64/kvm/vgic/vgic-debug.c | 2 +- arch/arm64/kvm/vgic/vgic-init.c | 2 +- arch/arm64/kvm/vgic/vgic-irqfd.c | 2 +- arch/arm64/kvm/vgic/vgic-kvm-device.c | 2 +- arch/arm64/kvm/vgic/vgic-mmio-v2.c | 4 ++-- arch/arm64/kvm/vgic/vgic-mmio-v3.c | 4 ++-- arch/arm64/kvm/vgic/vgic-mmio.c | 6 +++--- arch/arm64/kvm/vgic/vgic-v2.c | 2 +- arch/arm64/kvm/vgic/vgic-v3-nested.c | 3 +-- arch/arm64/kvm/vgic/vgic-v3.c | 2 +- arch/loongarch/include/asm/kvm_eiointc.h | 2 +- arch/loongarch/include/asm/kvm_ipi.h | 2 +- arch/loongarch/include/asm/kvm_pch_pic.h | 2 +- arch/mips/include/asm/kvm_host.h | 3 +-- arch/mips/kvm/Makefile | 2 -- arch/powerpc/kvm/Makefile | 2 -- arch/powerpc/kvm/mpic.c | 2 +- arch/riscv/kvm/Makefile | 2 -- arch/riscv/kvm/aia_aplic.c | 2 +- arch/riscv/kvm/aia_imsic.c | 2 +- arch/s390/kvm/Makefile | 2 -- arch/x86/kvm/Makefile | 1 - arch/x86/kvm/i8254.h | 2 +- arch/x86/kvm/ioapic.h | 2 +- arch/x86/kvm/irq.h | 2 +- arch/x86/kvm/lapic.h | 2 +- include/{kvm/iodev.h => linux/kvm_iodev.h} | 0 virt/kvm/Makefile.kvm | 2 ++ virt/kvm/coalesced_mmio.c | 3 +-- virt/kvm/eventfd.c | 2 +- virt/kvm/kvm_main.c | 3 +-- 54 files changed, 64 insertions(+), 77 deletions(-) rename include/kvm/arm_arch_timer.h => arch/arm64/include/asm/kvm_arch_timer.h (98%) rename include/kvm/arm_pmu.h => arch/arm64/include/asm/kvm_pmu.h (99%) rename include/kvm/arm_vgic.h => arch/arm64/include/asm/kvm_vgic.h (99%) rename {include => arch/arm64}/kvm/arm_hypercalls.h (100%) rename {include => arch/arm64}/kvm/arm_psci.h (100%) rename include/{kvm/iodev.h => linux/kvm_iodev.h} (100%)
base-commit: 45eb29140e68ffe8e93a5471006858a018480a45 --
Sean Christopherson seanjc@google.com writes:
On Mon, Jun 02, 2025, Colton Lewis wrote:
- Delete kvm/arm_pmu.h. These functions are mostly internal to KVM and should go in asm/kvm_host.h.
Ha! I'm a hair too late, as usual. I _just_ resurrected a patch[*] to move and rename all of the <kvm/arm_xxx.h> headers to <asm/kvm_xxx.h>. If only I had posted on Friday when they were ready :-)
Great minds think alike :) (In this case the other one was Marc)
It's a relatively small series (mostly arm64 code movement), but it does touch all architectures due to giving the same treatment to kvm/iodev.h (and purging include/kvm entirely).
Any preference/thoughts on how to proceed? My stuff obviously isn't urgent since I sat on the patches for almost two years. On the other hand, the almost pure code movement would be a nice precursor to this patch, e.g. move and rename to asm/kvm_pmu.h before extracting chunks of code into asm/kvm_host.h.
Letting the rename go first is fine and won't inconveneince me. I'm expecting this series to take a while to be accepted and Oliver told me I'll probably need a reroll to make my context switching lazy. Thanks for asking.
[*] https://lore.kernel.org/all/20230916003118.2540661-15-seanjc@google.com
Diff stats for context:
Anish Ghulati (1): KVM: arm64: Move arm_{psci,hypercalls}.h to an internal KVM path
Sean Christopherson (7): KVM: arm64: Include KVM headers to get forward declarations KVM: arm64: Move ARM specific headers in include/kvm to arch directory KVM: Move include/kvm/iodev.h to include/linux as kvm_iodev.h KVM: MIPS: Stop adding virt/kvm to the arch include path KVM: PPC: Stop adding virt/kvm to the arch include path KVM: s390: Stop adding virt/kvm to the arch include path KVM: Standardize include paths across all architectures
MAINTAINERS | 1 - .../arm64/include/asm/kvm_arch_timer.h | 2 ++ arch/arm64/include/asm/kvm_host.h | 7 +++---- include/kvm/arm_pmu.h => arch/arm64/include/asm/kvm_pmu.h | 2 ++ .../kvm/arm_vgic.h => arch/arm64/include/asm/kvm_vgic.h | 2 +- arch/arm64/kvm/Makefile | 2 -- arch/arm64/kvm/arch_timer.c | 5 ++--- arch/arm64/kvm/arm.c | 6 +++--- {include => arch/arm64}/kvm/arm_hypercalls.h | 0 {include => arch/arm64}/kvm/arm_psci.h | 0 arch/arm64/kvm/guest.c | 2 +- arch/arm64/kvm/handle_exit.c | 2 +- arch/arm64/kvm/hyp/Makefile | 6 +++--- arch/arm64/kvm/hyp/include/hyp/switch.h | 4 ++-- arch/arm64/kvm/hyp/nvhe/switch.c | 4 ++-- arch/arm64/kvm/hyp/vhe/switch.c | 4 ++-- arch/arm64/kvm/hypercalls.c | 4 ++-- arch/arm64/kvm/pmu-emul.c | 4 ++-- arch/arm64/kvm/psci.c | 4 ++-- arch/arm64/kvm/pvtime.c | 2 +- arch/arm64/kvm/reset.c | 3 +-- arch/arm64/kvm/trace_arm.h | 2 +- arch/arm64/kvm/trng.c | 2 +- arch/arm64/kvm/vgic/vgic-debug.c | 2 +- arch/arm64/kvm/vgic/vgic-init.c | 2 +- arch/arm64/kvm/vgic/vgic-irqfd.c | 2 +- arch/arm64/kvm/vgic/vgic-kvm-device.c | 2 +- arch/arm64/kvm/vgic/vgic-mmio-v2.c | 4 ++-- arch/arm64/kvm/vgic/vgic-mmio-v3.c | 4 ++-- arch/arm64/kvm/vgic/vgic-mmio.c | 6 +++--- arch/arm64/kvm/vgic/vgic-v2.c | 2 +- arch/arm64/kvm/vgic/vgic-v3-nested.c | 3 +-- arch/arm64/kvm/vgic/vgic-v3.c | 2 +- arch/loongarch/include/asm/kvm_eiointc.h | 2 +- arch/loongarch/include/asm/kvm_ipi.h | 2 +- arch/loongarch/include/asm/kvm_pch_pic.h | 2 +- arch/mips/include/asm/kvm_host.h | 3 +-- arch/mips/kvm/Makefile | 2 -- arch/powerpc/kvm/Makefile | 2 -- arch/powerpc/kvm/mpic.c | 2 +- arch/riscv/kvm/Makefile | 2 -- arch/riscv/kvm/aia_aplic.c | 2 +- arch/riscv/kvm/aia_imsic.c | 2 +- arch/s390/kvm/Makefile | 2 -- arch/x86/kvm/Makefile | 1 - arch/x86/kvm/i8254.h | 2 +- arch/x86/kvm/ioapic.h | 2 +- arch/x86/kvm/irq.h | 2 +- arch/x86/kvm/lapic.h | 2 +- include/{kvm/iodev.h => linux/kvm_iodev.h} | 0 virt/kvm/Makefile.kvm | 2 ++ virt/kvm/coalesced_mmio.c | 3 +-- virt/kvm/eventfd.c | 2 +- virt/kvm/kvm_main.c | 3 +-- 54 files changed, 64 insertions(+), 77 deletions(-) rename include/kvm/arm_arch_timer.h => arch/arm64/include/asm/kvm_arch_timer.h (98%) rename include/kvm/arm_pmu.h => arch/arm64/include/asm/kvm_pmu.h (99%) rename include/kvm/arm_vgic.h => arch/arm64/include/asm/kvm_vgic.h (99%) rename {include => arch/arm64}/kvm/arm_hypercalls.h (100%) rename {include => arch/arm64}/kvm/arm_psci.h (100%) rename include/{kvm/iodev.h => linux/kvm_iodev.h} (100%)
base-commit: 45eb29140e68ffe8e93a5471006858a018480a45
A lot of functions in pmu-emul.c aren't specific to the emulated PMU implementation. Move them to the more appropriate pmu.c file where shared PMU functions should live.
Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm64/include/asm/kvm_host.h | 2 + arch/arm64/kvm/pmu-emul.c | 611 +----------------------------- arch/arm64/kvm/pmu.c | 610 +++++++++++++++++++++++++++++ 3 files changed, 613 insertions(+), 610 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index f5d97cd8e177..3482d7602a5b 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -1706,6 +1706,7 @@ int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu); struct kvm_pmu_events *kvm_get_pmu_events(void); void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu); void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu); +bool kvm_pmu_overflow_status(struct kvm_vcpu *vcpu);
/* * Updates the vcpu's view of the pmu events for this cpu. @@ -1719,6 +1720,7 @@ void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu); } while (0)
u8 kvm_arm_pmu_get_pmuver_limit(void); +u32 kvm_pmu_event_mask(struct kvm *kvm); u64 kvm_pmu_evtyper_mask(struct kvm *kvm); int kvm_arm_set_default_pmu(struct kvm *kvm); u8 kvm_arm_pmu_get_max_counters(struct kvm *kvm); diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c index 472a2ab6938f..ff86c66e1b48 100644 --- a/arch/arm64/kvm/pmu-emul.c +++ b/arch/arm64/kvm/pmu-emul.c @@ -16,21 +16,10 @@
#define PERF_ATTR_CFG1_COUNTER_64BIT BIT(0)
-static LIST_HEAD(arm_pmus); -static DEFINE_MUTEX(arm_pmus_lock); - static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc); static void kvm_pmu_release_perf_event(struct kvm_pmc *pmc); static bool kvm_pmu_counter_is_enabled(struct kvm_pmc *pmc);
-#define kvm_arm_pmu_irq_initialized(v) ((v)->arch.pmu.irq_num >= VGIC_NR_SGIS) - -bool kvm_supports_guest_pmuv3(void) -{ - guard(mutex)(&arm_pmus_lock); - return !list_empty(&arm_pmus); -} - static struct kvm_vcpu *kvm_pmc_to_vcpu(const struct kvm_pmc *pmc) { return container_of(pmc, struct kvm_vcpu, arch.pmu.pmc[pmc->idx]); @@ -41,46 +30,6 @@ static struct kvm_pmc *kvm_vcpu_idx_to_pmc(struct kvm_vcpu *vcpu, int cnt_idx) return &vcpu->arch.pmu.pmc[cnt_idx]; }
-static u32 __kvm_pmu_event_mask(unsigned int pmuver) -{ - switch (pmuver) { - case ID_AA64DFR0_EL1_PMUVer_IMP: - return GENMASK(9, 0); - case ID_AA64DFR0_EL1_PMUVer_V3P1: - case ID_AA64DFR0_EL1_PMUVer_V3P4: - case ID_AA64DFR0_EL1_PMUVer_V3P5: - case ID_AA64DFR0_EL1_PMUVer_V3P7: - return GENMASK(15, 0); - default: /* Shouldn't be here, just for sanity */ - WARN_ONCE(1, "Unknown PMU version %d\n", pmuver); - return 0; - } -} - -static u32 kvm_pmu_event_mask(struct kvm *kvm) -{ - u64 dfr0 = kvm_read_vm_id_reg(kvm, SYS_ID_AA64DFR0_EL1); - u8 pmuver = SYS_FIELD_GET(ID_AA64DFR0_EL1, PMUVer, dfr0); - - return __kvm_pmu_event_mask(pmuver); -} - -u64 kvm_pmu_evtyper_mask(struct kvm *kvm) -{ - u64 mask = ARMV8_PMU_EXCLUDE_EL1 | ARMV8_PMU_EXCLUDE_EL0 | - kvm_pmu_event_mask(kvm); - - if (kvm_has_feat(kvm, ID_AA64PFR0_EL1, EL2, IMP)) - mask |= ARMV8_PMU_INCLUDE_EL2; - - if (kvm_has_feat(kvm, ID_AA64PFR0_EL1, EL3, IMP)) - mask |= ARMV8_PMU_EXCLUDE_NS_EL0 | - ARMV8_PMU_EXCLUDE_NS_EL1 | - ARMV8_PMU_EXCLUDE_EL3; - - return mask; -} - /** * kvm_pmc_is_64bit - determine if counter is 64bit * @pmc: counter context @@ -371,7 +320,7 @@ void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val) * counter where the values of the global enable control, PMOVSSET_EL0[n], and * PMINTENSET_EL1[n] are all 1. */ -static bool kvm_pmu_overflow_status(struct kvm_vcpu *vcpu) +bool kvm_pmu_overflow_status(struct kvm_vcpu *vcpu) { u64 reg = __vcpu_sys_reg(vcpu, PMOVSSET_EL0);
@@ -394,24 +343,6 @@ static bool kvm_pmu_overflow_status(struct kvm_vcpu *vcpu) return reg; }
-static void kvm_pmu_update_state(struct kvm_vcpu *vcpu) -{ - struct kvm_pmu *pmu = &vcpu->arch.pmu; - bool overflow; - - overflow = kvm_pmu_overflow_status(vcpu); - if (pmu->irq_level == overflow) - return; - - pmu->irq_level = overflow; - - if (likely(irqchip_in_kernel(vcpu->kvm))) { - int ret = kvm_vgic_inject_irq(vcpu->kvm, vcpu, - pmu->irq_num, overflow, pmu); - WARN_ON(ret); - } -} - bool kvm_pmu_should_notify_user(struct kvm_vcpu *vcpu) { struct kvm_pmu *pmu = &vcpu->arch.pmu; @@ -437,43 +368,6 @@ void kvm_pmu_update_run(struct kvm_vcpu *vcpu) regs->device_irq_level |= KVM_ARM_DEV_PMU; }
-/** - * kvm_pmu_flush_hwstate - flush pmu state to cpu - * @vcpu: The vcpu pointer - * - * Check if the PMU has overflowed while we were running in the host, and inject - * an interrupt if that was the case. - */ -void kvm_pmu_flush_hwstate(struct kvm_vcpu *vcpu) -{ - kvm_pmu_update_state(vcpu); -} - -/** - * kvm_pmu_sync_hwstate - sync pmu state from cpu - * @vcpu: The vcpu pointer - * - * Check if the PMU has overflowed while we were running in the guest, and - * inject an interrupt if that was the case. - */ -void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu) -{ - kvm_pmu_update_state(vcpu); -} - -/* - * When perf interrupt is an NMI, we cannot safely notify the vcpu corresponding - * to the event. - * This is why we need a callback to do it once outside of the NMI context. - */ -static void kvm_pmu_perf_overflow_notify_vcpu(struct irq_work *work) -{ - struct kvm_vcpu *vcpu; - - vcpu = container_of(work, struct kvm_vcpu, arch.pmu.overflow_work); - kvm_vcpu_kick(vcpu); -} - /* * Perform an increment on any of the counters described in @mask, * generating the overflow if required, and propagate it as a chained @@ -785,137 +679,6 @@ void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u64 data, kvm_pmu_create_perf_event(pmc); }
-struct arm_pmu_entry { - struct list_head entry; - struct arm_pmu *arm_pmu; -}; - -void kvm_host_pmu_init(struct arm_pmu *pmu) -{ - struct arm_pmu_entry *entry; - - /* - * Check the sanitised PMU version for the system, as KVM does not - * support implementations where PMUv3 exists on a subset of CPUs. - */ - if (!pmuv3_implemented(kvm_arm_pmu_get_pmuver_limit())) - return; - - guard(mutex)(&arm_pmus_lock); - - entry = kmalloc(sizeof(*entry), GFP_KERNEL); - if (!entry) - return; - - entry->arm_pmu = pmu; - list_add_tail(&entry->entry, &arm_pmus); -} - -static struct arm_pmu *kvm_pmu_probe_armpmu(void) -{ - struct arm_pmu_entry *entry; - struct arm_pmu *pmu; - int cpu; - - guard(mutex)(&arm_pmus_lock); - - /* - * It is safe to use a stale cpu to iterate the list of PMUs so long as - * the same value is used for the entirety of the loop. Given this, and - * the fact that no percpu data is used for the lookup there is no need - * to disable preemption. - * - * It is still necessary to get a valid cpu, though, to probe for the - * default PMU instance as userspace is not required to specify a PMU - * type. In order to uphold the preexisting behavior KVM selects the - * PMU instance for the core during vcpu init. A dependent use - * case would be a user with disdain of all things big.LITTLE that - * affines the VMM to a particular cluster of cores. - * - * In any case, userspace should just do the sane thing and use the UAPI - * to select a PMU type directly. But, be wary of the baggage being - * carried here. - */ - cpu = raw_smp_processor_id(); - list_for_each_entry(entry, &arm_pmus, entry) { - pmu = entry->arm_pmu; - - if (cpumask_test_cpu(cpu, &pmu->supported_cpus)) - return pmu; - } - - return NULL; -} - -static u64 __compute_pmceid(struct arm_pmu *pmu, bool pmceid1) -{ - u32 hi[2], lo[2]; - - bitmap_to_arr32(lo, pmu->pmceid_bitmap, ARMV8_PMUV3_MAX_COMMON_EVENTS); - bitmap_to_arr32(hi, pmu->pmceid_ext_bitmap, ARMV8_PMUV3_MAX_COMMON_EVENTS); - - return ((u64)hi[pmceid1] << 32) | lo[pmceid1]; -} - -static u64 compute_pmceid0(struct arm_pmu *pmu) -{ - u64 val = __compute_pmceid(pmu, 0); - - /* always support SW_INCR */ - val |= BIT(ARMV8_PMUV3_PERFCTR_SW_INCR); - /* always support CHAIN */ - val |= BIT(ARMV8_PMUV3_PERFCTR_CHAIN); - return val; -} - -static u64 compute_pmceid1(struct arm_pmu *pmu) -{ - u64 val = __compute_pmceid(pmu, 1); - - /* - * Don't advertise STALL_SLOT*, as PMMIR_EL0 is handled - * as RAZ - */ - val &= ~(BIT_ULL(ARMV8_PMUV3_PERFCTR_STALL_SLOT - 32) | - BIT_ULL(ARMV8_PMUV3_PERFCTR_STALL_SLOT_FRONTEND - 32) | - BIT_ULL(ARMV8_PMUV3_PERFCTR_STALL_SLOT_BACKEND - 32)); - return val; -} - -u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1) -{ - struct arm_pmu *cpu_pmu = vcpu->kvm->arch.arm_pmu; - unsigned long *bmap = vcpu->kvm->arch.pmu_filter; - u64 val, mask = 0; - int base, i, nr_events; - - if (!pmceid1) { - val = compute_pmceid0(cpu_pmu); - base = 0; - } else { - val = compute_pmceid1(cpu_pmu); - base = 32; - } - - if (!bmap) - return val; - - nr_events = kvm_pmu_event_mask(vcpu->kvm) + 1; - - for (i = 0; i < 32; i += 8) { - u64 byte; - - byte = bitmap_get_value8(bmap, base + i); - mask |= byte << i; - if (nr_events >= (0x4000 + base + 32)) { - byte = bitmap_get_value8(bmap, 0x4000 + base + i); - mask |= byte << (32 + i); - } - } - - return val & mask; -} - void kvm_vcpu_reload_pmu(struct kvm_vcpu *vcpu) { u64 mask = kvm_pmu_implemented_counter_mask(vcpu); @@ -927,378 +690,6 @@ void kvm_vcpu_reload_pmu(struct kvm_vcpu *vcpu) kvm_pmu_reprogram_counter_mask(vcpu, mask); }
-int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu) -{ - if (!vcpu->arch.pmu.created) - return -EINVAL; - - /* - * A valid interrupt configuration for the PMU is either to have a - * properly configured interrupt number and using an in-kernel - * irqchip, or to not have an in-kernel GIC and not set an IRQ. - */ - if (irqchip_in_kernel(vcpu->kvm)) { - int irq = vcpu->arch.pmu.irq_num; - /* - * If we are using an in-kernel vgic, at this point we know - * the vgic will be initialized, so we can check the PMU irq - * number against the dimensions of the vgic and make sure - * it's valid. - */ - if (!irq_is_ppi(irq) && !vgic_valid_spi(vcpu->kvm, irq)) - return -EINVAL; - } else if (kvm_arm_pmu_irq_initialized(vcpu)) { - return -EINVAL; - } - - return 0; -} - -static int kvm_arm_pmu_v3_init(struct kvm_vcpu *vcpu) -{ - if (irqchip_in_kernel(vcpu->kvm)) { - int ret; - - /* - * If using the PMU with an in-kernel virtual GIC - * implementation, we require the GIC to be already - * initialized when initializing the PMU. - */ - if (!vgic_initialized(vcpu->kvm)) - return -ENODEV; - - if (!kvm_arm_pmu_irq_initialized(vcpu)) - return -ENXIO; - - ret = kvm_vgic_set_owner(vcpu, vcpu->arch.pmu.irq_num, - &vcpu->arch.pmu); - if (ret) - return ret; - } - - init_irq_work(&vcpu->arch.pmu.overflow_work, - kvm_pmu_perf_overflow_notify_vcpu); - - vcpu->arch.pmu.created = true; - return 0; -} - -/* - * For one VM the interrupt type must be same for each vcpu. - * As a PPI, the interrupt number is the same for all vcpus, - * while as an SPI it must be a separate number per vcpu. - */ -static bool pmu_irq_is_valid(struct kvm *kvm, int irq) -{ - unsigned long i; - struct kvm_vcpu *vcpu; - - kvm_for_each_vcpu(i, vcpu, kvm) { - if (!kvm_arm_pmu_irq_initialized(vcpu)) - continue; - - if (irq_is_ppi(irq)) { - if (vcpu->arch.pmu.irq_num != irq) - return false; - } else { - if (vcpu->arch.pmu.irq_num == irq) - return false; - } - } - - return true; -} - -/** - * kvm_arm_pmu_get_max_counters - Return the max number of PMU counters. - * @kvm: The kvm pointer - */ -u8 kvm_arm_pmu_get_max_counters(struct kvm *kvm) -{ - struct arm_pmu *arm_pmu = kvm->arch.arm_pmu; - - /* - * PMUv3 requires that all event counters are capable of counting any - * event, though the same may not be true of non-PMUv3 hardware. - */ - if (cpus_have_final_cap(ARM64_WORKAROUND_PMUV3_IMPDEF_TRAPS)) - return 1; - - /* - * The arm_pmu->cntr_mask considers the fixed counter(s) as well. - * Ignore those and return only the general-purpose counters. - */ - return bitmap_weight(arm_pmu->cntr_mask, ARMV8_PMU_MAX_GENERAL_COUNTERS); -} - -static void kvm_arm_set_nr_counters(struct kvm *kvm, unsigned int nr) -{ - kvm->arch.nr_pmu_counters = nr; - - /* Reset MDCR_EL2.HPMN behind the vcpus' back... */ - if (test_bit(KVM_ARM_VCPU_HAS_EL2, kvm->arch.vcpu_features)) { - struct kvm_vcpu *vcpu; - unsigned long i; - - kvm_for_each_vcpu(i, vcpu, kvm) { - u64 val = __vcpu_sys_reg(vcpu, MDCR_EL2); - val &= ~MDCR_EL2_HPMN; - val |= FIELD_PREP(MDCR_EL2_HPMN, kvm->arch.nr_pmu_counters); - __vcpu_sys_reg(vcpu, MDCR_EL2) = val; - } - } -} - -static void kvm_arm_set_pmu(struct kvm *kvm, struct arm_pmu *arm_pmu) -{ - lockdep_assert_held(&kvm->arch.config_lock); - - kvm->arch.arm_pmu = arm_pmu; - kvm_arm_set_nr_counters(kvm, kvm_arm_pmu_get_max_counters(kvm)); -} - -/** - * kvm_arm_set_default_pmu - No PMU set, get the default one. - * @kvm: The kvm pointer - * - * The observant among you will notice that the supported_cpus - * mask does not get updated for the default PMU even though it - * is quite possible the selected instance supports only a - * subset of cores in the system. This is intentional, and - * upholds the preexisting behavior on heterogeneous systems - * where vCPUs can be scheduled on any core but the guest - * counters could stop working. - */ -int kvm_arm_set_default_pmu(struct kvm *kvm) -{ - struct arm_pmu *arm_pmu = kvm_pmu_probe_armpmu(); - - if (!arm_pmu) - return -ENODEV; - - kvm_arm_set_pmu(kvm, arm_pmu); - return 0; -} - -static int kvm_arm_pmu_v3_set_pmu(struct kvm_vcpu *vcpu, int pmu_id) -{ - struct kvm *kvm = vcpu->kvm; - struct arm_pmu_entry *entry; - struct arm_pmu *arm_pmu; - int ret = -ENXIO; - - lockdep_assert_held(&kvm->arch.config_lock); - mutex_lock(&arm_pmus_lock); - - list_for_each_entry(entry, &arm_pmus, entry) { - arm_pmu = entry->arm_pmu; - if (arm_pmu->pmu.type == pmu_id) { - if (kvm_vm_has_ran_once(kvm) || - (kvm->arch.pmu_filter && kvm->arch.arm_pmu != arm_pmu)) { - ret = -EBUSY; - break; - } - - kvm_arm_set_pmu(kvm, arm_pmu); - cpumask_copy(kvm->arch.supported_cpus, &arm_pmu->supported_cpus); - ret = 0; - break; - } - } - - mutex_unlock(&arm_pmus_lock); - return ret; -} - -static int kvm_arm_pmu_v3_set_nr_counters(struct kvm_vcpu *vcpu, unsigned int n) -{ - struct kvm *kvm = vcpu->kvm; - - if (!kvm->arch.arm_pmu) - return -EINVAL; - - if (n > kvm_arm_pmu_get_max_counters(kvm)) - return -EINVAL; - - kvm_arm_set_nr_counters(kvm, n); - return 0; -} - -int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) -{ - struct kvm *kvm = vcpu->kvm; - - lockdep_assert_held(&kvm->arch.config_lock); - - if (!kvm_vcpu_has_pmu(vcpu)) - return -ENODEV; - - if (vcpu->arch.pmu.created) - return -EBUSY; - - switch (attr->attr) { - case KVM_ARM_VCPU_PMU_V3_IRQ: { - int __user *uaddr = (int __user *)(long)attr->addr; - int irq; - - if (!irqchip_in_kernel(kvm)) - return -EINVAL; - - if (get_user(irq, uaddr)) - return -EFAULT; - - /* The PMU overflow interrupt can be a PPI or a valid SPI. */ - if (!(irq_is_ppi(irq) || irq_is_spi(irq))) - return -EINVAL; - - if (!pmu_irq_is_valid(kvm, irq)) - return -EINVAL; - - if (kvm_arm_pmu_irq_initialized(vcpu)) - return -EBUSY; - - kvm_debug("Set kvm ARM PMU irq: %d\n", irq); - vcpu->arch.pmu.irq_num = irq; - return 0; - } - case KVM_ARM_VCPU_PMU_V3_FILTER: { - u8 pmuver = kvm_arm_pmu_get_pmuver_limit(); - struct kvm_pmu_event_filter __user *uaddr; - struct kvm_pmu_event_filter filter; - int nr_events; - - /* - * Allow userspace to specify an event filter for the entire - * event range supported by PMUVer of the hardware, rather - * than the guest's PMUVer for KVM backward compatibility. - */ - nr_events = __kvm_pmu_event_mask(pmuver) + 1; - - uaddr = (struct kvm_pmu_event_filter __user *)(long)attr->addr; - - if (copy_from_user(&filter, uaddr, sizeof(filter))) - return -EFAULT; - - if (((u32)filter.base_event + filter.nevents) > nr_events || - (filter.action != KVM_PMU_EVENT_ALLOW && - filter.action != KVM_PMU_EVENT_DENY)) - return -EINVAL; - - if (kvm_vm_has_ran_once(kvm)) - return -EBUSY; - - if (!kvm->arch.pmu_filter) { - kvm->arch.pmu_filter = bitmap_alloc(nr_events, GFP_KERNEL_ACCOUNT); - if (!kvm->arch.pmu_filter) - return -ENOMEM; - - /* - * The default depends on the first applied filter. - * If it allows events, the default is to deny. - * Conversely, if the first filter denies a set of - * events, the default is to allow. - */ - if (filter.action == KVM_PMU_EVENT_ALLOW) - bitmap_zero(kvm->arch.pmu_filter, nr_events); - else - bitmap_fill(kvm->arch.pmu_filter, nr_events); - } - - if (filter.action == KVM_PMU_EVENT_ALLOW) - bitmap_set(kvm->arch.pmu_filter, filter.base_event, filter.nevents); - else - bitmap_clear(kvm->arch.pmu_filter, filter.base_event, filter.nevents); - - return 0; - } - case KVM_ARM_VCPU_PMU_V3_SET_PMU: { - int __user *uaddr = (int __user *)(long)attr->addr; - int pmu_id; - - if (get_user(pmu_id, uaddr)) - return -EFAULT; - - return kvm_arm_pmu_v3_set_pmu(vcpu, pmu_id); - } - case KVM_ARM_VCPU_PMU_V3_SET_NR_COUNTERS: { - unsigned int __user *uaddr = (unsigned int __user *)(long)attr->addr; - unsigned int n; - - if (get_user(n, uaddr)) - return -EFAULT; - - return kvm_arm_pmu_v3_set_nr_counters(vcpu, n); - } - case KVM_ARM_VCPU_PMU_V3_INIT: - return kvm_arm_pmu_v3_init(vcpu); - } - - return -ENXIO; -} - -int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) -{ - switch (attr->attr) { - case KVM_ARM_VCPU_PMU_V3_IRQ: { - int __user *uaddr = (int __user *)(long)attr->addr; - int irq; - - if (!irqchip_in_kernel(vcpu->kvm)) - return -EINVAL; - - if (!kvm_vcpu_has_pmu(vcpu)) - return -ENODEV; - - if (!kvm_arm_pmu_irq_initialized(vcpu)) - return -ENXIO; - - irq = vcpu->arch.pmu.irq_num; - return put_user(irq, uaddr); - } - } - - return -ENXIO; -} - -int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) -{ - switch (attr->attr) { - case KVM_ARM_VCPU_PMU_V3_IRQ: - case KVM_ARM_VCPU_PMU_V3_INIT: - case KVM_ARM_VCPU_PMU_V3_FILTER: - case KVM_ARM_VCPU_PMU_V3_SET_PMU: - case KVM_ARM_VCPU_PMU_V3_SET_NR_COUNTERS: - if (kvm_vcpu_has_pmu(vcpu)) - return 0; - } - - return -ENXIO; -} - -u8 kvm_arm_pmu_get_pmuver_limit(void) -{ - unsigned int pmuver; - - pmuver = SYS_FIELD_GET(ID_AA64DFR0_EL1, PMUVer, - read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1)); - - /* - * Spoof a barebones PMUv3 implementation if the system supports IMPDEF - * traps of the PMUv3 sysregs - */ - if (cpus_have_final_cap(ARM64_WORKAROUND_PMUV3_IMPDEF_TRAPS)) - return ID_AA64DFR0_EL1_PMUVer_IMP; - - /* - * Otherwise, treat IMPLEMENTATION DEFINED functionality as - * unimplemented - */ - if (pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF) - return 0; - - return min(pmuver, ID_AA64DFR0_EL1_PMUVer_V3P5); -} - /** * kvm_vcpu_read_pmcr - Read PMCR_EL0 register for the vCPU * @vcpu: The vcpu pointer diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c index 8bfc6b0a85f6..4f0152e67ff3 100644 --- a/arch/arm64/kvm/pmu.c +++ b/arch/arm64/kvm/pmu.c @@ -8,10 +8,21 @@ #include <linux/perf/arm_pmu.h> #include <linux/perf/arm_pmuv3.h>
+#include <asm/kvm_emulate.h> #include <asm/kvm_pmu.h>
+static LIST_HEAD(arm_pmus); +static DEFINE_MUTEX(arm_pmus_lock); static DEFINE_PER_CPU(struct kvm_pmu_events, kvm_pmu_events);
+#define kvm_arm_pmu_irq_initialized(v) ((v)->arch.pmu.irq_num >= VGIC_NR_SGIS) + +bool kvm_supports_guest_pmuv3(void) +{ + guard(mutex)(&arm_pmus_lock); + return !list_empty(&arm_pmus); +} + /* * Given the perf event attributes and system type, determine * if we are going to need to switch counters at guest entry/exit. @@ -211,3 +222,602 @@ void kvm_vcpu_pmu_resync_el0(void)
kvm_make_request(KVM_REQ_RESYNC_PMU_EL0, vcpu); } + +struct arm_pmu_entry { + struct list_head entry; + struct arm_pmu *arm_pmu; +}; + +void kvm_host_pmu_init(struct arm_pmu *pmu) +{ + struct arm_pmu_entry *entry; + + /* + * Check the sanitised PMU version for the system, as KVM does not + * support implementations where PMUv3 exists on a subset of CPUs. + */ + if (!pmuv3_implemented(kvm_arm_pmu_get_pmuver_limit())) + return; + + guard(mutex)(&arm_pmus_lock); + + entry = kmalloc(sizeof(*entry), GFP_KERNEL); + if (!entry) + return; + + entry->arm_pmu = pmu; + list_add_tail(&entry->entry, &arm_pmus); +} + +static struct arm_pmu *kvm_pmu_probe_armpmu(void) +{ + struct arm_pmu_entry *entry; + struct arm_pmu *pmu; + int cpu; + + guard(mutex)(&arm_pmus_lock); + + /* + * It is safe to use a stale cpu to iterate the list of PMUs so long as + * the same value is used for the entirety of the loop. Given this, and + * the fact that no percpu data is used for the lookup there is no need + * to disable preemption. + * + * It is still necessary to get a valid cpu, though, to probe for the + * default PMU instance as userspace is not required to specify a PMU + * type. In order to uphold the preexisting behavior KVM selects the + * PMU instance for the core during vcpu init. A dependent use + * case would be a user with disdain of all things big.LITTLE that + * affines the VMM to a particular cluster of cores. + * + * In any case, userspace should just do the sane thing and use the UAPI + * to select a PMU type directly. But, be wary of the baggage being + * carried here. + */ + cpu = raw_smp_processor_id(); + list_for_each_entry(entry, &arm_pmus, entry) { + pmu = entry->arm_pmu; + + if (cpumask_test_cpu(cpu, &pmu->supported_cpus)) + return pmu; + } + + return NULL; +} + +static u64 __compute_pmceid(struct arm_pmu *pmu, bool pmceid1) +{ + u32 hi[2], lo[2]; + + bitmap_to_arr32(lo, pmu->pmceid_bitmap, ARMV8_PMUV3_MAX_COMMON_EVENTS); + bitmap_to_arr32(hi, pmu->pmceid_ext_bitmap, ARMV8_PMUV3_MAX_COMMON_EVENTS); + + return ((u64)hi[pmceid1] << 32) | lo[pmceid1]; +} + +static u64 compute_pmceid0(struct arm_pmu *pmu) +{ + u64 val = __compute_pmceid(pmu, 0); + + /* always support SW_INCR */ + val |= BIT(ARMV8_PMUV3_PERFCTR_SW_INCR); + /* always support CHAIN */ + val |= BIT(ARMV8_PMUV3_PERFCTR_CHAIN); + return val; +} + +static u64 compute_pmceid1(struct arm_pmu *pmu) +{ + u64 val = __compute_pmceid(pmu, 1); + + /* + * Don't advertise STALL_SLOT*, as PMMIR_EL0 is handled + * as RAZ + */ + val &= ~(BIT_ULL(ARMV8_PMUV3_PERFCTR_STALL_SLOT - 32) | + BIT_ULL(ARMV8_PMUV3_PERFCTR_STALL_SLOT_FRONTEND - 32) | + BIT_ULL(ARMV8_PMUV3_PERFCTR_STALL_SLOT_BACKEND - 32)); + return val; +} + +u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1) +{ + struct arm_pmu *cpu_pmu = vcpu->kvm->arch.arm_pmu; + unsigned long *bmap = vcpu->kvm->arch.pmu_filter; + u64 val, mask = 0; + int base, i, nr_events; + + if (!pmceid1) { + val = compute_pmceid0(cpu_pmu); + base = 0; + } else { + val = compute_pmceid1(cpu_pmu); + base = 32; + } + + if (!bmap) + return val; + + nr_events = kvm_pmu_event_mask(vcpu->kvm) + 1; + + for (i = 0; i < 32; i += 8) { + u64 byte; + + byte = bitmap_get_value8(bmap, base + i); + mask |= byte << i; + if (nr_events >= (0x4000 + base + 32)) { + byte = bitmap_get_value8(bmap, 0x4000 + base + i); + mask |= byte << (32 + i); + } + } + + return val & mask; +} + +/* + * When perf interrupt is an NMI, we cannot safely notify the vcpu corresponding + * to the event. + * This is why we need a callback to do it once outside of the NMI context. + */ +static void kvm_pmu_perf_overflow_notify_vcpu(struct irq_work *work) +{ + struct kvm_vcpu *vcpu; + + vcpu = container_of(work, struct kvm_vcpu, arch.pmu.overflow_work); + kvm_vcpu_kick(vcpu); +} + +static u32 __kvm_pmu_event_mask(unsigned int pmuver) +{ + switch (pmuver) { + case ID_AA64DFR0_EL1_PMUVer_IMP: + return GENMASK(9, 0); + case ID_AA64DFR0_EL1_PMUVer_V3P1: + case ID_AA64DFR0_EL1_PMUVer_V3P4: + case ID_AA64DFR0_EL1_PMUVer_V3P5: + case ID_AA64DFR0_EL1_PMUVer_V3P7: + return GENMASK(15, 0); + default: /* Shouldn't be here, just for sanity */ + WARN_ONCE(1, "Unknown PMU version %d\n", pmuver); + return 0; + } +} + +u32 kvm_pmu_event_mask(struct kvm *kvm) +{ + u64 dfr0 = kvm_read_vm_id_reg(kvm, SYS_ID_AA64DFR0_EL1); + u8 pmuver = SYS_FIELD_GET(ID_AA64DFR0_EL1, PMUVer, dfr0); + + return __kvm_pmu_event_mask(pmuver); +} + +u64 kvm_pmu_evtyper_mask(struct kvm *kvm) +{ + u64 mask = ARMV8_PMU_EXCLUDE_EL1 | ARMV8_PMU_EXCLUDE_EL0 | + kvm_pmu_event_mask(kvm); + + if (kvm_has_feat(kvm, ID_AA64PFR0_EL1, EL2, IMP)) + mask |= ARMV8_PMU_INCLUDE_EL2; + + if (kvm_has_feat(kvm, ID_AA64PFR0_EL1, EL3, IMP)) + mask |= ARMV8_PMU_EXCLUDE_NS_EL0 | + ARMV8_PMU_EXCLUDE_NS_EL1 | + ARMV8_PMU_EXCLUDE_EL3; + + return mask; +} + +static void kvm_pmu_update_state(struct kvm_vcpu *vcpu) +{ + struct kvm_pmu *pmu = &vcpu->arch.pmu; + bool overflow; + + overflow = kvm_pmu_overflow_status(vcpu); + if (pmu->irq_level == overflow) + return; + + pmu->irq_level = overflow; + + if (likely(irqchip_in_kernel(vcpu->kvm))) { + int ret = kvm_vgic_inject_irq(vcpu->kvm, vcpu, + pmu->irq_num, overflow, pmu); + WARN_ON(ret); + } +} + +/** + * kvm_pmu_flush_hwstate - flush pmu state to cpu + * @vcpu: The vcpu pointer + * + * Check if the PMU has overflowed while we were running in the host, and inject + * an interrupt if that was the case. + */ +void kvm_pmu_flush_hwstate(struct kvm_vcpu *vcpu) +{ + kvm_pmu_update_state(vcpu); +} + +/** + * kvm_pmu_sync_hwstate - sync pmu state from cpu + * @vcpu: The vcpu pointer + * + * Check if the PMU has overflowed while we were running in the guest, and + * inject an interrupt if that was the case. + */ +void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu) +{ + kvm_pmu_update_state(vcpu); +} + +int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu) +{ + if (!vcpu->arch.pmu.created) + return -EINVAL; + + /* + * A valid interrupt configuration for the PMU is either to have a + * properly configured interrupt number and using an in-kernel + * irqchip, or to not have an in-kernel GIC and not set an IRQ. + */ + if (irqchip_in_kernel(vcpu->kvm)) { + int irq = vcpu->arch.pmu.irq_num; + /* + * If we are using an in-kernel vgic, at this point we know + * the vgic will be initialized, so we can check the PMU irq + * number against the dimensions of the vgic and make sure + * it's valid. + */ + if (!irq_is_ppi(irq) && !vgic_valid_spi(vcpu->kvm, irq)) + return -EINVAL; + } else if (kvm_arm_pmu_irq_initialized(vcpu)) { + return -EINVAL; + } + + return 0; +} + +static int kvm_arm_pmu_v3_init(struct kvm_vcpu *vcpu) +{ + if (irqchip_in_kernel(vcpu->kvm)) { + int ret; + + /* + * If using the PMU with an in-kernel virtual GIC + * implementation, we require the GIC to be already + * initialized when initializing the PMU. + */ + if (!vgic_initialized(vcpu->kvm)) + return -ENODEV; + + if (!kvm_arm_pmu_irq_initialized(vcpu)) + return -ENXIO; + + ret = kvm_vgic_set_owner(vcpu, vcpu->arch.pmu.irq_num, + &vcpu->arch.pmu); + if (ret) + return ret; + } + + init_irq_work(&vcpu->arch.pmu.overflow_work, + kvm_pmu_perf_overflow_notify_vcpu); + + vcpu->arch.pmu.created = true; + return 0; +} + +/* + * For one VM the interrupt type must be same for each vcpu. + * As a PPI, the interrupt number is the same for all vcpus, + * while as an SPI it must be a separate number per vcpu. + */ +static bool pmu_irq_is_valid(struct kvm *kvm, int irq) +{ + unsigned long i; + struct kvm_vcpu *vcpu; + + kvm_for_each_vcpu(i, vcpu, kvm) { + if (!kvm_arm_pmu_irq_initialized(vcpu)) + continue; + + if (irq_is_ppi(irq)) { + if (vcpu->arch.pmu.irq_num != irq) + return false; + } else { + if (vcpu->arch.pmu.irq_num == irq) + return false; + } + } + + return true; +} + +/** + * kvm_arm_pmu_get_max_counters - Return the max number of PMU counters. + * @kvm: The kvm pointer + */ +u8 kvm_arm_pmu_get_max_counters(struct kvm *kvm) +{ + struct arm_pmu *arm_pmu = kvm->arch.arm_pmu; + + /* + * PMUv3 requires that all event counters are capable of counting any + * event, though the same may not be true of non-PMUv3 hardware. + */ + if (cpus_have_final_cap(ARM64_WORKAROUND_PMUV3_IMPDEF_TRAPS)) + return 1; + + /* + * The arm_pmu->cntr_mask considers the fixed counter(s) as well. + * Ignore those and return only the general-purpose counters. + */ + return bitmap_weight(arm_pmu->cntr_mask, ARMV8_PMU_MAX_GENERAL_COUNTERS); +} + +static void kvm_arm_set_nr_counters(struct kvm *kvm, unsigned int nr) +{ + kvm->arch.nr_pmu_counters = nr; + + /* Reset MDCR_EL2.HPMN behind the vcpus' back... */ + if (test_bit(KVM_ARM_VCPU_HAS_EL2, kvm->arch.vcpu_features)) { + struct kvm_vcpu *vcpu; + unsigned long i; + + kvm_for_each_vcpu(i, vcpu, kvm) { + u64 val = __vcpu_sys_reg(vcpu, MDCR_EL2); + + val &= ~MDCR_EL2_HPMN; + val |= FIELD_PREP(MDCR_EL2_HPMN, kvm->arch.nr_pmu_counters); + __vcpu_sys_reg(vcpu, MDCR_EL2) = val; + } + } +} + +static void kvm_arm_set_pmu(struct kvm *kvm, struct arm_pmu *arm_pmu) +{ + lockdep_assert_held(&kvm->arch.config_lock); + + kvm->arch.arm_pmu = arm_pmu; + kvm_arm_set_nr_counters(kvm, kvm_arm_pmu_get_max_counters(kvm)); +} + +/** + * kvm_arm_set_default_pmu - No PMU set, get the default one. + * @kvm: The kvm pointer + * + * The observant among you will notice that the supported_cpus + * mask does not get updated for the default PMU even though it + * is quite possible the selected instance supports only a + * subset of cores in the system. This is intentional, and + * upholds the preexisting behavior on heterogeneous systems + * where vCPUs can be scheduled on any core but the guest + * counters could stop working. + */ +int kvm_arm_set_default_pmu(struct kvm *kvm) +{ + struct arm_pmu *arm_pmu = kvm_pmu_probe_armpmu(); + + if (!arm_pmu) + return -ENODEV; + + kvm_arm_set_pmu(kvm, arm_pmu); + return 0; +} + +static int kvm_arm_pmu_v3_set_pmu(struct kvm_vcpu *vcpu, int pmu_id) +{ + struct kvm *kvm = vcpu->kvm; + struct arm_pmu_entry *entry; + struct arm_pmu *arm_pmu; + int ret = -ENXIO; + + lockdep_assert_held(&kvm->arch.config_lock); + mutex_lock(&arm_pmus_lock); + + list_for_each_entry(entry, &arm_pmus, entry) { + arm_pmu = entry->arm_pmu; + if (arm_pmu->pmu.type == pmu_id) { + if (kvm_vm_has_ran_once(kvm) || + (kvm->arch.pmu_filter && kvm->arch.arm_pmu != arm_pmu)) { + ret = -EBUSY; + break; + } + + kvm_arm_set_pmu(kvm, arm_pmu); + cpumask_copy(kvm->arch.supported_cpus, &arm_pmu->supported_cpus); + ret = 0; + break; + } + } + + mutex_unlock(&arm_pmus_lock); + return ret; +} + +static int kvm_arm_pmu_v3_set_nr_counters(struct kvm_vcpu *vcpu, unsigned int n) +{ + struct kvm *kvm = vcpu->kvm; + + if (!kvm->arch.arm_pmu) + return -EINVAL; + + if (n > kvm_arm_pmu_get_max_counters(kvm)) + return -EINVAL; + + kvm_arm_set_nr_counters(kvm, n); + return 0; +} + +int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) +{ + struct kvm *kvm = vcpu->kvm; + + lockdep_assert_held(&kvm->arch.config_lock); + + if (!kvm_vcpu_has_pmu(vcpu)) + return -ENODEV; + + if (vcpu->arch.pmu.created) + return -EBUSY; + + switch (attr->attr) { + case KVM_ARM_VCPU_PMU_V3_IRQ: { + int __user *uaddr = (int __user *)(long)attr->addr; + int irq; + + if (!irqchip_in_kernel(kvm)) + return -EINVAL; + + if (get_user(irq, uaddr)) + return -EFAULT; + + /* The PMU overflow interrupt can be a PPI or a valid SPI. */ + if (!(irq_is_ppi(irq) || irq_is_spi(irq))) + return -EINVAL; + + if (!pmu_irq_is_valid(kvm, irq)) + return -EINVAL; + + if (kvm_arm_pmu_irq_initialized(vcpu)) + return -EBUSY; + + kvm_debug("Set kvm ARM PMU irq: %d\n", irq); + vcpu->arch.pmu.irq_num = irq; + return 0; + } + case KVM_ARM_VCPU_PMU_V3_FILTER: { + u8 pmuver = kvm_arm_pmu_get_pmuver_limit(); + struct kvm_pmu_event_filter __user *uaddr; + struct kvm_pmu_event_filter filter; + int nr_events; + + /* + * Allow userspace to specify an event filter for the entire + * event range supported by PMUVer of the hardware, rather + * than the guest's PMUVer for KVM backward compatibility. + */ + nr_events = __kvm_pmu_event_mask(pmuver) + 1; + + uaddr = (struct kvm_pmu_event_filter __user *)(long)attr->addr; + + if (copy_from_user(&filter, uaddr, sizeof(filter))) + return -EFAULT; + + if (((u32)filter.base_event + filter.nevents) > nr_events || + (filter.action != KVM_PMU_EVENT_ALLOW && + filter.action != KVM_PMU_EVENT_DENY)) + return -EINVAL; + + if (kvm_vm_has_ran_once(kvm)) + return -EBUSY; + + if (!kvm->arch.pmu_filter) { + kvm->arch.pmu_filter = bitmap_alloc(nr_events, GFP_KERNEL_ACCOUNT); + if (!kvm->arch.pmu_filter) + return -ENOMEM; + + /* + * The default depends on the first applied filter. + * If it allows events, the default is to deny. + * Conversely, if the first filter denies a set of + * events, the default is to allow. + */ + if (filter.action == KVM_PMU_EVENT_ALLOW) + bitmap_zero(kvm->arch.pmu_filter, nr_events); + else + bitmap_fill(kvm->arch.pmu_filter, nr_events); + } + + if (filter.action == KVM_PMU_EVENT_ALLOW) + bitmap_set(kvm->arch.pmu_filter, filter.base_event, filter.nevents); + else + bitmap_clear(kvm->arch.pmu_filter, filter.base_event, filter.nevents); + + return 0; + } + case KVM_ARM_VCPU_PMU_V3_SET_PMU: { + int __user *uaddr = (int __user *)(long)attr->addr; + int pmu_id; + + if (get_user(pmu_id, uaddr)) + return -EFAULT; + + return kvm_arm_pmu_v3_set_pmu(vcpu, pmu_id); + } + case KVM_ARM_VCPU_PMU_V3_SET_NR_COUNTERS: { + unsigned int __user *uaddr = (unsigned int __user *)(long)attr->addr; + unsigned int n; + + if (get_user(n, uaddr)) + return -EFAULT; + + return kvm_arm_pmu_v3_set_nr_counters(vcpu, n); + } + case KVM_ARM_VCPU_PMU_V3_INIT: + return kvm_arm_pmu_v3_init(vcpu); + } + + return -ENXIO; +} + +int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) +{ + switch (attr->attr) { + case KVM_ARM_VCPU_PMU_V3_IRQ: { + int __user *uaddr = (int __user *)(long)attr->addr; + int irq; + + if (!irqchip_in_kernel(vcpu->kvm)) + return -EINVAL; + + if (!kvm_vcpu_has_pmu(vcpu)) + return -ENODEV; + + if (!kvm_arm_pmu_irq_initialized(vcpu)) + return -ENXIO; + + irq = vcpu->arch.pmu.irq_num; + return put_user(irq, uaddr); + } + } + + return -ENXIO; +} + +int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) +{ + switch (attr->attr) { + case KVM_ARM_VCPU_PMU_V3_IRQ: + case KVM_ARM_VCPU_PMU_V3_INIT: + case KVM_ARM_VCPU_PMU_V3_FILTER: + case KVM_ARM_VCPU_PMU_V3_SET_PMU: + case KVM_ARM_VCPU_PMU_V3_SET_NR_COUNTERS: + if (kvm_vcpu_has_pmu(vcpu)) + return 0; + } + + return -ENXIO; +} + +u8 kvm_arm_pmu_get_pmuver_limit(void) +{ + unsigned int pmuver; + + pmuver = SYS_FIELD_GET(ID_AA64DFR0_EL1, PMUVer, + read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1)); + + /* + * Spoof a barebones PMUv3 implementation if the system supports IMPDEF + * traps of the PMUv3 sysregs + */ + if (cpus_have_final_cap(ARM64_WORKAROUND_PMUV3_IMPDEF_TRAPS)) + return ID_AA64DFR0_EL1_PMUVer_IMP; + + /* + * Otherwise, treat IMPLEMENTATION DEFINED functionality as + * unimplemented + */ + if (pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF) + return 0; + + return min(pmuver, ID_AA64DFR0_EL1_PMUVer_V3P5); +}
For PMUv3, the register field MDCR_EL2.HPMN partitiones the PMU counters into two ranges where counters 0..HPMN-1 are accessible by EL1 and, if allowed, EL0 while counters HPMN..N are only accessible by EL2.
Track HPMN and in a variable in struct arm_pmu because both KVM and the PMUv3 driver will need to know that to handle guests correctly. Introduce the function kvm_pmu_partition() to set this variable and modify the PMU driver's cntr_mask of available counters to exclude the counters being reserved for the guest. Finally, make sure HPMN is set with this value when setting up the MDCR_EL2 register.
Create a module parameter reserved_host_counters to set a default value. A more flexible uAPI will be added in a later commit.
Due to the difficulty this feature would create for the driver running at EL1 on the host, partitioning is only allowed in VHE mode. Working on nVHE mode would require a hypercall for every counter access in the driver because the counters reserved for the host by HPMN are only accessible to EL2.
Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm64/include/asm/kvm_pmu.h | 19 +++++ arch/arm64/kvm/Makefile | 2 +- arch/arm64/kvm/debug.c | 9 ++- arch/arm64/kvm/pmu-part.c | 117 +++++++++++++++++++++++++++++++ arch/arm64/kvm/pmu.c | 13 ++++ include/linux/perf/arm_pmu.h | 1 + 6 files changed, 157 insertions(+), 4 deletions(-) create mode 100644 arch/arm64/kvm/pmu-part.c
diff --git a/arch/arm64/include/asm/kvm_pmu.h b/arch/arm64/include/asm/kvm_pmu.h index 613cddbdbdd8..83b81e7829bf 100644 --- a/arch/arm64/include/asm/kvm_pmu.h +++ b/arch/arm64/include/asm/kvm_pmu.h @@ -22,6 +22,10 @@ bool kvm_set_pmuserenr(u64 val); void kvm_vcpu_pmu_resync_el0(void); void kvm_host_pmu_init(struct arm_pmu *pmu);
+bool kvm_pmu_partition_supported(void); +u8 kvm_pmu_hpmn(u8 host_counters); +int kvm_pmu_partition(struct arm_pmu *pmu, u8 host_counters); + #else
static inline void kvm_set_pmu_events(u64 set, struct perf_event_attr *attr) {} @@ -33,6 +37,21 @@ static inline bool kvm_set_pmuserenr(u64 val) static inline void kvm_vcpu_pmu_resync_el0(void) {} static inline void kvm_host_pmu_init(struct arm_pmu *pmu) {}
+static inline bool kvm_pmu_partiton_supported(void) +{ + return false; +} + +static inline u8 kvm_pmu_hpmn(u8 nr_counters) +{ + return -1; +} + +static inline int kvm_pmu_partition(struct arm_pmu *pmu) +{ + return -EPERM; +} + #endif
#endif diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile index 7c329e01c557..8161dfb123d7 100644 --- a/arch/arm64/kvm/Makefile +++ b/arch/arm64/kvm/Makefile @@ -25,7 +25,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \ vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \ vgic/vgic-its.o vgic/vgic-debug.o vgic/vgic-v3-nested.o
-kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o +kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu-part.o pmu.o kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o kvm-$(CONFIG_PTDUMP_STAGE2_DEBUGFS) += ptdump.o
diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c index 7fb1d9e7180f..41746a498a45 100644 --- a/arch/arm64/kvm/debug.c +++ b/arch/arm64/kvm/debug.c @@ -9,6 +9,7 @@
#include <linux/kvm_host.h> #include <linux/hw_breakpoint.h> +#include <linux/perf/arm_pmu.h> #include <linux/perf/arm_pmuv3.h>
#include <asm/debug-monitors.h> @@ -31,15 +32,17 @@ */ static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu) { + u8 hpmn = vcpu->kvm->arch.arm_pmu->hpmn; + preempt_disable();
/* * This also clears MDCR_EL2_E2PB_MASK and MDCR_EL2_E2TB_MASK * to disable guest access to the profiling and trace buffers */ - vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN, - *host_data_ptr(nr_event_counters)); - vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM | + vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN, hpmn); + vcpu->arch.mdcr_el2 |= (MDCR_EL2_HPMD | + MDCR_EL2_TPM | MDCR_EL2_TPMS | MDCR_EL2_TTRF | MDCR_EL2_TPMCR | diff --git a/arch/arm64/kvm/pmu-part.c b/arch/arm64/kvm/pmu-part.c new file mode 100644 index 000000000000..7252a58f085c --- /dev/null +++ b/arch/arm64/kvm/pmu-part.c @@ -0,0 +1,117 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025 Google LLC + * Author: Colton Lewis coltonlewis@google.com + */ + +#include <linux/kvm_host.h> +#include <linux/perf/arm_pmu.h> +#include <linux/perf/arm_pmuv3.h> + +#include <asm/kvm_pmu.h> +#include <asm/arm_pmuv3.h> + +/** + * kvm_pmu_reservation_is_valid() - Determine if reservation is allowed + * @host_counters: Number of host counters to reserve + * + * Determine if the number of host counters in the argument is + * allowed. It is allowed if it will produce a valid value for + * register field MDCR_EL2.HPMN. + * + * Return: True if reservation allowed, false otherwise + */ +static bool kvm_pmu_reservation_is_valid(u8 host_counters) +{ + u8 nr_counters = *host_data_ptr(nr_event_counters); + + return host_counters < nr_counters || + (host_counters == nr_counters + && cpus_have_final_cap(ARM64_HAS_HPMN0)); +} + +/** + * kvm_pmu_hpmn() - Compute HPMN value + * @host_counters: Number of host counters to reserve + * + * This function computes the value of HPMN, the partition pivot + * value, such that counters 0..HPMN are reserved for the guest and + * counters HPMN..N are reserved for the host. + * + * If the requested @host_counters would create an invalid partition, + * return the value of HPMN that creates no partition. + * + * Return: Value of HPMN + */ +u8 kvm_pmu_hpmn(u8 host_counters) +{ + u8 nr_counters = *host_data_ptr(nr_event_counters); + + if (likely(kvm_pmu_reservation_is_valid(host_counters))) + return nr_counters - host_counters; + else + return nr_counters; +} + +/** + * kvm_pmu_partition_supported() - Determine if partitioning is possible + * + * Partitioning is only supported in VHE mode where we have PMUv3 and + * Fine Grain Traps (FGT). + * + * Return: True if partitioning is possible, false otherwise + */ +bool kvm_pmu_partition_supported(void) +{ + return has_vhe() + && pmuv3_implemented(kvm_arm_pmu_get_pmuver_limit()) + && cpus_have_final_cap(ARM64_HAS_FGT); +} + +/** + * kvm_pmu_partition() - Partition the PMU + * @pmu: Pointer to pmu being partitioned + * @host_counters: Number of host counters to reserve + * + * Partition the given PMU by taking a number of host counters to + * reserve and, if it is a valid reservation, recording the + * corresponding HPMN value in the hpmn field of the PMU and clearing + * the guest-reserved counters from the counter mask. + * + * Passing 0 for @host_counters has the effect of disabling partitioning. + * + * Return: 0 on success, -ERROR otherwise + */ +int kvm_pmu_partition(struct arm_pmu *pmu, u8 host_counters) +{ + u8 nr_counters; + u8 hpmn; + + if (!kvm_pmu_reservation_is_valid(host_counters)) + return -EINVAL; + + nr_counters = *host_data_ptr(nr_event_counters); + hpmn = kvm_pmu_hpmn(host_counters); + + if (hpmn < nr_counters) { + pmu->hpmn = hpmn; + /* Inform host driver of available counters */ + bitmap_clear(pmu->cntr_mask, 0, hpmn); + bitmap_set(pmu->cntr_mask, hpmn, nr_counters); + clear_bit(ARMV8_PMU_CYCLE_IDX, pmu->cntr_mask); + if (pmuv3_has_icntr()) + clear_bit(ARMV8_PMU_INSTR_IDX, pmu->cntr_mask); + + kvm_debug("Partitioned PMU with HPMN %u", hpmn); + } else { + pmu->hpmn = nr_counters; + bitmap_set(pmu->cntr_mask, 0, nr_counters); + set_bit(ARMV8_PMU_CYCLE_IDX, pmu->cntr_mask); + if (pmuv3_has_icntr()) + set_bit(ARMV8_PMU_INSTR_IDX, pmu->cntr_mask); + + kvm_debug("Unpartitioned PMU"); + } + + return 0; +} diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c index 4f0152e67ff3..2dcfac3ea9c6 100644 --- a/arch/arm64/kvm/pmu.c +++ b/arch/arm64/kvm/pmu.c @@ -15,6 +15,12 @@ static LIST_HEAD(arm_pmus); static DEFINE_MUTEX(arm_pmus_lock); static DEFINE_PER_CPU(struct kvm_pmu_events, kvm_pmu_events);
+static u8 reserved_host_counters __read_mostly; + +module_param(reserved_host_counters, byte, 0); +MODULE_PARM_DESC(reserved_host_counters, + "Partition the PMU into host and guest counters"); + #define kvm_arm_pmu_irq_initialized(v) ((v)->arch.pmu.irq_num >= VGIC_NR_SGIS)
bool kvm_supports_guest_pmuv3(void) @@ -239,6 +245,13 @@ void kvm_host_pmu_init(struct arm_pmu *pmu) if (!pmuv3_implemented(kvm_arm_pmu_get_pmuver_limit())) return;
+ if (reserved_host_counters) { + if (kvm_pmu_partition_supported()) + WARN_ON(kvm_pmu_partition(pmu, reserved_host_counters)); + else + kvm_err("PMU Partition is not supported"); + } + guard(mutex)(&arm_pmus_lock);
entry = kmalloc(sizeof(*entry), GFP_KERNEL); diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h index 1de206b09616..3843d66b7328 100644 --- a/include/linux/perf/arm_pmu.h +++ b/include/linux/perf/arm_pmu.h @@ -130,6 +130,7 @@ struct arm_pmu {
/* Only to be used by ACPI probing code */ unsigned long acpi_cpuid; + u8 hpmn; /* MDCR_EL2.HPMN: counter partition pivot */ };
#define to_arm_pmu(p) (container_of(p, struct arm_pmu, pmu))
On Mon, Jun 02, 2025 at 07:26:51PM +0000, Colton Lewis wrote:
static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu) {
- u8 hpmn = vcpu->kvm->arch.arm_pmu->hpmn;
- preempt_disable();
/* * This also clears MDCR_EL2_E2PB_MASK and MDCR_EL2_E2TB_MASK * to disable guest access to the profiling and trace buffers */
- vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN,
*host_data_ptr(nr_event_counters));
- vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
- vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN, hpmn);
- vcpu->arch.mdcr_el2 |= (MDCR_EL2_HPMD |
MDCR_EL2_TPM |
This isn't safe, as there's no guarantee that kvm_arch::arm_pmu is pointing that the PMU for this CPU. KVM needs to derive HPMN from some per-CPU state, not anything tied to the VM/vCPU.
+/**
- kvm_pmu_partition() - Partition the PMU
- @pmu: Pointer to pmu being partitioned
- @host_counters: Number of host counters to reserve
- Partition the given PMU by taking a number of host counters to
- reserve and, if it is a valid reservation, recording the
- corresponding HPMN value in the hpmn field of the PMU and clearing
- the guest-reserved counters from the counter mask.
- Passing 0 for @host_counters has the effect of disabling partitioning.
- Return: 0 on success, -ERROR otherwise
- */
+int kvm_pmu_partition(struct arm_pmu *pmu, u8 host_counters) +{
- u8 nr_counters;
- u8 hpmn;
- if (!kvm_pmu_reservation_is_valid(host_counters))
return -EINVAL;
- nr_counters = *host_data_ptr(nr_event_counters);
- hpmn = kvm_pmu_hpmn(host_counters);
- if (hpmn < nr_counters) {
pmu->hpmn = hpmn;
/* Inform host driver of available counters */
bitmap_clear(pmu->cntr_mask, 0, hpmn);
bitmap_set(pmu->cntr_mask, hpmn, nr_counters);
clear_bit(ARMV8_PMU_CYCLE_IDX, pmu->cntr_mask);
if (pmuv3_has_icntr())
clear_bit(ARMV8_PMU_INSTR_IDX, pmu->cntr_mask);
kvm_debug("Partitioned PMU with HPMN %u", hpmn);
- } else {
pmu->hpmn = nr_counters;
bitmap_set(pmu->cntr_mask, 0, nr_counters);
set_bit(ARMV8_PMU_CYCLE_IDX, pmu->cntr_mask);
if (pmuv3_has_icntr())
set_bit(ARMV8_PMU_INSTR_IDX, pmu->cntr_mask);
kvm_debug("Unpartitioned PMU");
- }
- return 0;
+}
Hmm... Just in terms of code organization I'm not sure I like having KVM twiddling with *host* support for PMUv3. Feels like the ARM PMU driver should own partitioning and KVM just takes what it can get.
@@ -239,6 +245,13 @@ void kvm_host_pmu_init(struct arm_pmu *pmu) if (!pmuv3_implemented(kvm_arm_pmu_get_pmuver_limit())) return;
- if (reserved_host_counters) {
if (kvm_pmu_partition_supported())
WARN_ON(kvm_pmu_partition(pmu, reserved_host_counters));
else
kvm_err("PMU Partition is not supported");
- }
Hasn't the ARM PMU been registered with perf at this point? Surely the driver wouldn't be very pleased with us ripping counters out from under its feet.
Thanks, Oliver
Oliver Upton oliver.upton@linux.dev writes:
On Mon, Jun 02, 2025 at 07:26:51PM +0000, Colton Lewis wrote:
static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu) {
- u8 hpmn = vcpu->kvm->arch.arm_pmu->hpmn;
- preempt_disable();
/* * This also clears MDCR_EL2_E2PB_MASK and MDCR_EL2_E2TB_MASK * to disable guest access to the profiling and trace buffers */
- vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN,
*host_data_ptr(nr_event_counters));
- vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
- vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN, hpmn);
- vcpu->arch.mdcr_el2 |= (MDCR_EL2_HPMD |
MDCR_EL2_TPM |
This isn't safe, as there's no guarantee that kvm_arch::arm_pmu is pointing that the PMU for this CPU. KVM needs to derive HPMN from some per-CPU state, not anything tied to the VM/vCPU.
I'm confused. Isn't this function preparing to run the vCPU on this CPU? Why would it be pointing at a different PMU?
And HPMN is something that we only want set when running a vCPU, so there isn't any per-CPU state saying it should be anything but the default value (number of counters) outside that context.
Unless you just mean I should check the number of counters again and make sure HPMN is not an invalid value.
+/**
- kvm_pmu_partition() - Partition the PMU
- @pmu: Pointer to pmu being partitioned
- @host_counters: Number of host counters to reserve
- Partition the given PMU by taking a number of host counters to
- reserve and, if it is a valid reservation, recording the
- corresponding HPMN value in the hpmn field of the PMU and clearing
- the guest-reserved counters from the counter mask.
- Passing 0 for @host_counters has the effect of disabling
partitioning.
- Return: 0 on success, -ERROR otherwise
- */
+int kvm_pmu_partition(struct arm_pmu *pmu, u8 host_counters) +{
- u8 nr_counters;
- u8 hpmn;
- if (!kvm_pmu_reservation_is_valid(host_counters))
return -EINVAL;
- nr_counters = *host_data_ptr(nr_event_counters);
- hpmn = kvm_pmu_hpmn(host_counters);
- if (hpmn < nr_counters) {
pmu->hpmn = hpmn;
/* Inform host driver of available counters */
bitmap_clear(pmu->cntr_mask, 0, hpmn);
bitmap_set(pmu->cntr_mask, hpmn, nr_counters);
clear_bit(ARMV8_PMU_CYCLE_IDX, pmu->cntr_mask);
if (pmuv3_has_icntr())
clear_bit(ARMV8_PMU_INSTR_IDX, pmu->cntr_mask);
kvm_debug("Partitioned PMU with HPMN %u", hpmn);
- } else {
pmu->hpmn = nr_counters;
bitmap_set(pmu->cntr_mask, 0, nr_counters);
set_bit(ARMV8_PMU_CYCLE_IDX, pmu->cntr_mask);
if (pmuv3_has_icntr())
set_bit(ARMV8_PMU_INSTR_IDX, pmu->cntr_mask);
kvm_debug("Unpartitioned PMU");
- }
- return 0;
+}
Hmm... Just in terms of code organization I'm not sure I like having KVM twiddling with *host* support for PMUv3. Feels like the ARM PMU driver should own partitioning and KVM just takes what it can get.
Okay. I can move the code.
@@ -239,6 +245,13 @@ void kvm_host_pmu_init(struct arm_pmu *pmu) if (!pmuv3_implemented(kvm_arm_pmu_get_pmuver_limit())) return;
- if (reserved_host_counters) {
if (kvm_pmu_partition_supported())
WARN_ON(kvm_pmu_partition(pmu, reserved_host_counters));
else
kvm_err("PMU Partition is not supported");
- }
Hasn't the ARM PMU been registered with perf at this point? Surely the driver wouldn't be very pleased with us ripping counters out from under its feet.
AFAICT nothing in perf registration cares about the number of counters the PMU has. The PMUv3 driver tracks its own available counters through cntr_mask and I modify that during partition.
Since this is still initialization of the PMU, I don't believe anything has had a chance to use a counter yet that will be ripped away.
Aesthetically It makes since to change this if I move the partitioning code to the PMUv3 driver, but I think it's inconsequential to the function.
On Tue, Jun 03, 2025 at 09:32:41PM +0000, Colton Lewis wrote:
Oliver Upton oliver.upton@linux.dev writes:
On Mon, Jun 02, 2025 at 07:26:51PM +0000, Colton Lewis wrote:
static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu) {
- u8 hpmn = vcpu->kvm->arch.arm_pmu->hpmn;
- preempt_disable();
/* * This also clears MDCR_EL2_E2PB_MASK and MDCR_EL2_E2TB_MASK * to disable guest access to the profiling and trace buffers */
- vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN,
*host_data_ptr(nr_event_counters));
- vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
- vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN, hpmn);
- vcpu->arch.mdcr_el2 |= (MDCR_EL2_HPMD |
MDCR_EL2_TPM |
This isn't safe, as there's no guarantee that kvm_arch::arm_pmu is pointing that the PMU for this CPU. KVM needs to derive HPMN from some per-CPU state, not anything tied to the VM/vCPU.
I'm confused. Isn't this function preparing to run the vCPU on this CPU? Why would it be pointing at a different PMU?
Because arm64 is a silly ecosystem and system designers can glue together heterogenous CPU implementations. The arm_pmu that KVM is pointing at might only match a subset of CPUs, but vCPUs migrate at the whim of the scheduler (and userspace).
And HPMN is something that we only want set when running a vCPU, so there isn't any per-CPU state saying it should be anything but the default value (number of counters) outside that context.
Unless you just mean I should check the number of counters again and make sure HPMN is not an invalid value.
As you've implemented it the host cannot schedule events in the guest range of counters regardless of context. You need to reconcile that global limit with the desires of the VMM on how many counters it wants presented to this particular guest.
+/**
- kvm_pmu_partition() - Partition the PMU
- @pmu: Pointer to pmu being partitioned
- @host_counters: Number of host counters to reserve
- Partition the given PMU by taking a number of host counters to
- reserve and, if it is a valid reservation, recording the
- corresponding HPMN value in the hpmn field of the PMU and clearing
- the guest-reserved counters from the counter mask.
- Passing 0 for @host_counters has the effect of disabling
partitioning.
- Return: 0 on success, -ERROR otherwise
- */
+int kvm_pmu_partition(struct arm_pmu *pmu, u8 host_counters) +{
- u8 nr_counters;
- u8 hpmn;
- if (!kvm_pmu_reservation_is_valid(host_counters))
return -EINVAL;
- nr_counters = *host_data_ptr(nr_event_counters);
- hpmn = kvm_pmu_hpmn(host_counters);
- if (hpmn < nr_counters) {
pmu->hpmn = hpmn;
/* Inform host driver of available counters */
bitmap_clear(pmu->cntr_mask, 0, hpmn);
bitmap_set(pmu->cntr_mask, hpmn, nr_counters);
clear_bit(ARMV8_PMU_CYCLE_IDX, pmu->cntr_mask);
if (pmuv3_has_icntr())
clear_bit(ARMV8_PMU_INSTR_IDX, pmu->cntr_mask);
kvm_debug("Partitioned PMU with HPMN %u", hpmn);
- } else {
pmu->hpmn = nr_counters;
bitmap_set(pmu->cntr_mask, 0, nr_counters);
set_bit(ARMV8_PMU_CYCLE_IDX, pmu->cntr_mask);
if (pmuv3_has_icntr())
set_bit(ARMV8_PMU_INSTR_IDX, pmu->cntr_mask);
kvm_debug("Unpartitioned PMU");
- }
- return 0;
+}
Hmm... Just in terms of code organization I'm not sure I like having KVM twiddling with *host* support for PMUv3. Feels like the ARM PMU driver should own partitioning and KVM just takes what it can get.
Okay. I can move the code.
@@ -239,6 +245,13 @@ void kvm_host_pmu_init(struct arm_pmu *pmu) if (!pmuv3_implemented(kvm_arm_pmu_get_pmuver_limit())) return;
- if (reserved_host_counters) {
if (kvm_pmu_partition_supported())
WARN_ON(kvm_pmu_partition(pmu, reserved_host_counters));
else
kvm_err("PMU Partition is not supported");
- }
Hasn't the ARM PMU been registered with perf at this point? Surely the driver wouldn't be very pleased with us ripping counters out from under its feet.
AFAICT nothing in perf registration cares about the number of counters the PMU has. The PMUv3 driver tracks its own available counters through cntr_mask and I modify that during partition.
Since this is still initialization of the PMU, I don't believe anything has had a chance to use a counter yet that will be ripped away.
Given that kvm_pmu_partition() is called from an ioctl, it is entirely possible that events have been scheduled prior to applying the partition.
Aesthetically It makes since to change this if I move the partitioning code to the PMUv3 driver, but I think it's inconsequential to the function.
There are two *very* distinct functions w.r.t. partitioning:
1) Partitioning of a particular arm_pmu that says how many counters the host can use
2) VMM intentions to present a subset of the KVM-owned counter partition to its guest
#1 is modifying *global* state, we really can't mess with that in the context of a single VM...
Thanks, Oliver
Thank you Oliver for the additional explanation.
Oliver Upton oliver.upton@linux.dev writes:
On Tue, Jun 03, 2025 at 09:32:41PM +0000, Colton Lewis wrote:
Oliver Upton oliver.upton@linux.dev writes:
On Mon, Jun 02, 2025 at 07:26:51PM +0000, Colton Lewis wrote:
static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu) {
- u8 hpmn = vcpu->kvm->arch.arm_pmu->hpmn;
- preempt_disable();
/* * This also clears MDCR_EL2_E2PB_MASK and MDCR_EL2_E2TB_MASK * to disable guest access to the profiling and trace buffers */
- vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN,
*host_data_ptr(nr_event_counters));
- vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
- vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN, hpmn);
- vcpu->arch.mdcr_el2 |= (MDCR_EL2_HPMD |
MDCR_EL2_TPM |
This isn't safe, as there's no guarantee that kvm_arch::arm_pmu is pointing that the PMU for this CPU. KVM needs to derive HPMN from some per-CPU state, not anything tied to the VM/vCPU.
I'm confused. Isn't this function preparing to run the vCPU on this CPU? Why would it be pointing at a different PMU?
Because arm64 is a silly ecosystem and system designers can glue together heterogenous CPU implementations. The arm_pmu that KVM is pointing at might only match a subset of CPUs, but vCPUs migrate at the whim of the scheduler (and userspace).
That means the arm_pmu field might at any time point to data that doesn't represent the current CPU. I'm surprised that's not swapped out anywhere. Seems like it would be useful to have an arch struct be a reliable source of information about the current arch.
And HPMN is something that we only want set when running a vCPU, so there isn't any per-CPU state saying it should be anything but the default value (number of counters) outside that context.
Unless you just mean I should check the number of counters again and make sure HPMN is not an invalid value.
As you've implemented it the host cannot schedule events in the guest range of counters regardless of context. You need to reconcile that global limit with the desires of the VMM on how many counters it wants presented to this particular guest.
It's true that's the current implementation. I was assuming the VMM would control that with the new partition API. Given that partitioning untraps access to counters, there is no other way besides HPMN to control how many counters are exposed to the guest.
+/**
- kvm_pmu_partition() - Partition the PMU
- @pmu: Pointer to pmu being partitioned
- @host_counters: Number of host counters to reserve
- Partition the given PMU by taking a number of host counters to
- reserve and, if it is a valid reservation, recording the
- corresponding HPMN value in the hpmn field of the PMU and
clearing
- the guest-reserved counters from the counter mask.
- Passing 0 for @host_counters has the effect of disabling
partitioning.
- Return: 0 on success, -ERROR otherwise
- */
+int kvm_pmu_partition(struct arm_pmu *pmu, u8 host_counters) +{
- u8 nr_counters;
- u8 hpmn;
- if (!kvm_pmu_reservation_is_valid(host_counters))
return -EINVAL;
- nr_counters = *host_data_ptr(nr_event_counters);
- hpmn = kvm_pmu_hpmn(host_counters);
- if (hpmn < nr_counters) {
pmu->hpmn = hpmn;
/* Inform host driver of available counters */
bitmap_clear(pmu->cntr_mask, 0, hpmn);
bitmap_set(pmu->cntr_mask, hpmn, nr_counters);
clear_bit(ARMV8_PMU_CYCLE_IDX, pmu->cntr_mask);
if (pmuv3_has_icntr())
clear_bit(ARMV8_PMU_INSTR_IDX, pmu->cntr_mask);
kvm_debug("Partitioned PMU with HPMN %u", hpmn);
- } else {
pmu->hpmn = nr_counters;
bitmap_set(pmu->cntr_mask, 0, nr_counters);
set_bit(ARMV8_PMU_CYCLE_IDX, pmu->cntr_mask);
if (pmuv3_has_icntr())
set_bit(ARMV8_PMU_INSTR_IDX, pmu->cntr_mask);
kvm_debug("Unpartitioned PMU");
- }
- return 0;
+}
Hmm... Just in terms of code organization I'm not sure I like having
KVM
twiddling with *host* support for PMUv3. Feels like the ARM PMU driver should own partitioning and KVM just takes what it can get.
Okay. I can move the code.
@@ -239,6 +245,13 @@ void kvm_host_pmu_init(struct arm_pmu *pmu) if (!pmuv3_implemented(kvm_arm_pmu_get_pmuver_limit())) return;
- if (reserved_host_counters) {
if (kvm_pmu_partition_supported())
WARN_ON(kvm_pmu_partition(pmu, reserved_host_counters));
else
kvm_err("PMU Partition is not supported");
- }
Hasn't the ARM PMU been registered with perf at this point? Surely the driver wouldn't be very pleased with us ripping counters out from under its feet.
AFAICT nothing in perf registration cares about the number of counters the PMU has. The PMUv3 driver tracks its own available counters through cntr_mask and I modify that during partition.
Since this is still initialization of the PMU, I don't believe anything has had a chance to use a counter yet that will be ripped away.
Given that kvm_pmu_partition() is called from an ioctl, it is entirely possible that events have been scheduled prior to applying the partition.
That's true for the ioctl call. I was only saying it's not true here.
Aesthetically It makes since to change this if I move the partitioning code to the PMUv3 driver, but I think it's inconsequential to the function.
There are two *very* distinct functions w.r.t. partitioning:
- Partitioning of a particular arm_pmu that says how many counters the
host can use
- VMM intentions to present a subset of the KVM-owned counter
partition to its guest
#1 is modifying *global* state, we really can't mess with that in the context of a single VM...
I see the distinction more clearly now. Since KVM can only control the number of counters presented to the guest through HPMN, why would the VMM ever choose a subset? If the host PMU is globally partitioned to not use anything in the guest range, presenting fewer counters to a guest is just leaving some counters in the middle of the range unused.
Thanks, Oliver
On Wed, Jun 04, 2025 at 08:10:27PM +0000, Colton Lewis wrote:
Thank you Oliver for the additional explanation.
Oliver Upton oliver.upton@linux.dev writes:
On Tue, Jun 03, 2025 at 09:32:41PM +0000, Colton Lewis wrote:
Oliver Upton oliver.upton@linux.dev writes:
On Mon, Jun 02, 2025 at 07:26:51PM +0000, Colton Lewis wrote:
static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu) {
- u8 hpmn = vcpu->kvm->arch.arm_pmu->hpmn;
- preempt_disable();
/* * This also clears MDCR_EL2_E2PB_MASK and MDCR_EL2_E2TB_MASK * to disable guest access to the profiling and trace buffers */
- vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN,
*host_data_ptr(nr_event_counters));
- vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
- vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN, hpmn);
- vcpu->arch.mdcr_el2 |= (MDCR_EL2_HPMD |
MDCR_EL2_TPM |
This isn't safe, as there's no guarantee that kvm_arch::arm_pmu is pointing that the PMU for this CPU. KVM needs to derive HPMN from some per-CPU state, not anything tied to the VM/vCPU.
I'm confused. Isn't this function preparing to run the vCPU on this CPU? Why would it be pointing at a different PMU?
Because arm64 is a silly ecosystem and system designers can glue together heterogenous CPU implementations. The arm_pmu that KVM is pointing at might only match a subset of CPUs, but vCPUs migrate at the whim of the scheduler (and userspace).
That means the arm_pmu field might at any time point to data that doesn't represent the current CPU. I'm surprised that's not swapped out anywhere. Seems like it would be useful to have an arch struct be a reliable source of information about the current arch.
There's no way to accomplish that. It is per-VM data, and you could have vCPUs on a mix of physical CPUs.
This is mitigated somewhat when the VMM explicitly selects a PMU implementation, as we prevent vCPUs from actually entering the guest on an unsupported CPU (see ON_SUPPORTED_CPU flag).
There are two *very* distinct functions w.r.t. partitioning:
- Partitioning of a particular arm_pmu that says how many counters the
host can use
- VMM intentions to present a subset of the KVM-owned counter
partition to its guest
#1 is modifying *global* state, we really can't mess with that in the context of a single VM...
I see the distinction more clearly now. Since KVM can only control the number of counters presented to the guest through HPMN, why would the VMM ever choose a subset? If the host PMU is globally partitioned to not use anything in the guest range, presenting fewer counters to a guest is just leaving some counters in the middle of the range unused.
You may not want to give a 'full' PMU to all VMs running on a system, but some OSes (Windows) expect to have at least the fixed CPU cycle counter present. In this case the VMM would deliberately expose fewer counters. FEAT_HPMN0 didn't get added to the architecture by accident...
Thanks, Oliver
The OVSR bitmasks are valid for enable and interrupt registers as well as overflow registers. Generalize the names.
Signed-off-by: Colton Lewis coltonlewis@google.com --- drivers/perf/arm_pmuv3.c | 4 ++-- include/linux/perf/arm_pmuv3.h | 14 +++++++------- 2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c index e506d59654e7..bbcbc8e0c62a 100644 --- a/drivers/perf/arm_pmuv3.c +++ b/drivers/perf/arm_pmuv3.c @@ -502,7 +502,7 @@ static void armv8pmu_pmcr_write(u64 val)
static int armv8pmu_has_overflowed(u64 pmovsr) { - return !!(pmovsr & ARMV8_PMU_OVERFLOWED_MASK); + return !!(pmovsr & ARMV8_PMU_CNT_MASK_ALL); }
static int armv8pmu_counter_has_overflowed(u64 pmnc, int idx) @@ -738,7 +738,7 @@ static u64 armv8pmu_getreset_flags(void) value = read_pmovsclr();
/* Write to clear flags */ - value &= ARMV8_PMU_OVERFLOWED_MASK; + value &= ARMV8_PMU_CNT_MASK_ALL; write_pmovsclr(value);
return value; diff --git a/include/linux/perf/arm_pmuv3.h b/include/linux/perf/arm_pmuv3.h index d698efba28a2..fd2a34b4a64d 100644 --- a/include/linux/perf/arm_pmuv3.h +++ b/include/linux/perf/arm_pmuv3.h @@ -224,14 +224,14 @@ ARMV8_PMU_PMCR_LC | ARMV8_PMU_PMCR_LP)
/* - * PMOVSR: counters overflow flag status reg + * Counter bitmask layouts for overflow, enable, and interrupts */ -#define ARMV8_PMU_OVSR_P GENMASK(30, 0) -#define ARMV8_PMU_OVSR_C BIT(31) -#define ARMV8_PMU_OVSR_F BIT_ULL(32) /* arm64 only */ -/* Mask for writable bits is both P and C fields */ -#define ARMV8_PMU_OVERFLOWED_MASK (ARMV8_PMU_OVSR_P | ARMV8_PMU_OVSR_C | \ - ARMV8_PMU_OVSR_F) +#define ARMV8_PMU_CNT_MASK_P GENMASK(30, 0) +#define ARMV8_PMU_CNT_MASK_C BIT(31) +#define ARMV8_PMU_CNT_MASK_F BIT_ULL(32) /* arm64 only */ +#define ARMV8_PMU_CNT_MASK_ALL (ARMV8_PMU_CNT_MASK_P | \ + ARMV8_PMU_CNT_MASK_C | \ + ARMV8_PMU_CNT_MASK_F)
/* * PMXEVTYPER: Event selection reg
If the PMU is partitioned, keep the driver out of the guest counter partition and only use the host counter partition. Partitioning is defined by the MDCR_EL2.HPMN register field and saved in cpu_pmu->hpmn. The range 0..HPMN-1 is accessible by EL1 and EL0 while HPMN..PMCR.N is reserved for EL2.
Define some functions that take HPMN as an argument and construct mutually exclusive bitmaps for testing which partition a particular counter is in. Note that despite their different position in the bitmap, the cycle and instruction counters are always in the guest partition.
Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm/include/asm/arm_pmuv3.h | 18 ++++++++ arch/arm64/include/asm/kvm_pmu.h | 23 ++++++++++ arch/arm64/kvm/pmu-part.c | 73 ++++++++++++++++++++++++++++++++ drivers/perf/arm_pmuv3.c | 36 ++++++++++++++-- 4 files changed, 146 insertions(+), 4 deletions(-)
diff --git a/arch/arm/include/asm/arm_pmuv3.h b/arch/arm/include/asm/arm_pmuv3.h index 2ec0e5e83fc9..1687b4031ec2 100644 --- a/arch/arm/include/asm/arm_pmuv3.h +++ b/arch/arm/include/asm/arm_pmuv3.h @@ -227,6 +227,24 @@ static inline bool kvm_set_pmuserenr(u64 val) }
static inline void kvm_vcpu_pmu_resync_el0(void) {} +static inline void kvm_pmu_host_counters_enable(void) {} +static inline void kvm_pmu_host_counters_disable(void) {} + +static inline bool kvm_pmu_is_partitioned(struct arm_pmu *pmu) +{ + return false; +} + +static inline u64 kvm_pmu_host_counter_mask(struct arm_pmu *pmu) +{ + return ~0; +} + +static inline u64 kvm_pmu_guest_counter_mask(struct arm_pmu *pmu) +{ + return ~0; +} +
/* PMU Version in DFR Register */ #define ARMV8_PMU_DFR_VER_NI 0 diff --git a/arch/arm64/include/asm/kvm_pmu.h b/arch/arm64/include/asm/kvm_pmu.h index 83b81e7829bf..4098d4ad03d9 100644 --- a/arch/arm64/include/asm/kvm_pmu.h +++ b/arch/arm64/include/asm/kvm_pmu.h @@ -25,6 +25,11 @@ void kvm_host_pmu_init(struct arm_pmu *pmu); bool kvm_pmu_partition_supported(void); u8 kvm_pmu_hpmn(u8 host_counters); int kvm_pmu_partition(struct arm_pmu *pmu, u8 host_counters); +bool kvm_pmu_is_partitioned(struct arm_pmu *pmu); +u64 kvm_pmu_host_counter_mask(struct arm_pmu *pmu); +u64 kvm_pmu_guest_counter_mask(struct arm_pmu *pmu); +void kvm_pmu_host_counters_enable(void); +void kvm_pmu_host_counters_disable(void);
#else
@@ -52,6 +57,24 @@ static inline int kvm_pmu_partition(struct arm_pmu *pmu) return -EPERM; }
+static inline bool kvm_pmu_is_partitioned(struct arm_pmu *pmu) +{ + return false; +} + +static inline u64 kvm_pmu_host_counter_mask(struct arm_pmu *pmu) +{ + return ~0; +} + +static inline u64 kvm_pmu_guest_counter_mask(struct arm_pmu *pmu) +{ + return ~0; +} + +static inline void kvm_pmu_host_counters_enable(void) {} +static inline void kvm_pmu_host_counters_disable(void) {} + #endif
#endif diff --git a/arch/arm64/kvm/pmu-part.c b/arch/arm64/kvm/pmu-part.c index 7252a58f085c..33eeaa8faf7f 100644 --- a/arch/arm64/kvm/pmu-part.c +++ b/arch/arm64/kvm/pmu-part.c @@ -115,3 +115,76 @@ int kvm_pmu_partition(struct arm_pmu *pmu, u8 host_counters)
return 0; } + +/** + * kvm_pmu_is_partitioned() - Determine if given PMU is partitioned + * @pmu: Pointer to arm_pmu struct + * + * Determine if given PMU is partitioned by looking at hpmn field. The + * PMU is partitioned if this field is less than the number of + * counters in the system. + * + * Return: True if the PMU is partitioned, false otherwise + */ +bool kvm_pmu_is_partitioned(struct arm_pmu *pmu) +{ + return pmu->hpmn < *host_data_ptr(nr_event_counters); +} + +/** + * kvm_pmu_host_counter_mask() - Compute bitmask of host-reserved counters + * @pmu: Pointer to arm_pmu struct + * + * Compute the bitmask that selects the host-reserved counters in the + * {PMCNTEN,PMINTEN,PMOVS}{SET,CLR} registers. These are the counters + * in HPMN..N + * + * Return: Bitmask + */ +u64 kvm_pmu_host_counter_mask(struct arm_pmu *pmu) +{ + u8 nr_counters = *host_data_ptr(nr_event_counters); + + return GENMASK(nr_counters - 1, pmu->hpmn); +} + +/** kvm_pmu_guest_counter_mask() - Compute bitmask of guest-reserved counters + * + * Compute the bitmask that selects the guest-reserved counters in the + * {PMCNTEN,PMINTEN,PMOVS}{SET,CLR} registers. These are the counters + * in 0..HPMN and the cycle and instruction counters. + * + * Return: Bitmask + */ +u64 kvm_pmu_guest_counter_mask(struct arm_pmu *pmu) +{ + return ARMV8_PMU_CNT_MASK_ALL & ~kvm_pmu_host_counter_mask(pmu); +} + +/** kvm_pmu_host_counters_enable() - Enable host-reserved counters + * + * When partitioned the enable bit for host-reserved counters is + * MDCR_EL2.HPME instead of the typical PMCR_EL0.E, which now + * exclusively controls the guest-reserved counters. Enable that bit. + */ +void kvm_pmu_host_counters_enable(void) +{ + u64 mdcr = read_sysreg(mdcr_el2); + + mdcr |= MDCR_EL2_HPME; + write_sysreg(mdcr, mdcr_el2); +} + +/** kvm_pmu_host_counters_disable() - Disable host-reserved counters + * + * When partitioned the disable bit for host-reserved counters is + * MDCR_EL2.HPME instead of the typical PMCR_EL0.E, which now + * exclusively controls the guest-reserved counters. Disable that bit. + */ +void kvm_pmu_host_counters_disable(void) +{ + u64 mdcr = read_sysreg(mdcr_el2); + + mdcr &= ~MDCR_EL2_HPME; + write_sysreg(mdcr, mdcr_el2); +} diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c index bbcbc8e0c62a..f447a0f10e2b 100644 --- a/drivers/perf/arm_pmuv3.c +++ b/drivers/perf/arm_pmuv3.c @@ -823,12 +823,18 @@ static void armv8pmu_start(struct arm_pmu *cpu_pmu) kvm_vcpu_pmu_resync_el0();
/* Enable all counters */ + if (kvm_pmu_is_partitioned(cpu_pmu)) + kvm_pmu_host_counters_enable(); + armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E); }
static void armv8pmu_stop(struct arm_pmu *cpu_pmu) { /* Disable all counters */ + if (kvm_pmu_is_partitioned(cpu_pmu)) + kvm_pmu_host_counters_disable(); + armv8pmu_pmcr_write(armv8pmu_pmcr_read() & ~ARMV8_PMU_PMCR_E); }
@@ -939,6 +945,7 @@ static int armv8pmu_get_event_idx(struct pmu_hw_events *cpuc,
/* Always prefer to place a cycle counter into the cycle counter. */ if ((evtype == ARMV8_PMUV3_PERFCTR_CPU_CYCLES) && + !kvm_pmu_is_partitioned(cpu_pmu) && !armv8pmu_event_get_threshold(&event->attr)) { if (!test_and_set_bit(ARMV8_PMU_CYCLE_IDX, cpuc->used_mask)) return ARMV8_PMU_CYCLE_IDX; @@ -954,6 +961,7 @@ static int armv8pmu_get_event_idx(struct pmu_hw_events *cpuc, * may not know how to handle it. */ if ((evtype == ARMV8_PMUV3_PERFCTR_INST_RETIRED) && + !kvm_pmu_is_partitioned(cpu_pmu) && !armv8pmu_event_get_threshold(&event->attr) && test_bit(ARMV8_PMU_INSTR_IDX, cpu_pmu->cntr_mask) && !armv8pmu_event_want_user_access(event)) { @@ -965,7 +973,7 @@ static int armv8pmu_get_event_idx(struct pmu_hw_events *cpuc, * Otherwise use events counters */ if (armv8pmu_event_is_chained(event)) - return armv8pmu_get_chain_idx(cpuc, cpu_pmu); + return armv8pmu_get_chain_idx(cpuc, cpu_pmu); else return armv8pmu_get_single_idx(cpuc, cpu_pmu); } @@ -1057,6 +1065,14 @@ static int armv8pmu_set_event_filter(struct hw_perf_event *event, return 0; }
+static void armv8pmu_reset_host_counters(struct arm_pmu *cpu_pmu) +{ + int idx; + + for_each_set_bit(idx, cpu_pmu->cntr_mask, ARMV8_PMU_MAX_GENERAL_COUNTERS) + armv8pmu_write_evcntr(idx, 0); +} + static void armv8pmu_reset(void *info) { struct arm_pmu *cpu_pmu = (struct arm_pmu *)info; @@ -1064,6 +1080,9 @@ static void armv8pmu_reset(void *info)
bitmap_to_arr64(&mask, cpu_pmu->cntr_mask, ARMPMU_MAX_HWEVENTS);
+ if (kvm_pmu_is_partitioned(cpu_pmu)) + mask &= kvm_pmu_host_counter_mask(cpu_pmu); + /* The counter and interrupt enable registers are unknown at reset. */ armv8pmu_disable_counter(mask); armv8pmu_disable_intens(mask); @@ -1071,11 +1090,20 @@ static void armv8pmu_reset(void *info) /* Clear the counters we flip at guest entry/exit */ kvm_clr_pmu_events(mask);
+ + pmcr = ARMV8_PMU_PMCR_LC; + /* - * Initialize & Reset PMNC. Request overflow interrupt for - * 64 bit cycle counter but cheat in armv8pmu_write_counter(). + * Initialize & Reset PMNC. Request overflow interrupt for 64 + * bit cycle counter but cheat in armv8pmu_write_counter(). + * + * When partitioned, there is no single bit to reset only the + * host counters. so reset them individually. */ - pmcr = ARMV8_PMU_PMCR_P | ARMV8_PMU_PMCR_C | ARMV8_PMU_PMCR_LC; + if (kvm_pmu_is_partitioned(cpu_pmu)) + armv8pmu_reset_host_counters(cpu_pmu); + else + pmcr = ARMV8_PMU_PMCR_P | ARMV8_PMU_PMCR_C;
/* Enable long event counter support where available */ if (armv8pmu_has_long_event(cpu_pmu))
In order to gain a real performance benefit from partitioning the PMU, utilize fine grain traps (FEAT_FGT and FEAT_FGT2) to avoid trapping common PMU register accesses by the guest to remove that overhead.
There should be no information leaks between guests as all these registers are context switched by a later patch in this series.
Untrapped: * PMCR_EL0 * PMUSERENR_EL0 * PMSELR_EL0 * PMCCNTR_EL0 * PMINTEN_EL0 * PMEVCNTRn_EL0 * PMICNTR_EL0
Trapped: * PMOVS_EL0 * PMEVTYPERn_EL0 * PMICFILTR_EL0 * PMCCFILTR_EL0
PMOVS remains trapped so KVM can track overflow IRQs that will need to be injected into the guest.
PMEVTYPERn remains trapped so KVM can limit which events guests can count, such as disallowing counting at EL2. PMCCFILTR and PMCIFILTR are the same
Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm64/include/asm/kvm_host.h | 11 +++++ arch/arm64/kvm/debug.c | 5 +- arch/arm64/kvm/hyp/include/hyp/switch.h | 64 +++++++++++++++++++++++-- arch/arm64/kvm/pmu-part.c | 14 ++++++ 4 files changed, 88 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 3482d7602a5b..4ea045098bfa 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -1703,6 +1703,12 @@ int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr); int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu);
+bool kvm_vcpu_pmu_is_partitioned(struct kvm_vcpu *vcpu); + +#if defined(__KVM_NVHE_HYPERVISOR__) +#define kvm_vcpu_pmu_is_partitioned(_) false +#endif + struct kvm_pmu_events *kvm_get_pmu_events(void); void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu); void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu); @@ -1819,6 +1825,11 @@ static inline bool kvm_pmu_counter_is_hyp(struct kvm_vcpu *vcpu, unsigned int id
static inline void kvm_pmu_nested_transition(struct kvm_vcpu *vcpu) {}
+static inline bool kvm_vcpu_pmu_is_partitioned(struct kvm_vcpu *vcpu) +{ + return false; +} + #endif
#endif /* __ARM64_KVM_HOST_H__ */ diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c index 41746a498a45..cbe36825e41f 100644 --- a/arch/arm64/kvm/debug.c +++ b/arch/arm64/kvm/debug.c @@ -42,13 +42,14 @@ static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu) */ vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN, hpmn); vcpu->arch.mdcr_el2 |= (MDCR_EL2_HPMD | - MDCR_EL2_TPM | MDCR_EL2_TPMS | MDCR_EL2_TTRF | - MDCR_EL2_TPMCR | MDCR_EL2_TDRA | MDCR_EL2_TDOSA);
+ if (!kvm_vcpu_pmu_is_partitioned(vcpu)) + vcpu->arch.mdcr_el2 |= MDCR_EL2_TPM | MDCR_EL2_TPMCR; + /* Is the VM being debugged by userspace? */ if (vcpu->guest_debug) /* Route all software debug exceptions to EL2 */ diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h index d407e716df1b..c3c34a471ace 100644 --- a/arch/arm64/kvm/hyp/include/hyp/switch.h +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h @@ -133,6 +133,10 @@ static inline void __activate_traps_fpsimd32(struct kvm_vcpu *vcpu) case HDFGWTR_EL2: \ id = HDFGRTR_GROUP; \ break; \ + case HDFGRTR2_EL2: \ + case HDFGWTR2_EL2: \ + id = HDFGRTR2_GROUP; \ + break; \ case HAFGRTR_EL2: \ id = HAFGRTR_GROUP; \ break; \ @@ -143,10 +147,6 @@ static inline void __activate_traps_fpsimd32(struct kvm_vcpu *vcpu) case HFGITR2_EL2: \ id = HFGITR2_GROUP; \ break; \ - case HDFGRTR2_EL2: \ - case HDFGWTR2_EL2: \ - id = HDFGRTR2_GROUP; \ - break; \ default: \ BUILD_BUG_ON(1); \ } \ @@ -191,6 +191,59 @@ static inline bool cpu_has_amu(void) ID_AA64PFR0_EL1_AMU_SHIFT); }
+/** + * __activate_pmu_fgt() - Activate fine grain traps for partitioned PMU + * @vcpu: Pointer to struct kvm_vcpu + * + * Clear the most commonly accessed registers for a partitioned + * PMU. Trap the rest. + */ +static inline void __activate_pmu_fgt(struct kvm_vcpu *vcpu) +{ + struct kvm_cpu_context *hctxt = host_data_ptr(host_ctxt); + struct kvm *kvm = kern_hyp_va(vcpu->kvm); + u64 set; + u64 clr; + + set = HDFGRTR_EL2_PMOVS + | HDFGRTR_EL2_PMCCFILTR_EL0 + | HDFGRTR_EL2_PMEVTYPERn_EL0; + clr = HDFGRTR_EL2_PMUSERENR_EL0 + | HDFGRTR_EL2_PMSELR_EL0 + | HDFGRTR_EL2_PMINTEN + | HDFGRTR_EL2_PMCNTEN + | HDFGRTR_EL2_PMCCNTR_EL0 + | HDFGRTR_EL2_PMEVCNTRn_EL0; + + update_fgt_traps_cs(hctxt, vcpu, kvm, HDFGRTR_EL2, clr, set); + + set = HDFGWTR_EL2_PMOVS + | HDFGWTR_EL2_PMCCFILTR_EL0 + | HDFGWTR_EL2_PMEVTYPERn_EL0; + clr = HDFGWTR_EL2_PMUSERENR_EL0 + | HDFGWTR_EL2_PMCR_EL0 + | HDFGWTR_EL2_PMSELR_EL0 + | HDFGWTR_EL2_PMINTEN + | HDFGWTR_EL2_PMCNTEN + | HDFGWTR_EL2_PMCCNTR_EL0 + | HDFGWTR_EL2_PMEVCNTRn_EL0; + + update_fgt_traps_cs(hctxt, vcpu, kvm, HDFGWTR_EL2, clr, set); + + if (!cpus_have_final_cap(ARM64_HAS_FGT2)) + return; + + set = HDFGRTR2_EL2_nPMICFILTR_EL0; + clr = HDFGRTR2_EL2_nPMICNTR_EL0; + + update_fgt_traps_cs(hctxt, vcpu, kvm, HDFGRTR2_EL2, clr, set); + + set = HDFGWTR2_EL2_nPMICFILTR_EL0; + clr = HDFGWTR2_EL2_nPMICNTR_EL0; + + update_fgt_traps_cs(hctxt, vcpu, kvm, HDFGWTR2_EL2, clr, set); +} + static inline void __activate_traps_hfgxtr(struct kvm_vcpu *vcpu) { struct kvm_cpu_context *hctxt = host_data_ptr(host_ctxt); @@ -210,6 +263,9 @@ static inline void __activate_traps_hfgxtr(struct kvm_vcpu *vcpu) if (cpu_has_amu()) update_fgt_traps(hctxt, vcpu, kvm, HAFGRTR_EL2);
+ if (kvm_vcpu_pmu_is_partitioned(vcpu)) + __activate_pmu_fgt(vcpu); + if (!cpus_have_final_cap(ARM64_HAS_FGT2)) return;
diff --git a/arch/arm64/kvm/pmu-part.c b/arch/arm64/kvm/pmu-part.c index 33eeaa8faf7f..179a4144cfd0 100644 --- a/arch/arm64/kvm/pmu-part.c +++ b/arch/arm64/kvm/pmu-part.c @@ -131,6 +131,20 @@ bool kvm_pmu_is_partitioned(struct arm_pmu *pmu) return pmu->hpmn < *host_data_ptr(nr_event_counters); }
+/** + * kvm_vcpu_pmu_is_partitioned() - Determine if given VCPU has a partitioned PMU + * @vcpu: Pointer to kvm_vcpu struct + * + * Determine if given VCPU has a partitioned PMU by extracting that + * field and passing it to :c:func:`kvm_pmu_is_partitioned` + * + * Return: True if the VCPU PMU is partitioned, false otherwise + */ +bool kvm_vcpu_pmu_is_partitioned(struct kvm_vcpu *vcpu) +{ + return kvm_pmu_is_partitioned(vcpu->kvm->arch.arm_pmu); +} + /** * kvm_pmu_host_counter_mask() - Compute bitmask of host-reserved counters * @pmu: Pointer to arm_pmu struct
With FGT in place, the remaining trapped registers need to be written through to the underlying physical registers as well as the virtual ones. Failing to do this means delaying when guest writes take effect.
Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm64/kvm/sys_regs.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index d368eeb4f88e..afd06400429a 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -18,6 +18,7 @@ #include <linux/printk.h> #include <linux/uaccess.h> #include <linux/irqchip/arm-gic-v3.h> +#include <linux/perf/arm_pmu.h> #include <linux/perf/arm_pmuv3.h>
#include <asm/arm_pmuv3.h> @@ -942,7 +943,11 @@ static bool pmu_counter_idx_valid(struct kvm_vcpu *vcpu, u64 idx) { u64 pmcr, val;
- pmcr = kvm_vcpu_read_pmcr(vcpu); + if (kvm_vcpu_pmu_is_partitioned(vcpu)) + pmcr = read_pmcr(); + else + pmcr = kvm_vcpu_read_pmcr(vcpu); + val = FIELD_GET(ARMV8_PMU_PMCR_N, pmcr); if (idx >= val && idx != ARMV8_PMU_CYCLE_IDX) { kvm_inject_undefined(vcpu); @@ -1037,6 +1042,22 @@ static bool access_pmu_evcntr(struct kvm_vcpu *vcpu, return true; }
+static void writethrough_pmevtyper(struct kvm_vcpu *vcpu, struct sys_reg_params *p, + u64 reg, u64 idx) +{ + u64 evmask = kvm_pmu_evtyper_mask(vcpu->kvm); + u64 val = p->regval & evmask; + + __vcpu_sys_reg(vcpu, reg) = val; + + if (idx == ARMV8_PMU_CYCLE_IDX) + write_pmccfiltr(val); + else if (idx == ARMV8_PMU_INSTR_IDX) + write_pmicfiltr(val); + else + write_pmevtypern(idx, val); +} + static bool access_pmu_evtyper(struct kvm_vcpu *vcpu, struct sys_reg_params *p, const struct sys_reg_desc *r) { @@ -1063,7 +1084,9 @@ static bool access_pmu_evtyper(struct kvm_vcpu *vcpu, struct sys_reg_params *p, if (!pmu_counter_idx_valid(vcpu, idx)) return false;
- if (p->is_write) { + if (kvm_vcpu_pmu_is_partitioned(vcpu) && p->is_write) { + writethrough_pmevtyper(vcpu, p, reg, idx); + } else if (p->is_write) { kvm_pmu_set_counter_event_type(vcpu, p->regval, idx); kvm_vcpu_pmu_restore_guest(vcpu); } else {
On Mon, Jun 02, 2025 at 07:26:55PM +0000, Colton Lewis wrote:
With FGT in place, the remaining trapped registers need to be written through to the underlying physical registers as well as the virtual ones. Failing to do this means delaying when guest writes take effect.
Signed-off-by: Colton Lewis coltonlewis@google.com
arch/arm64/kvm/sys_regs.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index d368eeb4f88e..afd06400429a 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -18,6 +18,7 @@ #include <linux/printk.h> #include <linux/uaccess.h> #include <linux/irqchip/arm-gic-v3.h> +#include <linux/perf/arm_pmu.h> #include <linux/perf/arm_pmuv3.h> #include <asm/arm_pmuv3.h> @@ -942,7 +943,11 @@ static bool pmu_counter_idx_valid(struct kvm_vcpu *vcpu, u64 idx) { u64 pmcr, val;
- pmcr = kvm_vcpu_read_pmcr(vcpu);
- if (kvm_vcpu_pmu_is_partitioned(vcpu))
pmcr = read_pmcr();
Reading PMCR_EL0 from EL2 is not going to have the desired effect. PMCR_EL0.N only returns HPMN when read from the guest.
- else
pmcr = kvm_vcpu_read_pmcr(vcpu);
- val = FIELD_GET(ARMV8_PMU_PMCR_N, pmcr); if (idx >= val && idx != ARMV8_PMU_CYCLE_IDX) { kvm_inject_undefined(vcpu);
@@ -1037,6 +1042,22 @@ static bool access_pmu_evcntr(struct kvm_vcpu *vcpu, return true; } +static void writethrough_pmevtyper(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
u64 reg, u64 idx)
+{
- u64 evmask = kvm_pmu_evtyper_mask(vcpu->kvm);
- u64 val = p->regval & evmask;
- __vcpu_sys_reg(vcpu, reg) = val;
- if (idx == ARMV8_PMU_CYCLE_IDX)
write_pmccfiltr(val);
- else if (idx == ARMV8_PMU_INSTR_IDX)
write_pmicfiltr(val);
- else
write_pmevtypern(idx, val);
+}
How are you preventing the VM from configuring an event counter to count at EL2?
I see that you're setting MDCR_EL2.HPMD (which assumes FEAT_PMUv3p1) but due to an architecture bug there's no control to prohibit the cycle counter until FEAT_PMUv3p5 (MDCR_EL2.HCCD).
Since you're already trapping PMCCFILTR you could potentially configure the hardware value in such a way that it filters EL2.
static bool access_pmu_evtyper(struct kvm_vcpu *vcpu, struct sys_reg_params *p, const struct sys_reg_desc *r) { @@ -1063,7 +1084,9 @@ static bool access_pmu_evtyper(struct kvm_vcpu *vcpu, struct sys_reg_params *p, if (!pmu_counter_idx_valid(vcpu, idx)) return false;
- if (p->is_write) {
- if (kvm_vcpu_pmu_is_partitioned(vcpu) && p->is_write) {
writethrough_pmevtyper(vcpu, p, reg, idx);
What about the vPMU event filter?
- } else if (p->is_write) { kvm_pmu_set_counter_event_type(vcpu, p->regval, idx); kvm_vcpu_pmu_restore_guest(vcpu); } else {
-- 2.49.0.1204.g71687c7c1d-goog
Thanks, Oliver
Oliver Upton oliver.upton@linux.dev writes:
On Mon, Jun 02, 2025 at 07:26:55PM +0000, Colton Lewis wrote:
With FGT in place, the remaining trapped registers need to be written through to the underlying physical registers as well as the virtual ones. Failing to do this means delaying when guest writes take effect.
Signed-off-by: Colton Lewis coltonlewis@google.com
arch/arm64/kvm/sys_regs.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index d368eeb4f88e..afd06400429a 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -18,6 +18,7 @@ #include <linux/printk.h> #include <linux/uaccess.h> #include <linux/irqchip/arm-gic-v3.h> +#include <linux/perf/arm_pmu.h> #include <linux/perf/arm_pmuv3.h>
#include <asm/arm_pmuv3.h> @@ -942,7 +943,11 @@ static bool pmu_counter_idx_valid(struct kvm_vcpu *vcpu, u64 idx) { u64 pmcr, val;
- pmcr = kvm_vcpu_read_pmcr(vcpu);
- if (kvm_vcpu_pmu_is_partitioned(vcpu))
pmcr = read_pmcr();
Reading PMCR_EL0 from EL2 is not going to have the desired effect. PMCR_EL0.N only returns HPMN when read from the guest.
Okay. I'll change that.
- else
pmcr = kvm_vcpu_read_pmcr(vcpu);
- val = FIELD_GET(ARMV8_PMU_PMCR_N, pmcr); if (idx >= val && idx != ARMV8_PMU_CYCLE_IDX) { kvm_inject_undefined(vcpu);
@@ -1037,6 +1042,22 @@ static bool access_pmu_evcntr(struct kvm_vcpu *vcpu, return true; }
+static void writethrough_pmevtyper(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
u64 reg, u64 idx)
+{
- u64 evmask = kvm_pmu_evtyper_mask(vcpu->kvm);
- u64 val = p->regval & evmask;
- __vcpu_sys_reg(vcpu, reg) = val;
- if (idx == ARMV8_PMU_CYCLE_IDX)
write_pmccfiltr(val);
- else if (idx == ARMV8_PMU_INSTR_IDX)
write_pmicfiltr(val);
- else
write_pmevtypern(idx, val);
+}
How are you preventing the VM from configuring an event counter to count at EL2?
I had thought that's what kvm_pmu_evtyper_mask() did since masking with that is what kvm_pmu_set_counter_event_type() writes to the vCPU register.
I see that you're setting MDCR_EL2.HPMD (which assumes FEAT_PMUv3p1) but due to an architecture bug there's no control to prohibit the cycle counter until FEAT_PMUv3p5 (MDCR_EL2.HCCD).
I'll fix that.
Since you're already trapping PMCCFILTR you could potentially configure the hardware value in such a way that it filters EL2.
Sure.
static bool access_pmu_evtyper(struct kvm_vcpu *vcpu, struct sys_reg_params *p, const struct sys_reg_desc *r) { @@ -1063,7 +1084,9 @@ static bool access_pmu_evtyper(struct kvm_vcpu *vcpu, struct sys_reg_params *p, if (!pmu_counter_idx_valid(vcpu, idx)) return false;
- if (p->is_write) {
- if (kvm_vcpu_pmu_is_partitioned(vcpu) && p->is_write) {
writethrough_pmevtyper(vcpu, p, reg, idx);
What about the vPMU event filter?
I'll check that too.
- } else if (p->is_write) { kvm_pmu_set_counter_event_type(vcpu, p->regval, idx); kvm_vcpu_pmu_restore_guest(vcpu); } else {
-- 2.49.0.1204.g71687c7c1d-goog
Thanks, Oliver
Because PMXEVTYPER is trapped and PMSELR is not, it is not appropriate to use the virtual PMSELR register when it could be outdated and lead to an invalid write. Use the physical register.
Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm64/include/asm/arm_pmuv3.h | 7 ++++++- arch/arm64/kvm/sys_regs.c | 9 +++++++-- 2 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/include/asm/arm_pmuv3.h b/arch/arm64/include/asm/arm_pmuv3.h index 32c003a7b810..8eee8cb218ea 100644 --- a/arch/arm64/include/asm/arm_pmuv3.h +++ b/arch/arm64/include/asm/arm_pmuv3.h @@ -72,11 +72,16 @@ static inline u64 read_pmcr(void) return read_sysreg(pmcr_el0); }
-static inline void write_pmselr(u32 val) +static inline void write_pmselr(u64 val) { write_sysreg(val, pmselr_el0); }
+static inline u64 read_pmselr(void) +{ + return read_sysreg(pmselr_el0); +} + static inline void write_pmccntr(u64 val) { write_sysreg(val, pmccntr_el0); diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index afd06400429a..377fa7867152 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1061,14 +1061,19 @@ static void writethrough_pmevtyper(struct kvm_vcpu *vcpu, struct sys_reg_params static bool access_pmu_evtyper(struct kvm_vcpu *vcpu, struct sys_reg_params *p, const struct sys_reg_desc *r) { - u64 idx, reg; + u64 idx, reg, pmselr;
if (pmu_access_el0_disabled(vcpu)) return false;
if (r->CRn == 9 && r->CRm == 13 && r->Op2 == 1) { /* PMXEVTYPER_EL0 */ - idx = SYS_FIELD_GET(PMSELR_EL0, SEL, __vcpu_sys_reg(vcpu, PMSELR_EL0)); + if (kvm_vcpu_pmu_is_partitioned(vcpu)) + pmselr = read_pmselr(); + else + pmselr = __vcpu_sys_reg(vcpu, PMSELR_EL0); + + idx = SYS_FIELD_GET(PMSELR_EL0, SEL, pmselr); reg = PMEVTYPER0_EL0 + idx; } else if (r->CRn == 14 && (r->CRm & 12) == 12) { idx = ((r->CRm & 3) << 3) | (r->Op2 & 7);
With FGT in place, the remaining trapped registers need to be written through to the underlying physical registers as well as the virtual ones. Failing to do this means delaying when guest writes take effect.
Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm64/include/asm/arm_pmuv3.h | 10 ++++++++++ arch/arm64/kvm/sys_regs.c | 17 ++++++++++++++++- 2 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/arm_pmuv3.h b/arch/arm64/include/asm/arm_pmuv3.h index 8eee8cb218ea..5d01ed25c4ef 100644 --- a/arch/arm64/include/asm/arm_pmuv3.h +++ b/arch/arm64/include/asm/arm_pmuv3.h @@ -142,6 +142,16 @@ static inline u64 read_pmicfiltr(void) return read_sysreg_s(SYS_PMICFILTR_EL0); }
+static inline void write_pmovsset(u64 val) +{ + write_sysreg(val, pmovsset_el0); +} + +static inline u64 read_pmovsset(void) +{ + return read_sysreg(pmovsset_el0); +} + static inline void write_pmovsclr(u64 val) { write_sysreg(val, pmovsclr_el0); diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 377fa7867152..81a4ba7e6038 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1169,6 +1169,19 @@ static bool access_pminten(struct kvm_vcpu *vcpu, struct sys_reg_params *p, return true; }
+static void writethrough_pmovs(struct kvm_vcpu *vcpu, struct sys_reg_params *p, bool set) +{ + u64 mask = kvm_pmu_accessible_counter_mask(vcpu); + + if (set) { + __vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= (p->regval & mask); + write_pmovsset(p->regval & mask); + } else { + __vcpu_sys_reg(vcpu, PMOVSSET_EL0) &= ~(p->regval & mask); + write_pmovsclr(~(p->regval & mask)); + } +} + static bool access_pmovs(struct kvm_vcpu *vcpu, struct sys_reg_params *p, const struct sys_reg_desc *r) { @@ -1177,7 +1190,9 @@ static bool access_pmovs(struct kvm_vcpu *vcpu, struct sys_reg_params *p, if (pmu_access_el0_disabled(vcpu)) return false;
- if (p->is_write) { + if (kvm_vcpu_pmu_is_partitioned(vcpu) && p->is_write) { + writethrough_pmovs(vcpu, p, r->CRm & 0x2); + } else if (p->is_write) { if (r->CRm & 0x2) /* accessing PMOVSSET_EL0 */ __vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= (p->regval & mask);
Save and restore newly untrapped registers that will be directly accessed by the guest when the PMU is partitioned.
* PMEVCNTRn_EL0 * PMCCNTR_EL0 * PMICNTR_EL0 * PMUSERENR_EL0 * PMSELR_EL0 * PMCR_EL0 * PMCNTEN_EL0 * PMINTEN_EL1
If the PMU is not partitioned or MDCR_EL2.TPM is set, all PMU registers are trapped so return immediately.
Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm64/include/asm/arm_pmuv3.h | 17 ++++- arch/arm64/include/asm/kvm_host.h | 4 + arch/arm64/kvm/arm.c | 2 + arch/arm64/kvm/pmu-part.c | 117 +++++++++++++++++++++++++++++ 4 files changed, 139 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/arm_pmuv3.h b/arch/arm64/include/asm/arm_pmuv3.h index 5d01ed25c4ef..a00845cffb3f 100644 --- a/arch/arm64/include/asm/arm_pmuv3.h +++ b/arch/arm64/include/asm/arm_pmuv3.h @@ -107,6 +107,11 @@ static inline void write_pmcntenset(u64 val) write_sysreg(val, pmcntenset_el0); }
+static inline u64 read_pmcntenset(void) +{ + return read_sysreg(pmcntenset_el0); +} + static inline void write_pmcntenclr(u64 val) { write_sysreg(val, pmcntenclr_el0); @@ -117,6 +122,11 @@ static inline void write_pmintenset(u64 val) write_sysreg(val, pmintenset_el1); }
+static inline u64 read_pmintenset(void) +{ + return read_sysreg(pmintenset_el1); +} + static inline void write_pmintenclr(u64 val) { write_sysreg(val, pmintenclr_el1); @@ -162,11 +172,16 @@ static inline u64 read_pmovsclr(void) return read_sysreg(pmovsclr_el0); }
-static inline void write_pmuserenr(u32 val) +static inline void write_pmuserenr(u64 val) { write_sysreg(val, pmuserenr_el0); }
+static inline u64 read_pmuserenr(void) +{ + return read_sysreg(pmuserenr_el0); +} + static inline void write_pmuacr(u64 val) { write_sysreg_s(val, SYS_PMUACR_EL1); diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 4ea045098bfa..955359f20161 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -453,9 +453,11 @@ enum vcpu_sysreg { PMEVCNTR0_EL0, /* Event Counter Register (0-30) */ PMEVCNTR30_EL0 = PMEVCNTR0_EL0 + 30, PMCCNTR_EL0, /* Cycle Counter Register */ + PMICNTR_EL0, /* Instruction Counter Register */ PMEVTYPER0_EL0, /* Event Type Register (0-30) */ PMEVTYPER30_EL0 = PMEVTYPER0_EL0 + 30, PMCCFILTR_EL0, /* Cycle Count Filter Register */ + PMICFILTR_EL0, /* Insturction Count Filter Register */ PMCNTENSET_EL0, /* Count Enable Set Register */ PMINTENSET_EL1, /* Interrupt Enable Set Register */ PMOVSSET_EL0, /* Overflow Flag Status Set Register */ @@ -1713,6 +1715,8 @@ struct kvm_pmu_events *kvm_get_pmu_events(void); void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu); void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu); bool kvm_pmu_overflow_status(struct kvm_vcpu *vcpu); +void kvm_pmu_load(struct kvm_vcpu *vcpu); +void kvm_pmu_put(struct kvm_vcpu *vcpu);
/* * Updates the vcpu's view of the pmu events for this cpu. diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 3b9c003f2ea6..4a1cc7b72295 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -615,6 +615,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_vcpu_load_vhe(vcpu); kvm_arch_vcpu_load_fp(vcpu); kvm_vcpu_pmu_restore_guest(vcpu); + kvm_pmu_load(vcpu); if (kvm_arm_is_pvtime_enabled(&vcpu->arch)) kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
@@ -657,6 +658,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) kvm_timer_vcpu_put(vcpu); kvm_vgic_put(vcpu); kvm_vcpu_pmu_restore_host(vcpu); + kvm_pmu_put(vcpu); if (vcpu_has_nv(vcpu)) kvm_vcpu_put_hw_mmu(vcpu); kvm_arm_vmid_clear_active(); diff --git a/arch/arm64/kvm/pmu-part.c b/arch/arm64/kvm/pmu-part.c index 179a4144cfd0..40c72caef34e 100644 --- a/arch/arm64/kvm/pmu-part.c +++ b/arch/arm64/kvm/pmu-part.c @@ -8,6 +8,7 @@ #include <linux/perf/arm_pmu.h> #include <linux/perf/arm_pmuv3.h>
+#include <asm/kvm_emulate.h> #include <asm/kvm_pmu.h> #include <asm/arm_pmuv3.h>
@@ -202,3 +203,119 @@ void kvm_pmu_host_counters_disable(void) mdcr &= ~MDCR_EL2_HPME; write_sysreg(mdcr, mdcr_el2); } + +/** + * kvm_pmu_load() - Load untrapped PMU registers + * @vcpu: Pointer to struct kvm_vcpu + * + * Load all untrapped PMU registers from the VCPU into the PCPU. Mask + * to only bits belonging to guest-reserved counters and leave + * host-reserved counters alone in bitmask registers. + */ +void kvm_pmu_load(struct kvm_vcpu *vcpu) +{ + struct arm_pmu *pmu = vcpu->kvm->arch.arm_pmu; + u64 mask = kvm_pmu_guest_counter_mask(pmu); + u8 i; + u64 val; + + /* + * If the PMU is not partitioned, don't bother. + * + * If we have MDCR_EL2_TPM, every PMU access is trapped which + * implies we are using the emulated PMU instead of direct + * access. + */ + if (!kvm_pmu_is_partitioned(pmu) || (vcpu->arch.mdcr_el2 & MDCR_EL2_TPM)) + return; + + for (i = 0; i < pmu->hpmn; i++) { + val = __vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + i); + write_pmevcntrn(i, val); + } + + val = __vcpu_sys_reg(vcpu, PMCCNTR_EL0); + write_pmccntr(val); + + if (cpus_have_final_cap(ARM64_HAS_PMICNTR)) { + val = __vcpu_sys_reg(vcpu, PMICNTR_EL0); + write_pmicntr(val); + } + + val = __vcpu_sys_reg(vcpu, PMUSERENR_EL0); + write_pmuserenr(val); + + val = __vcpu_sys_reg(vcpu, PMSELR_EL0); + write_pmselr(val); + + val = __vcpu_sys_reg(vcpu, PMCR_EL0); + write_pmcr(val); + + /* + * Loading these registers is tricky because of + * 1. Applying only the bits for guest counters (indicated by mask) + * 2. Setting and clearing are different registers + */ + val = __vcpu_sys_reg(vcpu, PMCNTENSET_EL0); + write_pmcntenset(val & mask); + write_pmcntenclr(~val & mask); + + val = __vcpu_sys_reg(vcpu, PMINTENSET_EL1); + write_pmintenset(val & mask); + write_pmintenclr(~val & mask); +} + +/** + * kvm_pmu_put() - Put untrapped PMU registers + * @vcpu: Pointer to struct kvm_vcpu + * + * Put all untrapped PMU registers from the VCPU into the PCPU. Mask + * to only bits belonging to guest-reserved counters and leave + * host-reserved counters alone in bitmask registers. + */ +void kvm_pmu_put(struct kvm_vcpu *vcpu) +{ + struct arm_pmu *pmu = vcpu->kvm->arch.arm_pmu; + u64 mask = kvm_pmu_guest_counter_mask(pmu); + u8 i; + u64 val; + + /* + * If the PMU is not partitioned, don't bother. + * + * If we have MDCR_EL2_TPM, every PMU access is trapped which + * implies we are using the emulated PMU instead of direct + * access. + */ + if (!kvm_pmu_is_partitioned(pmu) || (vcpu->arch.mdcr_el2 & MDCR_EL2_TPM)) + return; + + for (i = 0; i < pmu->hpmn; i++) { + val = read_pmevcntrn(i); + __vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + i) = val; + } + + val = read_pmccntr(); + __vcpu_sys_reg(vcpu, PMCCNTR_EL0) = val; + + if (this_cpu_has_cap(ARM64_HAS_PMICNTR)) { + val = read_pmicntr(); + __vcpu_sys_reg(vcpu, PMICNTR_EL0) = val; + } + + val = read_pmuserenr(); + __vcpu_sys_reg(vcpu, PMUSERENR_EL0) = val; + + val = read_pmselr(); + __vcpu_sys_reg(vcpu, PMSELR_EL0) = val; + + val = read_pmcr(); + __vcpu_sys_reg(vcpu, PMCR_EL0) = val; + + /* Mask these to only save the guest relevant bits. */ + val = read_pmcntenset(); + __vcpu_sys_reg(vcpu, PMCNTENSET_EL0) = val & mask; + + val = read_pmintenset(); + __vcpu_sys_reg(vcpu, PMINTENSET_EL1) = val & mask; +}
Guest counters will still trigger interrupts that need to be handled by the host PMU interrupt handler. Clear the overflow flags in hardware to handle the interrupt as normal, but record which guest overflow flags were set in the virtual overflow register for later injecting the interrupt into the guest.
Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm/include/asm/arm_pmuv3.h | 6 ++++++ arch/arm64/include/asm/kvm_pmu.h | 2 ++ arch/arm64/kvm/pmu-part.c | 17 +++++++++++++++++ drivers/perf/arm_pmuv3.c | 15 +++++++++++---- 4 files changed, 36 insertions(+), 4 deletions(-)
diff --git a/arch/arm/include/asm/arm_pmuv3.h b/arch/arm/include/asm/arm_pmuv3.h index 1687b4031ec2..26e149bdc8b0 100644 --- a/arch/arm/include/asm/arm_pmuv3.h +++ b/arch/arm/include/asm/arm_pmuv3.h @@ -180,6 +180,11 @@ static inline void write_pmintenset(u32 val) write_sysreg(val, PMINTENSET); }
+static inline u32 read_pmintenset(void) +{ + return read_sysreg(PMINTENSET); +} + static inline void write_pmintenclr(u32 val) { write_sysreg(val, PMINTENCLR); @@ -245,6 +250,7 @@ static inline u64 kvm_pmu_guest_counter_mask(struct arm_pmu *pmu) return ~0; }
+static inline void kvm_pmu_handle_guest_irq(u64 govf) {}
/* PMU Version in DFR Register */ #define ARMV8_PMU_DFR_VER_NI 0 diff --git a/arch/arm64/include/asm/kvm_pmu.h b/arch/arm64/include/asm/kvm_pmu.h index 4098d4ad03d9..4cefd9fcf52b 100644 --- a/arch/arm64/include/asm/kvm_pmu.h +++ b/arch/arm64/include/asm/kvm_pmu.h @@ -30,6 +30,7 @@ u64 kvm_pmu_host_counter_mask(struct arm_pmu *pmu); u64 kvm_pmu_guest_counter_mask(struct arm_pmu *pmu); void kvm_pmu_host_counters_enable(void); void kvm_pmu_host_counters_disable(void); +void kvm_pmu_handle_guest_irq(u64 govf);
#else
@@ -74,6 +75,7 @@ static inline u64 kvm_pmu_guest_counter_mask(struct arm_pmu *pmu)
static inline void kvm_pmu_host_counters_enable(void) {} static inline void kvm_pmu_host_counters_disable(void) {} +static inline void kvm_pmu_handle_guest_irq(u64 govf) {}
#endif
diff --git a/arch/arm64/kvm/pmu-part.c b/arch/arm64/kvm/pmu-part.c index 40c72caef34e..0e1a2235e992 100644 --- a/arch/arm64/kvm/pmu-part.c +++ b/arch/arm64/kvm/pmu-part.c @@ -319,3 +319,20 @@ void kvm_pmu_put(struct kvm_vcpu *vcpu) val = read_pmintenset(); __vcpu_sys_reg(vcpu, PMINTENSET_EL1) = val & mask; } + +/** + * kvm_pmu_handle_guest_irq() - Record IRQs in guest counters + * @govf: Bitmask of guest overflowed counters + * + * Record IRQs from overflows in guest-reserved counters in the VCPU + * register for the guest to clear later. + */ +void kvm_pmu_handle_guest_irq(u64 govf) +{ + struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); + + if (!vcpu) + return; + + __vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= govf; +} diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c index f447a0f10e2b..20d9b35260d9 100644 --- a/drivers/perf/arm_pmuv3.c +++ b/drivers/perf/arm_pmuv3.c @@ -739,6 +739,8 @@ static u64 armv8pmu_getreset_flags(void)
/* Write to clear flags */ value &= ARMV8_PMU_CNT_MASK_ALL; + /* Only reset interrupt enabled counters. */ + value &= read_pmintenset(); write_pmovsclr(value);
return value; @@ -841,6 +843,7 @@ static void armv8pmu_stop(struct arm_pmu *cpu_pmu) static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu) { u64 pmovsr; + u64 govf; struct perf_sample_data data; struct pmu_hw_events *cpuc = this_cpu_ptr(cpu_pmu->hw_events); struct pt_regs *regs; @@ -867,19 +870,17 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu) * to prevent skews in group events. */ armv8pmu_stop(cpu_pmu); + for_each_set_bit(idx, cpu_pmu->cntr_mask, ARMPMU_MAX_HWEVENTS) { struct perf_event *event = cpuc->events[idx]; struct hw_perf_event *hwc;
/* Ignore if we don't have an event. */ - if (!event) - continue; - /* * We have a single interrupt for all counters. Check that * each counter has overflowed before we process it. */ - if (!armv8pmu_counter_has_overflowed(pmovsr, idx)) + if (!event || !armv8pmu_counter_has_overflowed(pmovsr, idx)) continue;
hwc = &event->hw; @@ -896,6 +897,12 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu) if (perf_event_overflow(event, &data, regs)) cpu_pmu->disable(event); } + + govf = pmovsr & kvm_pmu_guest_counter_mask(cpu_pmu); + + if (kvm_pmu_is_partitioned(cpu_pmu) && govf) + kvm_pmu_handle_guest_irq(govf); + armv8pmu_start(cpu_pmu);
return IRQ_HANDLED;
When we re-enter the VM after handling a PMU interrupt, calculate whether it was any of the guest counters that overflowed and inject an interrupt into the guest if so.
Signed-off-by: Colton Lewis coltonlewis@google.com --- arch/arm64/include/asm/kvm_host.h | 3 ++- arch/arm64/kvm/pmu-emul.c | 4 ++-- arch/arm64/kvm/pmu-part.c | 22 +++++++++++++++++++++- arch/arm64/kvm/pmu.c | 7 ++++++- 4 files changed, 31 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 955359f20161..0af8cc4c340f 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -1714,9 +1714,10 @@ bool kvm_vcpu_pmu_is_partitioned(struct kvm_vcpu *vcpu); struct kvm_pmu_events *kvm_get_pmu_events(void); void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu); void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu); -bool kvm_pmu_overflow_status(struct kvm_vcpu *vcpu); +bool kvm_pmu_emul_overflow_status(struct kvm_vcpu *vcpu); void kvm_pmu_load(struct kvm_vcpu *vcpu); void kvm_pmu_put(struct kvm_vcpu *vcpu); +bool kvm_pmu_part_overflow_status(struct kvm_vcpu *vcpu);
/* * Updates the vcpu's view of the pmu events for this cpu. diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c index ff86c66e1b48..0ffabada1dad 100644 --- a/arch/arm64/kvm/pmu-emul.c +++ b/arch/arm64/kvm/pmu-emul.c @@ -320,7 +320,7 @@ void kvm_pmu_reprogram_counter_mask(struct kvm_vcpu *vcpu, u64 val) * counter where the values of the global enable control, PMOVSSET_EL0[n], and * PMINTENSET_EL1[n] are all 1. */ -bool kvm_pmu_overflow_status(struct kvm_vcpu *vcpu) +bool kvm_pmu_emul_overflow_status(struct kvm_vcpu *vcpu) { u64 reg = __vcpu_sys_reg(vcpu, PMOVSSET_EL0);
@@ -457,7 +457,7 @@ static void kvm_pmu_perf_overflow(struct perf_event *perf_event, kvm_pmu_counter_increment(vcpu, BIT(idx + 1), ARMV8_PMUV3_PERFCTR_CHAIN);
- if (kvm_pmu_overflow_status(vcpu)) { + if (kvm_pmu_emul_overflow_status(vcpu)) { kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu);
if (!in_nmi()) diff --git a/arch/arm64/kvm/pmu-part.c b/arch/arm64/kvm/pmu-part.c index 0e1a2235e992..1d85e7ce76c8 100644 --- a/arch/arm64/kvm/pmu-part.c +++ b/arch/arm64/kvm/pmu-part.c @@ -252,7 +252,7 @@ void kvm_pmu_load(struct kvm_vcpu *vcpu) write_pmcr(val);
/* - * Loading these registers is tricky because of + * Loading these registers is more intricate because of * 1. Applying only the bits for guest counters (indicated by mask) * 2. Setting and clearing are different registers */ @@ -336,3 +336,23 @@ void kvm_pmu_handle_guest_irq(u64 govf)
__vcpu_sys_reg(vcpu, PMOVSSET_EL0) |= govf; } + +/** + * kvm_pmu_part_overflow_status() - Determine if any guest counters have overflowed + * @vcpu: Ponter to struct kvm_vcpu + * + * Determine if any guest counters have overflowed and therefore an + * IRQ needs to be injected into the guest. + * + * Return: True if there was an overflow, false otherwise + */ +bool kvm_pmu_part_overflow_status(struct kvm_vcpu *vcpu) +{ + struct arm_pmu *pmu = vcpu->kvm->arch.arm_pmu; + u64 mask = kvm_pmu_guest_counter_mask(pmu); + u64 pmovs = __vcpu_sys_reg(vcpu, PMOVSSET_EL0); + u64 pmint = read_pmintenset(); + u64 pmcr = read_pmcr(); + + return (pmcr & ARMV8_PMU_PMCR_E) && (mask & pmovs & pmint); +} diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c index 2dcfac3ea9c6..6c3151dec25a 100644 --- a/arch/arm64/kvm/pmu.c +++ b/arch/arm64/kvm/pmu.c @@ -425,7 +425,11 @@ static void kvm_pmu_update_state(struct kvm_vcpu *vcpu) struct kvm_pmu *pmu = &vcpu->arch.pmu; bool overflow;
- overflow = kvm_pmu_overflow_status(vcpu); + if (kvm_vcpu_pmu_is_partitioned(vcpu)) + overflow = kvm_pmu_part_overflow_status(vcpu); + else + overflow = kvm_pmu_emul_overflow_status(vcpu); + if (pmu->irq_level == overflow) return;
@@ -694,6 +698,7 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) return -EBUSY;
kvm_debug("Set kvm ARM PMU irq: %d\n", irq); + vcpu->arch.pmu.irq_num = irq; return 0; }
Add KVM_ARM_PMU_PARTITION to partition the PMU for a given vCPU with a specified number of reserved host counters. Add a corresponding KVM_CAP_ARM_PMU_PARTITION to check for this ability.
This capability is allowed on an initialized vCPU where PMUv3, VHE, and FGT are supported.
If the ioctl is never called, partitioning will fall back on kernel command line kvm.reserved_host_counters as before.
Signed-off-by: Colton Lewis coltonlewis@google.com --- Documentation/virt/kvm/api.rst | 16 ++++++++++++++++ arch/arm64/kvm/arm.c | 21 +++++++++++++++++++++ include/uapi/linux/kvm.h | 4 ++++ 3 files changed, 41 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index fe3d6b5d2acc..88b851cb6f66 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6464,6 +6464,22 @@ the capability to be present.
`flags` must currently be zero.
+4.144 KVM_ARM_PARTITION_PMU +--------------------------- + +:Capability: KVM_CAP_ARM_PMU_PARTITION +:Architectures: arm64 +:Type: vcpu ioctl +:Parameters: arg[0] is the number of counters to reserve for the host + +This API controls the ability to partition the PMU counters into two +sets, one set reserved for the host and one set reserved for the +guest. When partitoned, KVM will allow the guest direct hardware +access to the most commonly used PMU capabilities for those counters, +bypassing the KVM traps in the standard emulated PMU implementation +and reducing the overhead of any guest software that uses PMU +capabilities such as `perf`. The host PMU driver will not access any +of the counters or bits reserved for the guest.
.. _kvm_run:
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 4a1cc7b72295..1c44160d3b2d 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -21,6 +21,7 @@ #include <linux/irqbypass.h> #include <linux/sched/stat.h> #include <linux/psci.h> +#include <linux/perf/arm_pmu.h> #include <trace/events/kvm.h>
#define CREATE_TRACE_POINTS @@ -38,6 +39,7 @@ #include <asm/kvm_emulate.h> #include <asm/kvm_mmu.h> #include <asm/kvm_nested.h> +#include <asm/kvm_pmu.h> #include <asm/kvm_pkvm.h> #include <asm/kvm_ptrauth.h> #include <asm/sections.h> @@ -382,6 +384,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_ARM_PMU_V3: r = kvm_supports_guest_pmuv3(); break; + case KVM_CAP_ARM_PARTITION_PMU: + r = kvm_pmu_partition_supported(); + break; case KVM_CAP_ARM_INJECT_SERROR_ESR: r = cpus_have_final_cap(ARM64_HAS_RAS_EXTN); break; @@ -1809,6 +1814,22 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
return kvm_arm_vcpu_finalize(vcpu, what); } + case KVM_ARM_PARTITION_PMU: { + struct arm_pmu *pmu; + u8 host_counters; + + if (unlikely(!kvm_vcpu_initialized(vcpu))) + return -ENOEXEC; + + if (!kvm_pmu_partition_supported()) + return -EPERM; + + if (copy_from_user(&host_counters, argp, sizeof(host_counters))) + return -EFAULT; + + pmu = vcpu->kvm->arch.arm_pmu; + return kvm_pmu_partition(pmu, host_counters); + } default: r = -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index c9d4a908976e..f7387c0696d5 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -932,6 +932,7 @@ struct kvm_enable_cap { #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239 #define KVM_CAP_ARM_EL2 240 #define KVM_CAP_ARM_EL2_E2H0 241 +#define KVM_CAP_ARM_PARTITION_PMU 242
struct kvm_irq_routing_irqchip { __u32 irqchip; @@ -1410,6 +1411,9 @@ struct kvm_enc_region { #define KVM_GET_SREGS2 _IOR(KVMIO, 0xcc, struct kvm_sregs2) #define KVM_SET_SREGS2 _IOW(KVMIO, 0xcd, struct kvm_sregs2)
+/* Available with KVM_CAP_ARM_PARTITION_PMU */ +#define KVM_ARM_PARTITION_PMU _IOWR(KVMIO, 0xce, u8) + #define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (1 << 0) #define KVM_DIRTY_LOG_INITIALLY_SET (1 << 1)
On Mon, Jun 02, 2025 at 07:27:01PM +0000, Colton Lewis wrote:
- case KVM_ARM_PARTITION_PMU: {
This should be a vCPU attribute similar to the other PMUv3 controls we already have. Ideally a single attribute where userspace tells us it wants paritioning and specifies the PMU ID to use. None of this can be changed after INIT'ing the PMU.
struct arm_pmu *pmu;
u8 host_counters;
if (unlikely(!kvm_vcpu_initialized(vcpu)))
return -ENOEXEC;
if (!kvm_pmu_partition_supported())
return -EPERM;
if (copy_from_user(&host_counters, argp, sizeof(host_counters)))
return -EFAULT;
pmu = vcpu->kvm->arch.arm_pmu;
return kvm_pmu_partition(pmu, host_counters);
Yeah, we really can't be changing the counters available to the ARM PMU driver at this point. What happens to host events already scheduled on the CPU?
Either the partition of host / KVM-owned counters needs to be computed up front (prior to scheduling events) or KVM needs a way to direct perf to reschedule events on the PMU based on the new operating constraints.
Thanks, Oliver
Oliver Upton oliver.upton@linux.dev writes:
On Mon, Jun 02, 2025 at 07:27:01PM +0000, Colton Lewis wrote:
- case KVM_ARM_PARTITION_PMU: {
This should be a vCPU attribute similar to the other PMUv3 controls we already have. Ideally a single attribute where userspace tells us it wants paritioning and specifies the PMU ID to use. None of this can be changed after INIT'ing the PMU.
Okay
struct arm_pmu *pmu;
u8 host_counters;
if (unlikely(!kvm_vcpu_initialized(vcpu)))
return -ENOEXEC;
if (!kvm_pmu_partition_supported())
return -EPERM;
if (copy_from_user(&host_counters, argp, sizeof(host_counters)))
return -EFAULT;
pmu = vcpu->kvm->arch.arm_pmu;
return kvm_pmu_partition(pmu, host_counters);
Yeah, we really can't be changing the counters available to the ARM PMU driver at this point. What happens to host events already scheduled on the CPU?
Okay. I remember talking about this before.
Either the partition of host / KVM-owned counters needs to be computed up front (prior to scheduling events) or KVM needs a way to direct perf to reschedule events on the PMU based on the new operating constraints.
Yes. I will think about it.
Oliver Upton oliver.upton@linux.dev writes:
On Mon, Jun 02, 2025 at 07:27:01PM +0000, Colton Lewis wrote:
- case KVM_ARM_PARTITION_PMU: {
This should be a vCPU attribute similar to the other PMUv3 controls we already have. Ideally a single attribute where userspace tells us it wants paritioning and specifies the PMU ID to use. None of this can be changed after INIT'ing the PMU.
Okay
struct arm_pmu *pmu;
u8 host_counters;
if (unlikely(!kvm_vcpu_initialized(vcpu)))
return -ENOEXEC;
if (!kvm_pmu_partition_supported())
return -EPERM;
if (copy_from_user(&host_counters, argp, sizeof(host_counters)))
return -EFAULT;
pmu = vcpu->kvm->arch.arm_pmu;
return kvm_pmu_partition(pmu, host_counters);
Yeah, we really can't be changing the counters available to the ARM PMU driver at this point. What happens to host events already scheduled on the CPU?
Okay. I remember talking about this before.
Either the partition of host / KVM-owned counters needs to be computed up front (prior to scheduling events) or KVM needs a way to direct perf to reschedule events on the PMU based on the new operating constraints.
Yes. I will think about it.
It would be cool to have perf reschedule events. I'm not positive how to do that, but it looks not too hard. Can someone comment on the correctness and feasibility here?
1. Scan perf events and call event_sched_out on all events using the counters KVM wants. 2. Do the PMU surgery to change the available counters. 3. Call ctx_resched to reschedule events with the available counters.
There is a second option to avoid a permanent partition up front. We know which counters are in use through used_mask. We could check if the partition would claim any counters in use and fail with an error if it would.
Run separate test cases for a partitioned PMU in vpmu_counter_access. Notably, partitioning the PMU untraps PMCR_EL0.N, so that is no longer settable by KVM.
Add a boolean argument to run_access_test() that will partition the PMU by reserving one host counter if true then run the test for the PMCR_EL0.N value that implies, one less than the number of counters on the host system.
Signed-off-by: Colton Lewis coltonlewis@google.com --- tools/include/uapi/linux/kvm.h | 2 + .../selftests/kvm/arm64/vpmu_counter_access.c | 40 ++++++++++++++++--- 2 files changed, 37 insertions(+), 5 deletions(-)
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h index b6ae8ad8934b..cb72b57b9b6c 100644 --- a/tools/include/uapi/linux/kvm.h +++ b/tools/include/uapi/linux/kvm.h @@ -930,6 +930,7 @@ struct kvm_enable_cap { #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237 #define KVM_CAP_X86_GUEST_MODE 238 #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239 +#define KVM_CAP_ARM_PARTITION_PMU 242
struct kvm_irq_routing_irqchip { __u32 irqchip; @@ -1356,6 +1357,7 @@ struct kvm_vfio_spapr_tce { #define KVM_S390_SET_CMMA_BITS _IOW(KVMIO, 0xb9, struct kvm_s390_cmma_log) /* Memory Encryption Commands */ #define KVM_MEMORY_ENCRYPT_OP _IOWR(KVMIO, 0xba, unsigned long) +#define KVM_ARM_PARTITION_PMU _IOWR(KVMIO, 0xce, u8)
struct kvm_enc_region { __u64 addr; diff --git a/tools/testing/selftests/kvm/arm64/vpmu_counter_access.c b/tools/testing/selftests/kvm/arm64/vpmu_counter_access.c index f16b3b27e32e..e06448c1fbb5 100644 --- a/tools/testing/selftests/kvm/arm64/vpmu_counter_access.c +++ b/tools/testing/selftests/kvm/arm64/vpmu_counter_access.c @@ -369,6 +369,7 @@ static void guest_code(uint64_t expected_pmcr_n) pmcr = read_sysreg(pmcr_el0); pmcr_n = get_pmcr_n(pmcr);
+ /* __GUEST_ASSERT(0, "Expect PMCR: %lx", pmcr); */ /* Make sure that PMCR_EL0.N indicates the value userspace set */ __GUEST_ASSERT(pmcr_n == expected_pmcr_n, "Expected PMCR.N: 0x%lx, PMCR.N: 0x%lx", @@ -508,16 +509,18 @@ static void test_create_vpmu_vm_with_pmcr_n(uint64_t pmcr_n, bool expect_fail) * Create a guest with one vCPU, set the PMCR_EL0.N for the vCPU to @pmcr_n, * and run the test. */ -static void run_access_test(uint64_t pmcr_n) +static void run_access_test(uint64_t pmcr_n, bool partition) { uint64_t sp; struct kvm_vcpu *vcpu; struct kvm_vcpu_init init; + uint8_t host_counters = (uint8_t)partition;
pr_debug("Test with pmcr_n %lu\n", pmcr_n);
test_create_vpmu_vm_with_pmcr_n(pmcr_n, false); vcpu = vpmu_vm.vcpu; + vcpu_ioctl(vcpu, KVM_ARM_PARTITION_PMU, &host_counters);
/* Save the initial sp to restore them later to run the guest again */ sp = vcpu_get_reg(vcpu, ARM64_CORE_REG(sp_el1)); @@ -529,6 +532,8 @@ static void run_access_test(uint64_t pmcr_n) * check if PMCR_EL0.N is preserved. */ vm_ioctl(vpmu_vm.vm, KVM_ARM_PREFERRED_TARGET, &init); + vcpu_ioctl(vcpu, KVM_ARM_PARTITION_PMU, &host_counters); + init.features[0] |= (1 << KVM_ARM_VCPU_PMU_V3); aarch64_vcpu_setup(vcpu, &init); vcpu_init_descriptor_tables(vcpu); @@ -609,7 +614,7 @@ static void run_pmregs_validity_test(uint64_t pmcr_n) */ static void run_error_test(uint64_t pmcr_n) { - pr_debug("Error test with pmcr_n %lu (larger than the host)\n", pmcr_n); + pr_debug("Error test with pmcr_n %lu (larger than the host allows)\n", pmcr_n);
test_create_vpmu_vm_with_pmcr_n(pmcr_n, true); destroy_vpmu_vm(); @@ -629,20 +634,45 @@ static uint64_t get_pmcr_n_limit(void) return get_pmcr_n(pmcr); }
-int main(void) +void test_emulated_pmu(void) { uint64_t i, pmcr_n;
- TEST_REQUIRE(kvm_has_cap(KVM_CAP_ARM_PMU_V3)); + pr_info("Testing Emulated PMU\n");
pmcr_n = get_pmcr_n_limit(); for (i = 0; i <= pmcr_n; i++) { - run_access_test(i); + run_access_test(i, false); run_pmregs_validity_test(i); }
for (i = pmcr_n + 1; i < ARMV8_PMU_MAX_COUNTERS; i++) run_error_test(i); +} + +void test_partitioned_pmu(void) +{ + uint64_t i, pmcr_n; + + pr_info("Testing Partitioned PMU\n"); + + pmcr_n = get_pmcr_n_limit(); + run_access_test(pmcr_n - 1, true); + + /* Partitioning implies only one PMCR.N allowed */ + for (i = 0; i < ARMV8_PMU_MAX_COUNTERS; i++) + if (i != pmcr_n) + run_error_test(i); +} + +int main(void) +{ + TEST_REQUIRE(kvm_has_cap(KVM_CAP_ARM_PMU_V3)); + + test_emulated_pmu(); + + if (kvm_has_cap(KVM_CAP_ARM_PARTITION_PMU)) + test_partitioned_pmu();
return 0; }
On Mon, Jun 02, 2025 at 07:26:45PM +0000, Colton Lewis wrote:
Caveats:
Because the most consistent and performant thing to do was untrap PMCR_EL0, the number of counters visible to the guest via PMCR_EL0.N is always equal to the value KVM sets for MDCR_EL2.HPMN. Previously allowed writes to PMCR_EL0.N via {GET,SET}_ONE_REG no longer affect the guest.
These improvements come at a cost to 7-35 new registers that must be swapped at every vcpu_load and vcpu_put if the feature is enabled. I have been informed KVM would like to avoid paying this cost when possible.
One solution is to make the trapping changes and context swapping lazy such that the trapping changes and context swapping only take place after the guest has actually accessed the PMU so guests that never access the PMU never pay the cost.
You should try and model this similar to how we manage the debug breakpoints/watchpoints. In that case the debug register context is loaded if either:
(1) Self-hosted debug is actively in use by the guest, or
(2) The guest has accessed a debug register since the last vcpu_load()
This is not done here because it is not crucial to the primary functionality and I thought review would be more productive as soon as I had something complete enough for reviewers to easily play with.
However, this or any better ideas are on the table for inclusion in future re-rolls.
One of the other things that I'd like to see is if we can pare down the amount of CPU feature dependencies for a partitioned PMU. Annoyingly, there aren't a lot of machines out there with FEAT_FGT yet, and you should be able to make all of this work in VHE + FEAT_PMUv3p1.
That "just" comes at the cost of extra traps (leaving TPM and potentially TPMCR set). You can mitigate the cost of this by emulating accesses in the fast path that don't need to go out to a kernel context to be serviced. Same goes for requiring FEAT_HPMN0 to expose 0 event counters, we can fall back to TPM traps if needed.
Taking perf out of the picture should still give you a significant reduction vPMU overheads.
Last thing, let's table guest support for FEAT_PMUv3_ICNTR for the time being. Yes, it falls in the KVM-owned range, but we can just handle it with a fine-grained undef for now. Once the core infrastructure has landed upstream we can start layering new features into the partitioned implementation.
Thanks, Oliver
Oliver Upton oliver.upton@linux.dev writes:
On Mon, Jun 02, 2025 at 07:26:45PM +0000, Colton Lewis wrote:
Caveats:
Because the most consistent and performant thing to do was untrap PMCR_EL0, the number of counters visible to the guest via PMCR_EL0.N is always equal to the value KVM sets for MDCR_EL2.HPMN. Previously allowed writes to PMCR_EL0.N via {GET,SET}_ONE_REG no longer affect the guest.
These improvements come at a cost to 7-35 new registers that must be swapped at every vcpu_load and vcpu_put if the feature is enabled. I have been informed KVM would like to avoid paying this cost when possible.
One solution is to make the trapping changes and context swapping lazy such that the trapping changes and context swapping only take place after the guest has actually accessed the PMU so guests that never access the PMU never pay the cost.
You should try and model this similar to how we manage the debug breakpoints/watchpoints. In that case the debug register context is loaded if either:
(1) Self-hosted debug is actively in use by the guest, or
(2) The guest has accessed a debug register since the last vcpu_load()
Okay
This is not done here because it is not crucial to the primary functionality and I thought review would be more productive as soon as I had something complete enough for reviewers to easily play with.
However, this or any better ideas are on the table for inclusion in future re-rolls.
One of the other things that I'd like to see is if we can pare down the amount of CPU feature dependencies for a partitioned PMU. Annoyingly, there aren't a lot of machines out there with FEAT_FGT yet, and you should be able to make all of this work in VHE + FEAT_PMUv3p1.
That "just" comes at the cost of extra traps (leaving TPM and potentially TPMCR set). You can mitigate the cost of this by emulating accesses in the fast path that don't need to go out to a kernel context to be serviced. Same goes for requiring FEAT_HPMN0 to expose 0 event counters, we can fall back to TPM traps if needed.
Taking perf out of the picture should still give you a significant reduction vPMU overheads.
Okay
Last thing, let's table guest support for FEAT_PMUv3_ICNTR for the time being. Yes, it falls in the KVM-owned range, but we can just handle it with a fine-grained undef for now. Once the core infrastructure has landed upstream we can start layering new features into the partitioned implementation.
Sure
Thanks, Oliver
linux-kselftest-mirror@lists.linaro.org