With joint effort from the upstream KVM community, we come up with the
4th version of mediated vPMU for x86. We have made the following changes
on top of the previous RFC v3.
v3 -> v4
- Rebase whole patchset on 6.14-rc3 base.
- Address Peter's comments on Perf part.
- Address Sean's comments on KVM part.
* Change key word "passthrough" to "mediated" in all patches
* Change static enabling to user space dynamic enabling via KVM_CAP_PMU_CAPABILITY.
* Only support GLOBAL_CTRL save/restore with VMCS exec_ctrl, drop the MSR
save/retore list support for GLOBAL_CTRL, thus the support of mediated
vPMU is constrained to SapphireRapids and later CPUs on Intel side.
* Merge some small changes into a single patch.
- Address Sandipan's comment on invalid pmu pointer.
- Add back "eventsel_hw" and "fixed_ctr_ctrl_hw" to avoid to directly
manipulate pmc->eventsel and pmu->fixed_ctr_ctrl.
Testing (Intel side):
- Perf-based legacy vPMU (force emulation on/off)
* Kselftests pmu_counters_test, pmu_event_filter_test and
vmx_pmu_caps_test pass.
* KUT PMU tests pmu, pmu_lbr, pmu_pebs pass.
* Basic perf counting/sampling tests in 3 scenarios, guest-only,
host-only and host-guest coexistence all pass.
- Mediated vPMU (force emulation on/off)
* Kselftests pmu_counters_test, pmu_event_filter_test and
vmx_pmu_caps_test pass.
* KUT PMU tests pmu, pmu_lbr, pmu_pebs pass.
* Basic perf counting/sampling tests in 3 scenarios, guest-only,
host-only and host-guest coexistence all pass.
- Failures. All above tests passed on Intel Granite Rapids as well
except a failure on KUT/pmu_pebs.
* GP counter 0 (0xfffffffffffe): PEBS record (written seq 0)
is verified (including size, counters and cfg).
* The pebs_data_cfg (0xb500000000) doesn't match with the
effective MSR_PEBS_DATA_CFG (0x0).
* This failure has nothing to do with this mediated vPMU patch set. The
failure is caused by Granite Rapids supported timed PEBS which needs
extra support on Qemu and KUT/pmu_pebs. These extra support would be
sent in separate patches later.
Testing (AMD side):
- Kselftests pmu_counters_test, pmu_event_filter_test and
vmx_pmu_caps_test all pass
- legacy guest with KUT/pmu:
* qmeu option: -cpu host, -perfctr-core
* when set force_emulation_prefix=1, passes
* when set force_emulation_prefix=0, passes
- perfmon-v1 guest with KUT/pmu:
* qmeu option: -cpu host, -perfmon-v2
* when set force_emulation_prefix=1, passes
* when set force_emulation_prefix=0, passes
- perfmon-v2 guest with KUT/pmu:
* qmeu option: -cpu host
* when set force_emulation_prefix=1, passes
* when set force_emulation_prefix=0, passes
- perf_fuzzer (perfmon-v2):
* fails with soft lockup in guest in current version.
* culprit could be between 6.13 ~ 6.14-rc3 within KVM
* Series tested on 6.12 and 6.13 without issue.
Note: a QEMU series is needed to run mediated vPMU v4:
- https://lore.kernel.org/all/20250324123712.34096-1-dapeng1.mi@linux.intel.c…
History:
- RFC v3: https://lore.kernel.org/all/20240801045907.4010984-1-mizhang@google.com/
- RFC v2: https://lore.kernel.org/all/20240506053020.3911940-1-mizhang@google.com/
- RFC v1: https://lore.kernel.org/all/20240126085444.324918-1-xiong.y.zhang@linux.int…
Dapeng Mi (18):
KVM: x86/pmu: Introduce enable_mediated_pmu global parameter
KVM: x86/pmu: Check PMU cpuid configuration from user space
KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers
KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{}
KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header
KVM: VMX: Add macros to wrap around
{secondary,tertiary}_exec_controls_changebit()
KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with
vm_exit/entry_ctrl
KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers
KVM: x86/pmu: Setup PMU MSRs' interception mode
KVM: x86/pmu: Handle PMU MSRs interception and event filtering
KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry
KVM: x86/pmu: Handle emulated instruction for mediated vPMU
KVM: nVMX: Add macros to simplify nested MSR interception setting
KVM: selftests: Add mediated vPMU supported for pmu tests
KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test
KVM: Selftests: Fix pmu_counters_test error for mediated vPMU
KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space
Kan Liang (8):
perf: Support get/put mediated PMU interfaces
perf: Skip pmu_ctx based on event_type
perf: Clean up perf ctx time
perf: Add a EVENT_GUEST flag
perf: Add generic exclude_guest support
perf: Add switch_guest_ctx() interface
perf/x86: Support switch_guest_ctx interface
perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU
Mingwei Zhang (5):
perf/x86: Forbid PMI handler when guest own PMU
perf/x86/core: Plumb mediated PMU capability from x86_pmu to
x86_pmu_cap
KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
KVM: x86/pmu: introduce eventsel_hw to prepare for pmu event filtering
KVM: nVMX: Add nested virtualization support for mediated PMU
Sandipan Das (4):
perf/x86/core: Do not set bit width for unavailable counters
KVM: x86/pmu: Add AMD PMU registers to direct access list
KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
write to event selectors
perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host
Xiong Zhang (3):
x86/irq: Factor out common code for installing kvm irq handler
perf: core/x86: Register a new vector for KVM GUEST PMI
KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler
arch/x86/events/amd/core.c | 2 +
arch/x86/events/core.c | 40 +-
arch/x86/events/intel/core.c | 5 +
arch/x86/include/asm/hardirq.h | 1 +
arch/x86/include/asm/idtentry.h | 1 +
arch/x86/include/asm/irq.h | 2 +-
arch/x86/include/asm/irq_vectors.h | 5 +-
arch/x86/include/asm/kvm-x86-pmu-ops.h | 2 +
arch/x86/include/asm/kvm_host.h | 10 +
arch/x86/include/asm/msr-index.h | 18 +-
arch/x86/include/asm/perf_event.h | 1 +
arch/x86/include/asm/vmx.h | 1 +
arch/x86/kernel/idt.c | 1 +
arch/x86/kernel/irq.c | 39 +-
arch/x86/kvm/cpuid.c | 15 +
arch/x86/kvm/pmu.c | 254 ++++++++-
arch/x86/kvm/pmu.h | 45 ++
arch/x86/kvm/svm/pmu.c | 148 ++++-
arch/x86/kvm/svm/svm.c | 26 +
arch/x86/kvm/svm/svm.h | 2 +-
arch/x86/kvm/vmx/capabilities.h | 11 +-
arch/x86/kvm/vmx/nested.c | 68 ++-
arch/x86/kvm/vmx/pmu_intel.c | 224 ++++++--
arch/x86/kvm/vmx/vmx.c | 89 +--
arch/x86/kvm/vmx/vmx.h | 11 +-
arch/x86/kvm/x86.c | 63 ++-
arch/x86/kvm/x86.h | 2 +
include/linux/perf_event.h | 47 +-
kernel/events/core.c | 519 ++++++++++++++----
.../beauty/arch/x86/include/asm/irq_vectors.h | 5 +-
.../selftests/kvm/include/kvm_test_harness.h | 13 +
.../testing/selftests/kvm/include/kvm_util.h | 3 +
.../selftests/kvm/include/x86/processor.h | 8 +
tools/testing/selftests/kvm/lib/kvm_util.c | 23 +
.../selftests/kvm/x86/pmu_counters_test.c | 24 +-
.../selftests/kvm/x86/pmu_event_filter_test.c | 8 +-
.../selftests/kvm/x86/vmx_pmu_caps_test.c | 2 +-
37 files changed, 1480 insertions(+), 258 deletions(-)
base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319
--
2.49.0.395.g12beb8f557-goog
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find DUALPI2 iproute2 patch v11.
For more details of DualPI2, please refer IETF RFC9332
(https://datatracker.ietf.org/doc/html/rfc9332).
Best Regards,
Chia-Yu
---
v11 (18-Jul-2025)
- Replace TCA_DUALPI2 prefix with TC_DUALPI2 prefix for enums (Jakub Kicinski <kuba(a)kernel.org>)
v10 (02-Jul-2025)
- Replace STEP_THRESH and STEP_PACKETS w/ STEP_THRESH_PKTS and STEP_THRESH_US of net-next patch (Jakub Kicinski <kuba(a)kernel.org>)
v9 (13-Jun-2025)
- Fix space issue and typos (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Change 'rtt_typical' to 'typical_rtt' in tc/q_dualpi2.c (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Add the num of enum used by DualPI2 in pkt_sched.h
v8 (09-May-2025)
- Update pkt_sched.h with the one in nex-next
- Correct a typo in the comment within pkt_sched.h (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update manual content in man/man8/tc-dualpi2.8 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update tc/q_dualpi2.c to fix missing blank lines and add missing case (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
v7 (05-May-2025)
- Align pkt_sched.h with the v14 version of net-next due to spec modification in tc.yaml
- Reorganize dualpi2_print_opt() to match the order in tc.yaml
- Remove credit-queue in PRINT_JSON
v6 (26-Apr-2025)
- Update JSON file output due to spec modification in tc.yaml of net-next
v5 (25-Mar-2025)
- Use matches() to replace current strcmp() (Stephen Hemminger <stephen(a)networkplumber.org>)
- Use general parse_percent() for handling scaled percentage values (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 stats (Stephen Hemminger <stephen(a)networkplumber.org>)
v4 (16-Mar-2025)
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step marking.
v3 (21-Feb-2025)
- Add memlimit to the dualpi2 attribute, and add memory_used, max_memory_used, and memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update the manual to align with the latest implementation and clarify the queue naming and default unit
- Use common "get_scaled_alpha_beta" and clean print_opt for Dualpi2
v2 (23-Oct-2024)
- Rename get_float in dualpi2 to get_float_min_max in utils.c
- Move get_float from iplink_can.c in utils.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 (Stephen Hemminger <stephen(a)networkplumber.org>)
---
Chia-Yu Chang (1):
tc: add dualpi2 scheduler module
bash-completion/tc | 11 +-
include/uapi/linux/pkt_sched.h | 68 +++++
include/utils.h | 2 +
ip/iplink_can.c | 14 -
lib/utils.c | 30 ++
man/man8/tc-dualpi2.8 | 249 ++++++++++++++++
tc/Makefile | 1 +
tc/q_dualpi2.c | 528 +++++++++++++++++++++++++++++++++
8 files changed, 888 insertions(+), 15 deletions(-)
create mode 100644 man/man8/tc-dualpi2.8
create mode 100644 tc/q_dualpi2.c
--
2.34.1
From: Benjamin Berg <benjamin.berg(a)intel.com>
Hi,
This patchset adds signal handling to nolibc. Initially, I would like to
use this for tests. But in the long run, the goal is to use nolibc for
the UML kernel itself. In both cases, signal handling will be needed.
With v3 everything is now included in nolibc instead of trying to use
the messy kernel headers.
Benjamin
Benjamin Berg (4):
selftests/nolibc: fix EXPECT_NZ macro
selftests/nolibc: remove outdated comment about construct order
tools/nolibc: add more generic bitmask macros for FD_*
tools/nolibc: add signal support
tools/include/nolibc/Makefile | 1 +
tools/include/nolibc/arch-s390.h | 4 +-
tools/include/nolibc/asm-signal.h | 237 +++++++++++++++++++
tools/include/nolibc/signal.h | 179 ++++++++++++++
tools/include/nolibc/sys.h | 2 +-
tools/include/nolibc/sys/wait.h | 1 +
tools/include/nolibc/time.h | 2 +-
tools/include/nolibc/types.h | 81 ++++---
tools/testing/selftests/nolibc/nolibc-test.c | 139 ++++++++++-
9 files changed, 608 insertions(+), 38 deletions(-)
create mode 100644 tools/include/nolibc/asm-signal.h
--
2.50.1
With /proc/pid/maps now being read under per-vma lock protection we can
reuse parts of that code to execute PROCMAP_QUERY ioctl also without
taking mmap_lock. The change is designed to reduce mmap_lock contention
and prevent PROCMAP_QUERY ioctl calls from blocking address space updates.
This patchset was split out of the original patchset [1] that introduced
per-vma lock usage for /proc/pid/maps reading. It contains PROCMAP_QUERY
tests, code refactoring patch to simplify the main change and the actual
transition to per-vma lock.
[1] https://lore.kernel.org/all/20250704060727.724817-1-surenb@google.com/
Suren Baghdasaryan (3):
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
fs/proc/task_mmu: factor out proc_maps_private fields used by
PROCMAP_QUERY
fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks
fs/proc/internal.h | 15 +-
fs/proc/task_mmu.c | 149 ++++++++++++------
tools/testing/selftests/proc/proc-maps-race.c | 65 ++++++++
3 files changed, 174 insertions(+), 55 deletions(-)
base-commit: 01da54f10fddf3b01c5a3b80f6b16bbad390c302
--
2.50.1.565.gc32cd1483b-goog
The step_after_suspend_test verifies that the system successfully
suspended and resumed by setting a timerfd and checking whether the
timer fully expired. However, this method is unreliable due to timing
races.
In practice, the system may take time to enter suspend, during which the
timer may expire just before or during the transition. As a result,
the remaining time after resume may show non-zero nanoseconds, even if
suspend/resume completed successfully. This leads to false test failures.
Replace the timer-based check with a read from
/sys/power/suspend_stats/success. This counter is incremented only
after a full suspend/resume cycle, providing a reliable and race-free
indicator.
Also remove the unused file descriptor for /sys/power/state, which
remained after switching to a system() call to trigger suspend [1].
[1] https://lore.kernel.org/all/20240930224025.2858767-1-yifei.l.liu@oracle.com/
Fixes: c66be905cda2 ("selftests: breakpoints: use remaining time to check if suspend succeed")
Signed-off-by: Moon Hee Lee <moonhee.lee.ca(a)gmail.com>
---
.../breakpoints/step_after_suspend_test.c | 41 ++++++++++++++-----
1 file changed, 31 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/breakpoints/step_after_suspend_test.c b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
index 8d275f03e977..8d233ac95696 100644
--- a/tools/testing/selftests/breakpoints/step_after_suspend_test.c
+++ b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
@@ -127,22 +127,42 @@ int run_test(int cpu)
return KSFT_PASS;
}
+/*
+ * Reads the suspend success count from sysfs.
+ * Returns the count on success or exits on failure.
+ */
+static int get_suspend_success_count_or_fail(void)
+{
+ FILE *fp;
+ int val;
+
+ fp = fopen("/sys/power/suspend_stats/success", "r");
+ if (!fp)
+ ksft_exit_fail_msg(
+ "Failed to open suspend_stats/success: %s\n",
+ strerror(errno));
+
+ if (fscanf(fp, "%d", &val) != 1) {
+ fclose(fp);
+ ksft_exit_fail_msg(
+ "Failed to read suspend success count\n");
+ }
+
+ fclose(fp);
+ return val;
+}
+
void suspend(void)
{
- int power_state_fd;
int timerfd;
int err;
+ int count_before;
+ int count_after;
struct itimerspec spec = {};
if (getuid() != 0)
ksft_exit_skip("Please run the test as root - Exiting.\n");
- power_state_fd = open("/sys/power/state", O_RDWR);
- if (power_state_fd < 0)
- ksft_exit_fail_msg(
- "open(\"/sys/power/state\") failed %s)\n",
- strerror(errno));
-
timerfd = timerfd_create(CLOCK_BOOTTIME_ALARM, 0);
if (timerfd < 0)
ksft_exit_fail_msg("timerfd_create() failed\n");
@@ -152,14 +172,15 @@ void suspend(void)
if (err < 0)
ksft_exit_fail_msg("timerfd_settime() failed\n");
+ count_before = get_suspend_success_count_or_fail();
+
system("(echo mem > /sys/power/state) 2> /dev/null");
- timerfd_gettime(timerfd, &spec);
- if (spec.it_value.tv_sec != 0 || spec.it_value.tv_nsec != 0)
+ count_after = get_suspend_success_count_or_fail();
+ if (count_after <= count_before)
ksft_exit_fail_msg("Failed to enter Suspend state\n");
close(timerfd);
- close(power_state_fd);
}
int main(int argc, char **argv)
--
2.43.0
This patch fixes unstable LACP negotiation when bonding is configured in
passive mode (`lacp_active=off`).
Previously, the actor would stop sending LACPDUs after initial negotiation
succeeded, leading to the partner timing out and restarting the negotiation
cycle. This resulted in continuous LACP state flapping.
The fix ensures the passive actor starts sending periodic LACPDUs after
receiving the first LACPDU from the partner, in accordance with IEEE
802.1AX-2020 section 6.4.1.
Out of topic:
Although this patch addresses a functional bug and could be considered for
`net`, I'm slightly concerned about potential regressions, as it changes
the current bonding LACP protocol behavior.
It might be safer to merge this through `net-next` first to allow broader
testing. Thoughts?
Hangbin Liu (2):
bonding: send LACPDUs periodically in passive mode after receiving
partner's LACPDU
selftests: bonding: add test for passive LACP mode
drivers/net/bonding/bond_3ad.c | 72 ++++++++++----
drivers/net/bonding/bond_options.c | 1 +
include/net/bond_3ad.h | 1 +
.../selftests/drivers/net/bonding/Makefile | 3 +-
.../drivers/net/bonding/bond_passive_lacp.sh | 93 +++++++++++++++++++
5 files changed, 151 insertions(+), 19 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_passive_lacp.sh
--
2.46.0