From: Jesper Dangaard Brouer hawk@kernel.org
We frequently consult with Jesper's out-of-tree page_pool benchmark to evaluate page_pool changes.
Import the benchmark into the upstream linux kernel tree so that (a) we're all running the same version, (b) pave the way for shared improvements, and (c) maybe one day integrate it with nipa, if possible.
Import bench_page_pool_simple from commit 35b1716d0c30 ("Add page_bench06_walk_all"), from this repository: https://github.com/netoptimizer/prototype-kernel.git
Changes done during upstreaming: - Fix checkpatch issues. - Remove the tasklet logic not needed. - Move under tools/testing - Create ksft for the benchmark. - Changed slightly how the benchmark gets build. Out of tree, time_bench is built as an independent .ko. Here it is included in bench_page_pool.ko
Steps to run:
``` mkdir -p /tmp/run-pp-bench make -C ./tools/testing/selftests/net/bench make -C ./tools/testing/selftests/net/bench install INSTALL_PATH=/tmp/run-pp-bench rsync --delete -avz --progress /tmp/run-pp-bench mina@$SERVER:~/ ssh mina@$SERVER << EOF cd ~/run-pp-bench && sudo ./test_bench_page_pool.sh EOF ```
Output:
``` (benchmrk dmesg logs)
Fast path results: no-softirq-page_pool01 Per elem: 11 cycles(tsc) 4.368 ns
ptr_ring results: no-softirq-page_pool02 Per elem: 527 cycles(tsc) 195.187 ns
slow path results: no-softirq-page_pool03 Per elem: 549 cycles(tsc) 203.466 ns ```
Cc: Jesper Dangaard Brouer hawk@kernel.org Cc: Ilias Apalodimas ilias.apalodimas@linaro.org Cc: Jakub Kicinski kuba@kernel.org Cc: Toke Høiland-Jørgensen toke@toke.dk
Signed-off-by: Mina Almasry almasrymina@google.com
---
v2: - Move under tools/selftests (Jakub) - Create ksft for it. - Remove the tasklet logic no longer needed (Jesper + Toke)
RFC discussion points: - Desirable to import it? - Can the benchmark be imported as-is for an initial version? Or needs lots of modifications? - Code location. I retained the location in Jesper's tree, but a path like net/core/bench/ may make more sense.
--- tools/testing/selftests/net/bench/Makefile | 7 + .../selftests/net/bench/page_pool/Makefile | 17 + .../bench/page_pool/bench_page_pool_simple.c | 275 ++++++++++++ .../bench/page_pool/test_bench_page_pool.sh | 32 ++ .../net/bench/page_pool/time_bench.c | 406 ++++++++++++++++++ .../net/bench/page_pool/time_bench.h | 259 +++++++++++ 6 files changed, 996 insertions(+) create mode 100644 tools/testing/selftests/net/bench/Makefile create mode 100644 tools/testing/selftests/net/bench/page_pool/Makefile create mode 100644 tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c create mode 100755 tools/testing/selftests/net/bench/page_pool/test_bench_page_pool.sh create mode 100644 tools/testing/selftests/net/bench/page_pool/time_bench.c create mode 100644 tools/testing/selftests/net/bench/page_pool/time_bench.h
diff --git a/tools/testing/selftests/net/bench/Makefile b/tools/testing/selftests/net/bench/Makefile new file mode 100644 index 000000000000..4ebce5d71b18 --- /dev/null +++ b/tools/testing/selftests/net/bench/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0 + +TEST_GEN_MODS_DIR := page_pool + +TEST_PROGS += page_pool/test_bench_page_pool.sh + +include ../../lib.mk diff --git a/tools/testing/selftests/net/bench/page_pool/Makefile b/tools/testing/selftests/net/bench/page_pool/Makefile new file mode 100644 index 000000000000..0549a16ba275 --- /dev/null +++ b/tools/testing/selftests/net/bench/page_pool/Makefile @@ -0,0 +1,17 @@ +BENCH_PAGE_POOL_SIMPLE_TEST_DIR := $(realpath $(dir $(abspath $(lastword $(MAKEFILE_LIST))))) +KDIR ?= /lib/modules/$(shell uname -r)/build + +ifeq ($(V),1) +Q = +else +Q = @ +endif + +obj-m += bench_page_pool.o +bench_page_pool-y += bench_page_pool_simple.o time_bench.o + +all: + +$(Q)make -C $(KDIR) M=$(BENCH_PAGE_POOL_SIMPLE_TEST_DIR) modules + +clean: + +$(Q)make -C $(KDIR) M=$(BENCH_PAGE_POOL_SIMPLE_TEST_DIR) clean diff --git a/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c b/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c new file mode 100644 index 000000000000..53d168cce27d --- /dev/null +++ b/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c @@ -0,0 +1,275 @@ +/* + * Benchmark module for page_pool. + * + */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/module.h> +#include <linux/mutex.h> + +#include <linux/version.h> +#include <net/page_pool/helpers.h> + +#include <linux/interrupt.h> +#include <linux/limits.h> + +#include "time_bench.h" + +static int verbose = 1; +#define MY_POOL_SIZE 1024 + +static inline void _page_pool_put_page(struct page_pool *pool, + struct page *page, bool allow_direct) +{ + page_pool_put_page(pool, page, -1, allow_direct); +} + +/* Makes tests selectable. Useful for perf-record to analyze a single test. + * Hint: Bash shells support writing binary number like: $((2#101010) + * + * # modprobe bench_page_pool_simple run_flags=$((2#100)) + */ +static unsigned long run_flags = 0xFFFFFFFF; +module_param(run_flags, ulong, 0); +MODULE_PARM_DESC(run_flags, "Limit which bench test that runs"); +/* Count the bit number from the enum */ +enum benchmark_bit { + bit_run_bench_baseline, + bit_run_bench_no_softirq01, + bit_run_bench_no_softirq02, + bit_run_bench_no_softirq03, +}; +#define bit(b) (1 << (b)) +#define enabled(b) ((run_flags & (bit(b)))) + +/* notice time_bench is limited to U32_MAX nr loops */ +static unsigned long loops = 10000000; +module_param(loops, ulong, 0); +MODULE_PARM_DESC(loops, "Specify loops bench will run"); + +/* Timing at the nanosec level, we need to know the overhead + * introduced by the for loop itself */ +static int time_bench_for_loop(struct time_bench_record *rec, void *data) +{ + uint64_t loops_cnt = 0; + int i; + + time_bench_start(rec); + /** Loop to measure **/ + for (i = 0; i < rec->loops; i++) { + loops_cnt++; + barrier(); /* avoid compiler to optimize this loop */ + } + time_bench_stop(rec, loops_cnt); + return loops_cnt; +} + +static int time_bench_atomic_inc(struct time_bench_record *rec, void *data) +{ + uint64_t loops_cnt = 0; + atomic_t cnt; + int i; + + atomic_set(&cnt, 0); + + time_bench_start(rec); + /** Loop to measure **/ + for (i = 0; i < rec->loops; i++) { + atomic_inc(&cnt); + barrier(); /* avoid compiler to optimize this loop */ + } + loops_cnt = atomic_read(&cnt); + time_bench_stop(rec, loops_cnt); + return loops_cnt; +} + +/* The ptr_ping in page_pool uses a spinlock. We need to know the minimum + * overhead of taking+releasing a spinlock, to know the cycles that can be saved + * by e.g. amortizing this via bulking. + */ +static int time_bench_lock(struct time_bench_record *rec, void *data) +{ + uint64_t loops_cnt = 0; + spinlock_t lock; + int i; + + spin_lock_init(&lock); + + time_bench_start(rec); + /** Loop to measure **/ + for (i = 0; i < rec->loops; i++) { + spin_lock(&lock); + loops_cnt++; + barrier(); /* avoid compiler to optimize this loop */ + spin_unlock(&lock); + } + time_bench_stop(rec, loops_cnt); + return loops_cnt; +} + +/* Helper for filling some page's into ptr_ring */ +static void pp_fill_ptr_ring(struct page_pool *pp, int elems) +{ + gfp_t gfp_mask = GFP_ATOMIC; /* GFP_ATOMIC needed when under run softirq */ + struct page **array; + int i; + + array = kzalloc(sizeof(struct page *) * elems, gfp_mask); + + for (i = 0; i < elems; i++) { + array[i] = page_pool_alloc_pages(pp, gfp_mask); + } + for (i = 0; i < elems; i++) { + _page_pool_put_page(pp, array[i], false); + } + + kfree(array); +} + +enum test_type { type_fast_path, type_ptr_ring, type_page_allocator }; + +/* Depends on compile optimizing this function */ +static __always_inline int time_bench_page_pool(struct time_bench_record *rec, + void *data, enum test_type type, + const char *func) +{ + uint64_t loops_cnt = 0; + gfp_t gfp_mask = GFP_ATOMIC; /* GFP_ATOMIC is not really needed */ + int i, err; + + struct page_pool *pp; + struct page *page; + + struct page_pool_params pp_params = { + .order = 0, + .flags = 0, + .pool_size = MY_POOL_SIZE, + .nid = NUMA_NO_NODE, + .dev = NULL, /* Only use for DMA mapping */ + .dma_dir = DMA_BIDIRECTIONAL, + }; + + pp = page_pool_create(&pp_params); + if (IS_ERR(pp)) { + err = PTR_ERR(pp); + pr_warn("%s: Error(%d) creating page_pool\n", func, err); + goto out; + } + pp_fill_ptr_ring(pp, 64); + + if (in_serving_softirq()) + pr_warn("%s(): in_serving_softirq fast-path\n", func); + else + pr_warn("%s(): Cannot use page_pool fast-path\n", func); + + time_bench_start(rec); + /** Loop to measure **/ + for (i = 0; i < rec->loops; i++) { + /* Common fast-path alloc, that depend on in_serving_softirq() */ + page = page_pool_alloc_pages(pp, gfp_mask); + if (!page) + break; + loops_cnt++; + barrier(); /* avoid compiler to optimize this loop */ + + /* The benchmarks purpose it to test different return paths. + * Compiler should inline optimize other function calls out + */ + if (type == type_fast_path) { + /* Fast-path recycling e.g. XDP_DROP use-case */ + page_pool_recycle_direct(pp, page); + + } else if (type == type_ptr_ring) { + /* Normal return path */ + _page_pool_put_page(pp, page, false); + + } else if (type == type_page_allocator) { + /* Test if not pages are recycled, but instead + * returned back into systems page allocator + */ + get_page(page); /* cause no-recycling */ + _page_pool_put_page(pp, page, false); + put_page(page); + } else { + BUILD_BUG(); + } + } + time_bench_stop(rec, loops_cnt); +out: + page_pool_destroy(pp); + return loops_cnt; +} + +static int time_bench_page_pool01_fast_path(struct time_bench_record *rec, + void *data) +{ + return time_bench_page_pool(rec, data, type_fast_path, __func__); +} + +static int time_bench_page_pool02_ptr_ring(struct time_bench_record *rec, + void *data) +{ + return time_bench_page_pool(rec, data, type_ptr_ring, __func__); +} + +static int time_bench_page_pool03_slow(struct time_bench_record *rec, + void *data) +{ + return time_bench_page_pool(rec, data, type_page_allocator, __func__); +} + +static int run_benchmark_tests(void) +{ + uint32_t nr_loops = loops; + int passed_count = 0; + + /* Baseline tests */ + if (enabled(bit_run_bench_baseline)) { + time_bench_loop(nr_loops * 10, 0, "for_loop", NULL, + time_bench_for_loop); + time_bench_loop(nr_loops * 10, 0, "atomic_inc", NULL, + time_bench_atomic_inc); + time_bench_loop(nr_loops, 0, "lock", NULL, time_bench_lock); + } + + /* This test cannot activate correct code path, due to no-softirq ctx */ + if (enabled(bit_run_bench_no_softirq01)) + time_bench_loop(nr_loops, 0, "no-softirq-page_pool01", NULL, + time_bench_page_pool01_fast_path); + if (enabled(bit_run_bench_no_softirq02)) + time_bench_loop(nr_loops, 0, "no-softirq-page_pool02", NULL, + time_bench_page_pool02_ptr_ring); + if (enabled(bit_run_bench_no_softirq03)) + time_bench_loop(nr_loops, 0, "no-softirq-page_pool03", NULL, + time_bench_page_pool03_slow); + + return passed_count; +} + +static int __init bench_page_pool_simple_module_init(void) +{ + if (verbose) + pr_info("Loaded\n"); + + if (loops > U32_MAX) { + pr_err("Module param loops(%lu) exceeded U32_MAX(%u)\n", loops, + U32_MAX); + return -ECHRNG; + } + + run_benchmark_tests(); + + return 0; +} +module_init(bench_page_pool_simple_module_init); + +static void __exit bench_page_pool_simple_module_exit(void) +{ + if (verbose) + pr_info("Unloaded\n"); +} +module_exit(bench_page_pool_simple_module_exit); + +MODULE_DESCRIPTION("Benchmark of page_pool simple cases"); +MODULE_AUTHOR("Jesper Dangaard Brouer netoptimizer@brouer.com"); +MODULE_LICENSE("GPL"); diff --git a/tools/testing/selftests/net/bench/page_pool/test_bench_page_pool.sh b/tools/testing/selftests/net/bench/page_pool/test_bench_page_pool.sh new file mode 100755 index 000000000000..5eb48f28b659 --- /dev/null +++ b/tools/testing/selftests/net/bench/page_pool/test_bench_page_pool.sh @@ -0,0 +1,32 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# + +set -e + +DRIVER="./page_pool/bench_page_pool.ko" +result="" + +function run_test() +{ + rmmod "bench_page_pool.ko" || true + insmod $DRIVER > /dev/null 2>&1 + result=$(dmesg | tail -10) + echo "$result" + + echo + echo "Fast path results:" + echo ${result} | grep -o -E "no-softirq-page_pool01 Per elem: ([0-9]+) cycles(tsc) ([0-9]+.[0-9]+) ns" + + echo + echo "ptr_ring results:" + echo ${result} | grep -o -E "no-softirq-page_pool02 Per elem: ([0-9]+) cycles(tsc) ([0-9]+.[0-9]+) ns" + + echo + echo "slow path results:" + echo ${result} | grep -o -E "no-softirq-page_pool03 Per elem: ([0-9]+) cycles(tsc) ([0-9]+.[0-9]+) ns" +} + +run_test + +exit 0 diff --git a/tools/testing/selftests/net/bench/page_pool/time_bench.c b/tools/testing/selftests/net/bench/page_pool/time_bench.c new file mode 100644 index 000000000000..257b1515c64e --- /dev/null +++ b/tools/testing/selftests/net/bench/page_pool/time_bench.c @@ -0,0 +1,406 @@ +/* + * Benchmarking code execution time inside the kernel + * + * Copyright (C) 2014, Red Hat, Inc., Jesper Dangaard Brouer + * for licensing details see kernel-base/COPYING + */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/module.h> +#include <linux/time.h> + +#include <linux/perf_event.h> /* perf_event_create_kernel_counter() */ + +/* For concurrency testing */ +#include <linux/completion.h> +#include <linux/sched.h> +#include <linux/workqueue.h> +#include <linux/kthread.h> + +#include "time_bench.h" + +static int verbose = 1; + +/** TSC (Time-Stamp Counter) based ** + * See: linux/time_bench.h + * tsc_start_clock() and tsc_stop_clock() + */ + +/** Wall-clock based ** + */ + +/** PMU (Performance Monitor Unit) based ** + */ +#define PERF_FORMAT \ + (PERF_FORMAT_GROUP | PERF_FORMAT_ID | PERF_FORMAT_TOTAL_TIME_ENABLED | \ + PERF_FORMAT_TOTAL_TIME_RUNNING) + +struct raw_perf_event { + uint64_t config; /* event */ + uint64_t config1; /* umask */ + struct perf_event *save; + char *desc; +}; + +/* if HT is enable a maximum of 4 events (5 if one is instructions + * retired can be specified, if HT is disabled a maximum of 8 (9 if + * one is instructions retired) can be specified. + * + * From Table 19-1. Architectural Performance Events + * Architectures Software Developer’s Manual Volume 3: System Programming Guide + */ +struct raw_perf_event perf_events[] = { + { 0x3c, 0x00, NULL, "Unhalted CPU Cycles" }, + { 0xc0, 0x00, NULL, "Instruction Retired" } +}; + +#define NUM_EVTS (sizeof(perf_events) / sizeof(struct raw_perf_event)) + +/* WARNING: PMU config is currently broken! + */ +bool time_bench_PMU_config(bool enable) +{ + int i; + struct perf_event_attr perf_conf; + struct perf_event *perf_event; + int cpu; + + preempt_disable(); + cpu = smp_processor_id(); + pr_info("DEBUG: cpu:%d\n", cpu); + preempt_enable(); + + memset(&perf_conf, 0, sizeof(struct perf_event_attr)); + perf_conf.type = PERF_TYPE_RAW; + perf_conf.size = sizeof(struct perf_event_attr); + perf_conf.read_format = PERF_FORMAT; + perf_conf.pinned = 1; + perf_conf.exclude_user = 1; /* No userspace events */ + perf_conf.exclude_kernel = 0; /* Only kernel events */ + + for (i = 0; i < NUM_EVTS; i++) { + perf_conf.disabled = enable; + //perf_conf.disabled = (i == 0) ? 1 : 0; + perf_conf.config = perf_events[i].config; + perf_conf.config1 = perf_events[i].config1; + if (verbose) + pr_info("%s() enable PMU counter: %s\n", + __func__, perf_events[i].desc); + perf_event = perf_event_create_kernel_counter(&perf_conf, cpu, + NULL /* task */, + NULL /* overflow_handler*/, + NULL /* context */); + if (perf_event) { + perf_events[i].save = perf_event; + pr_info("%s():DEBUG perf_event success\n", __func__); + + perf_event_enable(perf_event); + } else { + pr_info("%s():DEBUG perf_event is NULL\n", __func__); + } + } + + return true; +} + +/** Generic functions ** + */ + +/* Calculate stats, store results in record */ +bool time_bench_calc_stats(struct time_bench_record *rec) +{ +#define NANOSEC_PER_SEC 1000000000 /* 10^9 */ + uint64_t ns_per_call_tmp_rem = 0; + uint32_t ns_per_call_remainder = 0; + uint64_t pmc_ipc_tmp_rem = 0; + uint32_t pmc_ipc_remainder = 0; + uint32_t pmc_ipc_div = 0; + uint32_t invoked_cnt_precision = 0; + uint32_t invoked_cnt = 0; /* 32-bit due to div_u64_rem() */ + + if (rec->flags & TIME_BENCH_LOOP) { + if (rec->invoked_cnt < 1000) { + pr_err("ERR: need more(>1000) loops(%llu) for timing\n", + rec->invoked_cnt); + return false; + } + if (rec->invoked_cnt > ((1ULL << 32) - 1)) { + /* div_u64_rem() can only support div with 32bit*/ + pr_err("ERR: Invoke cnt(%llu) too big overflow 32bit\n", + rec->invoked_cnt); + return false; + } + invoked_cnt = (uint32_t)rec->invoked_cnt; + } + + /* TSC (Time-Stamp Counter) records */ + if (rec->flags & TIME_BENCH_TSC) { + rec->tsc_interval = rec->tsc_stop - rec->tsc_start; + if (rec->tsc_interval == 0) { + pr_err("ABORT: timing took ZERO TSC time\n"); + return false; + } + /* Calculate stats */ + if (rec->flags & TIME_BENCH_LOOP) + rec->tsc_cycles = rec->tsc_interval / invoked_cnt; + else + rec->tsc_cycles = rec->tsc_interval; + } + + /* Wall-clock time calc */ + if (rec->flags & TIME_BENCH_WALLCLOCK) { + rec->time_start = rec->ts_start.tv_nsec + + (NANOSEC_PER_SEC * rec->ts_start.tv_sec); + rec->time_stop = rec->ts_stop.tv_nsec + + (NANOSEC_PER_SEC * rec->ts_stop.tv_sec); + rec->time_interval = rec->time_stop - rec->time_start; + if (rec->time_interval == 0) { + pr_err("ABORT: timing took ZERO wallclock time\n"); + return false; + } + /* Calculate stats */ + /*** Division in kernel it tricky ***/ + /* Orig: time_sec = (time_interval / NANOSEC_PER_SEC); */ + /* remainder only correct because NANOSEC_PER_SEC is 10^9 */ + rec->time_sec = div_u64_rem(rec->time_interval, NANOSEC_PER_SEC, + &rec->time_sec_remainder); + //TODO: use existing struct timespec records instead of div? + + if (rec->flags & TIME_BENCH_LOOP) { + /*** Division in kernel it tricky ***/ + /* Orig: ns = ((double)time_interval / invoked_cnt); */ + /* First get quotient */ + rec->ns_per_call_quotient = + div_u64_rem(rec->time_interval, invoked_cnt, + &ns_per_call_remainder); + /* Now get decimals .xxx precision (incorrect roundup)*/ + ns_per_call_tmp_rem = ns_per_call_remainder; + invoked_cnt_precision = invoked_cnt / 1000; + if (invoked_cnt_precision > 0) { + rec->ns_per_call_decimal = + div_u64_rem(ns_per_call_tmp_rem, + invoked_cnt_precision, + &ns_per_call_remainder); + } + } + } + + /* Performance Monitor Unit (PMU) counters */ + if (rec->flags & TIME_BENCH_PMU) { + //FIXME: Overflow handling??? + rec->pmc_inst = rec->pmc_inst_stop - rec->pmc_inst_start; + rec->pmc_clk = rec->pmc_clk_stop - rec->pmc_clk_start; + + /* Calc Instruction Per Cycle (IPC) */ + /* First get quotient */ + rec->pmc_ipc_quotient = div_u64_rem(rec->pmc_inst, rec->pmc_clk, + &pmc_ipc_remainder); + /* Now get decimals .xxx precision (incorrect roundup)*/ + pmc_ipc_tmp_rem = pmc_ipc_remainder; + pmc_ipc_div = rec->pmc_clk / 1000; + if (pmc_ipc_div > 0) { + rec->pmc_ipc_decimal = div_u64_rem(pmc_ipc_tmp_rem, + pmc_ipc_div, + &pmc_ipc_remainder); + } + } + + return true; +} + +/* Generic function for invoking a loop function and calculating + * execution time stats. The function being called/timed is assumed + * to perform a tight loop, and update the timing record struct. + */ +bool time_bench_loop(uint32_t loops, int step, char *txt, void *data, + int (*func)(struct time_bench_record *record, void *data)) +{ + struct time_bench_record rec; + + /* Setup record */ + memset(&rec, 0, sizeof(rec)); /* zero func might not update all */ + rec.version_abi = 1; + rec.loops = loops; + rec.step = step; + rec.flags = (TIME_BENCH_LOOP|TIME_BENCH_TSC|TIME_BENCH_WALLCLOCK); +// rec.flags = (TIME_BENCH_LOOP|TIME_BENCH_TSC| +// TIME_BENCH_WALLCLOCK|TIME_BENCH_PMU); + //TODO: Add/copy txt to rec + + /*** Loop function being timed ***/ + if (!func(&rec, data)) { + pr_err("ABORT: function being timed failed\n"); + return false; + } + + if (rec.invoked_cnt < loops) + pr_warn("WARNING: Invoke count(%llu) smaller than loops(%d)\n", + rec.invoked_cnt, loops); + + /* Calculate stats */ + time_bench_calc_stats(&rec); + + pr_info("Type:%s Per elem: %llu cycles(tsc) %llu.%03llu ns (step:%d)" + " - (measurement period time:%llu.%09u sec time_interval:%llu)" + " - (invoke count:%llu tsc_interval:%llu)\n", + txt, rec.tsc_cycles, + rec.ns_per_call_quotient, rec.ns_per_call_decimal, rec.step, + rec.time_sec, rec.time_sec_remainder, rec.time_interval, + rec.invoked_cnt, rec.tsc_interval); +/* pr_info("DEBUG check is %llu/%llu == %llu.%03llu ?\n", + rec.time_interval, rec.invoked_cnt, + rec.ns_per_call_quotient, rec.ns_per_call_decimal); +*/ + if (rec.flags & TIME_BENCH_PMU) { + pr_info("Type:%s PMU inst/clock" + "%llu/%llu = %llu.%03llu IPC (inst per cycle)\n", + txt, rec.pmc_inst, rec.pmc_clk, + rec.pmc_ipc_quotient, rec.pmc_ipc_decimal); + } + return true; +} + +/* Function getting invoked by kthread */ +static int invoke_test_on_cpu_func(void *private) +{ + struct time_bench_cpu *cpu = private; + struct time_bench_sync *sync = cpu->sync; + cpumask_t newmask = CPU_MASK_NONE; + void *data = cpu->data; + + /* Restrict CPU */ + cpumask_set_cpu(cpu->rec.cpu, &newmask); + set_cpus_allowed_ptr(current, &newmask); + + /* Synchronize start of concurrency test */ + atomic_inc(&sync->nr_tests_running); + wait_for_completion(&sync->start_event); + + /* Start benchmark function */ + if (!cpu->bench_func(&cpu->rec, data)) { + pr_err("ERROR: function being timed failed on CPU:%d(%d)\n", + cpu->rec.cpu, smp_processor_id()); + } else { + if (verbose) + pr_info("SUCCESS: ran on CPU:%d(%d)\n", cpu->rec.cpu, + smp_processor_id()); + } + cpu->did_bench_run = true; + + /* End test */ + atomic_dec(&sync->nr_tests_running); + /* Wait for kthread_stop() telling us to stop */ + while (!kthread_should_stop()) { + set_current_state(TASK_INTERRUPTIBLE); + schedule(); + } + __set_current_state(TASK_RUNNING); + return 0; +} + +void time_bench_print_stats_cpumask(const char *desc, + struct time_bench_cpu *cpu_tasks, + const struct cpumask *mask) +{ + uint64_t average = 0; + int cpu; + int step = 0; + struct sum { + uint64_t tsc_cycles; + int records; + } sum = { 0 }; + + /* Get stats */ + for_each_cpu(cpu, mask) { + struct time_bench_cpu *c = &cpu_tasks[cpu]; + struct time_bench_record *rec = &c->rec; + + /* Calculate stats */ + time_bench_calc_stats(rec); + + pr_info("Type:%s CPU(%d) %llu cycles(tsc) %llu.%03llu ns" + " (step:%d)" + " - (measurement period time:%llu.%09u sec time_interval:%llu)" + " - (invoke count:%llu tsc_interval:%llu)\n", + desc, cpu, rec->tsc_cycles, rec->ns_per_call_quotient, + rec->ns_per_call_decimal, rec->step, rec->time_sec, + rec->time_sec_remainder, rec->time_interval, + rec->invoked_cnt, rec->tsc_interval); + + /* Collect average */ + sum.records++; + sum.tsc_cycles += rec->tsc_cycles; + step = rec->step; + } + + if (sum.records) /* avoid div-by-zero */ + average = sum.tsc_cycles / sum.records; + pr_info("Sum Type:%s Average: %llu cycles(tsc) CPUs:%d step:%d\n", desc, + average, sum.records, step); +} + +void time_bench_run_concurrent( + uint32_t loops, int step, void *data, + const struct cpumask *mask, /* Support masking outsome CPUs*/ + struct time_bench_sync *sync, struct time_bench_cpu *cpu_tasks, + int (*func)(struct time_bench_record *record, void *data)) +{ + int cpu, running = 0; + + if (verbose) // DEBUG + pr_warn("%s() Started on CPU:%d\n", __func__, + smp_processor_id()); + + /* Reset sync conditions */ + atomic_set(&sync->nr_tests_running, 0); + init_completion(&sync->start_event); + + /* Spawn off jobs on all CPUs */ + for_each_cpu(cpu, mask) { + struct time_bench_cpu *c = &cpu_tasks[cpu]; + + running++; + c->sync = sync; /* Send sync variable along */ + c->data = data; /* Send opaque along */ + + /* Init benchmark record */ + memset(&c->rec, 0, sizeof(struct time_bench_record)); + c->rec.version_abi = 1; + c->rec.loops = loops; + c->rec.step = step; + c->rec.flags = (TIME_BENCH_LOOP|TIME_BENCH_TSC| + TIME_BENCH_WALLCLOCK); + c->rec.cpu = cpu; + c->bench_func = func; + c->task = kthread_run(invoke_test_on_cpu_func, c, + "time_bench%d", cpu); + if (IS_ERR(c->task)) { + pr_err("%s(): Failed to start test func\n", __func__); + return; /* Argh, what about cleanup?! */ + } + } + + /* Wait until all processes are running */ + while (atomic_read(&sync->nr_tests_running) < running) { + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_timeout(10); + } + /* Kick off all CPU concurrently on completion event */ + complete_all(&sync->start_event); + + /* Wait for CPUs to finish */ + while (atomic_read(&sync->nr_tests_running)) { + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_timeout(10); + } + + /* Stop the kthreads */ + for_each_cpu(cpu, mask) { + struct time_bench_cpu *c = &cpu_tasks[cpu]; + kthread_stop(c->task); + } + + if (verbose) // DEBUG - happens often, finish on another CPU + pr_warn("%s() Finished on CPU:%d\n", __func__, + smp_processor_id()); +} diff --git a/tools/testing/selftests/net/bench/page_pool/time_bench.h b/tools/testing/selftests/net/bench/page_pool/time_bench.h new file mode 100644 index 000000000000..7331b5789490 --- /dev/null +++ b/tools/testing/selftests/net/bench/page_pool/time_bench.h @@ -0,0 +1,259 @@ +/* + * Benchmarking code execution time inside the kernel + * + * Copyright (C) 2014, Red Hat, Inc., Jesper Dangaard Brouer + * for licensing details see kernel-base/COPYING + */ +#ifndef _LINUX_TIME_BENCH_H +#define _LINUX_TIME_BENCH_H + +/* Main structure used for recording a benchmark run */ +struct time_bench_record { + uint32_t version_abi; + uint32_t loops; /* Requested loop invocations */ + uint32_t step; /* option for e.g. bulk invocations */ + + uint32_t flags; /* Measurements types enabled */ +#define TIME_BENCH_LOOP (1<<0) +#define TIME_BENCH_TSC (1<<1) +#define TIME_BENCH_WALLCLOCK (1<<2) +#define TIME_BENCH_PMU (1<<3) + + uint32_t cpu; /* Used when embedded in time_bench_cpu */ + + /* Records */ + uint64_t invoked_cnt; /* Returned actual invocations */ + uint64_t tsc_start; + uint64_t tsc_stop; + struct timespec64 ts_start; + struct timespec64 ts_stop; + /** PMU counters for instruction and cycles + * instructions counter including pipelined instructions */ + uint64_t pmc_inst_start; + uint64_t pmc_inst_stop; + /* CPU unhalted clock counter */ + uint64_t pmc_clk_start; + uint64_t pmc_clk_stop; + + /* Result records */ + uint64_t tsc_interval; + uint64_t time_start, time_stop, time_interval; /* in nanosec */ + uint64_t pmc_inst, pmc_clk; + + /* Derived result records */ + uint64_t tsc_cycles; // +decimal? + uint64_t ns_per_call_quotient, ns_per_call_decimal; + uint64_t time_sec; + uint32_t time_sec_remainder; + uint64_t pmc_ipc_quotient, pmc_ipc_decimal; /* inst per cycle */ +}; + +/* For synchronizing parallel CPUs to run concurrently */ +struct time_bench_sync { + atomic_t nr_tests_running; + struct completion start_event; +}; + +/* Keep track of CPUs executing our bench function. + * + * Embed a time_bench_record for storing info per cpu + */ +struct time_bench_cpu { + struct time_bench_record rec; + struct time_bench_sync *sync; /* back ptr */ + struct task_struct *task; + /* "data" opaque could have been placed in time_bench_sync, + * but to avoid any false sharing, place it per CPU + */ + void *data; + /* Support masking outsome CPUs, mark if it ran */ + bool did_bench_run; + /* int cpu; // note CPU stored in time_bench_record */ + int (*bench_func)(struct time_bench_record *record, void *data); +}; + +/* + * Below TSC assembler code is not compatible with other archs, and + * can also fail on guests if cpu-flags are not correct. + * + * The way TSC reading is used, many iterations, does not require as + * high accuracy as described below (in Intel Doc #324264). + * + * Considering changing to use get_cycles() (#include <asm/timex.h>). + */ + +/** TSC (Time-Stamp Counter) based ** + * Recommend reading, to understand details of reading TSC accurately: + * Intel Doc #324264, "How to Benchmark Code Execution Times on Intel" + * + * Consider getting exclusive ownership of CPU by using: + * unsigned long flags; + * preempt_disable(); + * raw_local_irq_save(flags); + * _your_code_ + * raw_local_irq_restore(flags); + * preempt_enable(); + * + * Clobbered registers: "%rax", "%rbx", "%rcx", "%rdx" + * RDTSC only change "%rax" and "%rdx" but + * CPUID clears the high 32-bits of all (rax/rbx/rcx/rdx) + */ +static __always_inline uint64_t tsc_start_clock(void) +{ + /* See: Intel Doc #324264 */ + unsigned hi, lo; + asm volatile("CPUID\n\t" + "RDTSC\n\t" + "mov %%edx, %0\n\t" + "mov %%eax, %1\n\t" + : "=r"(hi), "=r"(lo)::"%rax", "%rbx", "%rcx", "%rdx"); + //FIXME: on 32bit use clobbered %eax + %edx + return ((uint64_t)lo) | (((uint64_t)hi) << 32); +} + +static __always_inline uint64_t tsc_stop_clock(void) +{ + /* See: Intel Doc #324264 */ + unsigned hi, lo; + asm volatile("RDTSCP\n\t" + "mov %%edx, %0\n\t" + "mov %%eax, %1\n\t" + "CPUID\n\t" + : "=r"(hi), "=r"(lo)::"%rax", "%rbx", "%rcx", "%rdx"); + return ((uint64_t)lo) | (((uint64_t)hi) << 32); +} + +/* Notes for RDTSC and RDTSCP + * + * Hannes found out that __builtin_ia32_rdtsc and + * __builtin_ia32_rdtscp are undocumented available in gcc, so there + * is no need to write inline assembler functions for them any more. + * + * unsigned long long __builtin_ia32_rdtscp(unsigned int *foo); + * (where foo is set to: numa_node << 12 | cpu) + * and + * unsigned long long __builtin_ia32_rdtsc(void); + * + * Above we combine the calls with CPUID, thus I don't see how this is + * directly appreciable. + */ + +/* +inline uint64_t rdtsc(void) +{ + uint32_t low, high; + asm volatile("rdtsc" : "=a" (low), "=d" (high)); + return low | (((uint64_t )high ) << 32); +} +*/ + +/** Wall-clock based ** + * + * use: getnstimeofday() + * getnstimeofday(&rec->ts_start); + * getnstimeofday(&rec->ts_stop); + * + * API changed see: Documentation/core-api/timekeeping.rst + * https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#c.getnstime... + * + * We should instead use: ktime_get_real_ts64() is a direct + * replacement, but consider using monotonic time (ktime_get_ts64()) + * and/or a ktime_t based interface (ktime_get()/ktime_get_real()). + */ + +/** PMU (Performance Monitor Unit) based ** + * + * Needed for calculating: Instructions Per Cycle (IPC) + * - The IPC number tell how efficient the CPU pipelining were + */ +//lookup: perf_event_create_kernel_counter() + +bool time_bench_PMU_config(bool enable); + +/* Raw reading via rdpmc() using fixed counters + * + * From: https://github.com/andikleen/simple-pmu + */ +enum { + FIXED_SELECT = (1U << 30), /* == 0x40000000 */ + FIXED_INST_RETIRED_ANY = 0, + FIXED_CPU_CLK_UNHALTED_CORE = 1, + FIXED_CPU_CLK_UNHALTED_REF = 2, +}; + +static __always_inline unsigned long long p_rdpmc(unsigned in) +{ + unsigned d, a; + + asm volatile("rdpmc" : "=d"(d), "=a"(a) : "c"(in) : "memory"); + return ((unsigned long long)d << 32) | a; +} + +/* These PMU counter needs to be enabled, but I don't have the + * configure code implemented. My current hack is running: + * sudo perf stat -e cycles:k -e instructions:k insmod lib/ring_queue_test.ko + */ +/* Reading all pipelined instruction */ +static __always_inline unsigned long long pmc_inst(void) +{ + return p_rdpmc(FIXED_SELECT | FIXED_INST_RETIRED_ANY); +} + +/* Reading CPU clock cycles */ +static __always_inline unsigned long long pmc_clk(void) +{ + return p_rdpmc(FIXED_SELECT | FIXED_CPU_CLK_UNHALTED_CORE); +} + +/* Raw reading via MSR rdmsr() is likely wrong + * FIXME: How can I know which raw MSR registers are conf for what? + */ +#define MSR_IA32_PCM0 0x400000C1 /* PERFCTR0 */ +#define MSR_IA32_PCM1 0x400000C2 /* PERFCTR1 */ +#define MSR_IA32_PCM2 0x400000C3 +static inline uint64_t msr_inst(unsigned long long *msr_result) +{ + return rdmsrl_safe(MSR_IA32_PCM0, msr_result); +} + +/** Generic functions ** + */ +bool time_bench_loop(uint32_t loops, int step, char *txt, void *data, + int (*func)(struct time_bench_record *rec, void *data)); +bool time_bench_calc_stats(struct time_bench_record *rec); + +void time_bench_run_concurrent( + uint32_t loops, int step, void *data, + const struct cpumask *mask, /* Support masking outsome CPUs*/ + struct time_bench_sync *sync, struct time_bench_cpu *cpu_tasks, + int (*func)(struct time_bench_record *record, void *data)); +void time_bench_print_stats_cpumask(const char *desc, + struct time_bench_cpu *cpu_tasks, + const struct cpumask *mask); + +//FIXME: use rec->flags to select measurement, should be MACRO +static __always_inline void time_bench_start(struct time_bench_record *rec) +{ + //getnstimeofday(&rec->ts_start); + ktime_get_real_ts64(&rec->ts_start); + if (rec->flags & TIME_BENCH_PMU) { + rec->pmc_inst_start = pmc_inst(); + rec->pmc_clk_start = pmc_clk(); + } + rec->tsc_start = tsc_start_clock(); +} + +static __always_inline void time_bench_stop(struct time_bench_record *rec, + uint64_t invoked_cnt) +{ + rec->tsc_stop = tsc_stop_clock(); + if (rec->flags & TIME_BENCH_PMU) { + rec->pmc_inst_stop = pmc_inst(); + rec->pmc_clk_stop = pmc_clk(); + } + //getnstimeofday(&rec->ts_stop); + ktime_get_real_ts64(&rec->ts_stop); + rec->invoked_cnt = invoked_cnt; +} + +#endif /* _LINUX_TIME_BENCH_H */
base-commit: ea15e046263b19e91ffd827645ae5dfa44ebd044
Mina Almasry almasrymina@google.com writes:
From: Jesper Dangaard Brouer hawk@kernel.org
We frequently consult with Jesper's out-of-tree page_pool benchmark to evaluate page_pool changes.
Import the benchmark into the upstream linux kernel tree so that (a) we're all running the same version, (b) pave the way for shared improvements, and (c) maybe one day integrate it with nipa, if possible.
Import bench_page_pool_simple from commit 35b1716d0c30 ("Add page_bench06_walk_all"), from this repository: https://github.com/netoptimizer/prototype-kernel.git
Changes done during upstreaming:
- Fix checkpatch issues.
- Remove the tasklet logic not needed.
- Move under tools/testing
- Create ksft for the benchmark.
- Changed slightly how the benchmark gets build. Out of tree, time_bench is built as an independent .ko. Here it is included in bench_page_pool.ko
Steps to run:
mkdir -p /tmp/run-pp-bench make -C ./tools/testing/selftests/net/bench make -C ./tools/testing/selftests/net/bench install INSTALL_PATH=/tmp/run-pp-bench rsync --delete -avz --progress /tmp/run-pp-bench mina@$SERVER:~/ ssh mina@$SERVER << EOF cd ~/run-pp-bench && sudo ./test_bench_page_pool.sh EOF
Output:
(benchmrk dmesg logs) Fast path results: no-softirq-page_pool01 Per elem: 11 cycles(tsc) 4.368 ns ptr_ring results: no-softirq-page_pool02 Per elem: 527 cycles(tsc) 195.187 ns slow path results: no-softirq-page_pool03 Per elem: 549 cycles(tsc) 203.466 ns
Cc: Jesper Dangaard Brouer hawk@kernel.org Cc: Ilias Apalodimas ilias.apalodimas@linaro.org Cc: Jakub Kicinski kuba@kernel.org Cc: Toke Høiland-Jørgensen toke@toke.dk
Signed-off-by: Mina Almasry almasrymina@google.com
Back when you posted the first RFC, Jesper and I chatted about ways to avoid the ugly "load module and read the output from dmesg" interface to the test.
One idea we came up with was to make the module include only the "inner" functions for the benchmark, and expose those to BPF as kfuncs. Then the test runner can be a BPF program that runs the tests, collects the data and passes it to userspace via maps or a ringbuffer or something. That's a nicer and more customisable interface than the printk output. And if they're small enough, maybe we could even include the functions into the page_pool code itself, instead of in a separate benchmark module?
WDYT of that idea? :)
-Toke
On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen toke@redhat.com wrote:
Fast path results: no-softirq-page_pool01 Per elem: 11 cycles(tsc) 4.368 ns
ptr_ring results: no-softirq-page_pool02 Per elem: 527 cycles(tsc) 195.187 ns
slow path results: no-softirq-page_pool03 Per elem: 549 cycles(tsc) 203.466 ns
Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: Mina Almasry <almasrymina@google.com>
Back when you posted the first RFC, Jesper and I chatted about ways to avoid the ugly "load module and read the output from dmesg" interface to the test.
I agree the existing interface is ugly.
One idea we came up with was to make the module include only the "inner" functions for the benchmark, and expose those to BPF as kfuncs. Then the test runner can be a BPF program that runs the tests, collects the data and passes it to userspace via maps or a ringbuffer or something. That's a nicer and more customisable interface than the printk output. And if they're small enough, maybe we could even include the functions into the page_pool code itself, instead of in a separate benchmark module?
WDYT of that idea? :)
...but this sounds like an enormous amount of effort, for something that is a bit ugly but isn't THAT bad. Especially for me, I'm not that much of an expert that I know how to implement what you're referring to off the top of my head. I normally am open to spending time but this is not that high on my todolist and I have limited bandwidth to resolve this :(
I also feel that this is something that could be improved post merge. I think it's very beneficial to have this merged in some form that can be improved later. Byungchul is making a lot of changes to these mm things and it would be nice to have an easy way to run the benchmark in tree and maybe even get automated results from nipa. If we could agree on mvp that is appropriate to merge without too much scope creep that would be ideal from my side at least.
Mina Almasry almasrymina@google.com writes:
On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen toke@redhat.com wrote:
Fast path results: no-softirq-page_pool01 Per elem: 11 cycles(tsc) 4.368 ns
ptr_ring results: no-softirq-page_pool02 Per elem: 527 cycles(tsc) 195.187 ns
slow path results: no-softirq-page_pool03 Per elem: 549 cycles(tsc) 203.466 ns
Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: Mina Almasry <almasrymina@google.com>
Back when you posted the first RFC, Jesper and I chatted about ways to avoid the ugly "load module and read the output from dmesg" interface to the test.
I agree the existing interface is ugly.
One idea we came up with was to make the module include only the "inner" functions for the benchmark, and expose those to BPF as kfuncs. Then the test runner can be a BPF program that runs the tests, collects the data and passes it to userspace via maps or a ringbuffer or something. That's a nicer and more customisable interface than the printk output. And if they're small enough, maybe we could even include the functions into the page_pool code itself, instead of in a separate benchmark module?
WDYT of that idea? :)
...but this sounds like an enormous amount of effort, for something that is a bit ugly but isn't THAT bad. Especially for me, I'm not that much of an expert that I know how to implement what you're referring to off the top of my head. I normally am open to spending time but this is not that high on my todolist and I have limited bandwidth to resolve this :(
I also feel that this is something that could be improved post merge. I think it's very beneficial to have this merged in some form that can be improved later. Byungchul is making a lot of changes to these mm things and it would be nice to have an easy way to run the benchmark in tree and maybe even get automated results from nipa. If we could agree on mvp that is appropriate to merge without too much scope creep that would be ideal from my side at least.
Right, fair. I guess we can merge it as-is, and then investigate whether we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :)
-Toke
On Wed, May 28, 2025 at 11:28:54AM +0200, Toke Høiland-Jørgensen wrote:
Mina Almasry almasrymina@google.com writes:
On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen toke@redhat.com wrote:
Back when you posted the first RFC, Jesper and I chatted about ways to avoid the ugly "load module and read the output from dmesg" interface to the test.
I agree the existing interface is ugly.
One idea we came up with was to make the module include only the "inner" functions for the benchmark, and expose those to BPF as kfuncs. Then the test runner can be a BPF program that runs the tests, collects the data and passes it to userspace via maps or a ringbuffer or something. That's a nicer and more customisable interface than the printk output. And if they're small enough, maybe we could even include the functions into the page_pool code itself, instead of in a separate benchmark module?
WDYT of that idea? :)
...but this sounds like an enormous amount of effort, for something that is a bit ugly but isn't THAT bad. Especially for me, I'm not that much of an expert that I know how to implement what you're referring to off the top of my head. I normally am open to spending time but this is not that high on my todolist and I have limited bandwidth to resolve this :(
I also feel that this is something that could be improved post merge.
agreed
I think it's very beneficial to have this merged in some form that can be improved later. Byungchul is making a lot of changes to these mm things and it would be nice to have an easy way to run the benchmark in tree and maybe even get automated results from nipa. If we could agree on mvp that is appropriate to merge without too much scope creep that would be ideal from my side at least.
Right, fair. I guess we can merge it as-is, and then investigate whether we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :)
tldr; I'd advise to merge it as-is, then kfunc'ify parts of it and use it from a 'perf bench' suite.
Yeah, the model would be what I did for uprobes, but even then there is a selftests based uprobes benchmark ;-)
The 'perf bench' part, that calls into the skel:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
The skel:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
While this one is just to generate BPF load to measure the impact on uprobes, for your case it would involve using a ring buffer to communicate from the skel (BPF/kernel side) to the userspace part, similar to what is done in various other BPF based perf tooling available in:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
Like at this line (BPF skel part):
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tre...
The simplest part is in the canonical, standalone runqslower tool, also hosted in the kernel sources:
BPF skel sending stuff to userspace:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
The userspace part that reads it:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
This is a callback that gets called for every event that the BPF skel produces, called from this loop:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
That handle_event callback was associated via:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
There is a dissection I did about this process a long time ago, but still relevant, I think:
http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-pr...
The part explaining the interaction userspace/kernel starts here:
http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-pr...
(yeah, its http, but then, its _old_vger ;-)
Doing it in perf is interesting because it gets widely packaged, so whatever you add to it gets visibility for people using 'perf bench' and also gets available in most places, it would add to this collection:
root@number:~# perf bench Usage: perf bench [<common options>] <collection> <benchmark> [<options>]
# List of all available benchmark collections:
sched: Scheduler and IPC benchmarks syscall: System call benchmarks mem: Memory access benchmarks numa: NUMA scheduling and MM benchmarks futex: Futex stressing benchmarks epoll: Epoll stressing benchmarks internals: Perf-internals benchmarks breakpoint: Breakpoint benchmarks uprobe: uprobe benchmarks all: All benchmarks
root@number:~#
the 'perf bench' that uses BPF skel:
root@number:~# perf bench uprobe baseline # Running 'uprobe/baseline' benchmark: # Executed 1,000 usleep(1000) calls Total time: 1,050,383 usecs
1,050.383 usecs/op root@number:~# perf trace --summary perf bench uprobe trace_printk # Running 'uprobe/trace_printk' benchmark: # Executed 1,000 usleep(1000) calls Total time: 1,053,082 usecs
1,053.082 usecs/op
Summary of events:
uprobe-trace_pr (1247691), 3316 events, 96.9%
syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ clock_nanosleep 1000 0 1101.236 1.007 1.101 50.939 4.53% close 98 0 32.979 0.001 0.337 32.821 99.52% perf_event_open 1 0 18.691 18.691 18.691 18.691 0.00% mmap 209 0 0.567 0.001 0.003 0.007 2.59% bpf 38 2 0.380 0.000 0.010 0.092 28.38% openat 65 0 0.171 0.001 0.003 0.012 7.14% mprotect 56 0 0.141 0.001 0.003 0.008 6.86% read 68 0 0.082 0.001 0.001 0.010 11.60% fstat 65 0 0.056 0.001 0.001 0.003 5.40% brk 10 0 0.050 0.001 0.005 0.012 24.29% pread64 8 0 0.042 0.001 0.005 0.021 49.29% <SNIP other syscalls>
root@number:~#
- Arnaldo
Arnaldo Carvalho de Melo acme@kernel.org writes:
On Wed, May 28, 2025 at 11:28:54AM +0200, Toke Høiland-Jørgensen wrote:
Mina Almasry almasrymina@google.com writes:
On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen toke@redhat.com wrote:
Back when you posted the first RFC, Jesper and I chatted about ways to avoid the ugly "load module and read the output from dmesg" interface to the test.
I agree the existing interface is ugly.
One idea we came up with was to make the module include only the "inner" functions for the benchmark, and expose those to BPF as kfuncs. Then the test runner can be a BPF program that runs the tests, collects the data and passes it to userspace via maps or a ringbuffer or something. That's a nicer and more customisable interface than the printk output. And if they're small enough, maybe we could even include the functions into the page_pool code itself, instead of in a separate benchmark module?
WDYT of that idea? :)
...but this sounds like an enormous amount of effort, for something that is a bit ugly but isn't THAT bad. Especially for me, I'm not that much of an expert that I know how to implement what you're referring to off the top of my head. I normally am open to spending time but this is not that high on my todolist and I have limited bandwidth to resolve this :(
I also feel that this is something that could be improved post merge.
agreed
I think it's very beneficial to have this merged in some form that can be improved later. Byungchul is making a lot of changes to these mm things and it would be nice to have an easy way to run the benchmark in tree and maybe even get automated results from nipa. If we could agree on mvp that is appropriate to merge without too much scope creep that would be ideal from my side at least.
Right, fair. I guess we can merge it as-is, and then investigate whether we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :)
tldr; I'd advise to merge it as-is, then kfunc'ify parts of it and use it from a 'perf bench' suite.
Yeah, the model would be what I did for uprobes, but even then there is a selftests based uprobes benchmark ;-)
The 'perf bench' part, that calls into the skel:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
The skel:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
While this one is just to generate BPF load to measure the impact on uprobes, for your case it would involve using a ring buffer to communicate from the skel (BPF/kernel side) to the userspace part, similar to what is done in various other BPF based perf tooling available in:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
Like at this line (BPF skel part):
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tre...
The simplest part is in the canonical, standalone runqslower tool, also hosted in the kernel sources:
BPF skel sending stuff to userspace:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
The userspace part that reads it:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
This is a callback that gets called for every event that the BPF skel produces, called from this loop:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
That handle_event callback was associated via:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
There is a dissection I did about this process a long time ago, but still relevant, I think:
http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-pr...
The part explaining the interaction userspace/kernel starts here:
http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-pr...
(yeah, its http, but then, its _old_vger ;-)
Doing it in perf is interesting because it gets widely packaged, so whatever you add to it gets visibility for people using 'perf bench' and also gets available in most places, it would add to this collection:
root@number:~# perf bench Usage: perf bench [<common options>] <collection> <benchmark> [<options>]
# List of all available benchmark collections: sched: Scheduler and IPC benchmarks syscall: System call benchmarks mem: Memory access benchmarks numa: NUMA scheduling and MM benchmarks futex: Futex stressing benchmarks epoll: Epoll stressing benchmarks internals: Perf-internals benchmarks breakpoint: Breakpoint benchmarks uprobe: uprobe benchmarks all: All benchmarks
root@number:~#
the 'perf bench' that uses BPF skel:
root@number:~# perf bench uprobe baseline # Running 'uprobe/baseline' benchmark: # Executed 1,000 usleep(1000) calls Total time: 1,050,383 usecs
1,050.383 usecs/op root@number:~# perf trace --summary perf bench uprobe trace_printk # Running 'uprobe/trace_printk' benchmark: # Executed 1,000 usleep(1000) calls Total time: 1,053,082 usecs
1,053.082 usecs/op
Summary of events:
uprobe-trace_pr (1247691), 3316 events, 96.9%
syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%)
clock_nanosleep 1000 0 1101.236 1.007 1.101 50.939 4.53% close 98 0 32.979 0.001 0.337 32.821 99.52% perf_event_open 1 0 18.691 18.691 18.691 18.691 0.00% mmap 209 0 0.567 0.001 0.003 0.007 2.59% bpf 38 2 0.380 0.000 0.010 0.092 28.38% openat 65 0 0.171 0.001 0.003 0.012 7.14% mprotect 56 0 0.141 0.001 0.003 0.008 6.86% read 68 0 0.082 0.001 0.001 0.010 11.60% fstat 65 0 0.056 0.001 0.001 0.003 5.40% brk 10 0 0.050 0.001 0.005 0.012 24.29% pread64 8 0 0.042 0.001 0.005 0.021 49.29%
<SNIP other syscalls>
root@number:~#
Cool, thanks for the pointers! Guess we'd need to restructure the functions to be benchmarked a bit, but that should be doable I guess.
-Toke
Hi all,
This is very useful.
On Wed, 28 May 2025 at 16:51, Arnaldo Carvalho de Melo acme@kernel.org wrote:
On Wed, May 28, 2025 at 11:28:54AM +0200, Toke Høiland-Jørgensen wrote:
Mina Almasry almasrymina@google.com writes:
On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen toke@redhat.com wrote:
Back when you posted the first RFC, Jesper and I chatted about ways to avoid the ugly "load module and read the output from dmesg" interface to the test.
I agree the existing interface is ugly.
One idea we came up with was to make the module include only the "inner" functions for the benchmark, and expose those to BPF as kfuncs. Then the test runner can be a BPF program that runs the tests, collects the data and passes it to userspace via maps or a ringbuffer or something. That's a nicer and more customisable interface than the printk output. And if they're small enough, maybe we could even include the functions into the page_pool code itself, instead of in a separate benchmark module?
WDYT of that idea? :)
...but this sounds like an enormous amount of effort, for something that is a bit ugly but isn't THAT bad. Especially for me, I'm not that much of an expert that I know how to implement what you're referring to off the top of my head. I normally am open to spending time but this is not that high on my todolist and I have limited bandwidth to resolve this :(
I also feel that this is something that could be improved post merge.
agreed
I think it's very beneficial to have this merged in some form that can be improved later. Byungchul is making a lot of changes to these mm things and it would be nice to have an easy way to run the benchmark in tree and maybe even get automated results from nipa. If we could agree on mvp that is appropriate to merge without too much scope creep that would be ideal from my side at least.
Right, fair. I guess we can merge it as-is, and then investigate whether we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :)
tldr; I'd advise to merge it as-is, then kfunc'ify parts of it and use it from a 'perf bench' suite.
Yeah, the model would be what I did for uprobes, but even then there is a selftests based uprobes benchmark ;-)
The 'perf bench' part, that calls into the skel:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
The skel:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
While this one is just to generate BPF load to measure the impact on uprobes, for your case it would involve using a ring buffer to communicate from the skel (BPF/kernel side) to the userspace part, similar to what is done in various other BPF based perf tooling available in:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
Like at this line (BPF skel part):
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tre...
The simplest part is in the canonical, standalone runqslower tool, also hosted in the kernel sources:
BPF skel sending stuff to userspace:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
The userspace part that reads it:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
This is a callback that gets called for every event that the BPF skel produces, called from this loop:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
That handle_event callback was associated via:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
There is a dissection I did about this process a long time ago, but still relevant, I think:
http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-pr...
The part explaining the interaction userspace/kernel starts here:
http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-pr...
(yeah, its http, but then, its _old_vger ;-)
Doing it in perf is interesting because it gets widely packaged, so whatever you add to it gets visibility for people using 'perf bench' and also gets available in most places, it would add to this collection:
root@number:~# perf bench Usage: perf bench [<common options>] <collection> <benchmark> [<options>]
# List of all available benchmark collections: sched: Scheduler and IPC benchmarks syscall: System call benchmarks mem: Memory access benchmarks numa: NUMA scheduling and MM benchmarks futex: Futex stressing benchmarks epoll: Epoll stressing benchmarks internals: Perf-internals benchmarks breakpoint: Breakpoint benchmarks uprobe: uprobe benchmarks all: All benchmarks
root@number:~#
the 'perf bench' that uses BPF skel:
root@number:~# perf bench uprobe baseline # Running 'uprobe/baseline' benchmark: # Executed 1,000 usleep(1000) calls Total time: 1,050,383 usecs
1,050.383 usecs/op root@number:~# perf trace --summary perf bench uprobe trace_printk # Running 'uprobe/trace_printk' benchmark: # Executed 1,000 usleep(1000) calls Total time: 1,053,082 usecs
1,053.082 usecs/op
Summary of events:
uprobe-trace_pr (1247691), 3316 events, 96.9%
syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%)
clock_nanosleep 1000 0 1101.236 1.007 1.101 50.939 4.53% close 98 0 32.979 0.001 0.337 32.821 99.52% perf_event_open 1 0 18.691 18.691 18.691 18.691 0.00% mmap 209 0 0.567 0.001 0.003 0.007 2.59% bpf 38 2 0.380 0.000 0.010 0.092 28.38% openat 65 0 0.171 0.001 0.003 0.012 7.14% mprotect 56 0 0.141 0.001 0.003 0.008 6.86% read 68 0 0.082 0.001 0.001 0.010 11.60% fstat 65 0 0.056 0.001 0.001 0.003 5.40% brk 10 0 0.050 0.001 0.005 0.012 24.29% pread64 8 0 0.042 0.001 0.005 0.021 49.29%
<SNIP other syscalls>
root@number:~#
Thanks for all the pointers here. Overall I agree we should merge this. Yes it's not ideal, but we've been pointing people to run it over several years before accepting patches. Having it out of tree doesn't help much. It's a test, it's a bit ugly now, but it serves our purpose and the maintenance burden is minimal.
Acked-by: Ilias Apalodimas ilias.apalodimas@linaro.org
- Arnaldo
On Wed, May 28, 2025 at 2:28 AM Toke Høiland-Jørgensen toke@redhat.com wrote:
Mina Almasry almasrymina@google.com writes:
On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen toke@redhat.com wrote:
Fast path results: no-softirq-page_pool01 Per elem: 11 cycles(tsc) 4.368 ns
ptr_ring results: no-softirq-page_pool02 Per elem: 527 cycles(tsc) 195.187 ns
slow path results: no-softirq-page_pool03 Per elem: 549 cycles(tsc) 203.466 ns
Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: Mina Almasry <almasrymina@google.com>
Back when you posted the first RFC, Jesper and I chatted about ways to avoid the ugly "load module and read the output from dmesg" interface to the test.
I agree the existing interface is ugly.
One idea we came up with was to make the module include only the "inner" functions for the benchmark, and expose those to BPF as kfuncs. Then the test runner can be a BPF program that runs the tests, collects the data and passes it to userspace via maps or a ringbuffer or something. That's a nicer and more customisable interface than the printk output. And if they're small enough, maybe we could even include the functions into the page_pool code itself, instead of in a separate benchmark module?
WDYT of that idea? :)
...but this sounds like an enormous amount of effort, for something that is a bit ugly but isn't THAT bad. Especially for me, I'm not that much of an expert that I know how to implement what you're referring to off the top of my head. I normally am open to spending time but this is not that high on my todolist and I have limited bandwidth to resolve this :(
I also feel that this is something that could be improved post merge. I think it's very beneficial to have this merged in some form that can be improved later. Byungchul is making a lot of changes to these mm things and it would be nice to have an easy way to run the benchmark in tree and maybe even get automated results from nipa. If we could agree on mvp that is appropriate to merge without too much scope creep that would be ideal from my side at least.
Right, fair. I guess we can merge it as-is, and then investigate whether we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :)
Thanks for the pliability. Reviewed-bys and comments welcome.
Additionally Signed-off-by from Jesper is needed I think. Since most of this code is his, I retained his authorship. Jesper, whenever this looks good to me, a signed-off-by would be good and I would carry it to future versions. Changing authorship to me is also fine by me but I would think you want to retain the credit.
On 28/05/2025 21.46, Mina Almasry wrote:
On Wed, May 28, 2025 at 2:28 AM Toke Høiland-Jørgensen toke@redhat.com wrote:
Mina Almasry almasrymina@google.com writes:
On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen toke@redhat.com wrote:
Fast path results: no-softirq-page_pool01 Per elem: 11 cycles(tsc) 4.368 ns
ptr_ring results: no-softirq-page_pool02 Per elem: 527 cycles(tsc) 195.187 ns
slow path results: no-softirq-page_pool03 Per elem: 549 cycles(tsc) 203.466 ns
Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: Mina Almasry <almasrymina@google.com>
Back when you posted the first RFC, Jesper and I chatted about ways to avoid the ugly "load module and read the output from dmesg" interface to the test.
I agree the existing interface is ugly.
One idea we came up with was to make the module include only the "inner" functions for the benchmark, and expose those to BPF as kfuncs. Then the test runner can be a BPF program that runs the tests, collects the data and passes it to userspace via maps or a ringbuffer or something. That's a nicer and more customisable interface than the printk output. And if they're small enough, maybe we could even include the functions into the page_pool code itself, instead of in a separate benchmark module?
WDYT of that idea? :)
...but this sounds like an enormous amount of effort, for something that is a bit ugly but isn't THAT bad. Especially for me, I'm not that much of an expert that I know how to implement what you're referring to off the top of my head. I normally am open to spending time but this is not that high on my todolist and I have limited bandwidth to resolve this :(
I also feel that this is something that could be improved post merge. I think it's very beneficial to have this merged in some form that can be improved later. Byungchul is making a lot of changes to these mm things and it would be nice to have an easy way to run the benchmark in tree and maybe even get automated results from nipa. If we could agree on mvp that is appropriate to merge without too much scope creep that would be ideal from my side at least.
Right, fair. I guess we can merge it as-is, and then investigate whether we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :)
Thanks for the pliability. Reviewed-bys and comments welcome.
Additionally Signed-off-by from Jesper is needed I think. Since most of this code is his, I retained his authorship. Jesper, whenever this looks good to me, a signed-off-by would be good and I would carry it to future versions. Changing authorship to me is also fine by me but I would think you want to retain the credit.
Okay, I think Ilias'es comment[1] and ACK convinced me, let us merge this as-is. We have been asking people to run it over several years before accepting patches. We shouldn't be pointing people to use out-of-tree tests for accepting patches.
It is not perfect, but it have served us well for benchmarking in the last approx 10 years (5 years for page_pool test). It is isolated as a selftest under (tools/testing/selftests/net/bench/page_pool/).
Realistically we are all too busy inventing a new "perfect" benchmark for page_pool. That said, I do encourage others with free cycles to integrated a better benchmark test into `perf bench`. Then we can just remove this module again.
Signed-off-by: Jesper Dangaard Brouer hawk@kernel.org
[1] https://lore.kernel.org/all/CAC_iWjLmO4XZ_+PBaCNxpVCTmGKNBsLGyeeKS2ptRrepn1u...
Thanks Mina for pushing this forward, --Jesper
linux-kselftest-mirror@lists.linaro.org