On Wed, May 28, 2025 at 11:28:54AM +0200, Toke Høiland-Jørgensen wrote:
Mina Almasry almasrymina@google.com writes:
On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen toke@redhat.com wrote:
Back when you posted the first RFC, Jesper and I chatted about ways to avoid the ugly "load module and read the output from dmesg" interface to the test.
I agree the existing interface is ugly.
One idea we came up with was to make the module include only the "inner" functions for the benchmark, and expose those to BPF as kfuncs. Then the test runner can be a BPF program that runs the tests, collects the data and passes it to userspace via maps or a ringbuffer or something. That's a nicer and more customisable interface than the printk output. And if they're small enough, maybe we could even include the functions into the page_pool code itself, instead of in a separate benchmark module?
WDYT of that idea? :)
...but this sounds like an enormous amount of effort, for something that is a bit ugly but isn't THAT bad. Especially for me, I'm not that much of an expert that I know how to implement what you're referring to off the top of my head. I normally am open to spending time but this is not that high on my todolist and I have limited bandwidth to resolve this :(
I also feel that this is something that could be improved post merge.
agreed
I think it's very beneficial to have this merged in some form that can be improved later. Byungchul is making a lot of changes to these mm things and it would be nice to have an easy way to run the benchmark in tree and maybe even get automated results from nipa. If we could agree on mvp that is appropriate to merge without too much scope creep that would be ideal from my side at least.
Right, fair. I guess we can merge it as-is, and then investigate whether we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :)
tldr; I'd advise to merge it as-is, then kfunc'ify parts of it and use it from a 'perf bench' suite.
Yeah, the model would be what I did for uprobes, but even then there is a selftests based uprobes benchmark ;-)
The 'perf bench' part, that calls into the skel:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
The skel:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
While this one is just to generate BPF load to measure the impact on uprobes, for your case it would involve using a ring buffer to communicate from the skel (BPF/kernel side) to the userspace part, similar to what is done in various other BPF based perf tooling available in:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
Like at this line (BPF skel part):
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tre...
The simplest part is in the canonical, standalone runqslower tool, also hosted in the kernel sources:
BPF skel sending stuff to userspace:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
The userspace part that reads it:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
This is a callback that gets called for every event that the BPF skel produces, called from this loop:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
That handle_event callback was associated via:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
There is a dissection I did about this process a long time ago, but still relevant, I think:
http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-pr...
The part explaining the interaction userspace/kernel starts here:
http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-pr...
(yeah, its http, but then, its _old_vger ;-)
Doing it in perf is interesting because it gets widely packaged, so whatever you add to it gets visibility for people using 'perf bench' and also gets available in most places, it would add to this collection:
root@number:~# perf bench Usage: perf bench [<common options>] <collection> <benchmark> [<options>]
# List of all available benchmark collections:
sched: Scheduler and IPC benchmarks syscall: System call benchmarks mem: Memory access benchmarks numa: NUMA scheduling and MM benchmarks futex: Futex stressing benchmarks epoll: Epoll stressing benchmarks internals: Perf-internals benchmarks breakpoint: Breakpoint benchmarks uprobe: uprobe benchmarks all: All benchmarks
root@number:~#
the 'perf bench' that uses BPF skel:
root@number:~# perf bench uprobe baseline # Running 'uprobe/baseline' benchmark: # Executed 1,000 usleep(1000) calls Total time: 1,050,383 usecs
1,050.383 usecs/op root@number:~# perf trace --summary perf bench uprobe trace_printk # Running 'uprobe/trace_printk' benchmark: # Executed 1,000 usleep(1000) calls Total time: 1,053,082 usecs
1,053.082 usecs/op
Summary of events:
uprobe-trace_pr (1247691), 3316 events, 96.9%
syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ clock_nanosleep 1000 0 1101.236 1.007 1.101 50.939 4.53% close 98 0 32.979 0.001 0.337 32.821 99.52% perf_event_open 1 0 18.691 18.691 18.691 18.691 0.00% mmap 209 0 0.567 0.001 0.003 0.007 2.59% bpf 38 2 0.380 0.000 0.010 0.092 28.38% openat 65 0 0.171 0.001 0.003 0.012 7.14% mprotect 56 0 0.141 0.001 0.003 0.008 6.86% read 68 0 0.082 0.001 0.001 0.010 11.60% fstat 65 0 0.056 0.001 0.001 0.003 5.40% brk 10 0 0.050 0.001 0.005 0.012 24.29% pread64 8 0 0.042 0.001 0.005 0.021 49.29% <SNIP other syscalls>
root@number:~#
- Arnaldo