在 2024/7/10 7:55, Andrii Nakryiko 写道:
On Mon, Jul 8, 2024 at 6:00 PM Liao Chang liaochang1@huawei.com wrote:
Reduce the runtime overhead for struct return_instance data managed by uretprobe. This patch replaces the dynamic allocation with statically allocated array, leverage two facts that are limited nesting depth of uretprobe (max 64) and the function call style of return_instance usage (create at entry, free at exit).
This patch has been tested on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores @ 2.4GHz. Redis benchmarks show a throughput gain by 2% for Redis GET and SET commands:
Test case | No uretprobes | uretprobes | uretprobes | | (current) | (optimized) ================================================================== Redis SET (RPS) | 47025 | 40619 (-13.6%) | 41529 (-11.6%)
Redis GET (RPS) | 46715 | 41426 (-11.3%) | 42306 (-9.4%)
Signed-off-by: Liao Chang liaochang1@huawei.com
include/linux/uprobes.h | 10 ++- kernel/events/uprobes.c | 162 ++++++++++++++++++++++++---------------- 2 files changed, 105 insertions(+), 67 deletions(-)
[...]
+static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
struct pt_regs *regs)
+{
struct return_frame *frame = &utask->frame;
struct return_instance *ri = frame->return_instance;
enum rp_check ctx = chained ? RP_CHECK_CHAIN_CALL : RP_CHECK_CALL;
while (ri && !arch_uretprobe_is_alive(ri, ctx, regs)) {
ri = next_ret_instance(frame, ri);
utask->depth--;
}
frame->return_instance = ri;
+}
+static struct return_instance *alloc_return_instance(struct uprobe_task *task) +{
struct return_frame *frame = &task->frame;
if (!frame->vaddr) {
frame->vaddr = kcalloc(MAX_URETPROBE_DEPTH,
sizeof(struct return_instance), GFP_KERNEL);
Are you just pre-allocating MAX_URETPROBE_DEPTH instances always? I.e., even if we need just one (because there is no recursion), you'd still waste memory for all 64 ones?
This is the truth. On my testing machines, each struct return_instance data is 28 bytes, resulting in a total pre-allocated 1792 bytes when the first instrumented function is hit.
That seems rather wasteful.
Have you considered using objpool for fast reuse across multiple CPUs? Check lib/objpool.c.
After studying how kretprobe uses objpool, I'm convinced it is a right solution for managing return_instance in uretporbe. While I need some time to fully understand the objpool code itself and run some benchmark to verify its performance.
Thanks for the suggestion.
if (!frame->vaddr)
return NULL;
}
if (!frame->return_instance) {
frame->return_instance = frame->vaddr;
return frame->return_instance;
}
return ++frame->return_instance;
+}
+static inline bool return_frame_empty(struct uprobe_task *task) +{
return !task->frame.return_instance;
}
/*
[...]