On Tue, Mar 16, 2021 at 05:22PM +0100, Peter Zijlstra wrote:
On Wed, Mar 10, 2021 at 11:41:34AM +0100, Marco Elver wrote:
Adds bit perf_event_attr::remove_on_exec, to support removing an event from a task on exec.
This option supports the case where an event is supposed to be process-wide only, and should not propagate beyond exec, to limit monitoring to the original process image only.
Signed-off-by: Marco Elver elver@google.com
+/*
- Removes all events from the current task that have been marked
- remove-on-exec, and feeds their values back to parent events.
- */
+static void perf_event_remove_on_exec(void) +{
- int ctxn;
- for_each_task_context_nr(ctxn) {
struct perf_event_context *ctx;
struct perf_event *event, *next;
ctx = perf_pin_task_context(current, ctxn);
if (!ctx)
continue;
mutex_lock(&ctx->mutex);
list_for_each_entry_safe(event, next, &ctx->event_list, event_entry) {
if (!event->attr.remove_on_exec)
continue;
if (!is_kernel_event(event))
perf_remove_from_owner(event);
perf_remove_from_context(event, DETACH_GROUP);
There's a comment on this in perf_event_exit_event(), if this task happens to have the original event, then DETACH_GROUP will destroy the grouping.
I think this wants to be:
perf_remove_from_text(event, child_event->parent ? DETACH_GROUP : 0);
or something.
/*
* Remove the event and feed back its values to the
* parent event.
*/
perf_event_exit_event(event, ctx, current);
Oooh, and here we call it... but it will do list_del_even() / perf_group_detach() *again*.
So the problem is that perf_event_exit_task_context() doesn't use remove_from_context(), but instead does task_ctx_sched_out() and then relies on the events not being active.
Whereas above you *DO* use remote_from_context(), but then perf_event_exit_event() will try and remove it more.
AFAIK, we want to deallocate the events and not just remove them, so doing what perf_event_exit_event() is the right way forward? Or did you have something else in mind?
I'm still trying to make sense of the zoo of synchronisation mechanisms at play here. No matter what I try, it seems I get stuck on the fact that I can't cleanly "pause" the context to remove the events (warnings in event_function()).
This is what I've been playing with to understand:
diff --git a/kernel/events/core.c b/kernel/events/core.c index 450ea9415ed7..c585cef284a0 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4195,6 +4195,88 @@ static void perf_event_enable_on_exec(int ctxn) put_ctx(clone_ctx); }
+static void perf_remove_from_owner(struct perf_event *event); +static void perf_event_exit_event(struct perf_event *child_event, + struct perf_event_context *child_ctx, + struct task_struct *child); + +/* + * Removes all events from the current task that have been marked + * remove-on-exec, and feeds their values back to parent events. + */ +static void perf_event_remove_on_exec(void) +{ + struct perf_event *event, *next; + int ctxn; + + /***************** BROKEN BROKEN BROKEN *****************/ + + for_each_task_context_nr(ctxn) { + struct perf_event_context *ctx; + bool removed = false; + + ctx = perf_pin_task_context(current, ctxn); + if (!ctx) + continue; + mutex_lock(&ctx->mutex); + + raw_spin_lock_irq(&ctx->lock); + /* + * WIP: Ok, we will unschedule the context, _and_ tell everyone + * still trying to use that it's dead... even though it isn't. + * + * This can't be right... + */ + task_ctx_sched_out(__get_cpu_context(ctx), ctx, EVENT_ALL); + RCU_INIT_POINTER(current->perf_event_ctxp[ctxn], NULL); + WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
This code here is obviously bogus, because it removes the context from the task: we might still need it since this task is not dead yet.
What's the right way to pause the context to remove the events from it?
+ raw_spin_unlock_irq(&ctx->lock); + + list_for_each_entry_safe(event, next, &ctx->event_list, event_entry) { + if (!event->attr.remove_on_exec) + continue; + removed = true; + + if (!is_kernel_event(event)) + perf_remove_from_owner(event); + + /* + * WIP: Want to free the event and feed back its values + * to the parent (if any) ... + */ + perf_event_exit_event(event, ctx, current); + } +
... need to schedule context back in here?
+ + mutex_unlock(&ctx->mutex); + perf_unpin_context(ctx); + put_ctx(ctx); + } +} + struct perf_read_data { struct perf_event *event; bool group; @@ -7553,6 +7635,8 @@ void perf_event_exec(void) true); } rcu_read_unlock(); + + perf_event_remove_on_exec(); }
Thanks, -- Marco