This patch set implements the necessary kernel changes for persistent events.
Persistent events run standalone in the system without the need of a controlling process that holds an event's file descriptor. The events are always enabled and collect data samples in a ring buffer. Processes may connect to existing persistent events using the perf_event_open() syscall. For this the syscall must be configured using the new PERF_TYPE_PERSISTENT event type and a unique event identifier specified in attr.config. The id is propagated in sysfs or using ioctl (see below).
Persistent event buffers may be accessed with mmap() in the same way as for any other event. Since the buffers may be used by multiple processes at the same time, there is only read-only access to them. Currently there is only support for per-cpu events, thus root access is needed too.
Persistent events are visible in sysfs. They are added or removed dynamically. With the information in sysfs userland knows about how to setup the perf_event attribute of a persistent event. Since a persistent event always has the persistent flag set, a way is needed to express this in sysfs. A new syntax is used for this. With 'attr<num>:<mask>' any bit in the attribute structure may be set in a similar way as using 'config<num>', but <num> is an index that points to the u64 value to change within the attribute.
For persistent events the persistent flag (bit 23 of flag field in struct perf_event_attr) needs to be set which is expressed in sysfs with "attr5:23". E.g. the mce_record event is described in sysfs as follows:
/sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106 /sys/bus/event_source/devices/persistent/format/persistent:attr5:23
Note that perf tools need to support the 'attr<num>' syntax that is added in a separate patch set. With it we are able to run perf tool commands to read persistent events, e.g.:
# perf record -e persistent/mce_record/ sleep 10 # perf top -e persistent/mce_record/
In general the new syntax is flexible to describe with sysfs any event to be setup by perf tools.
There are ioctl functions to control persistent events that can be used to detach or attach an event to or from a process. The PERF_EVENT_IOC_DETACH ioctl call makes an event persistent. The perf_event_open() syscall can be used to re-open the event by any process. The PERF_EVENT_IOC_ATTACH ioctl attaches the event again so that it is removed after closing the event's fd.
The patches base on the originally work from Borislav Petkov.
This version 3 of the patch set is a complete rework of the code. There are the following major changes:
* new event type PERF_TYPE_PERSISTENT introduced,
* support for all type of events,
* unique event ids,
* improvements in reference counting and locking,
* ioctl functions are added to control persistency,
* the sysfs implementation now uses variable list size.
This should address most issues discussed during last review of version 2. The following is unresolved yet and can be added later on top of this patches, if necessary:
* support for per-task events (also allowing non-root access),
* creation of persistent events for disabled cpus,
* make event persistent with already open (mmap'ed) buffers,
* make event persistent while creating it.
First patches contain some rework of the perf mmap code to reuse it for persistent events.
Also note that patch 12 (ioctl functions to control persistency) is RFC and untested. A perf tools implementation for this is missing and some ideas are needed how this could be integrated, esp. in something like perf trace or so.
All patches can be found here:
git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v3
Note: I will resent the perf tools patch necessary to use persistent events.
-Robert
Borislav Petkov (1): mce, x86: Enable persistent events
Robert Richter (11): perf, mmap: Factor out ring_buffer_detach_all() perf, mmap: Factor out try_get_event()/put_event() perf, mmap: Factor out perf_alloc/free_rb() perf, mmap: Factor out perf_get_fd() perf: Add persistent events perf, persistent: Implementing a persistent pmu perf, persistent: Exposing persistent events using sysfs perf, persistent: Use unique event ids perf, persistent: Implement reference counter for events perf, persistent: Dynamically resize list of sysfs entries [RFC] perf, persistent: ioctl functions to control persistency
.../testing/sysfs-bus-event_source-devices-format | 43 +- arch/x86/kernel/cpu/mcheck/mce.c | 19 + include/linux/perf_event.h | 12 +- include/uapi/linux/perf_event.h | 6 +- kernel/events/Makefile | 2 +- kernel/events/core.c | 210 +++++--- kernel/events/internal.h | 20 + kernel/events/persistent.c | 563 +++++++++++++++++++++ 8 files changed, 779 insertions(+), 96 deletions(-) create mode 100644 kernel/events/persistent.c
From: Robert Richter robert.richter@linaro.org
Factor out a function to detach all events from a ringbuffer. No functional changes.
Signed-off-by: Robert Richter robert.richter@linaro.org Signed-off-by: Robert Richter rric@kernel.org --- kernel/events/core.c | 82 ++++++++++++++++++++++++++++------------------------ 1 file changed, 44 insertions(+), 38 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c index 928fae7c..5dcc5fe 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3775,6 +3775,49 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb) spin_unlock_irqrestore(&rb->event_lock, flags); }
+static void ring_buffer_detach_all(struct ring_buffer *rb) +{ + struct perf_event *event; +again: + rcu_read_lock(); + list_for_each_entry_rcu(event, &rb->event_list, rb_entry) { + if (!atomic_long_inc_not_zero(&event->refcount)) { + /* + * This event is en-route to free_event() which will + * detach it and remove it from the list. + */ + continue; + } + rcu_read_unlock(); + + mutex_lock(&event->mmap_mutex); + /* + * Check we didn't race with perf_event_set_output() which can + * swizzle the rb from under us while we were waiting to + * acquire mmap_mutex. + * + * If we find a different rb; ignore this event, a next + * iteration will no longer find it on the list. We have to + * still restart the iteration to make sure we're not now + * iterating the wrong list. + */ + if (event->rb == rb) { + rcu_assign_pointer(event->rb, NULL); + ring_buffer_detach(event, rb); + ring_buffer_put(rb); /* can't be last, we still have one */ + } + mutex_unlock(&event->mmap_mutex); + put_event(event); + + /* + * Restart the iteration; either we're on the wrong list or + * destroyed its integrity by doing a deletion. + */ + goto again; + } + rcu_read_unlock(); +} + static void ring_buffer_wakeup(struct perf_event *event) { struct ring_buffer *rb; @@ -3867,44 +3910,7 @@ static void perf_mmap_close(struct vm_area_struct *vma) * into the now unreachable buffer. Somewhat complicated by the * fact that rb::event_lock otherwise nests inside mmap_mutex. */ -again: - rcu_read_lock(); - list_for_each_entry_rcu(event, &rb->event_list, rb_entry) { - if (!atomic_long_inc_not_zero(&event->refcount)) { - /* - * This event is en-route to free_event() which will - * detach it and remove it from the list. - */ - continue; - } - rcu_read_unlock(); - - mutex_lock(&event->mmap_mutex); - /* - * Check we didn't race with perf_event_set_output() which can - * swizzle the rb from under us while we were waiting to - * acquire mmap_mutex. - * - * If we find a different rb; ignore this event, a next - * iteration will no longer find it on the list. We have to - * still restart the iteration to make sure we're not now - * iterating the wrong list. - */ - if (event->rb == rb) { - rcu_assign_pointer(event->rb, NULL); - ring_buffer_detach(event, rb); - ring_buffer_put(rb); /* can't be last, we still have one */ - } - mutex_unlock(&event->mmap_mutex); - put_event(event); - - /* - * Restart the iteration; either we're on the wrong list or - * destroyed its integrity by doing a deletion. - */ - goto again; - } - rcu_read_unlock(); + ring_buffer_detach_all(rb);
/* * It could be there's still a few 0-ref events on the list; they'll
From: Robert Richter robert.richter@linaro.org
Implement try_get_event() as counter part to put_event(). Put both in internal.h to make it available to other perf files.
Signed-off-by: Robert Richter robert.richter@linaro.org Signed-off-by: Robert Richter rric@kernel.org --- kernel/events/core.c | 9 +++------ kernel/events/internal.h | 12 ++++++++++++ 2 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c index 5dcc5fe..c9a5d4c 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3242,13 +3242,10 @@ EXPORT_SYMBOL_GPL(perf_event_release_kernel); /* * Called when the last reference to the file is gone. */ -static void put_event(struct perf_event *event) +void __put_event(struct perf_event *event) { struct task_struct *owner;
- if (!atomic_long_dec_and_test(&event->refcount)) - return; - rcu_read_lock(); owner = ACCESS_ONCE(event->owner); /* @@ -3781,7 +3778,7 @@ static void ring_buffer_detach_all(struct ring_buffer *rb) again: rcu_read_lock(); list_for_each_entry_rcu(event, &rb->event_list, rb_entry) { - if (!atomic_long_inc_not_zero(&event->refcount)) { + if (!try_get_event(event)) { /* * This event is en-route to free_event() which will * detach it and remove it from the list. @@ -7445,7 +7442,7 @@ inherit_event(struct perf_event *parent_event, if (IS_ERR(child_event)) return child_event;
- if (!atomic_long_inc_not_zero(&parent_event->refcount)) { + if (!try_get_event(parent_event)) { free_event(child_event); return NULL; } diff --git a/kernel/events/internal.h b/kernel/events/internal.h index ca65997..96a07d2 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -178,4 +178,16 @@ static inline bool arch_perf_have_user_stack_dump(void) #define perf_user_stack_pointer(regs) 0 #endif /* CONFIG_HAVE_PERF_USER_STACK_DUMP */
+static inline bool try_get_event(struct perf_event *event) +{ + return atomic_long_inc_not_zero(&event->refcount) != 0; +} +extern void __put_event(struct perf_event *event); +static inline void put_event(struct perf_event *event) +{ + if (!atomic_long_dec_and_test(&event->refcount)) + return; + __put_event(event); +} + #endif /* _KERNEL_EVENTS_INTERNAL_H */
From: Robert Richter robert.richter@linaro.org
Factor out code to allocate and deallocate ringbuffers. We need this later to setup the sampling buffer for persistent events.
While at this, replacing get_current_user() with get_uid(user).
Signed-off-by: Robert Richter robert.richter@linaro.org Signed-off-by: Robert Richter rric@kernel.org --- kernel/events/core.c | 75 +++++++++++++++++++++++++++++------------------- kernel/events/internal.h | 3 ++ 2 files changed, 48 insertions(+), 30 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c index c9a5d4c..24810d5 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3124,8 +3124,44 @@ static void free_event_rcu(struct rcu_head *head) }
static void ring_buffer_put(struct ring_buffer *rb); +static void ring_buffer_attach(struct perf_event *event, struct ring_buffer *rb); static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);
+/* + * Must be called with &event->mmap_mutex held. event->rb must be + * NULL. perf_alloc_rb() requires &event->mmap_count to be incremented + * on success which corresponds to &rb->mmap_count that is initialized + * with 1. + */ +int perf_alloc_rb(struct perf_event *event, int nr_pages, int flags) +{ + struct ring_buffer *rb; + + rb = rb_alloc(nr_pages, + event->attr.watermark ? event->attr.wakeup_watermark : 0, + event->cpu, flags); + if (!rb) + return -ENOMEM; + + atomic_set(&rb->mmap_count, 1); + ring_buffer_attach(event, rb); + rcu_assign_pointer(event->rb, rb); + + perf_event_update_userpage(event); + + return 0; +} + +/* Must be called with &event->mmap_mutex held. event->rb must be set. */ +void perf_free_rb(struct perf_event *event) +{ + struct ring_buffer *rb = event->rb; + + rcu_assign_pointer(event->rb, NULL); + ring_buffer_detach(event, rb); + ring_buffer_put(rb); +} + static void unaccount_event_cpu(struct perf_event *event, int cpu) { if (event->parent) @@ -3177,6 +3213,7 @@ static void __free_event(struct perf_event *event)
call_rcu(&event->rcu_head, free_event_rcu); } + static void free_event(struct perf_event *event) { irq_work_sync(&event->pending); @@ -3184,8 +3221,6 @@ static void free_event(struct perf_event *event) unaccount_event(event);
if (event->rb) { - struct ring_buffer *rb; - /* * Can happen when we close an event with re-directed output. * @@ -3193,12 +3228,8 @@ static void free_event(struct perf_event *event) * over us; possibly making our ring_buffer_put() the last. */ mutex_lock(&event->mmap_mutex); - rb = event->rb; - if (rb) { - rcu_assign_pointer(event->rb, NULL); - ring_buffer_detach(event, rb); - ring_buffer_put(rb); /* could be last */ - } + if (event->rb) + perf_free_rb(event); mutex_unlock(&event->mmap_mutex); }
@@ -3798,11 +3829,8 @@ static void ring_buffer_detach_all(struct ring_buffer *rb) * still restart the iteration to make sure we're not now * iterating the wrong list. */ - if (event->rb == rb) { - rcu_assign_pointer(event->rb, NULL); - ring_buffer_detach(event, rb); - ring_buffer_put(rb); /* can't be last, we still have one */ - } + if (event->rb == rb) + perf_free_rb(event); mutex_unlock(&event->mmap_mutex); put_event(event);
@@ -3938,7 +3966,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma) unsigned long user_locked, user_lock_limit; struct user_struct *user = current_user(); unsigned long locked, lock_limit; - struct ring_buffer *rb; unsigned long vma_size; unsigned long nr_pages; long user_extra, extra; @@ -4022,27 +4049,15 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma) if (vma->vm_flags & VM_WRITE) flags |= RING_BUFFER_WRITABLE;
- rb = rb_alloc(nr_pages, - event->attr.watermark ? event->attr.wakeup_watermark : 0, - event->cpu, flags); - - if (!rb) { - ret = -ENOMEM; + ret = perf_alloc_rb(event, nr_pages, flags); + if (ret) goto unlock; - }
- atomic_set(&rb->mmap_count, 1); - rb->mmap_locked = extra; - rb->mmap_user = get_current_user(); + event->rb->mmap_locked = extra; + event->rb->mmap_user = get_uid(user);
atomic_long_add(user_extra, &user->locked_vm); vma->vm_mm->pinned_vm += extra; - - ring_buffer_attach(event, rb); - rcu_assign_pointer(event->rb, rb); - - perf_event_update_userpage(event); - unlock: if (!ret) atomic_inc(&event->mmap_count); diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 96a07d2..8ddaf57 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -190,4 +190,7 @@ static inline void put_event(struct perf_event *event) __put_event(event); }
+extern int perf_alloc_rb(struct perf_event *event, int nr_pages, int flags); +extern void perf_free_rb(struct perf_event *event); + #endif /* _KERNEL_EVENTS_INTERNAL_H */
From: Robert Richter robert.richter@linaro.org
This new function creates a new fd for an event. We need this later to get a fd from a persistent event.
Signed-off-by: Robert Richter robert.richter@linaro.org Signed-off-by: Robert Richter rric@kernel.org --- kernel/events/core.c | 13 ++++++++----- kernel/events/internal.h | 1 + 2 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c index 24810d5..932acc6 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4100,6 +4100,11 @@ static const struct file_operations perf_fops = { .fasync = perf_fasync, };
+int perf_get_fd(struct perf_event *event) +{ + return anon_inode_getfd("[perf_event]", &perf_fops, event, O_RDWR); +} + /* * Perf event wakeup * @@ -6868,7 +6873,6 @@ SYSCALL_DEFINE5(perf_event_open, struct perf_event *event, *sibling; struct perf_event_attr attr; struct perf_event_context *ctx; - struct file *event_file = NULL; struct fd group = {NULL, 0}; struct task_struct *task = NULL; struct pmu *pmu; @@ -7025,9 +7029,9 @@ SYSCALL_DEFINE5(perf_event_open, goto err_context; }
- event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, O_RDWR); - if (IS_ERR(event_file)) { - err = PTR_ERR(event_file); + event_fd = perf_get_fd(event); + if (event_fd < 0) { + err = event_fd; goto err_context; }
@@ -7093,7 +7097,6 @@ SYSCALL_DEFINE5(perf_event_open, * perf_group_detach(). */ fdput(group); - fd_install(event_fd, event_file); return event_fd;
err_context: diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 8ddaf57..d8708aa 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -192,5 +192,6 @@ static inline void put_event(struct perf_event *event)
extern int perf_alloc_rb(struct perf_event *event, int nr_pages, int flags); extern void perf_free_rb(struct perf_event *event); +extern int perf_get_fd(struct perf_event *event);
#endif /* _KERNEL_EVENTS_INTERNAL_H */
From: Robert Richter robert.richter@linaro.org
Add the needed pieces for persistent events which makes them process-agnostic. Also, make their buffers read-only when mmaping them from userspace.
Add a barebones implementation for registering persistent events with perf. For that, we don't destroy the buffers when they're unmapped; also, we map them read-only so that multiple agents can access them.
Also, we allocate the event buffers at event init time and not at mmap time so that we can log samples into them regardless of whether there are readers in userspace or not.
Multiple events from different cpus may map to a single persistent event entry which has a unique identifier. The identifier allows to access the persistent event with the perf_event_open() syscall. For this the new event type PERF_TYPE_PERSISTENT must be set with its id specified in attr.config. Currently there is only support for per-cpu events. Also, root access is required.
Since the buffers are shared, the set_output ioctl may not be used in conjunction with persistent events.
This patch only supports trace_points, support for all event types is implemented in a later patch.
Based on patch set from Borislav Petkov bp@alien8.de.
Cc: Borislav Petkov bp@alien8.de Cc: Fengguang Wu fengguang.wu@intel.com Cc: Jiri Olsa jolsa@redhat.com Signed-off-by: Robert Richter robert.richter@linaro.org Signed-off-by: Robert Richter rric@kernel.org --- include/linux/perf_event.h | 12 ++- include/uapi/linux/perf_event.h | 4 +- kernel/events/Makefile | 2 +- kernel/events/core.c | 37 +++++-- kernel/events/internal.h | 2 + kernel/events/persistent.c | 221 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 266 insertions(+), 12 deletions(-) create mode 100644 kernel/events/persistent.c
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index c43f6ea..1a62a25 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -436,6 +436,8 @@ struct perf_event { struct perf_cgroup *cgrp; /* cgroup event is attach to */ int cgrp_defer_enabled; #endif + struct list_head pevent_entry; /* persistent event */ + int pevent_id;
#endif /* CONFIG_PERF_EVENTS */ }; @@ -765,7 +767,7 @@ extern void perf_event_enable(struct perf_event *event); extern void perf_event_disable(struct perf_event *event); extern int __perf_event_disable(void *info); extern void perf_event_task_tick(void); -#else +#else /* !CONFIG_PERF_EVENTS */ static inline void perf_event_task_sched_in(struct task_struct *prev, struct task_struct *task) { } @@ -805,7 +807,7 @@ static inline void perf_event_enable(struct perf_event *event) { } static inline void perf_event_disable(struct perf_event *event) { } static inline int __perf_event_disable(void *info) { return -1; } static inline void perf_event_task_tick(void) { } -#endif +#endif /* !CONFIG_PERF_EVENTS */
#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_NO_HZ_FULL) extern bool perf_event_can_stop_tick(void); @@ -819,6 +821,12 @@ extern void perf_restore_debug_store(void); static inline void perf_restore_debug_store(void) { } #endif
+#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_EVENT_TRACING) +extern int perf_add_persistent_tp(struct ftrace_event_call *tp); +#else +static inline int perf_add_persistent_tp(void *tp) { return -ENOENT; } +#endif + #define perf_output_put(handle, x) perf_output_copy((handle), &(x), sizeof(x))
/* diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 62c25a2..2b84b97 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -32,6 +32,7 @@ enum perf_type_id { PERF_TYPE_HW_CACHE = 3, PERF_TYPE_RAW = 4, PERF_TYPE_BREAKPOINT = 5, + PERF_TYPE_PERSISTENT = 6,
PERF_TYPE_MAX, /* non-ABI */ }; @@ -275,8 +276,9 @@ struct perf_event_attr {
exclude_callchain_kernel : 1, /* exclude kernel callchains */ exclude_callchain_user : 1, /* exclude user callchains */ + persistent : 1, /* always-on event */
- __reserved_1 : 41; + __reserved_1 : 40;
union { __u32 wakeup_events; /* wakeup every n events */ diff --git a/kernel/events/Makefile b/kernel/events/Makefile index 103f5d1..70990d5 100644 --- a/kernel/events/Makefile +++ b/kernel/events/Makefile @@ -2,7 +2,7 @@ ifdef CONFIG_FUNCTION_TRACER CFLAGS_REMOVE_core.o = -pg endif
-obj-y := core.o ring_buffer.o callchain.o +obj-y := core.o ring_buffer.o callchain.o persistent.o
obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o obj-$(CONFIG_UPROBES) += uprobes.o diff --git a/kernel/events/core.c b/kernel/events/core.c index 932acc6..d9d6e67 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3982,6 +3982,9 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma) if (!(vma->vm_flags & VM_SHARED)) return -EINVAL;
+ if (event->attr.persistent && (vma->vm_flags & VM_WRITE)) + return -EACCES; + vma_size = vma->vm_end - vma->vm_start; nr_pages = (vma_size / PAGE_SIZE) - 1;
@@ -4007,6 +4010,11 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma) goto unlock; }
+ if (!event->rb->overwrite && vma->vm_flags & VM_WRITE) { + ret = -EACCES; + goto unlock; + } + if (!atomic_inc_not_zero(&event->rb->mmap_count)) { /* * Raced against perf_mmap_close() through @@ -5845,7 +5853,7 @@ static struct pmu perf_tracepoint = { .event_idx = perf_swevent_event_idx, };
-static inline void perf_tp_register(void) +static inline void perf_register_tp(void) { perf_pmu_register(&perf_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT); } @@ -5875,18 +5883,14 @@ static void perf_event_free_filter(struct perf_event *event)
#else
-static inline void perf_tp_register(void) -{ -} +static inline void perf_register_tp(void) { }
static int perf_event_set_filter(struct perf_event *event, void __user *arg) { return -ENOENT; }
-static void perf_event_free_filter(struct perf_event *event) -{ -} +static void perf_event_free_filter(struct perf_event *event) { }
#endif /* CONFIG_EVENT_TRACING */
@@ -6574,6 +6578,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, INIT_LIST_HEAD(&event->event_entry); INIT_LIST_HEAD(&event->sibling_list); INIT_LIST_HEAD(&event->rb_entry); + INIT_LIST_HEAD(&event->pevent_entry);
init_waitqueue_head(&event->waitq); init_irq_work(&event->pending, perf_pending_event); @@ -6831,6 +6836,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event) goto unlock; }
+ /* Don't redirect read-only (persistent) events. */ + ret = -EACCES; + if (old_rb && !old_rb->overwrite) + goto unlock; + if (rb && !rb->overwrite) + goto unlock; + if (old_rb) ring_buffer_detach(event, old_rb);
@@ -6888,6 +6900,14 @@ SYSCALL_DEFINE5(perf_event_open, if (err) return err;
+ /* return fd for an existing persistent event */ + if (attr.type == PERF_TYPE_PERSISTENT) + return perf_get_persistent_event_fd(cpu, attr.config); + + /* put event into persistent state (not yet supported) */ + if (attr.persistent) + return -EOPNOTSUPP; + if (!attr.exclude_kernel) { if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) return -EACCES; @@ -7828,7 +7848,8 @@ void __init perf_event_init(void) perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE); perf_pmu_register(&perf_cpu_clock, NULL, -1); perf_pmu_register(&perf_task_clock, NULL, -1); - perf_tp_register(); + perf_register_tp(); + perf_register_persistent(); perf_cpu_notifier(perf_cpu_notify); register_reboot_notifier(&perf_reboot_notifier);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h index d8708aa..94c3f73 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -193,5 +193,7 @@ static inline void put_event(struct perf_event *event) extern int perf_alloc_rb(struct perf_event *event, int nr_pages, int flags); extern void perf_free_rb(struct perf_event *event); extern int perf_get_fd(struct perf_event *event); +extern int perf_get_persistent_event_fd(int cpu, int id); +extern void __init perf_register_persistent(void);
#endif /* _KERNEL_EVENTS_INTERNAL_H */ diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c new file mode 100644 index 0000000..926654f --- /dev/null +++ b/kernel/events/persistent.c @@ -0,0 +1,221 @@ +#include <linux/slab.h> +#include <linux/perf_event.h> +#include <linux/ftrace_event.h> + +#include "internal.h" + +/* 512 kiB: default perf tools memory size, see perf_evlist__mmap() */ +#define CPU_BUFFER_NR_PAGES ((512 * 1024) / PAGE_SIZE) + +struct pevent { + char *name; + int id; +}; + +static DEFINE_PER_CPU(struct list_head, pevents); +static DEFINE_PER_CPU(struct mutex, pevents_lock); + +/* Must be protected with pevents_lock. */ +static struct perf_event *__pevent_find(int cpu, int id) +{ + struct perf_event *event; + + list_for_each_entry(event, &per_cpu(pevents, cpu), pevent_entry) { + if (event->pevent_id == id) + return event; + } + + return NULL; +} + +static int pevent_add(struct pevent *pevent, struct perf_event *event) +{ + int ret = -EEXIST; + int cpu = event->cpu; + + mutex_lock(&per_cpu(pevents_lock, cpu)); + + if (__pevent_find(cpu, pevent->id)) + goto unlock; + + if (event->pevent_id) + goto unlock; + + ret = 0; + event->pevent_id = pevent->id; + list_add_tail(&event->pevent_entry, &per_cpu(pevents, cpu)); +unlock: + mutex_unlock(&per_cpu(pevents_lock, cpu)); + + return ret; +} + +static struct perf_event *pevent_del(struct pevent *pevent, int cpu) +{ + struct perf_event *event; + + mutex_lock(&per_cpu(pevents_lock, cpu)); + + event = __pevent_find(cpu, pevent->id); + if (event) { + list_del(&event->pevent_entry); + event->pevent_id = 0; + } + + mutex_unlock(&per_cpu(pevents_lock, cpu)); + + return event; +} + +static void persistent_event_release(struct perf_event *event) +{ + /* + * Safe since we hold &event->mmap_count. The ringbuffer is + * released with put_event() if there are no other references. + * In this case there are also no other mmaps. + */ + atomic_dec(&event->rb->mmap_count); + atomic_dec(&event->mmap_count); + put_event(event); +} + +static int persistent_event_open(int cpu, struct pevent *pevent, + struct perf_event_attr *attr, int nr_pages) +{ + struct perf_event *event; + int ret; + + event = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL); + if (IS_ERR(event)) + return PTR_ERR(event); + + if (nr_pages < 0) + nr_pages = CPU_BUFFER_NR_PAGES; + + ret = perf_alloc_rb(event, nr_pages, 0); + if (ret) + goto fail; + + ret = pevent_add(pevent, event); + if (ret) + goto fail; + + atomic_inc(&event->mmap_count); + + /* All workie, enable event now */ + perf_event_enable(event); + + return ret; +fail: + perf_event_release_kernel(event); + return ret; +} + +static void persistent_event_close(int cpu, struct pevent *pevent) +{ + struct perf_event *event = pevent_del(pevent, cpu); + if (event) + persistent_event_release(event); +} + +static int __maybe_unused +persistent_open(char *name, struct perf_event_attr *attr, int nr_pages) +{ + struct pevent *pevent; + char id_buf[32]; + int cpu; + int ret = 0; + + pevent = kzalloc(sizeof(*pevent), GFP_KERNEL); + if (!pevent) + return -ENOMEM; + + pevent->id = attr->config; + + if (!name) { + snprintf(id_buf, sizeof(id_buf), "%d", pevent->id); + name = id_buf; + } + + pevent->name = kstrdup(name, GFP_KERNEL); + if (!pevent->name) { + ret = -ENOMEM; + goto fail; + } + + for_each_possible_cpu(cpu) { + ret = persistent_event_open(cpu, pevent, attr, nr_pages); + if (ret) + goto fail; + } + + return 0; +fail: + for_each_possible_cpu(cpu) + persistent_event_close(cpu, pevent); + kfree(pevent->name); + kfree(pevent); + + pr_err("%s: Error adding persistent event: %d\n", + __func__, ret); + + return ret; +} + +#ifdef CONFIG_EVENT_TRACING + +int perf_add_persistent_tp(struct ftrace_event_call *tp) +{ + struct perf_event_attr attr; + + memset(&attr, 0, sizeof(attr)); + attr.sample_period = 1; + attr.wakeup_events = 1; + attr.sample_type = PERF_SAMPLE_RAW; + attr.persistent = 1; + attr.config = tp->event.type; + attr.type = PERF_TYPE_TRACEPOINT; + attr.size = sizeof(attr); + + return persistent_open(tp->name, &attr, -1); +} + +#endif /* CONFIG_EVENT_TRACING */ + +int perf_get_persistent_event_fd(int cpu, int id) +{ + struct perf_event *event; + int event_fd = 0; + + if ((unsigned)cpu >= nr_cpu_ids) + return -EINVAL; + + /* Must be root for persistent events */ + if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN)) + return -EACCES; + + mutex_lock(&per_cpu(pevents_lock, cpu)); + event = __pevent_find(cpu, id); + if (!event || !try_get_event(event)) + event_fd = -ENOENT; + mutex_unlock(&per_cpu(pevents_lock, cpu)); + + if (event_fd) + return event_fd; + + event_fd = perf_get_fd(event); + if (event_fd < 0) + put_event(event); + + return event_fd; +} + +void __init perf_register_persistent(void) +{ + int cpu; + + for_each_possible_cpu(cpu) { + INIT_LIST_HEAD(&per_cpu(pevents, cpu)); + mutex_init(&per_cpu(pevents_lock, cpu)); + } +}
From: Borislav Petkov bp@suse.de
... for MCEs collection.
Signed-off-by: Borislav Petkov bp@suse.de [ rric: Fix build error for no-tracepoints configs ] [ rric: Return proper error code. ] [ rric: No error message if perf is disabled. ] Signed-off-by: Robert Richter rric@kernel.org --- arch/x86/kernel/cpu/mcheck/mce.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 87a65c9..ffa227b 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1990,6 +1990,25 @@ int __init mcheck_init(void) return 0; }
+#ifdef CONFIG_EVENT_TRACING + +int __init mcheck_init_tp(void) +{ + int ret = perf_add_persistent_tp(&event_mce_record); + + if (ret && ret != -ENOENT) + pr_err("Error adding MCE persistent event: %d\n", ret); + + return ret; +} +/* + * We can't run earlier because persistent events uses anon_inode_getfile and + * its anon_inode_mnt gets initialized as a fs_initcall. + */ +fs_initcall_sync(mcheck_init_tp); + +#endif /* CONFIG_EVENT_TRACING */ + /* * mce_syscore: PM support */
From: Robert Richter robert.richter@linaro.org
We want to use the kernel's pmu design to later expose persistent events via sysfs to userland. Initially implement a persistent pmu.
The format syntax is introduced allowing to set bits anywhere in struct perf_event_attr. This is used in this case to set the persistent flag (attr5:23). The syntax is attr<num> where num is the index of the u64 array in struct perf_event_attr. Otherwise syntax is same as for config<num>.
Patches that implement this functionality for perf tools are sent in a separate patchset.
Signed-off-by: Robert Richter robert.richter@linaro.org Signed-off-by: Robert Richter rric@kernel.org --- kernel/events/persistent.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c index 926654f..ede95ab 100644 --- a/kernel/events/persistent.c +++ b/kernel/events/persistent.c @@ -12,6 +12,7 @@ struct pevent { int id; };
+static struct pmu persistent_pmu; static DEFINE_PER_CPU(struct list_head, pevents); static DEFINE_PER_CPU(struct mutex, pevents_lock);
@@ -210,10 +211,43 @@ int perf_get_persistent_event_fd(int cpu, int id) return event_fd; }
+PMU_FORMAT_ATTR(persistent, "attr5:23"); + +static struct attribute *persistent_format_attrs[] = { + &format_attr_persistent.attr, + NULL, +}; + +static struct attribute_group persistent_format_group = { + .name = "format", + .attrs = persistent_format_attrs, +}; + +static const struct attribute_group *persistent_attr_groups[] = { + &persistent_format_group, + NULL, +}; + +static int persistent_pmu_init(struct perf_event *event) +{ + if (persistent_pmu.type != event->attr.type) + return -ENOENT; + + /* Not a persistent event. */ + return -EFAULT; +} + +static struct pmu persistent_pmu = { + .event_init = persistent_pmu_init, + .attr_groups = persistent_attr_groups, +}; + void __init perf_register_persistent(void) { int cpu;
+ perf_pmu_register(&persistent_pmu, "persistent", PERF_TYPE_PERSISTENT); + for_each_possible_cpu(cpu) { INIT_LIST_HEAD(&per_cpu(pevents, cpu)); mutex_init(&per_cpu(pevents_lock, cpu));
From: Robert Richter robert.richter@linaro.org
Expose persistent events in the system to userland using sysfs. Perf tools are able to read existing pmu events from sysfs. Now we use a persistent pmu as an event container containing all registered persistent events of the system. This patch adds dynamically registration of persistent events to sysfs. E.g. something like this:
/sys/bus/event_source/devices/persistent/events/mce_record:persistent,config=106 /sys/bus/event_source/devices/persistent/format/persistent:attr5:23
Perf tools need to support the attr<num> syntax that is added in a separate patch set. With it we are able to run perf tool commands to read persistent events, e.g.:
# perf record -e persistent/mce_record/ sleep 10 # perf top -e persistent/mce_record/
[ Jiri: Document attr<index> syntax in sysfs ABI ] [ Namhyung: Fix sysfs registration with lockdep enabled ] Cc: Jiri Olsa jolsa@redhat.com Cc: Namhyung Kim namhyung@kernel.org Signed-off-by: Robert Richter robert.richter@linaro.org Signed-off-by: Robert Richter rric@kernel.org --- .../testing/sysfs-bus-event_source-devices-format | 43 ++++++++++++---- kernel/events/persistent.c | 60 ++++++++++++++++++++++ 2 files changed, 92 insertions(+), 11 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format index 77f47ff..2dbb911 100644 --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format @@ -1,13 +1,14 @@ -Where: /sys/bus/event_source/devices/<dev>/format +Where: /sys/bus/event_source/devices/<pmu>/format/<name> Date: January 2012 -Kernel Version: 3.3 +Kernel Version: 3.3 + 3.xx (added attr<index>:<bits>) Contact: Jiri Olsa jolsa@redhat.com -Description: - Attribute group to describe the magic bits that go into - perf_event_attr::config[012] for a particular pmu. - Each attribute of this group defines the 'hardware' bitmask - we want to export, so that userspace can deal with sane - name/value pairs. + +Description: Define formats for bit ranges in perf_event_attr + + Attribute group to describe the magic bits that go + into struct perf_event_attr for a particular pmu. Bit + range may be any bit mask of an u64 (bits 0 to 63).
Userspace must be prepared for the possibility that attributes define overlapping bit ranges. For example: @@ -15,6 +16,26 @@ Contact: Jiri Olsa jolsa@redhat.com attr2 = 'config:0-7' attr3 = 'config:12-35'
- Example: 'config1:1,6-10,44' - Defines contents of attribute that occupies bits 1,6-10,44 of - perf_event_attr::config1. + Syntax Description + + config[012]*:<bits> Each attribute of this group + defines the 'hardware' bitmask + we want to export, so that + userspace can deal with sane + name/value pairs. + + attr<index>:<bits> Set any field of the event + attribute. The index is a + decimal number that specifies + the u64 value to be set within + struct perf_event_attr. + + Examples: + + 'config1:1,6-10,44' Defines contents of attribute + that occupies bits 1,6-10,44 + of perf_event_attr::config1. + + 'attr5:23' Define the persistent event + flag (bit 23 of the attribute + flags) diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c index ede95ab..aca1e98 100644 --- a/kernel/events/persistent.c +++ b/kernel/events/persistent.c @@ -8,6 +8,7 @@ #define CPU_BUFFER_NR_PAGES ((512 * 1024) / PAGE_SIZE)
struct pevent { + struct perf_pmu_events_attr sysfs; char *name; int id; }; @@ -119,6 +120,8 @@ static void persistent_event_close(int cpu, struct pevent *pevent) persistent_event_release(event); }
+static int pevent_sysfs_register(struct pevent *event); + static int __maybe_unused persistent_open(char *name, struct perf_event_attr *attr, int nr_pages) { @@ -144,12 +147,18 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages) goto fail; }
+ pevent->sysfs.id = pevent->id; + for_each_possible_cpu(cpu) { ret = persistent_event_open(cpu, pevent, attr, nr_pages); if (ret) goto fail; }
+ ret = pevent_sysfs_register(pevent); + if (ret) + goto fail; + return 0; fail: for_each_possible_cpu(cpu) @@ -223,10 +232,61 @@ static struct attribute_group persistent_format_group = { .attrs = persistent_format_attrs, };
+#define MAX_EVENTS 16 + +static struct attribute *pevents_attr[MAX_EVENTS + 1] = { }; + +static struct attribute_group pevents_group = { + .name = "events", + .attrs = pevents_attr, +}; + static const struct attribute_group *persistent_attr_groups[] = { &persistent_format_group, + NULL, /* placeholder: &pevents_group */ NULL, }; +#define EVENTS_GROUP_PTR (&persistent_attr_groups[1]) + +static ssize_t pevent_sysfs_show(struct device *dev, + struct device_attribute *__attr, char *page) +{ + struct perf_pmu_events_attr *attr = + container_of(__attr, struct perf_pmu_events_attr, attr); + return sprintf(page, "persistent,config=%lld", + (unsigned long long)attr->id); +} + +static int pevent_sysfs_register(struct pevent *pevent) +{ + struct perf_pmu_events_attr *sysfs = &pevent->sysfs; + struct attribute *attr = &sysfs->attr.attr; + struct device *dev = persistent_pmu.dev; + const struct attribute_group **group = EVENTS_GROUP_PTR; + int idx; + + sysfs->id = pevent->id; + sysfs->attr = (struct device_attribute) + __ATTR(, 0444, pevent_sysfs_show, NULL); + attr->name = pevent->name; + sysfs_attr_init(attr); + + /* add sysfs attr to events: */ + for (idx = 0; idx < MAX_EVENTS; idx++) { + if (!cmpxchg(pevents_attr + idx, NULL, attr)) + break; + } + + if (idx >= MAX_EVENTS) + return -ENOSPC; + if (!idx) + *group = &pevents_group; + if (!dev) + return 0; /* sysfs not yet initialized */ + if (idx) + return sysfs_add_file_to_group(&dev->kobj, attr, (*group)->name); + return sysfs_create_group(&persistent_pmu.dev->kobj, *group); +}
static int persistent_pmu_init(struct perf_event *event) {
From: Robert Richter robert.richter@linaro.org
Tracepoints have a unique attr.config value. But, this is not sufficient to support all event types. For this we need to generate unique event ids.
Signed-off-by: Robert Richter robert.richter@linaro.org Signed-off-by: Robert Richter rric@kernel.org --- kernel/events/persistent.c | 40 ++++++++++++++++++++++++++++++++++++++-- 1 file changed, 38 insertions(+), 2 deletions(-)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c index aca1e98..f23270b 100644 --- a/kernel/events/persistent.c +++ b/kernel/events/persistent.c @@ -1,6 +1,7 @@ #include <linux/slab.h> #include <linux/perf_event.h> #include <linux/ftrace_event.h> +#include <linux/idr.h>
#include "internal.h"
@@ -13,10 +14,37 @@ struct pevent { int id; };
+static struct idr event_idr; +static struct mutex event_lock; static struct pmu persistent_pmu; static DEFINE_PER_CPU(struct list_head, pevents); static DEFINE_PER_CPU(struct mutex, pevents_lock);
+static inline struct pevent *find_event(int id) +{ + struct pevent *pevent; + rcu_read_lock(); + pevent = idr_find(&event_idr, id); + rcu_read_lock(); + return pevent; +} + +static inline int get_event_id(struct pevent *pevent) +{ + int event_id; + mutex_lock(&event_lock); + event_id = idr_alloc(&event_idr, pevent, 1, INT_MAX, GFP_KERNEL); + mutex_unlock(&event_lock); + return event_id; +} + +static inline void put_event_id(int id) +{ + mutex_lock(&event_lock); + idr_remove(&event_idr, id); + mutex_unlock(&event_lock); +} + /* Must be protected with pevents_lock. */ static struct perf_event *__pevent_find(int cpu, int id) { @@ -128,13 +156,16 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages) struct pevent *pevent; char id_buf[32]; int cpu; - int ret = 0; + int ret;
pevent = kzalloc(sizeof(*pevent), GFP_KERNEL); if (!pevent) return -ENOMEM;
- pevent->id = attr->config; + ret = get_event_id(pevent); + if (ret < 0) + goto fail; + pevent->id = ret;
if (!name) { snprintf(id_buf, sizeof(id_buf), "%d", pevent->id); @@ -163,6 +194,9 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages) fail: for_each_possible_cpu(cpu) persistent_event_close(cpu, pevent); + + if (pevent->id) + put_event_id(pevent->id); kfree(pevent->name); kfree(pevent);
@@ -306,6 +340,8 @@ void __init perf_register_persistent(void) { int cpu;
+ idr_init(&event_idr); + mutex_init(&event_lock); perf_pmu_register(&persistent_pmu, "persistent", PERF_TYPE_PERSISTENT);
for_each_possible_cpu(cpu) {
From: Robert Richter robert.richter@linaro.org
We need this later for proper event removal.
Signed-off-by: Robert Richter robert.richter@linaro.org Signed-off-by: Robert Richter rric@kernel.org --- kernel/events/persistent.c | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c index f23270b..70446ae 100644 --- a/kernel/events/persistent.c +++ b/kernel/events/persistent.c @@ -9,6 +9,7 @@ #define CPU_BUFFER_NR_PAGES ((512 * 1024) / PAGE_SIZE)
struct pevent { + atomic_t refcount; struct perf_pmu_events_attr sysfs; char *name; int id; @@ -130,6 +131,7 @@ static int persistent_event_open(int cpu, struct pevent *pevent, if (ret) goto fail;
+ atomic_inc(&pevent->refcount); atomic_inc(&event->mmap_count);
/* All workie, enable event now */ @@ -144,8 +146,11 @@ static int persistent_event_open(int cpu, struct pevent *pevent, static void persistent_event_close(int cpu, struct pevent *pevent) { struct perf_event *event = pevent_del(pevent, cpu); - if (event) + if (event) { + /* Safe, the caller holds &pevent->refcount too. */ + atomic_dec(&pevent->refcount); persistent_event_release(event); + } }
static int pevent_sysfs_register(struct pevent *event); @@ -162,6 +167,8 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages) if (!pevent) return -ENOMEM;
+ atomic_set(&pevent->refcount, 1); + ret = get_event_id(pevent); if (ret < 0) goto fail; @@ -187,21 +194,21 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages) }
ret = pevent_sysfs_register(pevent); - if (ret) - goto fail; - - return 0; + if (!ret) + goto out; fail: for_each_possible_cpu(cpu) persistent_event_close(cpu, pevent);
- if (pevent->id) - put_event_id(pevent->id); - kfree(pevent->name); - kfree(pevent); - pr_err("%s: Error adding persistent event: %d\n", __func__, ret); +out: + if (atomic_dec_and_test(&pevent->refcount)) { + if (pevent->id) + put_event_id(pevent->id); + kfree(pevent->name); + kfree(pevent); + }
return ret; }
From: Robert Richter robert.richter@linaro.org
There was a limitation of the total number of persistent events to be registered in sysfs due to the lack of dynamically list allocation. This patch implements memory reallocation in case an event is added or removed from the list.
While at this also implement pevent_sysfs_unregister() which we need later for proper event removal.
Signed-off-by: Robert Richter robert.richter@linaro.org Signed-off-by: Robert Richter rric@kernel.org --- kernel/events/persistent.c | 115 ++++++++++++++++++++++++++++++++++++++------- 1 file changed, 99 insertions(+), 16 deletions(-)
diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c index 70446ae..a0ef6d4 100644 --- a/kernel/events/persistent.c +++ b/kernel/events/persistent.c @@ -154,6 +154,7 @@ static void persistent_event_close(int cpu, struct pevent *pevent) }
static int pevent_sysfs_register(struct pevent *event); +static void pevent_sysfs_unregister(struct pevent *event);
static int __maybe_unused persistent_open(char *name, struct perf_event_attr *attr, int nr_pages) @@ -204,6 +205,7 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages) __func__, ret); out: if (atomic_dec_and_test(&pevent->refcount)) { + pevent_sysfs_unregister(pevent); if (pevent->id) put_event_id(pevent->id); kfree(pevent->name); @@ -273,13 +275,12 @@ static struct attribute_group persistent_format_group = { .attrs = persistent_format_attrs, };
-#define MAX_EVENTS 16 - -static struct attribute *pevents_attr[MAX_EVENTS + 1] = { }; +static struct mutex sysfs_lock; +static int sysfs_nr_entries;
static struct attribute_group pevents_group = { .name = "events", - .attrs = pevents_attr, + .attrs = NULL, /* dynamically allocated */ };
static const struct attribute_group *persistent_attr_groups[] = { @@ -288,6 +289,7 @@ static const struct attribute_group *persistent_attr_groups[] = { NULL, }; #define EVENTS_GROUP_PTR (&persistent_attr_groups[1]) +#define EVENTS_ATTRS_PTR (&pevents_group.attrs)
static ssize_t pevent_sysfs_show(struct device *dev, struct device_attribute *__attr, char *page) @@ -304,7 +306,9 @@ static int pevent_sysfs_register(struct pevent *pevent) struct attribute *attr = &sysfs->attr.attr; struct device *dev = persistent_pmu.dev; const struct attribute_group **group = EVENTS_GROUP_PTR; - int idx; + struct attribute ***attrs_ptr = EVENTS_ATTRS_PTR; + struct attribute **attrs; + int ret = 0;
sysfs->id = pevent->id; sysfs->attr = (struct device_attribute) @@ -312,21 +316,99 @@ static int pevent_sysfs_register(struct pevent *pevent) attr->name = pevent->name; sysfs_attr_init(attr);
- /* add sysfs attr to events: */ - for (idx = 0; idx < MAX_EVENTS; idx++) { - if (!cmpxchg(pevents_attr + idx, NULL, attr)) - break; + mutex_lock(&sysfs_lock); + + /* + * Keep old list if no new one is available. Need this for + * device_remove_attrs() if unregistering pmu. + */ + attrs = __krealloc(*attrs_ptr, (sysfs_nr_entries + 2) * sizeof(*attrs), + GFP_KERNEL); + + if (!attrs) { + ret = -ENOMEM; + goto unlock; }
- if (idx >= MAX_EVENTS) - return -ENOSPC; - if (!idx) + attrs[sysfs_nr_entries++] = attr; + attrs[sysfs_nr_entries] = NULL; + + if (!*group) *group = &pevents_group; + + if (!dev) + goto out; /* sysfs not yet initialized */ + + if (sysfs_nr_entries == 1) + ret = sysfs_create_group(&dev->kobj, *group); + else + ret = sysfs_add_file_to_group(&dev->kobj, attr, (*group)->name); + + if (ret) { + /* roll back */ + sysfs_nr_entries--; + if (!sysfs_nr_entries) + *group = NULL; + if (*attrs_ptr != attrs) + kfree(attrs); + else + attrs[sysfs_nr_entries] = NULL; + goto unlock; + } +out: + if (*attrs_ptr != attrs) { + kfree(*attrs_ptr); + *attrs_ptr = attrs; + } +unlock: + mutex_unlock(&sysfs_lock); + + return ret; +} + +static void pevent_sysfs_unregister(struct pevent *pevent) +{ + struct attribute *attr = &pevent->sysfs.attr.attr; + struct device *dev = persistent_pmu.dev; + const struct attribute_group **group = EVENTS_GROUP_PTR; + struct attribute ***attrs_ptr = EVENTS_ATTRS_PTR; + struct attribute **attrs, **dest; + + mutex_lock(&sysfs_lock); + + for (dest = *attrs_ptr; *dest; dest++) { + if (*dest == attr) + break; + } + + if (!*dest) + goto unlock; + + sysfs_nr_entries--; + + *dest = (*attrs_ptr)[sysfs_nr_entries]; + (*attrs_ptr)[sysfs_nr_entries] = NULL; + if (!dev) - return 0; /* sysfs not yet initialized */ - if (idx) - return sysfs_add_file_to_group(&dev->kobj, attr, (*group)->name); - return sysfs_create_group(&persistent_pmu.dev->kobj, *group); + goto out; /* sysfs not yet initialized */ + + if (!sysfs_nr_entries) + sysfs_remove_group(&dev->kobj, *group); + else + sysfs_remove_file_from_group(&dev->kobj, attr, (*group)->name); +out: + if (!sysfs_nr_entries) + *group = NULL; + + attrs = __krealloc(*attrs_ptr, (sysfs_nr_entries + 1) * sizeof(*attrs), + GFP_KERNEL); + + if (!attrs && *attrs_ptr != attrs) { + kfree(*attrs_ptr); + *attrs_ptr = attrs; + } +unlock: + mutex_unlock(&sysfs_lock); }
static int persistent_pmu_init(struct perf_event *event) @@ -349,6 +431,7 @@ void __init perf_register_persistent(void)
idr_init(&event_idr); mutex_init(&event_lock); + mutex_init(&sysfs_lock); perf_pmu_register(&persistent_pmu, "persistent", PERF_TYPE_PERSISTENT);
for_each_possible_cpu(cpu) {
From: Robert Richter robert.richter@linaro.org
Implementing ioctl functions to control persistent events. There are functions to detach or attach an event to or from a process. The PERF_EVENT_IOC_DETACH ioctl call makes an event persistent. After closing the event's fd it runs then in the background of the system without the need of a controlling process. The perf_event_open() syscall can be used to reopen the event by any process. The PERF_EVENT_IOC_ATTACH ioctl attaches the event again so that it is removed after closing the event's fd.
This is for Linux man-pages:
type ...
PERF_TYPE_PERSISTENT (Since Linux 3.xx)
This indicates a persistent event. There is a unique identifier for each persistent event that needs to be specified in the event's attribute config field. Persistent events are listed under:
/sys/bus/event_source/devices/persistent/
... persistent : 41, /* always-on event */ ...
persistent: (Since Linux 3.xx)
Put event into persistent state after opening. After closing the event's fd the event is persistent in the system and continues to run.
perf_event ioctl calls
PERF_EVENT_IOC_DETACH (Since Linux 3.xx)
Detach the event specified by the file descriptor from the process and make it persistent in the system. After closing the fd the event will continue to run. An unique identifier for the persistent event is returned or an error otherwise. The following allows to connect to the event again:
pe.type = PERF_TYPE_PERSISTENT; pe.config = <pevent_id>; ... fd = perf_event_open(...);
The event must be reopened on the same cpu.
PERF_EVENT_IOC_ATTACH (Since Linux 3.xx)
Attach the event specified by the file descriptor to the current process. The event is no longer persistent in the system and will be removed after all users disconnected from the event. Thus, if there are no other users the event will be closed too after closing its file descriptor, the event then no longer exists.
Cc: Vince Weaver vincent.weaver@maine.edu Signed-off-by: Robert Richter robert.richter@linaro.org Signed-off-by: Robert Richter rric@kernel.org --- include/uapi/linux/perf_event.h | 2 + kernel/events/core.c | 6 ++ kernel/events/internal.h | 2 + kernel/events/persistent.c | 178 +++++++++++++++++++++++++++++++++------- 4 files changed, 160 insertions(+), 28 deletions(-)
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 2b84b97..82a8244 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -324,6 +324,8 @@ struct perf_event_attr { #define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5) #define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *) #define PERF_EVENT_IOC_ID _IOR('$', 7, u64 *) +#define PERF_EVENT_IOC_DETACH _IO ('$', 8) +#define PERF_EVENT_IOC_ATTACH _IO ('$', 9)
enum perf_event_ioc_flags { PERF_IOC_FLAG_GROUP = 1U << 0, diff --git a/kernel/events/core.c b/kernel/events/core.c index d9d6e67..8d5c6e3 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3622,6 +3622,12 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg) case PERF_EVENT_IOC_SET_FILTER: return perf_event_set_filter(event, (void __user *)arg);
+ case PERF_EVENT_IOC_DETACH: + return perf_event_detach(event); + + case PERF_EVENT_IOC_ATTACH: + return perf_event_attach(event); + default: return -ENOTTY; } diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 94c3f73..f9bc15f 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -195,5 +195,7 @@ extern void perf_free_rb(struct perf_event *event); extern int perf_get_fd(struct perf_event *event); extern int perf_get_persistent_event_fd(int cpu, int id); extern void __init perf_register_persistent(void); +extern int perf_event_detach(struct perf_event *event); +extern int perf_event_attach(struct perf_event *event);
#endif /* _KERNEL_EVENTS_INTERNAL_H */ diff --git a/kernel/events/persistent.c b/kernel/events/persistent.c index a0ef6d4..e156afe 100644 --- a/kernel/events/persistent.c +++ b/kernel/events/persistent.c @@ -59,6 +59,49 @@ static struct perf_event *__pevent_find(int cpu, int id) return NULL; }
+static void pevent_free(struct pevent *pevent) +{ + if (pevent->id) + put_event_id(pevent->id); + + kfree(pevent->name); + kfree(pevent); +} + +static struct pevent *pevent_alloc(char *name) +{ + struct pevent *pevent; + char id_buf[32]; + int ret; + + pevent = kzalloc(sizeof(*pevent), GFP_KERNEL); + if (!pevent) + return ERR_PTR(-ENOMEM); + + atomic_set(&pevent->refcount, 1); + + ret = get_event_id(pevent); + if (ret < 0) + goto fail; + pevent->id = ret; + + if (!name) { + snprintf(id_buf, sizeof(id_buf), "%d", pevent->id); + name = id_buf; + } + + pevent->name = kstrdup(name, GFP_KERNEL); + if (!pevent->name) { + ret = -ENOMEM; + goto fail; + } + + return pevent; +fail: + pevent_free(pevent); + return ERR_PTR(ret); +} + static int pevent_add(struct pevent *pevent, struct perf_event *event) { int ret = -EEXIST; @@ -74,6 +117,7 @@ static int pevent_add(struct pevent *pevent, struct perf_event *event)
ret = 0; event->pevent_id = pevent->id; + event->attr.persistent = 1; list_add_tail(&event->pevent_entry, &per_cpu(pevents, cpu)); unlock: mutex_unlock(&per_cpu(pevents_lock, cpu)); @@ -91,6 +135,7 @@ static struct perf_event *pevent_del(struct pevent *pevent, int cpu) if (event) { list_del(&event->pevent_entry); event->pevent_id = 0; + event->attr.persistent = 0; }
mutex_unlock(&per_cpu(pevents_lock, cpu)); @@ -160,33 +205,12 @@ static int __maybe_unused persistent_open(char *name, struct perf_event_attr *attr, int nr_pages) { struct pevent *pevent; - char id_buf[32]; int cpu; int ret;
- pevent = kzalloc(sizeof(*pevent), GFP_KERNEL); - if (!pevent) - return -ENOMEM; - - atomic_set(&pevent->refcount, 1); - - ret = get_event_id(pevent); - if (ret < 0) - goto fail; - pevent->id = ret; - - if (!name) { - snprintf(id_buf, sizeof(id_buf), "%d", pevent->id); - name = id_buf; - } - - pevent->name = kstrdup(name, GFP_KERNEL); - if (!pevent->name) { - ret = -ENOMEM; - goto fail; - } - - pevent->sysfs.id = pevent->id; + pevent = pevent_alloc(name); + if (IS_ERR(pevent)) + return PTR_ERR(pevent);
for_each_possible_cpu(cpu) { ret = persistent_event_open(cpu, pevent, attr, nr_pages); @@ -206,10 +230,7 @@ persistent_open(char *name, struct perf_event_attr *attr, int nr_pages) out: if (atomic_dec_and_test(&pevent->refcount)) { pevent_sysfs_unregister(pevent); - if (pevent->id) - put_event_id(pevent->id); - kfree(pevent->name); - kfree(pevent); + pevent_free(pevent); }
return ret; @@ -439,3 +460,104 @@ void __init perf_register_persistent(void) mutex_init(&per_cpu(pevents_lock, cpu)); } } + +/* + * Detach an event from a process. The event will remain in the system + * after closing the event's fd, it becomes persistent. + */ +int perf_event_detach(struct perf_event *event) +{ + struct pevent *pevent; + int cpu; + int ret; + + if (!try_get_event(event)) + return -ENOENT; + + /* task events not yet supported: */ + cpu = event->cpu; + if ((unsigned)cpu >= nr_cpu_ids) { + ret = -EINVAL; + goto fail_rb; + } + + /* + * Avoid grabbing an id, later checked again in pevent_add() + * with mmap_mutex held. + */ + if (event->pevent_id) { + ret = -EEXIST; + goto fail_rb; + } + + mutex_lock(&event->mmap_mutex); + if (event->rb) + ret = -EBUSY; + else + ret = perf_alloc_rb(event, CPU_BUFFER_NR_PAGES, 0); + mutex_unlock(&event->mmap_mutex); + + if (ret) + goto fail_rb; + + pevent = pevent_alloc(NULL); + if (IS_ERR(pevent)) { + ret = PTR_ERR(pevent); + goto fail_pevent; + } + + ret = pevent_add(pevent, event); + if (ret) + goto fail_add; + + ret = pevent_sysfs_register(pevent); + if (ret) + goto fail_sysfs; + + atomic_inc(&event->mmap_count); + + return pevent->id; +fail_sysfs: + pevent_del(pevent, cpu); +fail_add: + pevent_free(pevent); +fail_pevent: + mutex_lock(&event->mmap_mutex); + if (event->rb) + perf_free_rb(event); + mutex_unlock(&event->mmap_mutex); +fail_rb: + put_event(event); + return ret; +} + +/* + * Attach an event to a process. The event will be removed after all + * users disconnected from it, it's no longer persistent in the + * system. + */ +int perf_event_attach(struct perf_event *event) +{ + int cpu = event->cpu; + struct pevent *pevent; + + if ((unsigned)cpu >= nr_cpu_ids) + return -EINVAL; + + pevent = find_event(event->pevent_id); + if (!pevent) + return -EINVAL; + + event = pevent_del(pevent, cpu); + if (!event) + return -EINVAL; + + if (atomic_dec_and_test(&pevent->refcount)) { + pevent_sysfs_unregister(pevent); + pevent_free(pevent); + } + + persistent_event_release(event); + + return 0; +}
linaro-kernel@lists.linaro.org