November 2022 - Linux-stable-mirror

[for-linus][PATCH 08/13] ftrace: Fix null pointer dereference in ftrace_add_mod()

by Steven Rostedt

From: Xiu Jianfeng <xiujianfeng(a)huawei.com> The @ftrace_mod is allocated by kzalloc(), so both the members {prev,next} of @ftrace_mode->list are NULL, it's not a valid state to call list_del(). If kstrdup() for @ftrace_mod->{func|module} fails, it goes to @out_free tag and calls free_ftrace_mod() to destroy @ftrace_mod, then list_del() will write prev->next and next->prev, where null pointer dereference happens. BUG: kernel NULL pointer dereference, address: 0000000000000008 Oops: 0002 [#1] PREEMPT SMP NOPTI Call Trace: <TASK> ftrace_mod_callback+0x20d/0x220 ? do_filp_open+0xd9/0x140 ftrace_process_regex.isra.51+0xbf/0x130 ftrace_regex_write.isra.52.part.53+0x6e/0x90 vfs_write+0xee/0x3a0 ? __audit_filter_op+0xb1/0x100 ? auditd_test_task+0x38/0x50 ksys_write+0xa5/0xe0 do_syscall_64+0x3a/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Kernel panic - not syncing: Fatal exception So call INIT_LIST_HEAD() to initialize the list member to fix this issue. Link: https://lkml.kernel.org/r/20221116015207.30858-1-xiujianfeng@huawei.com Cc: stable(a)vger.kernel.org Fixes: 673feb9d76ab ("ftrace: Add :mod: caching infrastructure to trace_array") Signed-off-by: Xiu Jianfeng <xiujianfeng(a)huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> --- kernel/trace/ftrace.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 56a168121bfc..33236241f236 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1289,6 +1289,7 @@ static int ftrace_add_mod(struct trace_array *tr, if (!ftrace_mod) return -ENOMEM; + INIT_LIST_HEAD(&ftrace_mod->list); ftrace_mod->func = kstrdup(func, GFP_KERNEL); ftrace_mod->module = kstrdup(module, GFP_KERNEL); ftrace_mod->enable = enable; -- 2.35.1

3 years

1
0
0 0

[for-linus][PATCH 07/13] ring_buffer: Do not deactivate non-existant pages

by Steven Rostedt

From: Daniil Tatianin <d-tatianin(a)yandex-team.ru> rb_head_page_deactivate() expects cpu_buffer to contain a valid list of ->pages, so verify that the list is actually present before calling it. Found by Linux Verification Center (linuxtesting.org) with the SVACE static analysis tool. Link: https://lkml.kernel.org/r/20221114143129.3534443-1-d-tatianin@yandex-team.ru Cc: stable(a)vger.kernel.org Fixes: 77ae365eca895 ("ring-buffer: make lockless") Signed-off-by: Daniil Tatianin <d-tatianin(a)yandex-team.ru> Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> --- kernel/trace/ring_buffer.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index a19369c4d8df..b21bf14bae9b 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1802,9 +1802,9 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer) free_buffer_page(cpu_buffer->reader_page); - rb_head_page_deactivate(cpu_buffer); - if (head) { + rb_head_page_deactivate(cpu_buffer); + list_for_each_entry_safe(bpage, tmp, head, list) { list_del_init(&bpage->list); free_buffer_page(bpage); -- 2.35.1

3 years

1
0
0 0

[for-linus][PATCH 06/13] ftrace: Optimize the allocation for mcount entries

by Steven Rostedt

From: Wang Wensheng <wangwensheng4(a)huawei.com> If we can't allocate this size, try something smaller with half of the size. Its order should be decreased by one instead of divided by two. Link: https://lkml.kernel.org/r/20221109094434.84046-3-wangwensheng4@huawei.com Cc: <mhiramat(a)kernel.org> Cc: <mark.rutland(a)arm.com> Cc: stable(a)vger.kernel.org Fixes: a79008755497d ("ftrace: Allocate the mcount record pages as groups") Signed-off-by: Wang Wensheng <wangwensheng4(a)huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> --- kernel/trace/ftrace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 8b13ce2eae70..56a168121bfc 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -3190,7 +3190,7 @@ static int ftrace_allocate_records(struct ftrace_page *pg, int count) /* if we can't allocate this size, try something smaller */ if (!order) return -ENOMEM; - order >>= 1; + order--; goto again; } -- 2.35.1

3 years

1
0
0 0

[for-linus][PATCH 05/13] ftrace: Fix the possible incorrect kernel message

by Steven Rostedt

From: Wang Wensheng <wangwensheng4(a)huawei.com> If the number of mcount entries is an integer multiple of ENTRIES_PER_PAGE, the page count showing on the console would be wrong. Link: https://lkml.kernel.org/r/20221109094434.84046-2-wangwensheng4@huawei.com Cc: <mhiramat(a)kernel.org> Cc: <mark.rutland(a)arm.com> Cc: stable(a)vger.kernel.org Fixes: 5821e1b74f0d0 ("function tracing: fix wrong pos computing when read buffer has been fulfilled") Signed-off-by: Wang Wensheng <wangwensheng4(a)huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> --- kernel/trace/ftrace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 7dc023641bf1..8b13ce2eae70 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -7391,7 +7391,7 @@ void __init ftrace_init(void) } pr_info("ftrace: allocating %ld entries in %ld pages\n", - count, count / ENTRIES_PER_PAGE + 1); + count, DIV_ROUND_UP(count, ENTRIES_PER_PAGE)); ret = ftrace_process_locs(NULL, __start_mcount_loc, -- 2.35.1

3 years

1
0
0 0

[for-linus][PATCH 03/13] tracing: Fix memory leak in tracing_read_pipe()

by Steven Rostedt

From: Wang Yufen <wangyufen(a)huawei.com> kmemleak reports this issue: unreferenced object 0xffff888105a18900 (size 128): comm "test_progs", pid 18933, jiffies 4336275356 (age 22801.766s) hex dump (first 32 bytes): 25 73 00 90 81 88 ff ff 26 05 00 00 42 01 58 04 %s......&...B.X. 03 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000560143a1>] __kmalloc_node_track_caller+0x4a/0x140 [<000000006af00822>] krealloc+0x8d/0xf0 [<00000000c309be6a>] trace_iter_expand_format+0x99/0x150 [<000000005a53bdb6>] trace_check_vprintf+0x1e0/0x11d0 [<0000000065629d9d>] trace_event_printf+0xb6/0xf0 [<000000009a690dc7>] trace_raw_output_bpf_trace_printk+0x89/0xc0 [<00000000d22db172>] print_trace_line+0x73c/0x1480 [<00000000cdba76ba>] tracing_read_pipe+0x45c/0x9f0 [<0000000015b58459>] vfs_read+0x17b/0x7c0 [<000000004aeee8ed>] ksys_read+0xed/0x1c0 [<0000000063d3d898>] do_syscall_64+0x3b/0x90 [<00000000a06dda7f>] entry_SYSCALL_64_after_hwframe+0x63/0xcd iter->fmt alloced in tracing_read_pipe() -> .. ->trace_iter_expand_format(), but not freed, to fix, add free in tracing_release_pipe() Link: https://lkml.kernel.org/r/1667819090-4643-1-git-send-email-wangyufen@huawei… Cc: stable(a)vger.kernel.org Fixes: efbbdaa22bb7 ("tracing: Show real address for trace event arguments") Acked-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org> Signed-off-by: Wang Yufen <wangyufen(a)huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> --- kernel/trace/trace.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index c6c7a0af3ed2..5bd202d6d79a 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -6657,6 +6657,7 @@ static int tracing_release_pipe(struct inode *inode, struct file *file) mutex_unlock(&trace_types_lock); free_cpumask_var(iter->started); + kfree(iter->fmt); mutex_destroy(&iter->mutex); kfree(iter); -- 2.35.1

3 years

1
0
0 0

[for-linus][PATCH 01/13] tracing/ring-buffer: Have polling block on watermark

by Steven Rostedt

From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org> Currently the way polling works on the ring buffer is broken. It will return immediately if there's any data in the ring buffer whereas a read will block until the watermark (defined by the tracefs buffer_percent file) is hit. That is, a select() or poll() will return as if there's data available, but then the following read will block. This is broken for the way select()s and poll()s are supposed to work. Have the polling on the ring buffer also block the same way reads and splice does on the ring buffer. Link: https://lkml.kernel.org/r/20221020231427.41be3f26@gandalf.local.home Cc: Linux Trace Kernel <linux-trace-kernel(a)vger.kernel.org> Cc: Masami Hiramatsu <mhiramat(a)kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> Cc: Primiano Tucci <primiano(a)google.com> Cc: stable(a)vger.kernel.org Fixes: 1e0d6714aceb7 ("ring-buffer: Do not wake up a splice waiter when page is not full") Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> --- include/linux/ring_buffer.h | 2 +- kernel/trace/ring_buffer.c | 55 ++++++++++++++++++++++++------------- kernel/trace/trace.c | 2 +- 3 files changed, 38 insertions(+), 21 deletions(-) diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 2504df9a0453..3c7d295746f6 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -100,7 +100,7 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full); __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, - struct file *filp, poll_table *poll_table); + struct file *filp, poll_table *poll_table, int full); void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu); #define RING_BUFFER_ALL_CPUS -1 diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 9712083832f4..089b1ec9cb3b 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -907,6 +907,21 @@ size_t ring_buffer_nr_dirty_pages(struct trace_buffer *buffer, int cpu) return cnt - read; } +static __always_inline bool full_hit(struct trace_buffer *buffer, int cpu, int full) +{ + struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu]; + size_t nr_pages; + size_t dirty; + + nr_pages = cpu_buffer->nr_pages; + if (!nr_pages || !full) + return true; + + dirty = ring_buffer_nr_dirty_pages(buffer, cpu); + + return (dirty * 100) > (full * nr_pages); +} + /* * rb_wake_up_waiters - wake up tasks waiting for ring buffer input * @@ -1046,22 +1061,20 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full) !ring_buffer_empty_cpu(buffer, cpu)) { unsigned long flags; bool pagebusy; - size_t nr_pages; - size_t dirty; + bool done; if (!full) break; raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); pagebusy = cpu_buffer->reader_page == cpu_buffer->commit_page; - nr_pages = cpu_buffer->nr_pages; - dirty = ring_buffer_nr_dirty_pages(buffer, cpu); + done = !pagebusy && full_hit(buffer, cpu, full); + if (!cpu_buffer->shortest_full || cpu_buffer->shortest_full > full) cpu_buffer->shortest_full = full; raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); - if (!pagebusy && - (!nr_pages || (dirty * 100) > full * nr_pages)) + if (done) break; } @@ -1087,6 +1100,7 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full) * @cpu: the cpu buffer to wait on * @filp: the file descriptor * @poll_table: The poll descriptor + * @full: wait until the percentage of pages are available, if @cpu != RING_BUFFER_ALL_CPUS * * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon * as data is added to any of the @buffer's cpu buffers. Otherwise @@ -1096,14 +1110,15 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full) * zero otherwise. */ __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, - struct file *filp, poll_table *poll_table) + struct file *filp, poll_table *poll_table, int full) { struct ring_buffer_per_cpu *cpu_buffer; struct rb_irq_work *work; - if (cpu == RING_BUFFER_ALL_CPUS) + if (cpu == RING_BUFFER_ALL_CPUS) { work = &buffer->irq_work; - else { + full = 0; + } else { if (!cpumask_test_cpu(cpu, buffer->cpumask)) return -EINVAL; @@ -1111,8 +1126,14 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, work = &cpu_buffer->irq_work; } - poll_wait(filp, &work->waiters, poll_table); - work->waiters_pending = true; + if (full) { + poll_wait(filp, &work->full_waiters, poll_table); + work->full_waiters_pending = true; + } else { + poll_wait(filp, &work->waiters, poll_table); + work->waiters_pending = true; + } + /* * There's a tight race between setting the waiters_pending and * checking if the ring buffer is empty. Once the waiters_pending bit @@ -1128,6 +1149,9 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, */ smp_mb(); + if (full) + return full_hit(buffer, cpu, full) ? EPOLLIN | EPOLLRDNORM : 0; + if ((cpu == RING_BUFFER_ALL_CPUS && !ring_buffer_empty(buffer)) || (cpu != RING_BUFFER_ALL_CPUS && !ring_buffer_empty_cpu(buffer, cpu))) return EPOLLIN | EPOLLRDNORM; @@ -3155,10 +3179,6 @@ static void rb_commit(struct ring_buffer_per_cpu *cpu_buffer, static __always_inline void rb_wakeups(struct trace_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer) { - size_t nr_pages; - size_t dirty; - size_t full; - if (buffer->irq_work.waiters_pending) { buffer->irq_work.waiters_pending = false; /* irq_work_queue() supplies it's own memory barriers */ @@ -3182,10 +3202,7 @@ rb_wakeups(struct trace_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer) cpu_buffer->last_pages_touch = local_read(&cpu_buffer->pages_touched); - full = cpu_buffer->shortest_full; - nr_pages = cpu_buffer->nr_pages; - dirty = ring_buffer_nr_dirty_pages(buffer, cpu_buffer->cpu); - if (full && nr_pages && (dirty * 100) <= full * nr_pages) + if (!full_hit(buffer, cpu_buffer->cpu, cpu_buffer->shortest_full)) return; cpu_buffer->irq_work.wakeup_full = true; diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 47a44b055a1d..c6c7a0af3ed2 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -6681,7 +6681,7 @@ trace_poll(struct trace_iterator *iter, struct file *filp, poll_table *poll_tabl return EPOLLIN | EPOLLRDNORM; else return ring_buffer_poll_wait(iter->array_buffer->buffer, iter->cpu_file, - filp, poll_table); + filp, poll_table, iter->tr->buffer_percent); } static __poll_t -- 2.35.1

3 years

1
0
0 0

[PATCH 3/4] io_uring: pass in EPOLL_URING as part of eventfd signaling and wakeups

by Jens Axboe

Pass in EPOLL_URING when signaling eventfd or doing poll related wakups, so that we can check for a circular event dependency between eventfd and epoll. If this flag is set when our wakeup handlers are called, then we know we have a dependency that needs to terminate multishot requests. eventfd and epoll are the only such possible dependencies. Cc: stable(a)vger.kernel.org # 6.0 Signed-off-by: Jens Axboe <axboe(a)kernel.dk> --- io_uring/io_uring.c | 4 ++-- io_uring/io_uring.h | 15 +++++++++++---- io_uring/poll.c | 8 ++++++++ 3 files changed, 21 insertions(+), 6 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 8840cf3e20f2..53d0043b77a5 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -495,7 +495,7 @@ static void io_eventfd_ops(struct rcu_head *rcu) int ops = atomic_xchg(&ev_fd->ops, 0); if (ops & BIT(IO_EVENTFD_OP_SIGNAL_BIT)) - eventfd_signal(ev_fd->cq_ev_fd, 1); + eventfd_signal_mask(ev_fd->cq_ev_fd, 1, EPOLL_URING); /* IO_EVENTFD_OP_FREE_BIT may not be set here depending on callback * ordering in a race but if references are 0 we know we have to free @@ -531,7 +531,7 @@ static void io_eventfd_signal(struct io_ring_ctx *ctx) goto out; if (likely(eventfd_signal_allowed())) { - eventfd_signal(ev_fd->cq_ev_fd, 1); + eventfd_signal_mask(ev_fd->cq_ev_fd, 1, EPOLL_URING); } else { atomic_inc(&ev_fd->refs); if (!atomic_fetch_or(BIT(IO_EVENTFD_OP_SIGNAL_BIT), &ev_fd->ops)) diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index cef5ff924e63..f6cf74cd692b 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -4,6 +4,7 @@ #include <linux/errno.h> #include <linux/lockdep.h> #include <linux/io_uring_types.h> +#include <uapi/linux/eventpoll.h> #include "io-wq.h" #include "slist.h" #include "filetable.h" @@ -207,12 +208,18 @@ static inline void io_commit_cqring(struct io_ring_ctx *ctx) static inline void __io_cqring_wake(struct io_ring_ctx *ctx) { /* - * wake_up_all() may seem excessive, but io_wake_function() and - * io_should_wake() handle the termination of the loop and only - * wake as many waiters as we need to. + * Trigger waitqueue handler on all waiters on our waitqueue. This + * won't necessarily wake up all the tasks, io_should_wake() will make + * that decision. + * + * Pass in EPOLLIN|EPOLL_URING as the poll wakeup key. The latter set + * in the mask so that if we recurse back into our own poll waitqueue + * handlers, we know we have a dependency between eventfd or epoll and + * should terminate multishot poll at that point. */ if (waitqueue_active(&ctx->cq_wait)) - wake_up_all(&ctx->cq_wait); + __wake_up(&ctx->cq_wait, TASK_NORMAL, 0, + poll_to_key(EPOLL_URING | EPOLLIN)); } static inline void io_cqring_wake(struct io_ring_ctx *ctx) diff --git a/io_uring/poll.c b/io_uring/poll.c index 055632e9092a..b5d9426c60f6 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -394,6 +394,14 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, return 0; if (io_poll_get_ownership(req)) { + /* + * If we trigger a multishot poll off our own wakeup path, + * disable multishot as there is a circular dependency between + * CQ posting and triggering the event. + */ + if (mask & EPOLL_URING) + poll->events |= EPOLLONESHOT; + /* optional, saves extra locking for removal in tw handler */ if (mask && poll->events & EPOLLONESHOT) { list_del_init(&poll->wait.entry); -- 2.35.1

3 years

1
0
0 0

[PATCH 2/4] eventfd: provide a eventfd_signal_mask() helper

by Jens Axboe

This is identical to eventfd_signal(), but it allows the caller to pass in a mask to be used for the poll wakeup key. The use case is avoiding repeated multishot triggers if we have a dependency between eventfd and io_uring. If we setup an eventfd context and register that as the io_uring eventfd, and at the same time queue a multishot poll request for the eventfd context, then any CQE posted will repeatedly trigger the multishot request until it terminates when the CQ ring overflows. In preparation for io_uring detecting this circular dependency, add the mentioned helper so that io_uring can pass in EPOLL_URING as part of the poll wakeup key. Cc: stable(a)vger.kernel.org # 6.0 Signed-off-by: Jens Axboe <axboe(a)kernel.dk> --- fs/eventfd.c | 37 +++++++++++++++++++++---------------- include/linux/eventfd.h | 1 + 2 files changed, 22 insertions(+), 16 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index c0ffee99ad23..249ca6c0b784 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -43,21 +43,7 @@ struct eventfd_ctx { int id; }; -/** - * eventfd_signal - Adds @n to the eventfd counter. - * @ctx: [in] Pointer to the eventfd context. - * @n: [in] Value of the counter to be added to the eventfd internal counter. - * The value cannot be negative. - * - * This function is supposed to be called by the kernel in paths that do not - * allow sleeping. In this function we allow the counter to reach the ULLONG_MAX - * value, and we signal this as overflow condition by returning a EPOLLERR - * to poll(2). - * - * Returns the amount by which the counter was incremented. This will be less - * than @n if the counter has overflowed. - */ -__u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) +__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, unsigned mask) { unsigned long flags; @@ -78,12 +64,31 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) n = ULLONG_MAX - ctx->count; ctx->count += n; if (waitqueue_active(&ctx->wqh)) - wake_up_locked_poll(&ctx->wqh, EPOLLIN); + wake_up_locked_poll(&ctx->wqh, EPOLLIN | mask); current->in_eventfd = 0; spin_unlock_irqrestore(&ctx->wqh.lock, flags); return n; } + +/** + * eventfd_signal - Adds @n to the eventfd counter. + * @ctx: [in] Pointer to the eventfd context. + * @n: [in] Value of the counter to be added to the eventfd internal counter. + * The value cannot be negative. + * + * This function is supposed to be called by the kernel in paths that do not + * allow sleeping. In this function we allow the counter to reach the ULLONG_MAX + * value, and we signal this as overflow condition by returning a EPOLLERR + * to poll(2). + * + * Returns the amount by which the counter was incremented. This will be less + * than @n if the counter has overflowed. + */ +__u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) +{ + return eventfd_signal_mask(ctx, n, 0); +} EXPORT_SYMBOL_GPL(eventfd_signal); static void eventfd_free_ctx(struct eventfd_ctx *ctx) diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index 30eb30d6909b..e849329ce1a8 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -40,6 +40,7 @@ struct file *eventfd_fget(int fd); struct eventfd_ctx *eventfd_ctx_fdget(int fd); struct eventfd_ctx *eventfd_ctx_fileget(struct file *file); __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n); +__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, unsigned mask); int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t *wait, __u64 *cnt); void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt); -- 2.35.1

3 years

1
0
0 0

[PATCH 1/4] eventpoll: add EPOLL_URING wakeup flag

by Jens Axboe

We can have dependencies between epoll and io_uring. Consider an epoll context, identified by the epfd file descriptor, and an io_uring file descriptor identified by iofd. If we add iofd to the epfd context, and arm a multishot poll request for epfd with iofd, then the multishot poll request will repeatedly trigger and generate events until terminated by CQ ring overflow. This isn't a desired behavior. Add EPOLL_URING so that io_uring can pass it in as part of the poll wakeup key, and io_uring can check for that to detect a potential recursive invocation. Cc: stable(a)vger.kernel.org # 6.0 Signed-off-by: Jens Axboe <axboe(a)kernel.dk> --- fs/eventpoll.c | 18 ++++++++++-------- include/uapi/linux/eventpoll.h | 6 ++++++ 2 files changed, 16 insertions(+), 8 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 52954d4637b5..e864256001e8 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -491,7 +491,8 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi) */ #ifdef CONFIG_DEBUG_LOCK_ALLOC -static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi) +static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi, + unsigned pollflags) { struct eventpoll *ep_src; unsigned long flags; @@ -522,16 +523,17 @@ static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi) } spin_lock_irqsave_nested(&ep->poll_wait.lock, flags, nests); ep->nests = nests + 1; - wake_up_locked_poll(&ep->poll_wait, EPOLLIN); + wake_up_locked_poll(&ep->poll_wait, EPOLLIN | pollflags); ep->nests = 0; spin_unlock_irqrestore(&ep->poll_wait.lock, flags); } #else -static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi) +static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi, + unsigned pollflags) { - wake_up_poll(&ep->poll_wait, EPOLLIN); + wake_up_poll(&ep->poll_wait, EPOLLIN | pollflags); } #endif @@ -742,7 +744,7 @@ static void ep_free(struct eventpoll *ep) /* We need to release all tasks waiting for these file */ if (waitqueue_active(&ep->poll_wait)) - ep_poll_safewake(ep, NULL); + ep_poll_safewake(ep, NULL, 0); /* * We need to lock this because we could be hit by @@ -1208,7 +1210,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v /* We have to call this outside the lock */ if (pwake) - ep_poll_safewake(ep, epi); + ep_poll_safewake(ep, epi, pollflags & EPOLL_URING); if (!(epi->event.events & EPOLLEXCLUSIVE)) ewake = 1; @@ -1553,7 +1555,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, /* We have to call this outside the lock */ if (pwake) - ep_poll_safewake(ep, NULL); + ep_poll_safewake(ep, NULL, 0); return 0; } @@ -1629,7 +1631,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, /* We have to call this outside the lock */ if (pwake) - ep_poll_safewake(ep, NULL); + ep_poll_safewake(ep, NULL, 0); return 0; } diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index 8a3432d0f0dc..532cc7fa75c0 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -41,6 +41,12 @@ #define EPOLLMSG (__force __poll_t)0x00000400 #define EPOLLRDHUP (__force __poll_t)0x00002000 +/* + * Internal flag - wakeup generated by io_uring, used to detect recursion back + * into the io_uring poll handler. + */ +#define EPOLL_URING ((__force __poll_t)(1U << 27)) + /* Set exclusive wakeup mode for the target file descriptor */ #define EPOLLEXCLUSIVE ((__force __poll_t)(1U << 28)) -- 2.35.1

3 years

1
0
0 0

[PATCH v4] PM/devfreq: governor: Add a private governor_data for governor

by Kant Fan

The member void *data in the structure devfreq can be overwrite by governor_userspace. For example: 1. The device driver assigned the devfreq governor to simple_ondemand by the function devfreq_add_device() and init the devfreq member void *data to a pointer of a static structure devfreq_simple_ondemand_data by the function devfreq_add_device(). 2. The user changed the devfreq governor to userspace by the command "echo userspace > /sys/class/devfreq/.../governor". 3. The governor userspace alloced a dynamic memory for the struct userspace_data and assigend the member void *data of devfreq to this memory by the function userspace_init(). 4. The user changed the devfreq governor back to simple_ondemand by the command "echo simple_ondemand > /sys/class/devfreq/.../governor". 5. The governor userspace exited and assigned the member void *data in the structure devfreq to NULL by the function userspace_exit(). 6. The governor simple_ondemand fetched the static information of devfreq_simple_ondemand_data in the function devfreq_simple_ondemand_func() but the member void *data of devfreq was assigned to NULL by the function userspace_exit(). 7. The information of upthreshold and downdifferential is lost and the governor simple_ondemand can't work correctly. The member void *data in the structure devfreq is designed for a static pointer used in a governor and inited by the function devfreq_add_device(). This patch add an element named governor_data in the devfreq structure which can be used by a governor(E.g userspace) who want to assign a private data to do some private things. Fixes: ce26c5bb9569 ("PM / devfreq: Add basic governors") Cc: stable(a)vger.kernel.org # 5.10+ Reviewed-by: Chanwoo Choi <cwchoi00(a)gmail.com> Acked-by: MyungJoo Ham <myungjoo.ham(a)samsung.com> Signed-off-by: Kant Fan <kant(a)allwinnertech.com> --- drivers/devfreq/devfreq.c | 6 ++---- drivers/devfreq/governor_userspace.c | 12 ++++++------ include/linux/devfreq.h | 7 ++++--- 3 files changed, 12 insertions(+), 13 deletions(-) diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c index 63347a5ae599..8c5f6f7fca11 100644 --- a/drivers/devfreq/devfreq.c +++ b/drivers/devfreq/devfreq.c @@ -776,8 +776,7 @@ static void remove_sysfs_files(struct devfreq *devfreq, * @dev: the device to add devfreq feature. * @profile: device-specific profile to run devfreq. * @governor_name: name of the policy to choose frequency. - * @data: private data for the governor. The devfreq framework does not - * touch this value. + * @data: devfreq driver pass to governors, governor should not change it. */ struct devfreq *devfreq_add_device(struct device *dev, struct devfreq_dev_profile *profile, @@ -1011,8 +1010,7 @@ static void devm_devfreq_dev_release(struct device *dev, void *res) * @dev: the device to add devfreq feature. * @profile: device-specific profile to run devfreq. * @governor_name: name of the policy to choose frequency. - * @data: private data for the governor. The devfreq framework does not - * touch this value. + * @data: devfreq driver pass to governors, governor should not change it. * * This function manages automatically the memory of devfreq device using device * resource management and simplify the free operation for memory of devfreq diff --git a/drivers/devfreq/governor_userspace.c b/drivers/devfreq/governor_userspace.c index ab9db7adb3ad..d69672ccacc4 100644 --- a/drivers/devfreq/governor_userspace.c +++ b/drivers/devfreq/governor_userspace.c @@ -21,7 +21,7 @@ struct userspace_data { static int devfreq_userspace_func(struct devfreq *df, unsigned long *freq) { - struct userspace_data *data = df->data; + struct userspace_data *data = df->governor_data; if (data->valid) *freq = data->user_frequency; @@ -40,7 +40,7 @@ static ssize_t set_freq_store(struct device *dev, struct device_attribute *attr, int err = 0; mutex_lock(&devfreq->lock); - data = devfreq->data; + data = devfreq->governor_data; sscanf(buf, "%lu", &wanted); data->user_frequency = wanted; @@ -60,7 +60,7 @@ static ssize_t set_freq_show(struct device *dev, int err = 0; mutex_lock(&devfreq->lock); - data = devfreq->data; + data = devfreq->governor_data; if (data->valid) err = sprintf(buf, "%lu\n", data->user_frequency); @@ -91,7 +91,7 @@ static int userspace_init(struct devfreq *devfreq) goto out; } data->valid = false; - devfreq->data = data; + devfreq->governor_data = data; err = sysfs_create_group(&devfreq->dev.kobj, &dev_attr_group); out: @@ -107,8 +107,8 @@ static void userspace_exit(struct devfreq *devfreq) if (devfreq->dev.kobj.sd) sysfs_remove_group(&devfreq->dev.kobj, &dev_attr_group); - kfree(devfreq->data); - devfreq->data = NULL; + kfree(devfreq->governor_data); + devfreq->governor_data = NULL; } static int devfreq_userspace_handler(struct devfreq *devfreq, diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h index 34aab4dd336c..4dc7cda4fd46 100644 --- a/include/linux/devfreq.h +++ b/include/linux/devfreq.h @@ -152,8 +152,8 @@ struct devfreq_stats { * @max_state: count of entry present in the frequency table. * @previous_freq: previously configured frequency value. * @last_status: devfreq user device info, performance statistics - * @data: Private data of the governor. The devfreq framework does not - * touch this. + * @data: devfreq driver pass to governors, governor should not change it. + * @governor_data: private data for governors, devfreq core doesn't touch it. * @user_min_freq_req: PM QoS minimum frequency request from user (via sysfs) * @user_max_freq_req: PM QoS maximum frequency request from user (via sysfs) * @scaling_min_freq: Limit minimum frequency requested by OPP interface @@ -193,7 +193,8 @@ struct devfreq { unsigned long previous_freq; struct devfreq_dev_status last_status; - void *data; /* private data for governors */ + void *data; + void *governor_data; struct dev_pm_qos_request user_min_freq_req; struct dev_pm_qos_request user_max_freq_req; -- 2.29.0

3 years

2
1
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror November 2022