5-6/9 were forgotten to be marked for-stable. Others are 5 out of 6 failed to apply + dependencies.
Jens Axboe (3): fs: provide locked helper variant of close_fd_get_file() io_uring: get rid of intermediate IORING_OP_CLOSE stage io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL
Pavel Begunkov (6): io_uring: fix inconsistent lock state io_uring: deduplicate core cancellations sequence io_uring: unpark SQPOLL thread for cancelation io_uring: deduplicate failing task_work_add io_uring/io-wq: return 2-step work swap scheme io_uring: don't take uring_lock during iowq cancel
fs/file.c | 36 +++++--- fs/internal.h | 1 + fs/io-wq.c | 17 ++-- fs/io-wq.h | 5 +- fs/io_uring.c | 241 +++++++++++++++++++++++--------------------------- 5 files changed, 145 insertions(+), 155 deletions(-)
commin 9ae1f8dd372e0e4c020b345cf9e09f519265e981 upstream
WARNING: inconsistent lock state
inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. syz-executor217/8450 [HC1[1]:SC0[0]:HE0:SE1] takes: ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline] ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_req_clean_work fs/io_uring.c:1398 [inline] ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029
other info that might help us debug this: Possible unsafe locking scenario:
CPU0 ---- lock(&fs->lock); <Interrupt> lock(&fs->lock);
*** DEADLOCK ***
1 lock held by syz-executor217/8450: #0: ffff88802417c3e8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x1071/0x1f30 fs/io_uring.c:9442
stack backtrace: CPU: 1 PID: 8450 Comm: syz-executor217 Not tainted 5.11.0-rc5-next-20210129-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> [...] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] io_req_clean_work fs/io_uring.c:1398 [inline] io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029 __io_free_req+0x3d/0x2e0 fs/io_uring.c:2046 io_free_req fs/io_uring.c:2269 [inline] io_double_put_req fs/io_uring.c:2392 [inline] io_put_req+0xf9/0x570 fs/io_uring.c:2388 io_link_timeout_fn+0x30c/0x480 fs/io_uring.c:6497 __run_hrtimer kernel/time/hrtimer.c:1519 [inline] __hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583 hrtimer_interrupt+0x334/0x940 kernel/time/hrtimer.c:1645 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1085 [inline] __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1102 asm_call_irq_on_stack+0xf/0x20 </IRQ> __run_sysvec_on_irqstack arch/x86/include/asm/irq_stack.h:37 [inline] run_sysvec_on_irqstack_cond arch/x86/include/asm/irq_stack.h:89 [inline] sysvec_apic_timer_interrupt+0xbd/0x100 arch/x86/kernel/apic/apic.c:1096 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:629 RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline] RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:199 spin_unlock_irq include/linux/spinlock.h:404 [inline] io_queue_linked_timeout+0x194/0x1f0 fs/io_uring.c:6525 __io_queue_sqe+0x328/0x1290 fs/io_uring.c:6594 io_queue_sqe+0x631/0x10d0 fs/io_uring.c:6639 io_queue_link_head fs/io_uring.c:6650 [inline] io_submit_sqe fs/io_uring.c:6697 [inline] io_submit_sqes+0x19b5/0x2720 fs/io_uring.c:6960 __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9443 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Don't free requests from under hrtimer context (softirq) as it may sleep or take spinlocks improperly (e.g. non-irq versions).
Cc: stable@vger.kernel.org # 5.6+ Reported-by: syzbot+81d17233a2b02eafba33@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk --- fs/io_uring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 38bfd168ad3b..a1d08b641d0f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6506,9 +6506,10 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) if (prev) { req_set_fail_links(prev); io_async_find_and_cancel(ctx, req, prev->user_data, -ETIME); - io_put_req(prev); + io_put_req_deferred(prev, 1); } else { - io_req_complete(req, -ETIME); + io_cqring_add_event(req, -ETIME, 0); + io_put_req_deferred(req, 1); } return HRTIMER_NORESTART; }
commit 9936c7c2bc76a0b2276f6d19de6d1d92f03deeab upstream
Files and task cancellations go over same steps trying to cancel requests in io-wq, poll, etc. Deduplicate it with a helper.
note: new io_uring_try_cancel_requests() is former __io_uring_cancel_task_requests() with files passed as an agrument and flushing overflowed requests.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk --- fs/io_uring.c | 85 ++++++++++++++++++++++++--------------------------- 1 file changed, 40 insertions(+), 45 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a1d08b641d0f..cbcd4023cf7a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -996,9 +996,9 @@ enum io_mem_account { ACCT_PINNED, };
-static void __io_uring_cancel_task_requests(struct io_ring_ctx *ctx, - struct task_struct *task); - +static void io_uring_try_cancel_requests(struct io_ring_ctx *ctx, + struct task_struct *task, + struct files_struct *files); static void destroy_fixed_file_ref_node(struct fixed_file_ref_node *ref_node); static struct fixed_file_ref_node *alloc_fixed_file_ref_node( struct io_ring_ctx *ctx); @@ -8780,7 +8780,7 @@ static void io_ring_exit_work(struct work_struct *work) * as nobody else will be looking for them. */ do { - __io_uring_cancel_task_requests(ctx, NULL); + io_uring_try_cancel_requests(ctx, NULL, NULL); } while (!wait_for_completion_timeout(&ctx->ref_comp, HZ/20)); io_ring_ctx_free(ctx); } @@ -8894,6 +8894,40 @@ static void io_cancel_defer_files(struct io_ring_ctx *ctx, } }
+static void io_uring_try_cancel_requests(struct io_ring_ctx *ctx, + struct task_struct *task, + struct files_struct *files) +{ + struct io_task_cancel cancel = { .task = task, .files = files, }; + + while (1) { + enum io_wq_cancel cret; + bool ret = false; + + if (ctx->io_wq) { + cret = io_wq_cancel_cb(ctx->io_wq, io_cancel_task_cb, + &cancel, true); + ret |= (cret != IO_WQ_CANCEL_NOTFOUND); + } + + /* SQPOLL thread does its own polling */ + if (!(ctx->flags & IORING_SETUP_SQPOLL) && !files) { + while (!list_empty_careful(&ctx->iopoll_list)) { + io_iopoll_try_reap_events(ctx); + ret = true; + } + } + + ret |= io_poll_remove_all(ctx, task, files); + ret |= io_kill_timeouts(ctx, task, files); + ret |= io_run_task_work(); + io_cqring_overflow_flush(ctx, true, task, files); + if (!ret) + break; + cond_resched(); + } +} + static int io_uring_count_inflight(struct io_ring_ctx *ctx, struct task_struct *task, struct files_struct *files) @@ -8913,7 +8947,6 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, struct files_struct *files) { while (!list_empty_careful(&ctx->inflight_list)) { - struct io_task_cancel cancel = { .task = task, .files = files }; DEFINE_WAIT(wait); int inflight;
@@ -8921,13 +8954,7 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, if (!inflight) break;
- io_wq_cancel_cb(ctx->io_wq, io_cancel_task_cb, &cancel, true); - io_poll_remove_all(ctx, task, files); - io_kill_timeouts(ctx, task, files); - io_cqring_overflow_flush(ctx, true, task, files); - /* cancellations _may_ trigger task work */ - io_run_task_work(); - + io_uring_try_cancel_requests(ctx, task, files); prepare_to_wait(&task->io_uring->wait, &wait, TASK_UNINTERRUPTIBLE); if (inflight == io_uring_count_inflight(ctx, task, files)) @@ -8936,37 +8963,6 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, } }
-static void __io_uring_cancel_task_requests(struct io_ring_ctx *ctx, - struct task_struct *task) -{ - while (1) { - struct io_task_cancel cancel = { .task = task, .files = NULL, }; - enum io_wq_cancel cret; - bool ret = false; - - if (ctx->io_wq) { - cret = io_wq_cancel_cb(ctx->io_wq, io_cancel_task_cb, - &cancel, true); - ret |= (cret != IO_WQ_CANCEL_NOTFOUND); - } - - /* SQPOLL thread does its own polling */ - if (!(ctx->flags & IORING_SETUP_SQPOLL)) { - while (!list_empty_careful(&ctx->iopoll_list)) { - io_iopoll_try_reap_events(ctx); - ret = true; - } - } - - ret |= io_poll_remove_all(ctx, task, NULL); - ret |= io_kill_timeouts(ctx, task, NULL); - ret |= io_run_task_work(); - if (!ret) - break; - cond_resched(); - } -} - static void io_disable_sqo_submit(struct io_ring_ctx *ctx) { mutex_lock(&ctx->uring_lock); @@ -8996,11 +8992,10 @@ static void io_uring_cancel_task_requests(struct io_ring_ctx *ctx, }
io_cancel_defer_files(ctx, task, files); - io_cqring_overflow_flush(ctx, true, task, files);
io_uring_cancel_files(ctx, task, files); if (!files) - __io_uring_cancel_task_requests(ctx, task); + io_uring_try_cancel_requests(ctx, task, NULL);
if ((ctx->flags & IORING_SETUP_SQPOLL) && ctx->sq_data) { atomic_dec(&task->io_uring->in_idle);
commit 34343786ecc5ff493ca4d1f873b4386759ba52ee upstream
We park SQPOLL task before going into io_uring_cancel_files(), so the task won't run task_works including those that might be important for the cancellation passes. In this case it's io_poll_remove_one(), which frees requests via io_put_req_deferred().
Unpark it for while waiting, it's ok as we disable submissions beforehand, so no new requests will be generated.
INFO: task syz-executor893:8493 blocked for more than 143 seconds. Call Trace: context_switch kernel/sched/core.c:4327 [inline] __schedule+0x90c/0x21a0 kernel/sched/core.c:5078 schedule+0xcf/0x270 kernel/sched/core.c:5157 io_uring_cancel_files fs/io_uring.c:8912 [inline] io_uring_cancel_task_requests+0xe70/0x11a0 fs/io_uring.c:8979 __io_uring_files_cancel+0x110/0x1b0 fs/io_uring.c:9067 io_uring_files_cancel include/linux/io_uring.h:51 [inline] do_exit+0x2fe/0x2ae0 kernel/exit.c:780 do_group_exit+0x125/0x310 kernel/exit.c:922 __do_sys_exit_group kernel/exit.c:933 [inline] __se_sys_exit_group kernel/exit.c:931 [inline] __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Cc: stable@vger.kernel.org # 5.5+ Reported-by: syzbot+695b03d82fa8e4901b06@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk --- fs/io_uring.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cbcd4023cf7a..842a7c017296 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -8955,11 +8955,16 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, break;
io_uring_try_cancel_requests(ctx, task, files); + + if (ctx->sq_data) + io_sq_thread_unpark(ctx->sq_data); prepare_to_wait(&task->io_uring->wait, &wait, TASK_UNINTERRUPTIBLE); if (inflight == io_uring_count_inflight(ctx, task, files)) schedule(); finish_wait(&task->io_uring->wait, &wait); + if (ctx->sq_data) + io_sq_thread_park(ctx->sq_data); } }
commit eab30c4d20dc761d463445e5130421863ff81505 upstream
When io_req_task_work_add() fails, the request will be cancelled by enqueueing via task_works of io-wq. Extract a function for that.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk --- fs/io_uring.c | 46 +++++++++++++++++----------------------------- 1 file changed, 17 insertions(+), 29 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 842a7c017296..bc76929e0031 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2172,6 +2172,16 @@ static int io_req_task_work_add(struct io_kiocb *req) return ret; }
+static void io_req_task_work_add_fallback(struct io_kiocb *req, + void (*cb)(struct callback_head *)) +{ + struct task_struct *tsk = io_wq_get_task(req->ctx->io_wq); + + init_task_work(&req->task_work, cb); + task_work_add(tsk, &req->task_work, TWA_NONE); + wake_up_process(tsk); +} + static void __io_req_task_cancel(struct io_kiocb *req, int error) { struct io_ring_ctx *ctx = req->ctx; @@ -2229,14 +2239,8 @@ static void io_req_task_queue(struct io_kiocb *req) percpu_ref_get(&req->ctx->refs);
ret = io_req_task_work_add(req); - if (unlikely(ret)) { - struct task_struct *tsk; - - init_task_work(&req->task_work, io_req_task_cancel); - tsk = io_wq_get_task(req->ctx->io_wq); - task_work_add(tsk, &req->task_work, TWA_NONE); - wake_up_process(tsk); - } + if (unlikely(ret)) + io_req_task_work_add_fallback(req, io_req_task_cancel); }
static inline void io_queue_next(struct io_kiocb *req) @@ -2354,13 +2358,8 @@ static void io_free_req_deferred(struct io_kiocb *req)
init_task_work(&req->task_work, io_put_req_deferred_cb); ret = io_req_task_work_add(req); - if (unlikely(ret)) { - struct task_struct *tsk; - - tsk = io_wq_get_task(req->ctx->io_wq); - task_work_add(tsk, &req->task_work, TWA_NONE); - wake_up_process(tsk); - } + if (unlikely(ret)) + io_req_task_work_add_fallback(req, io_put_req_deferred_cb); }
static inline void io_put_req_deferred(struct io_kiocb *req, int refs) @@ -3439,15 +3438,8 @@ static int io_async_buf_func(struct wait_queue_entry *wait, unsigned mode, /* submit ref gets dropped, acquire a new one */ refcount_inc(&req->refs); ret = io_req_task_work_add(req); - if (unlikely(ret)) { - struct task_struct *tsk; - - /* queue just for cancelation */ - init_task_work(&req->task_work, io_req_task_cancel); - tsk = io_wq_get_task(req->ctx->io_wq); - task_work_add(tsk, &req->task_work, TWA_NONE); - wake_up_process(tsk); - } + if (unlikely(ret)) + io_req_task_work_add_fallback(req, io_req_task_cancel); return 1; }
@@ -5159,12 +5151,8 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, */ ret = io_req_task_work_add(req); if (unlikely(ret)) { - struct task_struct *tsk; - WRITE_ONCE(poll->canceled, true); - tsk = io_wq_get_task(req->ctx->io_wq); - task_work_add(tsk, &req->task_work, TWA_NONE); - wake_up_process(tsk); + io_req_task_work_add_fallback(req, func); } return 1; }
From: Jens Axboe axboe@kernel.dk
commit 53dec2ea74f2ef360e8455439be96a780baa6097 upstream
Assumes current->files->file_lock is already held on invocation. Helps the caller check the file before removing the fd, if it needs to.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Pavel Begunkov asml.silence@gmail.com --- fs/file.c | 36 +++++++++++++++++++++++++----------- fs/internal.h | 1 + 2 files changed, 26 insertions(+), 11 deletions(-)
diff --git a/fs/file.c b/fs/file.c index dab120b71e44..f3a4bac2cbe9 100644 --- a/fs/file.c +++ b/fs/file.c @@ -22,6 +22,8 @@ #include <linux/close_range.h> #include <net/sock.h>
+#include "internal.h" + unsigned int sysctl_nr_open __read_mostly = 1024*1024; unsigned int sysctl_nr_open_min = BITS_PER_LONG; /* our min() is unusable in constant expressions ;-/ */ @@ -732,36 +734,48 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags) }
/* - * variant of close_fd that gets a ref on the file for later fput. - * The caller must ensure that filp_close() called on the file, and then - * an fput(). + * See close_fd_get_file() below, this variant assumes current->files->file_lock + * is held. */ -int close_fd_get_file(unsigned int fd, struct file **res) +int __close_fd_get_file(unsigned int fd, struct file **res) { struct files_struct *files = current->files; struct file *file; struct fdtable *fdt;
- spin_lock(&files->file_lock); fdt = files_fdtable(files); if (fd >= fdt->max_fds) - goto out_unlock; + goto out_err; file = fdt->fd[fd]; if (!file) - goto out_unlock; + goto out_err; rcu_assign_pointer(fdt->fd[fd], NULL); __put_unused_fd(files, fd); - spin_unlock(&files->file_lock); get_file(file); *res = file; return 0; - -out_unlock: - spin_unlock(&files->file_lock); +out_err: *res = NULL; return -ENOENT; }
+/* + * variant of close_fd that gets a ref on the file for later fput. + * The caller must ensure that filp_close() called on the file, and then + * an fput(). + */ +int close_fd_get_file(unsigned int fd, struct file **res) +{ + struct files_struct *files = current->files; + int ret; + + spin_lock(&files->file_lock); + ret = __close_fd_get_file(fd, res); + spin_unlock(&files->file_lock); + + return ret; +} + void do_close_on_exec(struct files_struct *files) { unsigned i; diff --git a/fs/internal.h b/fs/internal.h index 77c50befbfbe..c6c85f6ad598 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -132,6 +132,7 @@ extern struct file *do_file_open_root(struct dentry *, struct vfsmount *, const char *, const struct open_flags *); extern struct open_how build_open_how(int flags, umode_t mode); extern int build_open_flags(const struct open_how *how, struct open_flags *op); +extern int __close_fd_get_file(unsigned int fd, struct file **res);
long do_sys_ftruncate(unsigned int fd, loff_t length, int small); int chmod_common(const struct path *path, umode_t mode);
From: Jens Axboe axboe@kernel.dk
commit 9eac1904d3364254d622bf2c771c4f85cd435fc2 upstream
We currently split the close into two, in case we have a ->flush op that we can't safely handle from non-blocking context. This requires us to flag the op as uncancelable if we do need to punt it async, and that means special handling for just this op type.
Use __close_fd_get_file() and grab the files lock so we can get the file and check if we need to go async in one atomic operation. That gets rid of the need for splitting this into two steps, and hence the need for IO_WQ_WORK_NO_CANCEL.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Pavel Begunkov asml.silence@gmail.com --- fs/io_uring.c | 64 ++++++++++++++++++++++++++++----------------------- 1 file changed, 35 insertions(+), 29 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bc76929e0031..7d03689d0e47 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -411,7 +411,6 @@ struct io_poll_remove {
struct io_close { struct file *file; - struct file *put_file; int fd; };
@@ -908,8 +907,6 @@ static const struct io_op_def io_op_defs[] = { IO_WQ_WORK_FS | IO_WQ_WORK_MM, }, [IORING_OP_CLOSE] = { - .needs_file = 1, - .needs_file_no_error = 1, .work_flags = IO_WQ_WORK_FILES | IO_WQ_WORK_BLKCG, }, [IORING_OP_FILES_UPDATE] = { @@ -4473,13 +4470,6 @@ static int io_statx(struct io_kiocb *req, bool force_nonblock)
static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { - /* - * If we queue this for async, it must not be cancellable. That would - * leave the 'file' in an undeterminate state, and here need to modify - * io_wq_work.flags, so initialize io_wq_work firstly. - */ - io_req_init_async(req); - if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->ioprio || sqe->off || sqe->addr || sqe->len || @@ -4489,43 +4479,59 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EBADF;
req->close.fd = READ_ONCE(sqe->fd); - if ((req->file && req->file->f_op == &io_uring_fops)) - return -EBADF; - - req->close.put_file = NULL; return 0; }
static int io_close(struct io_kiocb *req, bool force_nonblock, struct io_comp_state *cs) { + struct files_struct *files = current->files; struct io_close *close = &req->close; + struct fdtable *fdt; + struct file *file; int ret;
- /* might be already done during nonblock submission */ - if (!close->put_file) { - ret = close_fd_get_file(close->fd, &close->put_file); - if (ret < 0) - return (ret == -ENOENT) ? -EBADF : ret; + file = NULL; + ret = -EBADF; + spin_lock(&files->file_lock); + fdt = files_fdtable(files); + if (close->fd >= fdt->max_fds) { + spin_unlock(&files->file_lock); + goto err; + } + file = fdt->fd[close->fd]; + if (!file) { + spin_unlock(&files->file_lock); + goto err; + } + + if (file->f_op == &io_uring_fops) { + spin_unlock(&files->file_lock); + file = NULL; + goto err; }
/* if the file has a flush method, be safe and punt to async */ - if (close->put_file->f_op->flush && force_nonblock) { - /* not safe to cancel at this point */ - req->work.flags |= IO_WQ_WORK_NO_CANCEL; - /* was never set, but play safe */ - req->flags &= ~REQ_F_NOWAIT; - /* avoid grabbing files - we don't need the files */ - req->flags |= REQ_F_NO_FILE_TABLE; + if (file->f_op->flush && force_nonblock) { + spin_unlock(&files->file_lock); return -EAGAIN; }
+ ret = __close_fd_get_file(close->fd, &file); + spin_unlock(&files->file_lock); + if (ret < 0) { + if (ret == -ENOENT) + ret = -EBADF; + goto err; + } + /* No ->flush() or already async, safely close from here */ - ret = filp_close(close->put_file, req->work.identity->files); + ret = filp_close(file, current->files); +err: if (ret < 0) req_set_fail_links(req); - fput(close->put_file); - close->put_file = NULL; + if (file) + fput(file); __io_req_complete(req, ret, 0, cs); return 0; }
From: Jens Axboe axboe@kernel.dk
commit 4014d943cb62db892eb023d385a966a3fce5ee4c upstream
It's no longer used as IORING_OP_CLOSE got rid for the need of flagging it as uncancelable, kill it of.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Pavel Begunkov asml.silence@gmail.com --- fs/io-wq.c | 1 - fs/io-wq.h | 1 - fs/io_uring.c | 5 +---- 3 files changed, 1 insertion(+), 6 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index a564f36e260c..2e2f14f42bf2 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -944,7 +944,6 @@ static bool io_wq_worker_cancel(struct io_worker *worker, void *data) */ spin_lock_irqsave(&worker->lock, flags); if (worker->cur_work && - !(worker->cur_work->flags & IO_WQ_WORK_NO_CANCEL) && match->fn(worker->cur_work, match->data)) { send_sig(SIGINT, worker->task, 1); match->nr_running++; diff --git a/fs/io-wq.h b/fs/io-wq.h index b158f8addcf3..e1ffb80a4a1d 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -9,7 +9,6 @@ enum { IO_WQ_WORK_CANCEL = 1, IO_WQ_WORK_HASHED = 2, IO_WQ_WORK_UNBOUND = 4, - IO_WQ_WORK_NO_CANCEL = 8, IO_WQ_WORK_CONCURRENT = 16,
IO_WQ_WORK_FILES = 32, diff --git a/fs/io_uring.c b/fs/io_uring.c index 7d03689d0e47..5ebc05f41c19 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6388,11 +6388,8 @@ static struct io_wq_work *io_wq_submit_work(struct io_wq_work *work) if (timeout) io_queue_linked_timeout(timeout);
- /* if NO_CANCEL is set, we must still run the work */ - if ((work->flags & (IO_WQ_WORK_CANCEL|IO_WQ_WORK_NO_CANCEL)) == - IO_WQ_WORK_CANCEL) { + if (work->flags & IO_WQ_WORK_CANCEL) ret = -ECANCELED; - }
if (!ret) { do {
commit 5280f7e530f71ba85baf90169393196976ad0e52 upstream
Saving one lock/unlock for io-wq is not super important, but adds some ugliness in the code. More important, atomic decs not turning it to zero for some archs won't give the right ordering/barriers so the io_steal_work() may pretty easily get subtly and completely broken.
Return back 2-step io-wq work exchange and clean it up.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk --- fs/io-wq.c | 16 ++++++---------- fs/io-wq.h | 4 ++-- fs/io_uring.c | 26 ++++---------------------- 3 files changed, 12 insertions(+), 34 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 2e2f14f42bf2..63ef195b1acb 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -555,23 +555,21 @@ static void io_worker_handle_work(struct io_worker *worker)
/* handle a whole dependent link */ do { - struct io_wq_work *old_work, *next_hashed, *linked; + struct io_wq_work *next_hashed, *linked; unsigned int hash = io_get_work_hash(work);
next_hashed = wq_next_work(work); io_impersonate_work(worker, work); + wq->do_work(work); + io_assign_current_work(worker, NULL);
- old_work = work; - linked = wq->do_work(work); - + linked = wq->free_work(work); work = next_hashed; if (!work && linked && !io_wq_is_hashed(linked)) { work = linked; linked = NULL; } io_assign_current_work(worker, work); - wq->free_work(old_work); - if (linked) io_wqe_enqueue(wqe, linked);
@@ -850,11 +848,9 @@ static void io_run_cancel(struct io_wq_work *work, struct io_wqe *wqe) struct io_wq *wq = wqe->wq;
do { - struct io_wq_work *old_work = work; - work->flags |= IO_WQ_WORK_CANCEL; - work = wq->do_work(work); - wq->free_work(old_work); + wq->do_work(work); + work = wq->free_work(work); } while (work); }
diff --git a/fs/io-wq.h b/fs/io-wq.h index e1ffb80a4a1d..e37a0f217cc8 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -106,8 +106,8 @@ static inline struct io_wq_work *wq_next_work(struct io_wq_work *work) return container_of(work->list.next, struct io_wq_work, list); }
-typedef void (free_work_fn)(struct io_wq_work *); -typedef struct io_wq_work *(io_wq_work_fn)(struct io_wq_work *); +typedef struct io_wq_work *(free_work_fn)(struct io_wq_work *); +typedef void (io_wq_work_fn)(struct io_wq_work *);
struct io_wq_data { struct user_struct *user; diff --git a/fs/io_uring.c b/fs/io_uring.c index 5ebc05f41c19..5e9bff1eeaa0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2365,22 +2365,6 @@ static inline void io_put_req_deferred(struct io_kiocb *req, int refs) io_free_req_deferred(req); }
-static struct io_wq_work *io_steal_work(struct io_kiocb *req) -{ - struct io_kiocb *nxt; - - /* - * A ref is owned by io-wq in which context we're. So, if that's the - * last one, it's safe to steal next work. False negatives are Ok, - * it just will be re-punted async in io_put_work() - */ - if (refcount_read(&req->refs) != 1) - return NULL; - - nxt = io_req_find_next(req); - return nxt ? &nxt->work : NULL; -} - static void io_double_put_req(struct io_kiocb *req) { /* drop both submit and complete references */ @@ -6378,7 +6362,7 @@ static int io_issue_sqe(struct io_kiocb *req, bool force_nonblock, return 0; }
-static struct io_wq_work *io_wq_submit_work(struct io_wq_work *work) +static void io_wq_submit_work(struct io_wq_work *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); struct io_kiocb *timeout; @@ -6429,8 +6413,6 @@ static struct io_wq_work *io_wq_submit_work(struct io_wq_work *work) if (lock_ctx) mutex_unlock(&lock_ctx->uring_lock); } - - return io_steal_work(req); }
static inline struct file *io_file_from_index(struct io_ring_ctx *ctx, @@ -8062,12 +8044,12 @@ static int io_sqe_files_update(struct io_ring_ctx *ctx, void __user *arg, return __io_sqe_files_update(ctx, &up, nr_args); }
-static void io_free_work(struct io_wq_work *work) +static struct io_wq_work *io_free_work(struct io_wq_work *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work);
- /* Consider that io_steal_work() relies on this ref */ - io_put_req(req); + req = io_put_req_find_next(req); + return req ? &req->work : NULL; }
static int io_init_wq_offload(struct io_ring_ctx *ctx,
commit 792bb6eb862333658bf1bd2260133f0507e2da8d upstream
[ 97.866748] a.out/2890 is trying to acquire lock: [ 97.867829] ffff8881046763e8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_wq_submit_work+0x155/0x240 [ 97.869735] [ 97.869735] but task is already holding lock: [ 97.871033] ffff88810dfe0be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 97.873074] [ 97.873074] other info that might help us debug this: [ 97.874520] Possible unsafe locking scenario: [ 97.874520] [ 97.875845] CPU0 [ 97.876440] ---- [ 97.877048] lock(&ctx->uring_lock); [ 97.877961] lock(&ctx->uring_lock); [ 97.878881] [ 97.878881] *** DEADLOCK *** [ 97.878881] [ 97.880341] May be due to missing lock nesting notation [ 97.880341] [ 97.881952] 1 lock held by a.out/2890: [ 97.882873] #0: ffff88810dfe0be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 97.885108] [ 97.885108] stack backtrace: [ 97.890457] Call Trace: [ 97.891121] dump_stack+0xac/0xe3 [ 97.891972] __lock_acquire+0xab6/0x13a0 [ 97.892940] lock_acquire+0x2c3/0x390 [ 97.894894] __mutex_lock+0xae/0x9f0 [ 97.901101] io_wq_submit_work+0x155/0x240 [ 97.902112] io_wq_cancel_cb+0x162/0x490 [ 97.904126] io_async_find_and_cancel+0x3b/0x140 [ 97.905247] io_issue_sqe+0x86d/0x13e0 [ 97.909122] __io_queue_sqe+0x10b/0x550 [ 97.913971] io_queue_sqe+0x235/0x470 [ 97.914894] io_submit_sqes+0xcce/0xf10 [ 97.917872] __x64_sys_io_uring_enter+0x3fb/0x5b0 [ 97.921424] do_syscall_64+0x2d/0x40 [ 97.922329] entry_SYSCALL_64_after_hwframe+0x44/0xa9
While holding uring_lock, e.g. from inline execution, async cancel request may attempt cancellations through io_wq_submit_work, which may try to grab a lock. Delay it to task_work, so we do it from a clean context and don't have to worry about locking.
Cc: stable@vger.kernel.org # 5.5+ Fixes: c07e6719511e ("io_uring: hold uring_lock while completing failed polled io in io_wq_submit_work()") Reported-by: Abaci abaci@linux.alibaba.com Reported-by: Hao Xu haoxu@linux.alibaba.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk --- fs/io_uring.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5e9bff1eeaa0..241313278e5a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2198,7 +2198,9 @@ static void io_req_task_cancel(struct callback_head *cb) struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); struct io_ring_ctx *ctx = req->ctx;
+ mutex_lock(&ctx->uring_lock); __io_req_task_cancel(req, -ECANCELED); + mutex_unlock(&ctx->uring_lock); percpu_ref_put(&ctx->refs); }
@@ -6372,8 +6374,13 @@ static void io_wq_submit_work(struct io_wq_work *work) if (timeout) io_queue_linked_timeout(timeout);
- if (work->flags & IO_WQ_WORK_CANCEL) - ret = -ECANCELED; + if (work->flags & IO_WQ_WORK_CANCEL) { + /* io-wq is going to take down one */ + refcount_inc(&req->refs); + percpu_ref_get(&req->ctx->refs); + io_req_task_work_add_fallback(req, io_req_task_cancel); + return; + }
if (!ret) { do {
On 10/03/2021 11:30, Pavel Begunkov wrote:
5-6/9 were forgotten to be marked for-stable. Others are
5 out of 6 failed to apply + dependencies.
edit: 3 out of 4 that came on March 1st, and the 4th was reverted upstream so isn't needed. Doesn't matter though
Jens Axboe (3): fs: provide locked helper variant of close_fd_get_file() io_uring: get rid of intermediate IORING_OP_CLOSE stage io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL
Pavel Begunkov (6): io_uring: fix inconsistent lock state io_uring: deduplicate core cancellations sequence io_uring: unpark SQPOLL thread for cancelation io_uring: deduplicate failing task_work_add io_uring/io-wq: return 2-step work swap scheme io_uring: don't take uring_lock during iowq cancel
fs/file.c | 36 +++++--- fs/internal.h | 1 + fs/io-wq.c | 17 ++-- fs/io-wq.h | 5 +- fs/io_uring.c | 241 +++++++++++++++++++++++--------------------------- 5 files changed, 145 insertions(+), 155 deletions(-)
On Wed, Mar 10, 2021 at 11:30:36AM +0000, Pavel Begunkov wrote:
5-6/9 were forgotten to be marked for-stable. Others are 5 out of 6 failed to apply + dependencies.
Jens Axboe (3): fs: provide locked helper variant of close_fd_get_file() io_uring: get rid of intermediate IORING_OP_CLOSE stage io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL
Pavel Begunkov (6): io_uring: fix inconsistent lock state io_uring: deduplicate core cancellations sequence io_uring: unpark SQPOLL thread for cancelation io_uring: deduplicate failing task_work_add io_uring/io-wq: return 2-step work swap scheme io_uring: don't take uring_lock during iowq cancel
Thanks for these, now queued up.
greg k-h
linux-stable-mirror@lists.linaro.org