An earlier commit:
b7db41c9e03b ("io_uring: fix regression with always ignoring signals in io_cqring_wait()")
ensured that we didn't get stuck waiting for eventfd reads when it's registered with the io_uring ring for event notification, but we still have a gap where the task can be waiting on other events in the kernel and need a bigger nudge to make forward progress.
Ensure that we use signaled notifications for a task that isn't currently running, to be certain the work is seen and processed immediately.
Cc: stable@vger.kernel.org # v5.7+ Reported-by: Josef josef.grieb@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk --- fs/io_uring.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e9b27cdaa735..443eecdfeda9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1712,21 +1712,27 @@ static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb) struct io_ring_ctx *ctx = req->ctx; int ret, notify = TWA_RESUME;
+ ret = __task_work_add(tsk, cb); + if (unlikely(ret)) + return ret; + /* * SQPOLL kernel thread doesn't need notification, just a wakeup. - * If we're not using an eventfd, then TWA_RESUME is always fine, - * as we won't have dependencies between request completions for - * other kernel wait conditions. + * For any other work, use signaled wakeups if the task isn't + * running to avoid dependencies between tasks or threads. If + * the issuing task is currently waiting in the kernel on a thread, + * and same thread is waiting for a completion event, then we need + * to ensure that the issuing task processes task_work. TWA_SIGNAL + * is needed for that. */ if (ctx->flags & IORING_SETUP_SQPOLL) notify = 0; - else if (ctx->cq_ev_fd) + else if (READ_ONCE(tsk->state) != TASK_RUNNING) notify = TWA_SIGNAL;
- ret = task_work_add(tsk, cb, notify); - if (!ret) - wake_up_process(tsk); - return ret; + __task_work_notify(tsk, notify); + wake_up_process(tsk); + return 0; }
static void __io_req_task_cancel(struct io_kiocb *req, int error)
On Sat, Aug 08, 2020 at 12:34:39PM -0600, Jens Axboe wrote:
Wait.. so the only change here is that you look at tsk->state, _after_ doing __task_work_add(), but nothing, not the Changelog nor the comment explains this.
So you're relying on __task_work_add() being an smp_mb() vs the add, and you order this against the smp_mb() in set_current_state() ?
This really needs spelling out.
On 8/10/20 9:02 AM, Jens Axboe wrote:
I failed to convince myself that the existing construct was safe, so here's an incremental on top of that. Basically we re-check the task state _after_ the initial notification, to protect ourselves from the case where we initially find the task running, but between that check and when we do the notification, it's now gone to sleep. Should be pretty slim, but I think it's there.
Hence do a loop around it, if we're using TWA_RESUME.
diff --git a/fs/io_uring.c b/fs/io_uring.c index 44ac103483b6..a4ecb6c7e2b0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1780,12 +1780,27 @@ static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb) * to ensure that the issuing task processes task_work. TWA_SIGNAL * is needed for that. */ - if (ctx->flags & IORING_SETUP_SQPOLL) + if (ctx->flags & IORING_SETUP_SQPOLL) { notify = 0; - else if (READ_ONCE(tsk->state) != TASK_RUNNING) - notify = TWA_SIGNAL; + } else { + bool notified = false;
- __task_work_notify(tsk, notify); + /* + * If the task is running, TWA_RESUME notify is enough. Make + * sure to re-check after we've sent the notification, as not + * to have a race between the check and the notification. This + * only applies for TWA_RESUME, as TWA_SIGNAL is safe with a + * sleeping task + */ + do { + if (READ_ONCE(tsk->state) != TASK_RUNNING) + notify = TWA_SIGNAL; + else if (notified) + break; + __task_work_notify(tsk, notify); + notified = true; + } while (notify != TWA_SIGNAL); + } wake_up_process(tsk); return 0; }
and I've folded it in here:
https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=8d685b5...
On Mon, Aug 10, 2020 at 01:21:48PM -0600, Jens Axboe wrote:
Could we get a clue as to why TWA_RESUME is enough when it's running? I presume it is because we'll do task_work_run() somewhere before we block, but having an explicit reference here might help someone new to this make sense of it all.
Would it be clearer to write it like so perhaps?
/* * Optimization; when the task is RUNNING we can do with a * cheaper TWA_RESUME notification because,... <reason goes * here>. Otherwise do the more expensive, but always correct * TWA_SIGNAL. */ if (READ_ONCE(tsk->state) == TASK_RUNNING) { __task_work_notify(tsk, TWA_RESUME); if (READ_ONCE(tsk->state) == TASK_RUNNING) return; } __task_work_notify(tsk, TWA_SIGNAL); wake_up_process(tsk);
On 8/10/20 2:13 PM, Jens Axboe wrote:
Re-write it a bit on top of that, just turning it into two separate READ_ONCE, and added appropriate comments. For the SQPOLL case, the wake_up_process() is enough, so we can clean up that if/else.
https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=49bc5c1...
On Mon, Aug 10, 2020 at 02:25:48PM -0600, Jens Axboe wrote:
OK, that works for me, thanks!
Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
On Mon, Aug 10, 2020 at 10:25 PM Jens Axboe axboe@kernel.dk wrote:
I think I'm starting to understand the overall picture here, and I think if my understanding is correct, your solution isn't going to work properly.
My understanding of the scenario you're trying to address is:
- task A starts up io_uring - task A tells io_uring to bump the counter of an eventfd E when work has been completed - task A submits some work ("read a byte from file descriptor X", or something like that) - io_uring internally starts an asynchronous I/O operation, with a callback C - task A calls read(E, &counter, sizeof(counter)) to wait for events to be processed - the async I/O operation finishes, C is invoked, and C schedules task_work for task A
And here you run into a deadlock, because the task_work will only run when task A returns from the syscall, but the syscall will only return once the task_work is executing and has finished the I/O operation.
If that is the scenario you're trying to solve here (where you're trying to force a task that's in the middle of some syscall that's completely unrelated to io_uring to return back to syscall context), I don't think this will work: It might well be that the task has e.g. just started entering the read() syscall, and is *about to* block, but is currently still running.
On 8/10/20 2:35 PM, Jann Horn wrote:
Your understanding of the scenario appears to be correct, and as far as I can tell, also your analysis of why the existing approach doesn't fully close it. You're right in that the task could currently be on its way to blocking, but still running. And for that case, TWA_RESUME isn't going to cut it.
Ugh. This basically means I have to use TWA_SIGNAL regardless of state, unless SQPOLL is used. That's not optimal.
Alternatively:
if (tsk->state != TASK_RUNNING || task_in_kernel(tsk)) notify = TWA_SIGNAL; else notify = TWA_RESUME;
should work as far as I can tell, but I don't even know if there's a reliable way to do task_in_kernel(). But I suspect this kind of check would still save the day, as we're not really expecting the common case to be that the task is in the kernel on the way to blocking. And it'd be kind of annoying to have to cater to that scenario by slowing down the fast path.
Suggestions?
On Mon, Aug 10, 2020 at 11:12 PM Jens Axboe axboe@kernel.dk wrote:
Why are you trying to avoid using TWA_SIGNAL? Is there a specific part of handling it that's particularly slow?
On 8/10/20 3:28 PM, Jens Axboe wrote:
Ran some quick testing in a vm, which is worst case for this kind of thing as any kind of mucking with interrupts is really slow. And the hit is substantial. Though with the below, we're basically at parity again. Just for discussion...
diff --git a/kernel/task_work.c b/kernel/task_work.c index 5c0848ca1287..ea2c683c8563 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -42,7 +42,8 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify) set_notify_resume(task); break; case TWA_SIGNAL: - if (lock_task_sighand(task, &flags)) { + if (!(task->jobctl & JOBCTL_TASK_WORK) && + lock_task_sighand(task, &flags)) { task->jobctl |= JOBCTL_TASK_WORK; signal_wake_up(task, 0); unlock_task_sighand(task, &flags);
On Tue, Aug 11, 2020 at 12:01 AM Jens Axboe axboe@kernel.dk wrote:
I think that should work in theory, but if you want to be able to do a proper unlocked read of task->jobctl here, then I think you'd have to use READ_ONCE() here and make all existing writes to ->jobctl use WRITE_ONCE().
Also, I think that to make this work, stuff like get_signal() will need to use memory barriers to ensure that reads from ->task_works are ordered after ->jobctl has been cleared - ideally written such that on the fastpath, the memory barrier doesn't execute.
On 8/10/20 4:41 PM, Jann Horn wrote:
I wonder if it's possible to just make it safe for the io_uring case, since a bigger change would make this performance regression persistent in this release... Would still require the split add/notification patch, but that one is trivial.
On 08/11, Jann Horn wrote:
Agreed,
and make all existing writes to ->jobctl use WRITE_ONCE().
->jobctl is always modified with ->siglock held, do we really need WRITE_ONCE() ?
Why? I don't follow.
Afaics, we only need to ensure that task_work_add() checks JOBCTL_TASK_WORK after it adds the new work to the ->task_works list, and we can rely on cmpxchg() ?
Oleg.
On Tue, Aug 11, 2020 at 09:14:02AM +0200, Oleg Nesterov wrote:
Ah, well, that I don't konw. Why do we need the READ_ONCE() ?
It does:
and the lock_task_sighand() implies barrier(), so I thought the reason for the READ_ONCE() was load-tearing, and then we need WRITE_ONCE() to avoid store-tearing.
On 8/11/20 2:10 AM, Oleg Nesterov wrote:
Thanks, reading through this thread makes me feel better. I agree that we'll need READ_ONCE() just to shut up analyzers.
I'd really like to get this done at the same time as the io_uring change. Are you open to doing the READ_ONCE() based JOBCTL_TASK_WORK addition for 5.9? Alternatively we can retain the 1/2 patch from this series and I'll open-code it in io_uring, but seems pointless as io_uring is the only user of TWA_SIGNAL in the kernel anyway.
On 08/11, Jens Axboe wrote:
Yes, the patch looks fine to me. In fact I was going to add this optimization from the very beginning, but then decided to make that patch as simple as possible.
And in any case I personally like this change much more than 1/2 + 2/2 which I honestly don't understand ;)
Oleg.
On 8/10/20 3:12 PM, Jens Axboe wrote:
Is there a safe way to make TWA_SIGNAL notification cheaper? I won't pretend to fully understand the ordering, Oleg probably has a much better idea then me...
diff --git a/kernel/task_work.c b/kernel/task_work.c index 5c0848ca1287..ea2c683c8563 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -42,7 +42,8 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify) set_notify_resume(task); break; case TWA_SIGNAL: - if (lock_task_sighand(task, &flags)) { + if (!(task->jobctl & JOBCTL_TASK_WORK) && + lock_task_sighand(task, &flags)) { task->jobctl |= JOBCTL_TASK_WORK; signal_wake_up(task, 0); unlock_task_sighand(task, &flags);
Hi
[This is an automated email]
This commit has been processed because it contains a -stable tag. The stable tag indicates that it's relevant for the following trees: 5.7+
The bot has tested the following trees: v5.8, v5.7.14.
v5.8: Failed to apply! Possible dependencies: 3fa5e0f33128 ("io_uring: optimise io_req_find_next() fast check") 4503b7676a2e ("io_uring: catch -EIO from buffered issue request failure") 7c86ffeeed30 ("io_uring: deduplicate freeing linked timeouts") 9b0d911acce0 ("io_uring: kill REQ_F_LINK_NEXT") 9b5f7bd93272 ("io_uring: replace find_next() out param with ret") a1d7c393c471 ("io_uring: enable READ/WRITE to use deferred completions") b63534c41e20 ("io_uring: re-issue block requests that failed because of resources") bcf5a06304d6 ("io_uring: support true async buffered reads, if file provides it") c2c4c83c58cb ("io_uring: use new io_req_task_work_add() helper throughout") c40f63790ec9 ("io_uring: use task_work for links if possible") e1e16097e265 ("io_uring: provide generic io_req_complete() helper")
v5.7.14: Failed to apply! Possible dependencies: 0cdaf760f42e ("io_uring: remove req->needs_fixed_files") 310672552f4a ("io_uring: async task poll trigger cleanup") 3fa5e0f33128 ("io_uring: optimise io_req_find_next() fast check") 405a5d2b2762 ("io_uring: avoid unnecessary io_wq_work copy for fast poll feature") 4a38aed2a0a7 ("io_uring: batch reap of dead file registrations") 4dd2824d6d59 ("io_uring: lazy get task") 7c86ffeeed30 ("io_uring: deduplicate freeing linked timeouts") 7cdaf587de7c ("io_uring: avoid whole io_wq_work copy for requests completed inline") 7d01bd745a8f ("io_uring: remove obsolete 'state' parameter") 9b0d911acce0 ("io_uring: kill REQ_F_LINK_NEXT") 9b5f7bd93272 ("io_uring: replace find_next() out param with ret") c2c4c83c58cb ("io_uring: use new io_req_task_work_add() helper throughout") c40f63790ec9 ("io_uring: use task_work for links if possible") d4c81f38522f ("io_uring: don't arm a timeout through work.func") f5fa38c59cb0 ("io_wq: add per-wq work handler instead of per work")
NOTE: The patch will not be queued to stable trees until it is upstream.
How should we proceed with this patch?
Hi
[This is an automated email]
This commit has been processed because it contains a -stable tag. The stable tag indicates that it's relevant for the following trees: 5.7+
The bot has tested the following trees: v5.8.1, v5.7.15.
v5.8.1: Failed to apply! Possible dependencies: 3fa5e0f33128 ("io_uring: optimise io_req_find_next() fast check") 4503b7676a2e ("io_uring: catch -EIO from buffered issue request failure") 7c86ffeeed30 ("io_uring: deduplicate freeing linked timeouts") 9b0d911acce0 ("io_uring: kill REQ_F_LINK_NEXT") 9b5f7bd93272 ("io_uring: replace find_next() out param with ret") a1d7c393c471 ("io_uring: enable READ/WRITE to use deferred completions") b63534c41e20 ("io_uring: re-issue block requests that failed because of resources") bcf5a06304d6 ("io_uring: support true async buffered reads, if file provides it") c2c4c83c58cb ("io_uring: use new io_req_task_work_add() helper throughout") c40f63790ec9 ("io_uring: use task_work for links if possible") e1e16097e265 ("io_uring: provide generic io_req_complete() helper")
v5.7.15: Failed to apply! Possible dependencies: 0cdaf760f42e ("io_uring: remove req->needs_fixed_files") 310672552f4a ("io_uring: async task poll trigger cleanup") 3fa5e0f33128 ("io_uring: optimise io_req_find_next() fast check") 405a5d2b2762 ("io_uring: avoid unnecessary io_wq_work copy for fast poll feature") 4a38aed2a0a7 ("io_uring: batch reap of dead file registrations") 4dd2824d6d59 ("io_uring: lazy get task") 7c86ffeeed30 ("io_uring: deduplicate freeing linked timeouts") 7cdaf587de7c ("io_uring: avoid whole io_wq_work copy for requests completed inline") 7d01bd745a8f ("io_uring: remove obsolete 'state' parameter") 9b0d911acce0 ("io_uring: kill REQ_F_LINK_NEXT") 9b5f7bd93272 ("io_uring: replace find_next() out param with ret") c2c4c83c58cb ("io_uring: use new io_req_task_work_add() helper throughout") c40f63790ec9 ("io_uring: use task_work for links if possible") d4c81f38522f ("io_uring: don't arm a timeout through work.func") f5fa38c59cb0 ("io_wq: add per-wq work handler instead of per work")
NOTE: The patch will not be queued to stable trees until it is upstream.
How should we proceed with this patch?
On 8/19/20 4:59 PM, Jens Axboe wrote:
The replacement is:
commit 228bb0e69491f55e502c883c583d7c0d67659e83 Author: Jens Axboe axboe@kernel.dk Date: Thu Aug 6 19:41:50 2020 -0600
io_uring: use TWA_SIGNAL for task_work uncondtionally
So you can ignore this one.
linux-stable-mirror@lists.linaro.org