io_uring backports for stable 5.15
Pavel Begunkov (5): io_uring/af_unix: defer registered files gc to io_uring release io_uring: correct pinned_vm accounting io_uring/rw: fix short rw error handling io_uring/rw: fix error'ed retry return values io_uring/rw: fix unexpected link breakage
fs/io_uring.c | 40 ++++++++++++++++++++++++---------------- include/linux/skbuff.h | 2 ++ net/unix/garbage.c | 20 ++++++++++++++++++++ 3 files changed, 46 insertions(+), 16 deletions(-)
[ upstream commit 0091bfc81741b8d3aeb3b7ab8636f911b2de6e80 ]
Instead of putting io_uring's registered files in unix_gc() we want it to be done by io_uring itself. The trick here is to consider io_uring registered files for cycle detection but not actually putting them down. Because io_uring can't register other ring instances, this will remove all refs to the ring file triggering the ->release path and clean up with io_ring_ctx_free().
Cc: stable@vger.kernel.org Fixes: 6b06314c47e1 ("io_uring: add file set registration") Reported-and-tested-by: David Bouman dbouman03@gmail.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Thadeu Lima de Souza Cascardo cascardo@canonical.com [axboe: add kerneldoc comment to skb, fold in skb leak fix] Signed-off-by: Jens Axboe axboe@kernel.dk --- fs/io_uring.c | 1 + include/linux/skbuff.h | 2 ++ net/unix/garbage.c | 20 ++++++++++++++++++++ 3 files changed, 23 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ed6abd74f386..1b0e83fdea7a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -8065,6 +8065,7 @@ static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset) }
skb->sk = sk; + skb->scm_io_uring = 1;
nr_files = 0; fpl->user = get_uid(current_user()); diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index cfb889f66c70..19e595cab23a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -725,6 +725,7 @@ typedef unsigned char *sk_buff_data_t; * @csum_level: indicates the number of consecutive checksums found in * the packet minus one that have been verified as * CHECKSUM_UNNECESSARY (max 3) + * @scm_io_uring: SKB holds io_uring registered files * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required @@ -910,6 +911,7 @@ struct sk_buff { __u8 decrypted:1; #endif __u8 slow_gro:1; + __u8 scm_io_uring:1;
#ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ diff --git a/net/unix/garbage.c b/net/unix/garbage.c index d45d5366115a..dc2763540393 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -204,6 +204,7 @@ void wait_for_unix_gc(void) /* The external entry point: unix_gc() */ void unix_gc(void) { + struct sk_buff *next_skb, *skb; struct unix_sock *u; struct unix_sock *next; struct sk_buff_head hitlist; @@ -297,11 +298,30 @@ void unix_gc(void)
spin_unlock(&unix_gc_lock);
+ /* We need io_uring to clean its registered files, ignore all io_uring + * originated skbs. It's fine as io_uring doesn't keep references to + * other io_uring instances and so killing all other files in the cycle + * will put all io_uring references forcing it to go through normal + * release.path eventually putting registered files. + */ + skb_queue_walk_safe(&hitlist, skb, next_skb) { + if (skb->scm_io_uring) { + __skb_unlink(skb, &hitlist); + skb_queue_tail(&skb->sk->sk_receive_queue, skb); + } + } + /* Here we are. Hitlist is filled. Die. */ __skb_queue_purge(&hitlist);
spin_lock(&unix_gc_lock);
+ /* There could be io_uring registered files, just push them back to + * the inflight list + */ + list_for_each_entry_safe(u, next, &gc_candidates, link) + list_move_tail(&u->link, &gc_inflight_list); + /* All candidates should have been detached by now. */ BUG_ON(!list_empty(&gc_candidates));
[ upstream commit 42b6419d0aba47c5d8644cdc0b68502254671de5 ]
->mm_account should be released only after we free all registered buffers, otherwise __io_sqe_buffers_unregister() will see a NULL ->mm_account and skip locked_vm accounting.
Cc: Stable@vger.kernel.org Signed-off-by: Pavel Begunkov asml.silence@gmail.com Link: https://lore.kernel.org/r/6d798f65ed4ab8db3664c4d3397d4af16ca98846.166484993... Signed-off-by: Jens Axboe axboe@kernel.dk --- fs/io_uring.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1b0e83fdea7a..2da7a490d1c7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -9302,11 +9302,6 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) { io_sq_thread_finish(ctx);
- if (ctx->mm_account) { - mmdrop(ctx->mm_account); - ctx->mm_account = NULL; - } - /* __io_rsrc_put_work() may need uring_lock to progress, wait w/o it */ io_wait_rsrc_data(ctx->buf_data); io_wait_rsrc_data(ctx->file_data); @@ -9342,6 +9337,11 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) #endif WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
+ if (ctx->mm_account) { + mmdrop(ctx->mm_account); + ctx->mm_account = NULL; + } + io_mem_free(ctx->rings); io_mem_free(ctx->sq_sqes);
[ upstream commit 89473c1a9205760c4fa6d158058da7b594a815f0 ]
We have a couple of problems, first reports of unexpected link breakage for reads when cqe->res indicates that the IO was done in full. The reason here is partial IO with retries.
TL;DR; we compare the result in __io_complete_rw_common() against req->cqe.res, but req->cqe.res doesn't store the full length but rather the length left to be done. So, when we pass the full corrected result via kiocb_done() -> __io_complete_rw_common(), it fails.
The second problem is that we don't try to correct res in io_complete_rw(), which, for instance, might be a problem for O_DIRECT but when a prefix of data was cached in the page cache. We also definitely don't want to pass a corrected result into io_rw_done().
The fix here is to leave __io_complete_rw_common() alone, always pass not corrected result into it and fix it up as the last step just before actually finishing the I/O.
Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov asml.silence@gmail.com --- fs/io_uring.c | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2da7a490d1c7..c0d1948fb5a6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2701,6 +2701,20 @@ static bool __io_complete_rw_common(struct io_kiocb *req, long res) return false; }
+static inline unsigned io_fixup_rw_res(struct io_kiocb *req, unsigned res) +{ + struct io_async_rw *io = req->async_data; + + /* add previously done IO, if any */ + if (io && io->bytes_done > 0) { + if (res < 0) + res = io->bytes_done; + else + res += io->bytes_done; + } + return res; +} + static void io_req_task_complete(struct io_kiocb *req, bool *locked) { unsigned int cflags = io_put_rw_kbuf(req); @@ -2724,7 +2738,7 @@ static void __io_complete_rw(struct io_kiocb *req, long res, long res2, { if (__io_complete_rw_common(req, res)) return; - __io_req_complete(req, issue_flags, req->result, io_put_rw_kbuf(req)); + __io_req_complete(req, issue_flags, io_fixup_rw_res(req, res), io_put_rw_kbuf(req)); }
static void io_complete_rw(struct kiocb *kiocb, long res, long res2) @@ -2733,7 +2747,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
if (__io_complete_rw_common(req, res)) return; - req->result = res; + req->result = io_fixup_rw_res(req, res); req->io_task_work.func = io_req_task_complete; io_req_task_work_add(req); } @@ -2979,15 +2993,6 @@ static void kiocb_done(struct kiocb *kiocb, ssize_t ret, unsigned int issue_flags) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); - struct io_async_rw *io = req->async_data; - - /* add previously done IO, if any */ - if (io && io->bytes_done > 0) { - if (ret < 0) - ret = io->bytes_done; - else - ret += io->bytes_done; - }
if (req->flags & REQ_F_CUR_POS) req->file->f_pos = kiocb->ki_pos; @@ -3004,6 +3009,7 @@ static void kiocb_done(struct kiocb *kiocb, ssize_t ret, unsigned int cflags = io_put_rw_kbuf(req); struct io_ring_ctx *ctx = req->ctx;
+ ret = io_fixup_rw_res(req, ret); req_set_fail(req); if (!(issue_flags & IO_URING_F_NONBLOCK)) { mutex_lock(&ctx->uring_lock);
[ upstream commit 62bb0647b14646fa6c9aa25ecdf67ad18f13523c ]
Kernel test robot reports that we test negativity of an unsigned in io_fixup_rw_res() after a recent change, which masks error codes and messes up the return value in case I/O is re-retried and failed with an error.
Fixes: 4d9cb92ca41dd ("io_uring/rw: fix short rw error handling") Reported-by: kernel test robot lkp@intel.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Link: https://lore.kernel.org/r/9754a0970af1861e7865f9014f735c70dc60bf79.166307158... Signed-off-by: Jens Axboe axboe@kernel.dk --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c0d1948fb5a6..f667197c8bb9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2701,7 +2701,7 @@ static bool __io_complete_rw_common(struct io_kiocb *req, long res) return false; }
-static inline unsigned io_fixup_rw_res(struct io_kiocb *req, unsigned res) +static inline int io_fixup_rw_res(struct io_kiocb *req, unsigned res) { struct io_async_rw *io = req->async_data;
[ upstream commit bf68b5b34311ee57ed40749a1257a30b46127556 ]
req->cqe.res is set in io_read() to the amount of bytes left to be done, which is used to figure out whether to fail a read or not. However, io_read() may do another without returning, and we stash the previous value into ->bytes_done but forget to update cqe.res. Then we ask a read to do strictly less than cqe.res but expect the return to be exactly cqe.res.
Fix the bug by updating cqe.res for retries.
Cc: stable@vger.kernel.org Reported-and-Tested-by: Beld Zhang beldzhang@gmail.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Link: https://lore.kernel.org/r/3a1088440c7be98e5800267af922a67da0ef9f13.166423573... Signed-off-by: Jens Axboe axboe@kernel.dk --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f667197c8bb9..837dd96228b2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3606,6 +3606,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) return -EAGAIN; }
+ req->result = iov_iter_count(iter); /* * Now retry read with the IOCB_WAITQ parts set in the iocb. If * we get -EIOCBQUEUED, then we'll get a notification when the
On 10/17/22 10:37, Greg KH wrote:
On Sun, Oct 16, 2022 at 10:42:53PM +0100, Pavel Begunkov wrote:
io_uring backports for stable 5.15
How about 5.19?
Looks there were no conflicts for 5.19 and you applied them yesterday.
And thanks for queuing all backports!
If you wait a week or so, 5.19 will be end-of-life, so maybe we can just not worry about it :)
On Mon, Oct 17, 2022 at 02:08:57PM +0100, Pavel Begunkov wrote:
On 10/17/22 10:37, Greg KH wrote:
On Sun, Oct 16, 2022 at 10:42:53PM +0100, Pavel Begunkov wrote:
io_uring backports for stable 5.15
How about 5.19?
Looks there were no conflicts for 5.19 and you applied them yesterday.
Ah, you are right, sorry about that, too many patches floating around right now...
greg k-h
linux-stable-mirror@lists.linaro.org