Add support for the POLLFREE flag to force complete iocb inline in aio_poll_wake(). A thread may use it to signal it's exit and/or request to cleanup while pending poll request. In this case, aio_poll_wake() needs to make sure it doesn't keep any reference to the queue entry before returning from wake to avoid possible use after free via poll_cancel() path.
UAF issue was found during binder and aio interactions in certain sequence of events [1].
The POLLFREE flag is no more exclusive to the epoll and is being shared with the aio. Remove comment from poll.h to avoid confusion.
[1] https://lore.kernel.org/r/CAKUd0B_TCXRY4h1hTztfwWbNSFQqsudDLn2S_28csgWZmZAG3...
Fixes: af5c72b1fc7a ("Fix aio_poll() races") Signed-off-by: Ramji Jiyani ramjiyani@google.com Reviewed-by: Jeff Moyer jmoyer@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Cc: stable@vger.kernel.org # 4.19+ --- Changes since v1: - Removed parenthesis around POLLFREE macro definition as per review. - Updated description to refer UAF issue discussion this patch fixes. - Updated description to remove reference to parenthesis change. - Added Reviewed-by from Jeff Moyer
Changes since v2: - Added Fixes tag. - Added stable tag for backporting on 4.19+ LTS releases
Changes since v3: - Updated patch description - Updated Fixes tag to issue manifestation origin
Changes since v4: - Added Reviewed-by from Christoph Hellwig --- fs/aio.c | 45 ++++++++++++++++++--------------- include/uapi/asm-generic/poll.h | 2 +- 2 files changed, 26 insertions(+), 21 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index 51b08ab01dff..5d539c05df42 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1674,6 +1674,7 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, { struct poll_iocb *req = container_of(wait, struct poll_iocb, wait); struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll); + struct kioctx *ctx = iocb->ki_ctx; __poll_t mask = key_to_poll(key); unsigned long flags;
@@ -1683,29 +1684,33 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
list_del_init(&req->wait.entry);
- if (mask && spin_trylock_irqsave(&iocb->ki_ctx->ctx_lock, flags)) { - struct kioctx *ctx = iocb->ki_ctx; + /* + * Use irqsave/irqrestore because not all filesystems (e.g. fuse) + * call this function with IRQs disabled and because IRQs have to + * be disabled before ctx_lock is obtained. + */ + if (mask & POLLFREE) { + /* Force complete iocb inline to remove refs to deleted entry */ + spin_lock_irqsave(&ctx->ctx_lock, flags); + } else if (!(mask && spin_trylock_irqsave(&ctx->ctx_lock, flags))) { + /* Can't complete iocb inline; schedule for later */ + schedule_work(&req->work); + return 1; + }
- /* - * Try to complete the iocb inline if we can. Use - * irqsave/irqrestore because not all filesystems (e.g. fuse) - * call this function with IRQs disabled and because IRQs - * have to be disabled before ctx_lock is obtained. - */ - list_del(&iocb->ki_list); - iocb->ki_res.res = mangle_poll(mask); - req->done = true; - if (iocb->ki_eventfd && eventfd_signal_allowed()) { - iocb = NULL; - INIT_WORK(&req->work, aio_poll_put_work); - schedule_work(&req->work); - } - spin_unlock_irqrestore(&ctx->ctx_lock, flags); - if (iocb) - iocb_put(iocb); - } else { + /* complete iocb inline */ + list_del(&iocb->ki_list); + iocb->ki_res.res = mangle_poll(mask); + req->done = true; + if (iocb->ki_eventfd && eventfd_signal_allowed()) { + iocb = NULL; + INIT_WORK(&req->work, aio_poll_put_work); schedule_work(&req->work); } + spin_unlock_irqrestore(&ctx->ctx_lock, flags); + if (iocb) + iocb_put(iocb); + return 1; }
diff --git a/include/uapi/asm-generic/poll.h b/include/uapi/asm-generic/poll.h index 41b509f410bf..f9c520ce4bf4 100644 --- a/include/uapi/asm-generic/poll.h +++ b/include/uapi/asm-generic/poll.h @@ -29,7 +29,7 @@ #define POLLRDHUP 0x2000 #endif
-#define POLLFREE (__force __poll_t)0x4000 /* currently only for epoll */ +#define POLLFREE (__force __poll_t)0x4000
#define POLL_BUSY_LOOP (__force __poll_t)0x8000
On Wed, Oct 27, 2021 at 01:18:34AM +0000, Ramji Jiyani wrote:
Add support for the POLLFREE flag to force complete iocb inline in aio_poll_wake(). A thread may use it to signal it's exit and/or request to cleanup while pending poll request. In this case, aio_poll_wake() needs to make sure it doesn't keep any reference to the queue entry before returning from wake to avoid possible use after free via poll_cancel() path.
UAF issue was found during binder and aio interactions in certain sequence of events [1].
The POLLFREE flag is no more exclusive to the epoll and is being shared with the aio. Remove comment from poll.h to avoid confusion.
[1] https://lore.kernel.org/r/CAKUd0B_TCXRY4h1hTztfwWbNSFQqsudDLn2S_28csgWZmZAG3...
Fixes: af5c72b1fc7a ("Fix aio_poll() races") Signed-off-by: Ramji Jiyani ramjiyani@google.com Reviewed-by: Jeff Moyer jmoyer@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Cc: stable@vger.kernel.org # 4.19+
Looks good, feel free to add:
Reviewed-by: Eric Biggers ebiggers@google.com
I'm still not 100% happy with the commit message, but it's good enough. The actual code looks correct.
Who is going to take this patch? This is an important fix; it shouldn't be sitting ignored for months. get_maintainer.pl shows:
$ ./scripts/get_maintainer.pl fs/aio.c Benjamin LaHaise bcrl@kvack.org (supporter:AIO) Alexander Viro viro@zeniv.linux.org.uk (maintainer:FILESYSTEMS (VFS and infrastructure)) linux-aio@kvack.org (open list:AIO) linux-fsdevel@vger.kernel.org (open list:FILESYSTEMS (VFS and infrastructure)) linux-kernel@vger.kernel.org (open list)
- Eric
On Tue, Nov 23, 2021 at 11:49:54AM -0800, Eric Biggers wrote:
On Wed, Oct 27, 2021 at 01:18:34AM +0000, Ramji Jiyani wrote:
Add support for the POLLFREE flag to force complete iocb inline in aio_poll_wake(). A thread may use it to signal it's exit and/or request to cleanup while pending poll request. In this case, aio_poll_wake() needs to make sure it doesn't keep any reference to the queue entry before returning from wake to avoid possible use after free via poll_cancel() path.
UAF issue was found during binder and aio interactions in certain sequence of events [1].
The POLLFREE flag is no more exclusive to the epoll and is being shared with the aio. Remove comment from poll.h to avoid confusion.
[1] https://lore.kernel.org/r/CAKUd0B_TCXRY4h1hTztfwWbNSFQqsudDLn2S_28csgWZmZAG3...
Fixes: af5c72b1fc7a ("Fix aio_poll() races") Signed-off-by: Ramji Jiyani ramjiyani@google.com Reviewed-by: Jeff Moyer jmoyer@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Cc: stable@vger.kernel.org # 4.19+
Looks good, feel free to add:
Reviewed-by: Eric Biggers ebiggers@google.com
I'm still not 100% happy with the commit message, but it's good enough. The actual code looks correct.
Who is going to take this patch? This is an important fix; it shouldn't be sitting ignored for months. get_maintainer.pl shows:
$ ./scripts/get_maintainer.pl fs/aio.c Benjamin LaHaise bcrl@kvack.org (supporter:AIO) Alexander Viro viro@zeniv.linux.org.uk (maintainer:FILESYSTEMS (VFS and infrastructure)) linux-aio@kvack.org (open list:AIO) linux-fsdevel@vger.kernel.org (open list:FILESYSTEMS (VFS and infrastructure)) linux-kernel@vger.kernel.org (open list)
Actually, there is a bug in this patch -- it creates a lock inversion between ctx->ctx_lock (kioctx::ctx_lock) and req->head->lock (wait_queue_head::lock).
Task 1: signalfd_cleanup() -> wake_up_poll() [takes wait_queue_head::lock] -> aio_poll_wake() [takes kioctx::ctx_lock]
Task 2: sys_io_cancel() [takes kioctx::ctx_lock] -> aio_poll_cancel [takes wait_queue_head::lock]
Previously this was okay because the lock operation in aio_poll_wake() was only a trylock. This patch changes it to a regular lock, which causes a deadlock.
I am able to reproduce this deadlock. It also generates a lockdep report, shown below. Unfortunately, I don't know how to fix it. Anyone have any ideas? Al and Christoph, it looks like you wrote most of the aio poll code?
Note, the use-after-free this patch is fixing also affects signalfd, not just binder, since both rely on POLLFREE. (I was testing it with signalfd.) So we really need to fix it one way or another...
====================================================== WARNING: possible circular locking dependency detected 5.16.0-rc2-00001-gf97efc5c03bf #22 Not tainted ------------------------------------------------------ aio/137 is trying to acquire lock: ffff888006170158 (&ctx->ctx_lock){..-.}-{2:2}, at: aio_poll_wake+0x1ac/0x390 fs/aio.c:1693
but task is already holding lock: ffff8880053a91e0 (&sighand->signalfd_wqh){....}-{2:2}, at: __wake_up_common_lock+0x5b/0xb0 kernel/sched/wait.c:137
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&sighand->signalfd_wqh){....}-{2:2}: __lock_acquire+0x4b4/0x960 kernel/locking/lockdep.c:5027 lock_acquire kernel/locking/lockdep.c:5637 [inline] lock_acquire+0xc9/0x2e0 kernel/locking/lockdep.c:5602 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:349 [inline] aio_poll.constprop.0+0x15d/0x440 fs/aio.c:1773 __io_submit_one.constprop.0+0x139/0x1b0 fs/aio.c:1847 io_submit_one+0x134/0x640 fs/aio.c:1884 __do_sys_io_submit fs/aio.c:1943 [inline] __se_sys_io_submit fs/aio.c:1913 [inline] __x64_sys_io_submit+0x89/0x260 fs/aio.c:1913 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #0 (&ctx->ctx_lock){..-.}-{2:2}: check_prev_add+0x93/0xbf0 kernel/locking/lockdep.c:3063 check_prevs_add kernel/locking/lockdep.c:3186 [inline] validate_chain+0x585/0x8c0 kernel/locking/lockdep.c:3801 __lock_acquire+0x4b4/0x960 kernel/locking/lockdep.c:5027 lock_acquire kernel/locking/lockdep.c:5637 [inline] lock_acquire+0xc9/0x2e0 kernel/locking/lockdep.c:5602 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x3e/0x60 kernel/locking/spinlock.c:162 aio_poll_wake+0x1ac/0x390 fs/aio.c:1693 __wake_up_common+0x8c/0x1a0 kernel/sched/wait.c:108 __wake_up_common_lock+0x77/0xb0 kernel/sched/wait.c:138 __wake_up+0xe/0x10 kernel/sched/wait.c:157 signalfd_cleanup+0x33/0x40 fs/signalfd.c:48 __cleanup_sighand kernel/fork.c:1613 [inline] __cleanup_sighand+0x27/0x50 kernel/fork.c:1610 __exit_signal+0x236/0x380 kernel/exit.c:159 release_task+0x180/0x3d0 kernel/exit.c:200 wait_task_zombie+0x28a/0x600 kernel/exit.c:1114 wait_consider_task+0x121/0x160 kernel/exit.c:1341 do_wait_thread kernel/exit.c:1404 [inline] do_wait+0x21b/0x380 kernel/exit.c:1521 kernel_wait4+0xaa/0x150 kernel/exit.c:1684 __do_sys_wait4+0x85/0x90 kernel/exit.c:1712 __se_sys_wait4 kernel/exit.c:1708 [inline] __x64_sys_wait4+0x17/0x20 kernel/exit.c:1708 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1 ---- ---- lock(&sighand->signalfd_wqh); lock(&ctx->ctx_lock); lock(&sighand->signalfd_wqh); lock(&ctx->ctx_lock);
*** DEADLOCK ***
2 locks held by aio/137: #0: ffffffff81e06098 (tasklist_lock){.+.+}-{2:2}, at: release_task+0x110/0x3d0 kernel/exit.c:197 #1: ffff8880053a91e0 (&sighand->signalfd_wqh){....}-{2:2}, at: __wake_up_common_lock+0x5b/0xb0 kernel/sched/wait.c:137
stack backtrace: CPU: 3 PID: 137 Comm: aio Not tainted 5.16.0-rc2-00001-gf97efc5c03bf #22 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014 Call Trace: <TASK> show_stack+0x3d/0x3f arch/x86/kernel/dumpstack.c:318 __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x49/0x5e lib/dump_stack.c:106 dump_stack+0x10/0x12 lib/dump_stack.c:113 print_circular_bug.cold+0x13e/0x143 kernel/locking/lockdep.c:2021 check_noncircular+0xfe/0x110 kernel/locking/lockdep.c:2143 check_prev_add+0x93/0xbf0 kernel/locking/lockdep.c:3063 check_prevs_add kernel/locking/lockdep.c:3186 [inline] validate_chain+0x585/0x8c0 kernel/locking/lockdep.c:3801 __lock_acquire+0x4b4/0x960 kernel/locking/lockdep.c:5027 lock_acquire kernel/locking/lockdep.c:5637 [inline] lock_acquire+0xc9/0x2e0 kernel/locking/lockdep.c:5602 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x3e/0x60 kernel/locking/spinlock.c:162 aio_poll_wake+0x1ac/0x390 fs/aio.c:1693 __wake_up_common+0x8c/0x1a0 kernel/sched/wait.c:108 __wake_up_common_lock+0x77/0xb0 kernel/sched/wait.c:138 __wake_up+0xe/0x10 kernel/sched/wait.c:157 signalfd_cleanup+0x33/0x40 fs/signalfd.c:48 __cleanup_sighand kernel/fork.c:1613 [inline] __cleanup_sighand+0x27/0x50 kernel/fork.c:1610 __exit_signal+0x236/0x380 kernel/exit.c:159 release_task+0x180/0x3d0 kernel/exit.c:200 wait_task_zombie+0x28a/0x600 kernel/exit.c:1114 wait_consider_task+0x121/0x160 kernel/exit.c:1341 do_wait_thread kernel/exit.c:1404 [inline] do_wait+0x21b/0x380 kernel/exit.c:1521 kernel_wait4+0xaa/0x150 kernel/exit.c:1684 __do_sys_wait4+0x85/0x90 kernel/exit.c:1712 __se_sys_wait4 kernel/exit.c:1708 [inline] __x64_sys_wait4+0x17/0x20 kernel/exit.c:1708 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f23a3b0e9ea Code: ff e9 0a 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 3d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 89 54 24 14 RSP: 002b:00007ffcd0926098 EFLAGS: 00000246 ORIG_RAX: 000000000000003d RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007f23a3b0e9ea RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff RBP: 000055944067b000 R08: fffffffe7fffffff R09: fffffffe7fffffff R10: 0000000000000000 R11: 0000000000000246 R12: 00005594406761c0 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK>
linux-stable-mirror@lists.linaro.org