This series fixes some bugs in aio poll. This is intended to replace "[PATCH v5] aio: Add support for the POLLFREE" (https://lore.kernel.org/r/20211027011834.2497484-1-ramjiyani@google.com) which has some bugs.
Careful review is appreciated; this code is very hard to work with, and I don't know of an easy way to test it.
Note, it looks like io_uring has these same bugs too. I haven't tried to fix io_uring.
This applies to v5.16-rc3.
Eric Biggers (2): aio: keep poll requests on waitqueue until completed aio: fix use-after-free due to missing POLLFREE handling
fs/aio.c | 138 +++++++++++++++++++++++++------- include/uapi/asm-generic/poll.h | 2 +- 2 files changed, 111 insertions(+), 29 deletions(-)
base-commit: d58071a8a76d779eedab38033ae4c821c30295a5
From: Eric Biggers ebiggers@google.com
Currently, aio_poll_wake() will always remove the poll request from the waitqueue. Then, if aio_poll_complete_work() sees that none of the polled events are ready and the request isn't cancelled, it re-adds the request to the waitqueue. (This can easily happen when polling a file that doesn't pass an event mask when waking up its waitqueue.)
This is fundamentally broken for two reasons:
1. If a wakeup occurs between vfs_poll() and the request being re-added to the waitqueue, it will be missed because the request wasn't on the waitqueue at the time. Therefore, IOCB_CMD_POLL might never complete even if the polled file is ready.
2. When the request isn't on the waitqueue, there is no way to be notified that the waitqueue is being freed (which happens when its lifetime is shorter than the struct file's). This is supposed to happen via the waitqueue entries being woken up with POLLFREE.
Therefore, leave the requests on the waitqueue until they are actually completed or cancelled. To keep track of when aio_poll_complete_work needs to be scheduled, use new fields in struct poll_iocb. Remove the 'done' field which is now redundant.
Note that this is consistent with how sys_poll() and eventpoll work; their wakeup functions do *not* remove the waitqueue entries.
Fixes: 2c14fa838cbe ("aio: implement IOCB_CMD_POLL") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Eric Biggers ebiggers@google.com --- fs/aio.c | 60 +++++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 44 insertions(+), 16 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index 9c81cf611d659..b72beeaed0631 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -181,8 +181,9 @@ struct poll_iocb { struct file *file; struct wait_queue_head *head; __poll_t events; - bool done; bool cancelled; + bool work_scheduled; + bool work_need_resched; struct wait_queue_entry wait; struct work_struct work; }; @@ -1638,14 +1639,23 @@ static void aio_poll_complete_work(struct work_struct *work) * avoid further branches in the fast path. */ spin_lock_irq(&ctx->ctx_lock); + spin_lock(&req->head->lock); if (!mask && !READ_ONCE(req->cancelled)) { - add_wait_queue(req->head, &req->wait); + /* Reschedule completion if another wakeup came in. */ + if (req->work_need_resched) { + schedule_work(&req->work); + req->work_need_resched = false; + } else { + req->work_scheduled = false; + } + spin_unlock(&req->head->lock); spin_unlock_irq(&ctx->ctx_lock); return; } + list_del_init(&req->wait.entry); + spin_unlock(&req->head->lock); list_del_init(&iocb->ki_list); iocb->ki_res.res = mangle_poll(mask); - req->done = true; spin_unlock_irq(&ctx->ctx_lock);
iocb_put(iocb); @@ -1680,20 +1690,24 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, if (mask && !(mask & req->events)) return 0;
- list_del_init(&req->wait.entry); - + /* + * Complete the iocb inline if possible. This requires that two + * conditions be met: + * 1. The event mask must have been passed. If a regular wakeup was + * done instead, then mask == 0 and we have to call vfs_poll() to + * get the events, so inline completion isn't possible. + * 2. ctx_lock must not be busy. We have to use trylock because we + * already hold the waitqueue lock, so this inverts the normal + * locking order. Use irqsave/irqrestore because not all + * filesystems (e.g. fuse) call this function with IRQs disabled, + * yet IRQs have to be disabled before ctx_lock is obtained. + */ if (mask && spin_trylock_irqsave(&iocb->ki_ctx->ctx_lock, flags)) { struct kioctx *ctx = iocb->ki_ctx;
- /* - * Try to complete the iocb inline if we can. Use - * irqsave/irqrestore because not all filesystems (e.g. fuse) - * call this function with IRQs disabled and because IRQs - * have to be disabled before ctx_lock is obtained. - */ + list_del_init(&req->wait.entry); list_del(&iocb->ki_list); iocb->ki_res.res = mangle_poll(mask); - req->done = true; if (iocb->ki_eventfd && eventfd_signal_allowed()) { iocb = NULL; INIT_WORK(&req->work, aio_poll_put_work); @@ -1703,7 +1717,16 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, if (iocb) iocb_put(iocb); } else { - schedule_work(&req->work); + /* + * Schedule the completion work if needed. If it was already + * scheduled, record that another wakeup came in. + */ + if (req->work_scheduled) { + req->work_need_resched = true; + } else { + schedule_work(&req->work); + req->work_scheduled = true; + } } return 1; } @@ -1750,8 +1773,9 @@ static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) req->events = demangle_poll(iocb->aio_buf) | EPOLLERR | EPOLLHUP;
req->head = NULL; - req->done = false; req->cancelled = false; + req->work_scheduled = false; + req->work_need_resched = false;
apt.pt._qproc = aio_poll_queue_proc; apt.pt._key = req->events; @@ -1766,17 +1790,21 @@ static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) spin_lock_irq(&ctx->ctx_lock); if (likely(req->head)) { spin_lock(&req->head->lock); - if (unlikely(list_empty(&req->wait.entry))) { + if (req->work_scheduled) { + /* async completion work was already scheduled */ if (apt.error) cancel = true; apt.error = 0; mask = 0; } if (mask || apt.error) { + /* steal to complete synchronously */ list_del_init(&req->wait.entry); } else if (cancel) { + /* cancel if possible (may be too late though) */ WRITE_ONCE(req->cancelled, true); - } else if (!req->done) { /* actually waiting for an event */ + } else if (!list_empty(&req->wait.entry)) { + /* actually waiting for an event */ list_add_tail(&aiocb->ki_list, &ctx->active_reqs); aiocb->ki_cancel = aio_poll_cancel; }
On Fri, Dec 03, 2021 at 04:23:00PM -0800, Eric Biggers wrote:
@@ -1680,20 +1690,24 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, if (mask && !(mask & req->events)) return 0;
- list_del_init(&req->wait.entry);
- /*
* Complete the iocb inline if possible. This requires that two
* conditions be met:
* 1. The event mask must have been passed. If a regular wakeup was
* done instead, then mask == 0 and we have to call vfs_poll() to
* get the events, so inline completion isn't possible.
* 2. ctx_lock must not be busy. We have to use trylock because we
* already hold the waitqueue lock, so this inverts the normal
* locking order. Use irqsave/irqrestore because not all
* filesystems (e.g. fuse) call this function with IRQs disabled,
* yet IRQs have to be disabled before ctx_lock is obtained.
if (mask && spin_trylock_irqsave(&iocb->ki_ctx->ctx_lock, flags)) { struct kioctx *ctx = iocb->ki_ctx;*/
/*
* Try to complete the iocb inline if we can. Use
* irqsave/irqrestore because not all filesystems (e.g. fuse)
* call this function with IRQs disabled and because IRQs
* have to be disabled before ctx_lock is obtained.
*/
list_del(&iocb->ki_list); iocb->ki_res.res = mangle_poll(mask);list_del_init(&req->wait.entry);
if (iocb->ki_eventfd && eventfd_signal_allowed()) { iocb = NULL; INIT_WORK(&req->work, aio_poll_put_work);req->done = true;
@@ -1703,7 +1717,16 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, if (iocb) iocb_put(iocb); } else {
I think I missed something here. Now that the request is left on the waitqueue, there needs to be a third condition for completing the iocb inline: the completion work must not have already been scheduled.
- Eric
From: Eric Biggers ebiggers@google.com
signalfd_poll() and binder_poll() are special in that they use a waitqueue whose lifetime is the current task, rather than the struct file as is normally the case. This is okay for blocking polls, since a blocking poll occurs within one task; however, non-blocking polls require another solution. This solution is for the queue to be cleared before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.
Unfortunately, only eventpoll handles POLLFREE. A second type of non-blocking poll, aio poll, was added in kernel v4.18, and it doesn't handle POLLFREE. This allows a use-after-free to occur if a signalfd or binder fd is polled with aio poll, and the waitqueue gets freed.
Fix this by making aio poll handle POLLFREE.
A patch by Ramji Jiyani ramjiyani@google.com (https://lore.kernel.org/r/20211027011834.2497484-1-ramjiyani@google.com) tried to do this by making aio_poll_wake() always complete the request inline if POLLFREE is seen. However, that solution had two bugs. First, it introduced a deadlock, as it unconditionally locked the aio context while holding the waitqueue lock, which inverts the normal locking order. Second, it didn't consider that POLLFREE notifications are missed while the request has been temporarily de-queued.
The second problem was solved by my previous patch. This patch then properly fixes the use-after-free by handling POLLFREE in a deadlock-free way. It does this by taking advantage of the fact that freeing of the waitqueue is RCU-delayed, similar to what eventpoll does.
Fixes: 2c14fa838cbe ("aio: implement IOCB_CMD_POLL") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Eric Biggers ebiggers@google.com --- fs/aio.c | 102 ++++++++++++++++++++++++-------- include/uapi/asm-generic/poll.h | 2 +- 2 files changed, 79 insertions(+), 25 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index b72beeaed0631..8daafb95b6801 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1620,6 +1620,50 @@ static void aio_poll_put_work(struct work_struct *work) iocb_put(iocb); }
+/* + * Safely lock the waitqueue which the request is on, taking into account the + * fact that the waitqueue may be freed at any time before the lock is taken. + * + * Returns true on success, meaning that req->head->lock was locked and + * req->wait is on req->head. Returns false if the request has already been + * removed from its waitqueue (which might no longer exist). + */ +static bool poll_iocb_lock_wq(struct poll_iocb *req) +{ + /* + * Holding req->head->lock prevents req->wait being removed from the + * waitqueue, and thus prevents the waitqueue from being freed + * (POLLFREE) in the rare cases of ->poll() implementations where the + * waitqueue may not live as long as the struct file. However, taking + * req->head->lock in the first place can race with POLLFREE. + * + * We solve this in a similar way as eventpoll does: by taking advantage + * of the fact that all users of POLLFREE will RCU-delay the actual + * free. If we enter rcu_read_lock() and check that req->wait isn't + * already removed, then req->head is guaranteed to be accessible until + * rcu_read_unlock(). We can then lock req->head->lock and re-check + * whether req->wait has been removed or not. + */ + rcu_read_lock(); + if (list_empty_careful(&req->wait.entry)) { + rcu_read_unlock(); + return false; + } + spin_lock(&req->head->lock); + if (list_empty(&req->wait.entry)) { + spin_unlock(&req->head->lock); + rcu_read_unlock(); + return false; + } + rcu_read_unlock(); + return true; +} + +static void poll_iocb_unlock_wq(struct poll_iocb *req) +{ + spin_unlock(&req->head->lock); +} + static void aio_poll_complete_work(struct work_struct *work) { struct poll_iocb *req = container_of(work, struct poll_iocb, work); @@ -1639,21 +1683,22 @@ static void aio_poll_complete_work(struct work_struct *work) * avoid further branches in the fast path. */ spin_lock_irq(&ctx->ctx_lock); - spin_lock(&req->head->lock); - if (!mask && !READ_ONCE(req->cancelled)) { - /* Reschedule completion if another wakeup came in. */ - if (req->work_need_resched) { - schedule_work(&req->work); - req->work_need_resched = false; - } else { - req->work_scheduled = false; + if (poll_iocb_lock_wq(req)) { + if (!mask && !READ_ONCE(req->cancelled)) { + /* Reschedule completion if another wakeup came in. */ + if (req->work_need_resched) { + schedule_work(&req->work); + req->work_need_resched = false; + } else { + req->work_scheduled = false; + } + poll_iocb_unlock_wq(req); + spin_unlock_irq(&ctx->ctx_lock); + return; } - spin_unlock(&req->head->lock); - spin_unlock_irq(&ctx->ctx_lock); - return; - } - list_del_init(&req->wait.entry); - spin_unlock(&req->head->lock); + list_del_init(&req->wait.entry); + poll_iocb_unlock_wq(req); + } /* else, POLLFREE has freed the waitqueue, so we must complete */ list_del_init(&iocb->ki_list); iocb->ki_res.res = mangle_poll(mask); spin_unlock_irq(&ctx->ctx_lock); @@ -1667,13 +1712,12 @@ static int aio_poll_cancel(struct kiocb *iocb) struct aio_kiocb *aiocb = container_of(iocb, struct aio_kiocb, rw); struct poll_iocb *req = &aiocb->poll;
- spin_lock(&req->head->lock); - WRITE_ONCE(req->cancelled, true); - if (!list_empty(&req->wait.entry)) { + if (poll_iocb_lock_wq(req)) { + WRITE_ONCE(req->cancelled, true); list_del_init(&req->wait.entry); - schedule_work(&aiocb->poll.work); - } - spin_unlock(&req->head->lock); + schedule_work(&req->work); + poll_iocb_unlock_wq(req); + } /* else, the request was force-cancelled by POLLFREE already */
return 0; } @@ -1717,6 +1761,14 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, if (iocb) iocb_put(iocb); } else { + if (mask & POLLFREE) { + /* + * The waitqueue is going to be freed, so remove the + * request from the waitqueue and mark it as cancelled. + */ + list_del_init(&req->wait.entry); + WRITE_ONCE(req->cancelled, true); + } /* * Schedule the completion work if needed. If it was already * scheduled, record that another wakeup came in. @@ -1789,8 +1841,9 @@ static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) mask = vfs_poll(req->file, &apt.pt) & req->events; spin_lock_irq(&ctx->ctx_lock); if (likely(req->head)) { - spin_lock(&req->head->lock); - if (req->work_scheduled) { + bool on_queue = poll_iocb_lock_wq(req); + + if (!on_queue || req->work_scheduled) { /* async completion work was already scheduled */ if (apt.error) cancel = true; @@ -1803,12 +1856,13 @@ static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) } else if (cancel) { /* cancel if possible (may be too late though) */ WRITE_ONCE(req->cancelled, true); - } else if (!list_empty(&req->wait.entry)) { + } else if (on_queue) { /* actually waiting for an event */ list_add_tail(&aiocb->ki_list, &ctx->active_reqs); aiocb->ki_cancel = aio_poll_cancel; } - spin_unlock(&req->head->lock); + if (on_queue) + poll_iocb_unlock_wq(req); } if (mask) { /* no async, we'd stolen it */ aiocb->ki_res.res = mangle_poll(mask); diff --git a/include/uapi/asm-generic/poll.h b/include/uapi/asm-generic/poll.h index 41b509f410bf9..f9c520ce4bf4e 100644 --- a/include/uapi/asm-generic/poll.h +++ b/include/uapi/asm-generic/poll.h @@ -29,7 +29,7 @@ #define POLLRDHUP 0x2000 #endif
-#define POLLFREE (__force __poll_t)0x4000 /* currently only for epoll */ +#define POLLFREE (__force __poll_t)0x4000
#define POLL_BUSY_LOOP (__force __poll_t)0x8000
On Fri, Dec 3, 2021 at 4:23 PM Eric Biggers ebiggers@kernel.org wrote:
require another solution. This solution is for the queue to be cleared before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.
Ugh.
I hate POLLFREE, and the more I look at this, the more I think it's broken.
And that
wake_up_poll(wq, EPOLLHUP | POLLFREE);
in particular looks broken - the intent is that it should remove all the wait queue entries (because the wait queue head is going away), but wake_up_poll() iself actually does
__wake_up(x, TASK_NORMAL, 1, poll_to_key(m))
where that '1' is the number of exclusive entries it will wake up.
So if there are two exclusive waiters, wake_up_poll() will simply stop waking things up after the first one.
Which defeats the whole POLLFREE thing too.
Maybe I'm missing something, but POLLFREE really is broken.
I'd argue that all of epoll() is broken, but I guess we're stuck with it.
Now, it's very possible that nobody actually uses exclusive waits for those wait queues, and my "nr_exclusive" argument is about something that isn't actually a bug in reality. But I think it's a sign of confusion, and it's just another issue with POLLFREE.
I really wish we could have some way to not have epoll and aio mess with the wait-queue lists and cache the wait queue head pointers that they don't own.
In the meantime, I don't think these patches make things worse, and they may fix things. But see above about "nr_exclusive" and how I think wait queue entries might end up avoiding POLLFREE handling..
Linus
On Mon, Dec 06, 2021 at 11:28:13AM -0800, Linus Torvalds wrote:
On Fri, Dec 3, 2021 at 4:23 PM Eric Biggers ebiggers@kernel.org wrote:
require another solution. This solution is for the queue to be cleared before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.
Ugh.
I hate POLLFREE, and the more I look at this, the more I think it's broken.
And that
wake_up_poll(wq, EPOLLHUP | POLLFREE);
in particular looks broken - the intent is that it should remove all the wait queue entries (because the wait queue head is going away), but wake_up_poll() iself actually does
__wake_up(x, TASK_NORMAL, 1, poll_to_key(m))
where that '1' is the number of exclusive entries it will wake up.
So if there are two exclusive waiters, wake_up_poll() will simply stop waking things up after the first one.
Which defeats the whole POLLFREE thing too.
Maybe I'm missing something, but POLLFREE really is broken.
I'd argue that all of epoll() is broken, but I guess we're stuck with it.
Now, it's very possible that nobody actually uses exclusive waits for those wait queues, and my "nr_exclusive" argument is about something that isn't actually a bug in reality. But I think it's a sign of confusion, and it's just another issue with POLLFREE.
I really wish we could have some way to not have epoll and aio mess with the wait-queue lists and cache the wait queue head pointers that they don't own.
In the meantime, I don't think these patches make things worse, and they may fix things. But see above about "nr_exclusive" and how I think wait queue entries might end up avoiding POLLFREE handling..
Linus
epoll supports exclusive waits, via the EPOLLEXCLUSIVE flag. So this looks like a real problem.
It could be fixed by converting signalfd and binder to use something like this, right?
#define wake_up_pollfree(x) \ __wake_up(x, TASK_NORMAL, 0, poll_to_key(EPOLLHUP | POLLFREE))
As for eliminating POLLFREE entirely, that would require that the waitqueue heads be moved to a location which has a longer lifetime. I'm not sure if that's possible. In the case of signalfd, maybe the waitqueue head could be moved to the file private data (signalfd_ctx), and then sighand_struct would contain a list of signalfd_ctx's which are receiving signals directed to that sighand_struct, rather than the waitqueue head itself. I'm not sure how well that would work. This would probably change user-visible behavior; if a signalfd is inherited by fork(), the child process would be notified about signals sent to the parent process, rather than itself as is currently the case.
- Eric
On Mon, Dec 6, 2021 at 11:54 AM Eric Biggers ebiggers@kernel.org wrote:
It could be fixed by converting signalfd and binder to use something like this, right?
#define wake_up_pollfree(x) \ __wake_up(x, TASK_NORMAL, 0, poll_to_key(EPOLLHUP | POLLFREE))
Yeah, that looks better to me.
That said, maybe it would be even better to then make it more explicit about what it does, and not make it look like just another wakeup with an odd parameter.
IOW, maybe that "pollfree()" function itself could very much do the waitqueue entry removal on each entry using list_del_init(), and not expect the wakeup code for the entry to do so.
I think that kind of explicit "this removes all entries from the wait list head" is better than "let's do a wakeup and expect all entries to magically implicitly remove themselves".
After all, that "implicitly remove themselves" was what didn't happen, and caused the bug in the first place.
And all the normal cases, that don't care about POLLFREE at all, because their waitqueues don't go away from under them, wouldn't care, because "list_del_init()" still leaves a valid self-pointing list in place, so if they do list_del() afterwards, nothing happens.
I dunno. But yes, that wake_up_pollfree() of yours certainly looks better than what we have now.
As for eliminating POLLFREE entirely, that would require that the waitqueue heads be moved to a location which has a longer lifetime.
Yeah, the problem with aio and epoll is exactly that they end up using waitqueue heads without knowing what they are.
I'm not at all convinced that there aren't other situations where the thing the waitqueue head is embedded might not have other lifetimes.
The *common* situation is obviously that it's associated with a file, and the file pointer ends up holding the reference to whatever device or something (global list in a loadable module, or whatever) it is.
Hmm. The poll_wait() callback function actually does get the 'struct file *' that the wait is associated with. I wonder if epoll queueing could actually increment the file ref when it creates its own wait entry, and release it at ep_remove_wait_queue()?
Maybe epoll could avoid having to remove entries entirely that way - simply by virtue of having a ref to the files - and remove the need for having the ->whead pointer entirely (and remove the need for POLLFREE handling)?
And maybe the signalfd case can do the same - instead of expecting exit() to clean up the list when sighand->count goes to zero, maybe the signalfd filp can just hold a ref to that 'struct sighand_struct', and it gets free'd whenever there are no signalfd users left?
That would involve making signalfd_ctx actually tied to one particular 'struct signal', but that might be the right thing to do regardless (instead of making it always act on 'current' like it does now).
So maybe with some re-organization, we could get rid of the need for POLLFREE entirely.. Anybody?
But your patches are certainly simpler in that they just fix the status quo.
Linus
On Mon, Dec 06, 2021 at 01:48:04PM -0800, Linus Torvalds wrote:
On Mon, Dec 6, 2021 at 11:54 AM Eric Biggers ebiggers@kernel.org wrote:
It could be fixed by converting signalfd and binder to use something like this, right?
#define wake_up_pollfree(x) \ __wake_up(x, TASK_NORMAL, 0, poll_to_key(EPOLLHUP | POLLFREE))
Yeah, that looks better to me.
That said, maybe it would be even better to then make it more explicit about what it does, and not make it look like just another wakeup with an odd parameter.
IOW, maybe that "pollfree()" function itself could very much do the waitqueue entry removal on each entry using list_del_init(), and not expect the wakeup code for the entry to do so.
I think that kind of explicit "this removes all entries from the wait list head" is better than "let's do a wakeup and expect all entries to magically implicitly remove themselves".
After all, that "implicitly remove themselves" was what didn't happen, and caused the bug in the first place.
And all the normal cases, that don't care about POLLFREE at all, because their waitqueues don't go away from under them, wouldn't care, because "list_del_init()" still leaves a valid self-pointing list in place, so if they do list_del() afterwards, nothing happens.
I dunno. But yes, that wake_up_pollfree() of yours certainly looks better than what we have now.
In v2 (https://lore.kernel.org/r/20211207095726.169766-1-ebiggers@kernel.org/), I did end up making wake_up_pollfree() into a function, which calls __wake_up() and also checks that the queue became empty.
However, I didn't make wake_up_pollfree() remove the queue entries itself. I don't think that would be better, given that the purpose of POLLFREE is to provide a notification that the queue is going away, not simply to remove the queue entries. In particular, we might want to cancel the poll request(s); that's what aio poll does. Also, we need to synchronize with other places where the waitqueue is accessed, e.g. to remove an entry which requires taking the waitqueue lock (see either aio poll or epoll).
As for eliminating POLLFREE entirely, that would require that the waitqueue heads be moved to a location which has a longer lifetime.
Yeah, the problem with aio and epoll is exactly that they end up using waitqueue heads without knowing what they are.
I'm not at all convinced that there aren't other situations where the thing the waitqueue head is embedded might not have other lifetimes.
The *common* situation is obviously that it's associated with a file, and the file pointer ends up holding the reference to whatever device or something (global list in a loadable module, or whatever) it is.
Hmm. The poll_wait() callback function actually does get the 'struct file *' that the wait is associated with. I wonder if epoll queueing could actually increment the file ref when it creates its own wait entry, and release it at ep_remove_wait_queue()?
Maybe epoll could avoid having to remove entries entirely that way - simply by virtue of having a ref to the files - and remove the need for having the ->whead pointer entirely (and remove the need for POLLFREE handling)?
Well, the documented user-visible behavior of epoll is that when a file is closed, it is automatically removed from all epoll instances. That precludes epoll instances from holding references to the files which they are polling.
And the reason that POLLFREE needs to exist is that holding a file reference isn't enough anyway, since in the case of signalfd and binder the waitqueue may have a shorter lifetime. If the file reference was enough, then aio poll wouldn't need to handle POLLFREE.
And maybe the signalfd case can do the same - instead of expecting exit() to clean up the list when sighand->count goes to zero, maybe the signalfd filp can just hold a ref to that 'struct sighand_struct', and it gets free'd whenever there are no signalfd users left?
That would involve making signalfd_ctx actually tied to one particular 'struct signal', but that might be the right thing to do regardless (instead of making it always act on 'current' like it does now).
So maybe with some re-organization, we could get rid of the need for POLLFREE entirely.. Anybody?
But your patches are certainly simpler in that they just fix the status quo.
That would be really nice, but unfortunately the current behavior is documented in signalfd(2), for example:
""" fork(2) semantics After a fork(2), the child inherits a copy of the signalfd file de‐ scriptor. A read(2) from the file descriptor in the child will return information about signals queued to the child. """
With your proposed change to signalfd, the child would receive its parent's signals via the signalfd, rather than its own signals.
It's maybe how signalfd should have worked originally. I don't know whether it is actually safe to change or not, but the fact that the current behavior is specifically documented in the man page isn't too promising.
Also, changing signalfd by itself wouldn't allow removal of POLLFREE; binder would have to be changed too. I'm not a binder expert, but fixing binder doesn't look super promising. It looks pretty intentional that binder_poll() operates on per-thread state, while the binder file description itself is a per-process thing.
- Eric
linux-stable-mirror@lists.linaro.org