On Mon, Dec 06, 2021 at 11:28:13AM -0800, Linus Torvalds wrote:
On Fri, Dec 3, 2021 at 4:23 PM Eric Biggers ebiggers@kernel.org wrote:
require another solution. This solution is for the queue to be cleared before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.
Ugh.
I hate POLLFREE, and the more I look at this, the more I think it's broken.
And that
wake_up_poll(wq, EPOLLHUP | POLLFREE);
in particular looks broken - the intent is that it should remove all the wait queue entries (because the wait queue head is going away), but wake_up_poll() iself actually does
__wake_up(x, TASK_NORMAL, 1, poll_to_key(m))
where that '1' is the number of exclusive entries it will wake up.
So if there are two exclusive waiters, wake_up_poll() will simply stop waking things up after the first one.
Which defeats the whole POLLFREE thing too.
Maybe I'm missing something, but POLLFREE really is broken.
I'd argue that all of epoll() is broken, but I guess we're stuck with it.
Now, it's very possible that nobody actually uses exclusive waits for those wait queues, and my "nr_exclusive" argument is about something that isn't actually a bug in reality. But I think it's a sign of confusion, and it's just another issue with POLLFREE.
I really wish we could have some way to not have epoll and aio mess with the wait-queue lists and cache the wait queue head pointers that they don't own.
In the meantime, I don't think these patches make things worse, and they may fix things. But see above about "nr_exclusive" and how I think wait queue entries might end up avoiding POLLFREE handling..
Linus
epoll supports exclusive waits, via the EPOLLEXCLUSIVE flag. So this looks like a real problem.
It could be fixed by converting signalfd and binder to use something like this, right?
#define wake_up_pollfree(x) \ __wake_up(x, TASK_NORMAL, 0, poll_to_key(EPOLLHUP | POLLFREE))
As for eliminating POLLFREE entirely, that would require that the waitqueue heads be moved to a location which has a longer lifetime. I'm not sure if that's possible. In the case of signalfd, maybe the waitqueue head could be moved to the file private data (signalfd_ctx), and then sighand_struct would contain a list of signalfd_ctx's which are receiving signals directed to that sighand_struct, rather than the waitqueue head itself. I'm not sure how well that would work. This would probably change user-visible behavior; if a signalfd is inherited by fork(), the child process would be notified about signals sent to the parent process, rather than itself as is currently the case.
- Eric