Hi,
While staring at epoll, I noticed ep_events_available() looks wrong. I wrote a small program to confirm, and yes it is definitely wrong.
This series adds a reproducer to kselftest, and fix the bug.
Nam Cao (2): selftests/eventpoll: Add test for multiple waiters eventpoll: Fix epoll_wait() report false negative
fs/eventpoll.c | 16 +------ .../filesystems/epoll/epoll_wakeup_test.c | 45 +++++++++++++++++++ 2 files changed, 47 insertions(+), 14 deletions(-)
Add a test whichs creates 64 threads who all epoll_wait() on the same eventpoll. The source eventfd is written but never read, therefore all the threads should always see an EPOLLIN event.
This test fails because of a kernel bug, which will be fixed by a follow-up commit.
Signed-off-by: Nam Cao namcao@linutronix.de --- .../filesystems/epoll/epoll_wakeup_test.c | 45 +++++++++++++++++++ 1 file changed, 45 insertions(+)
diff --git a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c b/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c index 65ede506305c..0852c68d0461 100644 --- a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c +++ b/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c @@ -3493,4 +3493,49 @@ TEST(epoll64) close(ctx.sfd[1]); }
+static void *epoll65_wait(void *ctx_) +{ + struct epoll_mtcontext *ctx = ctx_; + struct epoll_event event; + + for (int i = 0; i < 100000; ++i) { + if (!epoll_wait(ctx->efd[0], &event, 1, 0)) + return (void *)ENODATA; + } + + return (void *)0; +} + +TEST(epoll65) +{ + struct epoll_mtcontext ctx; + struct epoll_event event; + int64_t dummy_data = 99; + pthread_t threads[64]; + uintptr_t ret; + int i, err; + + ctx.efd[0] = epoll_create(1); + ASSERT_GE(ctx.efd[0], 0); + ctx.efd[1] = eventfd(0, 0); + ASSERT_GE(ctx.efd[1], 0); + + event.events = EPOLLIN; + err = epoll_ctl(ctx.efd[0], EPOLL_CTL_ADD, ctx.efd[1], &event); + ASSERT_EQ(err, 0); + + write(ctx.efd[1], &dummy_data, sizeof(dummy_data)); + + for (i = 0; i < ARRAY_SIZE(threads); ++i) + ASSERT_EQ(pthread_create(&threads[i], NULL, epoll65_wait, &ctx), 0); + + for (i = 0; i < ARRAY_SIZE(threads); ++i) { + ASSERT_EQ(pthread_join(threads[i], (void **)&ret), 0); + ASSERT_EQ(ret, 0); + } + + close(ctx.efd[0]); + close(ctx.efd[1]); +} + TEST_HARNESS_MAIN
ep_events_available() checks for available events by looking at ep->rdllist and ep->ovflist. However, this is done without a lock, therefore the returned value is not reliable. Because it is possible that both checks on ep->rdllist and ep->ovflist are false while ep_start_scan() or ep_done_scan() is being executed on other CPUs, despite events are available.
This bug can be observed by:
1. Create an eventpoll with at least one ready level-triggered event
2. Create multiple threads who do epoll_wait() with zero timeout. The threads do not consume the events, therefore all epoll_wait() should return at least one event.
If one thread is executing ep_events_available() while another thread is executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly return no event for the former thread.
This reproducer is implemented as TEST(epoll65) in tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
Fix it by skipping ep_events_available(), just call ep_try_send_events() directly.
epoll_sendevents() (io_uring) suffers the same problem, fix that as well.
There is still ep_busy_loop() who uses ep_events_available() without lock, but it is probably okay (?) for busy-polling.
Fixes: c5a282e9635e ("fs/epoll: reduce the scope of wq lock in epoll_wait()") Fixes: e59d3c64cba6 ("epoll: eliminate unnecessary lock for zero timeout") Fixes: ae3a4f1fdc2c ("eventpoll: add epoll_sendevents() helper") Signed-off-by: Nam Cao namcao@linutronix.de Cc: stable@vger.kernel.org --- fs/eventpoll.c | 16 ++-------------- 1 file changed, 2 insertions(+), 14 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 0fbf5dfedb24..541481eafc20 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2022,7 +2022,7 @@ static int ep_schedule_timeout(ktime_t *to) static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, struct timespec64 *timeout) { - int res, eavail, timed_out = 0; + int res, eavail = 1, timed_out = 0; u64 slack = 0; wait_queue_entry_t wait; ktime_t expires, *to = NULL; @@ -2041,16 +2041,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, timed_out = 1; }
- /* - * This call is racy: We may or may not see events that are being added - * to the ready list under the lock (e.g., in IRQ callbacks). For cases - * with a non-zero timeout, this thread will check the ready list under - * lock and will add to the wait queue. For cases with a zero - * timeout, the user by definition should not care and will have to - * recheck again. - */ - eavail = ep_events_available(ep); - while (1) { if (eavail) { res = ep_try_send_events(ep, events, maxevents); @@ -2496,9 +2486,7 @@ int epoll_sendevents(struct file *file, struct epoll_event __user *events, * Racy call, but that's ok - it should get retried based on * poll readiness anyway. */ - if (ep_events_available(ep)) - return ep_try_send_events(ep, events, maxevents); - return 0; + return ep_try_send_events(ep, events, maxevents); }
/*
On Fri, Jul 18, 2025 at 8:52 AM Nam Cao namcao@linutronix.de wrote:
ep_events_available() checks for available events by looking at ep->rdllist and ep->ovflist. However, this is done without a lock, therefore the returned value is not reliable. Because it is possible that both checks on ep->rdllist and ep->ovflist are false while ep_start_scan() or ep_done_scan() is being executed on other CPUs, despite events are available.
This bug can be observed by:
Create an eventpoll with at least one ready level-triggered event
Create multiple threads who do epoll_wait() with zero timeout. The threads do not consume the events, therefore all epoll_wait() should return at least one event.
If one thread is executing ep_events_available() while another thread is executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly return no event for the former thread.
That is the whole point of epoll_wait with a zero timeout. We would want to opportunistically poll without much overhead, which will have more false positives. A caller that calls with a zero timeout should retry later, and will at some point observe the event.
I'm not sure if we would want to add much more overheads, for higher precision.
Thanks, Soheil
This reproducer is implemented as TEST(epoll65) in tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
Fix it by skipping ep_events_available(), just call ep_try_send_events() directly.
epoll_sendevents() (io_uring) suffers the same problem, fix that as well.
There is still ep_busy_loop() who uses ep_events_available() without lock, but it is probably okay (?) for busy-polling.
Fixes: c5a282e9635e ("fs/epoll: reduce the scope of wq lock in epoll_wait()") Fixes: e59d3c64cba6 ("epoll: eliminate unnecessary lock for zero timeout") Fixes: ae3a4f1fdc2c ("eventpoll: add epoll_sendevents() helper") Signed-off-by: Nam Cao namcao@linutronix.de Cc: stable@vger.kernel.org
fs/eventpoll.c | 16 ++-------------- 1 file changed, 2 insertions(+), 14 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 0fbf5dfedb24..541481eafc20 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2022,7 +2022,7 @@ static int ep_schedule_timeout(ktime_t *to) static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, struct timespec64 *timeout) {
int res, eavail, timed_out = 0;
int res, eavail = 1, timed_out = 0; u64 slack = 0; wait_queue_entry_t wait; ktime_t expires, *to = NULL;
@@ -2041,16 +2041,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, timed_out = 1; }
/*
* This call is racy: We may or may not see events that are being added
* to the ready list under the lock (e.g., in IRQ callbacks). For cases
* with a non-zero timeout, this thread will check the ready list under
* lock and will add to the wait queue. For cases with a zero
* timeout, the user by definition should not care and will have to
* recheck again.
*/
eavail = ep_events_available(ep);
while (1) { if (eavail) { res = ep_try_send_events(ep, events, maxevents);
@@ -2496,9 +2486,7 @@ int epoll_sendevents(struct file *file, struct epoll_event __user *events, * Racy call, but that's ok - it should get retried based on * poll readiness anyway. */
if (ep_events_available(ep))
return ep_try_send_events(ep, events, maxevents);
return 0;
return ep_try_send_events(ep, events, maxevents);
}
/*
2.39.5
On Fri, Jul 18, 2025 at 09:38:27AM +0100, Soheil Hassas Yeganeh wrote:
On Fri, Jul 18, 2025 at 8:52 AM Nam Cao namcao@linutronix.de wrote:
ep_events_available() checks for available events by looking at ep->rdllist and ep->ovflist. However, this is done without a lock, therefore the returned value is not reliable. Because it is possible that both checks on ep->rdllist and ep->ovflist are false while ep_start_scan() or ep_done_scan() is being executed on other CPUs, despite events are available.
This bug can be observed by:
Create an eventpoll with at least one ready level-triggered event
Create multiple threads who do epoll_wait() with zero timeout. The threads do not consume the events, therefore all epoll_wait() should return at least one event.
If one thread is executing ep_events_available() while another thread is executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly return no event for the former thread.
That is the whole point of epoll_wait with a zero timeout. We would want to opportunistically poll without much overhead, which will have more false positives. A caller that calls with a zero timeout should retry later, and will at some point observe the event.
Is this a documented behavior that users expect? I do not see this in the man page.
It sounds completely broken to me that an event has been sitting there for some time, but epoll_wait() says there is nothing.
I'm not sure if we would want to add much more overheads, for higher precision.
Correctness before performance please. And I'm not sure what you mean by "much more overheads". While this patch definitely adds some cycles in case there is no event, epoll_wait() still returns "instantly".
Nam
On Fri, Jul 18, 2025 at 09:52:29AM +0200, Nam Cao wrote:
ep_events_available() checks for available events by looking at ep->rdllist and ep->ovflist. However, this is done without a lock, therefore the returned value is not reliable. Because it is possible that both checks on ep->rdllist and ep->ovflist are false while ep_start_scan() or ep_done_scan() is being executed on other CPUs, despite events are available.
This bug can be observed by:
Create an eventpoll with at least one ready level-triggered event
Create multiple threads who do epoll_wait() with zero timeout. The threads do not consume the events, therefore all epoll_wait() should return at least one event.
If one thread is executing ep_events_available() while another thread is executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly return no event for the former thread.
This reproducer is implemented as TEST(epoll65) in tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
Fix it by skipping ep_events_available(), just call ep_try_send_events() directly.
epoll_sendevents() (io_uring) suffers the same problem, fix that as well.
There is still ep_busy_loop() who uses ep_events_available() without lock, but it is probably okay (?) for busy-polling.
I'll say upfront I'm not an epoll person, just looked here out of curiosity.
I can agree there is a bug. The event is generated before any of the threads even exist and they only poll for it, none of them consume it.
However, the commit message fails to explain why the change fixes anything and I think your patch de facto reverts e59d3c64cba6 ("epoll: eliminate unnecessary lock for zero timeout"). Looking at that diff the point was to avoid the expensive lock trip if timeout == 0 and there are no events.
As for the bug is, from my reading the ep_start_scan()/ep_done_scan() pair transiently disturbs the state checked by ep_events_available(), resulting in false-negatives. Then the locked check works because by the time you acquire it, the damage is undone.
Given the commits referenced in Fixes:, I suspect the real fix would be to stop destroying that state of course.
But if that's not feasible, I would check if a sequence counter around this would do the trick -- then the racy ep_events_available(ep) upfront would become safe with smaller overhead than with your proposal for the no-event case, but with higher overhead when there is something.
My proposal is trivial to implement, I have no idea if it will get a buy-in though.
Fixes: c5a282e9635e ("fs/epoll: reduce the scope of wq lock in epoll_wait()") Fixes: e59d3c64cba6 ("epoll: eliminate unnecessary lock for zero timeout") Fixes: ae3a4f1fdc2c ("eventpoll: add epoll_sendevents() helper") Signed-off-by: Nam Cao namcao@linutronix.de Cc: stable@vger.kernel.org
fs/eventpoll.c | 16 ++-------------- 1 file changed, 2 insertions(+), 14 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 0fbf5dfedb24..541481eafc20 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2022,7 +2022,7 @@ static int ep_schedule_timeout(ktime_t *to) static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, struct timespec64 *timeout) {
- int res, eavail, timed_out = 0;
- int res, eavail = 1, timed_out = 0; u64 slack = 0; wait_queue_entry_t wait; ktime_t expires, *to = NULL;
@@ -2041,16 +2041,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, timed_out = 1; }
- /*
* This call is racy: We may or may not see events that are being added
* to the ready list under the lock (e.g., in IRQ callbacks). For cases
* with a non-zero timeout, this thread will check the ready list under
* lock and will add to the wait queue. For cases with a zero
* timeout, the user by definition should not care and will have to
* recheck again.
*/
- eavail = ep_events_available(ep);
- while (1) { if (eavail) { res = ep_try_send_events(ep, events, maxevents);
@@ -2496,9 +2486,7 @@ int epoll_sendevents(struct file *file, struct epoll_event __user *events, * Racy call, but that's ok - it should get retried based on * poll readiness anyway. */
- if (ep_events_available(ep))
return ep_try_send_events(ep, events, maxevents);
- return 0;
- return ep_try_send_events(ep, events, maxevents);
} /* -- 2.39.5
Mateusz Guzik mjguzik@gmail.com writes:
I'll say upfront I'm not an epoll person, just looked here out of curiosity.
Feedbacks always welcomed.
I can agree there is a bug. The event is generated before any of the threads even exist and they only poll for it, none of them consume it.
However, the commit message fails to explain why the change fixes anything and I think your patch de facto reverts e59d3c64cba6 ("epoll: eliminate unnecessary lock for zero timeout"). Looking at that diff the point was to avoid the expensive lock trip if timeout == 0 and there are no events.
As for the bug is, from my reading the ep_start_scan()/ep_done_scan() pair transiently disturbs the state checked by ep_events_available(), resulting in false-negatives. Then the locked check works because by the time you acquire it, the damage is undone.
Exactly so. I can add this into the description.
Given the commits referenced in Fixes:, I suspect the real fix would be to stop destroying that state of course.
But if that's not feasible, I would check if a sequence counter around this would do the trick -- then the racy ep_events_available(ep) upfront would become safe with smaller overhead than with your proposal for the no-event case, but with higher overhead when there is something.
My proposal is trivial to implement, I have no idea if it will get a buy-in though.
My question is whether the performance of epoll_wait() with zero timeout is really that important that we have to complicate things. If epoll_wait() with zero timeout is called repeatedly in a loop but there is no event, I'm sure there will be measurabled performance drop. But sane user would just use timeout in that case.
epoll's data is protected by a lock. Therefore I think the most straightforward solution is just taking the lock before reading the data.
Lockless is hard to get right and may cause hard-to-debug problems. So unless this performance drop somehow bothers someone, I would prefer "keep it simple, stupid".
Best regards, Nam
On Wed, Sep 17, 2025 at 3:41 PM Nam Cao namcao@linutronix.de wrote:
My question is whether the performance of epoll_wait() with zero timeout is really that important that we have to complicate things. If epoll_wait() with zero timeout is called repeatedly in a loop but there is no event, I'm sure there will be measurabled performance drop. But sane user would just use timeout in that case.
epoll's data is protected by a lock. Therefore I think the most straightforward solution is just taking the lock before reading the data.
I have no idea what the original use case is. I see the author of the patch is cc'ed, so hopefully they will answer.
Lockless is hard to get right and may cause hard-to-debug problems. So unless this performance drop somehow bothers someone, I would prefer "keep it simple, stupid".
Well epoll is known to suffer from lock contention, so I would like to think the lockless games were motivated by a real-world need, but I'm not going peruse the history to find out.
I can agree the current state concerning ep_events_available() is avoidably error prone and something(tm) should be done. fwiw the refcount thing is almost free on amd64, I have no idea how this pans out on arm64.
On Wed, Sep 17, 2025 at 6:05 PM Mateusz Guzik mjguzik@gmail.com wrote:
On Wed, Sep 17, 2025 at 3:41 PM Nam Cao namcao@linutronix.de wrote:
My question is whether the performance of epoll_wait() with zero timeout is really that important that we have to complicate things. If epoll_wait() with zero timeout is called repeatedly in a loop but there is no event, I'm sure there will be measurabled performance drop. But sane user would just use timeout in that case.
epoll's data is protected by a lock. Therefore I think the most straightforward solution is just taking the lock before reading the data.
I have no idea what the original use case is. I see the author of the patch is cc'ed, so hopefully they will answer.
Lockless is hard to get right and may cause hard-to-debug problems. So unless this performance drop somehow bothers someone, I would prefer "keep it simple, stupid".
Well epoll is known to suffer from lock contention, so I would like to think the lockless games were motivated by a real-world need, but I'm not going peruse the history to find out.
I can agree the current state concerning ep_events_available() is avoidably error prone and something(tm) should be done. fwiw the refcount thing is almost free on amd64, I have no idea how this pans out on arm64.
erm, seqcount
I think the justification for the original comment is: epoll_wait returns either when events are available, or the timeout is hit -> and if the timeout is hit, "event is available" is undefined (or another way: it would be incorrect to interpret a timeout being hit as "no events available"). So one could justify this missed event that way, but it does feel against the spirit of the API, especially since the event may have existed for an infinite amount of time, and still be missed.
Glancing again at this code, ep_events_available() should return true if rdllist is not empty, is actively being modified, or if ovflist is not EP_UNACTIVE_PTR.
One ordering thing that sticks out to me is ep_start_scan first splices out rdllist, *then* clears ovflist (from EP_UNACTIVE_PTR -> NULL). This creates a short condition where rdllist is empty, not being modified, but ovflist is still EP_UNACTIVE_PTR -> which we interpret as "no events available" - even though a local txlist may have some events. It seems like, for this lockless check to remain accurate, we should need to reverse the order of these two operations, *and* ensure the order remains observable. (and for users using the lock, there should be no observable difference with this change)
On Wed, Sep 17, 2025 at 11:03 AM Khazhy Kumykov khazhy@chromium.org wrote:
One ordering thing that sticks out to me is
(ordering can't help here since we'll rapidly be flapping between ovflist in use and inactive... right)
On Thu, Sep 18, 2025 at 12:34 AM Khazhy Kumykov khazhy@chromium.org wrote:
On Wed, Sep 17, 2025 at 11:03 AM Khazhy Kumykov khazhy@chromium.org wrote:
One ordering thing that sticks out to me is
(ordering can't help here since we'll rapidly be flapping between ovflist in use and inactive... right)
a sequence counter around shenanigans will sort it out, but I don't know if it is worth it and don't really want to investigate myself.
Mateusz Guzik mjguzik@gmail.com writes:
a sequence counter around shenanigans will sort it out, but I don't know if it is worth it and don't really want to investigate myself.
The original commit did mention "1% CPU/RPC reduction in RPC benchmarks". I'm not sure what "RPC" stands for and which benchmark it is. But if it is really important, we must have heard by now.
Nam
On Wed, 17 Sep 2025 18:05:45 +0200 Mateusz Guzik mjguzik@gmail.com wrote:
On Wed, Sep 17, 2025 at 3:41 PM Nam Cao namcao@linutronix.de wrote:
My question is whether the performance of epoll_wait() with zero timeout is really that important that we have to complicate things. If epoll_wait() with zero timeout is called repeatedly in a loop but there is no event, I'm sure there will be measurabled performance drop. But sane user would just use timeout in that case.
epoll's data is protected by a lock. Therefore I think the most straightforward solution is just taking the lock before reading the data.
I have no idea what the original use case is. I see the author of the patch is cc'ed, so hopefully they will answer.
Lockless is hard to get right and may cause hard-to-debug problems. So unless this performance drop somehow bothers someone, I would prefer "keep it simple, stupid".
Well epoll is known to suffer from lock contention, so I would like to think the lockless games were motivated by a real-world need, but I'm not going peruse the history to find out.
I can agree the current state concerning ep_events_available() is avoidably error prone and something(tm) should be done. fwiw the refcount thing is almost free on amd64, I have no idea how this pans out on arm64.
Atomic operations are anything but free.... They are likely to be a similar cost to an uncontested spinlock entry.
David
On Sat, Sep 20, 2025 at 4:42 PM David Laight david.laight.linux@gmail.com wrote:
On Wed, 17 Sep 2025 18:05:45 +0200 Mateusz Guzik mjguzik@gmail.com wrote:
I can agree the current state concerning ep_events_available() is avoidably error prone and something(tm) should be done. fwiw the refcount thing is almost free on amd64, I have no idea how this pans out on arm64.
Atomic operations are anything but free.... They are likely to be a similar cost to an uncontested spinlock entry.
In this context it was supposed to be s/refcount/seqcount/ and on amd64 that's loading the same var twice + a branch for the read thing. Not *free* but not in the same galaxy comped to acquiring a spinlock (even assuming it is uncontested).
Nam Cao namcao@linutronix.de writes:
While staring at epoll, I noticed ep_events_available() looks wrong. I wrote a small program to confirm, and yes it is definitely wrong.
This series adds a reproducer to kselftest, and fix the bug.
Friendly reminder that this exists.
This probably only appears once in a blue moon, and the impact is probably not significant. But it still should be fixed.
Nam
linux-kselftest-mirror@lists.linaro.org