On Tue, Jul 15, 2025 at 02:46:34PM +0200, Nam Cao wrote:
The ready event list of an epoll object is protected by read-write semaphore:
- The consumer (waiter) acquires the write lock and takes items.
- the producer (waker) takes the read lock and adds items.
The point of this design is enabling epoll to scale well with large number of producers, as multiple producers can hold the read lock at the same time.
Unfortunately, this implementation may cause scheduling priority inversion problem. Suppose the consumer has higher scheduling priority than the producer. The consumer needs to acquire the write lock, but may be blocked by the producer holding the read lock. Since read-write semaphore does not support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y), we have a case of priority inversion: a higher priority consumer is blocked by a lower priority producer. This problem was reported in [1].
Furthermore, this could also cause stall problem, as described in [2].
Fix this problem by replacing rwlock with spinlock.
This reduces the event bandwidth, as the producers now have to contend with each other for the spinlock. According to the benchmark from https://github.com/rouming/test-tools/blob/master/stress-epoll.c:
On 12 x86 CPUs: Before After Diff threads events/ms events/ms 8 7162 4956 -31% 16 8733 5383 -38% 32 7968 5572 -30% 64 10652 5739 -46% 128 11236 5931 -47% On 4 riscv CPUs: Before After Diff threads events/ms events/ms 8 2958 2833 -4% 16 3323 3097 -7% 32 3451 3240 -6% 64 3554 3178 -11% 128 3601 3235 -10%
Although the numbers look bad, it should be noted that this benchmark creates multiple threads who do nothing except constantly generating new epoll events, thus contention on the spinlock is high. For real workload, the event rate is likely much lower, and the performance drop is not as bad.
Using another benchmark (perf bench epoll wait) where spinlock contention is lower, improvement is even observed on x86:
On 12 x86 CPUs: Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8 After: Averaged 114577 operations/sec (+- 2.25%), total secs = 8 On 4 riscv CPUs: Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8 After: Averaged 167396 operations/sec (+- 0.23%), total secs = 8
In conclusion, no one is likely to be upset over this change. After all, spinlock was used originally for years, and the commit which converted to rwlock didn't mention a real workload, just that the benchmark numbers are nice.
This patch is not exactly the revert of commit a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention"), because git revert conflicts in some places which are not obvious on the resolution. This patch is intended to be backported, therefore go with the obvious approach:
Replace rwlock_t with spinlock_t one to one
Delete list_add_tail_lockless() and chain_epi_lockless(). These were introduced to allow producers to concurrently add items to the list. But now that spinlock no longer allows producers to touch the event list concurrently, these two functions are not necessary anymore.
Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention") Signed-off-by: Nam Cao namcao@linutronix.de Cc: stable@vger.kernel.org
I forgot to add:
Reported-by: Frederic Weisbecker frederic@kernel.org Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1] Reported-by: Valentin Schneider vschneid@redhat.com Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.... [2]
Christian, do you mind adding those for me, if/when you apply the patch?
Nam