On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 4:54 AM SeongJae Park sjpark@amazon.com wrote:
CC-ing stable@vger.kernel.org and adding some more explanations.
On Tue, 5 May 2020 10:10:33 +0200 SeongJae Park sjpark@amazon.com wrote:
From: SeongJae Park sjpark@amazon.de
The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made the deallocation of 'socket_alloc' to be done asynchronously using RCU, as same to 'sock.wq'. And the following commit 333f7909a857 ("coallocate socket_sq with socket itself") made those to have same life cycle.
The changes made the code much more simple, but also made 'socket_alloc' live longer than before. For the reason, user programs intensively repeating allocations and deallocations of sockets could cause memory pressure on recent kernels.
I found this problem on a production virtual machine utilizing 4GB memory while running lebench[1]. The 'poll big' test of lebench opens 1000 sockets, polls and closes those. This test is repeated 10,000 times. Therefore it should consume only 1000 'socket_alloc' objects at once. As size of socket_alloc is about 800 Bytes, it's only 800 KiB. However, on the recent kernels, it could consume up to 10,000,000 objects (about 8 GiB). On the test machine, I confirmed it consuming about 4GB of the system memory and results in OOM.
To be fair, I have not backported Al patches to Google production kernels, nor I have tried this benchmark.
Why do we have 10,000,000 objects around ? Could this be because of some RCU problem ?
Mainly because of a long RCU grace period, as you guess. I have no idea how the grace period became so long in this case.
As my test machine was a virtual machine instance, I guess RCU readers preemption[1] like problem might affected this.
[1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
Yes, both the old kernel that prior to Al's patches and the recent kernel reverting the Al's patches didn't reproduce the problem.
Thanks, SeongJae Park
Thanks.
To avoid the problem, this commit reverts the changes.
I also tried to make fixup rather than reverts, but I couldn't easily find simple fixup. As the commits 6d7855c54e1e and 333f7909a857 were for code refactoring rather than performance optimization, I thought introducing complex fixup for this problem would make no sense. Meanwhile, the memory pressure regression could affect real machines. To this end, I decided to quickly revert the commits first and consider better refactoring later.
Thanks, SeongJae Park
SeongJae Park (2): Revert "coallocate socket_wq with socket itself" Revert "sockfs: switch to ->free_inode()"
drivers/net/tap.c | 5 +++-- drivers/net/tun.c | 8 +++++--- include/linux/if_tap.h | 1 + include/linux/net.h | 4 ++-- include/net/sock.h | 4 ++-- net/core/sock.c | 2 +- net/socket.c | 23 ++++++++++++++++------- 7 files changed, 30 insertions(+), 17 deletions(-)
-- 2.17.1