On Tue, 2020-05-05 at 13:54 +0200, SeongJae Park wrote:
CC-ing stable@vger.kernel.org and adding some more explanations.
On Tue, 5 May 2020 10:10:33 +0200 SeongJae Park sjpark@amazon.com wrote:
From: SeongJae Park sjpark@amazon.de
The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made the deallocation of 'socket_alloc' to be done asynchronously using RCU, as same to 'sock.wq'. And the following commit 333f7909a857 ("coallocate socket_sq with socket itself") made those to have same life cycle.
The changes made the code much more simple, but also made 'socket_alloc' live longer than before. For the reason, user programs intensively repeating allocations and deallocations of sockets could cause memory pressure on recent kernels.
I found this problem on a production virtual machine utilizing 4GB memory while running lebench[1]. The 'poll big' test of lebench opens 1000 sockets, polls and closes those. This test is repeated 10,000 times. Therefore it should consume only 1000 'socket_alloc' objects at once. As size of socket_alloc is about 800 Bytes, it's only 800 KiB. However, on the recent kernels, it could consume up to 10,000,000 objects (about 8 GiB). On the test machine, I confirmed it consuming about 4GB of the system memory and results in OOM.
[1] https://github.com/LinuxPerfStudy/LEBench
To avoid the problem, this commit reverts the changes.
I also tried to make fixup rather than reverts, but I couldn't easily find simple fixup. As the commits 6d7855c54e1e and 333f7909a857 were for code refactoring rather than performance optimization, I thought introducing complex fixup for this problem would make no sense. Meanwhile, the memory pressure regression could affect real machines. To this end, I decided to quickly revert the commits first and consider better refactoring later.
While lebench might be exercising a rather pathological case, the increase in memory pressure is real. I am concerned that the OOM killer is actually engaging and killing off processes when there are lots of resources already marked for release. This might be true for other lazy/delayed resource deallocation, too. This has obviously just become too lazy currently.
So for both reverts:
Reviewed-by: Stefan Nuernberger snu@amazon.com
Thanks, SeongJae Park
SeongJae Park (2): Revert "coallocate socket_wq with socket itself" Revert "sockfs: switch to ->free_inode()"
drivers/net/tap.c | 5 +++-- drivers/net/tun.c | 8 +++++--- include/linux/if_tap.h | 1 + include/linux/net.h | 4 ++-- include/net/sock.h | 4 ++-- net/core/sock.c | 2 +- net/socket.c | 23 ++++++++++++++++------- 7 files changed, 30 insertions(+), 17 deletions(-)
Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879