CC-ing stable@vger.kernel.org and adding some more explanations.
On Tue, 5 May 2020 10:10:33 +0200 SeongJae Park sjpark@amazon.com wrote:
From: SeongJae Park sjpark@amazon.de
The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made the deallocation of 'socket_alloc' to be done asynchronously using RCU, as same to 'sock.wq'. And the following commit 333f7909a857 ("coallocate socket_sq with socket itself") made those to have same life cycle.
The changes made the code much more simple, but also made 'socket_alloc' live longer than before. For the reason, user programs intensively repeating allocations and deallocations of sockets could cause memory pressure on recent kernels.
I found this problem on a production virtual machine utilizing 4GB memory while running lebench[1]. The 'poll big' test of lebench opens 1000 sockets, polls and closes those. This test is repeated 10,000 times. Therefore it should consume only 1000 'socket_alloc' objects at once. As size of socket_alloc is about 800 Bytes, it's only 800 KiB. However, on the recent kernels, it could consume up to 10,000,000 objects (about 8 GiB). On the test machine, I confirmed it consuming about 4GB of the system memory and results in OOM.
[1] https://github.com/LinuxPerfStudy/LEBench
To avoid the problem, this commit reverts the changes.
I also tried to make fixup rather than reverts, but I couldn't easily find simple fixup. As the commits 6d7855c54e1e and 333f7909a857 were for code refactoring rather than performance optimization, I thought introducing complex fixup for this problem would make no sense. Meanwhile, the memory pressure regression could affect real machines. To this end, I decided to quickly revert the commits first and consider better refactoring later.
Thanks, SeongJae Park
SeongJae Park (2): Revert "coallocate socket_wq with socket itself" Revert "sockfs: switch to ->free_inode()"
drivers/net/tap.c | 5 +++-- drivers/net/tun.c | 8 +++++--- include/linux/if_tap.h | 1 + include/linux/net.h | 4 ++-- include/net/sock.h | 4 ++-- net/core/sock.c | 2 +- net/socket.c | 23 ++++++++++++++++------- 7 files changed, 30 insertions(+), 17 deletions(-)
-- 2.17.1
On Tue, 2020-05-05 at 13:54 +0200, SeongJae Park wrote:
CC-ing stable@vger.kernel.org and adding some more explanations.
On Tue, 5 May 2020 10:10:33 +0200 SeongJae Park sjpark@amazon.com wrote:
From: SeongJae Park sjpark@amazon.de
The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made the deallocation of 'socket_alloc' to be done asynchronously using RCU, as same to 'sock.wq'. And the following commit 333f7909a857 ("coallocate socket_sq with socket itself") made those to have same life cycle.
The changes made the code much more simple, but also made 'socket_alloc' live longer than before. For the reason, user programs intensively repeating allocations and deallocations of sockets could cause memory pressure on recent kernels.
I found this problem on a production virtual machine utilizing 4GB memory while running lebench[1]. The 'poll big' test of lebench opens 1000 sockets, polls and closes those. This test is repeated 10,000 times. Therefore it should consume only 1000 'socket_alloc' objects at once. As size of socket_alloc is about 800 Bytes, it's only 800 KiB. However, on the recent kernels, it could consume up to 10,000,000 objects (about 8 GiB). On the test machine, I confirmed it consuming about 4GB of the system memory and results in OOM.
[1] https://github.com/LinuxPerfStudy/LEBench
To avoid the problem, this commit reverts the changes.
I also tried to make fixup rather than reverts, but I couldn't easily find simple fixup. As the commits 6d7855c54e1e and 333f7909a857 were for code refactoring rather than performance optimization, I thought introducing complex fixup for this problem would make no sense. Meanwhile, the memory pressure regression could affect real machines. To this end, I decided to quickly revert the commits first and consider better refactoring later.
While lebench might be exercising a rather pathological case, the increase in memory pressure is real. I am concerned that the OOM killer is actually engaging and killing off processes when there are lots of resources already marked for release. This might be true for other lazy/delayed resource deallocation, too. This has obviously just become too lazy currently.
So for both reverts:
Reviewed-by: Stefan Nuernberger snu@amazon.com
Thanks, SeongJae Park
SeongJae Park (2): Revert "coallocate socket_wq with socket itself" Revert "sockfs: switch to ->free_inode()"
drivers/net/tap.c | 5 +++-- drivers/net/tun.c | 8 +++++--- include/linux/if_tap.h | 1 + include/linux/net.h | 4 ++-- include/net/sock.h | 4 ++-- net/core/sock.c | 2 +- net/socket.c | 23 ++++++++++++++++------- 7 files changed, 30 insertions(+), 17 deletions(-)
Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Tue, May 5, 2020 at 4:54 AM SeongJae Park sjpark@amazon.com wrote:
CC-ing stable@vger.kernel.org and adding some more explanations.
On Tue, 5 May 2020 10:10:33 +0200 SeongJae Park sjpark@amazon.com wrote:
From: SeongJae Park sjpark@amazon.de
The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made the deallocation of 'socket_alloc' to be done asynchronously using RCU, as same to 'sock.wq'. And the following commit 333f7909a857 ("coallocate socket_sq with socket itself") made those to have same life cycle.
The changes made the code much more simple, but also made 'socket_alloc' live longer than before. For the reason, user programs intensively repeating allocations and deallocations of sockets could cause memory pressure on recent kernels.
I found this problem on a production virtual machine utilizing 4GB memory while running lebench[1]. The 'poll big' test of lebench opens 1000 sockets, polls and closes those. This test is repeated 10,000 times. Therefore it should consume only 1000 'socket_alloc' objects at once. As size of socket_alloc is about 800 Bytes, it's only 800 KiB. However, on the recent kernels, it could consume up to 10,000,000 objects (about 8 GiB). On the test machine, I confirmed it consuming about 4GB of the system memory and results in OOM.
To be fair, I have not backported Al patches to Google production kernels, nor I have tried this benchmark.
Why do we have 10,000,000 objects around ? Could this be because of some RCU problem ?
Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
Thanks.
To avoid the problem, this commit reverts the changes.
I also tried to make fixup rather than reverts, but I couldn't easily find simple fixup. As the commits 6d7855c54e1e and 333f7909a857 were for code refactoring rather than performance optimization, I thought introducing complex fixup for this problem would make no sense. Meanwhile, the memory pressure regression could affect real machines. To this end, I decided to quickly revert the commits first and consider better refactoring later.
Thanks, SeongJae Park
SeongJae Park (2): Revert "coallocate socket_wq with socket itself" Revert "sockfs: switch to ->free_inode()"
drivers/net/tap.c | 5 +++-- drivers/net/tun.c | 8 +++++--- include/linux/if_tap.h | 1 + include/linux/net.h | 4 ++-- include/net/sock.h | 4 ++-- net/core/sock.c | 2 +- net/socket.c | 23 ++++++++++++++++------- 7 files changed, 30 insertions(+), 17 deletions(-)
-- 2.17.1
On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 4:54 AM SeongJae Park sjpark@amazon.com wrote:
CC-ing stable@vger.kernel.org and adding some more explanations.
On Tue, 5 May 2020 10:10:33 +0200 SeongJae Park sjpark@amazon.com wrote:
From: SeongJae Park sjpark@amazon.de
The commit 6d7855c54e1e ("sockfs: switch to ->free_inode()") made the deallocation of 'socket_alloc' to be done asynchronously using RCU, as same to 'sock.wq'. And the following commit 333f7909a857 ("coallocate socket_sq with socket itself") made those to have same life cycle.
The changes made the code much more simple, but also made 'socket_alloc' live longer than before. For the reason, user programs intensively repeating allocations and deallocations of sockets could cause memory pressure on recent kernels.
I found this problem on a production virtual machine utilizing 4GB memory while running lebench[1]. The 'poll big' test of lebench opens 1000 sockets, polls and closes those. This test is repeated 10,000 times. Therefore it should consume only 1000 'socket_alloc' objects at once. As size of socket_alloc is about 800 Bytes, it's only 800 KiB. However, on the recent kernels, it could consume up to 10,000,000 objects (about 8 GiB). On the test machine, I confirmed it consuming about 4GB of the system memory and results in OOM.
To be fair, I have not backported Al patches to Google production kernels, nor I have tried this benchmark.
Why do we have 10,000,000 objects around ? Could this be because of some RCU problem ?
Mainly because of a long RCU grace period, as you guess. I have no idea how the grace period became so long in this case.
As my test machine was a virtual machine instance, I guess RCU readers preemption[1] like problem might affected this.
[1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
Yes, both the old kernel that prior to Al's patches and the recent kernel reverting the Al's patches didn't reproduce the problem.
Thanks, SeongJae Park
Thanks.
To avoid the problem, this commit reverts the changes.
I also tried to make fixup rather than reverts, but I couldn't easily find simple fixup. As the commits 6d7855c54e1e and 333f7909a857 were for code refactoring rather than performance optimization, I thought introducing complex fixup for this problem would make no sense. Meanwhile, the memory pressure regression could affect real machines. To this end, I decided to quickly revert the commits first and consider better refactoring later.
Thanks, SeongJae Park
SeongJae Park (2): Revert "coallocate socket_wq with socket itself" Revert "sockfs: switch to ->free_inode()"
drivers/net/tap.c | 5 +++-- drivers/net/tun.c | 8 +++++--- include/linux/if_tap.h | 1 + include/linux/net.h | 4 ++-- include/net/sock.h | 4 ++-- net/core/sock.c | 2 +- net/socket.c | 23 ++++++++++++++++------- 7 files changed, 30 insertions(+), 17 deletions(-)
-- 2.17.1
On 5/5/20 8:07 AM, SeongJae Park wrote:
On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote:
Why do we have 10,000,000 objects around ? Could this be because of some RCU problem ?
Mainly because of a long RCU grace period, as you guess. I have no idea how the grace period became so long in this case.
As my test machine was a virtual machine instance, I guess RCU readers preemption[1] like problem might affected this.
[1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
Yes, both the old kernel that prior to Al's patches and the recent kernel reverting the Al's patches didn't reproduce the problem.
I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ?
TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq object that was allocated in sock_alloc_inode() before Al patches.
These objects should be visible in kmalloc-64 kmem cache.
On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 8:07 AM, SeongJae Park wrote:
On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote:
Why do we have 10,000,000 objects around ? Could this be because of some RCU problem ?
Mainly because of a long RCU grace period, as you guess. I have no idea how the grace period became so long in this case.
As my test machine was a virtual machine instance, I guess RCU readers preemption[1] like problem might affected this.
[1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
Yes, both the old kernel that prior to Al's patches and the recent kernel reverting the Al's patches didn't reproduce the problem.
I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ?
TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq object that was allocated in sock_alloc_inode() before Al patches.
These objects should be visible in kmalloc-64 kmem cache.
Not exactly the 10,000,000, as it is only the possible highest number, but I was able to observe clear exponential increase of the number of the objects using slabtop. Before the start of the problematic workload, the number of objects of 'kmalloc-64' was 5760, but I was able to observe the number increase to 1,136,576.
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64
Thanks, SeongJae Park
On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote:
On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 8:07 AM, SeongJae Park wrote:
On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote:
Why do we have 10,000,000 objects around ? Could this be because of some RCU problem ?
Mainly because of a long RCU grace period, as you guess. I have no idea how the grace period became so long in this case.
As my test machine was a virtual machine instance, I guess RCU readers preemption[1] like problem might affected this.
[1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
Yes, both the old kernel that prior to Al's patches and the recent kernel reverting the Al's patches didn't reproduce the problem.
I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ?
TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq object that was allocated in sock_alloc_inode() before Al patches.
These objects should be visible in kmalloc-64 kmem cache.
Not exactly the 10,000,000, as it is only the possible highest number, but I was able to observe clear exponential increase of the number of the objects using slabtop. Before the start of the problematic workload, the number of objects of 'kmalloc-64' was 5760, but I was able to observe the number increase to 1,136,576.
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64
Great, thanks.
How recent is the kernel you are running for your experiment ?
Let's make sure the bug is not in RCU.
After Al changes, RCU got slightly better under stress.
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote:
On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 8:07 AM, SeongJae Park wrote:
On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote:
Why do we have 10,000,000 objects around ? Could this be because of some RCU problem ?
Mainly because of a long RCU grace period, as you guess. I have no idea how the grace period became so long in this case.
As my test machine was a virtual machine instance, I guess RCU readers preemption[1] like problem might affected this.
[1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
Yes, both the old kernel that prior to Al's patches and the recent kernel reverting the Al's patches didn't reproduce the problem.
I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ?
TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq object that was allocated in sock_alloc_inode() before Al patches.
These objects should be visible in kmalloc-64 kmem cache.
Not exactly the 10,000,000, as it is only the possible highest number, but I was able to observe clear exponential increase of the number of the objects using slabtop. Before the start of the problematic workload, the number of objects of 'kmalloc-64' was 5760, but I was able to observe the number increase to 1,136,576.
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64
Great, thanks.
How recent is the kernel you are running for your experiment ?
It's based on 5.4.35.
Let's make sure the bug is not in RCU.
One thing I can currently say is that the grace period passes at last. I modified the benchmark to repeat not 10,000 times but only 5,000 times to run the test without OOM but easily observable memory pressure. As soon as the benchmark finishes, the memory were freed.
If you need more tests, please let me know.
Thanks, SeongJae Park
After Al changes, RCU got slightly better under stress.
On 5/5/20 9:13 AM, SeongJae Park wrote:
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote:
On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 8:07 AM, SeongJae Park wrote:
On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote:
Why do we have 10,000,000 objects around ? Could this be because of some RCU problem ?
Mainly because of a long RCU grace period, as you guess. I have no idea how the grace period became so long in this case.
As my test machine was a virtual machine instance, I guess RCU readers preemption[1] like problem might affected this.
[1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
Yes, both the old kernel that prior to Al's patches and the recent kernel reverting the Al's patches didn't reproduce the problem.
I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ?
TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq object that was allocated in sock_alloc_inode() before Al patches.
These objects should be visible in kmalloc-64 kmem cache.
Not exactly the 10,000,000, as it is only the possible highest number, but I was able to observe clear exponential increase of the number of the objects using slabtop. Before the start of the problematic workload, the number of objects of 'kmalloc-64' was 5760, but I was able to observe the number increase to 1,136,576.
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64
Great, thanks.
How recent is the kernel you are running for your experiment ?
It's based on 5.4.35.
Let's make sure the bug is not in RCU.
One thing I can currently say is that the grace period passes at last. I modified the benchmark to repeat not 10,000 times but only 5,000 times to run the test without OOM but easily observable memory pressure. As soon as the benchmark finishes, the memory were freed.
If you need more tests, please let me know.
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
On 5/5/20 9:25 AM, Eric Dumazet wrote:
On 5/5/20 9:13 AM, SeongJae Park wrote:
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote:
On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 8:07 AM, SeongJae Park wrote:
On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote:
> Why do we have 10,000,000 objects around ? Could this be because of > some RCU problem ?
Mainly because of a long RCU grace period, as you guess. I have no idea how the grace period became so long in this case.
As my test machine was a virtual machine instance, I guess RCU readers preemption[1] like problem might affected this.
[1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
> > Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
Yes, both the old kernel that prior to Al's patches and the recent kernel reverting the Al's patches didn't reproduce the problem.
I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ?
TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq object that was allocated in sock_alloc_inode() before Al patches.
These objects should be visible in kmalloc-64 kmem cache.
Not exactly the 10,000,000, as it is only the possible highest number, but I was able to observe clear exponential increase of the number of the objects using slabtop. Before the start of the problematic workload, the number of objects of 'kmalloc-64' was 5760, but I was able to observe the number increase to 1,136,576.
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64
Great, thanks.
How recent is the kernel you are running for your experiment ?
It's based on 5.4.35.
Let's make sure the bug is not in RCU.
One thing I can currently say is that the grace period passes at last. I modified the benchmark to repeat not 10,000 times but only 5,000 times to run the test without OOM but easily observable memory pressure. As soon as the benchmark finishes, the memory were freed.
If you need more tests, please let me know.
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
I wonder if simply adjusting rcu_divisor to 6 or 5 would help
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -427,7 +427,7 @@ module_param(qovld, long, 0444); static ulong jiffies_till_first_fqs = ULONG_MAX; static ulong jiffies_till_next_fqs = ULONG_MAX; static bool rcu_kick_kthreads; -static int rcu_divisor = 7; +static int rcu_divisor = 6; module_param(rcu_divisor, int, 0644);
/* Force an exit from rcu_do_batch() after 3 milliseconds. */
On 5/5/20 9:31 AM, Eric Dumazet wrote:
On 5/5/20 9:25 AM, Eric Dumazet wrote:
On 5/5/20 9:13 AM, SeongJae Park wrote:
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote:
On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 8:07 AM, SeongJae Park wrote: > On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote: >
>> Why do we have 10,000,000 objects around ? Could this be because of >> some RCU problem ? > > Mainly because of a long RCU grace period, as you guess. I have no idea how > the grace period became so long in this case. > > As my test machine was a virtual machine instance, I guess RCU readers > preemption[1] like problem might affected this. > > [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf > >> >> Once Al patches reverted, do you have 10,000,000 sock_alloc around ? > > Yes, both the old kernel that prior to Al's patches and the recent kernel > reverting the Al's patches didn't reproduce the problem. >
I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ?
TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq object that was allocated in sock_alloc_inode() before Al patches.
These objects should be visible in kmalloc-64 kmem cache.
Not exactly the 10,000,000, as it is only the possible highest number, but I was able to observe clear exponential increase of the number of the objects using slabtop. Before the start of the problematic workload, the number of objects of 'kmalloc-64' was 5760, but I was able to observe the number increase to 1,136,576.
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64
Great, thanks.
How recent is the kernel you are running for your experiment ?
It's based on 5.4.35.
Let's make sure the bug is not in RCU.
One thing I can currently say is that the grace period passes at last. I modified the benchmark to repeat not 10,000 times but only 5,000 times to run the test without OOM but easily observable memory pressure. As soon as the benchmark finishes, the memory were freed.
If you need more tests, please let me know.
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
I wonder if simply adjusting rcu_divisor to 6 or 5 would help
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -427,7 +427,7 @@ module_param(qovld, long, 0444); static ulong jiffies_till_first_fqs = ULONG_MAX; static ulong jiffies_till_next_fqs = ULONG_MAX; static bool rcu_kick_kthreads; -static int rcu_divisor = 7; +static int rcu_divisor = 6; module_param(rcu_divisor, int, 0644); /* Force an exit from rcu_do_batch() after 3 milliseconds. */
To be clear, you can adjust the value without building a new kernel.
echo 6 >/sys/module/rcutree/parameters/rcu_divisor
On Tue, 5 May 2020 09:37:42 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 9:31 AM, Eric Dumazet wrote:
On 5/5/20 9:25 AM, Eric Dumazet wrote:
On 5/5/20 9:13 AM, SeongJae Park wrote:
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote:
On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
> > > On 5/5/20 8:07 AM, SeongJae Park wrote: >> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote: >> >
[...]
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
I wonder if simply adjusting rcu_divisor to 6 or 5 would help
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -427,7 +427,7 @@ module_param(qovld, long, 0444); static ulong jiffies_till_first_fqs = ULONG_MAX; static ulong jiffies_till_next_fqs = ULONG_MAX; static bool rcu_kick_kthreads; -static int rcu_divisor = 7; +static int rcu_divisor = 6; module_param(rcu_divisor, int, 0644); /* Force an exit from rcu_do_batch() after 3 milliseconds. */
To be clear, you can adjust the value without building a new kernel.
echo 6 >/sys/module/rcutree/parameters/rcu_divisor
I tried value 6, 5, and 4, but none of those removed the problem.
Thanks, SeongJae Park
On Tue, May 05, 2020 at 07:05:53PM +0200, SeongJae Park wrote:
On Tue, 5 May 2020 09:37:42 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 9:31 AM, Eric Dumazet wrote:
On 5/5/20 9:25 AM, Eric Dumazet wrote:
On 5/5/20 9:13 AM, SeongJae Park wrote:
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote: > > On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote: > >> >> >> On 5/5/20 8:07 AM, SeongJae Park wrote: >>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote: >>> >>
[...]
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
I wonder if simply adjusting rcu_divisor to 6 or 5 would help
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -427,7 +427,7 @@ module_param(qovld, long, 0444); static ulong jiffies_till_first_fqs = ULONG_MAX; static ulong jiffies_till_next_fqs = ULONG_MAX; static bool rcu_kick_kthreads; -static int rcu_divisor = 7; +static int rcu_divisor = 6; module_param(rcu_divisor, int, 0644); /* Force an exit from rcu_do_batch() after 3 milliseconds. */
To be clear, you can adjust the value without building a new kernel.
echo 6 >/sys/module/rcutree/parameters/rcu_divisor
I tried value 6, 5, and 4, but none of those removed the problem.
Thank you for checking this!
Was your earlier discussion on long RCU readers speculation, or do you have measurements?
Thanx, Paul
On Tue, 5 May 2020 10:30:36 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Tue, May 05, 2020 at 07:05:53PM +0200, SeongJae Park wrote:
On Tue, 5 May 2020 09:37:42 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 9:31 AM, Eric Dumazet wrote:
On 5/5/20 9:25 AM, Eric Dumazet wrote:
On 5/5/20 9:13 AM, SeongJae Park wrote:
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
> On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote: >> >> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote: >> >>> >>> >>> On 5/5/20 8:07 AM, SeongJae Park wrote: >>>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote: >>>> >>>
[...]
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
I wonder if simply adjusting rcu_divisor to 6 or 5 would help
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -427,7 +427,7 @@ module_param(qovld, long, 0444); static ulong jiffies_till_first_fqs = ULONG_MAX; static ulong jiffies_till_next_fqs = ULONG_MAX; static bool rcu_kick_kthreads; -static int rcu_divisor = 7; +static int rcu_divisor = 6; module_param(rcu_divisor, int, 0644); /* Force an exit from rcu_do_batch() after 3 milliseconds. */
To be clear, you can adjust the value without building a new kernel.
echo 6 >/sys/module/rcutree/parameters/rcu_divisor
I tried value 6, 5, and 4, but none of those removed the problem.
Thank you for checking this!
Was your earlier discussion on long RCU readers speculation, or do you have measurements?
It was just a guess without any measurement or dedicated investigation.
Thanks, SeongJae Park
Thanx, Paul
On Tue, May 05, 2020 at 07:56:05PM +0200, SeongJae Park wrote:
On Tue, 5 May 2020 10:30:36 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Tue, May 05, 2020 at 07:05:53PM +0200, SeongJae Park wrote:
On Tue, 5 May 2020 09:37:42 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 9:31 AM, Eric Dumazet wrote:
On 5/5/20 9:25 AM, Eric Dumazet wrote:
On 5/5/20 9:13 AM, SeongJae Park wrote: > On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote: > >> On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote: >>> >>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote: >>> >>>> >>>> >>>> On 5/5/20 8:07 AM, SeongJae Park wrote: >>>>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote: >>>>> >>>>
[...]
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
I wonder if simply adjusting rcu_divisor to 6 or 5 would help
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -427,7 +427,7 @@ module_param(qovld, long, 0444); static ulong jiffies_till_first_fqs = ULONG_MAX; static ulong jiffies_till_next_fqs = ULONG_MAX; static bool rcu_kick_kthreads; -static int rcu_divisor = 7; +static int rcu_divisor = 6; module_param(rcu_divisor, int, 0644); /* Force an exit from rcu_do_batch() after 3 milliseconds. */
To be clear, you can adjust the value without building a new kernel.
echo 6 >/sys/module/rcutree/parameters/rcu_divisor
I tried value 6, 5, and 4, but none of those removed the problem.
Thank you for checking this!
Was your earlier discussion on long RCU readers speculation, or do you have measurements?
It was just a guess without any measurement or dedicated investigation.
OK, another thing to check is the duration of the low-memory episode. Does this duration exceed the RCU CPU stall warning time? (21 seconds in mainline, 60 in many distros, but check rcupdate.rcu_cpu_stall_timeout to be sure.)
Also, any chance of a .config? Or at least the RCU portions? I am guessing CONFIG_PREEMPT=n, for example.
Thanx, Paul
On Tue, 5 May 2020 11:17:07 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Tue, May 05, 2020 at 07:56:05PM +0200, SeongJae Park wrote:
On Tue, 5 May 2020 10:30:36 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Tue, May 05, 2020 at 07:05:53PM +0200, SeongJae Park wrote:
On Tue, 5 May 2020 09:37:42 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 9:31 AM, Eric Dumazet wrote:
On 5/5/20 9:25 AM, Eric Dumazet wrote: > > > On 5/5/20 9:13 AM, SeongJae Park wrote: >> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote: >> >>> On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote: >>>> >>>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote: >>>> >>>>> >>>>> >>>>> On 5/5/20 8:07 AM, SeongJae Park wrote: >>>>>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote: >>>>>> >>>>>
[...]
> > I would ask Paul opinion on this issue, because we have many objects > being freed after RCU grace periods. > > If RCU subsystem can not keep-up, I guess other workloads will also suffer. > > Sure, we can revert patches there and there trying to work around the issue, > but for objects allocated from process context, we should not have these problems. >
I wonder if simply adjusting rcu_divisor to 6 or 5 would help
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -427,7 +427,7 @@ module_param(qovld, long, 0444); static ulong jiffies_till_first_fqs = ULONG_MAX; static ulong jiffies_till_next_fqs = ULONG_MAX; static bool rcu_kick_kthreads; -static int rcu_divisor = 7; +static int rcu_divisor = 6; module_param(rcu_divisor, int, 0644); /* Force an exit from rcu_do_batch() after 3 milliseconds. */
To be clear, you can adjust the value without building a new kernel.
echo 6 >/sys/module/rcutree/parameters/rcu_divisor
I tried value 6, 5, and 4, but none of those removed the problem.
Thank you for checking this!
Was your earlier discussion on long RCU readers speculation, or do you have measurements?
It was just a guess without any measurement or dedicated investigation.
OK, another thing to check is the duration of the low-memory episode. Does this duration exceed the RCU CPU stall warning time? (21 seconds in mainline, 60 in many distros, but check rcupdate.rcu_cpu_stall_timeout to be sure.)
The benchmark takes about 36 seconds for 10,000 repeats of the test.
The value on the test machine is 60.
So the duration would not exceeded the warning time and therefore I haven't seen the warning message.
As told in other mail, I will also adjust this value to shorter one.
Also, any chance of a .config? Or at least the RCU portions? I am guessing CONFIG_PREEMPT=n, for example.
I guess this would be ok.
# CONFIG_PREEMPT is not set
# # RCU Subsystem # CONFIG_TREE_RCU=y CONFIG_RCU_EXPERT=y CONFIG_SRCU=y CONFIG_TREE_SRCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_RCU_FANOUT=64 CONFIG_RCU_FANOUT_LEAF=16 # CONFIG_RCU_FAST_NO_HZ is not set CONFIG_RCU_NOCB_CPU=y # end of RCU Subsystem
Thanks, SeongJae Park
On Tue, May 05, 2020 at 08:34:02PM +0200, SeongJae Park wrote:
On Tue, 5 May 2020 11:17:07 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Tue, May 05, 2020 at 07:56:05PM +0200, SeongJae Park wrote:
On Tue, 5 May 2020 10:30:36 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Tue, May 05, 2020 at 07:05:53PM +0200, SeongJae Park wrote:
On Tue, 5 May 2020 09:37:42 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 9:31 AM, Eric Dumazet wrote: > > > On 5/5/20 9:25 AM, Eric Dumazet wrote: >> >> >> On 5/5/20 9:13 AM, SeongJae Park wrote: >>> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote: >>> >>>> On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote: >>>>> >>>>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote: >>>>> >>>>>> >>>>>> >>>>>> On 5/5/20 8:07 AM, SeongJae Park wrote: >>>>>>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote: >>>>>>> >>>>>>
[...]
>> >> I would ask Paul opinion on this issue, because we have many objects >> being freed after RCU grace periods. >> >> If RCU subsystem can not keep-up, I guess other workloads will also suffer. >> >> Sure, we can revert patches there and there trying to work around the issue, >> but for objects allocated from process context, we should not have these problems. >> > > I wonder if simply adjusting rcu_divisor to 6 or 5 would help > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -427,7 +427,7 @@ module_param(qovld, long, 0444); > static ulong jiffies_till_first_fqs = ULONG_MAX; > static ulong jiffies_till_next_fqs = ULONG_MAX; > static bool rcu_kick_kthreads; > -static int rcu_divisor = 7; > +static int rcu_divisor = 6; > module_param(rcu_divisor, int, 0644); > > /* Force an exit from rcu_do_batch() after 3 milliseconds. */ >
To be clear, you can adjust the value without building a new kernel.
echo 6 >/sys/module/rcutree/parameters/rcu_divisor
I tried value 6, 5, and 4, but none of those removed the problem.
Thank you for checking this!
Was your earlier discussion on long RCU readers speculation, or do you have measurements?
It was just a guess without any measurement or dedicated investigation.
OK, another thing to check is the duration of the low-memory episode. Does this duration exceed the RCU CPU stall warning time? (21 seconds in mainline, 60 in many distros, but check rcupdate.rcu_cpu_stall_timeout to be sure.)
The benchmark takes about 36 seconds for 10,000 repeats of the test.
The value on the test machine is 60.
So the duration would not exceeded the warning time and therefore I haven't seen the warning message.
As told in other mail, I will also adjust this value to shorter one.
Sounds good, thank you!
Also, any chance of a .config? Or at least the RCU portions? I am guessing CONFIG_PREEMPT=n, for example.
I guess this would be ok.
# CONFIG_PREEMPT is not set
# # RCU Subsystem # CONFIG_TREE_RCU=y CONFIG_RCU_EXPERT=y CONFIG_SRCU=y CONFIG_TREE_SRCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_RCU_FANOUT=64 CONFIG_RCU_FANOUT_LEAF=16 # CONFIG_RCU_FAST_NO_HZ is not set CONFIG_RCU_NOCB_CPU=y # end of RCU Subsystem
And thank you again!
Thanx, Paul
TL; DR: It was not kernel's fault, but the benchmark program.
So, the problem is reproducible using the lebench[1] only. I carefully read it's code again.
Before running the problem occurred "poll big" sub test, lebench executes "context switch" sub test. For the test, it sets the cpu affinity[2] and process priority[3] of itself to '0' and '-20', respectively. However, it doesn't restore the values to original value even after the "context switch" is finished. For the reason, "select big" sub test also run binded on CPU 0 and has lowest nice value. Therefore, it can disturb the RCU callback thread for the CPU 0, which processes the deferred deallocations of the sockets, and as a result it triggers the OOM.
We confirmed the problem disappears by offloading the RCU callbacks from the CPU 0 using rcu_nocbs=0 boot parameter or simply restoring the affinity and/or priority.
Someone _might_ still argue that this is kernel problem because the problem didn't occur on the old kernels prior to the Al's patches. However, setting the affinity and priority was available because the program received the permission. Therefore, it would be reasonable to blame the system administrators rather than the kernel.
So, please ignore this patchset, apology for making confuse. If you still has some doubts or need more tests, please let me know.
[1] https://github.com/LinuxPerfStudy/LEBench [2] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L82... [3] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L82...
Thanks, SeongJae Park
On Wed, May 6, 2020 at 5:59 AM SeongJae Park sjpark@amazon.com wrote:
TL; DR: It was not kernel's fault, but the benchmark program.
So, the problem is reproducible using the lebench[1] only. I carefully read it's code again.
Before running the problem occurred "poll big" sub test, lebench executes "context switch" sub test. For the test, it sets the cpu affinity[2] and process priority[3] of itself to '0' and '-20', respectively. However, it doesn't restore the values to original value even after the "context switch" is finished. For the reason, "select big" sub test also run binded on CPU 0 and has lowest nice value. Therefore, it can disturb the RCU callback thread for the CPU 0, which processes the deferred deallocations of the sockets, and as a result it triggers the OOM.
We confirmed the problem disappears by offloading the RCU callbacks from the CPU 0 using rcu_nocbs=0 boot parameter or simply restoring the affinity and/or priority.
Someone _might_ still argue that this is kernel problem because the problem didn't occur on the old kernels prior to the Al's patches. However, setting the affinity and priority was available because the program received the permission. Therefore, it would be reasonable to blame the system administrators rather than the kernel.
So, please ignore this patchset, apology for making confuse. If you still has some doubts or need more tests, please let me know.
[1] https://github.com/LinuxPerfStudy/LEBench [2] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L82... [3] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L82...
Thanks, SeongJae Park
No harm done, thanks for running more tests and root-causing the issue !
On Wed, May 06, 2020 at 02:59:26PM +0200, SeongJae Park wrote:
TL; DR: It was not kernel's fault, but the benchmark program.
So, the problem is reproducible using the lebench[1] only. I carefully read it's code again.
Before running the problem occurred "poll big" sub test, lebench executes "context switch" sub test. For the test, it sets the cpu affinity[2] and process priority[3] of itself to '0' and '-20', respectively. However, it doesn't restore the values to original value even after the "context switch" is finished. For the reason, "select big" sub test also run binded on CPU 0 and has lowest nice value. Therefore, it can disturb the RCU callback thread for the CPU 0, which processes the deferred deallocations of the sockets, and as a result it triggers the OOM.
We confirmed the problem disappears by offloading the RCU callbacks from the CPU 0 using rcu_nocbs=0 boot parameter or simply restoring the affinity and/or priority.
Someone _might_ still argue that this is kernel problem because the problem didn't occur on the old kernels prior to the Al's patches. However, setting the affinity and priority was available because the program received the permission. Therefore, it would be reasonable to blame the system administrators rather than the kernel.
So, please ignore this patchset, apology for making confuse. If you still has some doubts or need more tests, please let me know.
[1] https://github.com/LinuxPerfStudy/LEBench [2] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L82... [3] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L82...
Thank you for chasing this down!
I have had this sort of thing on my list as a potential issue, but given that it is now really showing up, it sounds like it is time to bump up its priority a bit. Of course there are limits, so if userspace is running at any of the real-time priorities, making sufficient CPU time available to RCU's kthreads becomes userspace's responsibility. But if everything is running at SCHED_OTHER (which is this case here, correct?), then it is reasonable for RCU to do some work to avoid this situation.
But still, yes, the immediate job is fixing the benchmark. ;-)
Thanx, Paul
PS. Why not just attack all potential issues on my list? Because I usually learn quite a bit from seeing the problem actually happen. And sometimes other changes in RCU eliminate the potential issue before it has a chance to happen.
On Wed, 6 May 2020 07:41:51 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Wed, May 06, 2020 at 02:59:26PM +0200, SeongJae Park wrote:
TL; DR: It was not kernel's fault, but the benchmark program.
So, the problem is reproducible using the lebench[1] only. I carefully read it's code again.
Before running the problem occurred "poll big" sub test, lebench executes "context switch" sub test. For the test, it sets the cpu affinity[2] and process priority[3] of itself to '0' and '-20', respectively. However, it doesn't restore the values to original value even after the "context switch" is finished. For the reason, "select big" sub test also run binded on CPU 0 and has lowest nice value. Therefore, it can disturb the RCU callback thread for the CPU 0, which processes the deferred deallocations of the sockets, and as a result it triggers the OOM.
We confirmed the problem disappears by offloading the RCU callbacks from the CPU 0 using rcu_nocbs=0 boot parameter or simply restoring the affinity and/or priority.
Someone _might_ still argue that this is kernel problem because the problem didn't occur on the old kernels prior to the Al's patches. However, setting the affinity and priority was available because the program received the permission. Therefore, it would be reasonable to blame the system administrators rather than the kernel.
So, please ignore this patchset, apology for making confuse. If you still has some doubts or need more tests, please let me know.
[1] https://github.com/LinuxPerfStudy/LEBench [2] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L82... [3] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L82...
Thank you for chasing this down!
I have had this sort of thing on my list as a potential issue, but given that it is now really showing up, it sounds like it is time to bump up its priority a bit. Of course there are limits, so if userspace is running at any of the real-time priorities, making sufficient CPU time available to RCU's kthreads becomes userspace's responsibility. But if everything is running at SCHED_OTHER (which is this case here, correct?),
Correct.
then it is reasonable for RCU to do some work to avoid this situation.
That would be also great!
But still, yes, the immediate job is fixing the benchmark. ;-)
Totally agreed.
Thanx, Paul
PS. Why not just attack all potential issues on my list? Because I usually learn quite a bit from seeing the problem actually happen. And sometimes other changes in RCU eliminate the potential issue before it has a chance to happen.
Sounds interesting, I will try some of those in my spare time ;)
Thanks, SeongJae Park
On Tue, May 05, 2020 at 09:37:42AM -0700, Eric Dumazet wrote:
On 5/5/20 9:31 AM, Eric Dumazet wrote:
On 5/5/20 9:25 AM, Eric Dumazet wrote:
On 5/5/20 9:13 AM, SeongJae Park wrote:
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote:
On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
> > > On 5/5/20 8:07 AM, SeongJae Park wrote: >> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote: >> > >>> Why do we have 10,000,000 objects around ? Could this be because of >>> some RCU problem ? >> >> Mainly because of a long RCU grace period, as you guess. I have no idea how >> the grace period became so long in this case. >> >> As my test machine was a virtual machine instance, I guess RCU readers >> preemption[1] like problem might affected this. >> >> [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf >> >>> >>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ? >> >> Yes, both the old kernel that prior to Al's patches and the recent kernel >> reverting the Al's patches didn't reproduce the problem. >> > > I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ? > > TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq > object that was allocated in sock_alloc_inode() before Al patches. > > These objects should be visible in kmalloc-64 kmem cache.
Not exactly the 10,000,000, as it is only the possible highest number, but I was able to observe clear exponential increase of the number of the objects using slabtop. Before the start of the problematic workload, the number of objects of 'kmalloc-64' was 5760, but I was able to observe the number increase to 1,136,576.
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64
Great, thanks.
How recent is the kernel you are running for your experiment ?
It's based on 5.4.35.
Let's make sure the bug is not in RCU.
One thing I can currently say is that the grace period passes at last. I modified the benchmark to repeat not 10,000 times but only 5,000 times to run the test without OOM but easily observable memory pressure. As soon as the benchmark finishes, the memory were freed.
If you need more tests, please let me know.
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
I wonder if simply adjusting rcu_divisor to 6 or 5 would help
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -427,7 +427,7 @@ module_param(qovld, long, 0444); static ulong jiffies_till_first_fqs = ULONG_MAX; static ulong jiffies_till_next_fqs = ULONG_MAX; static bool rcu_kick_kthreads; -static int rcu_divisor = 7; +static int rcu_divisor = 6; module_param(rcu_divisor, int, 0644); /* Force an exit from rcu_do_batch() after 3 milliseconds. */
To be clear, you can adjust the value without building a new kernel.
echo 6 >/sys/module/rcutree/parameters/rcu_divisor
Worth a try! If that helps significantly, I have some ideas for updating that heuristic, such as checking for sudden increases in the number of pending callbacks.
But I would really also like to know whether there are long readers and whether v5.6 fares better.
Thanx, Paul
On Tue, 5 May 2020 10:28:50 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Tue, May 05, 2020 at 09:37:42AM -0700, Eric Dumazet wrote:
On 5/5/20 9:31 AM, Eric Dumazet wrote:
On 5/5/20 9:25 AM, Eric Dumazet wrote:
On 5/5/20 9:13 AM, SeongJae Park wrote:
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote: > > On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote: > >> >> >> On 5/5/20 8:07 AM, SeongJae Park wrote: >>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote: >>> >> >>>> Why do we have 10,000,000 objects around ? Could this be because of >>>> some RCU problem ? >>> >>> Mainly because of a long RCU grace period, as you guess. I have no idea how >>> the grace period became so long in this case. >>> >>> As my test machine was a virtual machine instance, I guess RCU readers >>> preemption[1] like problem might affected this. >>> >>> [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf >>> >>>> >>>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ? >>> >>> Yes, both the old kernel that prior to Al's patches and the recent kernel >>> reverting the Al's patches didn't reproduce the problem. >>> >> >> I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ? >> >> TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq >> object that was allocated in sock_alloc_inode() before Al patches. >> >> These objects should be visible in kmalloc-64 kmem cache. > > Not exactly the 10,000,000, as it is only the possible highest number, but I > was able to observe clear exponential increase of the number of the objects > using slabtop. Before the start of the problematic workload, the number of > objects of 'kmalloc-64' was 5760, but I was able to observe the number increase > to 1,136,576. > > OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME > before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 > after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64 >
Great, thanks.
How recent is the kernel you are running for your experiment ?
It's based on 5.4.35.
Let's make sure the bug is not in RCU.
One thing I can currently say is that the grace period passes at last. I modified the benchmark to repeat not 10,000 times but only 5,000 times to run the test without OOM but easily observable memory pressure. As soon as the benchmark finishes, the memory were freed.
If you need more tests, please let me know.
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
I wonder if simply adjusting rcu_divisor to 6 or 5 would help
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d9a49cd6065a20936edbda1b334136ab597cde52..fde833bac0f9f81e8536211b4dad6e7575c1219a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -427,7 +427,7 @@ module_param(qovld, long, 0444); static ulong jiffies_till_first_fqs = ULONG_MAX; static ulong jiffies_till_next_fqs = ULONG_MAX; static bool rcu_kick_kthreads; -static int rcu_divisor = 7; +static int rcu_divisor = 6; module_param(rcu_divisor, int, 0644); /* Force an exit from rcu_do_batch() after 3 milliseconds. */
To be clear, you can adjust the value without building a new kernel.
echo 6 >/sys/module/rcutree/parameters/rcu_divisor
Worth a try! If that helps significantly, I have some ideas for updating that heuristic, such as checking for sudden increases in the number of pending callbacks.
But I would really also like to know whether there are long readers and whether v5.6 fares better.
I will share the results as soon as possible :)
Thanks, SeongJae Park
Thanx, Paul
On Tue, May 05, 2020 at 09:25:06AM -0700, Eric Dumazet wrote:
On 5/5/20 9:13 AM, SeongJae Park wrote:
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote:
On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 8:07 AM, SeongJae Park wrote:
On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote:
> Why do we have 10,000,000 objects around ? Could this be because of > some RCU problem ?
Mainly because of a long RCU grace period, as you guess. I have no idea how the grace period became so long in this case.
As my test machine was a virtual machine instance, I guess RCU readers preemption[1] like problem might affected this.
[1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
If this is the root cause of the problem, then it will be necessary to provide a hint to the hypervisor. Or, in the near term, avoid loading the hypervisor the point that vCPU preemption is so lengthy.
RCU could also provide some sort of pre-stall-warning notification that some of the CPUs aren't passing through quiescent states, which might allow the guest OS's userspace to take corrective action.
But first, what are you doing to either confirm or invalidate the hypothesis that this might be due to vCPU preemption?
> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
Yes, both the old kernel that prior to Al's patches and the recent kernel reverting the Al's patches didn't reproduce the problem.
I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ?
TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq object that was allocated in sock_alloc_inode() before Al patches.
These objects should be visible in kmalloc-64 kmem cache.
Not exactly the 10,000,000, as it is only the possible highest number, but I was able to observe clear exponential increase of the number of the objects using slabtop. Before the start of the problematic workload, the number of objects of 'kmalloc-64' was 5760, but I was able to observe the number increase to 1,136,576.
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64
Great, thanks.
How recent is the kernel you are running for your experiment ?
It's based on 5.4.35.
Is it possible to retest on v5.6? I have been adding various mechanisms to make RCU keep up better with heavy callback overload.
Also, could you please provide the .config? If either NO_HZ_FULL or RCU_NOCB_CPU, please also provide the kernel boot parameters.
Let's make sure the bug is not in RCU.
One thing I can currently say is that the grace period passes at last. I modified the benchmark to repeat not 10,000 times but only 5,000 times to run the test without OOM but easily observable memory pressure. As soon as the benchmark finishes, the memory were freed.
If you need more tests, please let me know.
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
As always, "It depends."
o If the problem is a too-long RCU reader, RCU is prohibited from ending the grace period. The reader duration must be shortened, and until it is shortened, there is nothing RCU can do.
o In some special cases of the above, RCU can and does help, for example, by enlisting the aid of cond_resched(). So perhaps there is a long in-kernel loop that needs a cond_resched().
And perhaps RCU can help for some types of vCPU preemption.
o As Al suggested offline and as has been discussed in the past, it would not be hard to cause RCU to burn CPU to attain faster grace periods during OOM events. This could be helpful, but only given that RCU readers are completing in reasonable timeframes.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
If readers are not excessively long, RCU should be able to keep up. (In the absence of misconfigurations, for example, both NO_HZ_FULL and then binding all the rcuo kthreads to a single CPU on a 100-CPU system or some such.)
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
Agreed, let's get more info on what is happening to RCU.
One approach is to shorten the RCU CPU stall warning timeout (rcupdate.rcu_cpu_stall_timeout=10 for 10 seconds).
Thanx, Paul
On Tue, 5 May 2020 10:23:58 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Tue, May 05, 2020 at 09:25:06AM -0700, Eric Dumazet wrote:
On 5/5/20 9:13 AM, SeongJae Park wrote:
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote:
On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
On 5/5/20 8:07 AM, SeongJae Park wrote: > On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote: >
>> Why do we have 10,000,000 objects around ? Could this be because of >> some RCU problem ? > > Mainly because of a long RCU grace period, as you guess. I have no idea how > the grace period became so long in this case. > > As my test machine was a virtual machine instance, I guess RCU readers > preemption[1] like problem might affected this. > > [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
If this is the root cause of the problem, then it will be necessary to provide a hint to the hypervisor. Or, in the near term, avoid loading the hypervisor the point that vCPU preemption is so lengthy.
RCU could also provide some sort of pre-stall-warning notification that some of the CPUs aren't passing through quiescent states, which might allow the guest OS's userspace to take corrective action.
But first, what are you doing to either confirm or invalidate the hypothesis that this might be due to vCPU preemption?
Nothing, I was just guessing. Sorry if this made you confused.
>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ? > > Yes, both the old kernel that prior to Al's patches and the recent kernel > reverting the Al's patches didn't reproduce the problem. >
I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ?
TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq object that was allocated in sock_alloc_inode() before Al patches.
These objects should be visible in kmalloc-64 kmem cache.
Not exactly the 10,000,000, as it is only the possible highest number, but I was able to observe clear exponential increase of the number of the objects using slabtop. Before the start of the problematic workload, the number of objects of 'kmalloc-64' was 5760, but I was able to observe the number increase to 1,136,576.
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64
Great, thanks.
How recent is the kernel you are running for your experiment ?
It's based on 5.4.35.
Is it possible to retest on v5.6? I have been adding various mechanisms to make RCU keep up better with heavy callback overload.
I will try soon!
Also, could you please provide the .config? If either NO_HZ_FULL or RCU_NOCB_CPU, please also provide the kernel boot parameters.
NO_HZ_FULL is not set, but RCU_NOCB_CPU is y.
I think I should check whether it's ok to share the full config and boot parameters. Please wait this.
Let's make sure the bug is not in RCU.
One thing I can currently say is that the grace period passes at last. I modified the benchmark to repeat not 10,000 times but only 5,000 times to run the test without OOM but easily observable memory pressure. As soon as the benchmark finishes, the memory were freed.
If you need more tests, please let me know.
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
As always, "It depends."
o If the problem is a too-long RCU reader, RCU is prohibited from ending the grace period. The reader duration must be shortened, and until it is shortened, there is nothing RCU can do.
o In some special cases of the above, RCU can and does help, for example, by enlisting the aid of cond_resched(). So perhaps there is a long in-kernel loop that needs a cond_resched().
And perhaps RCU can help for some types of vCPU preemption.
o As Al suggested offline and as has been discussed in the past, it would not be hard to cause RCU to burn CPU to attain faster grace periods during OOM events. This could be helpful, but only given that RCU readers are completing in reasonable timeframes.
Totally agreed.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
If readers are not excessively long, RCU should be able to keep up. (In the absence of misconfigurations, for example, both NO_HZ_FULL and then binding all the rcuo kthreads to a single CPU on a 100-CPU system or some such.)
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
Agreed, let's get more info on what is happening to RCU.
One approach is to shorten the RCU CPU stall warning timeout (rcupdate.rcu_cpu_stall_timeout=10 for 10 seconds).
I will also try this and let you know the results.
Thanks, SeongJae Park
Thanx, Paul
On Tue, May 05, 2020 at 07:49:43PM +0200, SeongJae Park wrote:
On Tue, 5 May 2020 10:23:58 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Tue, May 05, 2020 at 09:25:06AM -0700, Eric Dumazet wrote:
On 5/5/20 9:13 AM, SeongJae Park wrote:
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote:
On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote:
> > > On 5/5/20 8:07 AM, SeongJae Park wrote: >> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote: >> > >>> Why do we have 10,000,000 objects around ? Could this be because of >>> some RCU problem ? >> >> Mainly because of a long RCU grace period, as you guess. I have no idea how >> the grace period became so long in this case. >> >> As my test machine was a virtual machine instance, I guess RCU readers >> preemption[1] like problem might affected this. >> >> [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
If this is the root cause of the problem, then it will be necessary to provide a hint to the hypervisor. Or, in the near term, avoid loading the hypervisor the point that vCPU preemption is so lengthy.
RCU could also provide some sort of pre-stall-warning notification that some of the CPUs aren't passing through quiescent states, which might allow the guest OS's userspace to take corrective action.
But first, what are you doing to either confirm or invalidate the hypothesis that this might be due to vCPU preemption?
Nothing, I was just guessing. Sorry if this made you confused.
>>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ? >> >> Yes, both the old kernel that prior to Al's patches and the recent kernel >> reverting the Al's patches didn't reproduce the problem. >> > > I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ? > > TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq > object that was allocated in sock_alloc_inode() before Al patches. > > These objects should be visible in kmalloc-64 kmem cache.
Not exactly the 10,000,000, as it is only the possible highest number, but I was able to observe clear exponential increase of the number of the objects using slabtop. Before the start of the problematic workload, the number of objects of 'kmalloc-64' was 5760, but I was able to observe the number increase to 1,136,576.
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64
Great, thanks.
How recent is the kernel you are running for your experiment ?
It's based on 5.4.35.
Is it possible to retest on v5.6? I have been adding various mechanisms to make RCU keep up better with heavy callback overload.
I will try soon!
Also, could you please provide the .config? If either NO_HZ_FULL or RCU_NOCB_CPU, please also provide the kernel boot parameters.
NO_HZ_FULL is not set, but RCU_NOCB_CPU is y.
OK, this is important information.
I think I should check whether it's ok to share the full config and boot parameters. Please wait this.
I probably don't need the whole thing. So, if it makes it easier to gain approval...
The main thing I need are CONFIG_PREEMPT and the various Kconfig options having "RCU" in their names. For example, I have no need for any of the options pertaining to device drivers. (As far as I know at the moment, anyway!)
For the boot parameters, I am very interested in rcu_nocbs=. Along with any other boot parameters whose names contain "rcu".
If rcu_nocbs does designate have any CPUs listed, another thing to check is where the rcuo kthreads are permitted to run. The reason that this is important is that any CPU listed in the rcu_nocbs= boot parameter has its RCU callbacks invoked by one of the rcuo kthreads. If you have booted with (say) "rcu_nocbs=1,63" and then bound all of the resulting rcuo kthreads to CPU 0, you just tied RCU's hands, making it unable to keep up with any reasonable RCU callback load.
This sort of configuration is permitted, but it is intended for tightly controlled real-time or HPC systems whose configurations and workloads avoid tossing out large numbers of callbacks. Which might not be the case for your workload.
Let's make sure the bug is not in RCU.
One thing I can currently say is that the grace period passes at last. I modified the benchmark to repeat not 10,000 times but only 5,000 times to run the test without OOM but easily observable memory pressure. As soon as the benchmark finishes, the memory were freed.
If you need more tests, please let me know.
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
As always, "It depends."
o If the problem is a too-long RCU reader, RCU is prohibited from ending the grace period. The reader duration must be shortened, and until it is shortened, there is nothing RCU can do.
o In some special cases of the above, RCU can and does help, for example, by enlisting the aid of cond_resched(). So perhaps there is a long in-kernel loop that needs a cond_resched().
And perhaps RCU can help for some types of vCPU preemption.
o As Al suggested offline and as has been discussed in the past, it would not be hard to cause RCU to burn CPU to attain faster grace periods during OOM events. This could be helpful, but only given that RCU readers are completing in reasonable timeframes.
Totally agreed.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
If readers are not excessively long, RCU should be able to keep up. (In the absence of misconfigurations, for example, both NO_HZ_FULL and then binding all the rcuo kthreads to a single CPU on a 100-CPU system or some such.)
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
Agreed, let's get more info on what is happening to RCU.
One approach is to shorten the RCU CPU stall warning timeout (rcupdate.rcu_cpu_stall_timeout=10 for 10 seconds).
I will also try this and let you know the results.
Sounds good, thank you!
Thanx, Paul
On Tue, 5 May 2020 11:27:20 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Tue, May 05, 2020 at 07:49:43PM +0200, SeongJae Park wrote:
On Tue, 5 May 2020 10:23:58 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Tue, May 05, 2020 at 09:25:06AM -0700, Eric Dumazet wrote:
On 5/5/20 9:13 AM, SeongJae Park wrote:
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote: > > On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote: > >> >> >> On 5/5/20 8:07 AM, SeongJae Park wrote: >>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote: >>> >> >>>> Why do we have 10,000,000 objects around ? Could this be because of >>>> some RCU problem ? >>> >>> Mainly because of a long RCU grace period, as you guess. I have no idea how >>> the grace period became so long in this case. >>> >>> As my test machine was a virtual machine instance, I guess RCU readers >>> preemption[1] like problem might affected this. >>> >>> [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
If this is the root cause of the problem, then it will be necessary to provide a hint to the hypervisor. Or, in the near term, avoid loading the hypervisor the point that vCPU preemption is so lengthy.
RCU could also provide some sort of pre-stall-warning notification that some of the CPUs aren't passing through quiescent states, which might allow the guest OS's userspace to take corrective action.
But first, what are you doing to either confirm or invalidate the hypothesis that this might be due to vCPU preemption?
Nothing, I was just guessing. Sorry if this made you confused.
>>>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ? >>> >>> Yes, both the old kernel that prior to Al's patches and the recent kernel >>> reverting the Al's patches didn't reproduce the problem. >>> >> >> I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ? >> >> TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq >> object that was allocated in sock_alloc_inode() before Al patches. >> >> These objects should be visible in kmalloc-64 kmem cache. > > Not exactly the 10,000,000, as it is only the possible highest number, but I > was able to observe clear exponential increase of the number of the objects > using slabtop. Before the start of the problematic workload, the number of > objects of 'kmalloc-64' was 5760, but I was able to observe the number increase > to 1,136,576. > > OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME > before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 > after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64 >
Great, thanks.
How recent is the kernel you are running for your experiment ?
It's based on 5.4.35.
Is it possible to retest on v5.6? I have been adding various mechanisms to make RCU keep up better with heavy callback overload.
I will try soon!
Also, could you please provide the .config? If either NO_HZ_FULL or RCU_NOCB_CPU, please also provide the kernel boot parameters.
NO_HZ_FULL is not set, but RCU_NOCB_CPU is y.
OK, this is important information.
I think I should check whether it's ok to share the full config and boot parameters. Please wait this.
I probably don't need the whole thing. So, if it makes it easier to gain approval...
The main thing I need are CONFIG_PREEMPT and the various Kconfig options having "RCU" in their names. For example, I have no need for any of the options pertaining to device drivers. (As far as I know at the moment, anyway!)
For the boot parameters, I am very interested in rcu_nocbs=. Along with any other boot parameters whose names contain "rcu".
I guess this would be ok.
It uses no 'rcu_nocbs=' boot parameter.
The configs you asked are as below:
# CONFIG_PREEMPT is not set
# # RCU Subsystem # CONFIG_TREE_RCU=y CONFIG_RCU_EXPERT=y CONFIG_SRCU=y CONFIG_TREE_SRCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_RCU_FANOUT=64 CONFIG_RCU_FANOUT_LEAF=16 # CONFIG_RCU_FAST_NO_HZ is not set CONFIG_RCU_NOCB_CPU=y # end of RCU Subsystem
If rcu_nocbs does designate have any CPUs listed, another thing to check is where the rcuo kthreads are permitted to run. The reason that this is important is that any CPU listed in the rcu_nocbs= boot parameter has its RCU callbacks invoked by one of the rcuo kthreads. If you have booted with (say) "rcu_nocbs=1,63" and then bound all of the resulting rcuo kthreads to CPU 0, you just tied RCU's hands, making it unable to keep up with any reasonable RCU callback load.
This sort of configuration is permitted, but it is intended for tightly controlled real-time or HPC systems whose configurations and workloads avoid tossing out large numbers of callbacks. Which might not be the case for your workload.
Let's make sure the bug is not in RCU.
One thing I can currently say is that the grace period passes at last. I modified the benchmark to repeat not 10,000 times but only 5,000 times to run the test without OOM but easily observable memory pressure. As soon as the benchmark finishes, the memory were freed.
If you need more tests, please let me know.
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
As always, "It depends."
o If the problem is a too-long RCU reader, RCU is prohibited from ending the grace period. The reader duration must be shortened, and until it is shortened, there is nothing RCU can do.
o In some special cases of the above, RCU can and does help, for example, by enlisting the aid of cond_resched(). So perhaps there is a long in-kernel loop that needs a cond_resched().
And perhaps RCU can help for some types of vCPU preemption.
o As Al suggested offline and as has been discussed in the past, it would not be hard to cause RCU to burn CPU to attain faster grace periods during OOM events. This could be helpful, but only given that RCU readers are completing in reasonable timeframes.
Totally agreed.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
If readers are not excessively long, RCU should be able to keep up. (In the absence of misconfigurations, for example, both NO_HZ_FULL and then binding all the rcuo kthreads to a single CPU on a 100-CPU system or some such.)
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
Agreed, let's get more info on what is happening to RCU.
One approach is to shorten the RCU CPU stall warning timeout (rcupdate.rcu_cpu_stall_timeout=10 for 10 seconds).
I will also try this and let you know the results.
Sounds good, thank you!
:)
Thanks, SeongJae Park
On Tue, May 05, 2020 at 08:40:07PM +0200, SeongJae Park wrote:
On Tue, 5 May 2020 11:27:20 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Tue, May 05, 2020 at 07:49:43PM +0200, SeongJae Park wrote:
On Tue, 5 May 2020 10:23:58 -0700 "Paul E. McKenney" paulmck@kernel.org wrote:
On Tue, May 05, 2020 at 09:25:06AM -0700, Eric Dumazet wrote:
On 5/5/20 9:13 AM, SeongJae Park wrote:
On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet edumazet@google.com wrote:
> On Tue, May 5, 2020 at 8:47 AM SeongJae Park sjpark@amazon.com wrote: >> >> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet eric.dumazet@gmail.com wrote: >> >>> >>> >>> On 5/5/20 8:07 AM, SeongJae Park wrote: >>>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet edumazet@google.com wrote: >>>> >>> >>>>> Why do we have 10,000,000 objects around ? Could this be because of >>>>> some RCU problem ? >>>> >>>> Mainly because of a long RCU grace period, as you guess. I have no idea how >>>> the grace period became so long in this case. >>>> >>>> As my test machine was a virtual machine instance, I guess RCU readers >>>> preemption[1] like problem might affected this. >>>> >>>> [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
If this is the root cause of the problem, then it will be necessary to provide a hint to the hypervisor. Or, in the near term, avoid loading the hypervisor the point that vCPU preemption is so lengthy.
RCU could also provide some sort of pre-stall-warning notification that some of the CPUs aren't passing through quiescent states, which might allow the guest OS's userspace to take corrective action.
But first, what are you doing to either confirm or invalidate the hypothesis that this might be due to vCPU preemption?
Nothing, I was just guessing. Sorry if this made you confused.
>>>>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ? >>>> >>>> Yes, both the old kernel that prior to Al's patches and the recent kernel >>>> reverting the Al's patches didn't reproduce the problem. >>>> >>> >>> I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ? >>> >>> TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq >>> object that was allocated in sock_alloc_inode() before Al patches. >>> >>> These objects should be visible in kmalloc-64 kmem cache. >> >> Not exactly the 10,000,000, as it is only the possible highest number, but I >> was able to observe clear exponential increase of the number of the objects >> using slabtop. Before the start of the problematic workload, the number of >> objects of 'kmalloc-64' was 5760, but I was able to observe the number increase >> to 1,136,576. >> >> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME >> before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 >> after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64 >> > > Great, thanks. > > How recent is the kernel you are running for your experiment ?
It's based on 5.4.35.
Is it possible to retest on v5.6? I have been adding various mechanisms to make RCU keep up better with heavy callback overload.
I will try soon!
Also, could you please provide the .config? If either NO_HZ_FULL or RCU_NOCB_CPU, please also provide the kernel boot parameters.
NO_HZ_FULL is not set, but RCU_NOCB_CPU is y.
OK, this is important information.
I think I should check whether it's ok to share the full config and boot parameters. Please wait this.
I probably don't need the whole thing. So, if it makes it easier to gain approval...
The main thing I need are CONFIG_PREEMPT and the various Kconfig options having "RCU" in their names. For example, I have no need for any of the options pertaining to device drivers. (As far as I know at the moment, anyway!)
For the boot parameters, I am very interested in rcu_nocbs=. Along with any other boot parameters whose names contain "rcu".
I guess this would be ok.
It uses no 'rcu_nocbs=' boot parameter.
OK, thank you!
For whatever it is worth, if the 'rcu_nocbs=' boot parameters is never specified, there is no need to build with CONFIG_RCU_NOCB_CPU=y.
The configs you asked are as below:
# CONFIG_PREEMPT is not set
# # RCU Subsystem # CONFIG_TREE_RCU=y CONFIG_RCU_EXPERT=y CONFIG_SRCU=y CONFIG_TREE_SRCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_RCU_FANOUT=64 CONFIG_RCU_FANOUT_LEAF=16 # CONFIG_RCU_FAST_NO_HZ is not set CONFIG_RCU_NOCB_CPU=y # end of RCU Subsystem
Looks pretty standard otherwise. ;-)
Thanx, Paul
If rcu_nocbs does designate have any CPUs listed, another thing to check is where the rcuo kthreads are permitted to run. The reason that this is important is that any CPU listed in the rcu_nocbs= boot parameter has its RCU callbacks invoked by one of the rcuo kthreads. If you have booted with (say) "rcu_nocbs=1,63" and then bound all of the resulting rcuo kthreads to CPU 0, you just tied RCU's hands, making it unable to keep up with any reasonable RCU callback load.
This sort of configuration is permitted, but it is intended for tightly controlled real-time or HPC systems whose configurations and workloads avoid tossing out large numbers of callbacks. Which might not be the case for your workload.
> Let's make sure the bug is not in RCU.
One thing I can currently say is that the grace period passes at last. I modified the benchmark to repeat not 10,000 times but only 5,000 times to run the test without OOM but easily observable memory pressure. As soon as the benchmark finishes, the memory were freed.
If you need more tests, please let me know.
I would ask Paul opinion on this issue, because we have many objects being freed after RCU grace periods.
As always, "It depends."
o If the problem is a too-long RCU reader, RCU is prohibited from ending the grace period. The reader duration must be shortened, and until it is shortened, there is nothing RCU can do.
o In some special cases of the above, RCU can and does help, for example, by enlisting the aid of cond_resched(). So perhaps there is a long in-kernel loop that needs a cond_resched().
And perhaps RCU can help for some types of vCPU preemption.
o As Al suggested offline and as has been discussed in the past, it would not be hard to cause RCU to burn CPU to attain faster grace periods during OOM events. This could be helpful, but only given that RCU readers are completing in reasonable timeframes.
Totally agreed.
If RCU subsystem can not keep-up, I guess other workloads will also suffer.
If readers are not excessively long, RCU should be able to keep up. (In the absence of misconfigurations, for example, both NO_HZ_FULL and then binding all the rcuo kthreads to a single CPU on a 100-CPU system or some such.)
Sure, we can revert patches there and there trying to work around the issue, but for objects allocated from process context, we should not have these problems.
Agreed, let's get more info on what is happening to RCU.
One approach is to shorten the RCU CPU stall warning timeout (rcupdate.rcu_cpu_stall_timeout=10 for 10 seconds).
I will also try this and let you know the results.
Sounds good, thank you!
:)
Thanks, SeongJae Park
On Tue, May 05, 2020 at 09:00:44AM -0700, Eric Dumazet wrote:
Not exactly the 10,000,000, as it is only the possible highest number, but I was able to observe clear exponential increase of the number of the objects using slabtop. Before the start of the problematic workload, the number of objects of 'kmalloc-64' was 5760, but I was able to observe the number increase to 1,136,576.
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
before: 5760 5088 88% 0.06K 90 64 360K kmalloc-64 after: 1136576 1136576 100% 0.06K 17759 64 71036K kmalloc-64
Great, thanks.
How recent is the kernel you are running for your experiment ?
Let's make sure the bug is not in RCU.
After Al changes, RCU got slightly better under stress.
The thing that worries me here is that this is far from being the only source of RCU-delayed freeing of objects. If we really see bogus OOM kills due to that (IRL, not in an artificial microbenchmark), we'd better do something that would help with all those sources, not just paper over the contributions from one of those. Because there's no chance in hell to get rid of RCU-delayed freeing in general...
Does the problem extend to kfree_rcu()? And there's a lot of RCU callbacks that boil down to kmem_cache_free(); those really look like they should have exact same issue - sock_free_inode() is one of those, after all.
linux-stable-mirror@lists.linaro.org