Hi,
I'd like to report a very severe performance regression due to
mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels
in v4.19.88. I believe this was included since v4.19.67. It is also in all the other LTS kernels, except 3.16.
So today I switched an x86_64 production server from v5.1.21 to v4.19.88, because we kept hitting runaway kcompactd and kswapd. Plus there was a significant increase in memory usage compared to v5.1.5. I'm still bisecting that on another production server.
The service we run is one of the largest forums in Taiwan [1]. It is a terminal-based bulletin board system running over telnet, SSH or a custom WebSocket bridge. The service itself is the one-process-per-user type of design from the old days. This means a lot of forks when there are user spikes or reconnections.
(Reconnections happen because a lot of people use mobile apps that wrap the service, but they get disconnected as soon as they are backgrounded.)
With v4.19.88 we saw a lot of contention on pgd_lock in the process fork path with CONFIG_VMAP_STACK=y:
Samples: 937K of event 'cycles:ppp', Event count (approx.): 499112453614 Children Self Command Shared Object Symbol + 31.15% 0.03% mbbsd [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe + 31.12% 0.02% mbbsd [kernel.kallsyms] [k] do_syscall_64 + 28.12% 0.42% mbbsd [kernel.kallsyms] [k] do_raw_spin_lock - 27.70% 27.62% mbbsd [kernel.kallsyms] [k] queued_spin_lock_slowpath - 18.73% __libc_fork - 18.33% entry_SYSCALL_64_after_hwframe do_syscall_64 - _do_fork - 18.33% copy_process.part.64 - 11.00% __vmalloc_node_range - 10.93% sync_global_pgds_l4 do_raw_spin_lock queued_spin_lock_slowpath - 7.27% mm_init.isra.59 pgd_alloc do_raw_spin_lock queued_spin_lock_slowpath - 8.68% 0x41fd89415541f689 - __libc_start_main + 7.49% main + 0.90% main
This hit us pretty hard, with the service dropping below one-third of its original capacity.
With CONFIG_VMAP_STACK=n, the fork code path skips this, but other vmalloc users are still affected. One other area is the tty layer. This also causes problems for us since there can be as many as 15k users over SSH, some coming and going. So we got a lot of hung sshd processes as well. Unfortunately I don't have any perf reports or kernel logs to go with.
Now I understand that there is already a fix in -next:
https://lore.kernel.org/patchwork/patch/1137341/
However the code has changed a lot in mainline and I'm not sure how to backport this. For now I just reverted the commit by hand by removing the offending code. Seems to work OK, and based on the commit logs I guess it's safe to do so, as we're not running X86-32 or PTI.
Regards ChenYu
Hi,
On Thu, Dec 12, 2019 at 06:54:12PM +0800, Chen-Yu Tsai wrote:
I'd like to report a very severe performance regression due to
mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels
Yes, that is a known problem, with a couple of reports already in the past months. And I posted a fix from which I thought it is on its way upstream, but apparently its not:
https://lore.kernel.org/lkml/20191009124418.8286-1-joro@8bytes.org/
Adding Andrew and the x86 maintainers to Cc.
Regards,
Joerg
in v4.19.88. I believe this was included since v4.19.67. It is also in all the other LTS kernels, except 3.16.
So today I switched an x86_64 production server from v5.1.21 to v4.19.88, because we kept hitting runaway kcompactd and kswapd. Plus there was a significant increase in memory usage compared to v5.1.5. I'm still bisecting that on another production server.
The service we run is one of the largest forums in Taiwan [1]. It is a terminal-based bulletin board system running over telnet, SSH or a custom WebSocket bridge. The service itself is the one-process-per-user type of design from the old days. This means a lot of forks when there are user spikes or reconnections.
(Reconnections happen because a lot of people use mobile apps that wrap the service, but they get disconnected as soon as they are backgrounded.)
With v4.19.88 we saw a lot of contention on pgd_lock in the process fork path with CONFIG_VMAP_STACK=y:
Samples: 937K of event 'cycles:ppp', Event count (approx.): 499112453614 Children Self Command Shared Object Symbol
- 31.15% 0.03% mbbsd [kernel.kallsyms]
[k] entry_SYSCALL_64_after_hwframe
- 31.12% 0.02% mbbsd [kernel.kallsyms]
[k] do_syscall_64
- 28.12% 0.42% mbbsd [kernel.kallsyms]
[k] do_raw_spin_lock
- 27.70% 27.62% mbbsd [kernel.kallsyms]
[k] queued_spin_lock_slowpath
- 18.73% __libc_fork
- 18.33% entry_SYSCALL_64_after_hwframe do_syscall_64
- _do_fork
- 18.33% copy_process.part.64
- 11.00% __vmalloc_node_range
- 10.93% sync_global_pgds_l4 do_raw_spin_lock queued_spin_lock_slowpath
- 7.27% mm_init.isra.59 pgd_alloc do_raw_spin_lock queued_spin_lock_slowpath
- 8.68% 0x41fd89415541f689
- __libc_start_main
- 7.49% main
- 0.90% main
This hit us pretty hard, with the service dropping below one-third of its original capacity.
With CONFIG_VMAP_STACK=n, the fork code path skips this, but other vmalloc users are still affected. One other area is the tty layer. This also causes problems for us since there can be as many as 15k users over SSH, some coming and going. So we got a lot of hung sshd processes as well. Unfortunately I don't have any perf reports or kernel logs to go with.
Now I understand that there is already a fix in -next:
https://lore.kernel.org/patchwork/patch/1137341/
However the code has changed a lot in mainline and I'm not sure how to backport this. For now I just reverted the commit by hand by removing the offending code. Seems to work OK, and based on the commit logs I guess it's safe to do so, as we're not running X86-32 or PTI.
Regards ChenYu
On Thu, Dec 12, 2019 at 12:19:11PM +0100, Joerg Roedel wrote:
Hi,
On Thu, Dec 12, 2019 at 06:54:12PM +0800, Chen-Yu Tsai wrote:
I'd like to report a very severe performance regression due to
mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels
Yes, that is a known problem, with a couple of reports already in the past months. And I posted a fix from which I thought it is on its way upstream, but apparently its not:
https://lore.kernel.org/lkml/20191009124418.8286-1-joro@8bytes.org/
Ah, I missed that it is in linux-next already. Sorry for the noise.
Joerg
On Thu, Dec 12, 2019 at 06:54:12PM +0800, Chen-Yu Tsai wrote:
Hi,
I'd like to report a very severe performance regression due to
mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels
in v4.19.88. I believe this was included since v4.19.67. It is also in all the other LTS kernels, except 3.16.
So today I switched an x86_64 production server from v5.1.21 to v4.19.88, because we kept hitting runaway kcompactd and kswapd. Plus there was a significant increase in memory usage compared to v5.1.5. I'm still bisecting that on another production server.
The service we run is one of the largest forums in Taiwan [1]. It is a terminal-based bulletin board system running over telnet, SSH or a custom WebSocket bridge. The service itself is the one-process-per-user type of design from the old days. This means a lot of forks when there are user spikes or reconnections.
(Reconnections happen because a lot of people use mobile apps that wrap the service, but they get disconnected as soon as they are backgrounded.)
With v4.19.88 we saw a lot of contention on pgd_lock in the process fork path with CONFIG_VMAP_STACK=y:
Samples: 937K of event 'cycles:ppp', Event count (approx.): 499112453614 Children Self Command Shared Object Symbol
- 31.15% 0.03% mbbsd [kernel.kallsyms]
[k] entry_SYSCALL_64_after_hwframe
- 31.12% 0.02% mbbsd [kernel.kallsyms]
[k] do_syscall_64
- 28.12% 0.42% mbbsd [kernel.kallsyms]
[k] do_raw_spin_lock
- 27.70% 27.62% mbbsd [kernel.kallsyms]
[k] queued_spin_lock_slowpath
- 18.73% __libc_fork
- 18.33% entry_SYSCALL_64_after_hwframe do_syscall_64
- _do_fork
- 18.33% copy_process.part.64
- 11.00% __vmalloc_node_range
- 10.93% sync_global_pgds_l4 do_raw_spin_lock queued_spin_lock_slowpath
- 7.27% mm_init.isra.59 pgd_alloc do_raw_spin_lock queued_spin_lock_slowpath
- 8.68% 0x41fd89415541f689
- __libc_start_main
- 7.49% main
- 0.90% main
This hit us pretty hard, with the service dropping below one-third of its original capacity.
With CONFIG_VMAP_STACK=n, the fork code path skips this, but other vmalloc users are still affected. One other area is the tty layer. This also causes problems for us since there can be as many as 15k users over SSH, some coming and going. So we got a lot of hung sshd processes as well. Unfortunately I don't have any perf reports or kernel logs to go with.
Now I understand that there is already a fix in -next:
https://lore.kernel.org/patchwork/patch/1137341/
However the code has changed a lot in mainline and I'm not sure how to backport this. For now I just reverted the commit by hand by removing the offending code. Seems to work OK, and based on the commit logs I guess it's safe to do so, as we're not running X86-32 or PTI.
The above commit should resolve the issue for you, can you try it out on 5.4? And any reason you have to stick with the old 4.19 kernel?
thanks,
greg k-h
On Thu, Dec 12, 2019 at 7:19 PM Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
On Thu, Dec 12, 2019 at 06:54:12PM +0800, Chen-Yu Tsai wrote:
Hi,
I'd like to report a very severe performance regression due to
mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels
in v4.19.88. I believe this was included since v4.19.67. It is also in all the other LTS kernels, except 3.16.
So today I switched an x86_64 production server from v5.1.21 to v4.19.88, because we kept hitting runaway kcompactd and kswapd. Plus there was a significant increase in memory usage compared to v5.1.5. I'm still bisecting that on another production server.
The service we run is one of the largest forums in Taiwan [1]. It is a terminal-based bulletin board system running over telnet, SSH or a custom WebSocket bridge. The service itself is the one-process-per-user type of design from the old days. This means a lot of forks when there are user spikes or reconnections.
(Reconnections happen because a lot of people use mobile apps that wrap the service, but they get disconnected as soon as they are backgrounded.)
With v4.19.88 we saw a lot of contention on pgd_lock in the process fork path with CONFIG_VMAP_STACK=y:
Samples: 937K of event 'cycles:ppp', Event count (approx.): 499112453614 Children Self Command Shared Object Symbol
- 31.15% 0.03% mbbsd [kernel.kallsyms]
[k] entry_SYSCALL_64_after_hwframe
- 31.12% 0.02% mbbsd [kernel.kallsyms]
[k] do_syscall_64
- 28.12% 0.42% mbbsd [kernel.kallsyms]
[k] do_raw_spin_lock
- 27.70% 27.62% mbbsd [kernel.kallsyms]
[k] queued_spin_lock_slowpath
- 18.73% __libc_fork
- 18.33% entry_SYSCALL_64_after_hwframe do_syscall_64
- _do_fork
- 18.33% copy_process.part.64
- 11.00% __vmalloc_node_range
- 10.93% sync_global_pgds_l4 do_raw_spin_lock queued_spin_lock_slowpath
- 7.27% mm_init.isra.59 pgd_alloc do_raw_spin_lock queued_spin_lock_slowpath
- 8.68% 0x41fd89415541f689
- __libc_start_main
- 7.49% main
- 0.90% main
This hit us pretty hard, with the service dropping below one-third of its original capacity.
With CONFIG_VMAP_STACK=n, the fork code path skips this, but other vmalloc users are still affected. One other area is the tty layer. This also causes problems for us since there can be as many as 15k users over SSH, some coming and going. So we got a lot of hung sshd processes as well. Unfortunately I don't have any perf reports or kernel logs to go with.
Now I understand that there is already a fix in -next:
https://lore.kernel.org/patchwork/patch/1137341/
However the code has changed a lot in mainline and I'm not sure how to backport this. For now I just reverted the commit by hand by removing the offending code. Seems to work OK, and based on the commit logs I guess it's safe to do so, as we're not running X86-32 or PTI.
The above commit should resolve the issue for you, can you try it out on 5.4? And any reason you have to stick with the old 4.19 kernel?
We typically run new kernels on the other server (the one I'm currently doing git bisect on) for a couple weeks before running it on our main server. That one doesn't see nearly as much load though. Also because of the increased memory usage I was seeing in 5.1.21, I wasn't particularly comfortable going directly to 5.4.
I suppose the reason for being overly cautious is that the server is a pain to reboot. The service is monolithic, running on just the one server. And any significant downtime _always_ hits the local newspapers. Combined with the upcoming election, conspiracy theories start flying around. :( Now that it looks stable, we probably won't be testing anything new until mid-January.
ChenYu
On Thu, Dec 12, 2019 at 07:31:54PM +0800, Chen-Yu Tsai wrote:
On Thu, Dec 12, 2019 at 7:19 PM Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
On Thu, Dec 12, 2019 at 06:54:12PM +0800, Chen-Yu Tsai wrote:
Hi,
I'd like to report a very severe performance regression due to
mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels
in v4.19.88. I believe this was included since v4.19.67. It is also in all the other LTS kernels, except 3.16.
So today I switched an x86_64 production server from v5.1.21 to v4.19.88, because we kept hitting runaway kcompactd and kswapd. Plus there was a significant increase in memory usage compared to v5.1.5. I'm still bisecting that on another production server.
The service we run is one of the largest forums in Taiwan [1]. It is a terminal-based bulletin board system running over telnet, SSH or a custom WebSocket bridge. The service itself is the one-process-per-user type of design from the old days. This means a lot of forks when there are user spikes or reconnections.
(Reconnections happen because a lot of people use mobile apps that wrap the service, but they get disconnected as soon as they are backgrounded.)
With v4.19.88 we saw a lot of contention on pgd_lock in the process fork path with CONFIG_VMAP_STACK=y:
Samples: 937K of event 'cycles:ppp', Event count (approx.): 499112453614 Children Self Command Shared Object Symbol
- 31.15% 0.03% mbbsd [kernel.kallsyms]
[k] entry_SYSCALL_64_after_hwframe
- 31.12% 0.02% mbbsd [kernel.kallsyms]
[k] do_syscall_64
- 28.12% 0.42% mbbsd [kernel.kallsyms]
[k] do_raw_spin_lock
- 27.70% 27.62% mbbsd [kernel.kallsyms]
[k] queued_spin_lock_slowpath
- 18.73% __libc_fork
- 18.33% entry_SYSCALL_64_after_hwframe do_syscall_64
- _do_fork
- 18.33% copy_process.part.64
- 11.00% __vmalloc_node_range
- 10.93% sync_global_pgds_l4 do_raw_spin_lock queued_spin_lock_slowpath
- 7.27% mm_init.isra.59 pgd_alloc do_raw_spin_lock queued_spin_lock_slowpath
- 8.68% 0x41fd89415541f689
- __libc_start_main
- 7.49% main
- 0.90% main
This hit us pretty hard, with the service dropping below one-third of its original capacity.
With CONFIG_VMAP_STACK=n, the fork code path skips this, but other vmalloc users are still affected. One other area is the tty layer. This also causes problems for us since there can be as many as 15k users over SSH, some coming and going. So we got a lot of hung sshd processes as well. Unfortunately I don't have any perf reports or kernel logs to go with.
Now I understand that there is already a fix in -next:
https://lore.kernel.org/patchwork/patch/1137341/
However the code has changed a lot in mainline and I'm not sure how to backport this. For now I just reverted the commit by hand by removing the offending code. Seems to work OK, and based on the commit logs I guess it's safe to do so, as we're not running X86-32 or PTI.
The above commit should resolve the issue for you, can you try it out on 5.4? And any reason you have to stick with the old 4.19 kernel?
We typically run new kernels on the other server (the one I'm currently doing git bisect on) for a couple weeks before running it on our main server. That one doesn't see nearly as much load though. Also because of the increased memory usage I was seeing in 5.1.21, I wasn't particularly comfortable going directly to 5.4.
I suppose the reason for being overly cautious is that the server is a pain to reboot. The service is monolithic, running on just the one server. And any significant downtime _always_ hits the local newspapers. Combined with the upcoming election, conspiracy theories start flying around. :( Now that it looks stable, we probably won't be testing anything new until mid-January.
Fair enough, good luck!
greg k-h
Hi!
I'd like to report a very severe performance regression due to
mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels
in v4.19.88. I believe this was included since v4.19.67. It is also in all the other LTS kernels, except 3.16.
So today I switched an x86_64 production server from v5.1.21 to v4.19.88, because we kept hitting runaway kcompactd and kswapd. Plus there was a significant increase in memory usage compared to v5.1.5. I'm still bisecting that on another production server.
The service we run is one of the largest forums in Taiwan [1]. It is a terminal-based bulletin board system running over telnet, SSH or a custom WebSocket bridge. The service itself is the one-process-per-user type of design from the old days. This means a lot of forks when there are user spikes or reconnections.
Sounds like fun :-).
I noticed that there's something vmalloc-related in 4.19.89,
Subject: [PATCH 4.19 210/243] x86/mm/32: Sync only to VMALLOC_END in vmalloc_sync_all() From: Joerg Roedel jroedel@suse.de commit 9a62d20027da3164a22244d9f022c0c987261687 upstream.
But looking at the changelog again, it may not solve the performance problem.
Best regards, Pavel
linux-stable-mirror@lists.linaro.org