Regression from "mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()" in stable kernels

List overview All Threads
Download

newer

older

[PATCH v2 02/12] drm/i915: Clear...

Linux 5.4.3

Chen-Yu Tsai

12 Dec 2019 12 Dec '19

10:54 a.m.

Hi,

I'd like to report a very severe performance regression due to

mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels

in v4.19.88. I believe this was included since v4.19.67. It is also in all the other LTS kernels, except 3.16.

So today I switched an x86_64 production server from v5.1.21 to v4.19.88, because we kept hitting runaway kcompactd and kswapd. Plus there was a significant increase in memory usage compared to v5.1.5. I'm still bisecting that on another production server.

The service we run is one of the largest forums in Taiwan [1]. It is a terminal-based bulletin board system running over telnet, SSH or a custom WebSocket bridge. The service itself is the one-process-per-user type of design from the old days. This means a lot of forks when there are user spikes or reconnections.

(Reconnections happen because a lot of people use mobile apps that wrap the service, but they get disconnected as soon as they are backgrounded.)

With v4.19.88 we saw a lot of contention on pgd_lock in the process fork path with CONFIG_VMAP_STACK=y:

Samples: 937K of event 'cycles:ppp', Event count (approx.): 499112453614 Children Self Command Shared Object Symbol + 31.15% 0.03% mbbsd [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe + 31.12% 0.02% mbbsd [kernel.kallsyms] [k] do_syscall_64 + 28.12% 0.42% mbbsd [kernel.kallsyms] [k] do_raw_spin_lock - 27.70% 27.62% mbbsd [kernel.kallsyms] [k] queued_spin_lock_slowpath - 18.73% __libc_fork - 18.33% entry_SYSCALL_64_after_hwframe do_syscall_64 - _do_fork - 18.33% copy_process.part.64 - 11.00% __vmalloc_node_range - 10.93% sync_global_pgds_l4 do_raw_spin_lock queued_spin_lock_slowpath - 7.27% mm_init.isra.59 pgd_alloc do_raw_spin_lock queued_spin_lock_slowpath - 8.68% 0x41fd89415541f689 - __libc_start_main + 7.49% main + 0.90% main

This hit us pretty hard, with the service dropping below one-third of its original capacity.

With CONFIG_VMAP_STACK=n, the fork code path skips this, but other vmalloc users are still affected. One other area is the tty layer. This also causes problems for us since there can be as many as 15k users over SSH, some coming and going. So we got a lot of hung sshd processes as well. Unfortunately I don't have any perf reports or kernel logs to go with.

Now I understand that there is already a fix in -next:

https://lore.kernel.org/patchwork/patch/1137341/

However the code has changed a lot in mainline and I'm not sure how to backport this. For now I just reverted the commit by hand by removing the offending code. Seems to work OK, and based on the commit logs I guess it's safe to do so, as we're not running X86-32 or PTI.

Regards ChenYu

[1] https://en.wikipedia.org/wiki/PTT_Bulletin_Board_System

Show replies by date

Joerg Roedel

12 Dec 12 Dec

11:19 a.m.

Hi,

On Thu, Dec 12, 2019 at 06:54:12PM +0800, Chen-Yu Tsai wrote:

...

I'd like to report a very severe performance regression due to
mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels

Yes, that is a known problem, with a couple of reports already in the past months. And I posted a fix from which I thought it is on its way upstream, but apparently its not:

https://lore.kernel.org/lkml/20191009124418.8286-1-joro@8bytes.org/

Adding Andrew and the x86 maintainers to Cc.

Regards,

Joerg

...

in v4.19.88. I believe this was included since v4.19.67. It is also in all the other LTS kernels, except 3.16.

So today I switched an x86_64 production server from v5.1.21 to v4.19.88, because we kept hitting runaway kcompactd and kswapd. Plus there was a significant increase in memory usage compared to v5.1.5. I'm still bisecting that on another production server.

The service we run is one of the largest forums in Taiwan [1]. It is a terminal-based bulletin board system running over telnet, SSH or a custom WebSocket bridge. The service itself is the one-process-per-user type of design from the old days. This means a lot of forks when there are user spikes or reconnections.

(Reconnections happen because a lot of people use mobile apps that wrap the service, but they get disconnected as soon as they are backgrounded.)

With v4.19.88 we saw a lot of contention on pgd_lock in the process fork path with CONFIG_VMAP_STACK=y:

Samples: 937K of event 'cycles:ppp', Event count (approx.): 499112453614 Children Self Command Shared Object Symbol

31.15% 0.03% mbbsd [kernel.kallsyms]

[k] entry_SYSCALL_64_after_hwframe

31.12% 0.02% mbbsd [kernel.kallsyms]

[k] do_syscall_64

28.12% 0.42% mbbsd [kernel.kallsyms]

[k] do_raw_spin_lock

27.70% 27.62% mbbsd [kernel.kallsyms]

[k] queued_spin_lock_slowpath

18.73% __libc_fork

18.33% entry_SYSCALL_64_after_hwframe do_syscall_64

_do_fork

18.33% copy_process.part.64

11.00% __vmalloc_node_range

10.93% sync_global_pgds_l4 do_raw_spin_lock queued_spin_lock_slowpath

7.27% mm_init.isra.59 pgd_alloc do_raw_spin_lock queued_spin_lock_slowpath

8.68% 0x41fd89415541f689

__libc_start_main

7.49% main

0.90% main

This hit us pretty hard, with the service dropping below one-third of its original capacity.

With CONFIG_VMAP_STACK=n, the fork code path skips this, but other vmalloc users are still affected. One other area is the tty layer. This also causes problems for us since there can be as many as 15k users over SSH, some coming and going. So we got a lot of hung sshd processes as well. Unfortunately I don't have any perf reports or kernel logs to go with.

Now I understand that there is already a fix in -next:
https://lore.kernel.org/patchwork/patch/1137341/
However the code has changed a lot in mainline and I'm not sure how to backport this. For now I just reverted the commit by hand by removing the offending code. Seems to work OK, and based on the commit logs I guess it's safe to do so, as we're not running X86-32 or PTI.

Regards ChenYu

[1] https://en.wikipedia.org/wiki/PTT_Bulletin_Board_System

Joerg Roedel

11:22 a.m.

On Thu, Dec 12, 2019 at 12:19:11PM +0100, Joerg Roedel wrote:

...

Hi,

On Thu, Dec 12, 2019 at 06:54:12PM +0800, Chen-Yu Tsai wrote:

...
I'd like to report a very severe performance regression due to
mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels
Yes, that is a known problem, with a couple of reports already in the past months. And I posted a fix from which I thought it is on its way upstream, but apparently its not:

https://lore.kernel.org/lkml/20191009124418.8286-1-joro@8bytes.org/

Ah, I missed that it is in linux-next already. Sorry for the noise.

Joerg

Greg Kroah-Hartman

11:19 a.m.

On Thu, Dec 12, 2019 at 06:54:12PM +0800, Chen-Yu Tsai wrote:

...

Hi,

I'd like to report a very severe performance regression due to
mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels
in v4.19.88. I believe this was included since v4.19.67. It is also in all the other LTS kernels, except 3.16.

So today I switched an x86_64 production server from v5.1.21 to v4.19.88, because we kept hitting runaway kcompactd and kswapd. Plus there was a significant increase in memory usage compared to v5.1.5. I'm still bisecting that on another production server.

The service we run is one of the largest forums in Taiwan [1]. It is a terminal-based bulletin board system running over telnet, SSH or a custom WebSocket bridge. The service itself is the one-process-per-user type of design from the old days. This means a lot of forks when there are user spikes or reconnections.

(Reconnections happen because a lot of people use mobile apps that wrap the service, but they get disconnected as soon as they are backgrounded.)

With v4.19.88 we saw a lot of contention on pgd_lock in the process fork path with CONFIG_VMAP_STACK=y:

Samples: 937K of event 'cycles:ppp', Event count (approx.): 499112453614 Children Self Command Shared Object Symbol

31.15% 0.03% mbbsd [kernel.kallsyms]

[k] entry_SYSCALL_64_after_hwframe

31.12% 0.02% mbbsd [kernel.kallsyms]

[k] do_syscall_64

28.12% 0.42% mbbsd [kernel.kallsyms]

[k] do_raw_spin_lock

27.70% 27.62% mbbsd [kernel.kallsyms]

[k] queued_spin_lock_slowpath

18.73% __libc_fork

18.33% entry_SYSCALL_64_after_hwframe do_syscall_64

_do_fork

18.33% copy_process.part.64

11.00% __vmalloc_node_range

10.93% sync_global_pgds_l4 do_raw_spin_lock queued_spin_lock_slowpath

7.27% mm_init.isra.59 pgd_alloc do_raw_spin_lock queued_spin_lock_slowpath

8.68% 0x41fd89415541f689

__libc_start_main

7.49% main

0.90% main

This hit us pretty hard, with the service dropping below one-third of its original capacity.

With CONFIG_VMAP_STACK=n, the fork code path skips this, but other vmalloc users are still affected. One other area is the tty layer. This also causes problems for us since there can be as many as 15k users over SSH, some coming and going. So we got a lot of hung sshd processes as well. Unfortunately I don't have any perf reports or kernel logs to go with.

Now I understand that there is already a fix in -next:
https://lore.kernel.org/patchwork/patch/1137341/
However the code has changed a lot in mainline and I'm not sure how to backport this. For now I just reverted the commit by hand by removing the offending code. Seems to work OK, and based on the commit logs I guess it's safe to do so, as we're not running X86-32 or PTI.

The above commit should resolve the issue for you, can you try it out on 5.4? And any reason you have to stick with the old 4.19 kernel?

thanks,

greg k-h

Chen-Yu Tsai

11:31 a.m.

On Thu, Dec 12, 2019 at 7:19 PM Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:

...

On Thu, Dec 12, 2019 at 06:54:12PM +0800, Chen-Yu Tsai wrote:

...
Hi,

I'd like to report a very severe performance regression due to
mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels
in v4.19.88. I believe this was included since v4.19.67. It is also in all the other LTS kernels, except 3.16.

So today I switched an x86_64 production server from v5.1.21 to v4.19.88, because we kept hitting runaway kcompactd and kswapd. Plus there was a significant increase in memory usage compared to v5.1.5. I'm still bisecting that on another production server.

The service we run is one of the largest forums in Taiwan [1]. It is a terminal-based bulletin board system running over telnet, SSH or a custom WebSocket bridge. The service itself is the one-process-per-user type of design from the old days. This means a lot of forks when there are user spikes or reconnections.

(Reconnections happen because a lot of people use mobile apps that wrap the service, but they get disconnected as soon as they are backgrounded.)

With v4.19.88 we saw a lot of contention on pgd_lock in the process fork path with CONFIG_VMAP_STACK=y:

Samples: 937K of event 'cycles:ppp', Event count (approx.): 499112453614 Children Self Command Shared Object Symbol

31.15% 0.03% mbbsd [kernel.kallsyms]

[k] entry_SYSCALL_64_after_hwframe

31.12% 0.02% mbbsd [kernel.kallsyms]

[k] do_syscall_64

28.12% 0.42% mbbsd [kernel.kallsyms]

[k] do_raw_spin_lock

27.70% 27.62% mbbsd [kernel.kallsyms]

[k] queued_spin_lock_slowpath

18.73% __libc_fork

18.33% entry_SYSCALL_64_after_hwframe do_syscall_64

_do_fork

18.33% copy_process.part.64

11.00% __vmalloc_node_range

10.93% sync_global_pgds_l4 do_raw_spin_lock queued_spin_lock_slowpath

7.27% mm_init.isra.59 pgd_alloc do_raw_spin_lock queued_spin_lock_slowpath

8.68% 0x41fd89415541f689

__libc_start_main

7.49% main

0.90% main

This hit us pretty hard, with the service dropping below one-third of its original capacity.

With CONFIG_VMAP_STACK=n, the fork code path skips this, but other vmalloc users are still affected. One other area is the tty layer. This also causes problems for us since there can be as many as 15k users over SSH, some coming and going. So we got a lot of hung sshd processes as well. Unfortunately I don't have any perf reports or kernel logs to go with.

Now I understand that there is already a fix in -next:
https://lore.kernel.org/patchwork/patch/1137341/
However the code has changed a lot in mainline and I'm not sure how to backport this. For now I just reverted the commit by hand by removing the offending code. Seems to work OK, and based on the commit logs I guess it's safe to do so, as we're not running X86-32 or PTI.
The above commit should resolve the issue for you, can you try it out on 5.4? And any reason you have to stick with the old 4.19 kernel?

We typically run new kernels on the other server (the one I'm currently doing git bisect on) for a couple weeks before running it on our main server. That one doesn't see nearly as much load though. Also because of the increased memory usage I was seeing in 5.1.21, I wasn't particularly comfortable going directly to 5.4.

I suppose the reason for being overly cautious is that the server is a pain to reboot. The service is monolithic, running on just the one server. And any significant downtime _always_ hits the local newspapers. Combined with the upcoming election, conspiracy theories start flying around. :( Now that it looks stable, we probably won't be testing anything new until mid-January.

ChenYu

Greg Kroah-Hartman

12:19 p.m.

On Thu, Dec 12, 2019 at 07:31:54PM +0800, Chen-Yu Tsai wrote:

...

On Thu, Dec 12, 2019 at 7:19 PM Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:

...
On Thu, Dec 12, 2019 at 06:54:12PM +0800, Chen-Yu Tsai wrote:

...
Hi,

I'd like to report a very severe performance regression due to
mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels
in v4.19.88. I believe this was included since v4.19.67. It is also in all the other LTS kernels, except 3.16.

So today I switched an x86_64 production server from v5.1.21 to v4.19.88, because we kept hitting runaway kcompactd and kswapd. Plus there was a significant increase in memory usage compared to v5.1.5. I'm still bisecting that on another production server.

The service we run is one of the largest forums in Taiwan [1]. It is a terminal-based bulletin board system running over telnet, SSH or a custom WebSocket bridge. The service itself is the one-process-per-user type of design from the old days. This means a lot of forks when there are user spikes or reconnections.

(Reconnections happen because a lot of people use mobile apps that wrap the service, but they get disconnected as soon as they are backgrounded.)

With v4.19.88 we saw a lot of contention on pgd_lock in the process fork path with CONFIG_VMAP_STACK=y:

Samples: 937K of event 'cycles:ppp', Event count (approx.): 499112453614 Children Self Command Shared Object Symbol

31.15% 0.03% mbbsd [kernel.kallsyms]

[k] entry_SYSCALL_64_after_hwframe

31.12% 0.02% mbbsd [kernel.kallsyms]

[k] do_syscall_64

28.12% 0.42% mbbsd [kernel.kallsyms]

[k] do_raw_spin_lock

27.70% 27.62% mbbsd [kernel.kallsyms]

[k] queued_spin_lock_slowpath

18.73% __libc_fork

18.33% entry_SYSCALL_64_after_hwframe do_syscall_64

_do_fork

18.33% copy_process.part.64

11.00% __vmalloc_node_range

10.93% sync_global_pgds_l4 do_raw_spin_lock queued_spin_lock_slowpath

7.27% mm_init.isra.59 pgd_alloc do_raw_spin_lock queued_spin_lock_slowpath

8.68% 0x41fd89415541f689

__libc_start_main

7.49% main

0.90% main

This hit us pretty hard, with the service dropping below one-third of its original capacity.

With CONFIG_VMAP_STACK=n, the fork code path skips this, but other vmalloc users are still affected. One other area is the tty layer. This also causes problems for us since there can be as many as 15k users over SSH, some coming and going. So we got a lot of hung sshd processes as well. Unfortunately I don't have any perf reports or kernel logs to go with.

Now I understand that there is already a fix in -next:
https://lore.kernel.org/patchwork/patch/1137341/
However the code has changed a lot in mainline and I'm not sure how to backport this. For now I just reverted the commit by hand by removing the offending code. Seems to work OK, and based on the commit logs I guess it's safe to do so, as we're not running X86-32 or PTI.
The above commit should resolve the issue for you, can you try it out on 5.4? And any reason you have to stick with the old 4.19 kernel?
We typically run new kernels on the other server (the one I'm currently doing git bisect on) for a couple weeks before running it on our main server. That one doesn't see nearly as much load though. Also because of the increased memory usage I was seeing in 5.1.21, I wasn't particularly comfortable going directly to 5.4.

I suppose the reason for being overly cautious is that the server is a pain to reboot. The service is monolithic, running on just the one server. And any significant downtime _always_ hits the local newspapers. Combined with the upcoming election, conspiracy theories start flying around. :( Now that it looks stable, we probably won't be testing anything new until mid-January.

Fair enough, good luck!

greg k-h

Pavel Machek

13 Dec 13 Dec

6:57 p.m.

Hi!

...

I'd like to report a very severe performance regression due to
mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() in stable kernels
in v4.19.88. I believe this was included since v4.19.67. It is also in all the other LTS kernels, except 3.16.

So today I switched an x86_64 production server from v5.1.21 to v4.19.88, because we kept hitting runaway kcompactd and kswapd. Plus there was a significant increase in memory usage compared to v5.1.5. I'm still bisecting that on another production server.

The service we run is one of the largest forums in Taiwan [1]. It is a terminal-based bulletin board system running over telnet, SSH or a custom WebSocket bridge. The service itself is the one-process-per-user type of design from the old days. This means a lot of forks when there are user spikes or reconnections.

Sounds like fun :-).

I noticed that there's something vmalloc-related in 4.19.89,

Subject: [PATCH 4.19 210/243] x86/mm/32: Sync only to VMALLOC_END in vmalloc_sync_all() From: Joerg Roedel jroedel@suse.de commit 9a62d20027da3164a22244d9f022c0c987261687 upstream.

But looking at the changelog again, it may not solve the performance problem.

Best regards, Pavel

-- DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany

2221

days inactive

2222

days old

linux-stable-mirror@lists.linaro.org

6 comments

participants

tags (0)

participants (4)

Chen-Yu Tsai
Greg Kroah-Hartman
Joerg Roedel
Pavel Machek