Guys,
There is a convoluted deadlock that I just root caused, and that is fixed by this work (at least based on my code inspection it appears to be fixed); but the deadlock exists in older and stable kernels, and I am not sure whether to create a separate patch for it, or backport this whole thing.
Thread #1: Hot-removes memory device_offline memory_subsys_offline offline_pages __offline_pages mem_hotplug_lock <- write access waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can migrate it.
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: crashing userland program do_coredump elf_core_dump get_dump_page() -> get page with pfn#9e5113, and increment refcnt dump_emit __kernel_write __vfs_write new_sync_write pipe_write pipe_wait -> waits for Thread #4 systemd-coredump to read the pipe
Thread #4: systemd-coredump ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show cgroup_mutex -> waits from Thread #2 for this lock.
In Summary: Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2 waits for Thread#1 for mem_hotplug_lock rwlock.
This work appears to fix this deadlock because cgroup_mutex is not called anymore before mem_hotplug_lock (unless I am missing it), as it removes memcg_deactivate_kmem_caches.
Thank you, Pasha
On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin guro@fb.com wrote:
On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
The existing cgroup slab memory controller is based on the idea of replicating slab allocator internals for each memory cgroup. This approach promises a low memory overhead (one pointer per page), and isn't adding too much code on hot allocation and release paths. But is has a very serious flaw: it leads to a low slab utilization.
Using a drgn* script I've got an estimation of slab utilization on a number of machines running different production workloads. In most cases it was between 45% and 65%, and the best number I've seen was around 85%. Turning kmem accounting off brings it to high 90s. Also it brings back 30-50% of slab memory. It means that the real price of the existing slab memory controller is way bigger than a pointer per page.
The real reason why the existing design leads to a low slab utilization is simple: slab pages are used exclusively by one memory cgroup. If there are only few allocations of certain size made by a cgroup, or if some active objects (e.g. dentries) are left after the cgroup is deleted, or the cgroup contains a single-threaded application which is barely allocating any kernel objects, but does it every time on a new CPU: in all these cases the resulting slab utilization is very low. If kmem accounting is off, the kernel is able to use free space on slab pages for other allocations.
Arguably it wasn't an issue back to days when the kmem controller was introduced and was an opt-in feature, which had to be turned on individually for each memory cgroup. But now it's turned on by default on both cgroup v1 and v2. And modern systemd-based systems tend to create a large number of cgroups.
This patchset provides a new implementation of the slab memory controller, which aims to reach a much better slab utilization by sharing slab pages between multiple memory cgroups. Below is the short description of the new design (more details in commit messages).
Accounting is performed per-object instead of per-page. Slab-related vmstat counters are converted to bytes. Charging is performed on page-basis, with rounding up and remembering leftovers.
Memcg ownership data is stored in a per-slab-page vector: for each slab page a vector of corresponding size is allocated. To keep slab memory reparenting working, instead of saving a pointer to the memory cgroup directly an intermediate object is used. It's simply a pointer to a memcg (which can be easily changed to the parent) with a built-in reference counter. This scheme allows to reparent all allocated objects without walking them over and changing memcg pointer to the parent.
Instead of creating an individual set of kmem_caches for each memory cgroup, two global sets are used: the root set for non-accounted and root-cgroup allocations and the second set for all other allocations. This allows to simplify the lifetime management of individual kmem_caches: they are destroyed with root counterparts. It allows to remove a good amount of code and make things generally simpler.
The patchset* has been tested on a number of different workloads in our production. In all cases it saved significant amount of memory, measured from high hundreds of MBs to single GBs per host. On average, the size of slab memory has been reduced by 35-45%.
Here are some numbers from multiple runs of sysbench and kernel compilation with this patchset on a 10 core POWER8 host:
========================================================================== Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for Sysbench oltp_read_write with mysqld running as part of a mem cgroup (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 15859712 4456448 72 memory.usage_in_bytes 337510400 335806464 .5 Slab: (kB) 814336 607296 25
memory.kmem.usage_in_bytes 16187392 4653056 71 memory.usage_in_bytes 318832640 300154880 5 Slab: (kB) 789888 559744 29
Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for kernel compilation (make -s -j64) Compilation was done from bash that is in a memory cgroup. (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 338493440 231931904 31 memory.usage_in_bytes 7368015872 6275923968 15 Slab: (kB) 1139072 785408 31
memory.kmem.usage_in_bytes 341835776 236453888 30 memory.usage_in_bytes 6540427264 6072893440 7 Slab: (kB) 1074304 761280 29
memory.kmem.usage_in_bytes 340525056 233570304 31 memory.usage_in_bytes 6406209536 6177357824 3 Slab: (kB) 1244288 739712 40
Slab consumption right after boot
5.5.0-rc7-mm1 +slab patch %reduction
Slab: (kB) 821888 583424 29
Summary:
With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows around 70% and 30% reduction consistently.
Didn't see consistent reduction of memory.usage_in_bytes with sysbench and kernel compilation.
Slab usage (from /proc/meminfo) shows consistent 30% reduction and the same is seen right after boot too.
That's just perfect!
memory.usage_in_bytes was most likely the same because the freed space was taken by pagecache.
Thank you very much for testing!
Roman
BTW, I replied to a wrong version of this work. I intended to reply to version 7: https://lore.kernel.org/lkml/20200623174037.3951353-1-guro@fb.com/
Nevertheless, the problem is the same.
Thank you, Pasha
On Wed, Aug 12, 2020 at 7:16 PM Pavel Tatashin pasha.tatashin@soleen.com wrote:
Guys,
There is a convoluted deadlock that I just root caused, and that is fixed by this work (at least based on my code inspection it appears to be fixed); but the deadlock exists in older and stable kernels, and I am not sure whether to create a separate patch for it, or backport this whole thing.
Thread #1: Hot-removes memory device_offline memory_subsys_offline offline_pages __offline_pages mem_hotplug_lock <- write access waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can migrate it.
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: crashing userland program do_coredump elf_core_dump get_dump_page() -> get page with pfn#9e5113, and increment refcnt dump_emit __kernel_write __vfs_write new_sync_write pipe_write pipe_wait -> waits for Thread #4 systemd-coredump to read the pipe
Thread #4: systemd-coredump ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show cgroup_mutex -> waits from Thread #2 for this lock.
In Summary: Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2 waits for Thread#1 for mem_hotplug_lock rwlock.
This work appears to fix this deadlock because cgroup_mutex is not called anymore before mem_hotplug_lock (unless I am missing it), as it removes memcg_deactivate_kmem_caches.
Thank you, Pasha
On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin guro@fb.com wrote:
On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
The existing cgroup slab memory controller is based on the idea of replicating slab allocator internals for each memory cgroup. This approach promises a low memory overhead (one pointer per page), and isn't adding too much code on hot allocation and release paths. But is has a very serious flaw: it leads to a low slab utilization.
Using a drgn* script I've got an estimation of slab utilization on a number of machines running different production workloads. In most cases it was between 45% and 65%, and the best number I've seen was around 85%. Turning kmem accounting off brings it to high 90s. Also it brings back 30-50% of slab memory. It means that the real price of the existing slab memory controller is way bigger than a pointer per page.
The real reason why the existing design leads to a low slab utilization is simple: slab pages are used exclusively by one memory cgroup. If there are only few allocations of certain size made by a cgroup, or if some active objects (e.g. dentries) are left after the cgroup is deleted, or the cgroup contains a single-threaded application which is barely allocating any kernel objects, but does it every time on a new CPU: in all these cases the resulting slab utilization is very low. If kmem accounting is off, the kernel is able to use free space on slab pages for other allocations.
Arguably it wasn't an issue back to days when the kmem controller was introduced and was an opt-in feature, which had to be turned on individually for each memory cgroup. But now it's turned on by default on both cgroup v1 and v2. And modern systemd-based systems tend to create a large number of cgroups.
This patchset provides a new implementation of the slab memory controller, which aims to reach a much better slab utilization by sharing slab pages between multiple memory cgroups. Below is the short description of the new design (more details in commit messages).
Accounting is performed per-object instead of per-page. Slab-related vmstat counters are converted to bytes. Charging is performed on page-basis, with rounding up and remembering leftovers.
Memcg ownership data is stored in a per-slab-page vector: for each slab page a vector of corresponding size is allocated. To keep slab memory reparenting working, instead of saving a pointer to the memory cgroup directly an intermediate object is used. It's simply a pointer to a memcg (which can be easily changed to the parent) with a built-in reference counter. This scheme allows to reparent all allocated objects without walking them over and changing memcg pointer to the parent.
Instead of creating an individual set of kmem_caches for each memory cgroup, two global sets are used: the root set for non-accounted and root-cgroup allocations and the second set for all other allocations. This allows to simplify the lifetime management of individual kmem_caches: they are destroyed with root counterparts. It allows to remove a good amount of code and make things generally simpler.
The patchset* has been tested on a number of different workloads in our production. In all cases it saved significant amount of memory, measured from high hundreds of MBs to single GBs per host. On average, the size of slab memory has been reduced by 35-45%.
Here are some numbers from multiple runs of sysbench and kernel compilation with this patchset on a 10 core POWER8 host:
========================================================================== Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for Sysbench oltp_read_write with mysqld running as part of a mem cgroup (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 15859712 4456448 72 memory.usage_in_bytes 337510400 335806464 .5 Slab: (kB) 814336 607296 25
memory.kmem.usage_in_bytes 16187392 4653056 71 memory.usage_in_bytes 318832640 300154880 5 Slab: (kB) 789888 559744 29
Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for kernel compilation (make -s -j64) Compilation was done from bash that is in a memory cgroup. (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 338493440 231931904 31 memory.usage_in_bytes 7368015872 6275923968 15 Slab: (kB) 1139072 785408 31
memory.kmem.usage_in_bytes 341835776 236453888 30 memory.usage_in_bytes 6540427264 6072893440 7 Slab: (kB) 1074304 761280 29
memory.kmem.usage_in_bytes 340525056 233570304 31 memory.usage_in_bytes 6406209536 6177357824 3 Slab: (kB) 1244288 739712 40
Slab consumption right after boot
5.5.0-rc7-mm1 +slab patch %reduction
Slab: (kB) 821888 583424 29
Summary:
With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows around 70% and 30% reduction consistently.
Didn't see consistent reduction of memory.usage_in_bytes with sysbench and kernel compilation.
Slab usage (from /proc/meminfo) shows consistent 30% reduction and the same is seen right after boot too.
That's just perfect!
memory.usage_in_bytes was most likely the same because the freed space was taken by pagecache.
Thank you very much for testing!
Roman
On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
Guys,
There is a convoluted deadlock that I just root caused, and that is fixed by this work (at least based on my code inspection it appears to be fixed); but the deadlock exists in older and stable kernels, and I am not sure whether to create a separate patch for it, or backport this whole thing.
Hi Pavel,
wow, it's a quite complicated deadlock. Thank you for providing a perfect analysis!
Unfortunately, backporting the whole new slab controller isn't an option: it's way too big and invasive. Do you already have a standalone fix?
Thanks!
Thread #1: Hot-removes memory device_offline memory_subsys_offline offline_pages __offline_pages mem_hotplug_lock <- write access waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can migrate it.
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: crashing userland program do_coredump elf_core_dump get_dump_page() -> get page with pfn#9e5113, and increment refcnt dump_emit __kernel_write __vfs_write new_sync_write pipe_write pipe_wait -> waits for Thread #4 systemd-coredump to read the pipe
Thread #4: systemd-coredump ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show cgroup_mutex -> waits from Thread #2 for this lock.
In Summary: Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2 waits for Thread#1 for mem_hotplug_lock rwlock.
This work appears to fix this deadlock because cgroup_mutex is not called anymore before mem_hotplug_lock (unless I am missing it), as it removes memcg_deactivate_kmem_caches.
Thank you, Pasha
On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin guro@fb.com wrote:
On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
The existing cgroup slab memory controller is based on the idea of replicating slab allocator internals for each memory cgroup. This approach promises a low memory overhead (one pointer per page), and isn't adding too much code on hot allocation and release paths. But is has a very serious flaw: it leads to a low slab utilization.
Using a drgn* script I've got an estimation of slab utilization on a number of machines running different production workloads. In most cases it was between 45% and 65%, and the best number I've seen was around 85%. Turning kmem accounting off brings it to high 90s. Also it brings back 30-50% of slab memory. It means that the real price of the existing slab memory controller is way bigger than a pointer per page.
The real reason why the existing design leads to a low slab utilization is simple: slab pages are used exclusively by one memory cgroup. If there are only few allocations of certain size made by a cgroup, or if some active objects (e.g. dentries) are left after the cgroup is deleted, or the cgroup contains a single-threaded application which is barely allocating any kernel objects, but does it every time on a new CPU: in all these cases the resulting slab utilization is very low. If kmem accounting is off, the kernel is able to use free space on slab pages for other allocations.
Arguably it wasn't an issue back to days when the kmem controller was introduced and was an opt-in feature, which had to be turned on individually for each memory cgroup. But now it's turned on by default on both cgroup v1 and v2. And modern systemd-based systems tend to create a large number of cgroups.
This patchset provides a new implementation of the slab memory controller, which aims to reach a much better slab utilization by sharing slab pages between multiple memory cgroups. Below is the short description of the new design (more details in commit messages).
Accounting is performed per-object instead of per-page. Slab-related vmstat counters are converted to bytes. Charging is performed on page-basis, with rounding up and remembering leftovers.
Memcg ownership data is stored in a per-slab-page vector: for each slab page a vector of corresponding size is allocated. To keep slab memory reparenting working, instead of saving a pointer to the memory cgroup directly an intermediate object is used. It's simply a pointer to a memcg (which can be easily changed to the parent) with a built-in reference counter. This scheme allows to reparent all allocated objects without walking them over and changing memcg pointer to the parent.
Instead of creating an individual set of kmem_caches for each memory cgroup, two global sets are used: the root set for non-accounted and root-cgroup allocations and the second set for all other allocations. This allows to simplify the lifetime management of individual kmem_caches: they are destroyed with root counterparts. It allows to remove a good amount of code and make things generally simpler.
The patchset* has been tested on a number of different workloads in our production. In all cases it saved significant amount of memory, measured from high hundreds of MBs to single GBs per host. On average, the size of slab memory has been reduced by 35-45%.
Here are some numbers from multiple runs of sysbench and kernel compilation with this patchset on a 10 core POWER8 host:
========================================================================== Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for Sysbench oltp_read_write with mysqld running as part of a mem cgroup (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 15859712 4456448 72 memory.usage_in_bytes 337510400 335806464 .5 Slab: (kB) 814336 607296 25
memory.kmem.usage_in_bytes 16187392 4653056 71 memory.usage_in_bytes 318832640 300154880 5 Slab: (kB) 789888 559744 29
Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for kernel compilation (make -s -j64) Compilation was done from bash that is in a memory cgroup. (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 338493440 231931904 31 memory.usage_in_bytes 7368015872 6275923968 15 Slab: (kB) 1139072 785408 31
memory.kmem.usage_in_bytes 341835776 236453888 30 memory.usage_in_bytes 6540427264 6072893440 7 Slab: (kB) 1074304 761280 29
memory.kmem.usage_in_bytes 340525056 233570304 31 memory.usage_in_bytes 6406209536 6177357824 3 Slab: (kB) 1244288 739712 40
Slab consumption right after boot
5.5.0-rc7-mm1 +slab patch %reduction
Slab: (kB) 821888 583424 29
Summary:
With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows around 70% and 30% reduction consistently.
Didn't see consistent reduction of memory.usage_in_bytes with sysbench and kernel compilation.
Slab usage (from /proc/meminfo) shows consistent 30% reduction and the same is seen right after boot too.
That's just perfect!
memory.usage_in_bytes was most likely the same because the freed space was taken by pagecache.
Thank you very much for testing!
Roman
On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin guro@fb.com wrote:
On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
Guys,
There is a convoluted deadlock that I just root caused, and that is fixed by this work (at least based on my code inspection it appears to be fixed); but the deadlock exists in older and stable kernels, and I am not sure whether to create a separate patch for it, or backport this whole thing.
Hi Roman,
Hi Pavel,
wow, it's a quite complicated deadlock. Thank you for providing a perfect analysis!
Thank you, it indeed took me a while to fully grasp the deadlock.
Unfortunately, backporting the whole new slab controller isn't an option: it's way too big and invasive.
This is what I thought as well, this is why I want to figure out what is the best way forward.
Do you already have a standalone fix?
Not yet, I do not have a standalone fix. I suspect the best fix would be to address fix css_killed_work_fn() stack so we never have: cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse the order would work. If you have suggestions since you worked on this code recently, please let me know.
Thank you, Pasha
Thanks!
Thread #1: Hot-removes memory device_offline memory_subsys_offline offline_pages __offline_pages mem_hotplug_lock <- write access waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can migrate it.
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: crashing userland program do_coredump elf_core_dump get_dump_page() -> get page with pfn#9e5113, and increment refcnt dump_emit __kernel_write __vfs_write new_sync_write pipe_write pipe_wait -> waits for Thread #4 systemd-coredump to read the pipe
Thread #4: systemd-coredump ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show cgroup_mutex -> waits from Thread #2 for this lock.
In Summary: Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2 waits for Thread#1 for mem_hotplug_lock rwlock.
This work appears to fix this deadlock because cgroup_mutex is not called anymore before mem_hotplug_lock (unless I am missing it), as it removes memcg_deactivate_kmem_caches.
Thank you, Pasha
On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin guro@fb.com wrote:
On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
The existing cgroup slab memory controller is based on the idea of replicating slab allocator internals for each memory cgroup. This approach promises a low memory overhead (one pointer per page), and isn't adding too much code on hot allocation and release paths. But is has a very serious flaw: it leads to a low slab utilization.
Using a drgn* script I've got an estimation of slab utilization on a number of machines running different production workloads. In most cases it was between 45% and 65%, and the best number I've seen was around 85%. Turning kmem accounting off brings it to high 90s. Also it brings back 30-50% of slab memory. It means that the real price of the existing slab memory controller is way bigger than a pointer per page.
The real reason why the existing design leads to a low slab utilization is simple: slab pages are used exclusively by one memory cgroup. If there are only few allocations of certain size made by a cgroup, or if some active objects (e.g. dentries) are left after the cgroup is deleted, or the cgroup contains a single-threaded application which is barely allocating any kernel objects, but does it every time on a new CPU: in all these cases the resulting slab utilization is very low. If kmem accounting is off, the kernel is able to use free space on slab pages for other allocations.
Arguably it wasn't an issue back to days when the kmem controller was introduced and was an opt-in feature, which had to be turned on individually for each memory cgroup. But now it's turned on by default on both cgroup v1 and v2. And modern systemd-based systems tend to create a large number of cgroups.
This patchset provides a new implementation of the slab memory controller, which aims to reach a much better slab utilization by sharing slab pages between multiple memory cgroups. Below is the short description of the new design (more details in commit messages).
Accounting is performed per-object instead of per-page. Slab-related vmstat counters are converted to bytes. Charging is performed on page-basis, with rounding up and remembering leftovers.
Memcg ownership data is stored in a per-slab-page vector: for each slab page a vector of corresponding size is allocated. To keep slab memory reparenting working, instead of saving a pointer to the memory cgroup directly an intermediate object is used. It's simply a pointer to a memcg (which can be easily changed to the parent) with a built-in reference counter. This scheme allows to reparent all allocated objects without walking them over and changing memcg pointer to the parent.
Instead of creating an individual set of kmem_caches for each memory cgroup, two global sets are used: the root set for non-accounted and root-cgroup allocations and the second set for all other allocations. This allows to simplify the lifetime management of individual kmem_caches: they are destroyed with root counterparts. It allows to remove a good amount of code and make things generally simpler.
The patchset* has been tested on a number of different workloads in our production. In all cases it saved significant amount of memory, measured from high hundreds of MBs to single GBs per host. On average, the size of slab memory has been reduced by 35-45%.
Here are some numbers from multiple runs of sysbench and kernel compilation with this patchset on a 10 core POWER8 host:
========================================================================== Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for Sysbench oltp_read_write with mysqld running as part of a mem cgroup (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 15859712 4456448 72 memory.usage_in_bytes 337510400 335806464 .5 Slab: (kB) 814336 607296 25
memory.kmem.usage_in_bytes 16187392 4653056 71 memory.usage_in_bytes 318832640 300154880 5 Slab: (kB) 789888 559744 29
Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for kernel compilation (make -s -j64) Compilation was done from bash that is in a memory cgroup. (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 338493440 231931904 31 memory.usage_in_bytes 7368015872 6275923968 15 Slab: (kB) 1139072 785408 31
memory.kmem.usage_in_bytes 341835776 236453888 30 memory.usage_in_bytes 6540427264 6072893440 7 Slab: (kB) 1074304 761280 29
memory.kmem.usage_in_bytes 340525056 233570304 31 memory.usage_in_bytes 6406209536 6177357824 3 Slab: (kB) 1244288 739712 40
Slab consumption right after boot
5.5.0-rc7-mm1 +slab patch %reduction
Slab: (kB) 821888 583424 29
Summary:
With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows around 70% and 30% reduction consistently.
Didn't see consistent reduction of memory.usage_in_bytes with sysbench and kernel compilation.
Slab usage (from /proc/meminfo) shows consistent 30% reduction and the same is seen right after boot too.
That's just perfect!
memory.usage_in_bytes was most likely the same because the freed space was taken by pagecache.
Thank you very much for testing!
Roman
There appears to be another problem that is related to the cgroup_mutex -> mem_hotplug_lock deadlock described above.
In the original deadlock that I described, the workaround is to replace crash dump from piping to Linux traditional save to files method. However, after trying this workaround, I still observed hardware watchdog resets during machine shutdown.
The new problem occurs for the following reason: upon shutdown systemd calls a service that hot-removes memory, and if hot-removing fails for some reason systemd kills that service after timeout. However, systemd is never able to kill the service, and we get hardware reset caused by watchdog or a hang during shutdown:
Thread #1: memory hot-remove systemd service Loops indefinitely, because if there is something still to be migrated this loop never terminates. However, this loop can be terminated via signal from systemd after timeout. __offline_pages() do { pfn = scan_movable_pages(pfn, end_pfn); # Returns 0, meaning there is nothing available to # migrate, no page is PageLRU(page) ... ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL, check_pages_isolated_cb); # Returns -EBUSY, meaning there is at least one PFN that # still has to be migrated. } while (ret);
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: systemd ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt to thread #1.
The proper fix for both of the problems is to avoid cgroup_mutex -> mem_hotplug_lock ordering that was recently fixed in the mainline but still present in all stable branches. Unfortunately, I do not see a simple fix in how to remove mem_hotplug_lock from memcg_deactivate_kmem_caches without using Roman's series that is too big for stable.
Thanks, Pasha
On Wed, Aug 12, 2020 at 8:31 PM Pavel Tatashin pasha.tatashin@soleen.com wrote:
On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin guro@fb.com wrote:
On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
Guys,
There is a convoluted deadlock that I just root caused, and that is fixed by this work (at least based on my code inspection it appears to be fixed); but the deadlock exists in older and stable kernels, and I am not sure whether to create a separate patch for it, or backport this whole thing.
Hi Roman,
Hi Pavel,
wow, it's a quite complicated deadlock. Thank you for providing a perfect analysis!
Thank you, it indeed took me a while to fully grasp the deadlock.
Unfortunately, backporting the whole new slab controller isn't an option: it's way too big and invasive.
This is what I thought as well, this is why I want to figure out what is the best way forward.
Do you already have a standalone fix?
Not yet, I do not have a standalone fix. I suspect the best fix would be to address fix css_killed_work_fn() stack so we never have: cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse the order would work. If you have suggestions since you worked on this code recently, please let me know.
Thank you, Pasha
Thanks!
Thread #1: Hot-removes memory device_offline memory_subsys_offline offline_pages __offline_pages mem_hotplug_lock <- write access waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can migrate it.
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: crashing userland program do_coredump elf_core_dump get_dump_page() -> get page with pfn#9e5113, and increment refcnt dump_emit __kernel_write __vfs_write new_sync_write pipe_write pipe_wait -> waits for Thread #4 systemd-coredump to read the pipe
Thread #4: systemd-coredump ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show cgroup_mutex -> waits from Thread #2 for this lock.
In Summary: Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2 waits for Thread#1 for mem_hotplug_lock rwlock.
This work appears to fix this deadlock because cgroup_mutex is not called anymore before mem_hotplug_lock (unless I am missing it), as it removes memcg_deactivate_kmem_caches.
Thank you, Pasha
On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin guro@fb.com wrote:
On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
The existing cgroup slab memory controller is based on the idea of replicating slab allocator internals for each memory cgroup. This approach promises a low memory overhead (one pointer per page), and isn't adding too much code on hot allocation and release paths. But is has a very serious flaw: it leads to a low slab utilization.
Using a drgn* script I've got an estimation of slab utilization on a number of machines running different production workloads. In most cases it was between 45% and 65%, and the best number I've seen was around 85%. Turning kmem accounting off brings it to high 90s. Also it brings back 30-50% of slab memory. It means that the real price of the existing slab memory controller is way bigger than a pointer per page.
The real reason why the existing design leads to a low slab utilization is simple: slab pages are used exclusively by one memory cgroup. If there are only few allocations of certain size made by a cgroup, or if some active objects (e.g. dentries) are left after the cgroup is deleted, or the cgroup contains a single-threaded application which is barely allocating any kernel objects, but does it every time on a new CPU: in all these cases the resulting slab utilization is very low. If kmem accounting is off, the kernel is able to use free space on slab pages for other allocations.
Arguably it wasn't an issue back to days when the kmem controller was introduced and was an opt-in feature, which had to be turned on individually for each memory cgroup. But now it's turned on by default on both cgroup v1 and v2. And modern systemd-based systems tend to create a large number of cgroups.
This patchset provides a new implementation of the slab memory controller, which aims to reach a much better slab utilization by sharing slab pages between multiple memory cgroups. Below is the short description of the new design (more details in commit messages).
Accounting is performed per-object instead of per-page. Slab-related vmstat counters are converted to bytes. Charging is performed on page-basis, with rounding up and remembering leftovers.
Memcg ownership data is stored in a per-slab-page vector: for each slab page a vector of corresponding size is allocated. To keep slab memory reparenting working, instead of saving a pointer to the memory cgroup directly an intermediate object is used. It's simply a pointer to a memcg (which can be easily changed to the parent) with a built-in reference counter. This scheme allows to reparent all allocated objects without walking them over and changing memcg pointer to the parent.
Instead of creating an individual set of kmem_caches for each memory cgroup, two global sets are used: the root set for non-accounted and root-cgroup allocations and the second set for all other allocations. This allows to simplify the lifetime management of individual kmem_caches: they are destroyed with root counterparts. It allows to remove a good amount of code and make things generally simpler.
The patchset* has been tested on a number of different workloads in our production. In all cases it saved significant amount of memory, measured from high hundreds of MBs to single GBs per host. On average, the size of slab memory has been reduced by 35-45%.
Here are some numbers from multiple runs of sysbench and kernel compilation with this patchset on a 10 core POWER8 host:
========================================================================== Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for Sysbench oltp_read_write with mysqld running as part of a mem cgroup (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 15859712 4456448 72 memory.usage_in_bytes 337510400 335806464 .5 Slab: (kB) 814336 607296 25
memory.kmem.usage_in_bytes 16187392 4653056 71 memory.usage_in_bytes 318832640 300154880 5 Slab: (kB) 789888 559744 29
Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for kernel compilation (make -s -j64) Compilation was done from bash that is in a memory cgroup. (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 338493440 231931904 31 memory.usage_in_bytes 7368015872 6275923968 15 Slab: (kB) 1139072 785408 31
memory.kmem.usage_in_bytes 341835776 236453888 30 memory.usage_in_bytes 6540427264 6072893440 7 Slab: (kB) 1074304 761280 29
memory.kmem.usage_in_bytes 340525056 233570304 31 memory.usage_in_bytes 6406209536 6177357824 3 Slab: (kB) 1244288 739712 40
Slab consumption right after boot
5.5.0-rc7-mm1 +slab patch %reduction
Slab: (kB) 821888 583424 29
Summary:
With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows around 70% and 30% reduction consistently.
Didn't see consistent reduction of memory.usage_in_bytes with sysbench and kernel compilation.
Slab usage (from /proc/meminfo) shows consistent 30% reduction and the same is seen right after boot too.
That's just perfect!
memory.usage_in_bytes was most likely the same because the freed space was taken by pagecache.
Thank you very much for testing!
Roman
On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
There appears to be another problem that is related to the cgroup_mutex -> mem_hotplug_lock deadlock described above.
In the original deadlock that I described, the workaround is to replace crash dump from piping to Linux traditional save to files method. However, after trying this workaround, I still observed hardware watchdog resets during machine shutdown.
The new problem occurs for the following reason: upon shutdown systemd calls a service that hot-removes memory, and if hot-removing fails for some reason systemd kills that service after timeout. However, systemd is never able to kill the service, and we get hardware reset caused by watchdog or a hang during shutdown:
Thread #1: memory hot-remove systemd service Loops indefinitely, because if there is something still to be migrated this loop never terminates. However, this loop can be terminated via signal from systemd after timeout. __offline_pages() do { pfn = scan_movable_pages(pfn, end_pfn); # Returns 0, meaning there is nothing available to # migrate, no page is PageLRU(page) ... ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL, check_pages_isolated_cb); # Returns -EBUSY, meaning there is at least one PFN that # still has to be migrated. } while (ret);
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: systemd ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt to thread #1.
The proper fix for both of the problems is to avoid cgroup_mutex -> mem_hotplug_lock ordering that was recently fixed in the mainline but still present in all stable branches. Unfortunately, I do not see a simple fix in how to remove mem_hotplug_lock from memcg_deactivate_kmem_caches without using Roman's series that is too big for stable.
We too are seeing this on Power systems when stress-testing memory hotplug, but with the following call trace (from hung task timer) instead of Thread #2 above:
__switch_to __schedule schedule percpu_rwsem_wait __percpu_down_read get_online_mems memcg_create_kmem_cache memcg_kmem_cache_create_func process_one_work worker_thread kthread ret_from_kernel_thread
While I understand that Roman's new slab controller patchset will fix this, I also wonder if infinitely looping in the memory unplug path with mem_hotplug_lock held is the right thing to do? Earlier we had a few other exit possibilities in this path (like max retries etc) but those were removed by commits:
72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
Or, is the user-space test is expected to induce a signal back-off when unplug doesn't complete within a reasonable amount of time?
Regards, Bharata.
On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao bharata@linux.ibm.com wrote:
On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
There appears to be another problem that is related to the cgroup_mutex -> mem_hotplug_lock deadlock described above.
In the original deadlock that I described, the workaround is to replace crash dump from piping to Linux traditional save to files method. However, after trying this workaround, I still observed hardware watchdog resets during machine shutdown.
The new problem occurs for the following reason: upon shutdown systemd calls a service that hot-removes memory, and if hot-removing fails for some reason systemd kills that service after timeout. However, systemd is never able to kill the service, and we get hardware reset caused by watchdog or a hang during shutdown:
Thread #1: memory hot-remove systemd service Loops indefinitely, because if there is something still to be migrated this loop never terminates. However, this loop can be terminated via signal from systemd after timeout. __offline_pages() do { pfn = scan_movable_pages(pfn, end_pfn); # Returns 0, meaning there is nothing available to # migrate, no page is PageLRU(page) ... ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL, check_pages_isolated_cb); # Returns -EBUSY, meaning there is at least one PFN that # still has to be migrated. } while (ret);
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: systemd ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt to thread #1.
The proper fix for both of the problems is to avoid cgroup_mutex -> mem_hotplug_lock ordering that was recently fixed in the mainline but still present in all stable branches. Unfortunately, I do not see a simple fix in how to remove mem_hotplug_lock from memcg_deactivate_kmem_caches without using Roman's series that is too big for stable.
We too are seeing this on Power systems when stress-testing memory hotplug, but with the following call trace (from hung task timer) instead of Thread #2 above:
__switch_to __schedule schedule percpu_rwsem_wait __percpu_down_read get_online_mems memcg_create_kmem_cache memcg_kmem_cache_create_func process_one_work worker_thread kthread ret_from_kernel_thread
While I understand that Roman's new slab controller patchset will fix this, I also wonder if infinitely looping in the memory unplug path with mem_hotplug_lock held is the right thing to do? Earlier we had a few other exit possibilities in this path (like max retries etc) but those were removed by commits:
72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
Or, is the user-space test is expected to induce a signal back-off when unplug doesn't complete within a reasonable amount of time?
Hi Bharata,
Thank you for your input, it looks like you are experiencing the same problems that I observed.
What I found is that the reason why our machines did not complete hot-remove within the given time is because of this bug: https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@sole...
Could you please try it and see if that helps for your case?
Thank you, Pasha
On Tue, Sep 01, 2020 at 08:52:05AM -0400, Pavel Tatashin wrote:
On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao bharata@linux.ibm.com wrote:
On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
There appears to be another problem that is related to the cgroup_mutex -> mem_hotplug_lock deadlock described above.
In the original deadlock that I described, the workaround is to replace crash dump from piping to Linux traditional save to files method. However, after trying this workaround, I still observed hardware watchdog resets during machine shutdown.
The new problem occurs for the following reason: upon shutdown systemd calls a service that hot-removes memory, and if hot-removing fails for some reason systemd kills that service after timeout. However, systemd is never able to kill the service, and we get hardware reset caused by watchdog or a hang during shutdown:
Thread #1: memory hot-remove systemd service Loops indefinitely, because if there is something still to be migrated this loop never terminates. However, this loop can be terminated via signal from systemd after timeout. __offline_pages() do { pfn = scan_movable_pages(pfn, end_pfn); # Returns 0, meaning there is nothing available to # migrate, no page is PageLRU(page) ... ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL, check_pages_isolated_cb); # Returns -EBUSY, meaning there is at least one PFN that # still has to be migrated. } while (ret);
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: systemd ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt to thread #1.
The proper fix for both of the problems is to avoid cgroup_mutex -> mem_hotplug_lock ordering that was recently fixed in the mainline but still present in all stable branches. Unfortunately, I do not see a simple fix in how to remove mem_hotplug_lock from memcg_deactivate_kmem_caches without using Roman's series that is too big for stable.
We too are seeing this on Power systems when stress-testing memory hotplug, but with the following call trace (from hung task timer) instead of Thread #2 above:
__switch_to __schedule schedule percpu_rwsem_wait __percpu_down_read get_online_mems memcg_create_kmem_cache memcg_kmem_cache_create_func process_one_work worker_thread kthread ret_from_kernel_thread
While I understand that Roman's new slab controller patchset will fix this, I also wonder if infinitely looping in the memory unplug path with mem_hotplug_lock held is the right thing to do? Earlier we had a few other exit possibilities in this path (like max retries etc) but those were removed by commits:
72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
Or, is the user-space test is expected to induce a signal back-off when unplug doesn't complete within a reasonable amount of time?
Hi Bharata,
Thank you for your input, it looks like you are experiencing the same problems that I observed.
What I found is that the reason why our machines did not complete hot-remove within the given time is because of this bug: https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@sole...
Could you please try it and see if that helps for your case?
I am on an old codebase that already has the fix that you are proposing, so I might be seeing someother issue which I will debug further.
So looks like the loop in __offline_pages() had a call to drain_all_pages() before it was removed by
c52e75935f8d: mm: remove extra drain pages on pcp list
Regards, Bharata.
I am on an old codebase that already has the fix that you are proposing, so I might be seeing someother issue which I will debug further.
So looks like the loop in __offline_pages() had a call to drain_all_pages() before it was removed by
c52e75935f8d: mm: remove extra drain pages on pcp list
I see, thanks. There is a reason to have the second drain, my fix is a little better as it is performed only on a rare occasion when it is needed, but I should add a FIXES tag. I have not checked alloc_contig_range race.
On 8/28/20 6:47 PM, Pavel Tatashin wrote:
There appears to be another problem that is related to the cgroup_mutex -> mem_hotplug_lock deadlock described above.
In the original deadlock that I described, the workaround is to replace crash dump from piping to Linux traditional save to files method. However, after trying this workaround, I still observed hardware watchdog resets during machine shutdown.
The new problem occurs for the following reason: upon shutdown systemd calls a service that hot-removes memory, and if hot-removing fails for
Why is that hotremove even needed if we're shutting down? Are there any (virtualization?) platforms where it makes some difference over plain shutdown/restart?
some reason systemd kills that service after timeout. However, systemd is never able to kill the service, and we get hardware reset caused by watchdog or a hang during shutdown:
Thread #1: memory hot-remove systemd service Loops indefinitely, because if there is something still to be migrated this loop never terminates. However, this loop can be terminated via signal from systemd after timeout. __offline_pages() do { pfn = scan_movable_pages(pfn, end_pfn); # Returns 0, meaning there is nothing available to # migrate, no page is PageLRU(page) ... ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL, check_pages_isolated_cb); # Returns -EBUSY, meaning there is at least one PFN that # still has to be migrated. } while (ret);
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: systemd ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt to thread #1.
The proper fix for both of the problems is to avoid cgroup_mutex -> mem_hotplug_lock ordering that was recently fixed in the mainline but still present in all stable branches. Unfortunately, I do not see a simple fix in how to remove mem_hotplug_lock from memcg_deactivate_kmem_caches without using Roman's series that is too big for stable.
Thanks, Pasha
On Wed, Aug 12, 2020 at 8:31 PM Pavel Tatashin pasha.tatashin@soleen.com wrote:
On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin guro@fb.com wrote:
On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
Guys,
There is a convoluted deadlock that I just root caused, and that is fixed by this work (at least based on my code inspection it appears to be fixed); but the deadlock exists in older and stable kernels, and I am not sure whether to create a separate patch for it, or backport this whole thing.
Hi Roman,
Hi Pavel,
wow, it's a quite complicated deadlock. Thank you for providing a perfect analysis!
Thank you, it indeed took me a while to fully grasp the deadlock.
Unfortunately, backporting the whole new slab controller isn't an option: it's way too big and invasive.
This is what I thought as well, this is why I want to figure out what is the best way forward.
Do you already have a standalone fix?
Not yet, I do not have a standalone fix. I suspect the best fix would be to address fix css_killed_work_fn() stack so we never have: cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse the order would work. If you have suggestions since you worked on this code recently, please let me know.
Thank you, Pasha
Thanks!
Thread #1: Hot-removes memory device_offline memory_subsys_offline offline_pages __offline_pages mem_hotplug_lock <- write access waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can migrate it.
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
Thread #3: crashing userland program do_coredump elf_core_dump get_dump_page() -> get page with pfn#9e5113, and increment refcnt dump_emit __kernel_write __vfs_write new_sync_write pipe_write pipe_wait -> waits for Thread #4 systemd-coredump to read the pipe
Thread #4: systemd-coredump ksys_read vfs_read __vfs_read seq_read proc_single_show proc_cgroup_show cgroup_mutex -> waits from Thread #2 for this lock.
In Summary: Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2 waits for Thread#1 for mem_hotplug_lock rwlock.
This work appears to fix this deadlock because cgroup_mutex is not called anymore before mem_hotplug_lock (unless I am missing it), as it removes memcg_deactivate_kmem_caches.
Thank you, Pasha
On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin guro@fb.com wrote:
On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote: > The existing cgroup slab memory controller is based on the idea of > replicating slab allocator internals for each memory cgroup. > This approach promises a low memory overhead (one pointer per page), > and isn't adding too much code on hot allocation and release paths. > But is has a very serious flaw: it leads to a low slab utilization. > > Using a drgn* script I've got an estimation of slab utilization on > a number of machines running different production workloads. In most > cases it was between 45% and 65%, and the best number I've seen was > around 85%. Turning kmem accounting off brings it to high 90s. Also > it brings back 30-50% of slab memory. It means that the real price > of the existing slab memory controller is way bigger than a pointer > per page. > > The real reason why the existing design leads to a low slab utilization > is simple: slab pages are used exclusively by one memory cgroup. > If there are only few allocations of certain size made by a cgroup, > or if some active objects (e.g. dentries) are left after the cgroup is > deleted, or the cgroup contains a single-threaded application which is > barely allocating any kernel objects, but does it every time on a new CPU: > in all these cases the resulting slab utilization is very low. > If kmem accounting is off, the kernel is able to use free space > on slab pages for other allocations. > > Arguably it wasn't an issue back to days when the kmem controller was > introduced and was an opt-in feature, which had to be turned on > individually for each memory cgroup. But now it's turned on by default > on both cgroup v1 and v2. And modern systemd-based systems tend to > create a large number of cgroups. > > This patchset provides a new implementation of the slab memory controller, > which aims to reach a much better slab utilization by sharing slab pages > between multiple memory cgroups. Below is the short description of the new > design (more details in commit messages). > > Accounting is performed per-object instead of per-page. Slab-related > vmstat counters are converted to bytes. Charging is performed on page-basis, > with rounding up and remembering leftovers. > > Memcg ownership data is stored in a per-slab-page vector: for each slab page > a vector of corresponding size is allocated. To keep slab memory reparenting > working, instead of saving a pointer to the memory cgroup directly an > intermediate object is used. It's simply a pointer to a memcg (which can be > easily changed to the parent) with a built-in reference counter. This scheme > allows to reparent all allocated objects without walking them over and > changing memcg pointer to the parent. > > Instead of creating an individual set of kmem_caches for each memory cgroup, > two global sets are used: the root set for non-accounted and root-cgroup > allocations and the second set for all other allocations. This allows to > simplify the lifetime management of individual kmem_caches: they are > destroyed with root counterparts. It allows to remove a good amount of code > and make things generally simpler. > > The patchset* has been tested on a number of different workloads in our > production. In all cases it saved significant amount of memory, measured > from high hundreds of MBs to single GBs per host. On average, the size > of slab memory has been reduced by 35-45%.
Here are some numbers from multiple runs of sysbench and kernel compilation with this patchset on a 10 core POWER8 host:
========================================================================== Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for Sysbench oltp_read_write with mysqld running as part of a mem cgroup (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 15859712 4456448 72 memory.usage_in_bytes 337510400 335806464 .5 Slab: (kB) 814336 607296 25
memory.kmem.usage_in_bytes 16187392 4653056 71 memory.usage_in_bytes 318832640 300154880 5 Slab: (kB) 789888 559744 29
Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and meminfo:Slab for kernel compilation (make -s -j64) Compilation was done from bash that is in a memory cgroup. (Sampling every 5s)
5.5.0-rc7-mm1 +slab patch %reduction
memory.kmem.usage_in_bytes 338493440 231931904 31 memory.usage_in_bytes 7368015872 6275923968 15 Slab: (kB) 1139072 785408 31
memory.kmem.usage_in_bytes 341835776 236453888 30 memory.usage_in_bytes 6540427264 6072893440 7 Slab: (kB) 1074304 761280 29
memory.kmem.usage_in_bytes 340525056 233570304 31 memory.usage_in_bytes 6406209536 6177357824 3 Slab: (kB) 1244288 739712 40
Slab consumption right after boot
5.5.0-rc7-mm1 +slab patch %reduction
Slab: (kB) 821888 583424 29
Summary:
With sysbench and kernel compilation, memory.kmem.usage_in_bytes shows around 70% and 30% reduction consistently.
Didn't see consistent reduction of memory.usage_in_bytes with sysbench and kernel compilation.
Slab usage (from /proc/meminfo) shows consistent 30% reduction and the same is seen right after boot too.
That's just perfect!
memory.usage_in_bytes was most likely the same because the freed space was taken by pagecache.
Thank you very much for testing!
Roman
Am 02.09.2020 um 11:53 schrieb Vlastimil Babka vbabka@suse.cz:
On 8/28/20 6:47 PM, Pavel Tatashin wrote:
There appears to be another problem that is related to the cgroup_mutex -> mem_hotplug_lock deadlock described above.
In the original deadlock that I described, the workaround is to replace crash dump from piping to Linux traditional save to files method. However, after trying this workaround, I still observed hardware watchdog resets during machine shutdown.
The new problem occurs for the following reason: upon shutdown systemd calls a service that hot-removes memory, and if hot-removing fails for
Why is that hotremove even needed if we're shutting down? Are there any (virtualization?) platforms where it makes some difference over plain shutdown/restart?
If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)
Am 02.09.2020 um 11:53 schrieb Vlastimil Babka vbabka@suse.cz:
On 8/28/20 6:47 PM, Pavel Tatashin wrote:
There appears to be another problem that is related to the cgroup_mutex -> mem_hotplug_lock deadlock described above.
In the original deadlock that I described, the workaround is to replace crash dump from piping to Linux traditional save to files method. However, after trying this workaround, I still observed hardware watchdog resets during machine shutdown.
The new problem occurs for the following reason: upon shutdown systemd calls a service that hot-removes memory, and if hot-removing fails for
Why is that hotremove even needed if we're shutting down? Are there any (virtualization?) platforms where it makes some difference over plain shutdown/restart?
If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)
Hi David,
This is how we are using it at Microsoft: there is a very large number of small memory machines (8G each) with low downtime requirements (reboot must be under a second). There is also a large state ~2G of memory that we need to transfer during reboot, otherwise it is very expensive to recreate the state. We have 2G of system memory memory reserved as a pmem in the device tree, and use it to pass information across reboots. Once the information is not needed we hot-add that memory and use it during runtime, before shutdown we hot-remove the 2G, save the program state on it, and do the reboot.
Pasha
On Wed 02-09-20 08:42:13, Pavel Tatashin wrote:
Am 02.09.2020 um 11:53 schrieb Vlastimil Babka vbabka@suse.cz:
On 8/28/20 6:47 PM, Pavel Tatashin wrote:
There appears to be another problem that is related to the cgroup_mutex -> mem_hotplug_lock deadlock described above.
In the original deadlock that I described, the workaround is to replace crash dump from piping to Linux traditional save to files method. However, after trying this workaround, I still observed hardware watchdog resets during machine shutdown.
The new problem occurs for the following reason: upon shutdown systemd calls a service that hot-removes memory, and if hot-removing fails for
Why is that hotremove even needed if we're shutting down? Are there any (virtualization?) platforms where it makes some difference over plain shutdown/restart?
If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)
Hi David,
This is how we are using it at Microsoft: there is a very large number of small memory machines (8G each) with low downtime requirements (reboot must be under a second). There is also a large state ~2G of memory that we need to transfer during reboot, otherwise it is very expensive to recreate the state. We have 2G of system memory memory reserved as a pmem in the device tree, and use it to pass information across reboots. Once the information is not needed we hot-add that memory and use it during runtime, before shutdown we hot-remove the 2G, save the program state on it, and do the reboot.
I still do not get it. So what does guarantee that the memory is offlineable in the first place? Also what is the difference between offlining and simply shutting the system down so that the memory is not used in the first place. In other words what kind of difference hotremove makes?
This is how we are using it at Microsoft: there is a very large number of small memory machines (8G each) with low downtime requirements (reboot must be under a second). There is also a large state ~2G of memory that we need to transfer during reboot, otherwise it is very expensive to recreate the state. We have 2G of system memory memory reserved as a pmem in the device tree, and use it to pass information across reboots. Once the information is not needed we hot-add that memory and use it during runtime, before shutdown we hot-remove the 2G, save the program state on it, and do the reboot.
I still do not get it. So what does guarantee that the memory is offlineable in the first place?
It is in a movable zone, and we have more than 2G of free memory for successful migrations.
Also what is the difference between offlining and simply shutting the system down so that the memory is not used in the first place. In other words what kind of difference hotremove makes?
For performance reasons during system updates/reboots we do not erase memory content. The memory content is erased only on power cycle, which we do not do in production.
Once we hot-remove the memory, we convert it back into DAXFS PMEM device, format it into EXT4, mount it as DAX file system, and allow programs to serialize their states to it so they can read it back after the reboot.
During startup we mount pmem, programs read the state back, and after that we hotplug the PMEM DAX as a movable zone. This way during normal runtime we have 8G available to programs.
Pasha
For performance reasons during system updates/reboots we do not erase memory content. The memory content is erased only on power cycle, which we do not do in production.
Once we hot-remove the memory, we convert it back into DAXFS PMEM device, format it into EXT4, mount it as DAX file system, and allow programs to serialize their states to it so they can read it back after the reboot.
During startup we mount pmem, programs read the state back, and after that we hotplug the PMEM DAX as a movable zone. This way during normal runtime we have 8G available to programs.
Thanks for sharing the workflow - while it sounds somewhat sub-optimal, I guess it gets the job done using existing tools / mechanisms.
(I remember the persistent tmpfs over kexec RFC, which tries to tackle it by introducing something new)
On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
On 8/28/20 6:47 PM, Pavel Tatashin wrote:
There appears to be another problem that is related to the cgroup_mutex -> mem_hotplug_lock deadlock described above.
In the original deadlock that I described, the workaround is to replace crash dump from piping to Linux traditional save to files method. However, after trying this workaround, I still observed hardware watchdog resets during machine shutdown.
The new problem occurs for the following reason: upon shutdown systemd calls a service that hot-removes memory, and if hot-removing fails for
Why is that hotremove even needed if we're shutting down? Are there any (virtualization?) platforms where it makes some difference over plain shutdown/restart?
Yes this sounds quite dubious.
some reason systemd kills that service after timeout. However, systemd is never able to kill the service, and we get hardware reset caused by watchdog or a hang during shutdown:
Thread #1: memory hot-remove systemd service Loops indefinitely, because if there is something still to be migrated this loop never terminates. However, this loop can be terminated via signal from systemd after timeout. __offline_pages() do { pfn = scan_movable_pages(pfn, end_pfn); # Returns 0, meaning there is nothing available to # migrate, no page is PageLRU(page) ... ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL, check_pages_isolated_cb); # Returns -EBUSY, meaning there is at least one PFN that # still has to be migrated. } while (ret);
This shouldn't really happen. What does prevent from this to proceed? Did you manage to catch the specific pfn and what is it used for? start_isolate_page_range and scan_movable_pages should fail if there is any memory that cannot be migrated permanently. This is something that we should focus on when debugging.
Thread #1: memory hot-remove systemd service Loops indefinitely, because if there is something still to be migrated this loop never terminates. However, this loop can be terminated via signal from systemd after timeout. __offline_pages() do { pfn = scan_movable_pages(pfn, end_pfn); # Returns 0, meaning there is nothing available to # migrate, no page is PageLRU(page) ... ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL, check_pages_isolated_cb); # Returns -EBUSY, meaning there is at least one PFN that # still has to be migrated. } while (ret);
Hi Micahl,
This shouldn't really happen. What does prevent from this to proceed? Did you manage to catch the specific pfn and what is it used for?
I did.
start_isolate_page_range and scan_movable_pages should fail if there is any memory that cannot be migrated permanently. This is something that we should focus on when debugging.
I was hitting this issue: mm/memory_hotplug: drain per-cpu pages again during memory offline https://lore.kernel.org/lkml/20200901124615.137200-1-pasha.tatashin@soleen.c...
Once the pcp drain race is fixed, this particular deadlock becomes irrelavent.
The lock ordering, however, cgroup_mutex -> mem_hotplug_lock is bad, and the first race condition that I was hitting and described above is still present. For now I added a temporary workaround by using save to file instead of piping the core during shutdown. I am glad the mainline is fixed, but stables should also have some kind of fix for this problem.
Pasha
On Wed 02-09-20 08:51:06, Pavel Tatashin wrote:
Thread #1: memory hot-remove systemd service Loops indefinitely, because if there is something still to be migrated this loop never terminates. However, this loop can be terminated via signal from systemd after timeout. __offline_pages() do { pfn = scan_movable_pages(pfn, end_pfn); # Returns 0, meaning there is nothing available to # migrate, no page is PageLRU(page) ... ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL, check_pages_isolated_cb); # Returns -EBUSY, meaning there is at least one PFN that # still has to be migrated. } while (ret);
Hi Micahl,
This shouldn't really happen. What does prevent from this to proceed? Did you manage to catch the specific pfn and what is it used for?
I did.
start_isolate_page_range and scan_movable_pages should fail if there is any memory that cannot be migrated permanently. This is something that we should focus on when debugging.
I was hitting this issue: mm/memory_hotplug: drain per-cpu pages again during memory offline https://lore.kernel.org/lkml/20200901124615.137200-1-pasha.tatashin@soleen.c...
I have noticed the patch but didn't have time to think it through (have been few days off and catching up with emails). Will give it a higher priority.
On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
And one more thing. THis has been brought up several times already. Maybe I have forgoten but why do we take hotplug locks in this path in the first place? Memory hotplug notifier takes slab_mutex so this shouldn't be really needed.
On Wed, Sep 2, 2020 at 7:32 AM Michal Hocko mhocko@suse.com wrote:
On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
Thread #2: ccs killer kthread css_killed_work_fn cgroup_mutex <- Grab this Mutex mem_cgroup_css_offline memcg_offline_kmem.part memcg_deactivate_kmem_caches get_online_mems mem_hotplug_lock <- waits for Thread#1 to get read access
And one more thing. THis has been brought up several times already. Maybe I have forgoten but why do we take hotplug locks in this path in the first place? Memory hotplug notifier takes slab_mutex so this shouldn't be really needed.
Good point, it seems this lock can be completely removed from memcg_deactivate_kmem_caches
Pasha
-- Michal Hocko SUSE Labs
On Wed 02-09-20 08:53:49, Pavel Tatashin wrote:
On Wed, Sep 2, 2020 at 7:32 AM Michal Hocko mhocko@suse.com wrote:
On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> Thread #2: ccs killer kthread > css_killed_work_fn > cgroup_mutex <- Grab this Mutex > mem_cgroup_css_offline > memcg_offline_kmem.part > memcg_deactivate_kmem_caches > get_online_mems > mem_hotplug_lock <- waits for Thread#1 to get read access
And one more thing. THis has been brought up several times already. Maybe I have forgoten but why do we take hotplug locks in this path in the first place? Memory hotplug notifier takes slab_mutex so this shouldn't be really needed.
Good point, it seems this lock can be completely removed from memcg_deactivate_kmem_caches
I am pretty sure we have discussed that in the past. But I do not remember the outcome. Either we have concluded that this is indeed the case but nobody came up with a patch or we have hit some obscure issue... Maybe David/Roman rememeber more than I do.
linux-stable-mirror@lists.linaro.org