The size of kvm's shadow page tables corresponds to the size of the guest virtual machines on the system. Large VMs can spend a significant amount of memory as shadow page tables which can not be left as system memory overhead. So, account shadow page tables to the kmemcg.
Signed-off-by: Shakeel Butt shakeelb@google.com Cc: Michal Hocko mhocko@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Greg Thelen gthelen@google.com Cc: Radim Krčmář rkrcmar@redhat.com Cc: Peter Feiner pfeiner@google.com Cc: Andrew Morton akpm@linux-foundation.org Cc: stable@vger.kernel.org --- Changelog since v1: - replaced (GFP_KERNEL|__GFP_ACCOUNT) with GFP_KERNEL_ACCOUNT
arch/x86/kvm/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index d594690d8b95..6b8f11521c41 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -890,7 +890,7 @@ static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache, if (cache->nobjs >= min) return 0; while (cache->nobjs < ARRAY_SIZE(cache->objects)) { - page = (void *)__get_free_page(GFP_KERNEL); + page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT); if (!page) return -ENOMEM; cache->objects[cache->nobjs++] = page;
On Fri 29-06-18 07:02:24, Shakeel Butt wrote:
The size of kvm's shadow page tables corresponds to the size of the guest virtual machines on the system. Large VMs can spend a significant amount of memory as shadow page tables which can not be left as system memory overhead. So, account shadow page tables to the kmemcg.
Signed-off-by: Shakeel Butt shakeelb@google.com Cc: Michal Hocko mhocko@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Greg Thelen gthelen@google.com Cc: Radim Krčmář rkrcmar@redhat.com Cc: Peter Feiner pfeiner@google.com Cc: Andrew Morton akpm@linux-foundation.org Cc: stable@vger.kernel.org
I am not familiar wtih kvm to judge but if we are going to account this memory we will probably want to let oom_badness know how much memory to account to a specific process. Is this something that we can do? We will probably need a new MM_KERNEL rss_stat stat for that purpose.
Just to make it clear. I am not opposing to this patch but considering that shadow page tables might consume a lot of memory it would be good to know who is responsible for it from the OOM perspective. Something to solve on top of this.
I would also love to see a note how this memory is bound to the owner life time in the changelog. That would make the review much more easier.
Changelog since v1:
- replaced (GFP_KERNEL|__GFP_ACCOUNT) with GFP_KERNEL_ACCOUNT
arch/x86/kvm/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index d594690d8b95..6b8f11521c41 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -890,7 +890,7 @@ static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache, if (cache->nobjs >= min) return 0; while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
page = (void *)__get_free_page(GFP_KERNEL);
if (!page) return -ENOMEM; cache->objects[cache->nobjs++] = page;page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
-- 2.18.0.rc2.346.g013aa6912e-goog
On 29/06/2018 16:30, Michal Hocko wrote:
I am not familiar wtih kvm to judge but if we are going to account this memory we will probably want to let oom_badness know how much memory to account to a specific process. Is this something that we can do? We will probably need a new MM_KERNEL rss_stat stat for that purpose.
Just to make it clear. I am not opposing to this patch but considering that shadow page tables might consume a lot of memory it would be good to know who is responsible for it from the OOM perspective. Something to solve on top of this.
The amount of memory is generally proportional to the size of the virtual machine memory, which is reflected directly into RSS. Because KVM processes are usually huge, and will probably dwarf everything else in the system (except firefox and chromium of course :)), the general order of magnitude of the oom_badness should be okay.
I would also love to see a note how this memory is bound to the owner life time in the changelog. That would make the review much more easier.
--verbose for people that aren't well versed in linux mm, please...
Paolo
On Fri 29-06-18 16:40:23, Paolo Bonzini wrote:
On 29/06/2018 16:30, Michal Hocko wrote:
I am not familiar wtih kvm to judge but if we are going to account this memory we will probably want to let oom_badness know how much memory to account to a specific process. Is this something that we can do? We will probably need a new MM_KERNEL rss_stat stat for that purpose.
Just to make it clear. I am not opposing to this patch but considering that shadow page tables might consume a lot of memory it would be good to know who is responsible for it from the OOM perspective. Something to solve on top of this.
The amount of memory is generally proportional to the size of the virtual machine memory, which is reflected directly into RSS. Because KVM processes are usually huge, and will probably dwarf everything else in the system (except firefox and chromium of course :)), the general order of magnitude of the oom_badness should be okay.
I think we will need MM_KERNEL longterm anyway. As I've said this is not a must for this patch to go. But it is better to have a fair comparision and kill larger processes if at all possible. It seems this should be the case here.
I would also love to see a note how this memory is bound to the owner life time in the changelog. That would make the review much more easier.
--verbose for people that aren't well versed in linux mm, please...
Well, if the memory accounted to the memcg hits the hard limit and there is no way to reclaim anything to reduce the charged memory then we have to kill something. Hopefully the memory hog. If that one dies it would be great it releases its charges along the way. My remark was just to explain how that would happen for this specific type of memory. Bound to a file, has its own tear down etc. Basically make life of reviewers easier to understand the lifetime of charged objects without digging deep into the specific subsystem.
On 29/06/2018 16:55, Michal Hocko wrote:
I would also love to see a note how this memory is bound to the owner life time in the changelog. That would make the review much more easier.
--verbose for people that aren't well versed in linux mm, please...
Well, if the memory accounted to the memcg hits the hard limit and there is no way to reclaim anything to reduce the charged memory then we have to kill something. Hopefully the memory hog. If that one dies it would be great it releases its charges along the way. My remark was just to explain how that would happen for this specific type of memory. Bound to a file, has its own tear down etc. Basically make life of reviewers easier to understand the lifetime of charged objects without digging deep into the specific subsystem.
Oh I see. Yes, it's all freed when the VM file descriptor (which you get with a ioctl on /dev/kvm) is closed.
Thanks,
Paolo
On Fri, Jun 29, 2018 at 7:55 AM Michal Hocko mhocko@kernel.org wrote:
On Fri 29-06-18 16:40:23, Paolo Bonzini wrote:
On 29/06/2018 16:30, Michal Hocko wrote:
I am not familiar wtih kvm to judge but if we are going to account this memory we will probably want to let oom_badness know how much memory to account to a specific process. Is this something that we can do? We will probably need a new MM_KERNEL rss_stat stat for that purpose.
Just to make it clear. I am not opposing to this patch but considering that shadow page tables might consume a lot of memory it would be good to know who is responsible for it from the OOM perspective. Something to solve on top of this.
The amount of memory is generally proportional to the size of the virtual machine memory, which is reflected directly into RSS. Because KVM processes are usually huge, and will probably dwarf everything else in the system (except firefox and chromium of course :)), the general order of magnitude of the oom_badness should be okay.
I think we will need MM_KERNEL longterm anyway. As I've said this is not a must for this patch to go. But it is better to have a fair comparision and kill larger processes if at all possible. It seems this should be the case here.
I will look more into MM_KERNEL counter. I still have couple more kmem allocations in kvm (like dirty bitmap) which I want to be accounted. I will bundle them together.
thanks, Shakeel
linux-stable-mirror@lists.linaro.org