We have usecase to use tmpfs as QEMU memory backend and we would like to take the advantage of THP as well. But, our test shows the EPT is not PMD mapped even though the underlying THP are PMD mapped on host. The number showed by /sys/kernel/debug/kvm/largepage is much less than the number of PMD mapped shmem pages as the below:
7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 579584 kB [snip] Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages 12
And some benchmarks do worse than with anonymous THPs.
By digging into the code we figured out that commit 127393fbe597 ("mm: thp: kvm: fix memory corruption in KVM with THP enabled") checks if there is a single PTE mapping on the page for anonymous THP when setting up EPT map. But, the _mapcount < 0 check doesn't fit to page cache THP since every subpage of page cache THP would get _mapcount inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns false for page cache THP. This would prevent KVM from setting up PMD mapped EPT entry.
So we need handle page cache THP correctly. However, when page cache THP's PMD gets split, kernel just remove the map instead of setting up PTE map like what anonymous THP does. Before KVM calls get_user_pages() the subpages may get PTE mapped even though it is still a THP since the page cache THP may be mapped by other processes at the mean time.
Checking its _mapcount and whether the THP has PTE mapped or not. Although this may report some false negative cases (PTE mapped by other processes), it looks not trivial to make this accurate.
With this fix /sys/kernel/debug/kvm/largepage would show reasonable pages are PMD mapped by EPT as the below:
7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 557056 kB [snip] Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages 271
And the benchmarks are as same as anonymous THPs.
Fixes: dd78fedde4b9 ("rmap: support file thp") Signed-off-by: Yang Shi yang.shi@linux.alibaba.com Reported-by: Gang Deng gavin.dg@linux.alibaba.com Tested-by: Gang Deng gavin.dg@linux.alibaba.com Suggested-by: Hugh Dickins hughd@google.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Matthew Wilcox willy@infradead.org Cc: stable@vger.kernel.org 4.8+ --- include/linux/mm.h | 5 ----- include/linux/mm_types.h | 5 +++++ include/linux/page-flags.h | 20 ++++++++++++++++++-- 3 files changed, 23 insertions(+), 7 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index cc29227..a2adf95 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -695,11 +695,6 @@ static inline void *kvcalloc(size_t n, size_t size, gfp_t flags)
extern void kvfree(const void *addr);
-static inline atomic_t *compound_mapcount_ptr(struct page *page) -{ - return &page[1].compound_mapcount; -} - static inline int compound_mapcount(struct page *page) { VM_BUG_ON_PAGE(!PageCompound(page), page); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 2222fa7..270aa8f 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -221,6 +221,11 @@ struct page { #endif } _struct_page_alignment;
+static inline atomic_t *compound_mapcount_ptr(struct page *page) +{ + return &page[1].compound_mapcount; +} + /* * Used for sizing the vmemmap region on some architectures */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index f91cb88..1bf83c8 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -622,12 +622,28 @@ static inline int PageTransCompound(struct page *page) * * Unlike PageTransCompound, this is safe to be called only while * split_huge_pmd() cannot run from under us, like if protected by the - * MMU notifier, otherwise it may result in page->_mapcount < 0 false + * MMU notifier, otherwise it may result in page->_mapcount check false * positives. + * + * We have to treat page cache THP differently since every subpage of it + * would get _mapcount inc'ed once it is PMD mapped. But, it may be PTE + * mapped in the current process so comparing subpage's _mapcount to + * compound_mapcount to filter out PTE mapped case. */ static inline int PageTransCompoundMap(struct page *page) { - return PageTransCompound(page) && atomic_read(&page->_mapcount) < 0; + struct page *head; + + if (!PageTransCompound(page)) + return 0; + + if (PageAnon(page)) + return atomic_read(&page->_mapcount) < 0; + + head = compound_head(page); + /* File THP is PMD mapped and not PTE mapped */ + return atomic_read(&page->_mapcount) == + atomic_read(compound_mapcount_ptr(head)); }
/*
On Thu, Oct 24, 2019 at 05:19:35AM +0800, Yang Shi wrote:
We have usecase to use tmpfs as QEMU memory backend and we would like to take the advantage of THP as well. But, our test shows the EPT is not PMD mapped even though the underlying THP are PMD mapped on host. The number showed by /sys/kernel/debug/kvm/largepage is much less than the number of PMD mapped shmem pages as the below:
7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 579584 kB [snip] Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages 12
And some benchmarks do worse than with anonymous THPs.
By digging into the code we figured out that commit 127393fbe597 ("mm: thp: kvm: fix memory corruption in KVM with THP enabled") checks if there is a single PTE mapping on the page for anonymous THP when setting up EPT map. But, the _mapcount < 0 check doesn't fit to page cache THP since every subpage of page cache THP would get _mapcount inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns false for page cache THP. This would prevent KVM from setting up PMD mapped EPT entry.
So we need handle page cache THP correctly. However, when page cache THP's PMD gets split, kernel just remove the map instead of setting up PTE map like what anonymous THP does. Before KVM calls get_user_pages() the subpages may get PTE mapped even though it is still a THP since the page cache THP may be mapped by other processes at the mean time.
Checking its _mapcount and whether the THP has PTE mapped or not. Although this may report some false negative cases (PTE mapped by other processes), it looks not trivial to make this accurate.
I don't understand why you care how it's mapped into userspace. If there is a PMD-sized page in the page cache, then you can use a PMD mapping in the EPT tables to map it. Why would another process having a PTE mapping on the page cause you to not use a PMD mapping?
On 10/24/19 6:55 AM, Matthew Wilcox wrote:
On Thu, Oct 24, 2019 at 05:19:35AM +0800, Yang Shi wrote:
We have usecase to use tmpfs as QEMU memory backend and we would like to take the advantage of THP as well. But, our test shows the EPT is not PMD mapped even though the underlying THP are PMD mapped on host. The number showed by /sys/kernel/debug/kvm/largepage is much less than the number of PMD mapped shmem pages as the below:
7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 579584 kB [snip] Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages 12
And some benchmarks do worse than with anonymous THPs.
By digging into the code we figured out that commit 127393fbe597 ("mm: thp: kvm: fix memory corruption in KVM with THP enabled") checks if there is a single PTE mapping on the page for anonymous THP when setting up EPT map. But, the _mapcount < 0 check doesn't fit to page cache THP since every subpage of page cache THP would get _mapcount inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns false for page cache THP. This would prevent KVM from setting up PMD mapped EPT entry.
So we need handle page cache THP correctly. However, when page cache THP's PMD gets split, kernel just remove the map instead of setting up PTE map like what anonymous THP does. Before KVM calls get_user_pages() the subpages may get PTE mapped even though it is still a THP since the page cache THP may be mapped by other processes at the mean time.
Checking its _mapcount and whether the THP has PTE mapped or not. Although this may report some false negative cases (PTE mapped by other processes), it looks not trivial to make this accurate.
I don't understand why you care how it's mapped into userspace. If there is a PMD-sized page in the page cache, then you can use a PMD mapping in the EPT tables to map it. Why would another process having a PTE mapping on the page cause you to not use a PMD mapping?
We don't care if the THP is PTE mapped by other process, but either PageDoubleMap flag or _mapcount/compound_mapcount can't tell us if the PTE map comes from the current process or other process unless gup could return pmd's status.
I think the commit 127393fbe597 ("mm: thp: kvm: fix memory corruption in KVM with THP enabled") elaborates the trade-off clearly (not full commit log, just paste the most related part):
Ideally instead of the page->_mapcount < 1 check, get_user_pages() should return the granularity of the "page" mapping in the "mm" passed to get_user_pages(). However it's non trivial change to pass the "pmd" status belonging to the "mm" walked by get_user_pages up the stack (up to the caller of get_user_pages). So the fix just checks if there is not a single pte mapping on the page returned by get_user_pages, and in turn if the caller can assume that the whole compound page is mapped in the current "mm" (in a pmd_trans_huge()). In such case the entire compound page is safe to map into the secondary MMU without additional get_user_pages() calls on the surrounding tail/head pages. In addition of being faster, not having to run other get_user_pages() calls also reduces the memory footprint of the secondary MMU fault in case the pmd split happened as result of memory pressure.
On Thu, Oct 24, 2019 at 09:33:11AM -0700, Yang Shi wrote:
On 10/24/19 6:55 AM, Matthew Wilcox wrote:
On Thu, Oct 24, 2019 at 05:19:35AM +0800, Yang Shi wrote:
We have usecase to use tmpfs as QEMU memory backend and we would like to take the advantage of THP as well. But, our test shows the EPT is not PMD mapped even though the underlying THP are PMD mapped on host. The number showed by /sys/kernel/debug/kvm/largepage is much less than the number of PMD mapped shmem pages as the below:
7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 579584 kB [snip] Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages 12
And some benchmarks do worse than with anonymous THPs.
By digging into the code we figured out that commit 127393fbe597 ("mm: thp: kvm: fix memory corruption in KVM with THP enabled") checks if there is a single PTE mapping on the page for anonymous THP when setting up EPT map. But, the _mapcount < 0 check doesn't fit to page cache THP since every subpage of page cache THP would get _mapcount inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns false for page cache THP. This would prevent KVM from setting up PMD mapped EPT entry.
So we need handle page cache THP correctly. However, when page cache THP's PMD gets split, kernel just remove the map instead of setting up PTE map like what anonymous THP does. Before KVM calls get_user_pages() the subpages may get PTE mapped even though it is still a THP since the page cache THP may be mapped by other processes at the mean time.
Checking its _mapcount and whether the THP has PTE mapped or not. Although this may report some false negative cases (PTE mapped by other processes), it looks not trivial to make this accurate.
I don't understand why you care how it's mapped into userspace. If there is a PMD-sized page in the page cache, then you can use a PMD mapping in the EPT tables to map it. Why would another process having a PTE mapping on the page cause you to not use a PMD mapping?
We don't care if the THP is PTE mapped by other process, but either PageDoubleMap flag or _mapcount/compound_mapcount can't tell us if the PTE map comes from the current process or other process unless gup could return pmd's status.
But why do you care if the THP is PTE mapped by _this_ process? This process has a reference to the page; the page is PMD sized and PMD aligned, so you can use a PMD mapping in the guest, regardless of how it's mapped by userspace. Maybe this process doesn't even have the page mapped at all?
On Fri, Oct 25, 2019 at 08:32:05AM -0700, Matthew Wilcox wrote:
On Thu, Oct 24, 2019 at 09:33:11AM -0700, Yang Shi wrote:
On 10/24/19 6:55 AM, Matthew Wilcox wrote:
On Thu, Oct 24, 2019 at 05:19:35AM +0800, Yang Shi wrote:
We have usecase to use tmpfs as QEMU memory backend and we would like to take the advantage of THP as well. But, our test shows the EPT is not PMD mapped even though the underlying THP are PMD mapped on host. The number showed by /sys/kernel/debug/kvm/largepage is much less than the number of PMD mapped shmem pages as the below:
7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 579584 kB [snip] Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages 12
And some benchmarks do worse than with anonymous THPs.
By digging into the code we figured out that commit 127393fbe597 ("mm: thp: kvm: fix memory corruption in KVM with THP enabled") checks if there is a single PTE mapping on the page for anonymous THP when setting up EPT map. But, the _mapcount < 0 check doesn't fit to page cache THP since every subpage of page cache THP would get _mapcount inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns false for page cache THP. This would prevent KVM from setting up PMD mapped EPT entry.
So we need handle page cache THP correctly. However, when page cache THP's PMD gets split, kernel just remove the map instead of setting up PTE map like what anonymous THP does. Before KVM calls get_user_pages() the subpages may get PTE mapped even though it is still a THP since the page cache THP may be mapped by other processes at the mean time.
Checking its _mapcount and whether the THP has PTE mapped or not. Although this may report some false negative cases (PTE mapped by other processes), it looks not trivial to make this accurate.
I don't understand why you care how it's mapped into userspace. If there is a PMD-sized page in the page cache, then you can use a PMD mapping in the EPT tables to map it. Why would another process having a PTE mapping on the page cause you to not use a PMD mapping?
We don't care if the THP is PTE mapped by other process, but either PageDoubleMap flag or _mapcount/compound_mapcount can't tell us if the PTE map comes from the current process or other process unless gup could return pmd's status.
But why do you care if the THP is PTE mapped by _this_ process? This process has a reference to the page; the page is PMD sized and PMD aligned, so you can use a PMD mapping in the guest, regardless of how it's mapped by userspace. Maybe this process doesn't even have the page mapped at all?
Consider the case with MAP_PRIVATE. If part of the THP was overritten in the process, you want EPT to reflect the case, not map the page from page cache regardless.
On Thu, Oct 24, 2019 at 05:19:35AM +0800, Yang Shi wrote:
We have usecase to use tmpfs as QEMU memory backend and we would like to take the advantage of THP as well. But, our test shows the EPT is not PMD mapped even though the underlying THP are PMD mapped on host. The number showed by /sys/kernel/debug/kvm/largepage is much less than the number of PMD mapped shmem pages as the below:
7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 579584 kB [snip] Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages 12
And some benchmarks do worse than with anonymous THPs.
By digging into the code we figured out that commit 127393fbe597 ("mm: thp: kvm: fix memory corruption in KVM with THP enabled") checks if there is a single PTE mapping on the page for anonymous THP when setting up EPT map. But, the _mapcount < 0 check doesn't fit to page cache THP since every subpage of page cache THP would get _mapcount inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns false for page cache THP. This would prevent KVM from setting up PMD mapped EPT entry.
So we need handle page cache THP correctly. However, when page cache THP's PMD gets split, kernel just remove the map instead of setting up PTE map like what anonymous THP does. Before KVM calls get_user_pages() the subpages may get PTE mapped even though it is still a THP since the page cache THP may be mapped by other processes at the mean time.
Checking its _mapcount and whether the THP has PTE mapped or not. Although this may report some false negative cases (PTE mapped by other processes), it looks not trivial to make this accurate.
With this fix /sys/kernel/debug/kvm/largepage would show reasonable pages are PMD mapped by EPT as the below:
7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 557056 kB [snip] Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages 271
And the benchmarks are as same as anonymous THPs.
Fixes: dd78fedde4b9 ("rmap: support file thp") Signed-off-by: Yang Shi yang.shi@linux.alibaba.com Reported-by: Gang Deng gavin.dg@linux.alibaba.com Tested-by: Gang Deng gavin.dg@linux.alibaba.com Suggested-by: Hugh Dickins hughd@google.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Matthew Wilcox willy@infradead.org Cc: stable@vger.kernel.org 4.8+
Looks good to me.
Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
linux-stable-mirror@lists.linaro.org