In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
This patchset will fix the cases: User space application call madvise() with MADV_FREE, MADV_COLD and MADV_PAGEOUT for specific address range. There are THP mapped to the range. Without the patchset, the THP is skipped. With the patch, the THP will be split and handled accordingly.
David reported the cow self test skip some cases because of MADV_PAGEOUT skip THP: https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel.... and I confirmed this patchset make it work again.
Changelog from v1: - Avoid two Fixes tags make backport harder. Thank Andrew for pointing this out.
- Add note section to mention this is a temporary fix which is fine to reduce user-visble effects. For long term fix, we should wait for David's solution. Thank Ryan and David for pointing this out.
- Spell user-visible effects out. Then people could decide whether these patches are necessary for stable branch. Thank Andrew for pointing this out.
V1: https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel....
Yin Fengwei (3): madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
Commit 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping.
It's not correct for large folio. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared.
Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here.
User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready.
Fixes: 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios") Cc: stable@vger.kernel.org Signed-off-by: Yin Fengwei fengwei.yin@intel.com Reviewed-by: Yu Zhao yuzhao@google.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com --- mm/madvise.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 1d7933933e31..49af35e2d99a 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -383,7 +383,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, folio = pfn_folio(pmd_pfn(orig_pmd));
/* Do not interfere with other mappings of this folio */ - if (folio_mapcount(folio) != 1) + if (folio_estimated_sharers(folio) != 1) goto huge_unlock;
if (pageout_anon_only_filter && !folio_test_anon(folio)) @@ -459,7 +459,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (folio_test_large(folio)) { int err;
- if (folio_mapcount(folio) != 1) + if (folio_estimated_sharers(folio) != 1) break; if (pageout_anon_only_filter && !folio_test_anon(folio)) break;
Commit fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping.
It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared.
Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here.
User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready.
Fixes: fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio") Cc: stable@vger.kernel.org Signed-off-by: Yin Fengwei fengwei.yin@intel.com Reviewed-by: Yu Zhao yuzhao@google.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com --- mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 154c210892a1..0b709d2c46c6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1612,7 +1612,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, * If other processes are mapping this folio, we couldn't discard * the folio unless they all do MADV_FREE so let's skip the folio. */ - if (folio_mapcount(folio) != 1) + if (folio_estimated_sharers(folio) != 1) goto out;
if (!folio_trylock(folio))
On Mon, Aug 7, 2023 at 8:11 PM Yin Fengwei fengwei.yin@intel.com wrote:
Commit fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping.
It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared.
Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here.
User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready.
Fixes: fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio") Cc: stable@vger.kernel.org
Andrew, this one isn't really a bug fix but an optimization, so please feel free to drop the Fixes and Cc tags above. (It seems to me it doesn't hurt for stable to take it though.)
Thank you.
Signed-off-by: Yin Fengwei fengwei.yin@intel.com Reviewed-by: Yu Zhao yuzhao@google.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com
mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 154c210892a1..0b709d2c46c6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1612,7 +1612,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, * If other processes are mapping this folio, we couldn't discard * the folio unless they all do MADV_FREE so let's skip the folio. */
if (folio_mapcount(folio) != 1)
if (folio_estimated_sharers(folio) != 1) goto out; if (!folio_trylock(folio))
-- 2.39.2
Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping.
It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared.
Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here.
User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready.
Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") Cc: stable@vger.kernel.org Signed-off-by: Yin Fengwei fengwei.yin@intel.com Reviewed-by: Yu Zhao yuzhao@google.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com --- mm/madvise.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 49af35e2d99a..4dded5d27e7e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -683,7 +683,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (folio_test_large(folio)) { int err;
- if (folio_mapcount(folio) != 1) + if (folio_estimated_sharers(folio) != 1) break; if (!folio_trylock(folio)) break;
Hi Yin, On Tue, Aug 08, 2023 at 10:09:17AM +0800, Yin Fengwei wrote:
Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping.
It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared.
Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here.
I'm trying to understand why it should be ok for madvise this change, so I hope it's okay to ask you few questions.
folio_mapcount() calculates the total maps for all the subpages of a folio. However, the folio_estimated_sharers does it only for the first subpage making it not true for large folios. Then, wouldn't this change drop support for large folios?
Seems like folio_entire_mapcount() is not accurate either because of it does not inclue PTE-mapped sub-pages which I think we need here. Hence, the folio_mapcount(). Could this be something missing in the test side?
I tried to replicate the setup with CONFIG_TRANSPARENT_HUGEPAGE but seems like I'm not able to do it:
./cow # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb size: 2048 KiB # [INFO] detected hugetlb size: 1048576 KiB # [INFO] huge zeropage is enabled TAP version 13 1..166 # [INFO] Anonymous memory tests in private mappings # [RUN] Basic COW after fork() ... with base page not ok 1 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped out base page not ok 2 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with THP not ok 3 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out THP not ok 4 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with PTE-mapped THP not ok 5 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP not ok 6 MADV_HUGEPAGE failed ...
Daniel
User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready.
Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") Cc: stable@vger.kernel.org Signed-off-by: Yin Fengwei fengwei.yin@intel.com Reviewed-by: Yu Zhao yuzhao@google.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com
mm/madvise.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 49af35e2d99a..4dded5d27e7e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -683,7 +683,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (folio_test_large(folio)) { int err;
if (folio_mapcount(folio) != 1)
if (folio_estimated_sharers(folio) != 1) break; if (!folio_trylock(folio)) break;
-- 2.39.2
On 15.08.23 15:25, Daniel Gomez wrote:
Hi Yin, On Tue, Aug 08, 2023 at 10:09:17AM +0800, Yin Fengwei wrote:
Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping.
It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared.
Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here.
I'm trying to understand why it should be ok for madvise this change, so I hope it's okay to ask you few questions.
folio_mapcount() calculates the total maps for all the subpages of a folio. However, the folio_estimated_sharers does it only for the first subpage making it not true for large folios. Then, wouldn't this change drop support for large folios?
It's all a mess right now.
1) page_mapcount(page): how often it this page mapped
For a THP: entire mapcount of the THP (PMD-mapping) + mapcount of *this very subpage* (PTE-mapping) only
2) folio_mapcount(): how often is this folio mapped
For a THP: entire mapcount of the THP (PMD-mapping) + mapcount of *all* subpages (PTE-mapping) of the folio
3) folio_estimated_sharers(): how often is the first page mapped
For a THP: entire mapcount of the THP (PMD-mapping) + mapcount of *the first subpage* (PTE-mapping) only
For the time being, folio_estimated_sharers() is better then folio_mapcount(), because for a PTE-mapped THP folio_mapcount() > 1.
I'm looking into a replacement for folio_estimated_sharers() that is more precise ("folio_mapped_shared()"), but it's all a bit tricky. :)
On 8/15/23 21:25, Daniel Gomez wrote:
Hi Yin, On Tue, Aug 08, 2023 at 10:09:17AM +0800, Yin Fengwei wrote:
Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping.
It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared.
Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here.
I'm trying to understand why it should be ok for madvise this change, so I hope it's okay to ask you few questions.
folio_mapcount() calculates the total maps for all the subpages of a folio. However, the folio_estimated_sharers does it only for the first subpage making it not true for large folios. Then, wouldn't this change drop support for large folios?
I saw David explained this very well in another mail.
Seems like folio_entire_mapcount() is not accurate either because of it does not inclue PTE-mapped sub-pages which I think we need here. Hence, the folio_mapcount(). Could this be something missing in the test side?
I tried to replicate the setup with CONFIG_TRANSPARENT_HUGEPAGE but seems like I'm not able to do it:
./cow # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb size: 2048 KiB # [INFO] detected hugetlb size: 1048576 KiB # [INFO] huge zeropage is enabled TAP version 13 1..166 # [INFO] Anonymous memory tests in private mappings # [RUN] Basic COW after fork() ... with base page not ok 1 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped out base page not ok 2 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with THP not ok 3 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out THP not ok 4 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with PTE-mapped THP not ok 5 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP not ok 6 MADV_HUGEPAGE failed ...
Can you post the MADV_PAGEOUT and PTE-mapped THP related testing result? And I suppose swap need be enabled also for the testing.
Regards Yin, Fengwei
Daniel
User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready.
Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") Cc: stable@vger.kernel.org Signed-off-by: Yin Fengwei fengwei.yin@intel.com Reviewed-by: Yu Zhao yuzhao@google.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com
mm/madvise.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 49af35e2d99a..4dded5d27e7e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -683,7 +683,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (folio_test_large(folio)) { int err;
if (folio_mapcount(folio) != 1)
if (folio_estimated_sharers(folio) != 1) break; if (!folio_trylock(folio)) break;
-- 2.39.2
On Wed, Aug 16, 2023 at 07:30:35AM +0800, Yin Fengwei wrote:
On 8/15/23 21:25, Daniel Gomez wrote:
Hi Yin, On Tue, Aug 08, 2023 at 10:09:17AM +0800, Yin Fengwei wrote:
Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping.
It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared.
Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here.
I'm trying to understand why it should be ok for madvise this change, so I hope it's okay to ask you few questions.
folio_mapcount() calculates the total maps for all the subpages of a folio. However, the folio_estimated_sharers does it only for the first subpage making it not true for large folios. Then, wouldn't this change drop support for large folios?
I saw David explained this very well in another mail.
Seems like folio_entire_mapcount() is not accurate either because of it does not inclue PTE-mapped sub-pages which I think we need here. Hence, the folio_mapcount(). Could this be something missing in the test side?
I tried to replicate the setup with CONFIG_TRANSPARENT_HUGEPAGE but seems like I'm not able to do it:
./cow # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb size: 2048 KiB # [INFO] detected hugetlb size: 1048576 KiB # [INFO] huge zeropage is enabled TAP version 13 1..166 # [INFO] Anonymous memory tests in private mappings # [RUN] Basic COW after fork() ... with base page not ok 1 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped out base page not ok 2 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with THP not ok 3 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out THP not ok 4 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with PTE-mapped THP not ok 5 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP not ok 6 MADV_HUGEPAGE failed ...
Can you post the MADV_PAGEOUT and PTE-mapped THP related testing result? And I suppose swap need be enabled also for the testing.
You may find a dump of the logs in the link below with system information. Let me know if you find something wrong in my setup or if you need something else. Besides CONFIG_TRANSPARENT_HUGEPAGE, CONFIG_SWAP is also enabled in the kernel.
https://gitlab.com/-/snippets/2584135
Also, strace reports ENOSYS for MADV_*: madvise(0x7f2912465000, 4096, MADV_NOHUGEPAGE) = -1 ENOSYS (Function not implemented) madvise(0x7f2912000000, 2097152, MADV_HUGEPAGE) = -1 ENOSYS (Function not implemented)
Regards Yin, Fengwei
Daniel
User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready.
Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") Cc: stable@vger.kernel.org Signed-off-by: Yin Fengwei fengwei.yin@intel.com Reviewed-by: Yu Zhao yuzhao@google.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com
mm/madvise.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 49af35e2d99a..4dded5d27e7e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -683,7 +683,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (folio_test_large(folio)) { int err;
if (folio_mapcount(folio) != 1)
if (folio_estimated_sharers(folio) != 1) break; if (!folio_trylock(folio)) break;
-- 2.39.2
On 8/16/2023 7:44 PM, Daniel Gomez wrote:
On Wed, Aug 16, 2023 at 07:30:35AM +0800, Yin Fengwei wrote:
On 8/15/23 21:25, Daniel Gomez wrote:
Hi Yin, On Tue, Aug 08, 2023 at 10:09:17AM +0800, Yin Fengwei wrote:
Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping.
It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared.
Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here.
I'm trying to understand why it should be ok for madvise this change, so I hope it's okay to ask you few questions.
folio_mapcount() calculates the total maps for all the subpages of a folio. However, the folio_estimated_sharers does it only for the first subpage making it not true for large folios. Then, wouldn't this change drop support for large folios?
I saw David explained this very well in another mail.
Seems like folio_entire_mapcount() is not accurate either because of it does not inclue PTE-mapped sub-pages which I think we need here. Hence, the folio_mapcount(). Could this be something missing in the test side?
I tried to replicate the setup with CONFIG_TRANSPARENT_HUGEPAGE but seems like I'm not able to do it:
./cow # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb size: 2048 KiB # [INFO] detected hugetlb size: 1048576 KiB # [INFO] huge zeropage is enabled TAP version 13 1..166 # [INFO] Anonymous memory tests in private mappings # [RUN] Basic COW after fork() ... with base page not ok 1 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped out base page not ok 2 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with THP not ok 3 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out THP not ok 4 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with PTE-mapped THP not ok 5 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP not ok 6 MADV_HUGEPAGE failed ...
Can you post the MADV_PAGEOUT and PTE-mapped THP related testing result? And I suppose swap need be enabled also for the testing.
You may find a dump of the logs in the link below with system information. Let me know if you find something wrong in my setup or if you need something else. Besides CONFIG_TRANSPARENT_HUGEPAGE, CONFIG_SWAP is also enabled in the kernel.
https://gitlab.com/-/snippets/2584135
Also, strace reports ENOSYS for MADV_*: madvise(0x7f2912465000, 4096, MADV_NOHUGEPAGE) = -1 ENOSYS (Function not implemented) madvise(0x7f2912000000, 2097152, MADV_HUGEPAGE) = -1 ENOSYS (Function not implemented)
O. The problem here is MADV_HUGEPAGE/MADV_NOHUGEPAGE doesn't work. Do you have CONFIG_ADVISE_SYSCALLS enabled?
Regards Yin, Fengwei
Regards Yin, Fengwei
Daniel
User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready.
Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") Cc: stable@vger.kernel.org Signed-off-by: Yin Fengwei fengwei.yin@intel.com Reviewed-by: Yu Zhao yuzhao@google.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com
mm/madvise.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 49af35e2d99a..4dded5d27e7e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -683,7 +683,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (folio_test_large(folio)) { int err;
if (folio_mapcount(folio) != 1)
if (folio_estimated_sharers(folio) != 1) break; if (!folio_trylock(folio)) break;
-- 2.39.2
On Wed, Aug 16, 2023 at 08:04:11PM +0800, Yin, Fengwei wrote:
On 8/16/2023 7:44 PM, Daniel Gomez wrote:
On Wed, Aug 16, 2023 at 07:30:35AM +0800, Yin Fengwei wrote:
On 8/15/23 21:25, Daniel Gomez wrote:
Hi Yin, On Tue, Aug 08, 2023 at 10:09:17AM +0800, Yin Fengwei wrote:
Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping.
It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared.
Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here.
I'm trying to understand why it should be ok for madvise this change, so I hope it's okay to ask you few questions.
folio_mapcount() calculates the total maps for all the subpages of a folio. However, the folio_estimated_sharers does it only for the first subpage making it not true for large folios. Then, wouldn't this change drop support for large folios?
I saw David explained this very well in another mail.
Seems like folio_entire_mapcount() is not accurate either because of it does not inclue PTE-mapped sub-pages which I think we need here. Hence, the folio_mapcount(). Could this be something missing in the test side?
I tried to replicate the setup with CONFIG_TRANSPARENT_HUGEPAGE but seems like I'm not able to do it:
./cow # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb size: 2048 KiB # [INFO] detected hugetlb size: 1048576 KiB # [INFO] huge zeropage is enabled TAP version 13 1..166 # [INFO] Anonymous memory tests in private mappings # [RUN] Basic COW after fork() ... with base page not ok 1 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped out base page not ok 2 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with THP not ok 3 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out THP not ok 4 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with PTE-mapped THP not ok 5 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP not ok 6 MADV_HUGEPAGE failed ...
Can you post the MADV_PAGEOUT and PTE-mapped THP related testing result? And I suppose swap need be enabled also for the testing.
You may find a dump of the logs in the link below with system information. Let me know if you find something wrong in my setup or if you need something else. Besides CONFIG_TRANSPARENT_HUGEPAGE, CONFIG_SWAP is also enabled in the kernel.
https://gitlab.com/-/snippets/2584135
Also, strace reports ENOSYS for MADV_*: madvise(0x7f2912465000, 4096, MADV_NOHUGEPAGE) = -1 ENOSYS (Function not implemented) madvise(0x7f2912000000, 2097152, MADV_HUGEPAGE) = -1 ENOSYS (Function not implemented)
O. The problem here is MADV_HUGEPAGE/MADV_NOHUGEPAGE doesn't work. Do you have CONFIG_ADVISE_SYSCALLS enabled?
It worked after I enabled the conf. Some tests failed and some were skipped. But I managed to reproduce the issue now, thanks Yin!
Bail out! 4 out of 166 tests failed # Totals: pass:146 fail:4 xfail:0 xpass:0 skip:16 error:0
Here the full log: https://gitlab.com/-/snippets/2584190/raw/main/cow.txt
Regards Yin, Fengwei
Regards Yin, Fengwei
Daniel
User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready.
Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") Cc: stable@vger.kernel.org Signed-off-by: Yin Fengwei fengwei.yin@intel.com Reviewed-by: Yu Zhao yuzhao@google.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com
mm/madvise.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 49af35e2d99a..4dded5d27e7e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -683,7 +683,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (folio_test_large(folio)) { int err;
if (folio_mapcount(folio) != 1)
if (folio_estimated_sharers(folio) != 1) break; if (!folio_trylock(folio)) break;
-- 2.39.2
On 16.08.23 16:13, Daniel Gomez wrote:
On Wed, Aug 16, 2023 at 08:04:11PM +0800, Yin, Fengwei wrote:
On 8/16/2023 7:44 PM, Daniel Gomez wrote:
On Wed, Aug 16, 2023 at 07:30:35AM +0800, Yin Fengwei wrote:
On 8/15/23 21:25, Daniel Gomez wrote:
Hi Yin, On Tue, Aug 08, 2023 at 10:09:17AM +0800, Yin Fengwei wrote:
Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping.
It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared.
Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here.
I'm trying to understand why it should be ok for madvise this change, so I hope it's okay to ask you few questions.
folio_mapcount() calculates the total maps for all the subpages of a folio. However, the folio_estimated_sharers does it only for the first subpage making it not true for large folios. Then, wouldn't this change drop support for large folios?
I saw David explained this very well in another mail.
Seems like folio_entire_mapcount() is not accurate either because of it does not inclue PTE-mapped sub-pages which I think we need here. Hence, the folio_mapcount(). Could this be something missing in the test side?
I tried to replicate the setup with CONFIG_TRANSPARENT_HUGEPAGE but seems like I'm not able to do it:
./cow # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb size: 2048 KiB # [INFO] detected hugetlb size: 1048576 KiB # [INFO] huge zeropage is enabled TAP version 13 1..166 # [INFO] Anonymous memory tests in private mappings # [RUN] Basic COW after fork() ... with base page not ok 1 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped out base page not ok 2 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with THP not ok 3 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out THP not ok 4 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with PTE-mapped THP not ok 5 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP not ok 6 MADV_HUGEPAGE failed ...
Can you post the MADV_PAGEOUT and PTE-mapped THP related testing result? And I suppose swap need be enabled also for the testing.
You may find a dump of the logs in the link below with system information. Let me know if you find something wrong in my setup or if you need something else. Besides CONFIG_TRANSPARENT_HUGEPAGE, CONFIG_SWAP is also enabled in the kernel.
https://gitlab.com/-/snippets/2584135
Also, strace reports ENOSYS for MADV_*: madvise(0x7f2912465000, 4096, MADV_NOHUGEPAGE) = -1 ENOSYS (Function not implemented) madvise(0x7f2912000000, 2097152, MADV_HUGEPAGE) = -1 ENOSYS (Function not implemented)
O. The problem here is MADV_HUGEPAGE/MADV_NOHUGEPAGE doesn't work. Do you have CONFIG_ADVISE_SYSCALLS enabled?
It worked after I enabled the conf. Some tests failed and some were skipped. But I managed to reproduce the issue now, thanks Yin!
Bail out! 4 out of 166 tests failed # Totals: pass:146 fail:4 xfail:0 xpass:0 skip:16 error:0
These hugetlb that are failing are known failures.
On Wed, Aug 16, 2023 at 05:11:54PM +0200, David Hildenbrand wrote:
On 16.08.23 16:13, Daniel Gomez wrote:
On Wed, Aug 16, 2023 at 08:04:11PM +0800, Yin, Fengwei wrote:
On 8/16/2023 7:44 PM, Daniel Gomez wrote:
On Wed, Aug 16, 2023 at 07:30:35AM +0800, Yin Fengwei wrote:
On 8/15/23 21:25, Daniel Gomez wrote:
Hi Yin, On Tue, Aug 08, 2023 at 10:09:17AM +0800, Yin Fengwei wrote: > Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a > folio") replaced the page_mapcount() with folio_mapcount() to check > whether the folio is shared by other mapping. > > It's not correct for large folios. folio_mapcount() returns the total > mapcount of large folio which is not suitable to detect whether the folio > is shared. > > Use folio_estimated_sharers() which returns a estimated number of shares. > That means it's not 100% correct. It should be OK for madvise case here.
I'm trying to understand why it should be ok for madvise this change, so I hope it's okay to ask you few questions.
folio_mapcount() calculates the total maps for all the subpages of a folio. However, the folio_estimated_sharers does it only for the first subpage making it not true for large folios. Then, wouldn't this change drop support for large folios?
I saw David explained this very well in another mail.
Seems like folio_entire_mapcount() is not accurate either because of it does not inclue PTE-mapped sub-pages which I think we need here. Hence, the folio_mapcount(). Could this be something missing in the test side?
I tried to replicate the setup with CONFIG_TRANSPARENT_HUGEPAGE but seems like I'm not able to do it:
./cow # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb size: 2048 KiB # [INFO] detected hugetlb size: 1048576 KiB # [INFO] huge zeropage is enabled TAP version 13 1..166 # [INFO] Anonymous memory tests in private mappings # [RUN] Basic COW after fork() ... with base page not ok 1 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped out base page not ok 2 MADV_NOHUGEPAGE failed # [RUN] Basic COW after fork() ... with THP not ok 3 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out THP not ok 4 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with PTE-mapped THP not ok 5 MADV_HUGEPAGE failed # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP not ok 6 MADV_HUGEPAGE failed ...
Can you post the MADV_PAGEOUT and PTE-mapped THP related testing result? And I suppose swap need be enabled also for the testing.
You may find a dump of the logs in the link below with system information. Let me know if you find something wrong in my setup or if you need something else. Besides CONFIG_TRANSPARENT_HUGEPAGE, CONFIG_SWAP is also enabled in the kernel.
https://gitlab.com/-/snippets/2584135
Also, strace reports ENOSYS for MADV_*: madvise(0x7f2912465000, 4096, MADV_NOHUGEPAGE) = -1 ENOSYS (Function not implemented) madvise(0x7f2912000000, 2097152, MADV_HUGEPAGE) = -1 ENOSYS (Function not implemented)
O. The problem here is MADV_HUGEPAGE/MADV_NOHUGEPAGE doesn't work. Do you have CONFIG_ADVISE_SYSCALLS enabled?
It worked after I enabled the conf. Some tests failed and some were skipped. But I managed to reproduce the issue now, thanks Yin!
Bail out! 4 out of 166 tests failed # Totals: pass:146 fail:4 xfail:0 xpass:0 skip:16 error:0
These hugetlb that are failing are known failures.
Hi David, thanks for letting me know. Also, thanks for the description of the folio mapping in the other mail.
-- Cheers,
David / dhildenb
On Mon, Aug 7, 2023 at 8:10 PM Yin Fengwei fengwei.yin@intel.com wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
This patchset will fix the cases: User space application call madvise() with MADV_FREE, MADV_COLD and MADV_PAGEOUT for specific address range. There are THP mapped to the range. Without the patchset, the THP is skipped. With the patch, the THP will be split and handled accordingly.
David reported the cow self test skip some cases because of MADV_PAGEOUT skip THP: https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel.... and I confirmed this patchset make it work again.
Changelog from v1:
Avoid two Fixes tags make backport harder. Thank Andrew for pointing this out.
Add note section to mention this is a temporary fix which is fine to reduce user-visble effects. For long term fix, we should wait for David's solution. Thank Ryan and David for pointing this out.
Spell user-visible effects out. Then people could decide whether these patches are necessary for stable branch. Thank Andrew for pointing this out.
LGTM, thank you.
On 8/8/2023 10:43 AM, Yu Zhao wrote:
On Mon, Aug 7, 2023 at 8:10 PM Yin Fengwei fengwei.yin@intel.com wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
This patchset will fix the cases: User space application call madvise() with MADV_FREE, MADV_COLD and MADV_PAGEOUT for specific address range. There are THP mapped to the range. Without the patchset, the THP is skipped. With the patch, the THP will be split and handled accordingly.
David reported the cow self test skip some cases because of MADV_PAGEOUT skip THP: https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel.... and I confirmed this patchset make it work again.
Changelog from v1:
Avoid two Fixes tags make backport harder. Thank Andrew for pointing this out.
Add note section to mention this is a temporary fix which is fine to reduce user-visble effects. For long term fix, we should wait for David's solution. Thank Ryan and David for pointing this out.
Spell user-visible effects out. Then people could decide whether these patches are necessary for stable branch. Thank Andrew for pointing this out.
LGTM, thank you.
Thanks a lot for looking at this patchset.
Regards Yin, Fengwei
linux-stable-mirror@lists.linaro.org