In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
Yin Fengwei (2): madvise: don't use mapcount() against large folio for sharing check madvise: don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
The commit 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping.
But it's not correct for large folio. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared.
Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. But it should be OK for madvise case here.
Fixes: 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios") Signed-off-by: Yin Fengwei fengwei.yin@intel.com Reviewed-by: Yu Zhao yuzhao@google.com --- mm/madvise.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 886f06066622..148b46beb039 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -383,7 +383,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, folio = pfn_folio(pmd_pfn(orig_pmd));
/* Do not interfere with other mappings of this folio */ - if (folio_mapcount(folio) != 1) + if (folio_estimated_sharers(folio) != 1) goto huge_unlock;
if (pageout_anon_only_filter && !folio_test_anon(folio)) @@ -457,7 +457,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (folio_test_large(folio)) { int err;
- if (folio_mapcount(folio) != 1) + if (folio_estimated_sharers(folio) != 1) break; if (pageout_anon_only_filter && !folio_test_anon(folio)) break;
The commits 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio")
replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping.
But it's not correct for large folio. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared.
Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. But it should be OK for madvise case here.
Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") Fixes: fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio") Signed-off-by: Yin Fengwei fengwei.yin@intel.com Reviewed-by: Yu Zhao yuzhao@google.com --- mm/huge_memory.c | 2 +- mm/madvise.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index eb3678360b97..68c890875257 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1613,7 +1613,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, * If other processes are mapping this folio, we couldn't discard * the folio unless they all do MADV_FREE so let's skip the folio. */ - if (folio_mapcount(folio) != 1) + if (folio_estimated_sharers(folio) != 1) goto out;
if (!folio_trylock(folio)) diff --git a/mm/madvise.c b/mm/madvise.c index 148b46beb039..55bdf641abfa 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -678,7 +678,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (folio_test_large(folio)) { int err;
- if (folio_mapcount(folio) != 1) + if (folio_estimated_sharers(folio) != 1) break; if (!folio_trylock(folio)) break;
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
Rule: 'Cc: stable@vger.kernel.org' or 'commit <sha1> upstream.' Subject: [PATCH 2/2] madvise: don't use mapcount() against large folio for sharing check Link: https://lore.kernel.org/stable/20230728161356.1784568-3-fengwei.yin%40intel....
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
On Sat, 29 Jul 2023 00:13:56 +0800 Yin Fengwei fengwei.yin@intel.com wrote:
Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") Fixes: fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio")
Having two Fixes: for one patch presumably makes backporting more complicated and adds risk of making mistakes.
So I have split this into a three-patch series and I've fixed up the patch naming:
Subject: madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check Subject: madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check Subject: madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check
I haven't added cc:stable at this time - that awaits the description of user-visible effects.
Hi Andrew,
On 7/29/2023 1:41 AM, Andrew Morton wrote:
On Sat, 29 Jul 2023 00:13:56 +0800 Yin Fengwei fengwei.yin@intel.com wrote:
Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") Fixes: fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio")
Having two Fixes: for one patch presumably makes backporting more complicated and adds risk of making mistakes.
So I have split this into a three-patch series and I've fixed up the patch naming:
Subject: madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check Subject: madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check Subject: madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check
Thanks a lot for your kind help. Will be careful for the future patches.
I haven't added cc:stable at this time - that awaits the description of user-visible effects.
The impact of the patch: Without the patch, when user calls madvise() with MADV_COLD, MADV_PAGEOUT and MADV_FREE, it's likely THP pages will be skipped. With the patch, It's likely the THP pages will be split to pages which will be made code, reclaimed and freed.
Regards Yin, Fengwei
On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei fengwei.yin@intel.com wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
What are the user-visible runtime effects of these changes?
(and please try to avoid using the same Subject: for different patches)
Hi Andrew,
On 7/29/2023 1:24 AM, Andrew Morton wrote:
On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei fengwei.yin@intel.com wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
What are the user-visible runtime effects of these changes?
(and please try to avoid using the same Subject: for different patches)
Can you hold on these patches to mm-unstable? I think we need to wait for David's work on folio_maybe_mapped_shared() and redo the fix base on that. Thanks and sorry for the noise.
Regards Yin, Fengwei
Hi Andrew,
On 8/2/2023 8:39 PM, Yin, Fengwei wrote:
Hi Andrew,
On 7/29/2023 1:24 AM, Andrew Morton wrote:
On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei fengwei.yin@intel.com wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
What are the user-visible runtime effects of these changes?
(and please try to avoid using the same Subject: for different patches)
Can you hold on these patches to mm-unstable? I think we need to wait for David's work on folio_maybe_mapped_shared() and redo the fix base on that. Thanks and sorry for the noise.
Sorry for bothering you again for this patchset.
Let me explain the situation here: - The reason to hold on the patches to mm-unstable is that I don't want to promote the fix in this patch (using folio_estimated_sharers()). The correct way is waiting for folio_maybe_mapped_shared() from David.
Merging these patches motivate using folio_estimated_sharers() in other places. So once folio_maybe_mapped_shared() is ready, we need to replace folio_estimated_sharers() with folio_maybe_mapped_shared().
- For this specific patches, if they are suitable for stable, we may want to merge it (special for stable branch. I assume folio_maybe_mapped_shared() may not be back ported to stable branch).
So how do we deal with this situation? Thanks in advance.
Regards Yin, Fengwei
Regards Yin, Fengwei
On Fri, 4 Aug 2023 15:14:53 +0800 "Yin, Fengwei" fengwei.yin@intel.com wrote:
Hi Andrew,
On 8/2/2023 8:39 PM, Yin, Fengwei wrote:
Hi Andrew,
On 7/29/2023 1:24 AM, Andrew Morton wrote:
On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei fengwei.yin@intel.com wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
What are the user-visible runtime effects of these changes?
(and please try to avoid using the same Subject: for different patches)
Can you hold on these patches to mm-unstable? I think we need to wait for David's work on folio_maybe_mapped_shared() and redo the fix base on that. Thanks and sorry for the noise.
Sorry for bothering you again for this patchset.
Let me explain the situation here:
The reason to hold on the patches to mm-unstable is that I don't want to promote the fix in this patch (using folio_estimated_sharers()). The correct way is waiting for folio_maybe_mapped_shared() from David.
Merging these patches motivate using folio_estimated_sharers() in other places. So once folio_maybe_mapped_shared() is ready, we need to replace folio_estimated_sharers() with folio_maybe_mapped_shared().
For this specific patches, if they are suitable for stable, we may want to merge it (special for stable branch. I assume folio_maybe_mapped_shared() may not be back ported to stable branch).
So how do we deal with this situation? Thanks in advance.
I think I'll stage them for 6.5, with a cc:stable.
I'll drop the current three patches. Please resend with
a) different Subject:s for all patches and
b) changelogs which fully describe the user-visible effects of the change.
Thanks.
Hi Andrew,
On 8/8/2023 12:43 AM, Andrew Morton wrote:
On Fri, 4 Aug 2023 15:14:53 +0800 "Yin, Fengwei" fengwei.yin@intel.com wrote:
Hi Andrew,
On 8/2/2023 8:39 PM, Yin, Fengwei wrote:
Hi Andrew,
On 7/29/2023 1:24 AM, Andrew Morton wrote:
On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei fengwei.yin@intel.com wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
What are the user-visible runtime effects of these changes?
(and please try to avoid using the same Subject: for different patches)
Can you hold on these patches to mm-unstable? I think we need to wait for David's work on folio_maybe_mapped_shared() and redo the fix base on that. Thanks and sorry for the noise.
Sorry for bothering you again for this patchset.
Let me explain the situation here:
The reason to hold on the patches to mm-unstable is that I don't want to promote the fix in this patch (using folio_estimated_sharers()). The correct way is waiting for folio_maybe_mapped_shared() from David.
Merging these patches motivate using folio_estimated_sharers() in other places. So once folio_maybe_mapped_shared() is ready, we need to replace folio_estimated_sharers() with folio_maybe_mapped_shared().
For this specific patches, if they are suitable for stable, we may want to merge it (special for stable branch. I assume folio_maybe_mapped_shared() may not be back ported to stable branch).
So how do we deal with this situation? Thanks in advance.
I think I'll stage them for 6.5, with a cc:stable.
I'll drop the current three patches. Please resend with
Thanks. Will resend the patches.
Regards Yin, Fengwei
a) different Subject:s for all patches and
b) changelogs which fully describe the user-visible effects of the change.
Thanks.
On 28/07/2023 17:13, Yin Fengwei wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
Yin Fengwei (2): madvise: don't use mapcount() against large folio for sharing check madvise: don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
As a set of fixes, I agree this is definitely an improvement, so:
Reviewed-By: Ryan Roberts
But I have a couple of comments around further improvements;
Once we have the scheme that David is working on to be able to provide precise exclusive vs shared info, we will probably want to move to that. Although that scheme will need access to the mm_struct of a process known to be mapping the folio. We have that info, but its not passed to folio_estimated_sharers() so we can't just reimplement folio_estimated_sharers() - we will need to rework these call sites again.
Given the aspiration for most of the memory to be large folios going forwards, wouldn't it be better to avoid splitting the large folio where the large folio is mapped entirely within the range of the madvise operation? Sorry if this has already been discussed and decided against - I didn't follow the RFC too closely. Or perhaps you plan to do this as a follow up?
Thanks, Ryan
On 02.08.23 12:27, Ryan Roberts wrote:
On 28/07/2023 17:13, Yin Fengwei wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
Yin Fengwei (2): madvise: don't use mapcount() against large folio for sharing check madvise: don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
As a set of fixes, I agree this is definitely an improvement, so:
Reviewed-By: Ryan Roberts
But I have a couple of comments around further improvements;
Once we have the scheme that David is working on to be able to provide precise exclusive vs shared info, we will probably want to move to that. Although that scheme will need access to the mm_struct of a process known to be mapping the
There are probably ways to work around lack of mm_struct, but it would not be completely for free. But passing the mm_struct should probably be an easy refactoring.
folio. We have that info, but its not passed to folio_estimated_sharers() so we can't just reimplement folio_estimated_sharers() - we will need to rework these call sites again.
We should probably just have a
folio_maybe_mapped_shared()
with proper documentation. Nobody should care about the exact number.
If my scheme for anon pages makes it in, that would be precise for anon pages and we could document that. Once we can handle pagecache pages as well to get a precise answer, we could change to folio_mapped_shared() and adjust the documentation.
I just saw
https://lkml.kernel.org/r/20230802095346.87449-1-wangkefeng.wang@huawei.com
that converts a lot of code to folio_estimated_sharers().
That patchset, for example, also does
total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1
I'm not 100% sure what to think about that at this point. We eventually add false negatives (actually shared but we fail to detect it) all over the place, instead of having false positives (actually exclusive, but we fail to detect it).
And that patch set doesn't even spell that out.
Maybe it's as good as we will get, especially if my scheme doesn't make it in. But we should definitely spell that out.
On 02/08/2023 11:48, David Hildenbrand wrote:
On 02.08.23 12:27, Ryan Roberts wrote:
On 28/07/2023 17:13, Yin Fengwei wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
Yin Fengwei (2): madvise: don't use mapcount() against large folio for sharing check madvise: don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
As a set of fixes, I agree this is definitely an improvement, so:
Reviewed-By: Ryan Roberts
But I have a couple of comments around further improvements;
Once we have the scheme that David is working on to be able to provide precise exclusive vs shared info, we will probably want to move to that. Although that scheme will need access to the mm_struct of a process known to be mapping the
There are probably ways to work around lack of mm_struct, but it would not be completely for free. But passing the mm_struct should probably be an easy refactoring.
folio. We have that info, but its not passed to folio_estimated_sharers() so we can't just reimplement folio_estimated_sharers() - we will need to rework these call sites again.
We should probably just have a
folio_maybe_mapped_shared()
with proper documentation. Nobody should care about the exact number.
If my scheme for anon pages makes it in, that would be precise for anon pages and we could document that. Once we can handle pagecache pages as well to get a precise answer, we could change to folio_mapped_shared() and adjust the documentation.
Makes sense to me. I'm assuming your change would allow us to get rid of PG_anon_exclusive too? In which case we would also want a precise API specifically for anon folios for the CoW case, without waiting for pagecache page support.
I just saw
https://lkml.kernel.org/r/20230802095346.87449-1-wangkefeng.wang@huawei.com
that converts a lot of code to folio_estimated_sharers().
That patchset, for example, also does
total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1
I'm not 100% sure what to think about that at this point. We eventually add false negatives (actually shared but we fail to detect it) all over the place, instead of having false positives (actually exclusive, but we fail to detect it).
And that patch set doesn't even spell that out.
Maybe it's as good as we will get, especially if my scheme doesn't make it in.
I've been working on the assumption that your scheme is plan A, and I'm waiting for it to unblock forward progress on large anon folios. Is this the right approach, or do you think your scheme is sufficiently riskly and/or far out that I should aim not to depend on it?
But we should definitely spell that out.
On 02.08.23 13:20, Ryan Roberts wrote:
On 02/08/2023 11:48, David Hildenbrand wrote:
On 02.08.23 12:27, Ryan Roberts wrote:
On 28/07/2023 17:13, Yin Fengwei wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
Yin Fengwei (2): madvise: don't use mapcount() against large folio for sharing check madvise: don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
As a set of fixes, I agree this is definitely an improvement, so:
Reviewed-By: Ryan Roberts
But I have a couple of comments around further improvements;
Once we have the scheme that David is working on to be able to provide precise exclusive vs shared info, we will probably want to move to that. Although that scheme will need access to the mm_struct of a process known to be mapping the
There are probably ways to work around lack of mm_struct, but it would not be completely for free. But passing the mm_struct should probably be an easy refactoring.
folio. We have that info, but its not passed to folio_estimated_sharers() so we can't just reimplement folio_estimated_sharers() - we will need to rework these call sites again.
We should probably just have a
folio_maybe_mapped_shared()
with proper documentation. Nobody should care about the exact number.
If my scheme for anon pages makes it in, that would be precise for anon pages and we could document that. Once we can handle pagecache pages as well to get a precise answer, we could change to folio_mapped_shared() and adjust the documentation.
Makes sense to me. I'm assuming your change would allow us to get rid of PG_anon_exclusive too? In which case we would also want a precise API specifically for anon folios for the CoW case, without waiting for pagecache page support.
Not necessarily and I'm currently not planning that
On the COW path, I'm planning on using it only when PG_anon_exclusive is clear for a compound page, combined with a check that there are no other page references besides from mappings: all mappings from me and #refs == #mappings -> reuse (set PG_anon_exclusive). That keeps the default (no fork) as fast as possible and simple.
I just saw
https://lkml.kernel.org/r/20230802095346.87449-1-wangkefeng.wang@huawei.com
that converts a lot of code to folio_estimated_sharers().
That patchset, for example, also does
total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1
I'm not 100% sure what to think about that at this point. We eventually add false negatives (actually shared but we fail to detect it) all over the place, instead of having false positives (actually exclusive, but we fail to detect it).
And that patch set doesn't even spell that out.
Maybe it's as good as we will get, especially if my scheme doesn't make it in.
I've been working on the assumption that your scheme is plan A, and I'm waiting for it to unblock forward progress on large anon folios. Is this the right approach, or do you think your scheme is sufficiently riskly and/or far out that I should aim not to depend on it?
It is plan A. IMHO, it does not feel too risky and/or far out at this point -- and the implementation should not end up too complicated. But as always, I cannot promise anything before it's been implemented and discussed upstream.
Hopefully, we know more soon. I'll get at implementing it fairly soon.
On 02/08/2023 12:36, David Hildenbrand wrote:
On 02.08.23 13:20, Ryan Roberts wrote:
On 02/08/2023 11:48, David Hildenbrand wrote:
On 02.08.23 12:27, Ryan Roberts wrote:
On 28/07/2023 17:13, Yin Fengwei wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
Yin Fengwei (2): madvise: don't use mapcount() against large folio for sharing check madvise: don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
As a set of fixes, I agree this is definitely an improvement, so:
Reviewed-By: Ryan Roberts
But I have a couple of comments around further improvements;
Once we have the scheme that David is working on to be able to provide precise exclusive vs shared info, we will probably want to move to that. Although that scheme will need access to the mm_struct of a process known to be mapping the
There are probably ways to work around lack of mm_struct, but it would not be completely for free. But passing the mm_struct should probably be an easy refactoring.
folio. We have that info, but its not passed to folio_estimated_sharers() so we can't just reimplement folio_estimated_sharers() - we will need to rework these call sites again.
We should probably just have a
folio_maybe_mapped_shared()
with proper documentation. Nobody should care about the exact number.
If my scheme for anon pages makes it in, that would be precise for anon pages and we could document that. Once we can handle pagecache pages as well to get a precise answer, we could change to folio_mapped_shared() and adjust the documentation.
Makes sense to me. I'm assuming your change would allow us to get rid of PG_anon_exclusive too? In which case we would also want a precise API specifically for anon folios for the CoW case, without waiting for pagecache page support.
Not necessarily and I'm currently not planning that
On the COW path, I'm planning on using it only when PG_anon_exclusive is clear for a compound page, combined with a check that there are no other page references besides from mappings: all mappings from me and #refs == #mappings -> reuse (set PG_anon_exclusive). That keeps the default (no fork) as fast as possible and simple.
I just saw
https://lkml.kernel.org/r/20230802095346.87449-1-wangkefeng.wang@huawei.com
that converts a lot of code to folio_estimated_sharers().
That patchset, for example, also does
total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1
I'm not 100% sure what to think about that at this point. We eventually add false negatives (actually shared but we fail to detect it) all over the place, instead of having false positives (actually exclusive, but we fail to detect it).
And that patch set doesn't even spell that out.
Maybe it's as good as we will get, especially if my scheme doesn't make it in.
I've been working on the assumption that your scheme is plan A, and I'm waiting for it to unblock forward progress on large anon folios. Is this the right approach, or do you think your scheme is sufficiently riskly and/or far out that I should aim not to depend on it?
It is plan A. IMHO, it does not feel too risky and/or far out at this point -- and the implementation should not end up too complicated. But as always, I cannot promise anything before it's been implemented and discussed upstream.
OK, good we are on the same folio... (stolen from Hugh; if a joke is worth telling once, its worth telling 1000 times ;-)
Hopefully, we know more soon. I'll get at implementing it fairly soon.
On 02.08.23 13:51, Ryan Roberts wrote:
On 02/08/2023 12:36, David Hildenbrand wrote:
On 02.08.23 13:20, Ryan Roberts wrote:
On 02/08/2023 11:48, David Hildenbrand wrote:
On 02.08.23 12:27, Ryan Roberts wrote:
On 28/07/2023 17:13, Yin Fengwei wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
Yin Fengwei (2): madvise: don't use mapcount() against large folio for sharing check madvise: don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
As a set of fixes, I agree this is definitely an improvement, so:
Reviewed-By: Ryan Roberts
But I have a couple of comments around further improvements;
Once we have the scheme that David is working on to be able to provide precise exclusive vs shared info, we will probably want to move to that. Although that scheme will need access to the mm_struct of a process known to be mapping the
There are probably ways to work around lack of mm_struct, but it would not be completely for free. But passing the mm_struct should probably be an easy refactoring.
folio. We have that info, but its not passed to folio_estimated_sharers() so we can't just reimplement folio_estimated_sharers() - we will need to rework these call sites again.
We should probably just have a
folio_maybe_mapped_shared()
with proper documentation. Nobody should care about the exact number.
If my scheme for anon pages makes it in, that would be precise for anon pages and we could document that. Once we can handle pagecache pages as well to get a precise answer, we could change to folio_mapped_shared() and adjust the documentation.
Makes sense to me. I'm assuming your change would allow us to get rid of PG_anon_exclusive too? In which case we would also want a precise API specifically for anon folios for the CoW case, without waiting for pagecache page support.
Not necessarily and I'm currently not planning that
On the COW path, I'm planning on using it only when PG_anon_exclusive is clear for a compound page, combined with a check that there are no other page references besides from mappings: all mappings from me and #refs == #mappings -> reuse (set PG_anon_exclusive). That keeps the default (no fork) as fast as possible and simple.
I just saw
https://lkml.kernel.org/r/20230802095346.87449-1-wangkefeng.wang@huawei.com
that converts a lot of code to folio_estimated_sharers().
That patchset, for example, also does
total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1
I'm not 100% sure what to think about that at this point. We eventually add false negatives (actually shared but we fail to detect it) all over the place, instead of having false positives (actually exclusive, but we fail to detect it).
And that patch set doesn't even spell that out.
Maybe it's as good as we will get, especially if my scheme doesn't make it in.
I've been working on the assumption that your scheme is plan A, and I'm waiting for it to unblock forward progress on large anon folios. Is this the right approach, or do you think your scheme is sufficiently riskly and/or far out that I should aim not to depend on it?
It is plan A. IMHO, it does not feel too risky and/or far out at this point -- and the implementation should not end up too complicated. But as always, I cannot promise anything before it's been implemented and discussed upstream.
OK, good we are on the same folio... (stolen from Hugh; if a joke is worth telling once, its worth telling 1000 times ;-)
Heard it first the time :))
On 8/2/2023 6:27 PM, Ryan Roberts wrote:
On 28/07/2023 17:13, Yin Fengwei wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
Yin Fengwei (2): madvise: don't use mapcount() against large folio for sharing check madvise: don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
As a set of fixes, I agree this is definitely an improvement, so:
Reviewed-By: Ryan Roberts
Thanks.
But I have a couple of comments around further improvements;
Once we have the scheme that David is working on to be able to provide precise exclusive vs shared info, we will probably want to move to that. Although that scheme will need access to the mm_struct of a process known to be mapping the folio. We have that info, but its not passed to folio_estimated_sharers() so we can't just reimplement folio_estimated_sharers() - we will need to rework these call sites again.
Yes. This could be extra work. Maybe should delay till David's work is done.
Given the aspiration for most of the memory to be large folios going forwards, wouldn't it be better to avoid splitting the large folio where the large folio is mapped entirely within the range of the madvise operation? Sorry if this has already been discussed and decided against - I didn't follow the RFC too closely. Or perhaps you plan to do this as a follow up?
Yes. We are on same page. RFC patchset did that. But there are some other opens on the RFC. So I tried to submit this part of change which is bug fix. The other thing left in RFC is optimization (avoid split large folio if we can).
Regards Yin, Fengwei
Thanks, Ryan
On 02/08/2023 13:35, Yin, Fengwei wrote:
On 8/2/2023 6:27 PM, Ryan Roberts wrote:
On 28/07/2023 17:13, Yin Fengwei wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
Yin Fengwei (2): madvise: don't use mapcount() against large folio for sharing check madvise: don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
As a set of fixes, I agree this is definitely an improvement, so:
Reviewed-By: Ryan Roberts
Thanks.
But I have a couple of comments around further improvements;
Once we have the scheme that David is working on to be able to provide precise exclusive vs shared info, we will probably want to move to that. Although that scheme will need access to the mm_struct of a process known to be mapping the folio. We have that info, but its not passed to folio_estimated_sharers() so we can't just reimplement folio_estimated_sharers() - we will need to rework these call sites again.
Yes. This could be extra work. Maybe should delay till David's work is done.
What you have is definitely an improvement over what was there before. And is probably the best we can do without David's scheme. So I wouldn't delay this. Just pointing out that we will be able to make it even better later on (if David's stuff goes in).
Given the aspiration for most of the memory to be large folios going forwards, wouldn't it be better to avoid splitting the large folio where the large folio is mapped entirely within the range of the madvise operation? Sorry if this has already been discussed and decided against - I didn't follow the RFC too closely. Or perhaps you plan to do this as a follow up?
Yes. We are on same page. RFC patchset did that. But there are some other opens on the RFC. So I tried to submit this part of change which is bug fix. The other thing left in RFC is optimization (avoid split large folio if we can).
Regards Yin, Fengwei
Thanks, Ryan
On 8/2/2023 8:40 PM, Ryan Roberts wrote:
On 02/08/2023 13:35, Yin, Fengwei wrote:
On 8/2/2023 6:27 PM, Ryan Roberts wrote:
On 28/07/2023 17:13, Yin Fengwei wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
Yin Fengwei (2): madvise: don't use mapcount() against large folio for sharing check madvise: don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
As a set of fixes, I agree this is definitely an improvement, so:
Reviewed-By: Ryan Roberts
Thanks.
But I have a couple of comments around further improvements;
Once we have the scheme that David is working on to be able to provide precise exclusive vs shared info, we will probably want to move to that. Although that scheme will need access to the mm_struct of a process known to be mapping the folio. We have that info, but its not passed to folio_estimated_sharers() so we can't just reimplement folio_estimated_sharers() - we will need to rework these call sites again.
Yes. This could be extra work. Maybe should delay till David's work is done.
What you have is definitely an improvement over what was there before. And is probably the best we can do without David's scheme. So I wouldn't delay this. Just pointing out that we will be able to make it even better later on (if David's stuff goes in).
Yes. I agree that we should wait for David's work ready and do fix based on that.
Regards Yin, Fengwei
Given the aspiration for most of the memory to be large folios going forwards, wouldn't it be better to avoid splitting the large folio where the large folio is mapped entirely within the range of the madvise operation? Sorry if this has already been discussed and decided against - I didn't follow the RFC too closely. Or perhaps you plan to do this as a follow up?
Yes. We are on same page. RFC patchset did that. But there are some other opens on the RFC. So I tried to submit this part of change which is bug fix. The other thing left in RFC is optimization (avoid split large folio if we can).
Regards Yin, Fengwei
Thanks, Ryan
On 02/08/2023 13:42, Yin, Fengwei wrote:
On 8/2/2023 8:40 PM, Ryan Roberts wrote:
On 02/08/2023 13:35, Yin, Fengwei wrote:
On 8/2/2023 6:27 PM, Ryan Roberts wrote:
On 28/07/2023 17:13, Yin Fengwei wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
Yin Fengwei (2): madvise: don't use mapcount() against large folio for sharing check madvise: don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
As a set of fixes, I agree this is definitely an improvement, so:
Reviewed-By: Ryan Roberts
Thanks.
But I have a couple of comments around further improvements;
Once we have the scheme that David is working on to be able to provide precise exclusive vs shared info, we will probably want to move to that. Although that scheme will need access to the mm_struct of a process known to be mapping the folio. We have that info, but its not passed to folio_estimated_sharers() so we can't just reimplement folio_estimated_sharers() - we will need to rework these call sites again.
Yes. This could be extra work. Maybe should delay till David's work is done.
What you have is definitely an improvement over what was there before. And is probably the best we can do without David's scheme. So I wouldn't delay this. Just pointing out that we will be able to make it even better later on (if David's stuff goes in).
Yes. I agree that we should wait for David's work ready and do fix based on that.
I was suggesting the opposite - not waiting. Then we can do separate improvement later.
Regards Yin, Fengwei
Given the aspiration for most of the memory to be large folios going forwards, wouldn't it be better to avoid splitting the large folio where the large folio is mapped entirely within the range of the madvise operation? Sorry if this has already been discussed and decided against - I didn't follow the RFC too closely. Or perhaps you plan to do this as a follow up?
Yes. We are on same page. RFC patchset did that. But there are some other opens on the RFC. So I tried to submit this part of change which is bug fix. The other thing left in RFC is optimization (avoid split large folio if we can).
Regards Yin, Fengwei
Thanks, Ryan
On 8/2/2023 8:49 PM, Ryan Roberts wrote:
On 02/08/2023 13:42, Yin, Fengwei wrote:
On 8/2/2023 8:40 PM, Ryan Roberts wrote:
On 02/08/2023 13:35, Yin, Fengwei wrote:
On 8/2/2023 6:27 PM, Ryan Roberts wrote:
On 28/07/2023 17:13, Yin Fengwei wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
Yin Fengwei (2): madvise: don't use mapcount() against large folio for sharing check madvise: don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
As a set of fixes, I agree this is definitely an improvement, so:
Reviewed-By: Ryan Roberts
Thanks.
But I have a couple of comments around further improvements;
Once we have the scheme that David is working on to be able to provide precise exclusive vs shared info, we will probably want to move to that. Although that scheme will need access to the mm_struct of a process known to be mapping the folio. We have that info, but its not passed to folio_estimated_sharers() so we can't just reimplement folio_estimated_sharers() - we will need to rework these call sites again.
Yes. This could be extra work. Maybe should delay till David's work is done.
What you have is definitely an improvement over what was there before. And is probably the best we can do without David's scheme. So I wouldn't delay this. Just pointing out that we will be able to make it even better later on (if David's stuff goes in).
Yes. I agree that we should wait for David's work ready and do fix based on that.
I was suggesting the opposite - not waiting. Then we can do separate improvement later.
Let's wait for David's work ready. Otherwise, some other similar fix could be submitted. Unfortunately, we can't build folio_estimated_sharers() based on folio_maybe_mapped_shared() because latter one requires the extra parameter.
Regards Yin, Fengwei
Regards Yin, Fengwei
Given the aspiration for most of the memory to be large folios going forwards, wouldn't it be better to avoid splitting the large folio where the large folio is mapped entirely within the range of the madvise operation? Sorry if this has already been discussed and decided against - I didn't follow the RFC too closely. Or perhaps you plan to do this as a follow up?
Yes. We are on same page. RFC patchset did that. But there are some other opens on the RFC. So I tried to submit this part of change which is bug fix. The other thing left in RFC is optimization (avoid split large folio if we can).
Regards Yin, Fengwei
Thanks, Ryan
On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei fengwei.yin@intel.com wrote:
On 8/2/2023 8:49 PM, Ryan Roberts wrote:
On 02/08/2023 13:42, Yin, Fengwei wrote:
On 8/2/2023 8:40 PM, Ryan Roberts wrote:
On 02/08/2023 13:35, Yin, Fengwei wrote:
On 8/2/2023 6:27 PM, Ryan Roberts wrote:
On 28/07/2023 17:13, Yin Fengwei wrote: > In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), > folio_mapcount() is used to check whether the folio is shared. But it's > not correct as folio_mapcount() returns total mapcount of large folio. > > Use folio_estimated_sharers() here as the estimated number is enough. > > Yin Fengwei (2): > madvise: don't use mapcount() against large folio for sharing check > madvise: don't use mapcount() against large folio for sharing check > > mm/huge_memory.c | 2 +- > mm/madvise.c | 6 +++--- > 2 files changed, 4 insertions(+), 4 deletions(-) >
As a set of fixes, I agree this is definitely an improvement, so:
Reviewed-By: Ryan Roberts
Thanks.
But I have a couple of comments around further improvements;
Once we have the scheme that David is working on to be able to provide precise exclusive vs shared info, we will probably want to move to that. Although that scheme will need access to the mm_struct of a process known to be mapping the folio. We have that info, but its not passed to folio_estimated_sharers() so we can't just reimplement folio_estimated_sharers() - we will need to rework these call sites again.
Yes. This could be extra work. Maybe should delay till David's work is done.
What you have is definitely an improvement over what was there before. And is probably the best we can do without David's scheme. So I wouldn't delay this. Just pointing out that we will be able to make it even better later on (if David's stuff goes in).
Yes. I agree that we should wait for David's work ready and do fix based on that.
I was suggesting the opposite - not waiting. Then we can do separate improvement later.
Let's wait for David's work ready.
Waiting is fine as long as we don't miss the next merge window -- we don't want these two bugs to get into another release. Also I think we should cc stable, since as David mentioned, they have been causing selftest failures.
On 8/4/2023 4:46 AM, Yu Zhao wrote:
On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei fengwei.yin@intel.com wrote:
On 8/2/2023 8:49 PM, Ryan Roberts wrote:
On 02/08/2023 13:42, Yin, Fengwei wrote:
On 8/2/2023 8:40 PM, Ryan Roberts wrote:
On 02/08/2023 13:35, Yin, Fengwei wrote:
On 8/2/2023 6:27 PM, Ryan Roberts wrote: > On 28/07/2023 17:13, Yin Fengwei wrote: >> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), >> folio_mapcount() is used to check whether the folio is shared. But it's >> not correct as folio_mapcount() returns total mapcount of large folio. >> >> Use folio_estimated_sharers() here as the estimated number is enough. >> >> Yin Fengwei (2): >> madvise: don't use mapcount() against large folio for sharing check >> madvise: don't use mapcount() against large folio for sharing check >> >> mm/huge_memory.c | 2 +- >> mm/madvise.c | 6 +++--- >> 2 files changed, 4 insertions(+), 4 deletions(-) >> > > As a set of fixes, I agree this is definitely an improvement, so: > > Reviewed-By: Ryan Roberts Thanks.
> > > But I have a couple of comments around further improvements; > > Once we have the scheme that David is working on to be able to provide precise > exclusive vs shared info, we will probably want to move to that. Although that > scheme will need access to the mm_struct of a process known to be mapping the > folio. We have that info, but its not passed to folio_estimated_sharers() so we > can't just reimplement folio_estimated_sharers() - we will need to rework these > call sites again. Yes. This could be extra work. Maybe should delay till David's work is done.
What you have is definitely an improvement over what was there before. And is probably the best we can do without David's scheme. So I wouldn't delay this. Just pointing out that we will be able to make it even better later on (if David's stuff goes in).
Yes. I agree that we should wait for David's work ready and do fix based on that.
I was suggesting the opposite - not waiting. Then we can do separate improvement later.
Let's wait for David's work ready.
Waiting is fine as long as we don't miss the next merge window -- we don't want these two bugs to get into another release. Also I think we should cc stable, since as David mentioned, they have been causing selftest failures.
Stable was CCed. Andrew asked about the user-visible impact and I replied: https://lore.kernel.org/linux-mm/24e7429c-14ed-d953-e652-eac178de76e3@intel....
I was not aware that selftest is impact by the issue. And waiting for whether these patches are necessary for stable.
But I don't want to promote this kind of change in other place as we will move to David's solution.
Regards Yin, Fengwei
On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei fengwei.yin@intel.com wrote:
On 8/4/2023 4:46 AM, Yu Zhao wrote:
On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei fengwei.yin@intel.com wrote:
"
On 8/2/2023 8:49 PM, Ryan Roberts wrote:
On 02/08/2023 13:42, Yin, Fengwei wrote:
On 8/2/2023 8:40 PM, Ryan Roberts wrote:
On 02/08/2023 13:35, Yin, Fengwei wrote: > > > On 8/2/2023 6:27 PM, Ryan Roberts wrote: >> On 28/07/2023 17:13, Yin Fengwei wrote: >>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), >>> folio_mapcount() is used to check whether the folio is shared. But it's >>> not correct as folio_mapcount() returns total mapcount of large folio. >>> >>> Use folio_estimated_sharers() here as the estimated number is enough. >>> >>> Yin Fengwei (2): >>> madvise: don't use mapcount() against large folio for sharing check >>> madvise: don't use mapcount() against large folio for sharing check >>> >>> mm/huge_memory.c | 2 +- >>> mm/madvise.c | 6 +++--- >>> 2 files changed, 4 insertions(+), 4 deletions(-) >>> >> >> As a set of fixes, I agree this is definitely an improvement, so: >> >> Reviewed-By: Ryan Roberts > Thanks. > >> >> >> But I have a couple of comments around further improvements; >> >> Once we have the scheme that David is working on to be able to provide precise >> exclusive vs shared info, we will probably want to move to that. Although that >> scheme will need access to the mm_struct of a process known to be mapping the >> folio. We have that info, but its not passed to folio_estimated_sharers() so we >> can't just reimplement folio_estimated_sharers() - we will need to rework these >> call sites again. > Yes. This could be extra work. Maybe should delay till David's work is done.
What you have is definitely an improvement over what was there before. And is probably the best we can do without David's scheme. So I wouldn't delay this. Just pointing out that we will be able to make it even better later on (if David's stuff goes in).
Yes. I agree that we should wait for David's work ready and do fix based on that.
I was suggesting the opposite - not waiting. Then we can do separate improvement later.
Let's wait for David's work ready.
Waiting is fine as long as we don't miss the next merge window -- we don't want these two bugs to get into another release. Also I think we should cc stable, since as David mentioned, they have been causing selftest failures.
Stable was CCed.
Need to add the "Cc: stable@vger.kernel.org" tag: Documentation/process/stable-kernel-rules.rst
On 8/4/2023 7:38 AM, Yu Zhao wrote:
On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei fengwei.yin@intel.com wrote:
On 8/4/2023 4:46 AM, Yu Zhao wrote:
On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei fengwei.yin@intel.com wrote:
"
On 8/2/2023 8:49 PM, Ryan Roberts wrote:
On 02/08/2023 13:42, Yin, Fengwei wrote:
On 8/2/2023 8:40 PM, Ryan Roberts wrote: > On 02/08/2023 13:35, Yin, Fengwei wrote: >> >> >> On 8/2/2023 6:27 PM, Ryan Roberts wrote: >>> On 28/07/2023 17:13, Yin Fengwei wrote: >>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), >>>> folio_mapcount() is used to check whether the folio is shared. But it's >>>> not correct as folio_mapcount() returns total mapcount of large folio. >>>> >>>> Use folio_estimated_sharers() here as the estimated number is enough. >>>> >>>> Yin Fengwei (2): >>>> madvise: don't use mapcount() against large folio for sharing check >>>> madvise: don't use mapcount() against large folio for sharing check >>>> >>>> mm/huge_memory.c | 2 +- >>>> mm/madvise.c | 6 +++--- >>>> 2 files changed, 4 insertions(+), 4 deletions(-) >>>> >>> >>> As a set of fixes, I agree this is definitely an improvement, so: >>> >>> Reviewed-By: Ryan Roberts >> Thanks. >> >>> >>> >>> But I have a couple of comments around further improvements; >>> >>> Once we have the scheme that David is working on to be able to provide precise >>> exclusive vs shared info, we will probably want to move to that. Although that >>> scheme will need access to the mm_struct of a process known to be mapping the >>> folio. We have that info, but its not passed to folio_estimated_sharers() so we >>> can't just reimplement folio_estimated_sharers() - we will need to rework these >>> call sites again. >> Yes. This could be extra work. Maybe should delay till David's work is done. > > What you have is definitely an improvement over what was there before. And is > probably the best we can do without David's scheme. So I wouldn't delay this. > Just pointing out that we will be able to make it even better later on (if > David's stuff goes in). Yes. I agree that we should wait for David's work ready and do fix based on that.
I was suggesting the opposite - not waiting. Then we can do separate improvement later.
Let's wait for David's work ready.
Waiting is fine as long as we don't miss the next merge window -- we don't want these two bugs to get into another release. Also I think we should cc stable, since as David mentioned, they have been causing selftest failures.
Stable was CCed.
Need to add the "Cc: stable@vger.kernel.org" tag: Documentation/process/stable-kernel-rules.rst
OK. Thanks for clarification. I totally mis-understanded this. :).
I'd like to wait for answer from Andrew whether these patches are suitable for stable (I suppose you think so) branch.
Regards Yin, Fengwei
On 04.08.23 02:17, Yin, Fengwei wrote:
On 8/4/2023 7:38 AM, Yu Zhao wrote:
On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei fengwei.yin@intel.com wrote:
On 8/4/2023 4:46 AM, Yu Zhao wrote:
On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei fengwei.yin@intel.com wrote:
"
On 8/2/2023 8:49 PM, Ryan Roberts wrote:
On 02/08/2023 13:42, Yin, Fengwei wrote: > > > On 8/2/2023 8:40 PM, Ryan Roberts wrote: >> On 02/08/2023 13:35, Yin, Fengwei wrote: >>> >>> >>> On 8/2/2023 6:27 PM, Ryan Roberts wrote: >>>> On 28/07/2023 17:13, Yin Fengwei wrote: >>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), >>>>> folio_mapcount() is used to check whether the folio is shared. But it's >>>>> not correct as folio_mapcount() returns total mapcount of large folio. >>>>> >>>>> Use folio_estimated_sharers() here as the estimated number is enough. >>>>> >>>>> Yin Fengwei (2): >>>>> madvise: don't use mapcount() against large folio for sharing check >>>>> madvise: don't use mapcount() against large folio for sharing check >>>>> >>>>> mm/huge_memory.c | 2 +- >>>>> mm/madvise.c | 6 +++--- >>>>> 2 files changed, 4 insertions(+), 4 deletions(-) >>>>> >>>> >>>> As a set of fixes, I agree this is definitely an improvement, so: >>>> >>>> Reviewed-By: Ryan Roberts >>> Thanks. >>> >>>> >>>> >>>> But I have a couple of comments around further improvements; >>>> >>>> Once we have the scheme that David is working on to be able to provide precise >>>> exclusive vs shared info, we will probably want to move to that. Although that >>>> scheme will need access to the mm_struct of a process known to be mapping the >>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we >>>> can't just reimplement folio_estimated_sharers() - we will need to rework these >>>> call sites again. >>> Yes. This could be extra work. Maybe should delay till David's work is done. >> >> What you have is definitely an improvement over what was there before. And is >> probably the best we can do without David's scheme. So I wouldn't delay this. >> Just pointing out that we will be able to make it even better later on (if >> David's stuff goes in). > Yes. I agree that we should wait for David's work ready and do fix based on that.
I was suggesting the opposite - not waiting. Then we can do separate improvement later.
Let's wait for David's work ready.
Waiting is fine as long as we don't miss the next merge window -- we don't want these two bugs to get into another release. Also I think we should cc stable, since as David mentioned, they have been causing selftest failures.
Stable was CCed.
Need to add the "Cc: stable@vger.kernel.org" tag: Documentation/process/stable-kernel-rules.rst
OK. Thanks for clarification. I totally mis-understanded this. :).
I'd like to wait for answer from Andrew whether these patches are suitable for stable (I suppose you think so) branch.
Note that the COW test does not fail -- it skips -- but the behavir changed:
$ ./cow # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb page size: 2048 KiB # [INFO] detected hugetlb page size: 1048576 KiB # [INFO] huge zeropage is enabled TAP version 13 1..190 # [INFO] Anonymous memory tests in private mappings # [RUN] Basic COW after fork() ... with base page ok 1 No leak from parent into child # [RUN] Basic COW after fork() ... with swapped out base page ok 2 No leak from parent into child # [RUN] Basic COW after fork() ... with THP ok 3 No leak from parent into child # [RUN] Basic COW after fork() ... with swapped-out THP ok 4 No leak from parent into child # [RUN] Basic COW after fork() ... with PTE-mapped THP ok 5 No leak from parent into child # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled? # [RUN] Basic COW after fork() ... with single PTE of THP ok 7 No leak from parent into child # [RUN] Basic COW after fork() ... with single PTE of swapped-out THP ok 8 No leak from parent into child # [RUN] Basic COW after fork() ... with partially mremap()'ed THP ok 9 No leak from parent into child # [RUN] Basic COW after fork() ... with partially shared THP ok 10 No leak from parent into child ...
Observe how patch #6 skips because the MADV_PAGEOUT was not effective (which might have happened due to other reasons as well, thus no failure).
The code that broke it is
commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82 Author: Vishal Moola (Oracle) vishal.moola@gmail.com Date: Wed Dec 21 10:08:46 2022 -0800
madvise: convert madvise_cold_or_pageout_pte_range() to use folios
This change removes a number of calls to compound_head(), and saves 1729 bytes of kernel text.
Link: https://lkml.kernel.org/r/20221221180848.20774-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) vishal.moola@gmail.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: SeongJae Park sj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org
Ever since v6.3.
The simplest way to fix it would be to revert the page_mapcount() -> folio_mapcount(), conversion.
Probably all that is information worth having in the patch description.
Hi David,
On 8/4/2023 3:31 PM, David Hildenbrand wrote:
On 04.08.23 02:17, Yin, Fengwei wrote:
On 8/4/2023 7:38 AM, Yu Zhao wrote:
On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei fengwei.yin@intel.com wrote:
On 8/4/2023 4:46 AM, Yu Zhao wrote:
On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei fengwei.yin@intel.com wrote:
"
On 8/2/2023 8:49 PM, Ryan Roberts wrote: > On 02/08/2023 13:42, Yin, Fengwei wrote: >> >> >> On 8/2/2023 8:40 PM, Ryan Roberts wrote: >>> On 02/08/2023 13:35, Yin, Fengwei wrote: >>>> >>>> >>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote: >>>>> On 28/07/2023 17:13, Yin Fengwei wrote: >>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), >>>>>> folio_mapcount() is used to check whether the folio is shared. But it's >>>>>> not correct as folio_mapcount() returns total mapcount of large folio. >>>>>> >>>>>> Use folio_estimated_sharers() here as the estimated number is enough. >>>>>> >>>>>> Yin Fengwei (2): >>>>>> madvise: don't use mapcount() against large folio for sharing check >>>>>> madvise: don't use mapcount() against large folio for sharing check >>>>>> >>>>>> mm/huge_memory.c | 2 +- >>>>>> mm/madvise.c | 6 +++--- >>>>>> 2 files changed, 4 insertions(+), 4 deletions(-) >>>>>> >>>>> >>>>> As a set of fixes, I agree this is definitely an improvement, so: >>>>> >>>>> Reviewed-By: Ryan Roberts >>>> Thanks. >>>> >>>>> >>>>> >>>>> But I have a couple of comments around further improvements; >>>>> >>>>> Once we have the scheme that David is working on to be able to provide precise >>>>> exclusive vs shared info, we will probably want to move to that. Although that >>>>> scheme will need access to the mm_struct of a process known to be mapping the >>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we >>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these >>>>> call sites again. >>>> Yes. This could be extra work. Maybe should delay till David's work is done. >>> >>> What you have is definitely an improvement over what was there before. And is >>> probably the best we can do without David's scheme. So I wouldn't delay this. >>> Just pointing out that we will be able to make it even better later on (if >>> David's stuff goes in). >> Yes. I agree that we should wait for David's work ready and do fix based on that. > > I was suggesting the opposite - not waiting. Then we can do separate improvement > later. Let's wait for David's work ready.
Waiting is fine as long as we don't miss the next merge window -- we don't want these two bugs to get into another release. Also I think we should cc stable, since as David mentioned, they have been causing selftest failures.
Stable was CCed.
Need to add the "Cc: stable@vger.kernel.org" tag: Documentation/process/stable-kernel-rules.rst
OK. Thanks for clarification. I totally mis-understanded this. :).
I'd like to wait for answer from Andrew whether these patches are suitable for stable (I suppose you think so) branch.
Note that the COW test does not fail -- it skips -- but the behavir changed:
$ ./cow # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb page size: 2048 KiB # [INFO] detected hugetlb page size: 1048576 KiB # [INFO] huge zeropage is enabled TAP version 13 1..190 # [INFO] Anonymous memory tests in private mappings # [RUN] Basic COW after fork() ... with base page ok 1 No leak from parent into child # [RUN] Basic COW after fork() ... with swapped out base page ok 2 No leak from parent into child # [RUN] Basic COW after fork() ... with THP ok 3 No leak from parent into child # [RUN] Basic COW after fork() ... with swapped-out THP ok 4 No leak from parent into child # [RUN] Basic COW after fork() ... with PTE-mapped THP ok 5 No leak from parent into child # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled? # [RUN] Basic COW after fork() ... with single PTE of THP ok 7 No leak from parent into child # [RUN] Basic COW after fork() ... with single PTE of swapped-out THP ok 8 No leak from parent into child # [RUN] Basic COW after fork() ... with partially mremap()'ed THP ok 9 No leak from parent into child # [RUN] Basic COW after fork() ... with partially shared THP ok 10 No leak from parent into child ...
Observe how patch #6 skips because the MADV_PAGEOUT was not effective (which might have happened due to other reasons as well, thus no failure).
The code that broke it is
commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82 Author: Vishal Moola (Oracle) vishal.moola@gmail.com Date: Wed Dec 21 10:08:46 2022 -0800
madvise: convert madvise_cold_or_pageout_pte_range() to use folios This change removes a number of calls to compound_head(), and saves 1729 bytes of kernel text. Link: https://lkml.kernel.org/r/20221221180848.20774-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) vishal.moola@gmail.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: SeongJae Park sj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org
Ever since v6.3.
The simplest way to fix it would be to revert the page_mapcount() -> folio_mapcount(), conversion.
Probably all that is information worth having in the patch description.
Thanks a lot for checking this. I will try this patchset to see whether it can restore the behavior (I suppose so from your broken commit info but want to confirm).
Regards Yin, Fengwei
On 8/4/2023 3:31 PM, David Hildenbrand wrote:
On 04.08.23 02:17, Yin, Fengwei wrote:
On 8/4/2023 7:38 AM, Yu Zhao wrote:
On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei fengwei.yin@intel.com wrote:
On 8/4/2023 4:46 AM, Yu Zhao wrote:
On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei fengwei.yin@intel.com wrote:
"
On 8/2/2023 8:49 PM, Ryan Roberts wrote: > On 02/08/2023 13:42, Yin, Fengwei wrote: >> >> >> On 8/2/2023 8:40 PM, Ryan Roberts wrote: >>> On 02/08/2023 13:35, Yin, Fengwei wrote: >>>> >>>> >>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote: >>>>> On 28/07/2023 17:13, Yin Fengwei wrote: >>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), >>>>>> folio_mapcount() is used to check whether the folio is shared. But it's >>>>>> not correct as folio_mapcount() returns total mapcount of large folio. >>>>>> >>>>>> Use folio_estimated_sharers() here as the estimated number is enough. >>>>>> >>>>>> Yin Fengwei (2): >>>>>> madvise: don't use mapcount() against large folio for sharing check >>>>>> madvise: don't use mapcount() against large folio for sharing check >>>>>> >>>>>> mm/huge_memory.c | 2 +- >>>>>> mm/madvise.c | 6 +++--- >>>>>> 2 files changed, 4 insertions(+), 4 deletions(-) >>>>>> >>>>> >>>>> As a set of fixes, I agree this is definitely an improvement, so: >>>>> >>>>> Reviewed-By: Ryan Roberts >>>> Thanks. >>>> >>>>> >>>>> >>>>> But I have a couple of comments around further improvements; >>>>> >>>>> Once we have the scheme that David is working on to be able to provide precise >>>>> exclusive vs shared info, we will probably want to move to that. Although that >>>>> scheme will need access to the mm_struct of a process known to be mapping the >>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we >>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these >>>>> call sites again. >>>> Yes. This could be extra work. Maybe should delay till David's work is done. >>> >>> What you have is definitely an improvement over what was there before. And is >>> probably the best we can do without David's scheme. So I wouldn't delay this. >>> Just pointing out that we will be able to make it even better later on (if >>> David's stuff goes in). >> Yes. I agree that we should wait for David's work ready and do fix based on that. > > I was suggesting the opposite - not waiting. Then we can do separate improvement > later. Let's wait for David's work ready.
Waiting is fine as long as we don't miss the next merge window -- we don't want these two bugs to get into another release. Also I think we should cc stable, since as David mentioned, they have been causing selftest failures.
Stable was CCed.
Need to add the "Cc: stable@vger.kernel.org" tag: Documentation/process/stable-kernel-rules.rst
OK. Thanks for clarification. I totally mis-understanded this. :).
I'd like to wait for answer from Andrew whether these patches are suitable for stable (I suppose you think so) branch.
Note that the COW test does not fail -- it skips -- but the behavir changed:
$ ./cow # [INFO] detected THP size: 2048 KiB # [INFO] detected hugetlb page size: 2048 KiB # [INFO] detected hugetlb page size: 1048576 KiB # [INFO] huge zeropage is enabled TAP version 13 1..190 # [INFO] Anonymous memory tests in private mappings # [RUN] Basic COW after fork() ... with base page ok 1 No leak from parent into child # [RUN] Basic COW after fork() ... with swapped out base page ok 2 No leak from parent into child # [RUN] Basic COW after fork() ... with THP ok 3 No leak from parent into child # [RUN] Basic COW after fork() ... with swapped-out THP ok 4 No leak from parent into child # [RUN] Basic COW after fork() ... with PTE-mapped THP ok 5 No leak from parent into child # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled? # [RUN] Basic COW after fork() ... with single PTE of THP ok 7 No leak from parent into child # [RUN] Basic COW after fork() ... with single PTE of swapped-out THP ok 8 No leak from parent into child # [RUN] Basic COW after fork() ... with partially mremap()'ed THP ok 9 No leak from parent into child # [RUN] Basic COW after fork() ... with partially shared THP ok 10 No leak from parent into child ...
Observe how patch #6 skips because the MADV_PAGEOUT was not effective (which might have happened due to other reasons as well, thus no failure).
The code that broke it is
commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82 Author: Vishal Moola (Oracle) vishal.moola@gmail.com Date: Wed Dec 21 10:08:46 2022 -0800
madvise: convert madvise_cold_or_pageout_pte_range() to use folios This change removes a number of calls to compound_head(), and saves 1729 bytes of kernel text. Link: https://lkml.kernel.org/r/20221221180848.20774-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) vishal.moola@gmail.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: SeongJae Park sj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org
Ever since v6.3.
The simplest way to fix it would be to revert the page_mapcount() -> folio_mapcount(), conversion.
Confirmed this patchset also can make swapped-out, PTE-mapped THP related tests from skip to pass.
Regards Yin, Fengwei
Probably all that is information worth having in the patch description.
On 02.08.23 14:40, Ryan Roberts wrote:
On 02/08/2023 13:35, Yin, Fengwei wrote:
On 8/2/2023 6:27 PM, Ryan Roberts wrote:
On 28/07/2023 17:13, Yin Fengwei wrote:
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
Yin Fengwei (2): madvise: don't use mapcount() against large folio for sharing check madvise: don't use mapcount() against large folio for sharing check
mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
As a set of fixes, I agree this is definitely an improvement, so:
Reviewed-By: Ryan Roberts
Thanks.
But I have a couple of comments around further improvements;
Once we have the scheme that David is working on to be able to provide precise exclusive vs shared info, we will probably want to move to that. Although that scheme will need access to the mm_struct of a process known to be mapping the folio. We have that info, but its not passed to folio_estimated_sharers() so we can't just reimplement folio_estimated_sharers() - we will need to rework these call sites again.
Yes. This could be extra work. Maybe should delay till David's work is done.
What you have is definitely an improvement over what was there before. And is probably the best we can do without David's scheme. So I wouldn't delay this. Just pointing out that we will be able to make it even better later on (if David's stuff goes in).
Agreed, we just should be careful and clearly spell out the implications and that this is eventually also not what we 100% want.
That MADV_PAGEOUT now fails on a PTE-mapped THP -- as can be seen when executing the cow selftest where MADV_PAGEOUT will essentially fail -- is certainly undesired and should be fixed IMHO.
linux-stable-mirror@lists.linaro.org