We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") Signed-off-by: Wei Yang richard.weiyang@gmail.com Cc: David Hildenbrand david@redhat.com Cc: Lance Yang lance.yang@linux.dev Cc: Dev Jain dev.jain@arm.com Cc: stable@vger.kernel.org
--- v2: * add fix, cc stable and put description about the flow of current code * move deferred_split_folio() into map_anon_folio_pmd() --- mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b81680b4225..f13de93637bf 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1232,6 +1232,7 @@ static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, count_vm_event(THP_FAULT_ALLOC); count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); + deferred_split_folio(folio, false); }
static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) @@ -1272,7 +1273,6 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); map_anon_folio_pmd(folio, vmf->pmd, vma, haddr); mm_inc_nr_ptes(vma->vm_mm); - deferred_split_folio(folio, false); spin_unlock(vmf->ptl); }
On Thu, Oct 02, 2025 at 01:38:25AM +0000, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Since we move deferred_split_folio() into map_anon_folio_pmd(), I am thinking about whether we can consolidate the process in collapse_huge_page().
Use map_anon_folio_pmd() in collapse_huge_page(), but skip those statistic adjustment.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") Signed-off-by: Wei Yang richard.weiyang@gmail.com Cc: David Hildenbrand david@redhat.com Cc: Lance Yang lance.yang@linux.dev Cc: Dev Jain dev.jain@arm.com Cc: stable@vger.kernel.org
On 2025/10/2 09:46, Wei Yang wrote:
On Thu, Oct 02, 2025 at 01:38:25AM +0000, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Since we move deferred_split_folio() into map_anon_folio_pmd(), I am thinking about whether we can consolidate the process in collapse_huge_page().
Use map_anon_folio_pmd() in collapse_huge_page(), but skip those statistic adjustment.
Yeah, that's a good idea :)
We could add a simple bool is_fault parameter to map_anon_folio_pmd() to control the statistics.
The fault paths would call it with true, and the collapse paths could then call it with false.
Something like this:
``` diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b81680b4225..9924180a4a56 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1218,7 +1218,7 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma, }
static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, - struct vm_area_struct *vma, unsigned long haddr) + struct vm_area_struct *vma, unsigned long haddr, bool is_fault) { pmd_t entry;
@@ -1228,10 +1228,15 @@ static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, folio_add_lru_vma(folio, vma); set_pmd_at(vma->vm_mm, haddr, pmd, entry); update_mmu_cache_pmd(vma, haddr, pmd); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); - count_vm_event(THP_FAULT_ALLOC); - count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); - count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); + + if (is_fault) { + add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); + count_vm_event(THP_FAULT_ALLOC); + count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); + count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); + } + + deferred_split_folio(folio, false); }
static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index d0957648db19..2eddd5a60e48 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1227,17 +1227,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, __folio_mark_uptodate(folio); pgtable = pmd_pgtable(_pmd);
- _pmd = folio_mk_pmd(folio, vma->vm_page_prot); - _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma); - spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); - folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); - folio_add_lru_vma(folio, vma); pgtable_trans_huge_deposit(mm, pmd, pgtable); - set_pmd_at(mm, address, pmd, _pmd); - update_mmu_cache_pmd(vma, address, pmd); - deferred_split_folio(folio, false); + map_anon_folio_pmd(folio, pmd, vma, address, false); spin_unlock(pmd_ptl);
folio = NULL; ```
Untested, though.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") Signed-off-by: Wei Yang richard.weiyang@gmail.com Cc: David Hildenbrand david@redhat.com Cc: Lance Yang lance.yang@linux.dev Cc: Dev Jain dev.jain@arm.com Cc: stable@vger.kernel.org
On Thu, Oct 02, 2025 at 10:31:53AM +0800, Lance Yang wrote:
On 2025/10/2 09:46, Wei Yang wrote:
On Thu, Oct 02, 2025 at 01:38:25AM +0000, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Since we move deferred_split_folio() into map_anon_folio_pmd(), I am thinking about whether we can consolidate the process in collapse_huge_page().
Use map_anon_folio_pmd() in collapse_huge_page(), but skip those statistic adjustment.
Yeah, that's a good idea :)
We could add a simple bool is_fault parameter to map_anon_folio_pmd() to control the statistics.
The fault paths would call it with true, and the collapse paths could then call it with false.
Something like this:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b81680b4225..9924180a4a56 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1218,7 +1218,7 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma, } static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, - struct vm_area_struct *vma, unsigned long haddr) + struct vm_area_struct *vma, unsigned long haddr, bool is_fault) { pmd_t entry; @@ -1228,10 +1228,15 @@ static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, folio_add_lru_vma(folio, vma); set_pmd_at(vma->vm_mm, haddr, pmd, entry); update_mmu_cache_pmd(vma, haddr, pmd); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); - count_vm_event(THP_FAULT_ALLOC); - count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); - count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); + + if (is_fault) { + add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); + count_vm_event(THP_FAULT_ALLOC); + count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); + count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); + } + + deferred_split_folio(folio, false); } static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index d0957648db19..2eddd5a60e48 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1227,17 +1227,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, __folio_mark_uptodate(folio); pgtable = pmd_pgtable(_pmd); - _pmd = folio_mk_pmd(folio, vma->vm_page_prot); - _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma); - spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); - folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); - folio_add_lru_vma(folio, vma); pgtable_trans_huge_deposit(mm, pmd, pgtable); - set_pmd_at(mm, address, pmd, _pmd); - update_mmu_cache_pmd(vma, address, pmd); - deferred_split_folio(folio, false); + map_anon_folio_pmd(folio, pmd, vma, address, false); spin_unlock(pmd_ptl); folio = NULL;Untested, though.
This is the same as I thought.
Will prepare a patch for it.
On 02.10.25 05:17, Wei Yang wrote:
On Thu, Oct 02, 2025 at 10:31:53AM +0800, Lance Yang wrote:
On 2025/10/2 09:46, Wei Yang wrote:
On Thu, Oct 02, 2025 at 01:38:25AM +0000, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Since we move deferred_split_folio() into map_anon_folio_pmd(), I am thinking about whether we can consolidate the process in collapse_huge_page().
Use map_anon_folio_pmd() in collapse_huge_page(), but skip those statistic adjustment.
Yeah, that's a good idea :)
We could add a simple bool is_fault parameter to map_anon_folio_pmd() to control the statistics.
The fault paths would call it with true, and the collapse paths could then call it with false.
Something like this:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b81680b4225..9924180a4a56 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1218,7 +1218,7 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma, } static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, - struct vm_area_struct *vma, unsigned long haddr) + struct vm_area_struct *vma, unsigned long haddr, bool is_fault) { pmd_t entry; @@ -1228,10 +1228,15 @@ static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, folio_add_lru_vma(folio, vma); set_pmd_at(vma->vm_mm, haddr, pmd, entry); update_mmu_cache_pmd(vma, haddr, pmd); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); - count_vm_event(THP_FAULT_ALLOC); - count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); - count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); + + if (is_fault) { + add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); + count_vm_event(THP_FAULT_ALLOC); + count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); + count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); + } + + deferred_split_folio(folio, false); } static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index d0957648db19..2eddd5a60e48 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1227,17 +1227,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, __folio_mark_uptodate(folio); pgtable = pmd_pgtable(_pmd); - _pmd = folio_mk_pmd(folio, vma->vm_page_prot); - _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma); - spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); - folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); - folio_add_lru_vma(folio, vma); pgtable_trans_huge_deposit(mm, pmd, pgtable); - set_pmd_at(mm, address, pmd, _pmd); - update_mmu_cache_pmd(vma, address, pmd); - deferred_split_folio(folio, false); + map_anon_folio_pmd(folio, pmd, vma, address, false); spin_unlock(pmd_ptl); folio = NULL;Untested, though.
This is the same as I thought.
Will prepare a patch for it.
Let's do that as an add-on patch, though.
On 2025/10/2 15:16, David Hildenbrand wrote:
On 02.10.25 05:17, Wei Yang wrote:
On Thu, Oct 02, 2025 at 10:31:53AM +0800, Lance Yang wrote:
On 2025/10/2 09:46, Wei Yang wrote:
On Thu, Oct 02, 2025 at 01:38:25AM +0000, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Since we move deferred_split_folio() into map_anon_folio_pmd(), I am thinking about whether we can consolidate the process in collapse_huge_page().
Use map_anon_folio_pmd() in collapse_huge_page(), but skip those statistic adjustment.
Yeah, that's a good idea :)
We could add a simple bool is_fault parameter to map_anon_folio_pmd() to control the statistics.
The fault paths would call it with true, and the collapse paths could then call it with false.
Something like this:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b81680b4225..9924180a4a56 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1218,7 +1218,7 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma, } static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, - struct vm_area_struct *vma, unsigned long haddr) + struct vm_area_struct *vma, unsigned long haddr, bool is_fault) { pmd_t entry; @@ -1228,10 +1228,15 @@ static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, folio_add_lru_vma(folio, vma); set_pmd_at(vma->vm_mm, haddr, pmd, entry); update_mmu_cache_pmd(vma, haddr, pmd); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); - count_vm_event(THP_FAULT_ALLOC); - count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); - count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); + + if (is_fault) { + add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); + count_vm_event(THP_FAULT_ALLOC); + count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); + count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); + } + + deferred_split_folio(folio, false); } static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index d0957648db19..2eddd5a60e48 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1227,17 +1227,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, __folio_mark_uptodate(folio); pgtable = pmd_pgtable(_pmd); - _pmd = folio_mk_pmd(folio, vma->vm_page_prot); - _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma); - spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); - folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); - folio_add_lru_vma(folio, vma); pgtable_trans_huge_deposit(mm, pmd, pgtable); - set_pmd_at(mm, address, pmd, _pmd); - update_mmu_cache_pmd(vma, address, pmd); - deferred_split_folio(folio, false); + map_anon_folio_pmd(folio, pmd, vma, address, false); spin_unlock(pmd_ptl); folio = NULL;Untested, though.
This is the same as I thought.
Will prepare a patch for it.
Let's do that as an add-on patch, though.
Yeah, let’s do that separately ;)
On 02.10.25 03:38, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") Signed-off-by: Wei Yang richard.weiyang@gmail.com Cc: David Hildenbrand david@redhat.com Cc: Lance Yang lance.yang@linux.dev Cc: Dev Jain dev.jain@arm.com Cc: stable@vger.kernel.org
Acked-by: David Hildenbrand david@redhat.com
On 2025/10/2 09:38, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") Signed-off-by: Wei Yang richard.weiyang@gmail.com Cc: David Hildenbrand david@redhat.com Cc: Lance Yang lance.yang@linux.dev Cc: Dev Jain dev.jain@arm.com Cc: stable@vger.kernel.org
Cool. LGTM.
Reviewed-by: Lance Yang lance.yang@linux.dev
v2:
- add fix, cc stable and put description about the flow of current code
- move deferred_split_folio() into map_anon_folio_pmd()
mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b81680b4225..f13de93637bf 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1232,6 +1232,7 @@ static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, count_vm_event(THP_FAULT_ALLOC); count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
- deferred_split_folio(folio, false); }
static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) @@ -1272,7 +1273,6 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); map_anon_folio_pmd(folio, vmf->pmd, vma, haddr); mm_inc_nr_ptes(vma->vm_mm);
spin_unlock(vmf->ptl); }deferred_split_folio(folio, false);
On 02/10/25 7:08 am, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") Signed-off-by: Wei Yang richard.weiyang@gmail.com Cc: David Hildenbrand david@redhat.com Cc: Lance Yang lance.yang@linux.dev Cc: Dev Jain dev.jain@arm.com Cc: stable@vger.kernel.org
Thanks!
Reviewed-by: Dev Jain dev.jain@arm.com
Hey Wei,
On 2025/10/2 09:38, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory
IIRC, it was commit dafff3f4c850 ("mm: split underused THPs") that started unconditionally adding all new anon THPs to _deferred_list :)
in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault")
Shouldn't this rather be the following?
Fixes: dafff3f4c850 ("mm: split underused THPs")
Thanks, Lance
Signed-off-by: Wei Yang richard.weiyang@gmail.com Cc: David Hildenbrand david@redhat.com Cc: Lance Yang lance.yang@linux.dev Cc: Dev Jain dev.jain@arm.com Cc: stable@vger.kernel.org
v2:
- add fix, cc stable and put description about the flow of current code
- move deferred_split_folio() into map_anon_folio_pmd()
mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b81680b4225..f13de93637bf 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1232,6 +1232,7 @@ static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, count_vm_event(THP_FAULT_ALLOC); count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
- deferred_split_folio(folio, false); }
static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) @@ -1272,7 +1273,6 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); map_anon_folio_pmd(folio, vmf->pmd, vma, haddr); mm_inc_nr_ptes(vma->vm_mm);
spin_unlock(vmf->ptl); }deferred_split_folio(folio, false);
On 3 Oct 2025, at 9:49, Lance Yang wrote:
Hey Wei,
On 2025/10/2 09:38, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory
IIRC, it was commit dafff3f4c850 ("mm: split underused THPs") that started unconditionally adding all new anon THPs to _deferred_list :)
in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault")
Shouldn't this rather be the following?
Fixes: dafff3f4c850 ("mm: split underused THPs")
Yes, I agree. In this case, this patch looks more like an optimization for split underused THPs.
One observation on this change is that right after zero pmd wp, the deferred split queue could be scanned, the newly added pmd folio will split since it is all zero except one subpage. This means we probably should allocate a base folio for zero pmd wp and map the rest to zero page at the beginning if split underused THP is enabled to avoid this long trip. The downside is that user app cannot get a pmd folio if it is intended to write data into the entire folio.
Usama might be able to give some insight here.
Thanks, Lance
Signed-off-by: Wei Yang richard.weiyang@gmail.com Cc: David Hildenbrand david@redhat.com Cc: Lance Yang lance.yang@linux.dev Cc: Dev Jain dev.jain@arm.com Cc: stable@vger.kernel.org
v2:
- add fix, cc stable and put description about the flow of current code
- move deferred_split_folio() into map_anon_folio_pmd()
mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b81680b4225..f13de93637bf 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1232,6 +1232,7 @@ static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, count_vm_event(THP_FAULT_ALLOC); count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
- deferred_split_folio(folio, false); } static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
@@ -1272,7 +1273,6 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); map_anon_folio_pmd(folio, vmf->pmd, vma, haddr); mm_inc_nr_ptes(vma->vm_mm);
spin_unlock(vmf->ptl); }deferred_split_folio(folio, false);
Best Regards, Yan, Zi
On 03/10/2025 15:08, Zi Yan wrote:
On 3 Oct 2025, at 9:49, Lance Yang wrote:
Hey Wei,
On 2025/10/2 09:38, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory
IIRC, it was commit dafff3f4c850 ("mm: split underused THPs") that started unconditionally adding all new anon THPs to _deferred_list :)
in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault")
Shouldn't this rather be the following?
Fixes: dafff3f4c850 ("mm: split underused THPs")
Yes, I agree. In this case, this patch looks more like an optimization for split underused THPs.
One observation on this change is that right after zero pmd wp, the deferred split queue could be scanned, the newly added pmd folio will split since it is all zero except one subpage. This means we probably should allocate a base folio for zero pmd wp and map the rest to zero page at the beginning if split underused THP is enabled to avoid this long trip. The downside is that user app cannot get a pmd folio if it is intended to write data into the entire folio.
Usama might be able to give some insight here.
Thanks for CCing me Zi!
hmm I think the downside of not having PMD folio probably outweights the cost of splitting a zer-filled page? ofcourse I dont have any numbers to back that up, but that would be my initial guess.
Also:
Acked-by: Usama Arif usamaarif642@gmail.com
Thanks, Lance
Signed-off-by: Wei Yang richard.weiyang@gmail.com Cc: David Hildenbrand david@redhat.com Cc: Lance Yang lance.yang@linux.dev Cc: Dev Jain dev.jain@arm.com Cc: stable@vger.kernel.org
v2:
- add fix, cc stable and put description about the flow of current code
- move deferred_split_folio() into map_anon_folio_pmd()
mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b81680b4225..f13de93637bf 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1232,6 +1232,7 @@ static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, count_vm_event(THP_FAULT_ALLOC); count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
- deferred_split_folio(folio, false); } static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
@@ -1272,7 +1273,6 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); map_anon_folio_pmd(folio, vmf->pmd, vma, haddr); mm_inc_nr_ptes(vma->vm_mm);
spin_unlock(vmf->ptl); }deferred_split_folio(folio, false);Best Regards, Yan, Zi
On 3 Oct 2025, at 11:30, Usama Arif wrote:
On 03/10/2025 15:08, Zi Yan wrote:
On 3 Oct 2025, at 9:49, Lance Yang wrote:
Hey Wei,
On 2025/10/2 09:38, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory
IIRC, it was commit dafff3f4c850 ("mm: split underused THPs") that started unconditionally adding all new anon THPs to _deferred_list :)
in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault")
Shouldn't this rather be the following?
Fixes: dafff3f4c850 ("mm: split underused THPs")
Yes, I agree. In this case, this patch looks more like an optimization for split underused THPs.
One observation on this change is that right after zero pmd wp, the deferred split queue could be scanned, the newly added pmd folio will split since it is all zero except one subpage. This means we probably should allocate a base folio for zero pmd wp and map the rest to zero page at the beginning if split underused THP is enabled to avoid this long trip. The downside is that user app cannot get a pmd folio if it is intended to write data into the entire folio.
Usama might be able to give some insight here.
Thanks for CCing me Zi!
hmm I think the downside of not having PMD folio probably outweights the cost of splitting a zer-filled page?
Yeah, I agree.
ofcourse I dont have any numbers to back that up, but that would be my initial guess.
Also:
Acked-by: Usama Arif usamaarif642@gmail.com
Thanks, Lance
Signed-off-by: Wei Yang richard.weiyang@gmail.com Cc: David Hildenbrand david@redhat.com Cc: Lance Yang lance.yang@linux.dev Cc: Dev Jain dev.jain@arm.com Cc: stable@vger.kernel.org
v2:
- add fix, cc stable and put description about the flow of current code
- move deferred_split_folio() into map_anon_folio_pmd()
mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b81680b4225..f13de93637bf 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1232,6 +1232,7 @@ static void map_anon_folio_pmd(struct folio *folio, pmd_t *pmd, count_vm_event(THP_FAULT_ALLOC); count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
- deferred_split_folio(folio, false); } static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
@@ -1272,7 +1273,6 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); map_anon_folio_pmd(folio, vmf->pmd, vma, haddr); mm_inc_nr_ptes(vma->vm_mm);
spin_unlock(vmf->ptl); }deferred_split_folio(folio, false);Best Regards, Yan, Zi
Best Regards, Yan, Zi
On Fri, Oct 03, 2025 at 10:08:37AM -0400, Zi Yan wrote:
On 3 Oct 2025, at 9:49, Lance Yang wrote:
Hey Wei,
On 2025/10/2 09:38, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory
IIRC, it was commit dafff3f4c850 ("mm: split underused THPs") that started unconditionally adding all new anon THPs to _deferred_list :)
in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault")
Shouldn't this rather be the following?
Fixes: dafff3f4c850 ("mm: split underused THPs")
Yes, I agree. In this case, this patch looks more like an optimization for split underused THPs.
One observation on this change is that right after zero pmd wp, the deferred split queue could be scanned, the newly added pmd folio will split since it is all zero except one subpage. This means we probably should allocate a base folio for zero pmd wp and map the rest to zero page at the beginning if split underused THP is enabled to avoid this long trip. The downside is that user app cannot get a pmd folio if it is intended to write data into the entire folio.
Thanks for raising this.
IMHO, we could face the similar situation in __do_huge_pmd_anonymous_page(). If my understanding is correct, the allocated folio is zeroed and we don't have idea how user would write data to it.
Since shrinker is active when memory is low, maybe vma_alloc_anon_folio_pmd() has told use current status of the memory. If it does get a pmd folio, we are probably having enough memory in the system.
Usama might be able to give some insight here.
On Fri, Oct 03, 2025 at 09:49:28PM +0800, Lance Yang wrote:
Hey Wei,
On 2025/10/2 09:38, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory
IIRC, it was commit dafff3f4c850 ("mm: split underused THPs") that started unconditionally adding all new anon THPs to _deferred_list :)
Thanks for taking a look.
While at this time do_huge_zero_wp_pmd() is not introduced, how it fix a non-exist case? And how could it be backported? I am confused here.
in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault")
Shouldn't this rather be the following?
Fixes: dafff3f4c850 ("mm: split underused THPs")
Thanks, Lance
On 2025/10/4 10:04, Wei Yang wrote:
On Fri, Oct 03, 2025 at 09:49:28PM +0800, Lance Yang wrote:
Hey Wei,
On 2025/10/2 09:38, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory
IIRC, it was commit dafff3f4c850 ("mm: split underused THPs") that started unconditionally adding all new anon THPs to _deferred_list :)
Thanks for taking a look.
While at this time do_huge_zero_wp_pmd() is not introduced, how it fix a
Ah, I see. I was focused on the policy change ...
non-exist case? And how could it be backported? I am confused here.
And, yes, 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") was merged later and it introduced the new do_huge_zero_wp_pmd() path without aligning with the policy ...
Thanks for clarifying! Lance
in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault")
Shouldn't this rather be the following?
Fixes: dafff3f4c850 ("mm: split underused THPs")
Thanks, Lance
On 1 Oct 2025, at 21:38, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") Signed-off-by: Wei Yang richard.weiyang@gmail.com Cc: David Hildenbrand david@redhat.com Cc: Lance Yang lance.yang@linux.dev Cc: Dev Jain dev.jain@arm.com Cc: stable@vger.kernel.org
v2:
- add fix, cc stable and put description about the flow of current code
- move deferred_split_folio() into map_anon_folio_pmd()
mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
LGTM. Reviewed-by: Zi Yan ziy@nvidia.com
Best Regards, Yan, Zi
On 2025/10/2 09:38, Wei Yang wrote:
We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault.
Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory in case of memory pressure even the pmd folio is under used.
Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent.
Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") Signed-off-by: Wei Yang richard.weiyang@gmail.com Cc: David Hildenbrand david@redhat.com Cc: Lance Yang lance.yang@linux.dev Cc: Dev Jain dev.jain@arm.com Cc: stable@vger.kernel.org
Nice catch. LGTM. Reviewed-by: Baolin Wang baolin.wang@linux.alibaba.com
linux-stable-mirror@lists.linaro.org