Re: [PATCH v16 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

2 Jun 2023

On 6/2/23 1:11 AM, Peter Xu wrote:
...
On Thu, Jun 01, 2023 at 01:16:14PM +0500, Muhammad Usama Anjum wrote:
...
On 6/1/23 2:46 AM, Peter Xu wrote:
...
Muhammad,
Sorry, I probably can only review the non-interface part, and leave the
interface/buffer handling, etc. review for others and real potential users
of it..
Thank you so much for the review. I think mostly we should be okay with
interface as everybody has been making suggestions over the past revisions.
...
On Thu, May 25, 2023 at 01:55:14PM +0500, Muhammad Usama Anjum wrote:
...
+static inline void make_uffd_wp_huge_pte(struct vm_area_struct *vma,

			 unsigned long addr, pte_t *ptep,


			 pte_t ptent)



+{

pte_t old_pte;

if (!huge_pte_none(ptent)) {
old_pte = huge_ptep_modify_prot_start(vma, addr, ptep);


ptent = huge_pte_mkuffd_wp(old_pte);


ptep_modify_prot_commit(vma, addr, ptep, old_pte, ptent);



huge_ptep_modify_prot_start()?
Sorry, I didn't realized that huge_ptep_modify_prot_start() is different
from its pte version.
Here I meant huge_ptep_modify_prot_commit()..
I'll update.
...
...
...
The other thing is what if it's a pte marker already?  What if a hugetlb
migration entry?  Please check hugetlb_change_protection().
I've updated it in more better way. Please let me know what do you think
about the following:
static inline void make_uffd_wp_huge_pte(struct vm_area_struct *vma,
   				 unsigned long addr, pte_t *ptep,
   				 pte_t ptent)
{
   if (is_hugetlb_entry_hwpoisoned(ptent) || is_pte_marker(ptent))
   	return;
if (is_hugetlb_entry_migration(ptent))
   	set_huge_pte_at(vma->vm_mm, addr, ptep,
   			pte_swp_mkuffd_wp(ptent));
   else if (!huge_pte_none(ptent))
   	ptep_modify_prot_commit(vma, addr, ptep, ptent,
   				huge_pte_mkuffd_wp(ptent));
   else
   	set_huge_pte_at(vma->vm_mm, addr, ptep,
   			make_pte_marker(PTE_MARKER_UFFD_WP));
}
the is_pte_marker() check can be extended to double check
pte_marker_uffd_wp() bit, but shouldn't matter a lot since besides the
uffd-wp bit currently we only support swapin error which should sigbus when
accessed, so no point in tracking anyway.
Yeah, we are good with what we have as even if more bits are supported in
pte markers, this function is only reached when UNPOPULATED + ASYNC WP are
enabled. So no other bit would be set on the marker.
...
...
As we always set UNPOPULATED, so markers are always set on none ptes
initially. Is it possible that a none pte becomes present, then swapped and
finally none again? So I'll do the following addition for make_uffd_wp_pte():

--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1800,6 +1800,9 @@ static inline void make_uffd_wp_pte(struct
vm_area_struct *vma,
   } else if (is_swap_pte(ptent)) {
   	ptent = pte_swp_mkuffd_wp(ptent);
   	set_pte_at(vma->vm_mm, addr, pte, ptent);

} else {
set_pte_at(vma->vm_mm, addr, pte,


	   make_pte_marker(PTE_MARKER_UFFD_WP));

}

}
Makes sense, you can leverage userfaultfd_wp_use_markers() here, and you
should probably keep the protocol (only set the marker when WP_UNPOPULATED
for anon).
This function is only reachable when UNPOPULATED + Async WP are set. So we
don't need to use userfaultfd_wp_use_markers().
...
...
...
...

} else {
set_huge_pte_at(vma->vm_mm, addr, ptep,


		make_pte_marker(PTE_MARKER_UFFD_WP));


}

+}
+#endif
[...]
...
+static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,

		  unsigned long end, struct mm_walk *walk)



+{

struct pagemap_scan_private *p = walk->private;
struct vm_area_struct *vma = walk->vma;
unsigned long addr = end;
pte_t *pte, *orig_pte;
spinlock_t *ptl;
bool is_written;
int ret = 0;

arch_enter_lazy_mmu_mode();


+#ifdef CONFIG_TRANSPARENT_HUGEPAGE

ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
unsigned long n_pages = (end - start)/PAGE_SIZE;



if (p->max_pages && n_pages > p->max_pages - p->found_pages)


	n_pages = p->max_pages - p->found_pages;



is_written = !is_pmd_uffd_wp(*pmd);



/*


 * Break huge page into small pages if the WP operation need to


 * be performed is on a portion of the huge page.


 */


if (is_written && IS_PM_SCAN_WP(p->flags) &&


    n_pages < HPAGE_SIZE/PAGE_SIZE) {


	spin_unlock(ptl);



	split_huge_pmd(vma, pmd, start);


	goto process_smaller_pages;


}



if (IS_PM_SCAN_GET(p->flags))


	ret = pagemap_scan_output(is_written, vma->vm_file,


				  pmd_present(*pmd),


				  is_swap_pmd(*pmd),


				  p, start, n_pages);



if (ret >= 0 && is_written && IS_PM_SCAN_WP(p->flags))


	make_uffd_wp_pmd(vma, addr, pmd);



if (IS_PM_SCAN_WP(p->flags))


	flush_tlb_range(vma, start, end);



spin_unlock(ptl);



arch_leave_lazy_mmu_mode();


return ret;


}


+process_smaller_pages:

if (pmd_trans_unstable(pmd)) {
arch_leave_lazy_mmu_mode();


return 0;



I'm not sure whether this is right..  Shouldn't you return with -EAGAIN and
let the user retry?  Returning 0 means you'll move on with the next pmd
afaict and ignoring this one.
This has come up before. We are just replicating pagemap_pmd_range() here
as we are doing almost the same thing through IOCTL. It doesn't return any
error in this case and just skips it. So we are doing the same.
Hmm, is it a bug for pagemap?  pagemapread.buffer should be linear to the
address range to be scanned to me.  If it skips some unstable pmd without
filling in anything it seems everything later will be shifted with
PMD_SIZE..  I had a feeling that it should set walk->action==ACTION_AGAIN
before return.
I don't think this is a bug if this is how it was implemented in the first
place. In this task_mmu.c file, we can find several examples of the same
pattern that error isn't returned if pmd_trans_unstable() succeeds.
...
-- 
BR,
Muhammad Usama Anjum

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v16 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs