On Sat, Nov 19, 2022 at 10:20:20PM +0800, hev wrote:
Hi, Peter,
Hev,
On Fri, Nov 18, 2022 at 2:53 AM Peter Xu peterx@redhat.com wrote:
On Thu, Nov 17, 2022 at 10:12:07AM -0500, Peter Xu wrote:
Hi, Huacai,
On Thu, Nov 17, 2022 at 12:25:32PM +0800, Huacai Chen wrote:
Now {pmd,pte}_mkdirty() set _PAGE_DIRTY bit unconditionally, this causes random segmentation fault after commit 0ccf7f168e17bb7e ("mm/thp: carry over dirty bit when thp splits on pmd").
The reason is: when fork(), parent process use pmd_wrprotect() to clear huge page's _PAGE_WRITE and _PAGE_DIRTY (for COW);
Is it safe to drop dirty bit when wr-protect? It means the mm can reclaim the page directly assuming the page contains rubbish.
Consider after fork() and memory pressure kicks the kswapd, I don't see anything stops the kswapd from recycling the pages and lose the data in both processes.
Feel free to ignore this question.. I think I got an answer from Hev (and I then got a follow up question):
https://lore.kernel.org/all/Y3Z9Zf0jARMOkFBq@x1n/
then pte_mkdirty() set _PAGE_DIRTY as well as _PAGE_MODIFIED while splitting dirty huge pages; once _PAGE_DIRTY is set, there will be no tlb modify exception so the COW machanism fails; and at last memory corruption occurred between parent and child processes.
So, we should set _PAGE_DIRTY only when _PAGE_WRITE is set in {pmd,pte}_ mkdirty().
Cc: stable@vger.kernel.org Cc: Peter Xu peterx@redhat.com Signed-off-by: Huacai Chen chenhuacai@loongson.cn
Note: CC sparc maillist because they have similar issues.
I also had a look on sparc64, it seems to not do the same as loongarch here (not removing dirty in wr-protect):
static inline pmd_t pmd_wrprotect(pmd_t pmd) { pte_t pte = __pte(pmd_val(pmd));
pte = pte_wrprotect(pte); return __pmd(pte_val(pte));
}
static inline pte_t pte_wrprotect(pte_t pte) { unsigned long val = pte_val(pte), tmp;
__asm__ __volatile__( "\n661: andn %0, %3, %0\n" " nop\n" "\n662: nop\n" " nop\n" " .section .sun4v_2insn_patch, \"ax\"\n" " .word 661b\n" " sethi %%uhi(%4), %1\n" " sllx %1, 32, %1\n" " .word 662b\n" " or %1, %%lo(%4), %1\n" " andn %0, %1, %0\n" " .previous\n" : "=r" (val), "=r" (tmp) : "0" (val), "i" (_PAGE_WRITE_4U | _PAGE_W_4U), "i" (_PAGE_WRITE_4V | _PAGE_W_4V)); return __pte(val);
}
(Same here; I just overlooked what does _PAGE_W_4U meant..)
arch/loongarch/include/asm/pgtable.h | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h index 946704bee599..debbe116f105 100644 --- a/arch/loongarch/include/asm/pgtable.h +++ b/arch/loongarch/include/asm/pgtable.h @@ -349,7 +349,9 @@ static inline pte_t pte_mkclean(pte_t pte)
static inline pte_t pte_mkdirty(pte_t pte) {
- pte_val(pte) |= (_PAGE_DIRTY | _PAGE_MODIFIED);
- pte_val(pte) |= _PAGE_MODIFIED;
- if (pte_val(pte) & _PAGE_WRITE)
pte_val(pte) |= _PAGE_DIRTY;
I'm not sure whether mm has rule to always set write bit then set dirty bit, need to be careful here because the outcome may differ when use:
pte_mkdirty(pte_mkwrite(pte)) (expected)
VS:
pte_mkwrite(pte_mkdirty(pte)) (dirty not set)
I had a feeling I miss some arch-specific details here on why loongarch needs such implementation, but I can't quickly tell.
After a closer look I think it's fine for loongarch as pte_mkwrite will also set the dirty bit unconditionally, so at least the two ways will still generate the same pte (DIRTY+MODIFIED+WRITE).
But this whole thing is still confusing to me. It'll still be great if anyone can help explain why the _DIRTY cannot be set only in pte_mkwrite() if that's the solo place in charge of "whether the pte is writable".
The other follow up question is: how do we mark "this pte contains valid data" (the common definition of "dirty bit"), while "this pte is not writable" on loongarch?
It can happen when we're installing a page with non-zero data meanwhile wr-protected. That's actually a valid case for userfaultfd wr-protect mode where user specified UFFDIO_COPY ioctl with flag UFFDIO_COPY_MODE_WP, where we'll install a non-zero page from user buffer but don't grant write bit.
From code-wise, I think it can be done currently with this on loongarch:
pte_wrprotect(pte_mkwrite(pte_mkdirty(pte)))
Where pte_wrprotect(pte_mkwrite(pte)) is not a no-op but applying MODIFIED.
We would like to note that on LoongArch (for misunderstanding naming):
- _PAGE_DIRTY meaning hardware writable.
- _PAGE_WRITE meaning software writable.
- _PAGE_MODIFIED meaning software dirty, this page contains updated valid data.
PTE APIs:
- pte_mkwrite: Allow to write, only needs set _PAGE_WRITE.
- pte_mkdirty: Mark as dirty, only needs set _PAGE_MODIFIED.
- pte_dirty: Test is dirty, only test _PAGE_MODIFIED.
- pte_wrprotect: Clear both writable, force to raise exception to
handle_mm_fault.
If a pte is only set _PAGE_WRITE without _PAGE_DIRTY by pte_mkwrite, then a write memory access will cause mmu exception, and the (_PAGE_DIRTY|_PAGE_MODIFIED) will be set in this exception handler. I think the _PAGE_DIRTY is also possible to set in pte_mkwrite for speedup, then _PAGE_MODIFIED must be set at the same time. To avoid the page data being modified but not detected by pte_dirty. (Current code may needs to fix
pte_mkdirty mark pte as dirty is the main function, It can also make pte writeable by hardware(_PAGE_DIRTY) for speedup (too) if and only if the pte is writable(_PAGE_WRITE). (mkdirty sets _PAGE_DIRTY unconditionally is the root cause of the huge page COW issue.
Yes, and that's why Huacai's patch can fix this issue for Loongarch, but sparc64 still suffers from it so far.
For write-protection, pte_wrprotect will clear both writable(software and hardware) in pte to force a MMU exception to handle_mm_fault.
So yeah, the pte marked as dirty(_PAGE_MODIFIED) and without any writable in the following code:
pte_wrprotect(pte_mkwrite(pte_mkdirty(pte)))
Thanks for the further explanation, Hev.
I think so far I keep the questioning on whether it's a good optimization to apply _DIRTY in pte_mkdirty here.
Since we have pte_mkwrite() API after all, even if there's an optimization IMHO it should be done by the kernel generic code already, I don't immediately see why it needs to be arch-specific. IOW, I'm not sure whether such optimization for loongarch will bring any real performance benefit?
The thing is, generic mm code should already have called pte_mkwrite() along with pte_mkdirty() when proper, so _DIRTY will be properly applied even if pte_mkdirty() only applies _MODIFIED in loongarch code without extra mmu faults - if you see the code base that's really the major cases, except a few very special conditions where we want to only set _MODIFIED but may want to keep the pte read-only, but in that case it's never wise to set _DIRTY in pte_mkdirty anyway.
I just don't know whether there's anything else I could have overlooked, so maybe removing "set _DIRTY" in pte_mkdirty() will regress in some other way.
For now, I think I'll go ahead and try to propose another trivial fix for the pmd split generic code so we'll have the pte_mkdirty back (I think I need to reorder pte_wrprotect() so that it needs to appear after pte_mkdirty()) but hopefully also work for sparc64; loongarch will also benefit if before this patch lands.
(Sorry to be off-topic on this thread)