Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page

17 Aug 2022


      On Aug 17, 2022, at 12:17 AM, Huang, Ying ying.huang@intel.com wrote:
...
Alistair Popple apopple@nvidia.com writes:
...
Peter Xu peterx@redhat.com writes:
...
On Wed, Aug 17, 2022 at 11:49:03AM +1000, Alistair Popple wrote:
...
Peter Xu peterx@redhat.com writes:
...
On Tue, Aug 16, 2022 at 04:10:29PM +0800, huang ying wrote:
...
> @@ -193,11 +194,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>                        bool anon_exclusive;
>                        pte_t swp_pte;
> 
> +                       flush_cache_page(vma, addr, pte_pfn(*ptep));
> +                       pte = ptep_clear_flush(vma, addr, ptep);
Although I think it's possible to batch the TLB flushing just before
unlocking PTL.  The current code looks correct.
If we're with unconditionally ptep_clear_flush(), does it mean we should
probably drop the "unmapped" and the last flush_tlb_range() already since
they'll be redundant?
This patch does that, unless I missed something?
Yes it does.  Somehow I didn't read into the real v2 patch, sorry!
...
...
If that'll need to be dropped, it looks indeed better to still keep the
batch to me but just move it earlier (before unlock iiuc then it'll be
safe), then we can keep using ptep_get_and_clear() afaiu but keep "pte"
updated.
I think we would also need to check should_defer_flush(). Looking at
try_to_unmap_one() there is this comment:
	if (should_defer_flush(mm, flags) && !anon_exclusive) {
		/*
		 * We clear the PTE but do not flush so potentially
		 * a remote CPU could still be writing to the folio.
		 * If the entry was previously clean then the
		 * architecture must guarantee that a clear->dirty
		 * transition on a cached TLB entry is written through
		 * and traps if the PTE is unmapped.
		 */


And as I understand it we'd need the same guarantee here. Given
try_to_migrate_one() doesn't do batched TLB flushes either I'd rather
keep the code as consistent as possible between
migrate_vma_collect_pmd() and try_to_migrate_one(). I could look at
introducing TLB flushing for both in some future patch series.
should_defer_flush() is TTU-specific code?
I'm not sure, but I think we need the same guarantee here as mentioned
in the comment otherwise we wouldn't see a subsequent CPU write that
could dirty the PTE after we have cleared it but before the TLB flush.
My assumption was should_defer_flush() would ensure we have that
guarantee from the architecture, but maybe there are alternate/better
ways of enforcing that?
...
IIUC the caller sets TTU_BATCH_FLUSH showing that tlb can be omitted since
the caller will be responsible for doing it.  In migrate_vma_collect_pmd()
iiuc we don't need that hint because it'll be flushed within the same
function but just only after the loop of modifying the ptes.  Also it'll be
with the pgtable lock held.
Right, but the pgtable lock doesn't protect against HW PTE changes such
as setting the dirty bit so we need to ensure the HW does the right
thing here and I don't know if all HW does.
This sounds sensible.  But I take a look at zap_pte_range(), and find
that it appears that the implementation requires the PTE dirty bit to be
write-through.  Do I miss something?
Hi, Nadav, Can you help?
Sorry for joining the discussion late. I read most ofthis thread and I hope
I understand what you ask me. So at the risk of rehashing or repeating what
you already know - here are my 2 cents. Feel free to ask me again if I did
not understand your questions:
1. ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH is currently x86 specific. There is a
recent patch that want to use it for arm64 as well [1]. The assumption that
Alistair cited from the code (regarding should_defer_flush()) might not be
applicable to certain architectures (although most likely it is). I tried
to encapsulate the logic on whether an unflushed RO entry can become dirty
in an arch specific function [2].
2. Having said all of that, using the logic of “flush if there are pending
TLB flushes for this mm” as done by UNMAP_TLB_FLUSH makes sense IMHO
(although I would have considered doing it in finer granularity of
VMA/page-table as I proposed before and got somewhat lukewarm response [3]).
3. There is no question that flushing after dropping the ptl is wrong. But
reading the thread, I think that you only focus on whether a dirty
indication might get lost. The problem, I think, is bigger, as it might also
cause correction problems after concurrently removing mappings. What happens
if we get for a clean PTE something like:
CPU0					CPU1
  ----					----
migrate_vma_collect_pmd()
  [ defer flush, release ptl ]
    				madvise(MADV_DONTNEED)
    				-> zap_pte_range()
    				[ PTE not present; mmu_gather
    				  not updated ]
    				
    				[ no flush; stale PTE in TLB ]
[ page is still accessible ]
[ might apply to munmap(); I usually regard MADV_DONTNEED since it does
  not call mmap_write_lock() ]
4. Having multiple TLB flushing infrastructures makes all of these
discussions very complicated and unmaintainable. I need to convince myself
in every occasion (including this one) whether calls to
flush_tlb_batched_pending() and tlb_flush_pending() are needed or not.
What I would like to have [3] is a single infrastructure that gets a
“ticket” (generation when the batching started), the old PTE and the new PTE
and checks whether a TLB flush is needed based on the arch behavior and the
current TLB generation. If needed, it would update the “ticket” to the new
generation. Andy wanted a ring for pending TLB flushes, but I think it is an
overkill with more overhead and complexity than needed.
But the current situation in which every TLB flush is a basis for long
discussions and prone to bugs is impossible.
I hope it helps. Let me know if you want me to revive the patch-set or other
feedback.
[1] https://lore.kernel.org/all/20220711034615.482895-5-21cnbao@gmail.com/
[2] https://lore.kernel.org/all/20220718120212.3180-13-namit@vmware.com/
[3] https://lore.kernel.org/all/20210131001132.3368247-16-namit@vmware.com/

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page