Alistair Popple apopple@nvidia.com writes:
Peter Xu peterx@redhat.com writes:
On Wed, Aug 17, 2022 at 11:49:03AM +1000, Alistair Popple wrote:
Peter Xu peterx@redhat.com writes:
On Tue, Aug 16, 2022 at 04:10:29PM +0800, huang ying wrote:
@@ -193,11 +194,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, bool anon_exclusive; pte_t swp_pte;
flush_cache_page(vma, addr, pte_pfn(*ptep));
pte = ptep_clear_flush(vma, addr, ptep);
Although I think it's possible to batch the TLB flushing just before unlocking PTL. The current code looks correct.
If we're with unconditionally ptep_clear_flush(), does it mean we should probably drop the "unmapped" and the last flush_tlb_range() already since they'll be redundant?
This patch does that, unless I missed something?
Yes it does. Somehow I didn't read into the real v2 patch, sorry!
If that'll need to be dropped, it looks indeed better to still keep the batch to me but just move it earlier (before unlock iiuc then it'll be safe), then we can keep using ptep_get_and_clear() afaiu but keep "pte" updated.
I think we would also need to check should_defer_flush(). Looking at try_to_unmap_one() there is this comment:
if (should_defer_flush(mm, flags) && !anon_exclusive) { /* * We clear the PTE but do not flush so potentially * a remote CPU could still be writing to the folio. * If the entry was previously clean then the * architecture must guarantee that a clear->dirty * transition on a cached TLB entry is written through * and traps if the PTE is unmapped. */
And as I understand it we'd need the same guarantee here. Given try_to_migrate_one() doesn't do batched TLB flushes either I'd rather keep the code as consistent as possible between migrate_vma_collect_pmd() and try_to_migrate_one(). I could look at introducing TLB flushing for both in some future patch series.
should_defer_flush() is TTU-specific code?
I'm not sure, but I think we need the same guarantee here as mentioned in the comment otherwise we wouldn't see a subsequent CPU write that could dirty the PTE after we have cleared it but before the TLB flush.
My assumption was should_defer_flush() would ensure we have that guarantee from the architecture, but maybe there are alternate/better ways of enforcing that?
IIUC the caller sets TTU_BATCH_FLUSH showing that tlb can be omitted since the caller will be responsible for doing it. In migrate_vma_collect_pmd() iiuc we don't need that hint because it'll be flushed within the same function but just only after the loop of modifying the ptes. Also it'll be with the pgtable lock held.
Right, but the pgtable lock doesn't protect against HW PTE changes such as setting the dirty bit so we need to ensure the HW does the right thing here and I don't know if all HW does.
This sounds sensible. But I take a look at zap_pte_range(), and find that it appears that the implementation requires the PTE dirty bit to be write-through. Do I miss something?
Hi, Nadav, Can you help?
Best Regards, Huang, Ying
[snip]