When clearing a PTE the TLB should be flushed whilst still holding the PTL to avoid a potential race with madvise/munmap/etc. For example consider the following sequence:
CPU0 CPU1 ---- ----
migrate_vma_collect_pmd() pte_unmap_unlock() madvise(MADV_DONTNEED) -> zap_pte_range() pte_offset_map_lock() [ PTE not present, TLB not flushed ] pte_unmap_unlock() [ page is still accessible via stale TLB ] flush_tlb_range()
In this case the page may still be accessed via the stale TLB entry after madvise returns. Fix this by flushing the TLB while holding the PTL.
Signed-off-by: Alistair Popple apopple@nvidia.com Reported-by: Nadav Amit nadav.amit@gmail.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
---
Changes for v3:
- New for v3 --- mm/migrate_device.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 27fb37d..6a5ef9f 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, migrate->dst[migrate->npages] = 0; migrate->src[migrate->npages++] = mpfn; } - arch_leave_lazy_mmu_mode(); - pte_unmap_unlock(ptep - 1, ptl);
/* Only flush the TLB if we actually modified any entries */ if (unmapped) flush_tlb_range(walk->vma, start, end);
+ arch_leave_lazy_mmu_mode(); + pte_unmap_unlock(ptep - 1, ptl); + return 0; }
base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that installs migration entries directly if it can lock the migrating page. When removing a dirty pte the dirty bit is supposed to be carried over to the underlying page to prevent it being lost.
Currently migrate_vma_*() can only be used for private anonymous mappings. That means loss of the dirty bit usually doesn't result in data loss because these pages are typically not file-backed. However pages may be backed by swap storage which can result in data loss if an attempt is made to migrate a dirty page that doesn't yet have the PageDirty flag set.
In this case migration will fail due to unexpected references but the dirty pte bit will be lost. If the page is subsequently reclaimed data won't be written back to swap storage as it is considered uptodate, resulting in data loss if the page is subsequently accessed.
Prevent this by copying the dirty bit to the page when removing the pte to match what try_to_migrate_one() does.
Signed-off-by: Alistair Popple apopple@nvidia.com Acked-by: Peter Xu peterx@redhat.com Reported-by: Huang Ying ying.huang@intel.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
---
Changes for v3:
- Defer TLB flushing - Split a TLB flushing fix into a separate change.
Changes for v2:
- Fixed up Reported-by tag. - Added Peter's Acked-by. - Atomically read and clear the pte to prevent the dirty bit getting set after reading it. - Added fixes tag --- mm/migrate_device.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 6a5ef9f..51d9afa 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/memremap.h> #include <linux/migrate.h> +#include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/mmu_notifier.h> #include <linux/oom.h> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, anon_exclusive = PageAnon(page) && PageAnonExclusive(page); if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep)); - ptep_clear_flush(vma, addr, ptep); + pte = ptep_clear_flush(vma, addr, ptep);
if (page_try_share_anon_rmap(page)) { set_pte_at(mm, addr, ptep, pte); @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, goto next; } } else { - ptep_get_and_clear(mm, addr, ptep); + pte = ptep_get_and_clear(mm, addr, ptep); }
migrate->cpages++;
+ /* Set the dirty flag on the folio now the pte is gone. */ + if (pte_dirty(pte)) + folio_mark_dirty(page_folio(page)); + /* Setup special migration page table entry */ if (mpfn & MIGRATE_PFN_WRITE) entry = make_writable_migration_entry(
On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that installs migration entries directly if it can lock the migrating page. When removing a dirty pte the dirty bit is supposed to be carried over to the underlying page to prevent it being lost.
Currently migrate_vma_*() can only be used for private anonymous mappings. That means loss of the dirty bit usually doesn't result in data loss because these pages are typically not file-backed. However pages may be backed by swap storage which can result in data loss if an attempt is made to migrate a dirty page that doesn't yet have the PageDirty flag set.
In this case migration will fail due to unexpected references but the dirty pte bit will be lost. If the page is subsequently reclaimed data won't be written back to swap storage as it is considered uptodate, resulting in data loss if the page is subsequently accessed.
Prevent this by copying the dirty bit to the page when removing the pte to match what try_to_migrate_one() does.
Signed-off-by: Alistair Popple apopple@nvidia.com Acked-by: Peter Xu peterx@redhat.com Reported-by: Huang Ying ying.huang@intel.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
Changes for v3:
- Defer TLB flushing
- Split a TLB flushing fix into a separate change.
Changes for v2:
- Fixed up Reported-by tag.
- Added Peter's Acked-by.
- Atomically read and clear the pte to prevent the dirty bit getting set after reading it.
- Added fixes tag
mm/migrate_device.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 6a5ef9f..51d9afa 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/memremap.h> #include <linux/migrate.h> +#include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/mmu_notifier.h> #include <linux/oom.h> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, anon_exclusive = PageAnon(page) && PageAnonExclusive(page); if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush(vma, addr, ptep);
pte = ptep_clear_flush(vma, addr, ptep);
if (page_try_share_anon_rmap(page)) { set_pte_at(mm, addr, ptep, pte); @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, goto next; } } else {
ptep_get_and_clear(mm, addr, ptep);
pte = ptep_get_and_clear(mm, addr, ptep); }
I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are moved above the condition check so they're called unconditionally. Could you explain the rational on why it's changed back (since I think v2 was the correct approach)?
The other question is if we want to split the patch, would it be better to move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
migrate->cpages++;
/* Set the dirty flag on the folio now the pte is gone. */
if (pte_dirty(pte))
folio_mark_dirty(page_folio(page));
/* Setup special migration page table entry */ if (mpfn & MIGRATE_PFN_WRITE) entry = make_writable_migration_entry(
-- git-series 0.9.1
Peter Xu peterx@redhat.com writes:
On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that installs migration entries directly if it can lock the migrating page. When removing a dirty pte the dirty bit is supposed to be carried over to the underlying page to prevent it being lost.
Currently migrate_vma_*() can only be used for private anonymous mappings. That means loss of the dirty bit usually doesn't result in data loss because these pages are typically not file-backed. However pages may be backed by swap storage which can result in data loss if an attempt is made to migrate a dirty page that doesn't yet have the PageDirty flag set.
In this case migration will fail due to unexpected references but the dirty pte bit will be lost. If the page is subsequently reclaimed data won't be written back to swap storage as it is considered uptodate, resulting in data loss if the page is subsequently accessed.
Prevent this by copying the dirty bit to the page when removing the pte to match what try_to_migrate_one() does.
Signed-off-by: Alistair Popple apopple@nvidia.com Acked-by: Peter Xu peterx@redhat.com Reported-by: Huang Ying ying.huang@intel.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
Changes for v3:
- Defer TLB flushing
- Split a TLB flushing fix into a separate change.
Changes for v2:
- Fixed up Reported-by tag.
- Added Peter's Acked-by.
- Atomically read and clear the pte to prevent the dirty bit getting set after reading it.
- Added fixes tag
mm/migrate_device.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 6a5ef9f..51d9afa 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/memremap.h> #include <linux/migrate.h> +#include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/mmu_notifier.h> #include <linux/oom.h> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, anon_exclusive = PageAnon(page) && PageAnonExclusive(page); if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush(vma, addr, ptep);
pte = ptep_clear_flush(vma, addr, ptep); if (page_try_share_anon_rmap(page)) { set_pte_at(mm, addr, ptep, pte);
@@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, goto next; } } else {
ptep_get_and_clear(mm, addr, ptep);
pte = ptep_get_and_clear(mm, addr, ptep); }
I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are moved above the condition check so they're called unconditionally. Could you explain the rational on why it's changed back (since I think v2 was the correct approach)?
Mainly because I agree with your original comments, that it would be better to keep the batching of TLB flushing if possible. After the discussion I don't think there is any issues with HW pte dirty bits here. There are already other cases where HW needs to get that right anyway (eg. zap_pte_range).
The other question is if we want to split the patch, would it be better to move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
Isn't that already the case? Patch 1 moves the TLB flush before the PTL as suggested, patch 2 atomically copies the dirty bit without changing any TLB flushing.
migrate->cpages++;
/* Set the dirty flag on the folio now the pte is gone. */
if (pte_dirty(pte))
folio_mark_dirty(page_folio(page));
/* Setup special migration page table entry */ if (mpfn & MIGRATE_PFN_WRITE) entry = make_writable_migration_entry(
-- git-series 0.9.1
On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
Peter Xu peterx@redhat.com writes:
On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that installs migration entries directly if it can lock the migrating page. When removing a dirty pte the dirty bit is supposed to be carried over to the underlying page to prevent it being lost.
Currently migrate_vma_*() can only be used for private anonymous mappings. That means loss of the dirty bit usually doesn't result in data loss because these pages are typically not file-backed. However pages may be backed by swap storage which can result in data loss if an attempt is made to migrate a dirty page that doesn't yet have the PageDirty flag set.
In this case migration will fail due to unexpected references but the dirty pte bit will be lost. If the page is subsequently reclaimed data won't be written back to swap storage as it is considered uptodate, resulting in data loss if the page is subsequently accessed.
Prevent this by copying the dirty bit to the page when removing the pte to match what try_to_migrate_one() does.
Signed-off-by: Alistair Popple apopple@nvidia.com Acked-by: Peter Xu peterx@redhat.com Reported-by: Huang Ying ying.huang@intel.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
Changes for v3:
- Defer TLB flushing
- Split a TLB flushing fix into a separate change.
Changes for v2:
- Fixed up Reported-by tag.
- Added Peter's Acked-by.
- Atomically read and clear the pte to prevent the dirty bit getting set after reading it.
- Added fixes tag
mm/migrate_device.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 6a5ef9f..51d9afa 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/memremap.h> #include <linux/migrate.h> +#include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/mmu_notifier.h> #include <linux/oom.h> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, anon_exclusive = PageAnon(page) && PageAnonExclusive(page); if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush(vma, addr, ptep);
pte = ptep_clear_flush(vma, addr, ptep); if (page_try_share_anon_rmap(page)) { set_pte_at(mm, addr, ptep, pte);
@@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, goto next; } } else {
ptep_get_and_clear(mm, addr, ptep);
pte = ptep_get_and_clear(mm, addr, ptep); }
I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are moved above the condition check so they're called unconditionally. Could you explain the rational on why it's changed back (since I think v2 was the correct approach)?
Mainly because I agree with your original comments, that it would be better to keep the batching of TLB flushing if possible. After the discussion I don't think there is any issues with HW pte dirty bits here. There are already other cases where HW needs to get that right anyway (eg. zap_pte_range).
Yes tlb batching was kept, thanks for doing that way. Though if only apply patch 1 we'll have both ptep_clear_flush() and batched flush which seems to be redundant.
The other question is if we want to split the patch, would it be better to move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
Isn't that already the case? Patch 1 moves the TLB flush before the PTL as suggested, patch 2 atomically copies the dirty bit without changing any TLB flushing.
IMHO it's cleaner to have patch 1 fix batch flush, replace ptep_clear_flush() with ptep_get_and_clear() and update pte properly.
No strong opinions on the layout, but I still think we should drop the redundant ptep_clear_flush() above, meanwhile add the flush_cache_page() properly for !exclusive case too.
Thanks,
Peter Xu peterx@redhat.com writes:
On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
Peter Xu peterx@redhat.com writes:
On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that installs migration entries directly if it can lock the migrating page. When removing a dirty pte the dirty bit is supposed to be carried over to the underlying page to prevent it being lost.
Currently migrate_vma_*() can only be used for private anonymous mappings. That means loss of the dirty bit usually doesn't result in data loss because these pages are typically not file-backed. However pages may be backed by swap storage which can result in data loss if an attempt is made to migrate a dirty page that doesn't yet have the PageDirty flag set.
In this case migration will fail due to unexpected references but the dirty pte bit will be lost. If the page is subsequently reclaimed data won't be written back to swap storage as it is considered uptodate, resulting in data loss if the page is subsequently accessed.
Prevent this by copying the dirty bit to the page when removing the pte to match what try_to_migrate_one() does.
Signed-off-by: Alistair Popple apopple@nvidia.com Acked-by: Peter Xu peterx@redhat.com Reported-by: Huang Ying ying.huang@intel.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
Changes for v3:
- Defer TLB flushing
- Split a TLB flushing fix into a separate change.
Changes for v2:
- Fixed up Reported-by tag.
- Added Peter's Acked-by.
- Atomically read and clear the pte to prevent the dirty bit getting set after reading it.
- Added fixes tag
mm/migrate_device.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 6a5ef9f..51d9afa 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/memremap.h> #include <linux/migrate.h> +#include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/mmu_notifier.h> #include <linux/oom.h> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, anon_exclusive = PageAnon(page) && PageAnonExclusive(page); if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush(vma, addr, ptep);
pte = ptep_clear_flush(vma, addr, ptep); if (page_try_share_anon_rmap(page)) { set_pte_at(mm, addr, ptep, pte);
@@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, goto next; } } else {
ptep_get_and_clear(mm, addr, ptep);
pte = ptep_get_and_clear(mm, addr, ptep); }
I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are moved above the condition check so they're called unconditionally. Could you explain the rational on why it's changed back (since I think v2 was the correct approach)?
Mainly because I agree with your original comments, that it would be better to keep the batching of TLB flushing if possible. After the discussion I don't think there is any issues with HW pte dirty bits here. There are already other cases where HW needs to get that right anyway (eg. zap_pte_range).
Yes tlb batching was kept, thanks for doing that way. Though if only apply patch 1 we'll have both ptep_clear_flush() and batched flush which seems to be redundant.
The other question is if we want to split the patch, would it be better to move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
Isn't that already the case? Patch 1 moves the TLB flush before the PTL as suggested, patch 2 atomically copies the dirty bit without changing any TLB flushing.
IMHO it's cleaner to have patch 1 fix batch flush, replace ptep_clear_flush() with ptep_get_and_clear() and update pte properly.
Which ptep_clear_flush() are you referring to? This one?
if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep)); ptep_clear_flush(vma, addr, ptep);
My understanding is that we need to do a flush for anon_exclusive.
No strong opinions on the layout, but I still think we should drop the redundant ptep_clear_flush() above, meanwhile add the flush_cache_page() properly for !exclusive case too.
Good point, we need flush_cache_page() for !exclusive. Will add.
Thanks,
Alistair Popple apopple@nvidia.com writes:
Peter Xu peterx@redhat.com writes:
On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
Peter Xu peterx@redhat.com writes:
On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that installs migration entries directly if it can lock the migrating page. When removing a dirty pte the dirty bit is supposed to be carried over to the underlying page to prevent it being lost.
Currently migrate_vma_*() can only be used for private anonymous mappings. That means loss of the dirty bit usually doesn't result in data loss because these pages are typically not file-backed. However pages may be backed by swap storage which can result in data loss if an attempt is made to migrate a dirty page that doesn't yet have the PageDirty flag set.
In this case migration will fail due to unexpected references but the dirty pte bit will be lost. If the page is subsequently reclaimed data won't be written back to swap storage as it is considered uptodate, resulting in data loss if the page is subsequently accessed.
Prevent this by copying the dirty bit to the page when removing the pte to match what try_to_migrate_one() does.
Signed-off-by: Alistair Popple apopple@nvidia.com Acked-by: Peter Xu peterx@redhat.com Reported-by: Huang Ying ying.huang@intel.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
Changes for v3:
- Defer TLB flushing
- Split a TLB flushing fix into a separate change.
Changes for v2:
- Fixed up Reported-by tag.
- Added Peter's Acked-by.
- Atomically read and clear the pte to prevent the dirty bit getting set after reading it.
- Added fixes tag
mm/migrate_device.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 6a5ef9f..51d9afa 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/memremap.h> #include <linux/migrate.h> +#include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/mmu_notifier.h> #include <linux/oom.h> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, anon_exclusive = PageAnon(page) && PageAnonExclusive(page); if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush(vma, addr, ptep);
pte = ptep_clear_flush(vma, addr, ptep); if (page_try_share_anon_rmap(page)) { set_pte_at(mm, addr, ptep, pte);
@@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, goto next; } } else {
ptep_get_and_clear(mm, addr, ptep);
pte = ptep_get_and_clear(mm, addr, ptep); }
I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are moved above the condition check so they're called unconditionally. Could you explain the rational on why it's changed back (since I think v2 was the correct approach)?
Mainly because I agree with your original comments, that it would be better to keep the batching of TLB flushing if possible. After the discussion I don't think there is any issues with HW pte dirty bits here. There are already other cases where HW needs to get that right anyway (eg. zap_pte_range).
Yes tlb batching was kept, thanks for doing that way. Though if only apply patch 1 we'll have both ptep_clear_flush() and batched flush which seems to be redundant.
The other question is if we want to split the patch, would it be better to move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
Isn't that already the case? Patch 1 moves the TLB flush before the PTL as suggested, patch 2 atomically copies the dirty bit without changing any TLB flushing.
IMHO it's cleaner to have patch 1 fix batch flush, replace ptep_clear_flush() with ptep_get_and_clear() and update pte properly.
Which ptep_clear_flush() are you referring to? This one?
if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep)); ptep_clear_flush(vma, addr, ptep);
My understanding is that we need to do a flush for anon_exclusive.
No strong opinions on the layout, but I still think we should drop the redundant ptep_clear_flush() above, meanwhile add the flush_cache_page() properly for !exclusive case too.
Good point, we need flush_cache_page() for !exclusive. Will add.
That can be in another patch. For the patch itself, it looks good to me. Feel free to add,
Reviewed-by: "Huang, Ying" ying.huang@intel.com
Best Regards, Huang, Ying
On Fri, Aug 26, 2022 at 11:02:58AM +1000, Alistair Popple wrote:
Peter Xu peterx@redhat.com writes:
On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
Peter Xu peterx@redhat.com writes:
On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that installs migration entries directly if it can lock the migrating page. When removing a dirty pte the dirty bit is supposed to be carried over to the underlying page to prevent it being lost.
Currently migrate_vma_*() can only be used for private anonymous mappings. That means loss of the dirty bit usually doesn't result in data loss because these pages are typically not file-backed. However pages may be backed by swap storage which can result in data loss if an attempt is made to migrate a dirty page that doesn't yet have the PageDirty flag set.
In this case migration will fail due to unexpected references but the dirty pte bit will be lost. If the page is subsequently reclaimed data won't be written back to swap storage as it is considered uptodate, resulting in data loss if the page is subsequently accessed.
Prevent this by copying the dirty bit to the page when removing the pte to match what try_to_migrate_one() does.
Signed-off-by: Alistair Popple apopple@nvidia.com Acked-by: Peter Xu peterx@redhat.com Reported-by: Huang Ying ying.huang@intel.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
Changes for v3:
- Defer TLB flushing
- Split a TLB flushing fix into a separate change.
Changes for v2:
- Fixed up Reported-by tag.
- Added Peter's Acked-by.
- Atomically read and clear the pte to prevent the dirty bit getting set after reading it.
- Added fixes tag
mm/migrate_device.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 6a5ef9f..51d9afa 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/memremap.h> #include <linux/migrate.h> +#include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/mmu_notifier.h> #include <linux/oom.h> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, anon_exclusive = PageAnon(page) && PageAnonExclusive(page); if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush(vma, addr, ptep);
pte = ptep_clear_flush(vma, addr, ptep); if (page_try_share_anon_rmap(page)) { set_pte_at(mm, addr, ptep, pte);
@@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, goto next; } } else {
ptep_get_and_clear(mm, addr, ptep);
pte = ptep_get_and_clear(mm, addr, ptep); }
I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are moved above the condition check so they're called unconditionally. Could you explain the rational on why it's changed back (since I think v2 was the correct approach)?
Mainly because I agree with your original comments, that it would be better to keep the batching of TLB flushing if possible. After the discussion I don't think there is any issues with HW pte dirty bits here. There are already other cases where HW needs to get that right anyway (eg. zap_pte_range).
Yes tlb batching was kept, thanks for doing that way. Though if only apply patch 1 we'll have both ptep_clear_flush() and batched flush which seems to be redundant.
The other question is if we want to split the patch, would it be better to move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
Isn't that already the case? Patch 1 moves the TLB flush before the PTL as suggested, patch 2 atomically copies the dirty bit without changing any TLB flushing.
IMHO it's cleaner to have patch 1 fix batch flush, replace ptep_clear_flush() with ptep_get_and_clear() and update pte properly.
Which ptep_clear_flush() are you referring to? This one?
if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep)); ptep_clear_flush(vma, addr, ptep);
Correct.
My understanding is that we need to do a flush for anon_exclusive.
To me anon exclusive only shows this mm exclusively owns this page. I didn't quickly figure out why that requires different handling on tlb flushs. Did I perhaps miss something?
Thanks,
On 26.08.22 16:32, Peter Xu wrote:
On Fri, Aug 26, 2022 at 11:02:58AM +1000, Alistair Popple wrote:
Peter Xu peterx@redhat.com writes:
On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
Peter Xu peterx@redhat.com writes:
On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that installs migration entries directly if it can lock the migrating page. When removing a dirty pte the dirty bit is supposed to be carried over to the underlying page to prevent it being lost.
Currently migrate_vma_*() can only be used for private anonymous mappings. That means loss of the dirty bit usually doesn't result in data loss because these pages are typically not file-backed. However pages may be backed by swap storage which can result in data loss if an attempt is made to migrate a dirty page that doesn't yet have the PageDirty flag set.
In this case migration will fail due to unexpected references but the dirty pte bit will be lost. If the page is subsequently reclaimed data won't be written back to swap storage as it is considered uptodate, resulting in data loss if the page is subsequently accessed.
Prevent this by copying the dirty bit to the page when removing the pte to match what try_to_migrate_one() does.
Signed-off-by: Alistair Popple apopple@nvidia.com Acked-by: Peter Xu peterx@redhat.com Reported-by: Huang Ying ying.huang@intel.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
Changes for v3:
- Defer TLB flushing
- Split a TLB flushing fix into a separate change.
Changes for v2:
- Fixed up Reported-by tag.
- Added Peter's Acked-by.
- Atomically read and clear the pte to prevent the dirty bit getting set after reading it.
- Added fixes tag
mm/migrate_device.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 6a5ef9f..51d9afa 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/memremap.h> #include <linux/migrate.h> +#include <linux/mm.h> #include <linux/mm_inline.h> #include <linux/mmu_notifier.h> #include <linux/oom.h> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, anon_exclusive = PageAnon(page) && PageAnonExclusive(page); if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush(vma, addr, ptep);
pte = ptep_clear_flush(vma, addr, ptep); if (page_try_share_anon_rmap(page)) { set_pte_at(mm, addr, ptep, pte);
@@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, goto next; } } else {
ptep_get_and_clear(mm, addr, ptep);
pte = ptep_get_and_clear(mm, addr, ptep); }
I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are moved above the condition check so they're called unconditionally. Could you explain the rational on why it's changed back (since I think v2 was the correct approach)?
Mainly because I agree with your original comments, that it would be better to keep the batching of TLB flushing if possible. After the discussion I don't think there is any issues with HW pte dirty bits here. There are already other cases where HW needs to get that right anyway (eg. zap_pte_range).
Yes tlb batching was kept, thanks for doing that way. Though if only apply patch 1 we'll have both ptep_clear_flush() and batched flush which seems to be redundant.
The other question is if we want to split the patch, would it be better to move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
Isn't that already the case? Patch 1 moves the TLB flush before the PTL as suggested, patch 2 atomically copies the dirty bit without changing any TLB flushing.
IMHO it's cleaner to have patch 1 fix batch flush, replace ptep_clear_flush() with ptep_get_and_clear() and update pte properly.
Which ptep_clear_flush() are you referring to? This one?
if (anon_exclusive) { flush_cache_page(vma, addr, pte_pfn(*ptep)); ptep_clear_flush(vma, addr, ptep);
Correct.
My understanding is that we need to do a flush for anon_exclusive.
To me anon exclusive only shows this mm exclusively owns this page. I didn't quickly figure out why that requires different handling on tlb flushs. Did I perhaps miss something?
GUP-fast is the magic bit, we have to make sure that we won't see new GUP pins, thus the TLB flush.
include/linux/mm.h:gup_must_unshare() contains documentation.
Without GUP-fast, some things would be significantly easier to handle.
On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
To me anon exclusive only shows this mm exclusively owns this page. I didn't quickly figure out why that requires different handling on tlb flushs. Did I perhaps miss something?
GUP-fast is the magic bit, we have to make sure that we won't see new GUP pins, thus the TLB flush.
include/linux/mm.h:gup_must_unshare() contains documentation.
Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already guarantees that no other process/thread will see this pte anymore afterwards?
On 26.08.22 17:55, Peter Xu wrote:
On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
To me anon exclusive only shows this mm exclusively owns this page. I didn't quickly figure out why that requires different handling on tlb flushs. Did I perhaps miss something?
GUP-fast is the magic bit, we have to make sure that we won't see new GUP pins, thus the TLB flush.
include/linux/mm.h:gup_must_unshare() contains documentation.
Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already guarantees that no other process/thread will see this pte anymore afterwards?
You could have a GUP-fast thread that just looked up the PTE and is going to pin the page afterwards, after the ptep_get_and_clear() returned. You'll have to wait until that thread finished.
Another user that relies on this interaction between GUP-fast and TLB flushing is for example mm/ksm.c:write_protect_page()
There is a comment in there explaining the interaction a bit more detailed.
Maybe we'll be able to handle this differently in the future (maybe once this turns out to be an actual performance problem). Unfortunately, mm->write_protect_seq isn't easily usable because we'd need have to make sure we're the exclusive writer.
For now, it's not too complicated. For PTEs: * try_to_migrate_one() already uses ptep_clear_flush(). * try_to_unmap_one() already conditionally used ptep_clear_flush(). * migrate_vma_collect_pmd() was the one case that didn't use it already (and I wonder why it's different than try_to_migrate_one()).
On Fri, Aug 26, 2022 at 06:46:02PM +0200, David Hildenbrand wrote:
On 26.08.22 17:55, Peter Xu wrote:
On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
To me anon exclusive only shows this mm exclusively owns this page. I didn't quickly figure out why that requires different handling on tlb flushs. Did I perhaps miss something?
GUP-fast is the magic bit, we have to make sure that we won't see new GUP pins, thus the TLB flush.
include/linux/mm.h:gup_must_unshare() contains documentation.
Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already guarantees that no other process/thread will see this pte anymore afterwards?
You could have a GUP-fast thread that just looked up the PTE and is going to pin the page afterwards, after the ptep_get_and_clear() returned. You'll have to wait until that thread finished.
IIUC the early tlb flush won't protect concurrent fast-gup from happening, but I think it's safe because fast-gup will check pte after pinning, so either:
(1) fast-gup runs before ptep_get_and_clear(), then page_try_share_anon_rmap() will fail properly, or,
(2) fast-gup runs during or after ptep_get_and_clear(), then fast-gup will see that either the pte is none or changed, then it'll fail the fast-gup itself.
Another user that relies on this interaction between GUP-fast and TLB flushing is for example mm/ksm.c:write_protect_page()
There is a comment in there explaining the interaction a bit more detailed.
Maybe we'll be able to handle this differently in the future (maybe once this turns out to be an actual performance problem). Unfortunately, mm->write_protect_seq isn't easily usable because we'd need have to make sure we're the exclusive writer.
For now, it's not too complicated. For PTEs:
- try_to_migrate_one() already uses ptep_clear_flush().
- try_to_unmap_one() already conditionally used ptep_clear_flush().
- migrate_vma_collect_pmd() was the one case that didn't use it already
(and I wonder why it's different than try_to_migrate_one()).
I'm not sure whether I fully get the point, but here one major difference is all the rest handles one page, so a tlb flush alongside with the pte clear sounds reasonable. Even if so try_to_unmap_one() was modified to use tlb batching, but then I see that anon exclusive made that batching conditional. I also have question there on whether we can keep using the tlb batching even with anon exclusive pages there.
In general, I still don't see how stall tlb could affect anon exclusive pages on racing with fast-gup, because the only side effect of a stall tlb is unwanted page update iiuc, the problem is fast-gup doesn't even use tlb, afaict..
Thanks,
On 26.08.22 23:37, Peter Xu wrote:
On Fri, Aug 26, 2022 at 06:46:02PM +0200, David Hildenbrand wrote:
On 26.08.22 17:55, Peter Xu wrote:
On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
To me anon exclusive only shows this mm exclusively owns this page. I didn't quickly figure out why that requires different handling on tlb flushs. Did I perhaps miss something?
GUP-fast is the magic bit, we have to make sure that we won't see new GUP pins, thus the TLB flush.
include/linux/mm.h:gup_must_unshare() contains documentation.
Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already guarantees that no other process/thread will see this pte anymore afterwards?
You could have a GUP-fast thread that just looked up the PTE and is going to pin the page afterwards, after the ptep_get_and_clear() returned. You'll have to wait until that thread finished.
Good that we're talking about it, very helpful! If that's actually not required -- good.
What I learned how GUP-fast and TLB flushes interact is the following:
GUP-fast disables local interrupts. A TLB flush will send an IPI and wait until it has been processed. This implies, that once the TLB flush succeeded, that the interrupt was handled and GUP-fast cannot be running anymore.
BUT, there is the new RCU variant nowadays, and the TLB flush itself should not actually perform such a sync. They merely protect the page tables from getting freed and the THP from getting split IIUC. And you're correct that that wouldn't help.
IIUC the early tlb flush won't protect concurrent fast-gup from happening, but I think it's safe because fast-gup will check pte after pinning, so either:
(1) fast-gup runs before ptep_get_and_clear(), then page_try_share_anon_rmap() will fail properly, or,
(2) fast-gup runs during or after ptep_get_and_clear(), then fast-gup will see that either the pte is none or changed, then it'll fail the fast-gup itself.
I think you're right and I might have managed to confuse myself with the write_protect_page() comment. I placed the gup_must_unshare() check explicitly after the "pte changed" check for this reason. So once the PTE was cleared, GUP-fast would undo any pin performed on a now-stale PTE.
Another user that relies on this interaction between GUP-fast and TLB flushing is for example mm/ksm.c:write_protect_page()
There is a comment in there explaining the interaction a bit more detailed.
Maybe we'll be able to handle this differently in the future (maybe once this turns out to be an actual performance problem). Unfortunately, mm->write_protect_seq isn't easily usable because we'd need have to make sure we're the exclusive writer.
For now, it's not too complicated. For PTEs:
- try_to_migrate_one() already uses ptep_clear_flush().
- try_to_unmap_one() already conditionally used ptep_clear_flush().
- migrate_vma_collect_pmd() was the one case that didn't use it already
(and I wonder why it's different than try_to_migrate_one()).
I'm not sure whether I fully get the point, but here one major difference is all the rest handles one page, so a tlb flush alongside with the pte clear sounds reasonable. Even if so try_to_unmap_one() was modified to use tlb batching, but then I see that anon exclusive made that batching conditional. I also have question there on whether we can keep using the tlb batching even with anon exclusive pages there.
In general, I still don't see how stall tlb could affect anon exclusive pages on racing with fast-gup, because the only side effect of a stall tlb is unwanted page update iiuc, the problem is fast-gup doesn't even use tlb, afaict..
I have the gut feeling that the comment in write_protect_page() is indeed stale, and that clearing PageAnonExclusive doesn't strictly need the TLB flush.
I'll try to refresh my memory if there was any other case that I had to handle over the weekend.
Thanks!
We were not correctly copying PTE dirty bits to pages during migrate_vma_setup() calls. This could potentially lead to data loss, so add a test for this.
Signed-off-by: Alistair Popple apopple@nvidia.com --- tools/testing/selftests/vm/hmm-tests.c | 124 ++++++++++++++++++++++++++- 1 file changed, 124 insertions(+)
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index 529f53b..70fdb49 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -1200,6 +1200,130 @@ TEST_F(hmm, migrate_multiple) } }
+static char cgroup[] = "/sys/fs/cgroup/hmm-test-XXXXXX"; +static int write_cgroup_param(char *cgroup_path, char *param, long value) +{ + int ret; + FILE *f; + char *filename; + + if (asprintf(&filename, "%s/%s", cgroup_path, param) < 0) + return -1; + + f = fopen(filename, "w"); + if (!f) { + ret = -1; + goto out; + } + + ret = fprintf(f, "%ld\n", value); + if (ret < 0) + goto out1; + + ret = 0; + +out1: + fclose(f); +out: + free(filename); + + return ret; +} + +static int setup_cgroup(void) +{ + pid_t pid = getpid(); + int ret; + + if (!mkdtemp(cgroup)) + return -1; + + ret = write_cgroup_param(cgroup, "cgroup.procs", pid); + if (ret) + return ret; + + return 0; +} + +static int destroy_cgroup(void) +{ + pid_t pid = getpid(); + int ret; + + ret = write_cgroup_param("/sys/fs/cgroup/cgroup.procs", + "cgroup.proc", pid); + if (ret) + return ret; + + if (rmdir(cgroup)) + return -1; + + return 0; +} + +/* + * Try and migrate a dirty page that has previously been swapped to disk. This + * checks that we don't loose dirty bits. + */ +TEST_F(hmm, migrate_dirty_page) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int tmp = 0; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + ASSERT_EQ(setup_cgroup(), 0); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = 0; + + ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30)); + + /* Fault pages back in from swap as clean pages */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + tmp += ptr[i]; + + /* Dirty the pte */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* + * Attempt to migrate memory to device, which should fail because + * hopefully some pages are backed by swap storage. + */ + ASSERT_TRUE(hmm_migrate_sys_to_dev(self->fd, buffer, npages)); + + ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30)); + + /* Check we still see the updated data after restoring from swap. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + hmm_buffer_free(buffer); + destroy_cgroup(); +} + /* * Read anonymous memory multiple times. */
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
Rule: 'Cc: stable@vger.kernel.org' or 'commit <sha1> upstream.' Subject: [PATCH v3 3/3] selftests/hmm-tests: Add test for dirty bits Link: https://lore.kernel.org/stable/8e425a8ec191682fa2036260c575f065eeb56a53.1661...
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
kernel test robot lkp@intel.com writes:
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
Rule: 'Cc: stable@vger.kernel.org' or 'commit <sha1> upstream.' Subject: [PATCH v3 3/3] selftests/hmm-tests: Add test for dirty bits Link: https://lore.kernel.org/stable/8e425a8ec191682fa2036260c575f065eeb56a53.1661...
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
Ugh. I didn't mean to Cc stable on this patch, just the first two.
- Alistair
On 24.08.22 05:03, Alistair Popple wrote:
When clearing a PTE the TLB should be flushed whilst still holding the PTL to avoid a potential race with madvise/munmap/etc. For example consider the following sequence:
CPU0 CPU1
migrate_vma_collect_pmd() pte_unmap_unlock() madvise(MADV_DONTNEED) -> zap_pte_range() pte_offset_map_lock() [ PTE not present, TLB not flushed ] pte_unmap_unlock() [ page is still accessible via stale TLB ] flush_tlb_range()
In this case the page may still be accessed via the stale TLB entry after madvise returns. Fix this by flushing the TLB while holding the PTL.
Signed-off-by: Alistair Popple apopple@nvidia.com Reported-by: Nadav Amit nadav.amit@gmail.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
Changes for v3:
- New for v3
mm/migrate_device.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 27fb37d..6a5ef9f 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, migrate->dst[migrate->npages] = 0; migrate->src[migrate->npages++] = mpfn; }
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(ptep - 1, ptl);
/* Only flush the TLB if we actually modified any entries */ if (unmapped) flush_tlb_range(walk->vma, start, end);
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(ptep - 1, ptl);
- return 0;
}
base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
I'm not a TLB-flushing expert, but this matches my understanding (and a TLB flushing Linux documentation I stumbled over some while ago but cannot quickly find).
In the ordinary try_to_migrate_one() path, flushing would happen via ptep_clear_flush() (just like we do for the anon_exclusive case here as well), correct?
David Hildenbrand david@redhat.com writes:
On 24.08.22 05:03, Alistair Popple wrote:
When clearing a PTE the TLB should be flushed whilst still holding the PTL to avoid a potential race with madvise/munmap/etc. For example consider the following sequence:
CPU0 CPU1
migrate_vma_collect_pmd() pte_unmap_unlock() madvise(MADV_DONTNEED) -> zap_pte_range() pte_offset_map_lock() [ PTE not present, TLB not flushed ] pte_unmap_unlock() [ page is still accessible via stale TLB ] flush_tlb_range()
In this case the page may still be accessed via the stale TLB entry after madvise returns. Fix this by flushing the TLB while holding the PTL.
Signed-off-by: Alistair Popple apopple@nvidia.com Reported-by: Nadav Amit nadav.amit@gmail.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
Changes for v3:
- New for v3
mm/migrate_device.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 27fb37d..6a5ef9f 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, migrate->dst[migrate->npages] = 0; migrate->src[migrate->npages++] = mpfn; }
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(ptep - 1, ptl);
/* Only flush the TLB if we actually modified any entries */ if (unmapped) flush_tlb_range(walk->vma, start, end);
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(ptep - 1, ptl);
- return 0;
}
base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
I'm not a TLB-flushing expert, but this matches my understanding (and a TLB flushing Linux documentation I stumbled over some while ago but cannot quickly find).
In the ordinary try_to_migrate_one() path, flushing would happen via ptep_clear_flush() (just like we do for the anon_exclusive case here as well), correct?
Correct.
On 24.08.22 14:26, Alistair Popple wrote:
David Hildenbrand david@redhat.com writes:
On 24.08.22 05:03, Alistair Popple wrote:
When clearing a PTE the TLB should be flushed whilst still holding the PTL to avoid a potential race with madvise/munmap/etc. For example consider the following sequence:
CPU0 CPU1
migrate_vma_collect_pmd() pte_unmap_unlock() madvise(MADV_DONTNEED) -> zap_pte_range() pte_offset_map_lock() [ PTE not present, TLB not flushed ] pte_unmap_unlock() [ page is still accessible via stale TLB ] flush_tlb_range()
In this case the page may still be accessed via the stale TLB entry after madvise returns. Fix this by flushing the TLB while holding the PTL.
Signed-off-by: Alistair Popple apopple@nvidia.com Reported-by: Nadav Amit nadav.amit@gmail.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
Changes for v3:
- New for v3
mm/migrate_device.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 27fb37d..6a5ef9f 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, migrate->dst[migrate->npages] = 0; migrate->src[migrate->npages++] = mpfn; }
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(ptep - 1, ptl);
/* Only flush the TLB if we actually modified any entries */ if (unmapped) flush_tlb_range(walk->vma, start, end);
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(ptep - 1, ptl);
- return 0;
}
base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
I'm not a TLB-flushing expert, but this matches my understanding (and a TLB flushing Linux documentation I stumbled over some while ago but cannot quickly find).
In the ordinary try_to_migrate_one() path, flushing would happen via ptep_clear_flush() (just like we do for the anon_exclusive case here as well), correct?
Correct.
Acked-by: David Hildenbrand david@redhat.com
Alistair Popple apopple@nvidia.com writes:
When clearing a PTE the TLB should be flushed whilst still holding the PTL to avoid a potential race with madvise/munmap/etc. For example consider the following sequence:
CPU0 CPU1
migrate_vma_collect_pmd() pte_unmap_unlock() madvise(MADV_DONTNEED) -> zap_pte_range() pte_offset_map_lock() [ PTE not present, TLB not flushed ] pte_unmap_unlock() [ page is still accessible via stale TLB ] flush_tlb_range()
In this case the page may still be accessed via the stale TLB entry after madvise returns. Fix this by flushing the TLB while holding the PTL.
Signed-off-by: Alistair Popple apopple@nvidia.com Reported-by: Nadav Amit nadav.amit@gmail.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
Changes for v3:
- New for v3
mm/migrate_device.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 27fb37d..6a5ef9f 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, migrate->dst[migrate->npages] = 0; migrate->src[migrate->npages++] = mpfn; }
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(ptep - 1, ptl);
/* Only flush the TLB if we actually modified any entries */ if (unmapped) flush_tlb_range(walk->vma, start, end);
It appears that we can increase "unmapped" only if ptep_get_and_clear() is used?
Best Regards, Huang, Ying
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(ptep - 1, ptl);
- return 0;
}
base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
"Huang, Ying" ying.huang@intel.com writes:
Alistair Popple apopple@nvidia.com writes:
When clearing a PTE the TLB should be flushed whilst still holding the PTL to avoid a potential race with madvise/munmap/etc. For example consider the following sequence:
CPU0 CPU1
migrate_vma_collect_pmd() pte_unmap_unlock() madvise(MADV_DONTNEED) -> zap_pte_range() pte_offset_map_lock() [ PTE not present, TLB not flushed ] pte_unmap_unlock() [ page is still accessible via stale TLB ] flush_tlb_range()
In this case the page may still be accessed via the stale TLB entry after madvise returns. Fix this by flushing the TLB while holding the PTL.
Signed-off-by: Alistair Popple apopple@nvidia.com Reported-by: Nadav Amit nadav.amit@gmail.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
Changes for v3:
- New for v3
mm/migrate_device.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 27fb37d..6a5ef9f 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, migrate->dst[migrate->npages] = 0; migrate->src[migrate->npages++] = mpfn; }
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(ptep - 1, ptl);
/* Only flush the TLB if we actually modified any entries */ if (unmapped) flush_tlb_range(walk->vma, start, end);
It appears that we can increase "unmapped" only if ptep_get_and_clear() is used?
In other words you mean we only need to increase unmapped if pte_present && !anon_exclusive?
Agree, that's a good optimisation to make. However I'm just trying to solve a data corruption issue (not dirtying the page) here, so will post that as a separate optimisation patch. Thanks.
- Alistair
Best Regards, Huang, Ying
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(ptep - 1, ptl);
- return 0;
}
base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
Alistair Popple apopple@nvidia.com writes:
"Huang, Ying" ying.huang@intel.com writes:
Alistair Popple apopple@nvidia.com writes:
When clearing a PTE the TLB should be flushed whilst still holding the PTL to avoid a potential race with madvise/munmap/etc. For example consider the following sequence:
CPU0 CPU1
migrate_vma_collect_pmd() pte_unmap_unlock() madvise(MADV_DONTNEED) -> zap_pte_range() pte_offset_map_lock() [ PTE not present, TLB not flushed ] pte_unmap_unlock() [ page is still accessible via stale TLB ] flush_tlb_range()
In this case the page may still be accessed via the stale TLB entry after madvise returns. Fix this by flushing the TLB while holding the PTL.
Signed-off-by: Alistair Popple apopple@nvidia.com Reported-by: Nadav Amit nadav.amit@gmail.com Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Cc: stable@vger.kernel.org
Changes for v3:
- New for v3
mm/migrate_device.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 27fb37d..6a5ef9f 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, migrate->dst[migrate->npages] = 0; migrate->src[migrate->npages++] = mpfn; }
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(ptep - 1, ptl);
/* Only flush the TLB if we actually modified any entries */ if (unmapped) flush_tlb_range(walk->vma, start, end);
It appears that we can increase "unmapped" only if ptep_get_and_clear() is used?
In other words you mean we only need to increase unmapped if pte_present && !anon_exclusive?
Agree, that's a good optimisation to make. However I'm just trying to solve a data corruption issue (not dirtying the page) here, so will post that as a separate optimisation patch. Thanks.
OK. Then the patch looks good to me. Feel free to add my,
Reviewed-by: "Huang, Ying" ying.huang@intel.com
Best Regards, Huang, Ying
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(ptep - 1, ptl);
- return 0;
}
base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
linux-stable-mirror@lists.linaro.org