From: Andrea Arcangeli <aarcange(a)redhat.com>
[ Upstream commit d7c3393413fe7e7dc54498ea200ea94742d61e18 ]
Patch series "migrate_misplaced_transhuge_page race conditions".
Aaron found a new instance of the THP MADV_DONTNEED race against
pmdp_clear_flush* variants, that was apparently left unfixed.
While looking into the race found by Aaron, I may have found two more
issues in migrate_misplaced_transhuge_page.
These race conditions would not cause kernel instability, but they'd
corrupt userland data or leave data non zero after MADV_DONTNEED.
I did only minor testing, and I don't expect to be able to reproduce this
(especially the lack of ->invalidate_range before migrate_page_copy,
requires the latest iommu hardware or infiniband to reproduce). The last
patch is noop for x86 and it needs further review from maintainers of
archs that implement flush_cache_range() (not in CC yet).
To avoid confusion, it's not the first patch that introduces the bug fixed
in the second patch, even before removing the
pmdp_huge_clear_flush_notify, that _notify suffix was called after
migrate_page_copy already run.
This patch (of 3):
This is a corollary of ced108037c2aa ("thp: fix MADV_DONTNEED vs. numa
balancing race"), 58ceeb6bec8 ("thp: fix MADV_DONTNEED vs. MADV_FREE
race") and 5b7abeae3af8c ("thp: fix MADV_DONTNEED vs clear soft dirty
race).
When the above three fixes where posted Dave asked
https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
but apparently this was missed.
The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced
in a54a407fbf7 ("mm: Close races between THP migration and PMD numa
clearing").
The important part of such commit is only the part where the page lock is
not released until the first do_huge_pmd_numa_page() finished disarming
the pagenuma/protnone.
The addition of pmdp_clear_flush() wasn't beneficial to such commit and
there's no commentary about such an addition either.
I guess the pmdp_clear_flush() in such commit was added just in case for
safety, but it ended up introducing the MADV_DONTNEED race condition found
by Aaron.
At that point in time nobody thought of such kind of MADV_DONTNEED race
conditions yet (they were fixed later) so the code may have looked more
robust by adding the pmdp_clear_flush().
This specific race condition won't destabilize the kernel, but it can
confuse userland because after MADV_DONTNEED the memory won't be zeroed
out.
This also optimizes the code and removes a superfluous TLB flush.
[akpm(a)linux-foundation.org: reflow comment to 80 cols, fix grammar and typo (beacuse)]
Link: http://lkml.kernel.org/r/20181013002430.698-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange(a)redhat.com>
Reported-by: Aaron Tomlin <atomlin(a)redhat.com>
Acked-by: Mel Gorman <mgorman(a)suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
mm/migrate.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 2a55289ee9f1..1990d1ccee29 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2086,15 +2086,26 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
/*
- * Clear the old entry under pagetable lock and establish the new PTE.
- * Any parallel GUP will either observe the old page blocking on the
- * page lock, block on the page table lock or observe the new page.
- * The SetPageUptodate on the new page and page_add_new_anon_rmap
- * guarantee the copy is visible before the pagetable update.
+ * Overwrite the old entry under pagetable lock and establish
+ * the new PTE. Any parallel GUP will either observe the old
+ * page blocking on the page lock, block on the page table
+ * lock or observe the new page. The SetPageUptodate on the
+ * new page and page_add_new_anon_rmap guarantee the copy is
+ * visible before the pagetable update.
*/
flush_cache_range(vma, mmun_start, mmun_end);
page_add_anon_rmap(new_page, vma, mmun_start, true);
- pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
+ /*
+ * At this point the pmd is numa/protnone (i.e. non present) and the TLB
+ * has already been flushed globally. So no TLB can be currently
+ * caching this non present pmd mapping. There's no need to clear the
+ * pmd before doing set_pmd_at(), nor to flush the TLB after
+ * set_pmd_at(). Clearing the pmd here would introduce a race
+ * condition against MADV_DONTNEED, because MADV_DONTNEED only holds the
+ * mmap_sem for reading. If the pmd is set to NULL at any given time,
+ * MADV_DONTNEED won't wait on the pmd lock and it'll skip clearing this
+ * pmd.
+ */
set_pmd_at(mm, mmun_start, pmd, entry);
update_mmu_cache_pmd(vma, address, &entry);
@@ -2108,7 +2119,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
* No need to double call mmu_notifier->invalidate_range() callback as
* the above pmdp_huge_clear_flush_notify() did already call it.
*/
- mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
/* Take an "isolate" reference and put new page on the LRU. */
get_page(new_page);
--
2.17.1
In d8ae72427187, start was changed to point to the character under the
cursor, instead of the beginning of the cleared area. The offset was
correctly applied when clearing the backing buffer, but not when
updating the framebuffer region. Fix this by having start point once
again to the beginning of the cleared area.
Cc: stable(a)vger.kernel.org
Fixes: d8ae72427187 ("vt: preserve unicode values corresponding to screen characters")
Signed-off-by: Samuel Holland <samuel(a)sholland.org>
---
drivers/tty/vt/vt.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 5f1183b0b89d..be7bc83d6c41 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -1528,7 +1528,7 @@ static void csi_J(struct vc_data *vc, int vpar)
static void csi_K(struct vc_data *vc, int vpar)
{
unsigned int count;
- unsigned short *start = (unsigned short *)vc->vc_pos;
+ unsigned short *start;
int offset;
switch (vpar) {
@@ -1548,7 +1548,8 @@ static void csi_K(struct vc_data *vc, int vpar)
return;
}
vc_uniscr_clear_line(vc, vc->vc_x + offset, count);
- scr_memsetw(start + offset, vc->vc_video_erase_char, 2 * count);
+ start = (unsigned short *)vc->vc_pos + offset;
+ scr_memsetw(start, vc->vc_video_erase_char, 2 * count);
vc->vc_need_wrap = 0;
if (con_should_update(vc))
do_update_region(vc, (unsigned long) start, count);
--
2.18.1
If there's no entry to drop in bucket that corresponds to the hash,
early_drop() should look for it in other buckets. But since it increments
hash instead of bucket number, it actually looks in the same bucket 8
times: hsize is 16k by default (14 bits) and hash is 32-bit value, so
reciprocal_scale(hash, hsize) returns the same value for hash..hash+7 in
most cases.
Fix it by increasing bucket number instead of hash and rename _hash
to bucket to avoid future confusion.
Fixes: 3e86638e9a0b ("netfilter: conntrack: consider ct netns in early_drop logic")
Cc: <stable(a)vger.kernel.org> # v4.7+
Signed-off-by: Vasily Khoruzhick <vasilykh(a)arista.com>
---
v2: move 'bucket' outside of the loop
net/netfilter/nf_conntrack_core.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index ca1168d67fac..e92e749aff53 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1073,19 +1073,22 @@ static unsigned int early_drop_list(struct net *net,
return drops;
}
-static noinline int early_drop(struct net *net, unsigned int _hash)
+static noinline int early_drop(struct net *net, unsigned int hash)
{
- unsigned int i;
+ unsigned int i, bucket;
for (i = 0; i < NF_CT_EVICTION_RANGE; i++) {
struct hlist_nulls_head *ct_hash;
- unsigned int hash, hsize, drops;
+ unsigned int hsize, drops;
rcu_read_lock();
nf_conntrack_get_ht(&ct_hash, &hsize);
- hash = reciprocal_scale(_hash++, hsize);
+ if (!i)
+ bucket = reciprocal_scale(hash, hsize);
+ else
+ bucket = (bucket + 1) % hsize;
- drops = early_drop_list(net, &ct_hash[hash]);
+ drops = early_drop_list(net, &ct_hash[bucket]);
rcu_read_unlock();
if (drops) {
--
2.19.1
We need to make sure, that the carrier check polling is disabled
while suspending. Otherwise we can end up with usbnet_read_cmd()
being issued when only usbnet_read_cmd_nopm() is allowed. If this
happens, read operations lock up.
Fixes: d69d169493 ("usbnet: smsc95xx: fix link detection for disabled autonegotiation")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Frieder Schrempf <frieder.schrempf(a)kontron.de>
---
Changes in v2:
* move cancel_delayed_work_sync() to correct error path
drivers/net/usb/smsc95xx.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/net/usb/smsc95xx.c b/drivers/net/usb/smsc95xx.c
index 262e7a3..2d17f3b 100644
--- a/drivers/net/usb/smsc95xx.c
+++ b/drivers/net/usb/smsc95xx.c
@@ -1598,6 +1598,8 @@ static int smsc95xx_suspend(struct usb_interface *intf, pm_message_t message)
return ret;
}
+ cancel_delayed_work_sync(&pdata->carrier_check);
+
if (pdata->suspend_flags) {
netdev_warn(dev->net, "error during last resume\n");
pdata->suspend_flags = 0;
@@ -1840,6 +1842,11 @@ static int smsc95xx_suspend(struct usb_interface *intf, pm_message_t message)
*/
if (ret && PMSG_IS_AUTO(message))
usbnet_resume(intf);
+
+ if (ret)
+ schedule_delayed_work(&pdata->carrier_check,
+ CARRIER_CHECK_DELAY);
+
return ret;
}
--
2.7.4
From: Linus Torvalds <torvalds(a)linux-foundation.org>
Commit eb66ae030829605d61fbef1909ce310e29f78821 upstream.
This is a backport to stable 3.18.y, based on Will Deacon's 4.4.y
backport.
Jann Horn points out that our TLB flushing was subtly wrong for the
mremap() case. What makes mremap() special is that we don't follow the
usual "add page to list of pages to be freed, then flush tlb, and then
free pages". No, mremap() obviously just _moves_ the page from one page
table location to another.
That matters, because mremap() thus doesn't directly control the
lifetime of the moved page with a freelist: instead, the lifetime of the
page is controlled by the page table locking, that serializes access to
the entry.
As a result, we need to flush the TLB not just before releasing the lock
for the source location (to avoid any concurrent accesses to the entry),
but also before we release the destination page table lock (to avoid the
TLB being flushed after somebody else has already done something to that
page).
This also makes the whole "need_flush" logic unnecessary, since we now
always end up flushing the TLB for every valid entry.
Reported-and-tested-by: Jann Horn <jannh(a)google.com>
Acked-by: Will Deacon <will.deacon(a)arm.com>
Tested-by: Ingo Molnar <mingo(a)kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
[will: backport to 4.4 stable]
Signed-off-by: Will Deacon <will.deacon(a)arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
[ghackmann(a)google.com: adjust context]
Signed-off-by: Greg Hackmann <ghackmann(a)google.com>
---
mm/huge_memory.c | 6 +++++-
mm/mremap.c | 21 ++++++++++++++++-----
2 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bc402f39ac48..977437ee859b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1462,7 +1462,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
spinlock_t *old_ptl, *new_ptl;
int ret = 0;
pmd_t pmd;
-
+ bool force_flush = false;
struct mm_struct *mm = vma->vm_mm;
if ((old_addr & ~HPAGE_PMD_MASK) ||
@@ -1490,6 +1490,8 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
+ if (pmd_present(pmd))
+ force_flush = true;
VM_BUG_ON(!pmd_none(*new_pmd));
if (pmd_move_must_withdraw(new_ptl, old_ptl)) {
@@ -1498,6 +1500,8 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
}
set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
+ if (force_flush)
+ flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
diff --git a/mm/mremap.c b/mm/mremap.c
index b147f66f4c40..62199110babc 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -97,6 +97,8 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
struct mm_struct *mm = vma->vm_mm;
pte_t *old_pte, *new_pte, pte;
spinlock_t *old_ptl, *new_ptl;
+ bool force_flush = false;
+ unsigned long len = old_end - old_addr;
/*
* When need_rmap_locks is true, we take the i_mmap_mutex and anon_vma
@@ -143,12 +145,26 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
if (pte_none(*old_pte))
continue;
pte = ptep_get_and_clear(mm, old_addr, old_pte);
+ /*
+ * If we are remapping a valid PTE, make sure
+ * to flush TLB before we drop the PTL for the PTE.
+ *
+ * NOTE! Both old and new PTL matter: the old one
+ * for racing with page_mkclean(), the new one to
+ * make sure the physical page stays valid until
+ * the TLB entry for the old mapping has been
+ * flushed.
+ */
+ if (pte_present(pte))
+ force_flush = true;
pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
pte = move_soft_dirty_pte(pte);
set_pte_at(mm, new_addr, new_pte, pte);
}
arch_leave_lazy_mmu_mode();
+ if (force_flush)
+ flush_tlb_range(vma, old_end - len, old_end);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
pte_unmap(new_pte - 1);
@@ -168,7 +184,6 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
{
unsigned long extent, next, old_end;
pmd_t *old_pmd, *new_pmd;
- bool need_flush = false;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
@@ -207,7 +222,6 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
anon_vma_unlock_write(vma->anon_vma);
}
if (err > 0) {
- need_flush = true;
continue;
} else if (!err) {
split_huge_page_pmd(vma, old_addr, old_pmd);
@@ -224,10 +238,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
extent = LATENCY_LIMIT;
move_ptes(vma, old_pmd, old_addr, old_addr + extent,
new_vma, new_pmd, new_addr, need_rmap_locks);
- need_flush = true;
}
- if (likely(need_flush))
- flush_tlb_range(vma, old_end-len, old_addr);
mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
--
2.19.1.930.g4563a0d9d0-goog