bfq_setup_cooperator() uses bfqd->in_serv_last_pos so detect whether it
makes sense to merge current bfq queue with the in-service queue.
However if the in-service queue is freshly scheduled and didn't dispatch
any requests yet, bfqd->in_serv_last_pos is stale and contains value
from the previously scheduled bfq queue which can thus result in a bogus
decision that the two queues should be merged. This bug can be observed
for example with the following fio jobfile:
[global]
direct=0
ioengine=sync
invalidate=1
size=1g
rw=read
[reader]
numjobs=4
directory=/mnt
where the 4 processes will end up in the one shared bfq queue although
they do IO to physically very distant files (for some reason I was able to
observe this only with slice_idle=1ms setting).
Fix the problem by invalidating bfqd->in_serv_last_pos when switching
in-service queue.
Fixes: 058fdecc6de7 ("block, bfq: fix in-service-queue check for queue merging")
CC: stable(a)vger.kernel.org
Acked-by: Paolo Valente <paolo.valente(a)linaro.org>
Signed-off-by: Jan Kara <jack(a)suse.cz>
---
block/bfq-iosched.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 9e4eb0fc1c16..dc5f12d0d482 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2937,6 +2937,7 @@ static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
}
bfqd->in_service_queue = bfqq;
+ bfqd->in_serv_last_pos = -1;
}
/*
--
2.26.2
There is a race condition between __free_huge_page()
and dissolve_free_huge_page().
CPU0: CPU1:
// page_count(page) == 1
put_page(page)
__free_huge_page(page)
dissolve_free_huge_page(page)
spin_lock(&hugetlb_lock)
// PageHuge(page) && !page_count(page)
update_and_free_page(page)
// page is freed to the buddy
spin_unlock(&hugetlb_lock)
spin_lock(&hugetlb_lock)
clear_page_huge_active(page)
enqueue_huge_page(page)
// It is wrong, the page is already freed
spin_unlock(&hugetlb_lock)
The race windows is between put_page() and dissolve_free_huge_page().
We should make sure that the page is already on the free list
when it is dissolved.
Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Signed-off-by: Muchun Song <songmuchun(a)bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: stable(a)vger.kernel.org
---
mm/hugetlb.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4741d60f8955..4a9011e12175 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -79,6 +79,21 @@ DEFINE_SPINLOCK(hugetlb_lock);
static int num_fault_mutexes;
struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
+static inline bool PageHugeFreed(struct page *head)
+{
+ return page_private(head + 4) == -1UL;
+}
+
+static inline void SetPageHugeFreed(struct page *head)
+{
+ set_page_private(head + 4, -1UL);
+}
+
+static inline void ClearPageHugeFreed(struct page *head)
+{
+ set_page_private(head + 4, 0);
+}
+
/* Forward declaration */
static int hugetlb_acct_memory(struct hstate *h, long delta);
@@ -1028,6 +1043,7 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
list_move(&page->lru, &h->hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
+ SetPageHugeFreed(page);
}
static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
@@ -1044,6 +1060,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
list_move(&page->lru, &h->hugepage_activelist);
set_page_refcounted(page);
+ ClearPageHugeFreed(page);
h->free_huge_pages--;
h->free_huge_pages_node[nid]--;
return page;
@@ -1504,6 +1521,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
spin_lock(&hugetlb_lock);
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
+ ClearPageHugeFreed(page);
spin_unlock(&hugetlb_lock);
}
@@ -1770,6 +1788,14 @@ int dissolve_free_huge_page(struct page *page)
int nid = page_to_nid(head);
if (h->free_huge_pages - h->resv_huge_pages == 0)
goto out;
+
+ /*
+ * We should make sure that the page is already on the free list
+ * when it is dissolved.
+ */
+ if (unlikely(!PageHugeFreed(head)))
+ goto out;
+
/*
* Move PageHWPoison flag from head page to the raw error page,
* which makes any subpages rather than the error page reusable.
--
2.11.0
Changes since v2 [1]:
- Collect some reviewed-by's from David and Oscar
- Rework subsection validity to include pfn_valid() gated by
CONFIG_HAVE_ARCH_PFN_VALID (David, Oscar)
- Introduce pgmap_pfn_valid() to validate metadata vs data in a pgmap (David)
! Kill put_ref_page(): the extra "if (ref_page) put_page(ref_page)" still
feels more cluttered than adding a tiny helper. (Oscar)
[1]: http://lore.kernel.org/r/161044407603.1482714.16630477578392768273.stgit@dw…
---
Michal reminds that the discussion about how to ensure pfn-walkers do
not get confused by ZONE_DEVICE pages never resolved. A pfn-walker that
uses pfn_to_online_page() may inadvertently translate a pfn as online
and in the page allocator, when it is offline managed by a ZONE_DEVICE
mapping (details in Patch 3: ("mm: Teach pfn_to_online_page() about
ZONE_DEVICE section collisions")).
The 2 proposals under consideration are teach pfn_to_online_page() to be
precise in the presence of mixed-zone sections, or teach the memory-add
code to drop the System RAM associated with ZONE_DEVICE collisions. In
order to not regress memory capacity by a few 10s to 100s of MiB the
approach taken in this set is to add precision to pfn_to_online_page().
In the course of validating pfn_to_online_page() a couple other fixes
fell out:
1/ soft_offline_page() fails to drop the reference taken in the
madvise(..., MADV_SOFT_OFFLINE) case.
2/ The libnvdimm sysfs attribute visibility code was failing to publish
the resource base for memmap=ss!nn defined namespaces. This is needed
for the regression test for soft_offline_page().
3/ memory_failure() uses get_dev_pagemap() to lookup ZONE_DEVICE pages,
however that mapping may contain data pages and metadata raw pfns.
Introduce pgmap_pfn_valid() to delineate the 2 types and fail the
handling of raw metadata pfns.
---
Dan Williams (6):
mm: Move pfn_to_online_page() out of line
mm: Teach pfn_to_online_page() to consider subsection validity
mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions
mm: Fix page reference leak in soft_offline_page()
mm: Fix memory_failure() handling of dax-namespace metadata
libnvdimm/namespace: Fix visibility of namespace resource attribute
drivers/nvdimm/namespace_devs.c | 10 +++---
include/linux/memory_hotplug.h | 17 +--------
include/linux/memremap.h | 6 +++
include/linux/mmzone.h | 22 ++++++++----
mm/memory-failure.c | 26 ++++++++++++--
mm/memory_hotplug.c | 70 +++++++++++++++++++++++++++++++++++++++
mm/memremap.c | 15 ++++++++
7 files changed, 134 insertions(+), 32 deletions(-)
From: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Backport for 4.4
(cherry picked from commit d85be8a49e733dcd23674aa6202870d54bf5600d)
The placeholder for instruction selection should use the second
argument's operand, which is %1, not %0. This could generate incorrect
assembly code if the memory addressing of operand %0 is a different
form from that of operand %1.
Also remove the %Un placeholder because having %Un placeholders
for two operands which are based on the same local var (ptep) doesn't
make much sense. By the way, it doesn't change the current behaviour
because "<>" constraint is missing for the associated "=m".
[chleroy: revised commit log iaw segher's comments and removed %U0]
Fixes: 9bf2b5cdc5fe ("powerpc: Fixes for CONFIG_PTE_64BIT for SMP support")
Cc: <stable(a)vger.kernel.org> # v2.6.28+
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Acked-by: Segher Boessenkool <segher(a)kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/96354bd77977a6a933fe9020da57629007fdb920.16033589…
---
arch/powerpc/include/asm/pgtable.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index b64b4212b71f..408f9e1fa24a 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -149,9 +149,9 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
flush_hash_entry(mm, ptep, addr);
#endif
__asm__ __volatile__("\
- stw%U0%X0 %2,%0\n\
+ stw%X0 %2,%0\n\
eieio\n\
- stw%U0%X0 %L2,%1"
+ stw%X1 %L2,%1"
: "=m" (*ptep), "=m" (*((unsigned char *)ptep+4))
: "r" (pte) : "memory");
--
2.25.0